HTML

    Select a Subtopic

    Day 18: Web Scraping with BeautifulSoup

    Learn how to scrape data from websites using Python’s BeautifulSoup library

    Topics to Learn:

    • Introduction to Web Scraping
    • Using requests and BeautifulSoup
    • Parsing HTML

    Steps to Scrape a Website:

    Step 1: Install Required Libraries

    To start web scraping, you need to install the following libraries:

    pip install requests beautifulsoup4

    Step 2: Making a Request to a Website

    Use the requests library to fetch the HTML content of the webpage you want to scrape.

    import requests url = "https://example.com" response = requests.get(url) if response.status_code == 200: html_content = response.text print(html_content) # This will print the HTML content of the page else: print(f"Failed to retrieve the webpage: {response.status_code}")

    Step 3: Parsing HTML with BeautifulSoup

    After fetching the HTML, use BeautifulSoup to parse it and navigate the DOM (Document Object Model) to find specific data.

    from bs4 import BeautifulSoup # Parse the HTML content with BeautifulSoup soup = BeautifulSoup(html_content, "html.parser") # Print the prettified HTML (for better readability) print(soup.prettify())

    Step 4: Extracting Data

    You can now extract specific data using BeautifulSoup’s methods such as find(), find_all(), etc.

    headings = soup.find_all("h1") for heading in headings: print(heading.text)

    Step 5: Extracting Attributes

    You can also extract attributes (like href or src) from tags.

    links = soup.find_all("a") for link in links: print(link.get("href"))

    Step 6: Storing Scraped Data

    After scraping the data, you can store it in a structured format, like a CSV file.

    import csv url = "https://example-blog.com" response = requests.get(url) if response.status_code == 200: soup = BeautifulSoup(response.text, "html.parser") # Find all article titles and links articles = soup.find_all("article") with open("scraped_articles.csv", mode="w", newline="") as file: writer = csv.writer(file) writer.writerow(["Title", "Link"]) for article in articles: title = article.find("h2").text link = article.find("a")["href"] writer.writerow([title, link]) print("Data scraped and saved to 'scraped_articles.csv')

    Exercises

    • Scrape a Website’s Data and save to a CSV file
    • Scrape News Headlines from a news website

    Challenges to Tackle

    • Handle Dynamic Content
    • Pagination

    Important Considerations

    Always respect the website’s scraping policies and handle missing data gracefully.