Select a Subtopic

Day 18: Web Scraping with BeautifulSoup

Learn how to scrape data from websites using Python’s BeautifulSoup library

Topics to Learn:

Introduction to Web Scraping
Using requests and BeautifulSoup
Parsing HTML

Steps to Scrape a Website:

Step 1: Install Required Libraries

To start web scraping, you need to install the following libraries:

pip install requests beautifulsoup4

Step 2: Making a Request to a Website

Use the requests library to fetch the HTML content of the webpage you want to scrape.

import requests

url = "https://example.com"
response = requests.get(url)

if response.status_code == 200:
    html_content = response.text
    print(html_content)  # This will print the HTML content of the page
else:
    print(f"Failed to retrieve the webpage: {response.status_code}")

Step 3: Parsing HTML with BeautifulSoup

After fetching the HTML, use BeautifulSoup to parse it and navigate the DOM (Document Object Model) to find specific data.

from bs4 import BeautifulSoup

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

# Print the prettified HTML (for better readability)
print(soup.prettify())

Step 4: Extracting Data

You can now extract specific data using BeautifulSoup’s methods such as find(), find_all(), etc.

headings = soup.find_all("h1")
for heading in headings:
    print(heading.text)

Step 5: Extracting Attributes

You can also extract attributes (like href or src) from tags.

links = soup.find_all("a")
for link in links:
    print(link.get("href"))

Step 6: Storing Scraped Data

After scraping the data, you can store it in a structured format, like a CSV file.

import csv

url = "https://example-blog.com"
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all article titles and links
    articles = soup.find_all("article")

    with open("scraped_articles.csv", mode="w", newline="") as file:
        writer = csv.writer(file)
        writer.writerow(["Title", "Link"])

        for article in articles:
            title = article.find("h2").text
            link = article.find("a")["href"]
            writer.writerow([title, link])

    print("Data scraped and saved to 'scraped_articles.csv')

Exercises

Scrape a Website’s Data and save to a CSV file
Scrape News Headlines from a news website

Challenges to Tackle

Handle Dynamic Content
Pagination

Important Considerations

Always respect the website’s scraping policies and handle missing data gracefully.