Select a Subtopic
Day 18: Web Scraping with BeautifulSoup
Learn how to scrape data from websites using Python’s BeautifulSoup library
Topics to Learn:
- Introduction to Web Scraping
- Using
requests
andBeautifulSoup
- Parsing HTML
Steps to Scrape a Website:
Step 1: Install Required Libraries
To start web scraping, you need to install the following libraries:
pip install requests beautifulsoup4
Step 2: Making a Request to a Website
Use the requests
library to fetch the HTML content of the webpage you want to scrape.
import requests
url = "https://example.com"
response = requests.get(url)
if response.status_code == 200:
html_content = response.text
print(html_content) # This will print the HTML content of the page
else:
print(f"Failed to retrieve the webpage: {response.status_code}")
Step 3: Parsing HTML with BeautifulSoup
After fetching the HTML, use BeautifulSoup to parse it and navigate the DOM (Document Object Model) to find specific data.
from bs4 import BeautifulSoup
# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")
# Print the prettified HTML (for better readability)
print(soup.prettify())
Step 4: Extracting Data
You can now extract specific data using BeautifulSoup’s methods such as find()
, find_all()
, etc.
headings = soup.find_all("h1")
for heading in headings:
print(heading.text)
Step 5: Extracting Attributes
You can also extract attributes (like href
or src
) from tags.
links = soup.find_all("a")
for link in links:
print(link.get("href"))
Step 6: Storing Scraped Data
After scraping the data, you can store it in a structured format, like a CSV file.
import csv
url = "https://example-blog.com"
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
# Find all article titles and links
articles = soup.find_all("article")
with open("scraped_articles.csv", mode="w", newline="") as file:
writer = csv.writer(file)
writer.writerow(["Title", "Link"])
for article in articles:
title = article.find("h2").text
link = article.find("a")["href"]
writer.writerow([title, link])
print("Data scraped and saved to 'scraped_articles.csv')
Exercises
- Scrape a Website’s Data and save to a CSV file
- Scrape News Headlines from a news website
Challenges to Tackle
- Handle Dynamic Content
- Pagination
Important Considerations
Always respect the website’s scraping policies and handle missing data gracefully.