Key Takeaways
- Requests handles HTTP calls; BeautifulSoup parses the HTML response
- Always check robots.txt and the site's terms of service before scraping
- Add delays between requests to avoid overloading servers and getting blocked
- CSS selectors (.select()) are often cleaner than tag-based navigation (.find())
- Selenium or Playwright handle JavaScript-rendered pages that requests cannot
Web scraping is one of the most practical skills a Python developer or data professional can have. Price monitoring, lead generation, research data collection, competitive analysis — if data is on a webpage and not available via API, scraping gets it. This guide shows you the Python stack that handles 90% of scraping tasks, plus when to reach for heavier tools.
The Requests Library: Fetching Web Pages
The requests library makes HTTP calls in Python simple. Install it: pip install requests. A basic scrape:
import requests
url = 'https://example.com/products'
headers = {
'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)'
}
response = requests.get(url, headers=headers)
response.raise_for_status() # raises error if request failed
html_content = response.textAlways set a User-Agent header — many sites block requests with no User-Agent. raise_for_status() throws an exception for 4xx/5xx responses. For sites requiring authentication, requests.Session() maintains cookies across multiple requests.
BeautifulSoup: Parsing HTML to Extract Data
BeautifulSoup parses HTML into a navigable tree. Install: pip install beautifulsoup4 lxml. Parse and extract:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'lxml')
# Find by tag
title = soup.find('h1').text.strip()
# Find all elements of a type
links = soup.find_all('a', href=True)
for link in links:
print(link['href'], link.text)
# CSS selectors (often cleaner)
prices = soup.select('.product-price')
for price in prices:
print(price.text.strip())
# Find by attribute
img = soup.find('img', {'class': 'hero-image'})
img_url = img['src']The .text property returns all inner text; .get_text(strip=True) removes leading/trailing whitespace.
Handling Pagination and Multiple Pages
Most data you want spans multiple pages. Two common patterns: URL parameter pagination (?page=1, ?page=2) and next-button navigation. URL parameter approach:
import time
import requests
from bs4 import BeautifulSoup
all_items = []
for page in range(1, 11): # pages 1-10
url = f'https://example.com/products?page={page}'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
items = soup.select('.product-card')
if not items:
break # no more pages
for item in items:
all_items.append({
'name': item.select_one('.name').text.strip(),
'price': item.select_one('.price').text.strip()
})
time.sleep(1) # be polite
Storing Scraped Data: CSV, JSON, and Databases
After scraping, store the data usably. CSV for simple flat data: import csv; writer = csv.DictWriter(f, fieldnames=keys). JSON for nested data: import json; json.dump(data, f, indent=2). Pandas for quick cleanup and analysis:
import pandas as pd
df = pd.DataFrame(all_items)
df['price_clean'] = df['price'].str.replace('[^0-9.]', '', regex=True).astype(float)
df.to_csv('products.csv', index=False)
print(df.describe())For large ongoing scrapes, a SQLite database (built into Python) or PostgreSQL gives you query capabilities and deduplication. Store a hash of the URL or content to avoid re-scraping the same page.
JavaScript-Rendered Sites: Selenium and Playwright
Requests only gets the initial HTML response. If a site loads content via JavaScript after the page loads (React, Angular, Vue apps), you'll get an empty page. Use Playwright (preferred) or Selenium to automate a real browser:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto('https://example.com/spa')
page.wait_for_selector('.product-list')
html = page.content()
browser.close()
soup = BeautifulSoup(html, 'lxml')
# continue as normalInstall: pip install playwright && playwright install. Playwright is faster and more reliable than Selenium for most tasks.
Ethics, Legality, and Best Practices
Before scraping: check robots.txt (e.g., https://example.com/robots.txt) for what's allowed. Review the site's Terms of Service. Never scrape personal data without legal basis. Rate limiting: add time.sleep(1-3) between requests. Don't scrape at a rate that could affect the site's performance. Use APIs when available — they're legal, stable, and faster. Sites that block scrapers typically use: rate limiting by IP, bot detection (Cloudflare), login requirements, or dynamic content loading. Rotating proxies and user agents can help, but at that point ask whether the data is worth the effort versus buying it or using an API.
Frequently Asked Questions
- Is web scraping legal?
- It depends. Scraping publicly available data is generally legal in most jurisdictions, but the Computer Fraud and Abuse Act (US) and GDPR (EU) add restrictions. Always check the site's robots.txt and Terms of Service. Never scrape login-protected areas without authorization. When in doubt, use an official API instead.
- What is the difference between BeautifulSoup and Scrapy?
- BeautifulSoup is a parsing library — you use it alongside requests to scrape individual pages. Scrapy is a full scraping framework with built-in request management, pipelines, middleware, and crawling logic. Use BeautifulSoup for small or one-off scrapes. Use Scrapy for large-scale, production scraping with thousands of pages.
- How do I handle login-required pages?
- Use requests.Session() to maintain cookies. Log in once using session.post() with the form data, then subsequent session.get() calls include the session cookie automatically. For complex authentication, Playwright's browser automation handles it more reliably.
- How do I avoid getting blocked while scraping?
- Add delays between requests, rotate user agents, use residential proxies for large-scale scrapes, and avoid predictable patterns like scraping every item in alphabetical order. The most reliable approach is to scrape slowly and politely.
Ready to Level Up Your Skills?
Python, data collection, cleaning, and analysis are all covered in our hands-on bootcamp. Build real skills with real projects. Next cohorts October 2026 in Denver, NYC, Dallas, LA, and Chicago. Only $1,490.
View Bootcamp Details