Web Scraping with Python [2026]: BeautifulSoup and Requests

Learn web scraping with Python using requests and BeautifulSoup. Collect data from websites, handle pagination, and store results cleanly.

Key Takeaways

  • Requests handles HTTP calls; BeautifulSoup parses the HTML response
  • Always check robots.txt and the site's terms of service before scraping
  • Add delays between requests to avoid overloading servers and getting blocked
  • CSS selectors (.select()) are often cleaner than tag-based navigation (.find())
  • Selenium or Playwright handle JavaScript-rendered pages that requests cannot

Web scraping is one of the most practical skills a Python developer or data professional can have. Price monitoring, lead generation, research data collection, competitive analysis — if data is on a webpage and not available via API, scraping gets it. This guide shows you the Python stack that handles 90% of scraping tasks, plus when to reach for heavier tools.

The Requests Library: Fetching Web Pages

The requests library makes HTTP calls in Python simple. Install it: pip install requests. A basic scrape:

import requests

url = 'https://example.com/products'
headers = {
    'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)'
}
response = requests.get(url, headers=headers)
response.raise_for_status()  # raises error if request failed
html_content = response.text

Always set a User-Agent header — many sites block requests with no User-Agent. raise_for_status() throws an exception for 4xx/5xx responses. For sites requiring authentication, requests.Session() maintains cookies across multiple requests.

BeautifulSoup: Parsing HTML to Extract Data

BeautifulSoup parses HTML into a navigable tree. Install: pip install beautifulsoup4 lxml. Parse and extract:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'lxml')

# Find by tag
title = soup.find('h1').text.strip()

# Find all elements of a type
links = soup.find_all('a', href=True)
for link in links:
    print(link['href'], link.text)

# CSS selectors (often cleaner)
prices = soup.select('.product-price')
for price in prices:
    print(price.text.strip())

# Find by attribute
img = soup.find('img', {'class': 'hero-image'})
img_url = img['src']

The .text property returns all inner text; .get_text(strip=True) removes leading/trailing whitespace.

Handling Pagination and Multiple Pages

Most data you want spans multiple pages. Two common patterns: URL parameter pagination (?page=1, ?page=2) and next-button navigation. URL parameter approach:

import time
import requests
from bs4 import BeautifulSoup

all_items = []
for page in range(1, 11):  # pages 1-10
    url = f'https://example.com/products?page={page}'
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'lxml')
    items = soup.select('.product-card')
    if not items:
        break  # no more pages
    for item in items:
        all_items.append({
            'name': item.select_one('.name').text.strip(),
            'price': item.select_one('.price').text.strip()
        })
    time.sleep(1)  # be polite

Storing Scraped Data: CSV, JSON, and Databases

After scraping, store the data usably. CSV for simple flat data: import csv; writer = csv.DictWriter(f, fieldnames=keys). JSON for nested data: import json; json.dump(data, f, indent=2). Pandas for quick cleanup and analysis:

import pandas as pd

df = pd.DataFrame(all_items)
df['price_clean'] = df['price'].str.replace('[^0-9.]', '', regex=True).astype(float)
df.to_csv('products.csv', index=False)
print(df.describe())

For large ongoing scrapes, a SQLite database (built into Python) or PostgreSQL gives you query capabilities and deduplication. Store a hash of the URL or content to avoid re-scraping the same page.

JavaScript-Rendered Sites: Selenium and Playwright

Requests only gets the initial HTML response. If a site loads content via JavaScript after the page loads (React, Angular, Vue apps), you'll get an empty page. Use Playwright (preferred) or Selenium to automate a real browser:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto('https://example.com/spa')
    page.wait_for_selector('.product-list')
    html = page.content()
    browser.close()

soup = BeautifulSoup(html, 'lxml')
# continue as normal

Install: pip install playwright && playwright install. Playwright is faster and more reliable than Selenium for most tasks.

Ethics, Legality, and Best Practices

Before scraping: check robots.txt (e.g., https://example.com/robots.txt) for what's allowed. Review the site's Terms of Service. Never scrape personal data without legal basis. Rate limiting: add time.sleep(1-3) between requests. Don't scrape at a rate that could affect the site's performance. Use APIs when available — they're legal, stable, and faster. Sites that block scrapers typically use: rate limiting by IP, bot detection (Cloudflare), login requirements, or dynamic content loading. Rotating proxies and user agents can help, but at that point ask whether the data is worth the effort versus buying it or using an API.

Frequently Asked Questions

Is web scraping legal?
It depends. Scraping publicly available data is generally legal in most jurisdictions, but the Computer Fraud and Abuse Act (US) and GDPR (EU) add restrictions. Always check the site's robots.txt and Terms of Service. Never scrape login-protected areas without authorization. When in doubt, use an official API instead.
What is the difference between BeautifulSoup and Scrapy?
BeautifulSoup is a parsing library — you use it alongside requests to scrape individual pages. Scrapy is a full scraping framework with built-in request management, pipelines, middleware, and crawling logic. Use BeautifulSoup for small or one-off scrapes. Use Scrapy for large-scale, production scraping with thousands of pages.
How do I handle login-required pages?
Use requests.Session() to maintain cookies. Log in once using session.post() with the form data, then subsequent session.get() calls include the session cookie automatically. For complex authentication, Playwright's browser automation handles it more reliably.
How do I avoid getting blocked while scraping?
Add delays between requests, rotate user agents, use residential proxies for large-scale scrapes, and avoid predictable patterns like scraping every item in alphabetical order. The most reliable approach is to scrape slowly and politely.

Ready to Level Up Your Skills?

Python, data collection, cleaning, and analysis are all covered in our hands-on bootcamp. Build real skills with real projects. Next cohorts October 2026 in Denver, NYC, Dallas, LA, and Chicago. Only $1,490.

View Bootcamp Details

About the Author

Bo Peng is an AI Instructor and Founder of Precision AI Academy. He has trained 400+ professionals in AI, machine learning, and cloud technologies. His bootcamps run in Denver, NYC, Dallas, LA, and Chicago.