Is web scraping legal?

It depends. Scraping publicly available data is generally legal in most jurisdictions, but the Computer Fraud and Abuse Act (US) and GDPR (EU) add restrictions. Always check the site's robots.txt and Terms of Service. Never scrape login-protected areas without authorization. When in doubt, use an official API instead.

Is web scraping legal?

It depends. Scraping publicly available data is generally legal in most jurisdictions, but the Computer Fraud and Abuse Act (US) and GDPR (EU) add restrictions. Always check the site's robots.txt and Terms of Service. Never scrape login-protected areas without authorization. When in doubt, use an official API instead.

What is the difference between BeautifulSoup and Scrapy?

BeautifulSoup is a parsing library — you use it alongside requests to scrape individual pages. Scrapy is a full scraping framework with built-in request management, pipelines, middleware, and crawling logic. Use BeautifulSoup for small or one-off scrapes. Use Scrapy for large-scale, production scraping with thousands of pages.

How do I handle login-required pages?

Use requests.Session() to maintain cookies. Log in once using session.post() with the form data, then subsequent session.get() calls include the session cookie automatically. For complex authentication, Playwright's browser automation handles it more reliably.

How do I avoid getting blocked while scraping?

Add delays between requests, rotate user agents, use residential proxies for large-scale scrapes, and avoid predictable patterns like scraping every item in alphabetical order. The most reliable approach is to scrape slowly and politely.

What is Web Scraping with Python [2026]?

Web Scraping with Python [2026] — read the full guide above for a detailed explanation and practical examples.

Who is Web Scraping with Python [2026] for?

This guide is written for working professionals who want a practitioner-grade understanding of Web Scraping with Python [2026] without theoretical fluff.

How long does it take to learn Web Scraping with Python [2026]?

Reading time is under 15 minutes. Hands-on mastery depends on practice — expect a weekend of focused work with the examples provided.

Web Scraping with Python [2026]: BeautifulSoup and Requests

Key Takeaways

Requests handles HTTP calls; BeautifulSoup parses the HTML response
Always check robots.txt and the site's terms of service before scraping
Add delays between requests to avoid overloading servers and getting blocked
CSS selectors (.select()) are often cleaner than tag-based navigation (.find())
Selenium or Playwright handle JavaScript-rendered pages that requests cannot

Web scraping is one of the most practical skills a Python developer or data professional can have. Price monitoring, lead generation, research data collection, competitive analysis — if data is on a webpage and not available via API, scraping gets it. This guide shows you the Python stack that handles 90% of scraping tasks, plus when to reach for heavier tools.

The Requests Library: Fetching Web Pages

The requests library makes HTTP calls in Python simple. Install it: pip install requests. A basic scrape:

import requests

url = 'https://example.com/products'
headers = {
    'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)'
}
response = requests.get(url, headers=headers)
response.raise_for_status()  # raises error if request failed
html_content = response.text

Always set a User-Agent header — many sites block requests with no User-Agent. raise_for_status() throws an exception for 4xx/5xx responses. For sites requiring authentication, requests.Session() maintains cookies across multiple requests.

BeautifulSoup: Parsing HTML to Extract Data

BeautifulSoup parses HTML into a navigable tree. Install: pip install beautifulsoup4 lxml. Parse and extract:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'lxml')

# Find by tag
title = soup.find('h1').text.strip()

# Find all elements of a type
links = soup.find_all('a', href=True)
for link in links:
    print(link['href'], link.text)

# CSS selectors (often cleaner)
prices = soup.select('.product-price')
for price in prices:
    print(price.text.strip())

# Find by attribute
img = soup.find('img', {'class': 'hero-image'})
img_url = img['src']

The .text property returns all inner text; .get_text(strip=True) removes leading/trailing whitespace.

Handling Pagination and Multiple Pages

Most data you want spans multiple pages. Two common patterns: URL parameter pagination (?page=1, ?page=2) and next-button navigation. URL parameter approach:

import time
import requests
from bs4 import BeautifulSoup

all_items = []
for page in range(1, 11):  # pages 1-10
    url = f'https://example.com/products?page={page}'
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'lxml')
    items = soup.select('.product-card')
    if not items:
        break  # no more pages
    for item in items:
        all_items.append({
            'name': item.select_one('.name').text.strip(),
            'price': item.select_one('.price').text.strip()
        })
    time.sleep(1)  # be polite

Storing Scraped Data: CSV, JSON, and Databases

After scraping, store the data usably. CSV for simple flat data: import csv; writer = csv.DictWriter(f, fieldnames=keys). JSON for nested data: import json; json.dump(data, f, indent=2). Pandas for quick cleanup and analysis:

import pandas as pd

df = pd.DataFrame(all_items)
df['price_clean'] = df['price'].str.replace('[^0-9.]', '', regex=True).astype(float)
df.to_csv('products.csv', index=False)
print(df.describe())

For large ongoing scrapes, a SQLite database (built into Python) or PostgreSQL gives you query capabilities and deduplication. Store a hash of the URL or content to avoid re-scraping the same page.

JavaScript-Rendered Sites: Selenium and Playwright

Requests only gets the initial HTML response. If a site loads content via JavaScript after the page loads (React, Angular, Vue apps), you'll get an empty page. Use Playwright (preferred) or Selenium to automate a real browser:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto('https://example.com/spa')
    page.wait_for_selector('.product-list')
    html = page.content()
    browser.close()

soup = BeautifulSoup(html, 'lxml')
# continue as normal

Install: pip install playwright && playwright install. Playwright is faster and more reliable than Selenium for most tasks.

Ethics, Legality, and Best Practices

Before scraping: check robots.txt (e.g., https://example.com/robots.txt) for what's allowed. Review the site's Terms of Service. Never scrape personal data without legal basis. Rate limiting: add time.sleep(1-3) between requests. Don't scrape at a rate that could affect the site's performance. Use APIs when available — they're legal, stable, and faster. Sites that block scrapers typically use: rate limiting by IP, bot detection (Cloudflare), login requirements, or dynamic content loading. Rotating proxies and user agents can help, but at that point ask whether the data is worth the effort versus buying it or using an API.

Frequently Asked Questions

Is web scraping legal?: It depends. Scraping publicly available data is generally legal in most jurisdictions, but the Computer Fraud and Abuse Act (US) and GDPR (EU) add restrictions. Always check the site's robots.txt and Terms of Service. Never scrape login-protected areas without authorization. When in doubt, use an official API instead.
What is the difference between BeautifulSoup and Scrapy?: BeautifulSoup is a parsing library — you use it alongside requests to scrape individual pages. Scrapy is a full scraping framework with built-in request management, pipelines, middleware, and crawling logic. Use BeautifulSoup for small or one-off scrapes. Use Scrapy for large-scale, production scraping with thousands of pages.
How do I handle login-required pages?: Use requests.Session() to maintain cookies. Log in once using session.post() with the form data, then subsequent session.get() calls include the session cookie automatically. For complex authentication, Playwright's browser automation handles it more reliably.
How do I avoid getting blocked while scraping?: Add delays between requests, rotate user agents, use residential proxies for large-scale scrapes, and avoid predictable patterns like scraping every item in alphabetical order. The most reliable approach is to scrape slowly and politely.

The Bottom Line

You don't need to master everything at once. Start with the fundamentals in Web Scraping with Python, apply them to a real project, and iterate. The practitioners who build things always outpace those who just read about building things.

Build Real Skills. In Person. This October.

The 2-day in-person Precision AI Academy bootcamp. 5 cities (Denver, NYC, Dallas, LA, Chicago). $1,490. 40 seats max. June–October 2026 (Thu–Fri).

Reserve Your Seat

Our Take

Browser automation has become the dominant scraping approach — and it shows in detection rates.

The web scraping ecosystem has bifurcated more clearly than most guides reflect. Traditional HTTP-based scraping with requests and BeautifulSoup works well against static sites and APIs without active bot detection — which is still a significant portion of the web. But against sites using Cloudflare Bot Management, DataDome, or PerimeterX, HTTP scraping fails at the TLS fingerprint level before you even see HTML. Playwright and Selenium-based browser automation passes more detection checks but runs 10–50x slower and costs more at scale. The choice between them is not really about capability — it is about the detection sophistication of your target.

The emerging approach that is underrepresented in tutorials: API-first research before scraping. The majority of data on commercial sites — product prices, job listings, real estate listings — is also available through an official API, an unofficial but stable JSON endpoint discovered through browser developer tools, or a commercial data provider like Bright Data or ScraperAPI. Scraping rendered HTML is almost always more fragile than consuming structured data from an endpoint. The engineering work is the same order of magnitude, and the result is a more reliable pipeline. Check the API options before committing to scraping.

On the legal and ethical side: terms of service restrictions on scraping are frequently unenforceable in many jurisdictions (the hiQ vs. LinkedIn ruling addressed publicly available data specifically), but they are not uniformly safe to ignore. For commercial use cases, get counsel on the specific target site and data type rather than assuming publicly visible means freely usable.

Published By

Precision AI Academy

Practitioner-focused AI education · 2-day in-person bootcamp in 5 U.S. cities

Precision AI Academy publishes deep-dives on applied AI engineering for working professionals. Founded by Bo Peng (Kaggle Top 200) who leads the in-person bootcamp in Denver, NYC, Dallas, LA, and Chicago.

Kaggle Top 200 Federal AI Practitioner 5 U.S. Cities Thu–Fri Cohorts