# How to scrape web data with Beautiful Soup: step-by-step guide in 2026

> Source: <https://spidra.io/blog/web-scraping-with-beautifulsoup>
> Published: 2026-06-27 00:00:00+00:00

There is a lot of data on the internet that is not available through an API. Product prices, news articles, job listings, research data. Most of it lives in HTML pages rather than structured endpoints. Web scraping is how you get to it programmatically.

In this guide you will learn how to scrape web pages using Python's `requests`

library and Beautiful Soup. We use a real scraping practice site throughout so every code example produces actual output.

## What is Beautiful Soup and why use it

[Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) is a Python library that parses [HTML](https://spidra.io/blog/html-vs-markdown-for-ai-llms) and XML documents and gives you a clean interface to navigate and search the document tree. It does not fetch web pages. That is the job of `requests`

. Once you have the HTML, Beautiful Soup makes it straightforward to pull out the data you need.

It is a good first tool for web scraping because the learning curve is gentle, it handles messy HTML gracefully, and it works well with other libraries. Pair it with `requests`

for static pages, with `Selenium`

or `Playwright`

when you need JavaScript rendering, and with `Scrapy`

when you need to crawl an entire site.

The limitation to understand early: Beautiful Soup is a parser, not a browser. It reads whatever HTML your `requests`

call returns. If a page loads its content via JavaScript after the initial HTML is delivered, Beautiful Soup will see an empty result. This is a common source of confusion and we cover it in the challenges section below.

## Prerequisites

You need Python 3.8 or higher. Check your version:

```
python --version
```

Create and activate a virtual environment before installing anything:

```
# macOS / Linux
python3 -m venv venv
source venv/bin/activate

# Windows
python -m venv venv
venv\Scripts\activate
```

Install the required libraries:

```
pip install requests beautifulsoup4 lxml
```

`requests`

handles HTTP. `beautifulsoup4`

is the parser. `lxml`

is a fast C-based parser that Beautiful Soup can use under the hood and also provides XPath support when you need it.

## The target page

We scrape `books.toscrape.com`

throughout this guide. It is a sandbox site built specifically for scraping practice, publicly accessible and stable, designed to let you experiment without worrying about rate limits or legal concerns. Each page lists 20 books with a title, price, star rating, and availability.

```
https://books.toscrape.com/catalogue/page-1.html
```

## Step 1: download the page with requests

Start by fetching the HTML. The `get()`

method sends an HTTP GET request and returns a response object. The `text`

attribute on that object contains the raw HTML:

``` python
import requests

url = 'https://books.toscrape.com/catalogue/page-1.html'
response = requests.get(url)

print(response.status_code)
print(response.text[:500])
```

A `200`

status code means the request succeeded. Anything in the 4xx or 5xx range is an error. The `text[:500]`

just prints the first 500 characters so you can confirm you got real HTML back rather than a block page.

Adding a realistic User-Agent header reduces the chance of getting blocked on sites that check for it:

``` python
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
}

url = 'https://books.toscrape.com/catalogue/page-1.html'
response = requests.get(url, headers=headers)
```

## Step 2: parse the HTML with Beautiful Soup

Once you have the HTML, pass it to the `BeautifulSoup`

constructor along with the parser you want to use. We use `lxml`

because it is fast and handles malformed HTML well:

``` python
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'lxml')
```

`soup`

is now a navigable tree of Python objects representing the entire HTML document. You can search it, navigate it, and extract text and attributes from it.

## Step 3: inspect the page to find your selectors

Before writing extraction code, you need to know what selectors to use. Open the target page in Chrome, right-click the element you want to extract, and select Inspect. The browser's DevTools panel opens with the HTML for that element highlighted.

On Books to Scrape, each book is an `<article>`

element with the class `product_pod`

. Inside it you will find:

`<h3><a title="Book Title">`

: the book title lives in the`title`

attribute`<p class="price_color">`

: the price including the currency symbol`<p class="star-rating {One|Two|Three|Four|Five}">`

: the rating encoded in the class name`<p class="instock availability">`

: stock status

Knowing this structure is what you need to write the extraction code.

## Step 4: extract data from the page

### Get an element by HTML tag

`find()`

returns the first matching element. `find_all()`

returns a list of all matching elements. To get all book containers:

```
books = soup.find_all('article', class_='product_pod')
print(f'Found {len(books)} books')
Found 20 books
```

### Get an element by CSS class

Prefix the class name with a dot in `select()`

calls, or pass it as `class_=`

in `find()`

calls. To get the price of the first book:

```
first_book = books[0]
price = first_book.find('p', class_='price_color').text.strip()
print(price)
£51.77
```

### Get an element by ID

When an element has a unique ID, `select_one()`

with a `#`

prefix is the cleanest way to find it. This page has an `#default`

wrapper on the main content:

```
main = soup.select_one('#default')
print(main.name)
```

### Get an element by attribute

Square bracket notation in CSS selectors matches on attribute values. The book title is stored in the `title`

attribute of the `<a>`

tag inside each `<h3>`

:

```
title = first_book.find('h3').find('a')['title']
print(title)
A Light in the Attic
```

### Get an element using XPath

For XPath queries you need `lxml`

directly alongside Beautiful Soup. Right-click an element in DevTools, choose Copy, then Copy XPath to get the path. Here is how to use it:

``` python
from bs4 import BeautifulSoup
from lxml import etree

dom = etree.HTML(str(soup))
first_title = dom.xpath('//article[@class="product_pod"]/h3/a/@title')[0]
print(first_title)
A Light in the Attic
```

XPath is more powerful than CSS selectors for complex traversals, but CSS selectors are usually cleaner for straightforward extraction.

## Step 5: scrape the full page

Now that you know how to find each field, combine everything into a single scraper:

``` python
import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
}

url = 'https://books.toscrape.com/catalogue/page-1.html'
response = requests.get(url, headers=headers)

if response.status_code != 200:
    print(f'Request failed: {response.status_code}')
else:
    soup = BeautifulSoup(response.text, 'lxml')
    books_data = []

    for article in soup.find_all('article', class_='product_pod'):
        title    = article.find('h3').find('a')['title']
        price    = article.find('p', class_='price_color').text.strip()
        rating   = article.find('p', class_='star-rating')['class'][1]
        in_stock = 'In stock' in article.find('p', class_='instock').text

        books_data.append({
            'title':    title,
            'price':    price,
            'rating':   rating,
            'in_stock': in_stock,
        })

    for book in books_data[:3]:
        print(book)
{'title': 'A Light in the Attic', 'price': '£51.77', 'rating': 'Three', 'in_stock': True}
{'title': 'Tipping the Velvet',   'price': '£53.74', 'rating': 'One',   'in_stock': True}
{'title': 'Soumission',           'price': '£50.10', 'rating': 'One',   'in_stock': True}
```

The `rating`

field comes from the class name. `star-rating Three`

has two classes, so `['class'][1]`

picks the second one, which is the word that represents the rating. You would map it to a number in post-processing: `{'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}`

.

## Step 6: export to CSV

Python's built-in `csv`

module writes the data to a file you can open in Excel or Google Sheets:

``` python
import csv

csv_file = 'books.csv'

with open(csv_file, mode='w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['title', 'price', 'rating', 'in_stock'])
    writer.writeheader()
    writer.writerows(books_data)

print(f'Saved {len(books_data)} books to {csv_file}')
```

## Step 7: export to JSON

For pipelines that consume JSON or need to store the data in a format that preserves the types:

``` python
import json

with open('books.json', 'w', encoding='utf-8') as f:
    json.dump(books_data, f, indent=4)

print('Saved to books.json')
[
    {
        "title": "A Light in the Attic",
        "price": "£51.77",
        "rating": "Three",
        "in_stock": true
    },
    ...
]
```

## Step 8: handle pagination

Books to Scrape spreads its catalogue across 50 pages. The Next button at the bottom of each page has a predictable pattern: the URL increments from `page-1.html`

through `page-50.html`

. This function walks through every page and collects all the books:

``` python
import requests, time
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
}

base_url = 'https://books.toscrape.com/catalogue/page-{}.html'
all_books = []
page = 1

while True:
    response = requests.get(base_url.format(page), headers=headers)

    if response.status_code != 200:
        break

    soup = BeautifulSoup(response.text, 'lxml')
    articles = soup.find_all('article', class_='product_pod')

    if not articles:
        break

    for article in articles:
        title    = article.find('h3').find('a')['title']
        price    = article.find('p', class_='price_color').text.strip()
        rating   = article.find('p', class_='star-rating')['class'][1]
        in_stock = 'In stock' in article.find('p', class_='instock').text
        all_books.append({'title': title, 'price': price, 'rating': rating, 'in_stock': in_stock})

    print(f'Page {page}: {len(all_books)} books collected so far')

    # Check if there is a next page
    next_btn = soup.find('li', class_='next')
    if not next_btn:
        break

    page += 1
    time.sleep(1)  # Be respectful of the server

print(f'\nTotal: {len(all_books)} books')
```

The `time.sleep(1)`

between requests gives the server breathing room. Sending requests as fast as possible is what gets scrapers blocked or rate-limited.

## Common challenges

### Dynamic content

Beautiful Soup reads the HTML that `requests`

returns. If a page uses JavaScript to load its data after the initial HTML arrives, `requests`

gets the empty skeleton and Beautiful Soup finds nothing to parse. You will see this on modern e-commerce sites, single-page applications, and any page where the content only appears after scrolling or clicking.

The fix is to use a headless browser like Playwright or Selenium to render the page first, then pass the rendered HTML to Beautiful Soup. This adds significant overhead because a full browser instance runs for every page, and sites that use JavaScript-rendered content are also typically the ones with more aggressive bot detection.

### Getting blocked

Sites detect scrapers through a combination of signals: missing or suspicious User-Agent headers, requests arriving too fast, datacenter IP addresses, and missing browser fingerprints. A plain `requests`

call with no headers is the most obvious signal you can send.

Common mitigations are adding realistic headers, introducing random delays between requests, and routing through residential proxies. Each adds complexity and maintenance burden. Free proxy lists in particular are unreliable. They have short lifespans and many are already flagged by the time you use them.

### Selectors breaking

Site owners update their HTML without any concern for external scrapers. Class names change, element nesting shifts, and IDs get renamed. Any selector you write is a dependency on the current state of a page that you do not control. This is the hidden cost of selector-based scraping: it works until it does not, and it breaks silently.

### Error handling

Web scraping produces a lot of edge cases. An element you expect to be present is missing on one page. A `find()`

returns `None`

and the next attribute access throws an `AttributeError`

. Always write defensive code when accessing nested elements:

```
price_elem = article.find('p', class_='price_color')
price = price_elem.text.strip() if price_elem else None
```

A try-except around the whole extraction loop keeps one bad page from stopping the rest:

```
for article in articles:
    try:
        title = article.find('h3').find('a')['title']
        price = article.find('p', class_='price_color').text.strip()
        books_data.append({'title': title, 'price': price})
    except Exception as e:
        print(f'Failed to parse article: {e}')
        continue
```

## The easier approach for more complex targets

Books to Scrape is a scraping sandbox built to be easy. Real targets like [Amazon](https://spidra.io/blog/how-to-scrape-amazon-product-data), [eBay](https://spidra.io/blog/scrape-ebay-product-data), news sites, and job boards have bot detection, JavaScript-rendered content, rotating layouts, and CAPTCHA challenges. The selector maintenance, proxy management, and rendering overhead add up quickly.

The [Spidra API](https://spidra.io/spidra-api) takes a different approach. You describe what you want from a page in plain English and define the output shape in a JSON Schema. Spidra loads the page in a real browser, routes through [residential proxies](https://spidra.io/proxy-scraping), handles CAPTCHA automatically, and returns structured JSON matching your schema. When the site changes its layout, the prompt keeps working because it describes the data, not where it sits in the DOM.

``` python
import requests, time, os, json

API_KEY = os.environ['SPIDRA_API_KEY']
BASE    = 'https://api.spidra.io/api'
HEADERS = {'x-api-key': API_KEY, 'Content-Type': 'application/json'}

SCHEMA = {
    'type': 'object',
    'required': ['books'],
    'properties': {
        'books': {
            'type': 'array',
            'items': {
                'type': 'object',
                'properties': {
                    'title':    {'type': 'string'},
                    'price':    {'type': 'string'},
                    'rating':   {'type': 'string'},
                    'in_stock': {'type': 'boolean'},
                }
            }
        }
    }
}

resp = requests.post(f'{BASE}/scrape', headers=HEADERS, json={
    'urls':   [{'url': 'https://books.toscrape.com/catalogue/page-1.html'}],
    'prompt': 'Extract all books on this page with their title, price, star rating as a word, and whether they are in stock',
    'output': 'json',
    'schema': SCHEMA,
})
job_id = resp.json()['jobId']

while True:
    result = requests.get(f'{BASE}/scrape/{job_id}', headers=HEADERS).json()
    if result['status'] == 'completed':
        break
    time.sleep(3)

books = result['result']['content']['books']
print(f'Got {len(books)} books')
print(json.dumps(books[:2], indent=2))
```

Sign up free at [app.spidra.io](https://app.spidra.io/signup). The free plan gives you 300 credits with no card required. If you want to generate a schema from real output rather than writing it by hand, the free [JSON Schema Generator](https://spidra.io/tools/json-schema-generator) builds the structure for you from any JSON sample.
