How to scrape web data with Beautiful Soup: step-by-step guide in 2026

Python's Beautiful Soup library remains a top choice for web scraping in 2026, offering a gentle learning curve and robust HTML parsing. A step-by-step guide demonstrates scraping book data from books.toscrape.com using requests and Beautiful Soup, covering page download, parsing, and common pitfalls like JavaScript-rendered content.

There is a lot of data on the internet that is not available through an API. Product prices, news articles, job listings, research data. Most of it lives in HTML pages rather than structured endpoints. Web scraping is how you get to it programmatically. In this guide you will learn how to scrape web pages using Python's requests library and Beautiful Soup. We use a real scraping practice site throughout so every code example produces actual output. What is Beautiful Soup and why use it Beautiful Soup https://beautiful-soup-4.readthedocs.io/en/latest/ is a Python library that parses HTML https://spidra.io/blog/html-vs-markdown-for-ai-llms and XML documents and gives you a clean interface to navigate and search the document tree. It does not fetch web pages. That is the job of requests . Once you have the HTML, Beautiful Soup makes it straightforward to pull out the data you need. It is a good first tool for web scraping because the learning curve is gentle, it handles messy HTML gracefully, and it works well with other libraries. Pair it with requests for static pages, with Selenium or Playwright when you need JavaScript rendering, and with Scrapy when you need to crawl an entire site. The limitation to understand early: Beautiful Soup is a parser, not a browser. It reads whatever HTML your requests call returns. If a page loads its content via JavaScript after the initial HTML is delivered, Beautiful Soup will see an empty result. This is a common source of confusion and we cover it in the challenges section below. Prerequisites You need Python 3.8 or higher. Check your version: python --version Create and activate a virtual environment before installing anything: macOS / Linux python3 -m venv venv source venv/bin/activate Windows python -m venv venv venv\Scripts\activate Install the required libraries: pip install requests beautifulsoup4 lxml requests handles HTTP. beautifulsoup4 is the parser. lxml is a fast C-based parser that Beautiful Soup can use under the hood and also provides XPath support when you need it. The target page We scrape books.toscrape.com throughout this guide. It is a sandbox site built specifically for scraping practice, publicly accessible and stable, designed to let you experiment without worrying about rate limits or legal concerns. Each page lists 20 books with a title, price, star rating, and availability. https://books.toscrape.com/catalogue/page-1.html Step 1: download the page with requests Start by fetching the HTML. The get method sends an HTTP GET request and returns a response object. The text attribute on that object contains the raw HTML: python import requests url = 'https://books.toscrape.com/catalogue/page-1.html' response = requests.get url print response.status code print response.text :500 A 200 status code means the request succeeded. Anything in the 4xx or 5xx range is an error. The text :500 just prints the first 500 characters so you can confirm you got real HTML back rather than a block page. Adding a realistic User-Agent header reduces the chance of getting blocked on sites that check for it: python import requests headers = { 'User-Agent': 'Mozilla/5.0 Windows NT 10.0; Win64; x64 AppleWebKit/537.36 KHTML, like Gecko Chrome/126.0.0.0 Safari/537.36', 'Accept-Language': 'en-US,en;q=0.9', } url = 'https://books.toscrape.com/catalogue/page-1.html' response = requests.get url, headers=headers Step 2: parse the HTML with Beautiful Soup Once you have the HTML, pass it to the BeautifulSoup constructor along with the parser you want to use. We use lxml because it is fast and handles malformed HTML well: python from bs4 import BeautifulSoup soup = BeautifulSoup response.text, 'lxml' soup is now a navigable tree of Python objects representing the entire HTML document. You can search it, navigate it, and extract text and attributes from it. Step 3: inspect the page to find your selectors Before writing extraction code, you need to know what selectors to use. Open the target page in Chrome, right-click the element you want to extract, and select Inspect. The browser's DevTools panel opens with the HTML for that element highlighted. On Books to Scrape, each book is an <article element with the class product pod . Inside it you will find: <h3 <a title="Book Title" : the book title lives in the title attribute <p class="price color" : the price including the currency symbol <p class="star-rating {One|Two|Three|Four|Five}" : the rating encoded in the class name <p class="instock availability" : stock status Knowing this structure is what you need to write the extraction code. Step 4: extract data from the page Get an element by HTML tag find returns the first matching element. find all returns a list of all matching elements. To get all book containers: books = soup.find all 'article', class ='product pod' print f'Found {len books } books' Found 20 books Get an element by CSS class Prefix the class name with a dot in select calls, or pass it as class = in find calls. To get the price of the first book: first book = books 0 price = first book.find 'p', class ='price color' .text.strip print price £51.77 Get an element by ID When an element has a unique ID, select one with a prefix is the cleanest way to find it. This page has an default wrapper on the main content: main = soup.select one ' default' print main.name Get an element by attribute Square bracket notation in CSS selectors matches on attribute values. The book title is stored in the title attribute of the <a tag inside each <h3 : title = first book.find 'h3' .find 'a' 'title' print title A Light in the Attic Get an element using XPath For XPath queries you need lxml directly alongside Beautiful Soup. Right-click an element in DevTools, choose Copy, then Copy XPath to get the path. Here is how to use it: python from bs4 import BeautifulSoup from lxml import etree dom = etree.HTML str soup first title = dom.xpath '//article @class="product pod" /h3/a/@title' 0 print first title A Light in the Attic XPath is more powerful than CSS selectors for complex traversals, but CSS selectors are usually cleaner for straightforward extraction. Step 5: scrape the full page Now that you know how to find each field, combine everything into a single scraper: python import requests from bs4 import BeautifulSoup headers = { 'User-Agent': 'Mozilla/5.0 Windows NT 10.0; Win64; x64 AppleWebKit/537.36', } url = 'https://books.toscrape.com/catalogue/page-1.html' response = requests.get url, headers=headers if response.status code = 200: print f'Request failed: {response.status code}' else: soup = BeautifulSoup response.text, 'lxml' books data = for article in soup.find all 'article', class ='product pod' : title = article.find 'h3' .find 'a' 'title' price = article.find 'p', class ='price color' .text.strip rating = article.find 'p', class ='star-rating' 'class' 1 in stock = 'In stock' in article.find 'p', class ='instock' .text books data.append { 'title': title, 'price': price, 'rating': rating, 'in stock': in stock, } for book in books data :3 : print book {'title': 'A Light in the Attic', 'price': '£51.77', 'rating': 'Three', 'in stock': True} {'title': 'Tipping the Velvet', 'price': '£53.74', 'rating': 'One', 'in stock': True} {'title': 'Soumission', 'price': '£50.10', 'rating': 'One', 'in stock': True} The rating field comes from the class name. star-rating Three has two classes, so 'class' 1 picks the second one, which is the word that represents the rating. You would map it to a number in post-processing: {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5} . Step 6: export to CSV Python's built-in csv module writes the data to a file you can open in Excel or Google Sheets: python import csv csv file = 'books.csv' with open csv file, mode='w', newline='', encoding='utf-8' as f: writer = csv.DictWriter f, fieldnames= 'title', 'price', 'rating', 'in stock' writer.writeheader writer.writerows books data print f'Saved {len books data } books to {csv file}' Step 7: export to JSON For pipelines that consume JSON or need to store the data in a format that preserves the types: python import json with open 'books.json', 'w', encoding='utf-8' as f: json.dump books data, f, indent=4 print 'Saved to books.json' { "title": "A Light in the Attic", "price": "£51.77", "rating": "Three", "in stock": true }, ... Step 8: handle pagination Books to Scrape spreads its catalogue across 50 pages. The Next button at the bottom of each page has a predictable pattern: the URL increments from page-1.html through page-50.html . This function walks through every page and collects all the books: python import requests, time from bs4 import BeautifulSoup headers = { 'User-Agent': 'Mozilla/5.0 Windows NT 10.0; Win64; x64 AppleWebKit/537.36', } base url = 'https://books.toscrape.com/catalogue/page-{}.html' all books = page = 1 while True: response = requests.get base url.format page , headers=headers if response.status code = 200: break soup = BeautifulSoup response.text, 'lxml' articles = soup.find all 'article', class ='product pod' if not articles: break for article in articles: title = article.find 'h3' .find 'a' 'title' price = article.find 'p', class ='price color' .text.strip rating = article.find 'p', class ='star-rating' 'class' 1 in stock = 'In stock' in article.find 'p', class ='instock' .text all books.append {'title': title, 'price': price, 'rating': rating, 'in stock': in stock} print f'Page {page}: {len all books } books collected so far' Check if there is a next page next btn = soup.find 'li', class ='next' if not next btn: break page += 1 time.sleep 1 Be respectful of the server print f'\nTotal: {len all books } books' The time.sleep 1 between requests gives the server breathing room. Sending requests as fast as possible is what gets scrapers blocked or rate-limited. Common challenges Dynamic content Beautiful Soup reads the HTML that requests returns. If a page uses JavaScript to load its data after the initial HTML arrives, requests gets the empty skeleton and Beautiful Soup finds nothing to parse. You will see this on modern e-commerce sites, single-page applications, and any page where the content only appears after scrolling or clicking. The fix is to use a headless browser like Playwright or Selenium to render the page first, then pass the rendered HTML to Beautiful Soup. This adds significant overhead because a full browser instance runs for every page, and sites that use JavaScript-rendered content are also typically the ones with more aggressive bot detection. Getting blocked Sites detect scrapers through a combination of signals: missing or suspicious User-Agent headers, requests arriving too fast, datacenter IP addresses, and missing browser fingerprints. A plain requests call with no headers is the most obvious signal you can send. Common mitigations are adding realistic headers, introducing random delays between requests, and routing through residential proxies. Each adds complexity and maintenance burden. Free proxy lists in particular are unreliable. They have short lifespans and many are already flagged by the time you use them. Selectors breaking Site owners update their HTML without any concern for external scrapers. Class names change, element nesting shifts, and IDs get renamed. Any selector you write is a dependency on the current state of a page that you do not control. This is the hidden cost of selector-based scraping: it works until it does not, and it breaks silently. Error handling Web scraping produces a lot of edge cases. An element you expect to be present is missing on one page. A find returns None and the next attribute access throws an AttributeError . Always write defensive code when accessing nested elements: price elem = article.find 'p', class ='price color' price = price elem.text.strip if price elem else None A try-except around the whole extraction loop keeps one bad page from stopping the rest: for article in articles: try: title = article.find 'h3' .find 'a' 'title' price = article.find 'p', class ='price color' .text.strip books data.append {'title': title, 'price': price} except Exception as e: print f'Failed to parse article: {e}' continue The easier approach for more complex targets Books to Scrape is a scraping sandbox built to be easy. Real targets like Amazon https://spidra.io/blog/how-to-scrape-amazon-product-data , eBay https://spidra.io/blog/scrape-ebay-product-data , news sites, and job boards have bot detection, JavaScript-rendered content, rotating layouts, and CAPTCHA challenges. The selector maintenance, proxy management, and rendering overhead add up quickly. The Spidra API https://spidra.io/spidra-api takes a different approach. You describe what you want from a page in plain English and define the output shape in a JSON Schema. Spidra loads the page in a real browser, routes through residential proxies https://spidra.io/proxy-scraping , handles CAPTCHA automatically, and returns structured JSON matching your schema. When the site changes its layout, the prompt keeps working because it describes the data, not where it sits in the DOM. python import requests, time, os, json API KEY = os.environ 'SPIDRA API KEY' BASE = 'https://api.spidra.io/api' HEADERS = {'x-api-key': API KEY, 'Content-Type': 'application/json'} SCHEMA = { 'type': 'object', 'required': 'books' , 'properties': { 'books': { 'type': 'array', 'items': { 'type': 'object', 'properties': { 'title': {'type': 'string'}, 'price': {'type': 'string'}, 'rating': {'type': 'string'}, 'in stock': {'type': 'boolean'}, } } } } } resp = requests.post f'{BASE}/scrape', headers=HEADERS, json={ 'urls': {'url': 'https://books.toscrape.com/catalogue/page-1.html'} , 'prompt': 'Extract all books on this page with their title, price, star rating as a word, and whether they are in stock', 'output': 'json', 'schema': SCHEMA, } job id = resp.json 'jobId' while True: result = requests.get f'{BASE}/scrape/{job id}', headers=HEADERS .json if result 'status' == 'completed': break time.sleep 3 books = result 'result' 'content' 'books' print f'Got {len books } books' print json.dumps books :2 , indent=2 Sign up free at app.spidra.io https://app.spidra.io/signup . The free plan gives you 300 credits with no card required. If you want to generate a schema from real output rather than writing it by hand, the free JSON Schema Generator https://spidra.io/tools/json-schema-generator builds the structure for you from any JSON sample.