How to scrape web data with Beautiful Soup: step-by-step guide in 2026 Python's Beautiful Soup library remains a top choice for web scraping in 2026, offering a gentle learning curve and robust HTML parsing. A step-by-step guide demonstrates scraping book data from books.toscrape.com using requests and Beautiful Soup, covering page download, parsing, and common pitfalls like JavaScript-rendered content. There is a lot of data on the internet that is not available through an API. Product prices, news articles, job listings, research data. Most of it lives in HTML pages rather than structured endpoints. Web scraping is how you get to it programmatically. In this guide you will learn how to scrape web pages using Python's requests library and Beautiful Soup. We use a real scraping practice site throughout so every code example produces actual output. What is Beautiful Soup and why use it Beautiful Soup https://beautiful-soup-4.readthedocs.io/en/latest/ is a Python library that parses HTML https://spidra.io/blog/html-vs-markdown-for-ai-llms and XML documents and gives you a clean interface to navigate and search the document tree. It does not fetch web pages. That is the job of requests . Once you have the HTML, Beautiful Soup makes it straightforward to pull out the data you need. It is a good first tool for web scraping because the learning curve is gentle, it handles messy HTML gracefully, and it works well with other libraries. Pair it with requests for static pages, with Selenium or Playwright when you need JavaScript rendering, and with Scrapy when you need to crawl an entire site. The limitation to understand early: Beautiful Soup is a parser, not a browser. It reads whatever HTML your requests call returns. If a page loads its content via JavaScript after the initial HTML is delivered, Beautiful Soup will see an empty result. This is a common source of confusion and we cover it in the challenges section below. Prerequisites You need Python 3.8 or higher. Check your version: python --version Create and activate a virtual environment before installing anything: macOS / Linux python3 -m venv venv source venv/bin/activate Windows python -m venv venv venv\Scripts\activate Install the required libraries: pip install requests beautifulsoup4 lxml requests handles HTTP. beautifulsoup4 is the parser. lxml is a fast C-based parser that Beautiful Soup can use under the hood and also provides XPath support when you need it. The target page We scrape books.toscrape.com throughout this guide. It is a sandbox site built specifically for scraping practice, publicly accessible and stable, designed to let you experiment without worrying about rate limits or legal concerns. Each page lists 20 books with a title, price, star rating, and availability. https://books.toscrape.com/catalogue/page-1.html Step 1: download the page with requests Start by fetching the HTML. The get method sends an HTTP GET request and returns a response object. The text attribute on that object contains the raw HTML: python import requests url = 'https://books.toscrape.com/catalogue/page-1.html' response = requests.get url print response.status code print response.text :500 A 200 status code means the request succeeded. Anything in the 4xx or 5xx range is an error. The text :500 just prints the first 500 characters so you can confirm you got real HTML back rather than a block page. Adding a realistic User-Agent header reduces the chance of getting blocked on sites that check for it: python import requests headers = { 'User-Agent': 'Mozilla/5.0 Windows NT 10.0; Win64; x64 AppleWebKit/537.36 KHTML, like Gecko Chrome/126.0.0.0 Safari/537.36', 'Accept-Language': 'en-US,en;q=0.9', } url = 'https://books.toscrape.com/catalogue/page-1.html' response = requests.get url, headers=headers Step 2: parse the HTML with Beautiful Soup Once you have the HTML, pass it to the BeautifulSoup constructor along with the parser you want to use. We use lxml because it is fast and handles malformed HTML well: python from bs4 import BeautifulSoup soup = BeautifulSoup response.text, 'lxml' soup is now a navigable tree of Python objects representing the entire HTML document. You can search it, navigate it, and extract text and attributes from it. Step 3: inspect the page to find your selectors Before writing extraction code, you need to know what selectors to use. Open the target page in Chrome, right-click the element you want to extract, and select Inspect. The browser's DevTools panel opens with the HTML for that element highlighted. On Books to Scrape, each book is an