{"slug": "how-to-scrape-web-data-with-beautiful-soup-step-by-step-guide-in-2026", "title": "How to scrape web data with Beautiful Soup: step-by-step guide in 2026", "summary": "Python's Beautiful Soup library remains a top choice for web scraping in 2026, offering a gentle learning curve and robust HTML parsing. A step-by-step guide demonstrates scraping book data from books.toscrape.com using requests and Beautiful Soup, covering page download, parsing, and common pitfalls like JavaScript-rendered content.", "body_md": "There is a lot of data on the internet that is not available through an API. Product prices, news articles, job listings, research data. Most of it lives in HTML pages rather than structured endpoints. Web scraping is how you get to it programmatically.\n\nIn this guide you will learn how to scrape web pages using Python's `requests`\n\nlibrary and Beautiful Soup. We use a real scraping practice site throughout so every code example produces actual output.\n\n## What is Beautiful Soup and why use it\n\n[Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) is a Python library that parses [HTML](https://spidra.io/blog/html-vs-markdown-for-ai-llms) and XML documents and gives you a clean interface to navigate and search the document tree. It does not fetch web pages. That is the job of `requests`\n\n. Once you have the HTML, Beautiful Soup makes it straightforward to pull out the data you need.\n\nIt is a good first tool for web scraping because the learning curve is gentle, it handles messy HTML gracefully, and it works well with other libraries. Pair it with `requests`\n\nfor static pages, with `Selenium`\n\nor `Playwright`\n\nwhen you need JavaScript rendering, and with `Scrapy`\n\nwhen you need to crawl an entire site.\n\nThe limitation to understand early: Beautiful Soup is a parser, not a browser. It reads whatever HTML your `requests`\n\ncall returns. If a page loads its content via JavaScript after the initial HTML is delivered, Beautiful Soup will see an empty result. This is a common source of confusion and we cover it in the challenges section below.\n\n## Prerequisites\n\nYou need Python 3.8 or higher. Check your version:\n\n```\npython --version\n```\n\nCreate and activate a virtual environment before installing anything:\n\n```\n# macOS / Linux\npython3 -m venv venv\nsource venv/bin/activate\n\n# Windows\npython -m venv venv\nvenv\\Scripts\\activate\n```\n\nInstall the required libraries:\n\n```\npip install requests beautifulsoup4 lxml\n```\n\n`requests`\n\nhandles HTTP. `beautifulsoup4`\n\nis the parser. `lxml`\n\nis a fast C-based parser that Beautiful Soup can use under the hood and also provides XPath support when you need it.\n\n## The target page\n\nWe scrape `books.toscrape.com`\n\nthroughout this guide. It is a sandbox site built specifically for scraping practice, publicly accessible and stable, designed to let you experiment without worrying about rate limits or legal concerns. Each page lists 20 books with a title, price, star rating, and availability.\n\n```\nhttps://books.toscrape.com/catalogue/page-1.html\n```\n\n## Step 1: download the page with requests\n\nStart by fetching the HTML. The `get()`\n\nmethod sends an HTTP GET request and returns a response object. The `text`\n\nattribute on that object contains the raw HTML:\n\n``` python\nimport requests\n\nurl = 'https://books.toscrape.com/catalogue/page-1.html'\nresponse = requests.get(url)\n\nprint(response.status_code)\nprint(response.text[:500])\n```\n\nA `200`\n\nstatus code means the request succeeded. Anything in the 4xx or 5xx range is an error. The `text[:500]`\n\njust prints the first 500 characters so you can confirm you got real HTML back rather than a block page.\n\nAdding a realistic User-Agent header reduces the chance of getting blocked on sites that check for it:\n\n``` python\nimport requests\n\nheaders = {\n    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',\n    'Accept-Language': 'en-US,en;q=0.9',\n}\n\nurl = 'https://books.toscrape.com/catalogue/page-1.html'\nresponse = requests.get(url, headers=headers)\n```\n\n## Step 2: parse the HTML with Beautiful Soup\n\nOnce you have the HTML, pass it to the `BeautifulSoup`\n\nconstructor along with the parser you want to use. We use `lxml`\n\nbecause it is fast and handles malformed HTML well:\n\n``` python\nfrom bs4 import BeautifulSoup\n\nsoup = BeautifulSoup(response.text, 'lxml')\n```\n\n`soup`\n\nis now a navigable tree of Python objects representing the entire HTML document. You can search it, navigate it, and extract text and attributes from it.\n\n## Step 3: inspect the page to find your selectors\n\nBefore writing extraction code, you need to know what selectors to use. Open the target page in Chrome, right-click the element you want to extract, and select Inspect. The browser's DevTools panel opens with the HTML for that element highlighted.\n\nOn Books to Scrape, each book is an `<article>`\n\nelement with the class `product_pod`\n\n. Inside it you will find:\n\n`<h3><a title=\"Book Title\">`\n\n: the book title lives in the`title`\n\nattribute`<p class=\"price_color\">`\n\n: the price including the currency symbol`<p class=\"star-rating {One|Two|Three|Four|Five}\">`\n\n: the rating encoded in the class name`<p class=\"instock availability\">`\n\n: stock status\n\nKnowing this structure is what you need to write the extraction code.\n\n## Step 4: extract data from the page\n\n### Get an element by HTML tag\n\n`find()`\n\nreturns the first matching element. `find_all()`\n\nreturns a list of all matching elements. To get all book containers:\n\n```\nbooks = soup.find_all('article', class_='product_pod')\nprint(f'Found {len(books)} books')\nFound 20 books\n```\n\n### Get an element by CSS class\n\nPrefix the class name with a dot in `select()`\n\ncalls, or pass it as `class_=`\n\nin `find()`\n\ncalls. To get the price of the first book:\n\n```\nfirst_book = books[0]\nprice = first_book.find('p', class_='price_color').text.strip()\nprint(price)\n£51.77\n```\n\n### Get an element by ID\n\nWhen an element has a unique ID, `select_one()`\n\nwith a `#`\n\nprefix is the cleanest way to find it. This page has an `#default`\n\nwrapper on the main content:\n\n```\nmain = soup.select_one('#default')\nprint(main.name)\n```\n\n### Get an element by attribute\n\nSquare bracket notation in CSS selectors matches on attribute values. The book title is stored in the `title`\n\nattribute of the `<a>`\n\ntag inside each `<h3>`\n\n:\n\n```\ntitle = first_book.find('h3').find('a')['title']\nprint(title)\nA Light in the Attic\n```\n\n### Get an element using XPath\n\nFor XPath queries you need `lxml`\n\ndirectly alongside Beautiful Soup. Right-click an element in DevTools, choose Copy, then Copy XPath to get the path. Here is how to use it:\n\n``` python\nfrom bs4 import BeautifulSoup\nfrom lxml import etree\n\ndom = etree.HTML(str(soup))\nfirst_title = dom.xpath('//article[@class=\"product_pod\"]/h3/a/@title')[0]\nprint(first_title)\nA Light in the Attic\n```\n\nXPath is more powerful than CSS selectors for complex traversals, but CSS selectors are usually cleaner for straightforward extraction.\n\n## Step 5: scrape the full page\n\nNow that you know how to find each field, combine everything into a single scraper:\n\n``` python\nimport requests\nfrom bs4 import BeautifulSoup\n\nheaders = {\n    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',\n}\n\nurl = 'https://books.toscrape.com/catalogue/page-1.html'\nresponse = requests.get(url, headers=headers)\n\nif response.status_code != 200:\n    print(f'Request failed: {response.status_code}')\nelse:\n    soup = BeautifulSoup(response.text, 'lxml')\n    books_data = []\n\n    for article in soup.find_all('article', class_='product_pod'):\n        title    = article.find('h3').find('a')['title']\n        price    = article.find('p', class_='price_color').text.strip()\n        rating   = article.find('p', class_='star-rating')['class'][1]\n        in_stock = 'In stock' in article.find('p', class_='instock').text\n\n        books_data.append({\n            'title':    title,\n            'price':    price,\n            'rating':   rating,\n            'in_stock': in_stock,\n        })\n\n    for book in books_data[:3]:\n        print(book)\n{'title': 'A Light in the Attic', 'price': '£51.77', 'rating': 'Three', 'in_stock': True}\n{'title': 'Tipping the Velvet',   'price': '£53.74', 'rating': 'One',   'in_stock': True}\n{'title': 'Soumission',           'price': '£50.10', 'rating': 'One',   'in_stock': True}\n```\n\nThe `rating`\n\nfield comes from the class name. `star-rating Three`\n\nhas two classes, so `['class'][1]`\n\npicks the second one, which is the word that represents the rating. You would map it to a number in post-processing: `{'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}`\n\n.\n\n## Step 6: export to CSV\n\nPython's built-in `csv`\n\nmodule writes the data to a file you can open in Excel or Google Sheets:\n\n``` python\nimport csv\n\ncsv_file = 'books.csv'\n\nwith open(csv_file, mode='w', newline='', encoding='utf-8') as f:\n    writer = csv.DictWriter(f, fieldnames=['title', 'price', 'rating', 'in_stock'])\n    writer.writeheader()\n    writer.writerows(books_data)\n\nprint(f'Saved {len(books_data)} books to {csv_file}')\n```\n\n## Step 7: export to JSON\n\nFor pipelines that consume JSON or need to store the data in a format that preserves the types:\n\n``` python\nimport json\n\nwith open('books.json', 'w', encoding='utf-8') as f:\n    json.dump(books_data, f, indent=4)\n\nprint('Saved to books.json')\n[\n    {\n        \"title\": \"A Light in the Attic\",\n        \"price\": \"£51.77\",\n        \"rating\": \"Three\",\n        \"in_stock\": true\n    },\n    ...\n]\n```\n\n## Step 8: handle pagination\n\nBooks to Scrape spreads its catalogue across 50 pages. The Next button at the bottom of each page has a predictable pattern: the URL increments from `page-1.html`\n\nthrough `page-50.html`\n\n. This function walks through every page and collects all the books:\n\n``` python\nimport requests, time\nfrom bs4 import BeautifulSoup\n\nheaders = {\n    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',\n}\n\nbase_url = 'https://books.toscrape.com/catalogue/page-{}.html'\nall_books = []\npage = 1\n\nwhile True:\n    response = requests.get(base_url.format(page), headers=headers)\n\n    if response.status_code != 200:\n        break\n\n    soup = BeautifulSoup(response.text, 'lxml')\n    articles = soup.find_all('article', class_='product_pod')\n\n    if not articles:\n        break\n\n    for article in articles:\n        title    = article.find('h3').find('a')['title']\n        price    = article.find('p', class_='price_color').text.strip()\n        rating   = article.find('p', class_='star-rating')['class'][1]\n        in_stock = 'In stock' in article.find('p', class_='instock').text\n        all_books.append({'title': title, 'price': price, 'rating': rating, 'in_stock': in_stock})\n\n    print(f'Page {page}: {len(all_books)} books collected so far')\n\n    # Check if there is a next page\n    next_btn = soup.find('li', class_='next')\n    if not next_btn:\n        break\n\n    page += 1\n    time.sleep(1)  # Be respectful of the server\n\nprint(f'\\nTotal: {len(all_books)} books')\n```\n\nThe `time.sleep(1)`\n\nbetween requests gives the server breathing room. Sending requests as fast as possible is what gets scrapers blocked or rate-limited.\n\n## Common challenges\n\n### Dynamic content\n\nBeautiful Soup reads the HTML that `requests`\n\nreturns. If a page uses JavaScript to load its data after the initial HTML arrives, `requests`\n\ngets the empty skeleton and Beautiful Soup finds nothing to parse. You will see this on modern e-commerce sites, single-page applications, and any page where the content only appears after scrolling or clicking.\n\nThe fix is to use a headless browser like Playwright or Selenium to render the page first, then pass the rendered HTML to Beautiful Soup. This adds significant overhead because a full browser instance runs for every page, and sites that use JavaScript-rendered content are also typically the ones with more aggressive bot detection.\n\n### Getting blocked\n\nSites detect scrapers through a combination of signals: missing or suspicious User-Agent headers, requests arriving too fast, datacenter IP addresses, and missing browser fingerprints. A plain `requests`\n\ncall with no headers is the most obvious signal you can send.\n\nCommon mitigations are adding realistic headers, introducing random delays between requests, and routing through residential proxies. Each adds complexity and maintenance burden. Free proxy lists in particular are unreliable. They have short lifespans and many are already flagged by the time you use them.\n\n### Selectors breaking\n\nSite owners update their HTML without any concern for external scrapers. Class names change, element nesting shifts, and IDs get renamed. Any selector you write is a dependency on the current state of a page that you do not control. This is the hidden cost of selector-based scraping: it works until it does not, and it breaks silently.\n\n### Error handling\n\nWeb scraping produces a lot of edge cases. An element you expect to be present is missing on one page. A `find()`\n\nreturns `None`\n\nand the next attribute access throws an `AttributeError`\n\n. Always write defensive code when accessing nested elements:\n\n```\nprice_elem = article.find('p', class_='price_color')\nprice = price_elem.text.strip() if price_elem else None\n```\n\nA try-except around the whole extraction loop keeps one bad page from stopping the rest:\n\n```\nfor article in articles:\n    try:\n        title = article.find('h3').find('a')['title']\n        price = article.find('p', class_='price_color').text.strip()\n        books_data.append({'title': title, 'price': price})\n    except Exception as e:\n        print(f'Failed to parse article: {e}')\n        continue\n```\n\n## The easier approach for more complex targets\n\nBooks to Scrape is a scraping sandbox built to be easy. Real targets like [Amazon](https://spidra.io/blog/how-to-scrape-amazon-product-data), [eBay](https://spidra.io/blog/scrape-ebay-product-data), news sites, and job boards have bot detection, JavaScript-rendered content, rotating layouts, and CAPTCHA challenges. The selector maintenance, proxy management, and rendering overhead add up quickly.\n\nThe [Spidra API](https://spidra.io/spidra-api) takes a different approach. You describe what you want from a page in plain English and define the output shape in a JSON Schema. Spidra loads the page in a real browser, routes through [residential proxies](https://spidra.io/proxy-scraping), handles CAPTCHA automatically, and returns structured JSON matching your schema. When the site changes its layout, the prompt keeps working because it describes the data, not where it sits in the DOM.\n\n``` python\nimport requests, time, os, json\n\nAPI_KEY = os.environ['SPIDRA_API_KEY']\nBASE    = 'https://api.spidra.io/api'\nHEADERS = {'x-api-key': API_KEY, 'Content-Type': 'application/json'}\n\nSCHEMA = {\n    'type': 'object',\n    'required': ['books'],\n    'properties': {\n        'books': {\n            'type': 'array',\n            'items': {\n                'type': 'object',\n                'properties': {\n                    'title':    {'type': 'string'},\n                    'price':    {'type': 'string'},\n                    'rating':   {'type': 'string'},\n                    'in_stock': {'type': 'boolean'},\n                }\n            }\n        }\n    }\n}\n\nresp = requests.post(f'{BASE}/scrape', headers=HEADERS, json={\n    'urls':   [{'url': 'https://books.toscrape.com/catalogue/page-1.html'}],\n    'prompt': 'Extract all books on this page with their title, price, star rating as a word, and whether they are in stock',\n    'output': 'json',\n    'schema': SCHEMA,\n})\njob_id = resp.json()['jobId']\n\nwhile True:\n    result = requests.get(f'{BASE}/scrape/{job_id}', headers=HEADERS).json()\n    if result['status'] == 'completed':\n        break\n    time.sleep(3)\n\nbooks = result['result']['content']['books']\nprint(f'Got {len(books)} books')\nprint(json.dumps(books[:2], indent=2))\n```\n\nSign up free at [app.spidra.io](https://app.spidra.io/signup). The free plan gives you 300 credits with no card required. If you want to generate a schema from real output rather than writing it by hand, the free [JSON Schema Generator](https://spidra.io/tools/json-schema-generator) builds the structure for you from any JSON sample.", "url": "https://wpnews.pro/news/how-to-scrape-web-data-with-beautiful-soup-step-by-step-guide-in-2026", "canonical_source": "https://spidra.io/blog/web-scraping-with-beautifulsoup", "published_at": "2026-06-27 00:00:00+00:00", "updated_at": "2026-06-27 14:08:30.911686+00:00", "lang": "en", "topics": ["developer-tools", "machine-learning"], "entities": ["Beautiful Soup", "Python", "requests", "lxml", "Selenium", "Playwright", "Scrapy", "books.toscrape.com"], "alternates": {"html": "https://wpnews.pro/news/how-to-scrape-web-data-with-beautiful-soup-step-by-step-guide-in-2026", "markdown": "https://wpnews.pro/news/how-to-scrape-web-data-with-beautiful-soup-step-by-step-guide-in-2026.md", "text": "https://wpnews.pro/news/how-to-scrape-web-data-with-beautiful-soup-step-by-step-guide-in-2026.txt", "jsonld": "https://wpnews.pro/news/how-to-scrape-web-data-with-beautiful-soup-step-by-step-guide-in-2026.jsonld"}}