When HTML parsing fails: using LLMs to extract messy web data

A developer turned to large language models to extract product data from e-commerce sites with unpredictable HTML, after traditional scraping tools like BeautifulSoup and Scrapy failed due to constantly changing page structures. By feeding raw HTML to OpenAI's GPT-4o with a defined JSON schema, the engineer successfully extracted fields such as product names, prices, and availability on the first attempt, bypassing the need for fragile CSS selectors or XPath expressions. The approach combines LLM-based extraction for problematic sites with traditional parsers for stable ones, and includes validation steps to catch errors.

I’ve been scraping websites for years. BeautifulSoup, Scrapy, Playwright — I’ve used them all. But last month I hit a wall. A client needed me to extract product details from a dozen e-commerce sites. Most were straightforward: find the right CSS selectors, handle pagination, done. But one particular site was a nightmare. The HTML was a mess of nested divs, inline styles, and data scattered across attributes, text nodes, and even JavaScript variables. The layout changed every week. My carefully crafted selectors broke constantly. I spent two days fixing and refactoring. Every time I thought I had it, the site updated and my pipeline broke again. I was about to tell the client it wasn’t feasible. Then a colleague said: “Why not just give the raw HTML to an LLM and ask it to extract what you need?” At first I laughed. LLMs hallucinate, they’re slow, expensive — right? But I was desperate. I decided to prototype it. Before going down the AI route, I exhausted traditional approaches: data-price attributes, sometimes in nested <span s. product-price to price-info and my whole script died.The root problem: the HTML structure was unpredictable . A human can look at a page and say “that’s the price”. A CSS selector cannot — it relies on structure. I built a small script that takes raw HTML, sends it to an LLM I used OpenAI’s GPT-4o, but you can use any model that can handle long contexts , and asks it to return a JSON object according to a schema I define. The key insight: instead of teaching the computer where the data is, I teach it what the data looks like . I provide a description and let the LLM figure out the mapping. Here’s a simplified version: python import openai from bs4 import BeautifulSoup Fetch the page I'll use requests here, but you might need Playwright for dynamic sites import requests response = requests.get "https://example.com/product-page" raw html = response.text Clean up HTML a bit - remove scripts, styles, reduce noise soup = BeautifulSoup raw html, "html.parser" for tag in soup "script", "style", "meta", "link", "svg" : tag.decompose cleaned html = str soup :12000 limit context size Define the schema schema = { "product name": "string", "price": "string e.g., '$19.99' ", "availability": "string 'In Stock' or 'Out of Stock' ", "description": "string", "rating": "string e.g., '4.5 out of 5' " } Call the LLM prompt = f""" Extract the following fields from this HTML and return a valid JSON object. Fields: {schema} HTML: {cleaned html} Return ONLY the JSON object, no explanation. """ client = openai.OpenAI api key="sk-..." response = client.chat.completions.create model="gpt-4o", messages= {"role": "user", "content": prompt} , temperature=0.0, try: import json result = json.loads response.choices 0 .message.content print result except json.JSONDecodeError: print "LLM did not return valid JSON. Retrying..." That’s it. For the problematic site, this worked on the first try. No selectors, no XPath, no regular expressions. I just described what I wanted and the LLM figured out the rest. I’ve used this approach for a few weeks now, and here’s what I learned: I’d combine approaches. Use traditional parsers for stable sites, and fall back to LLM only for tricky ones. Also, I’d implement a validation step: check that extracted prices look like prices, ratings are within range, etc. If validation fails, re-run with a different prompt or a more powerful model. Another improvement: provide the LLM with a few examples few-shot prompting to improve accuracy on ambiguous fields. I’m not the only one doing this. There are now services that wrap this idea into nice APIs. For example, I came across InterWest AI https://ai.interwestinfo.com/ which offers a similar extraction API. I haven’t used it extensively, but it’s interesting to see this pattern being productized. LLM-based extraction isn’t a silver bullet. It’s expensive and slow. But for the 10% of cases where traditional parsing fails — changing layouts, inconsistent HTML, or just pure laziness — it’s a lifesaver. I’m still torn. Part of me feels like I’m cheating by throwing AI at a problem that used to require elegant code. But then again, the site’s layout changes every week, and I have better things to do than update selectors. What’s your experience with extracting data from messy websites? Have you tried AI-based parsing, or do you still prefer the precision of XPath and CSS? I’d love to hear how others handle this.