# When HTML parsing fails: using LLMs to extract messy web data

> Source: <https://dev.to/__c1b9e06dc90a7e0a676b/when-html-parsing-fails-using-llms-to-extract-messy-web-data-20ab>
> Published: 2026-06-05 08:34:43+00:00

I’ve been scraping websites for years. BeautifulSoup, Scrapy, Playwright — I’ve used them all. But last month I hit a wall.

A client needed me to extract product details from a dozen e-commerce sites. Most were straightforward: find the right CSS selectors, handle pagination, done. But one particular site was a nightmare. The HTML was a mess of nested divs, inline styles, and data scattered across attributes, text nodes, and even JavaScript variables. The layout changed every week. My carefully crafted selectors broke constantly.

I spent two days fixing and refactoring. Every time I thought I had it, the site updated and my pipeline broke again. I was about to tell the client it wasn’t feasible.

Then a colleague said: “Why not just give the raw HTML to an LLM and ask it to extract what you need?”

At first I laughed. LLMs hallucinate, they’re slow, expensive — right? But I was desperate. I decided to prototype it.

Before going down the AI route, I exhausted traditional approaches:

`data-price`

attributes, sometimes in nested `<span>`

s.`product-price`

to `price-info`

and my whole script died.The root problem: **the HTML structure was unpredictable**. A human can look at a page and say “that’s the price”. A CSS selector cannot — it relies on structure.

I built a small script that takes raw HTML, sends it to an LLM (I used OpenAI’s GPT-4o, but you can use any model that can handle long contexts), and asks it to return a JSON object according to a schema I define.

The key insight: **instead of teaching the computer where the data is, I teach it what the data looks like**. I provide a description and let the LLM figure out the mapping.

Here’s a simplified version:

``` python
import openai
from bs4 import BeautifulSoup

# Fetch the page (I'll use requests here, but you might need Playwright for dynamic sites)
import requests

response = requests.get("https://example.com/product-page")
raw_html = response.text

# Clean up HTML a bit - remove scripts, styles, reduce noise
soup = BeautifulSoup(raw_html, "html.parser")
for tag in soup(["script", "style", "meta", "link", "svg"]):
    tag.decompose()
cleaned_html = str(soup)[:12000]  # limit context size

# Define the schema
schema = {
    "product_name": "string",
    "price": "string (e.g., '$19.99')",
    "availability": "string ('In Stock' or 'Out of Stock')",
    "description": "string",
    "rating": "string (e.g., '4.5 out of 5')"
}

# Call the LLM
prompt = f"""
Extract the following fields from this HTML and return a valid JSON object.
Fields: {schema}

HTML:
{cleaned_html}

Return ONLY the JSON object, no explanation.
"""

client = openai.OpenAI(api_key="sk-...")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.0,
)

try:
    import json
    result = json.loads(response.choices[0].message.content)
    print(result)
except json.JSONDecodeError:
    print("LLM did not return valid JSON. Retrying...")
```

That’s it. For the problematic site, this worked on the first try. No selectors, no XPath, no regular expressions. I just described what I wanted and the LLM figured out the rest.

I’ve used this approach for a few weeks now, and here’s what I learned:

I’d combine approaches. Use traditional parsers for stable sites, and fall back to LLM only for tricky ones. Also, I’d implement a validation step: check that extracted prices look like prices, ratings are within range, etc. If validation fails, re-run with a different prompt or a more powerful model.

Another improvement: provide the LLM with a few examples (few-shot prompting) to improve accuracy on ambiguous fields.

I’m not the only one doing this. There are now services that wrap this idea into nice APIs. For example, I came across [InterWest AI](https://ai.interwestinfo.com/) which offers a similar extraction API. I haven’t used it extensively, but it’s interesting to see this pattern being productized.

LLM-based extraction isn’t a silver bullet. It’s expensive and slow. But for the 10% of cases where traditional parsing fails — changing layouts, inconsistent HTML, or just pure laziness — it’s a lifesaver.

I’m still torn. Part of me feels like I’m cheating by throwing AI at a problem that used to require elegant code. But then again, the site’s layout changes every week, and I have better things to do than update selectors.

What’s your experience with extracting data from messy websites? Have you tried AI-based parsing, or do you still prefer the precision of XPath and CSS? I’d love to hear how others handle this.
