cd /news/large-language-models/when-html-parsing-fails-using-llms-t… · home topics large-language-models article
[ARTICLE · art-22322] src=dev.to pub= topic=large-language-models verified=true sentiment=· neutral

When HTML parsing fails: using LLMs to extract messy web data

A developer turned to large language models to extract product data from e-commerce sites with unpredictable HTML, after traditional scraping tools like BeautifulSoup and Scrapy failed due to constantly changing page structures. By feeding raw HTML to OpenAI's GPT-4o with a defined JSON schema, the engineer successfully extracted fields such as product names, prices, and availability on the first attempt, bypassing the need for fragile CSS selectors or XPath expressions. The approach combines LLM-based extraction for problematic sites with traditional parsers for stable ones, and includes validation steps to catch errors.

read4 min publishedJun 5, 2026

I’ve been scraping websites for years. BeautifulSoup, Scrapy, Playwright — I’ve used them all. But last month I hit a wall.

A client needed me to extract product details from a dozen e-commerce sites. Most were straightforward: find the right CSS selectors, handle pagination, done. But one particular site was a nightmare. The HTML was a mess of nested divs, inline styles, and data scattered across attributes, text nodes, and even JavaScript variables. The layout changed every week. My carefully crafted selectors broke constantly.

I spent two days fixing and refactoring. Every time I thought I had it, the site updated and my pipeline broke again. I was about to tell the client it wasn’t feasible.

Then a colleague said: “Why not just give the raw HTML to an LLM and ask it to extract what you need?”

At first I laughed. LLMs hallucinate, they’re slow, expensive — right? But I was desperate. I decided to prototype it.

Before going down the AI route, I exhausted traditional approaches:

data-price

attributes, sometimes in nested <span>

s.product-price

to price-info

and my whole script died.The root problem: the HTML structure was unpredictable. A human can look at a page and say “that’s the price”. A CSS selector cannot — it relies on structure.

I built a small script that takes raw HTML, sends it to an LLM (I used OpenAI’s GPT-4o, but you can use any model that can handle long contexts), and asks it to return a JSON object according to a schema I define.

The key insight: instead of teaching the computer where the data is, I teach it what the data looks like. I provide a description and let the LLM figure out the mapping.

Here’s a simplified version:

import openai
from bs4 import BeautifulSoup

import requests

response = requests.get("https://example.com/product-page")
raw_html = response.text

soup = BeautifulSoup(raw_html, "html.parser")
for tag in soup(["script", "style", "meta", "link", "svg"]):
    tag.decompose()
cleaned_html = str(soup)[:12000]  # limit context size

schema = {
    "product_name": "string",
    "price": "string (e.g., '$19.99')",
    "availability": "string ('In Stock' or 'Out of Stock')",
    "description": "string",
    "rating": "string (e.g., '4.5 out of 5')"
}

prompt = f"""
Extract the following fields from this HTML and return a valid JSON object.
Fields: {schema}

HTML:
{cleaned_html}

Return ONLY the JSON object, no explanation.
"""

client = openai.OpenAI(api_key="sk-...")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.0,
)

try:
    import json
    result = json.loads(response.choices[0].message.content)
    print(result)
except json.JSONDecodeError:
    print("LLM did not return valid JSON. Retrying...")

That’s it. For the problematic site, this worked on the first try. No selectors, no XPath, no regular expressions. I just described what I wanted and the LLM figured out the rest.

I’ve used this approach for a few weeks now, and here’s what I learned:

I’d combine approaches. Use traditional parsers for stable sites, and fall back to LLM only for tricky ones. Also, I’d implement a validation step: check that extracted prices look like prices, ratings are within range, etc. If validation fails, re-run with a different prompt or a more powerful model.

Another improvement: provide the LLM with a few examples (few-shot prompting) to improve accuracy on ambiguous fields.

I’m not the only one doing this. There are now services that wrap this idea into nice APIs. For example, I came across InterWest AI which offers a similar extraction API. I haven’t used it extensively, but it’s interesting to see this pattern being productized.

LLM-based extraction isn’t a silver bullet. It’s expensive and slow. But for the 10% of cases where traditional parsing fails — changing layouts, inconsistent HTML, or just pure laziness — it’s a lifesaver.

I’m still torn. Part of me feels like I’m cheating by throwing AI at a problem that used to require elegant code. But then again, the site’s layout changes every week, and I have better things to do than update selectors.

What’s your experience with extracting data from messy websites? Have you tried AI-based parsing, or do you still prefer the precision of XPath and CSS? I’d love to hear how others handle this.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/when-html-parsing-fa…] indexed:0 read:4min 2026-06-05 ·