When HTML parsing fails: using LLMs to extract messy web data

wpnews.pro

cd /news/large-language-models/when-html-parsing-fails-using-llms-t… · home › topics › large-language-models › article

[ARTICLE · art-22322] src=dev.to ↗ pub=2026-06-05T08:34Z topic=large-language-models verified=true sentiment=· neutral

When HTML parsing fails: using LLMs to extract messy web data

A developer turned to large language models to extract product data from e-commerce sites with unpredictable HTML, after traditional scraping tools like BeautifulSoup and Scrapy failed due to constantly changing page structures. By feeding raw HTML to OpenAI's GPT-4o with a defined JSON schema, the engineer successfully extracted fields such as product names, prices, and availability on the first attempt, bypassing the need for fragile CSS selectors or XPath expressions. The approach combines LLM-based extraction for problematic sites with traditional parsers for stable ones, and includes validation steps to catch errors.

read4 min views18 publishedJun 5, 2026

I’ve been scraping websites for years. BeautifulSoup, Scrapy, Playwright — I’ve used them all. But last month I hit a wall.

A client needed me to extract product details from a dozen e-commerce sites. Most were straightforward: find the right CSS selectors, handle pagination, done. But one particular site was a nightmare. The HTML was a mess of nested divs, inline styles, and data scattered across attributes, text nodes, and even JavaScript variables. The layout changed every week. My carefully crafted selectors broke constantly.

I spent two days fixing and refactoring. Every time I thought I had it, the site updated and my pipeline broke again. I was about to tell the client it wasn’t feasible.

Then a colleague said: “Why not just give the raw HTML to an LLM and ask it to extract what you need?”

At first I laughed. LLMs hallucinate, they’re slow, expensive — right? But I was desperate. I decided to prototype it.

Before going down the AI route, I exhausted traditional approaches:

data-price

attributes, sometimes in nested <span>

s.product-price

to price-info

and my whole script died.The root problem: the HTML structure was unpredictable. A human can look at a page and say “that’s the price”. A CSS selector cannot — it relies on structure.

I built a small script that takes raw HTML, sends it to an LLM (I used OpenAI’s GPT-4o, but you can use any model that can handle long contexts), and asks it to return a JSON object according to a schema I define.

The key insight: instead of teaching the computer where the data is, I teach it what the data looks like. I provide a description and let the LLM figure out the mapping.

Here’s a simplified version:

import openai
from bs4 import BeautifulSoup

import requests

response = requests.get("https://example.com/product-page")
raw_html = response.text

soup = BeautifulSoup(raw_html, "html.parser")
for tag in soup(["script", "style", "meta", "link", "svg"]):
    tag.decompose()
cleaned_html = str(soup)[:12000]  # limit context size

schema = {
    "product_name": "string",
    "price": "string (e.g., '$19.99')",
    "availability": "string ('In Stock' or 'Out of Stock')",
    "description": "string",
    "rating": "string (e.g., '4.5 out of 5')"
}

prompt = f"""
Extract the following fields from this HTML and return a valid JSON object.
Fields: {schema}

HTML:
{cleaned_html}

Return ONLY the JSON object, no explanation.
"""

client = openai.OpenAI(api_key="sk-...")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.0,
)

try:
    import json
    result = json.loads(response.choices[0].message.content)
    print(result)
except json.JSONDecodeError:
    print("LLM did not return valid JSON. Retrying...")

That’s it. For the problematic site, this worked on the first try. No selectors, no XPath, no regular expressions. I just described what I wanted and the LLM figured out the rest.

I’ve used this approach for a few weeks now, and here’s what I learned:

I’d combine approaches. Use traditional parsers for stable sites, and fall back to LLM only for tricky ones. Also, I’d implement a validation step: check that extracted prices look like prices, ratings are within range, etc. If validation fails, re-run with a different prompt or a more powerful model.

Another improvement: provide the LLM with a few examples (few-shot prompting) to improve accuracy on ambiguous fields.

I’m not the only one doing this. There are now services that wrap this idea into nice APIs. For example, I came across InterWest AI which offers a similar extraction API. I haven’t used it extensively, but it’s interesting to see this pattern being productized.

LLM-based extraction isn’t a silver bullet. It’s expensive and slow. But for the 10% of cases where traditional parsing fails — changing layouts, inconsistent HTML, or just pure laziness — it’s a lifesaver.

I’m still torn. Part of me feels like I’m cheating by throwing AI at a problem that used to require elegant code. But then again, the site’s layout changes every week, and I have better things to do than update selectors.

What’s your experience with extracting data from messy websites? Have you tried AI-based parsing, or do you still prefer the precision of XPath and CSS? I’d love to hear how others handle this.

source & further reading

dev.to — original article Six open-source pieces, one JavaScript agent stack Tool vs Talent in Solon AI: When a Function Is Not Enough 301 duplicate IDs in the browser, 0 on the JVM: one real bug, end to end

~/api · this article 200

$curl api.wpnews.pro/v1/news/when-html-parsing-fails-…

Read original on dev.to → dev.to/__c1b9e06dc90a7e0a676b/when-html-parsing-…

mentioned entities

BeautifulSoup

Scrapy

Playwright

OpenAI

GPT-4o

metadata

slugwhen-html-parsing-fails-using-llms-to-extract-messy-web-data

topic#large-language-models

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevShow HN: LLM memory without cont…

next →Show HN: CLI for scoring OpenAPI…

── more in #large-language-models 4 stories · sorted by recency

startupfortune.com · 22 Jul · #large-language-models

Google released three Gemini models in one day while its flagship is still stuck in testing

twitter.com · 22 Jul · #large-language-models

Gigatoken: Fastest Tokenizer

runtimewire.com · 21 Jul · #large-language-models

TwelveLabs opens research preview of Jockey, an AI agent that searches entire video libraries via Claude

thedeepview.com · 21 Jul · #large-language-models

Cisco bets small models can solve AI's big problem

── more on @beautifulsoup 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 8 Jul · #ai-tools

What's the Future of Clay?

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required