I’ve been scraping websites for years. BeautifulSoup, Scrapy, Playwright — I’ve used them all. But last month I hit a wall.
A client needed me to extract product details from a dozen e-commerce sites. Most were straightforward: find the right CSS selectors, handle pagination, done. But one particular site was a nightmare. The HTML was a mess of nested divs, inline styles, and data scattered across attributes, text nodes, and even JavaScript variables. The layout changed every week. My carefully crafted selectors broke constantly.
I spent two days fixing and refactoring. Every time I thought I had it, the site updated and my pipeline broke again. I was about to tell the client it wasn’t feasible.
Then a colleague said: “Why not just give the raw HTML to an LLM and ask it to extract what you need?”
At first I laughed. LLMs hallucinate, they’re slow, expensive — right? But I was desperate. I decided to prototype it.
Before going down the AI route, I exhausted traditional approaches:
data-price
attributes, sometimes in nested <span>
s.product-price
to price-info
and my whole script died.The root problem: the HTML structure was unpredictable. A human can look at a page and say “that’s the price”. A CSS selector cannot — it relies on structure.
I built a small script that takes raw HTML, sends it to an LLM (I used OpenAI’s GPT-4o, but you can use any model that can handle long contexts), and asks it to return a JSON object according to a schema I define.
The key insight: instead of teaching the computer where the data is, I teach it what the data looks like. I provide a description and let the LLM figure out the mapping.
Here’s a simplified version:
import openai
from bs4 import BeautifulSoup
import requests
response = requests.get("https://example.com/product-page")
raw_html = response.text
soup = BeautifulSoup(raw_html, "html.parser")
for tag in soup(["script", "style", "meta", "link", "svg"]):
tag.decompose()
cleaned_html = str(soup)[:12000] # limit context size
schema = {
"product_name": "string",
"price": "string (e.g., '$19.99')",
"availability": "string ('In Stock' or 'Out of Stock')",
"description": "string",
"rating": "string (e.g., '4.5 out of 5')"
}
prompt = f"""
Extract the following fields from this HTML and return a valid JSON object.
Fields: {schema}
HTML:
{cleaned_html}
Return ONLY the JSON object, no explanation.
"""
client = openai.OpenAI(api_key="sk-...")
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
)
try:
import json
result = json.loads(response.choices[0].message.content)
print(result)
except json.JSONDecodeError:
print("LLM did not return valid JSON. Retrying...")
That’s it. For the problematic site, this worked on the first try. No selectors, no XPath, no regular expressions. I just described what I wanted and the LLM figured out the rest.
I’ve used this approach for a few weeks now, and here’s what I learned:
I’d combine approaches. Use traditional parsers for stable sites, and fall back to LLM only for tricky ones. Also, I’d implement a validation step: check that extracted prices look like prices, ratings are within range, etc. If validation fails, re-run with a different prompt or a more powerful model.
Another improvement: provide the LLM with a few examples (few-shot prompting) to improve accuracy on ambiguous fields.
I’m not the only one doing this. There are now services that wrap this idea into nice APIs. For example, I came across InterWest AI which offers a similar extraction API. I haven’t used it extensively, but it’s interesting to see this pattern being productized.
LLM-based extraction isn’t a silver bullet. It’s expensive and slow. But for the 10% of cases where traditional parsing fails — changing layouts, inconsistent HTML, or just pure laziness — it’s a lifesaver.
I’m still torn. Part of me feels like I’m cheating by throwing AI at a problem that used to require elegant code. But then again, the site’s layout changes every week, and I have better things to do than update selectors.
What’s your experience with extracting data from messy websites? Have you tried AI-based parsing, or do you still prefer the precision of XPath and CSS? I’d love to hear how others handle this.