A few months ago, I was building a price comparison tool that needed to pull product info from a dozen different e-commerce sites. Each one had its own lovingly crafted HTML structure—nested <div>
s with classes like price-123abc
that changed on every deployment. My initial approach was traditional: XPath, CSS selectors, and a sprinkle of regex. It worked until it didn’t. Then I discovered that I could throw an LLM at the raw HTML and let it figure out the extraction. Here’s what I learned.
I had a scraper for Site A that used document.querySelector('.product-price')
. It was fragile but worked for months. Then Site A redesigned. The selector broke. I updated it. A week later, another redesign. I started using regex
to find patterns like \$\d+\.\d{2}
. Then someone added a badge that said “$5 off” and my regex grabbed the wrong number.
I needed something that could understand the meaning of a price, not just its structure. That’s when I wondered: could GPT-4 (or any language model) parse the raw HTML and give me the structured data I needed?
First, I tried passing the full HTML of a product page directly to an LLM and asking, “extract the product name, price, and availability.” Two problems:
I also tried simplifying the HTML with html2text
to reduce tokens. That lost too much structure – the model couldn’t distinguish between a price in the main content and a price in a footer ad.
Then I tried extracting only the parts of the page that looked price-like using regex first, then feeding that to the LLM. That was a maintenance nightmare – I was back to writing brittle patterns.
The breakthrough came when I stopped trying to reduce what the model sees and instead improved how I asked. Here’s the approach that stuck:
Instead of raw HTML, I converted the page to a clean JSON tree of common elements (headings, paragraphs, lists, tables) and their text content. This reduced token count by ~70% while preserving structure.
from bs4 import BeautifulSoup
def simplify_html(html):
soup = BeautifulSoup(html, 'html.parser')
for tag in soup(['script', 'style', 'nav', 'footer', 'aside']):
tag.decompose()
simplified = []
for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'li', 'table', 'div.price']):
tag = element.name
text = element.get_text(strip=True)
simplified.append(f"<{tag}>{text}</{tag}>")
return '\n'.join(simplified)
I created 3–5 examples of product pages with the exact JSON output I wanted. I hardcoded them into the system prompt. This was key – it told the model exactly what “price” meant in my context (first product, not recommended items).
system_prompt = """You are a precise data extractor for e-commerce product pages.
Given simplified HTML, output a JSON object with fields:
- name: product name
- price: numeric value without currency symbol
- availability: 'in_stock' or 'out_of_stock'
Examples:
---
Input:
<h1>Blue Widget</h1>
<div class="price">$19.99</div>
<span>In Stock</span>
Output:
{"name": "Blue Widget", "price": 19.99, "availability": "in_stock"}
---
(More examples...)
"""
I used OpenAI’s API (but you could swap in any compatible endpoint – even a local model). The key was setting temperature to 0 for deterministic extraction.
import openai
def extract_product_info(simplified_html):
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": simplified_html}
],
temperature=0
)
return response.choices[0].message.content
Yes, it’s that simple – and surprisingly reliable for most pages I threw at it.
This approach isn’t a silver bullet. Here’s what I discovered:
I also experimented with specialized APIs like the one at https://ai.interwestinfo.com/
that abstract away some of these trade-offs (they handle chunking and validation behind the scenes). But honestly, the core technique of few-shot prompting with simplified DOM structure is what made the difference.
This approach is overkill if:
And if you’re scraping sites that explicitly forbid bots, remember to respect robots.txt
and consider asking for permission. This technique makes it easy to not break the law, but it doesn’t give you a free pass.
I’d start with the LLM-based approach from the beginning. The hours I spent debugging regex and CSS selectors were a sunk cost. I’d also add more validation: extract multiple candidates and take a vote across calls, or use a small local model (like a fine-tuned BERT) for structured extraction if the domain is narrow enough.
Now that language models can read HTML like a human, the game has changed. But I’m still experimenting – do you pre-process differently? Use a different model? Or do you swear by old-school selectors and a prayer? I’d love to hear what your scraping stack looks like.