Why I ditched regex scrapers for an LLM parser (and when you shouldn't)

A developer building a price comparison tool for outdoor gear replaced brittle regex and CSS selectors with an LLM-based parser to extract product details from 30 e-commerce sites. Using GPT-4o-mini with a simple prompt, the LLM successfully extracted product name, price, and availability from raw HTML snippets with about 80% accuracy, eliminating per-site maintenance. The developer notes that while the LLM approach works well for inconsistent sites, traditional scrapers remain preferable for stable, high-volume scraping due to cost and latency.

Last month I needed to scrape product details from 30 different e-commerce sites. Each site used its own HTML structure, class names changed weekly, and some were just plain inconsistent. I had two options: write a mountain of brittle CSS selectors or try something I’d been avoiding—letting an LLM figure out the extraction. Here’s what I learned the hard way, including the code that actually worked and the cases where I should have just stuck with BeautifulSoup. I was building a price comparison tool for niche outdoor gear. The data I needed was simple: product name, price, availability, and a few specs. But the sources ranged from massive marketplaces to small family-run shops. Every time a site pushed a new template, my carefully built regex broke. I spent more time maintaining scrapers than actually using the data. A typical selector for a price field looked like this: python import re import requests from bs4 import BeautifulSoup response = requests.get 'https://example.com/product/123' soup = BeautifulSoup response.text, 'html.parser' This selector changed three times in two weeks price element = soup.select one 'span.price--current span.value' if not price element: price element = soup.find 'div', class =re.compile r'price. ' I was debugging selectors more than I was analyzing prices. Something had to change. First I tried using XPath with fuzzy matching. That helped a little, but still required per-site rules. Then I reached for machine learning—training a small model on HTML structure. Overkill for a side project, and I didn’t have labeled data for each site. I looked at commercial scraping services, but they were either too expensive or required sending my data through their pipelines, which felt like over-sharing for a small personal tool. Then I heard about people using LLMs to parse unstructured data directly from raw HTML or even just the visible text. I was skeptical—LLMs are slow, expensive, and hallucinate. But the pain was real, so I gave it a shot. Instead of writing selectors per site, I started sending the raw HTML or a trimmed version to an LLM with a simple instruction: “Extract the product name, price, and availability status. Return JSON.” Here’s the core function I ended up with: python import json from openai import OpenAI import requests client = OpenAI def extract product data html snippet: str - dict: prompt = f"""You are a data extraction assistant. From the following HTML, extract: - product name string - price string, include currency symbol if present - in stock boolean Return only valid JSON with no extra text. HTML: {html snippet :4000 }""" Truncated to reduce tokens response = client.chat.completions.create model="gpt-4o-mini", Cheaper and fast enough messages= {"role": "user", "content": prompt} , temperature=0, response format={"type": "json object"} return json.loads response.choices 0 .message.content To use it, I just fetch the page and pass a cleaned snippet removing scripts, styles, and navigation elements to keep token count low . php import re def clean html raw html: str - str: Remove script and style tags cleaned = re.sub r'<script ^ . ?</script ', '', raw html, flags=re.DOTALL cleaned = re.sub r'<style ^ . ?</style ', '', cleaned, flags=re.DOTALL return cleaned :5000 Keep first 5000 chars as context Then I called: raw = requests.get 'https://example.com/product/123' .text snippet = clean html raw data = extract product data snippet print data {'product name': 'Trail Pro Jacket', 'price': '$89.99', 'in stock': True} It worked surprisingly well—on maybe 80% of the pages. The LLM could find the price even when it was buried in a table or formatted with weird spans. No regex, no per-site logic. One of the services I evaluated for this approach was Interwest AI https://ai.interwestinfo.com/ , which offers a similar extraction API. I ended up rolling my own with OpenAI because I wanted full control, but the technique is the same. Speed : Each extraction takes 1-3 seconds. That’s fine for a hundred products, but not for millions. Caching helps. Cost : GPT-4o-mini is cheap ~$0.15 per million input tokens . A single extraction with a 4K token page costs about $0.001. For my 30 sites with 50 products each, that’s about $1.50 total—acceptable for a hobby project. Accuracy : The LLM sometimes missed the price if it was inside a JavaScript-rendered component like a React app . For those, I had to fall back to browser automation or use an API like ScrapingBee. Also, the LLM can hallucinate—it once returned a price that looked plausible but was actually the shipping cost. I added a validation step that checks if the price contains a currency symbol and numeric value. When NOT to use this approach : I’d combine both worlds: use an LLM as a fallback for sites that change often, but keep a simple CSS selector cache for stable pages. I’d also try fine-tuning a smaller model like a Llama variant for cheaper on-premise extraction, especially if I needed to process thousands of pages. Another improvement: instead of sending raw HTML, I could extract only visible text blocks using a library like trafilatura or readability-lxml . That reduces tokens and improves accuracy because the LLM doesn’t get distracted by markup noise. LLM-powered scraping isn't a silver bullet, but for messy, semi-structured data, it saved me weekends of frustration. Have you tried letting an AI parse your scraped pages? What worked—or didn’t—for you?