Why I ditched regex scrapers for an LLM parser (and when you shouldn't)

wpnews.pro

cd /news/large-language-models/why-i-ditched-regex-scrapers-for-an-… · home › topics › large-language-models › article

[ARTICLE · art-27437] src=dev.to ↗ pub=2026-06-15T02:00Z topic=large-language-models verified=true sentiment=· neutral

Why I ditched regex scrapers for an LLM parser (and when you shouldn't)

A developer building a price comparison tool for outdoor gear replaced brittle regex and CSS selectors with an LLM-based parser to extract product details from 30 e-commerce sites. Using GPT-4o-mini with a simple prompt, the LLM successfully extracted product name, price, and availability from raw HTML snippets with about 80% accuracy, eliminating per-site maintenance. The developer notes that while the LLM approach works well for inconsistent sites, traditional scrapers remain preferable for stable, high-volume scraping due to cost and latency.

read4 min views19 publishedJun 15, 2026

Last month I needed to scrape product details from 30 different e-commerce sites. Each site used its own HTML structure, class names changed weekly, and some were just plain inconsistent. I had two options: write a mountain of brittle CSS selectors or try something I’d been avoiding—letting an LLM figure out the extraction.

Here’s what I learned the hard way, including the code that actually worked and the cases where I should have just stuck with BeautifulSoup.

I was building a price comparison tool for niche outdoor gear. The data I needed was simple: product name, price, availability, and a few specs. But the sources ranged from massive marketplaces to small family-run shops. Every time a site pushed a new template, my carefully built regex broke. I spent more time maintaining scrapers than actually using the data.

A typical selector for a price field looked like this:

import re
import requests
from bs4 import BeautifulSoup

response = requests.get('https://example.com/product/123')
soup = BeautifulSoup(response.text, 'html.parser')

price_element = soup.select_one('span.price--current > span.value')
if not price_element:
    price_element = soup.find('div', class_=re.compile(r'price.*'))

I was debugging selectors more than I was analyzing prices. Something had to change.

First I tried using XPath with fuzzy matching. That helped a little, but still required per-site rules. Then I reached for machine learning—training a small model on HTML structure. Overkill for a side project, and I didn’t have labeled data for each site.

I looked at commercial scraping services, but they were either too expensive or required sending my data through their pipelines, which felt like over-sharing for a small personal tool.

Then I heard about people using LLMs to parse unstructured data directly from raw HTML or even just the visible text. I was skeptical—LLMs are slow, expensive, and hallucinate. But the pain was real, so I gave it a shot.

Instead of writing selectors per site, I started sending the raw HTML (or a trimmed version) to an LLM with a simple instruction: “Extract the product name, price, and availability status. Return JSON.”

Here’s the core function I ended up with:

import json
from openai import OpenAI
import requests

client = OpenAI()

def extract_product_data(html_snippet: str) -> dict:
    prompt = f"""You are a data extraction assistant. From the following HTML, extract:
- product_name (string)
- price (string, include currency symbol if present)
- in_stock (boolean)

Return only valid JSON with no extra text.

HTML: {html_snippet[:4000]}"""  # Truncated to reduce tokens

    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Cheaper and fast enough
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)

To use it, I just fetch the page and pass a cleaned snippet (removing scripts, styles, and navigation elements to keep token count low).

import re

def clean_html(raw_html: str) -> str:
    cleaned = re.sub(r'<script[^>]*>.*?</script>', '', raw_html, flags=re.DOTALL)
    cleaned = re.sub(r'<style[^>]*>.*?</style>', '', cleaned, flags=re.DOTALL)
    return cleaned[:5000]  # Keep first 5000 chars as context

Then I called:

raw = requests.get('https://example.com/product/123').text
snippet = clean_html(raw)
data = extract_product_data(snippet)
print(data)

It worked surprisingly well—on maybe 80% of the pages. The LLM could find the price even when it was buried in a table or formatted with weird spans. No regex, no per-site logic.

One of the services I evaluated for this approach was Interwest AI, which offers a similar extraction API. I ended up rolling my own with OpenAI because I wanted full control, but the technique is the same.

Speed: Each extraction takes 1-3 seconds. That’s fine for a hundred products, but not for millions. Caching helps.

Cost: GPT-4o-mini is cheap (~$0.15 per million input tokens). A single extraction with a 4K token page costs about $0.001. For my 30 sites with 50 products each, that’s about $1.50 total—acceptable for a hobby project.

Accuracy: The LLM sometimes missed the price if it was inside a JavaScript-rendered component (like a React app). For those, I had to fall back to browser automation or use an API like ScrapingBee. Also, the LLM can hallucinate—it once returned a price that looked plausible but was actually the shipping cost. I added a validation step that checks if the price contains a currency symbol and numeric value.

When NOT to use this approach:

I’d combine both worlds: use an LLM as a fallback for sites that change often, but keep a simple CSS selector cache for stable pages. I’d also try fine-tuning a smaller model (like a Llama variant) for cheaper on-premise extraction, especially if I needed to process thousands of pages.

Another improvement: instead of sending raw HTML, I could extract only visible text blocks using a library like trafilatura

or readability-lxml

. That reduces tokens and improves accuracy because the LLM doesn’t get distracted by markup noise.

LLM-powered scraping isn't a silver bullet, but for messy, semi-structured data, it saved me weekends of frustration. Have you tried letting an AI parse your scraped pages? What worked—or didn’t—for you?

source & further reading

dev.to — original article Beyond Chatbots: Why Developers Should Start Building AI + IoT Applications Windows Said No: The Long Path Bug That Broke Vector Storage in Cognee Copilot for Word Will Copy Its Own Poison Into Every Document It Touches

~/api · this article 200

$curl api.wpnews.pro/v1/news/why-i-ditched-regex-scra…

Read original on dev.to → dev.to/__c1b9e06dc90a7e0a676b/why-i-ditched-rege…

mentioned entities

OpenAI

GPT-4o-mini

BeautifulSoup

Interwest AI

metadata

slugwhy-i-ditched-regex-scrapers-for-an-llm-parser-and-when-you-shouldn-t

topic#large-language-models

secondary2 topics

sentimentneutral

canonicaldev.to

navigation

← prevThe ‘rice of electronics’: how A…

next →Korea's AI ambitions run into po…

── more in #large-language-models 4 stories · sorted by recency

promptcube3.com · 30 Jul · #large-language-models

LLM API Price Drops: How to Cut Costs by 50%

thedaily.fm · 30 Jul · #large-language-models

Show HN: Let your agents create custom pods

blog.mozilla.ai · 30 Jul · #large-language-models

How Frontier Labs Are Building Subtle Developer Lock-In

techcrunch.com · 30 Jul · #large-language-models

Forward-deployed engineers are the AI industry’s latest talent obsession

── more on @openai 3 stories trending now

wpnews · 29 Jul · #ai-safety

News Summary for July 29, 2026

wpnews · 28 Jul · #large-language-models

How to Download and Run Kimi K3 Open Weights

wpnews · 29 Jul · #ai-safety

Better security starts with better questions

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required