{"slug": "why-i-ditched-regex-scrapers-for-an-llm-parser-and-when-you-shouldn-t", "title": "Why I ditched regex scrapers for an LLM parser (and when you shouldn't)", "summary": "A developer building a price comparison tool for outdoor gear replaced brittle regex and CSS selectors with an LLM-based parser to extract product details from 30 e-commerce sites. Using GPT-4o-mini with a simple prompt, the LLM successfully extracted product name, price, and availability from raw HTML snippets with about 80% accuracy, eliminating per-site maintenance. The developer notes that while the LLM approach works well for inconsistent sites, traditional scrapers remain preferable for stable, high-volume scraping due to cost and latency.", "body_md": "Last month I needed to scrape product details from 30 different e-commerce sites. Each site used its own HTML structure, class names changed weekly, and some were just plain inconsistent. I had two options: write a mountain of brittle CSS selectors or try something I’d been avoiding—letting an LLM figure out the extraction.\n\nHere’s what I learned the hard way, including the code that actually worked and the cases where I should have just stuck with BeautifulSoup.\n\nI was building a price comparison tool for niche outdoor gear. The data I needed was simple: product name, price, availability, and a few specs. But the sources ranged from massive marketplaces to small family-run shops. Every time a site pushed a new template, my carefully built regex broke. I spent more time maintaining scrapers than actually using the data.\n\nA typical selector for a price field looked like this:\n\n``` python\nimport re\nimport requests\nfrom bs4 import BeautifulSoup\n\nresponse = requests.get('https://example.com/product/123')\nsoup = BeautifulSoup(response.text, 'html.parser')\n\n# This selector changed three times in two weeks\nprice_element = soup.select_one('span.price--current > span.value')\nif not price_element:\n    price_element = soup.find('div', class_=re.compile(r'price.*'))\n```\n\nI was debugging selectors more than I was analyzing prices. Something had to change.\n\nFirst I tried using XPath with fuzzy matching. That helped a little, but still required per-site rules. Then I reached for machine learning—training a small model on HTML structure. Overkill for a side project, and I didn’t have labeled data for each site.\n\nI looked at commercial scraping services, but they were either too expensive or required sending my data through their pipelines, which felt like over-sharing for a small personal tool.\n\nThen I heard about people using LLMs to parse unstructured data directly from raw HTML or even just the visible text. I was skeptical—LLMs are slow, expensive, and hallucinate. But the pain was real, so I gave it a shot.\n\nInstead of writing selectors per site, I started sending the raw HTML (or a trimmed version) to an LLM with a simple instruction: “Extract the product name, price, and availability status. Return JSON.”\n\nHere’s the core function I ended up with:\n\n``` python\nimport json\nfrom openai import OpenAI\nimport requests\n\nclient = OpenAI()\n\ndef extract_product_data(html_snippet: str) -> dict:\n    prompt = f\"\"\"You are a data extraction assistant. From the following HTML, extract:\n- product_name (string)\n- price (string, include currency symbol if present)\n- in_stock (boolean)\n\nReturn only valid JSON with no extra text.\n\nHTML: {html_snippet[:4000]}\"\"\"  # Truncated to reduce tokens\n\n    response = client.chat.completions.create(\n        model=\"gpt-4o-mini\",  # Cheaper and fast enough\n        messages=[{\"role\": \"user\", \"content\": prompt}],\n        temperature=0,\n        response_format={\"type\": \"json_object\"}\n    )\n\n    return json.loads(response.choices[0].message.content)\n```\n\nTo use it, I just fetch the page and pass a cleaned snippet (removing scripts, styles, and navigation elements to keep token count low).\n\n``` php\nimport re\n\ndef clean_html(raw_html: str) -> str:\n    # Remove script and style tags\n    cleaned = re.sub(r'<script[^>]*>.*?</script>', '', raw_html, flags=re.DOTALL)\n    cleaned = re.sub(r'<style[^>]*>.*?</style>', '', cleaned, flags=re.DOTALL)\n    return cleaned[:5000]  # Keep first 5000 chars as context\n```\n\nThen I called:\n\n```\nraw = requests.get('https://example.com/product/123').text\nsnippet = clean_html(raw)\ndata = extract_product_data(snippet)\nprint(data)\n# {'product_name': 'Trail Pro Jacket', 'price': '$89.99', 'in_stock': True}\n```\n\nIt worked surprisingly well—on maybe 80% of the pages. The LLM could find the price even when it was buried in a table or formatted with weird spans. No regex, no per-site logic.\n\nOne of the services I evaluated for this approach was [Interwest AI](https://ai.interwestinfo.com/), which offers a similar extraction API. I ended up rolling my own with OpenAI because I wanted full control, but the technique is the same.\n\n**Speed**: Each extraction takes 1-3 seconds. That’s fine for a hundred products, but not for millions. Caching helps.\n\n**Cost**: GPT-4o-mini is cheap (~$0.15 per million input tokens). A single extraction with a 4K token page costs about $0.001. For my 30 sites with 50 products each, that’s about $1.50 total—acceptable for a hobby project.\n\n**Accuracy**: The LLM sometimes missed the price if it was inside a JavaScript-rendered component (like a React app). For those, I had to fall back to browser automation or use an API like ScrapingBee. Also, the LLM can hallucinate—it once returned a price that looked plausible but was actually the shipping cost. I added a validation step that checks if the price contains a currency symbol and numeric value.\n\n**When NOT to use this approach**:\n\nI’d combine both worlds: use an LLM as a fallback for sites that change often, but keep a simple CSS selector cache for stable pages. I’d also try fine-tuning a smaller model (like a Llama variant) for cheaper on-premise extraction, especially if I needed to process thousands of pages.\n\nAnother improvement: instead of sending raw HTML, I could extract only visible text blocks using a library like `trafilatura`\n\nor `readability-lxml`\n\n. That reduces tokens and improves accuracy because the LLM doesn’t get distracted by markup noise.\n\nLLM-powered scraping isn't a silver bullet, but for messy, semi-structured data, it saved me weekends of frustration. Have you tried letting an AI parse your scraped pages? What worked—or didn’t—for you?", "url": "https://wpnews.pro/news/why-i-ditched-regex-scrapers-for-an-llm-parser-and-when-you-shouldn-t", "canonical_source": "https://dev.to/__c1b9e06dc90a7e0a676b/why-i-ditched-regex-scrapers-for-an-llm-parser-and-when-you-shouldnt-3e78", "published_at": "2026-06-15 02:00:27+00:00", "updated_at": "2026-06-15 02:40:55.030092+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-tools"], "entities": ["OpenAI", "GPT-4o-mini", "BeautifulSoup", "Interwest AI"], "alternates": {"html": "https://wpnews.pro/news/why-i-ditched-regex-scrapers-for-an-llm-parser-and-when-you-shouldn-t", "markdown": "https://wpnews.pro/news/why-i-ditched-regex-scrapers-for-an-llm-parser-and-when-you-shouldn-t.md", "text": "https://wpnews.pro/news/why-i-ditched-regex-scrapers-for-an-llm-parser-and-when-you-shouldn-t.txt", "jsonld": "https://wpnews.pro/news/why-i-ditched-regex-scrapers-for-an-llm-parser-and-when-you-shouldn-t.jsonld"}}