Why I ditched regex scrapers for an LLM parser (and when you shouldn't) A developer building a price comparison tool for outdoor gear replaced brittle regex and CSS selectors with an LLM-based parser to extract product details from 30 e-commerce sites. Using GPT-4o-mini with a simple prompt, the LLM successfully extracted product name, price, and availability from raw HTML snippets with about 80% accuracy, eliminating per-site maintenance. The developer notes that while the LLM approach works well for inconsistent sites, traditional scrapers remain preferable for stable, high-volume scraping due to cost and latency. Last month I needed to scrape product details from 30 different e-commerce sites. Each site used its own HTML structure, class names changed weekly, and some were just plain inconsistent. I had two options: write a mountain of brittle CSS selectors or try something I’d been avoiding—letting an LLM figure out the extraction. Here’s what I learned the hard way, including the code that actually worked and the cases where I should have just stuck with BeautifulSoup. I was building a price comparison tool for niche outdoor gear. The data I needed was simple: product name, price, availability, and a few specs. But the sources ranged from massive marketplaces to small family-run shops. Every time a site pushed a new template, my carefully built regex broke. I spent more time maintaining scrapers than actually using the data. A typical selector for a price field looked like this: python import re import requests from bs4 import BeautifulSoup response = requests.get 'https://example.com/product/123' soup = BeautifulSoup response.text, 'html.parser' This selector changed three times in two weeks price element = soup.select one 'span.price--current span.value' if not price element: price element = soup.find 'div', class =re.compile r'price. ' I was debugging selectors more than I was analyzing prices. Something had to change. First I tried using XPath with fuzzy matching. That helped a little, but still required per-site rules. Then I reached for machine learning—training a small model on HTML structure. Overkill for a side project, and I didn’t have labeled data for each site. I looked at commercial scraping services, but they were either too expensive or required sending my data through their pipelines, which felt like over-sharing for a small personal tool. Then I heard about people using LLMs to parse unstructured data directly from raw HTML or even just the visible text. I was skeptical—LLMs are slow, expensive, and hallucinate. But the pain was real, so I gave it a shot. Instead of writing selectors per site, I started sending the raw HTML or a trimmed version to an LLM with a simple instruction: “Extract the product name, price, and availability status. Return JSON.” Here’s the core function I ended up with: python import json from openai import OpenAI import requests client = OpenAI def extract product data html snippet: str - dict: prompt = f"""You are a data extraction assistant. From the following HTML, extract: - product name string - price string, include currency symbol if present - in stock boolean Return only valid JSON with no extra text. HTML: {html snippet :4000 }""" Truncated to reduce tokens response = client.chat.completions.create model="gpt-4o-mini", Cheaper and fast enough messages= {"role": "user", "content": prompt} , temperature=0, response format={"type": "json object"} return json.loads response.choices 0 .message.content To use it, I just fetch the page and pass a cleaned snippet removing scripts, styles, and navigation elements to keep token count low . php import re def clean html raw html: str - str: Remove script and style tags cleaned = re.sub r'