When Regex Fails: LLMs for Messy HTML Data

A developer replaced brittle regex and CSS selectors with a local LLM to extract product data from messy legacy HTML, achieving reliable results at a fraction of the cost of cloud-based models. The approach uses Ollama's Llama 3.1 8B model to parse inconsistent HTML structures and output structured JSON, handling edge cases that broke traditional parsing rules.

Last month I inherited a project that needed to extract product information from a legacy e‑commerce site. The HTML was a nightmare—no semantic classes, inconsistent attribute names, and the occasional blob of inline JavaScript. I thought I could just write a few regular expressions and be done in an hour. Six hours later I was staring at a wall of conditional logic that broke every time the page changed. I needed a better way, and I ended up using a large language model LLM to handle the fuzzy extraction. Here’s what I learned—dead ends included—and a working approach you can copy‑paste today. The site had product cards like this: <div id="prod 123" <span class="name" Widget Alpha</span <span Price: <b $29.99</b </span <p SKU: WID-001</p <div class="desc" A handy gadget<br with extra features</div <span In Stock</span </div But other cards would swap <span for <div , omit the SKU entirely, or use inline styles. A few pages even dumped the price into a data- attribute inside a script tag. Parsing this with BeautifulSoup and CSS selectors worked on 80% of the pages, but that last 20% caused silent failures. I spent days writing custom parsers that became unmaintainable. I tried patterns like / Price: \s < ^ + ^< + <\/b /i . It worked on one page but broke on another where the <b was nested differently. Regex is brittle for HTML—we all know this, but sometimes we pretend we don't. I wrote a set of rules: “if .name exists, use that; else try itemprop="name" ; else fallback to first <h3 .” Every new page meant new rules. The rule count exploded, and I still missed edge cases. I fed entire HTML blocks to GPT‑4 with a prompt like “extract name, price, SKU, description, stock status.” It worked beautifully—but it cost $0.03 per product. For 10,000 products that’s $300. And latency was 2–3 seconds per call. Not feasible for a one‑time migration. I used a smaller, cheaper model like Llama 3.1 8B via Ollama, or a service that wraps similar models and asked it to output JSON according to a predefined schema. The trick was to show it the schema and only ask for the fields I needed, with clear instructions on how to handle missing data. Here’s the core idea: I wrote a Python script using requests and json . For the LLM, I used Ollama with llama3.1:8b running locally, but you can swap in any API that supports chat completions. python import requests import json import re from typing import Optional, Dict LLM URL = "http://localhost:11434/api/generate" Ollama endpoint MODEL = "llama3.1:8b" def extract product html: str - Optional Dict : schema = { "name": "string required ", "price": "float required, in USD ", "sku": "string optional ", "description": "string optional ", "in stock": "boolean optional " } prompt = f"""You are an HTML extraction expert. Given a product card's HTML, return a JSON object with these fields: {schema} Return ONLY valid JSON. If a field is missing, use null. Examples: HTML: <div <span class="name" Widget</span <span Price: <b $10.00</b </span </div JSON: {{"name": "Widget", "price": 10.00, "sku": null, "description": null, "in stock": null}} HTML: {html} JSON:""" response = requests.post LLM URL, json={ "model": MODEL, "prompt": prompt, "stream": False, "temperature": 0.1 } text = response.json "response" Clean markdown code fences if present match = re.search r'\{. \}', text, re.DOTALL if match: try: return json.loads match.group except json.JSONDecodeError: return None return None Test with our HTML html sample = """<div id="prod 123" <span class="name" Widget Alpha</span <span Price: <b $29.99</b </span <p SKU: WID-001</p <div class="desc" A handy gadget<br with extra features</div <span In Stock</span </div """ result = extract product html sample print result Output: {'name': 'Widget Alpha', 'price': 29.99, 'sku': 'WID-001', 'description': 'A handy gadget with extra features', 'in stock': True} If the result is None or fails a quick sanity check e.g., price is negative , I retry once with temperature=0.3 . That’s usually enough to fix formatting issues. temp=0.7 and got weird field names. float , boolean . LLMs can guess wrong.One service I tested that abstracts this exact pattern is InterwestInfo AI https://ai.interwestinfo.com/ . It provides a prompt‑based API with built‑in JSON validation, so you don’t have to write the retry logic yourself. But the technique is the same regardless of the endpoint. I’d start with a small local model and measure accuracy on a sample of 100 pages. If it’s above 95%, done. If not, I’d add a few‑shot examples for the tricky cases instead of building a rule‑based fallback. Also, I’d cache the LLM responses – if two products share the same HTML structure, the model often gives identical results. This approach saved me from writing fragile parsing code that would have needed constant updates. It’s not perfect, but for messy, real‑world HTML, it’s the most maintainable solution I’ve found. What’s your go‑to when traditional scraping fails? Do you reach for an LLM or something else?