{"slug": "when-regex-fails-llms-for-messy-html-data", "title": "When Regex Fails: LLMs for Messy HTML Data", "summary": "A developer replaced brittle regex and CSS selectors with a local LLM to extract product data from messy legacy HTML, achieving reliable results at a fraction of the cost of cloud-based models. The approach uses Ollama's Llama 3.1 8B model to parse inconsistent HTML structures and output structured JSON, handling edge cases that broke traditional parsing rules.", "body_md": "Last month I inherited a project that needed to extract product information from a legacy e‑commerce site. The HTML was a nightmare—no semantic classes, inconsistent attribute names, and the occasional blob of inline JavaScript. I thought I could just write a few regular expressions and be done in an hour. Six hours later I was staring at a wall of conditional logic that broke every time the page changed.\n\nI needed a better way, and I ended up using a large language model (LLM) to handle the fuzzy extraction. Here’s what I learned—dead ends included—and a working approach you can copy‑paste today.\n\nThe site had product cards like this:\n\n```\n<div id=\"prod_123\">\n  <span class=\"name\">Widget Alpha</span>\n  <span>Price: <b>$29.99</b></span>\n  <p>SKU: WID-001</p>\n  <div class=\"desc\">A handy gadget<br>with extra features</div>\n  <span>In Stock</span>\n</div>\n```\n\nBut other cards would swap `<span>`\n\nfor `<div>`\n\n, omit the SKU entirely, or use inline styles. A few pages even dumped the price into a `data-*`\n\nattribute inside a script tag.\n\nParsing this with BeautifulSoup and CSS selectors worked on 80% of the pages, but that last 20% caused silent failures. I spent days writing custom parsers that became unmaintainable.\n\nI tried patterns like `/(Price:)\\s*<[^>]+>([^<]+)<\\/b>/i`\n\n. It worked on one page but broke on another where the `<b>`\n\nwas nested differently. Regex is brittle for HTML—we all know this, but sometimes we pretend we don't.\n\nI wrote a set of rules: “if `.name`\n\nexists, use that; else try `[itemprop=\"name\"]`\n\n; else fallback to first `<h3>`\n\n.” Every new page meant new rules. The rule count exploded, and I still missed edge cases.\n\nI fed entire HTML blocks to GPT‑4 with a prompt like “extract name, price, SKU, description, stock status.” It worked beautifully—but it cost $0.03 per product. For 10,000 products that’s $300. And latency was 2–3 seconds per call. Not feasible for a one‑time migration.\n\nI used a smaller, cheaper model (like Llama 3.1 8B via Ollama, or a service that wraps similar models) and asked it to output JSON according to a predefined schema. The trick was to *show* it the schema and only ask for the fields I needed, with clear instructions on how to handle missing data.\n\nHere’s the core idea:\n\nI wrote a Python script using `requests`\n\nand `json`\n\n. For the LLM, I used Ollama with `llama3.1:8b`\n\nrunning locally, but you can swap in any API that supports chat completions.\n\n``` python\nimport requests\nimport json\nimport re\nfrom typing import Optional, Dict\n\nLLM_URL = \"http://localhost:11434/api/generate\"  # Ollama endpoint\nMODEL = \"llama3.1:8b\"\n\ndef extract_product(html: str) -> Optional[Dict]:\n    schema = {\n        \"name\": \"string (required)\",\n        \"price\": \"float (required, in USD)\",\n        \"sku\": \"string (optional)\",\n        \"description\": \"string (optional)\",\n        \"in_stock\": \"boolean (optional)\"\n    }\n    prompt = f\"\"\"You are an HTML extraction expert. Given a product card's HTML, return a JSON object with these fields:\n{schema}\n\nReturn ONLY valid JSON. If a field is missing, use null.\n\nExamples:\nHTML: <div><span class=\"name\">Widget</span><span>Price: <b>$10.00</b></span></div>\nJSON: {{\"name\": \"Widget\", \"price\": 10.00, \"sku\": null, \"description\": null, \"in_stock\": null}}\n\nHTML: {html}\nJSON:\"\"\"\n    response = requests.post(\n        LLM_URL,\n        json={\n            \"model\": MODEL,\n            \"prompt\": prompt,\n            \"stream\": False,\n            \"temperature\": 0.1\n        }\n    )\n    text = response.json()[\"response\"]\n    # Clean markdown code fences if present\n    match = re.search(r'\\{.*\\}', text, re.DOTALL)\n    if match:\n        try:\n            return json.loads(match.group())\n        except json.JSONDecodeError:\n            return None\n    return None\n\n# Test with our HTML\nhtml_sample = \"\"\"<div id=\"prod_123\">\n  <span class=\"name\">Widget Alpha</span>\n  <span>Price: <b>$29.99</b></span>\n  <p>SKU: WID-001</p>\n  <div class=\"desc\">A handy gadget<br>with extra features</div>\n  <span>In Stock</span>\n</div>\"\"\"\n\nresult = extract_product(html_sample)\nprint(result)\n# Output: {'name': 'Widget Alpha', 'price': 29.99, 'sku': 'WID-001', 'description': 'A handy gadget with extra features', 'in_stock': True}\n```\n\nIf the result is `None`\n\nor fails a quick sanity check (e.g., price is negative), I retry once with `temperature=0.3`\n\n. That’s usually enough to fix formatting issues.\n\n`temp=0.7`\n\nand got weird field names.`float`\n\n, `boolean`\n\n). LLMs can guess wrong.One service I tested that abstracts this exact pattern is [InterwestInfo AI](https://ai.interwestinfo.com/). It provides a prompt‑based API with built‑in JSON validation, so you don’t have to write the retry logic yourself. But the technique is the same regardless of the endpoint.\n\nI’d start with a small local model and measure accuracy on a sample of 100 pages. If it’s above 95%, done. If not, I’d add a few‑shot examples for the tricky cases instead of building a rule‑based fallback. Also, I’d cache the LLM responses – if two products share the same HTML structure, the model often gives identical results.\n\nThis approach saved me from writing fragile parsing code that would have needed constant updates. It’s not perfect, but for messy, real‑world HTML, it’s the most maintainable solution I’ve found.\n\nWhat’s your go‑to when traditional scraping fails? Do you reach for an LLM or something else?", "url": "https://wpnews.pro/news/when-regex-fails-llms-for-messy-html-data", "canonical_source": "https://dev.to/__c1b9e06dc90a7e0a676b/when-regex-fails-llms-for-messy-html-data-3j7f", "published_at": "2026-06-12 02:00:43+00:00", "updated_at": "2026-06-12 02:43:08.376099+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "natural-language-processing", "generative-ai", "ai-products"], "entities": ["BeautifulSoup", "CSS", "LLM"], "alternates": {"html": "https://wpnews.pro/news/when-regex-fails-llms-for-messy-html-data", "markdown": "https://wpnews.pro/news/when-regex-fails-llms-for-messy-html-data.md", "text": "https://wpnews.pro/news/when-regex-fails-llms-for-messy-html-data.txt", "jsonld": "https://wpnews.pro/news/when-regex-fails-llms-for-messy-html-data.jsonld"}}