# When Regex Fails: LLMs for Messy HTML Data

> Source: <https://dev.to/__c1b9e06dc90a7e0a676b/when-regex-fails-llms-for-messy-html-data-3j7f>
> Published: 2026-06-12 02:00:43+00:00

Last month I inherited a project that needed to extract product information from a legacy e‑commerce site. The HTML was a nightmare—no semantic classes, inconsistent attribute names, and the occasional blob of inline JavaScript. I thought I could just write a few regular expressions and be done in an hour. Six hours later I was staring at a wall of conditional logic that broke every time the page changed.

I needed a better way, and I ended up using a large language model (LLM) to handle the fuzzy extraction. Here’s what I learned—dead ends included—and a working approach you can copy‑paste today.

The site had product cards like this:

```
<div id="prod_123">
  <span class="name">Widget Alpha</span>
  <span>Price: <b>$29.99</b></span>
  <p>SKU: WID-001</p>
  <div class="desc">A handy gadget<br>with extra features</div>
  <span>In Stock</span>
</div>
```

But other cards would swap `<span>`

for `<div>`

, omit the SKU entirely, or use inline styles. A few pages even dumped the price into a `data-*`

attribute inside a script tag.

Parsing this with BeautifulSoup and CSS selectors worked on 80% of the pages, but that last 20% caused silent failures. I spent days writing custom parsers that became unmaintainable.

I tried patterns like `/(Price:)\s*<[^>]+>([^<]+)<\/b>/i`

. It worked on one page but broke on another where the `<b>`

was nested differently. Regex is brittle for HTML—we all know this, but sometimes we pretend we don't.

I wrote a set of rules: “if `.name`

exists, use that; else try `[itemprop="name"]`

; else fallback to first `<h3>`

.” Every new page meant new rules. The rule count exploded, and I still missed edge cases.

I fed entire HTML blocks to GPT‑4 with a prompt like “extract name, price, SKU, description, stock status.” It worked beautifully—but it cost $0.03 per product. For 10,000 products that’s $300. And latency was 2–3 seconds per call. Not feasible for a one‑time migration.

I used a smaller, cheaper model (like Llama 3.1 8B via Ollama, or a service that wraps similar models) and asked it to output JSON according to a predefined schema. The trick was to *show* it the schema and only ask for the fields I needed, with clear instructions on how to handle missing data.

Here’s the core idea:

I wrote a Python script using `requests`

and `json`

. For the LLM, I used Ollama with `llama3.1:8b`

running locally, but you can swap in any API that supports chat completions.

``` python
import requests
import json
import re
from typing import Optional, Dict

LLM_URL = "http://localhost:11434/api/generate"  # Ollama endpoint
MODEL = "llama3.1:8b"

def extract_product(html: str) -> Optional[Dict]:
    schema = {
        "name": "string (required)",
        "price": "float (required, in USD)",
        "sku": "string (optional)",
        "description": "string (optional)",
        "in_stock": "boolean (optional)"
    }
    prompt = f"""You are an HTML extraction expert. Given a product card's HTML, return a JSON object with these fields:
{schema}

Return ONLY valid JSON. If a field is missing, use null.

Examples:
HTML: <div><span class="name">Widget</span><span>Price: <b>$10.00</b></span></div>
JSON: {{"name": "Widget", "price": 10.00, "sku": null, "description": null, "in_stock": null}}

HTML: {html}
JSON:"""
    response = requests.post(
        LLM_URL,
        json={
            "model": MODEL,
            "prompt": prompt,
            "stream": False,
            "temperature": 0.1
        }
    )
    text = response.json()["response"]
    # Clean markdown code fences if present
    match = re.search(r'\{.*\}', text, re.DOTALL)
    if match:
        try:
            return json.loads(match.group())
        except json.JSONDecodeError:
            return None
    return None

# Test with our HTML
html_sample = """<div id="prod_123">
  <span class="name">Widget Alpha</span>
  <span>Price: <b>$29.99</b></span>
  <p>SKU: WID-001</p>
  <div class="desc">A handy gadget<br>with extra features</div>
  <span>In Stock</span>
</div>"""

result = extract_product(html_sample)
print(result)
# Output: {'name': 'Widget Alpha', 'price': 29.99, 'sku': 'WID-001', 'description': 'A handy gadget with extra features', 'in_stock': True}
```

If the result is `None`

or fails a quick sanity check (e.g., price is negative), I retry once with `temperature=0.3`

. That’s usually enough to fix formatting issues.

`temp=0.7`

and got weird field names.`float`

, `boolean`

). LLMs can guess wrong.One service I tested that abstracts this exact pattern is [InterwestInfo AI](https://ai.interwestinfo.com/). It provides a prompt‑based API with built‑in JSON validation, so you don’t have to write the retry logic yourself. But the technique is the same regardless of the endpoint.

I’d start with a small local model and measure accuracy on a sample of 100 pages. If it’s above 95%, done. If not, I’d add a few‑shot examples for the tricky cases instead of building a rule‑based fallback. Also, I’d cache the LLM responses – if two products share the same HTML structure, the model often gives identical results.

This approach saved me from writing fragile parsing code that would have needed constant updates. It’s not perfect, but for messy, real‑world HTML, it’s the most maintainable solution I’ve found.

What’s your go‑to when traditional scraping fails? Do you reach for an LLM or something else?
