When Regex Fails: LLMs for Messy HTML Data

wpnews.pro

cd /news/large-language-models/when-regex-fails-llms-for-messy-html… · home › topics › large-language-models › article

[ARTICLE · art-24705] src=dev.to ↗ pub=2026-06-12T02:00Z topic=large-language-models verified=true sentiment=· neutral

When Regex Fails: LLMs for Messy HTML Data

A developer replaced brittle regex and CSS selectors with a local LLM to extract product data from messy legacy HTML, achieving reliable results at a fraction of the cost of cloud-based models. The approach uses Ollama's Llama 3.1 8B model to parse inconsistent HTML structures and output structured JSON, handling edge cases that broke traditional parsing rules.

read4 min views27 publishedJun 12, 2026

Last month I inherited a project that needed to extract product information from a legacy e‑commerce site. The HTML was a nightmare—no semantic classes, inconsistent attribute names, and the occasional blob of inline JavaScript. I thought I could just write a few regular expressions and be done in an hour. Six hours later I was staring at a wall of conditional logic that broke every time the page changed.

I needed a better way, and I ended up using a large language model (LLM) to handle the fuzzy extraction. Here’s what I learned—dead ends included—and a working approach you can copy‑paste today.

The site had product cards like this:

<div id="prod_123">
  <span class="name">Widget Alpha</span>
  <span>Price: <b>$29.99</b></span>
  <p>SKU: WID-001</p>
  <div class="desc">A handy gadget<br>with extra features</div>
  <span>In Stock</span>
</div>

But other cards would swap <span>

for <div>

, omit the SKU entirely, or use inline styles. A few pages even dumped the price into a data-*

attribute inside a script tag.

Parsing this with BeautifulSoup and CSS selectors worked on 80% of the pages, but that last 20% caused silent failures. I spent days writing custom parsers that became unmaintainable.

I tried patterns like /(Price:)\s*<[^>]+>([^<]+)<\/b>/i

. It worked on one page but broke on another where the <b>

was nested differently. Regex is brittle for HTML—we all know this, but sometimes we pretend we don't.

I wrote a set of rules: “if .name

exists, use that; else try [itemprop="name"]

; else fallback to first <h3>

.” Every new page meant new rules. The rule count exploded, and I still missed edge cases.

I fed entire HTML blocks to GPT‑4 with a prompt like “extract name, price, SKU, description, stock status.” It worked beautifully—but it cost $0.03 per product. For 10,000 products that’s $300. And latency was 2–3 seconds per call. Not feasible for a one‑time migration.

I used a smaller, cheaper model (like Llama 3.1 8B via Ollama, or a service that wraps similar models) and asked it to output JSON according to a predefined schema. The trick was to show it the schema and only ask for the fields I needed, with clear instructions on how to handle missing data.

Here’s the core idea:

I wrote a Python script using requests

and json

. For the LLM, I used Ollama with llama3.1:8b

running locally, but you can swap in any API that supports chat completions.

import requests
import json
import re
from typing import Optional, Dict

LLM_URL = "http://localhost:11434/api/generate"  # Ollama endpoint
MODEL = "llama3.1:8b"

def extract_product(html: str) -> Optional[Dict]:
    schema = {
        "name": "string (required)",
        "price": "float (required, in USD)",
        "sku": "string (optional)",
        "description": "string (optional)",
        "in_stock": "boolean (optional)"
    }
    prompt = f"""You are an HTML extraction expert. Given a product card's HTML, return a JSON object with these fields:
{schema}

Return ONLY valid JSON. If a field is missing, use null.

Examples:
HTML: <div><span class="name">Widget</span><span>Price: <b>$10.00</b></span></div>
JSON: {{"name": "Widget", "price": 10.00, "sku": null, "description": null, "in_stock": null}}

HTML: {html}
JSON:"""
    response = requests.post(
        LLM_URL,
        json={
            "model": MODEL,
            "prompt": prompt,
            "stream": False,
            "temperature": 0.1
        }
    )
    text = response.json()["response"]
    match = re.search(r'\{.*\}', text, re.DOTALL)
    if match:
        try:
            return json.loads(match.group())
        except json.JSONDecodeError:
            return None
    return None

html_sample = """<div id="prod_123">
  <span class="name">Widget Alpha</span>
  <span>Price: <b>$29.99</b></span>
  <p>SKU: WID-001</p>
  <div class="desc">A handy gadget<br>with extra features</div>
  <span>In Stock</span>
</div>"""

result = extract_product(html_sample)
print(result)

If the result is None

or fails a quick sanity check (e.g., price is negative), I retry once with temperature=0.3

. That’s usually enough to fix formatting issues.

temp=0.7

and got weird field names.float

, boolean

). LLMs can guess wrong.One service I tested that abstracts this exact pattern is InterwestInfo AI. It provides a prompt‑based API with built‑in JSON validation, so you don’t have to write the retry logic yourself. But the technique is the same regardless of the endpoint.

I’d start with a small local model and measure accuracy on a sample of 100 pages. If it’s above 95%, done. If not, I’d add a few‑shot examples for the tricky cases instead of building a rule‑based fallback. Also, I’d cache the LLM responses – if two products share the same HTML structure, the model often gives identical results.

This approach saved me from writing fragile parsing code that would have needed constant updates. It’s not perfect, but for messy, real‑world HTML, it’s the most maintainable solution I’ve found.

What’s your go‑to when traditional scraping fails? Do you reach for an LLM or something else?

source & further reading

dev.to — original article `finish_reason=length` Returned Empty Content — and the Error Message Lied to Me Combined Offense + Defense (Engineering Edition) — Cross-Project Reuse Matrix and When Not to Use What actually belongs in CLAUDE.md — and what to move to skills, hooks, or docs

~/api · this article 200

$curl api.wpnews.pro/v1/news/when-regex-fails-llms-fo…

Read original on dev.to → dev.to/__c1b9e06dc90a7e0a676b/when-regex-fails-l…

mentioned entities

BeautifulSoup

CSS

LLM

metadata

slugwhen-regex-fails-llms-for-messy-html-data

topic#large-language-models

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevChaining LLM and web bugs to Adm…

next →RAG (Retrieval-Augmented Generat…

── more in #large-language-models 4 stories · sorted by recency

techstrong.ai · 29 Jul · #large-language-models

LLM Routers Have Become a Service Category of Their Own

promptcube3.com · 30 Jul · #large-language-models

Higgsfield vs Artlist: Which AI Workflow is Safer?

smarterarticles.co.uk · 30 Jul · #large-language-models

Machines That Never Push Back: What AI Toys Cost Childhood Empathy

industrycontents.com · 30 Jul · #large-language-models

Do Language Models Flatten Your Business? We Tested Four Real Ones to Find Out

── more on @beautifulsoup 3 stories trending now

wpnews · 29 Jul · #ai-safety

News Summary for July 29, 2026

wpnews · 29 Jul · #artificial-intelligence

Investors are selling Meta as it heads to its earnings report

wpnews · 28 Jul · #large-language-models

How to Download and Run Kimi K3 Open Weights

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required