When Regex Fails: LLMs for Messy HTML Data A developer replaced brittle regex and CSS selectors with a local LLM to extract product data from messy legacy HTML, achieving reliable results at a fraction of the cost of cloud-based models. The approach uses Ollama's Llama 3.1 8B model to parse inconsistent HTML structures and output structured JSON, handling edge cases that broke traditional parsing rules. Last month I inherited a project that needed to extract product information from a legacy e‑commerce site. The HTML was a nightmare—no semantic classes, inconsistent attribute names, and the occasional blob of inline JavaScript. I thought I could just write a few regular expressions and be done in an hour. Six hours later I was staring at a wall of conditional logic that broke every time the page changed. I needed a better way, and I ended up using a large language model LLM to handle the fuzzy extraction. Here’s what I learned—dead ends included—and a working approach you can copy‑paste today. The site had product cards like this: