When HTML parsing fails: using LLMs to extract messy web data
A developer turned to large language models to extract product data from e-commerce sites with unpredictable HTML, after traditional scraping tools like BeautifulSoup and Scrapy failed due to constantly changing page structures. By feeding raw HTML to OpenAI's GPT-4o with a defined JSON schema, the engineer successfully extracted fields such as product names, prices, and availability on the first attempt, bypassing the need for fragile CSS selectors or XPath expressions. The approach combines LLM-based extraction for problematic sites with traditional parsers for stable ones, and includes validation steps to catch errors.
I’ve been scraping websites for years. BeautifulSoup, Scrapy, Playwright — I’ve used them all. But last month I hit a wall. A client needed me to extract product details from a dozen e-commerce sites. Most were straightforward: find the right CSS selectors, handle pagination, done. But one particular site was a nightmare. The HTML was a mess of nested divs, inline styles, and data scattered across attributes, text nodes, and even JavaScript variables. The layout changed every week. My carefully crafted selectors broke constantly. I spent two days fixing and refactoring. Every time I thought I had it, the site updated and my pipeline broke again. I was about to tell the client it wasn’t feasible. Then a colleague said: “Why not just give the raw HTML to an LLM and ask it to extract what you need?” At first I laughed. LLMs hallucinate, they’re slow, expensive — right? But I was desperate. I decided to prototype it. Before going down the AI route, I exhausted traditional approaches: data-price attributes, sometimes in nested