{"slug": "i-tried-ai-powered-web-scraping-so-my-selectors-could-finally-rest", "title": "I Tried AI-Powered Web Scraping So My Selectors Could Finally Rest", "summary": "A developer built an AI-powered web scraper that uses large language models to extract product data from e-commerce sites, replacing fragile CSS selectors and regex patterns. The approach converts raw HTML into a simplified JSON tree, reducing token usage by 70%, then feeds it to GPT-4 with few-shot examples to reliably extract fields like price and availability. The system proved more resilient to site redesigns than traditional selector-based methods.", "body_md": "A few months ago, I was building a price comparison tool that needed to pull product info from a dozen different e-commerce sites. Each one had its own lovingly crafted HTML structure—nested `<div>`\n\ns with classes like `price-123abc`\n\nthat changed on every deployment. My initial approach was traditional: XPath, CSS selectors, and a sprinkle of regex. It worked until it didn’t. Then I discovered that I could throw an LLM at the raw HTML and let it figure out the extraction. Here’s what I learned.\n\nI had a scraper for Site A that used `document.querySelector('.product-price')`\n\n. It was fragile but worked for months. Then Site A redesigned. The selector broke. I updated it. A week later, another redesign. I started using `regex`\n\nto find patterns like `\\$\\d+\\.\\d{2}`\n\n. Then someone added a badge that said “$5 off” and my regex grabbed the wrong number.\n\nI needed something that could understand the *meaning* of a price, not just its structure. That’s when I wondered: could GPT-4 (or any language model) parse the raw HTML and give me the structured data I needed?\n\nFirst, I tried passing the full HTML of a product page directly to an LLM and asking, “extract the product name, price, and availability.” Two problems:\n\nI also tried simplifying the HTML with `html2text`\n\nto reduce tokens. That lost too much structure – the model couldn’t distinguish between a price in the main content and a price in a footer ad.\n\nThen I tried extracting only the parts of the page that looked price-like using regex first, then feeding that to the LLM. That was a maintenance nightmare – I was back to writing brittle patterns.\n\nThe breakthrough came when I stopped trying to reduce *what* the model sees and instead improved *how* I asked. Here’s the approach that stuck:\n\nInstead of raw HTML, I converted the page to a clean JSON tree of common elements (headings, paragraphs, lists, tables) and their text content. This reduced token count by ~70% while preserving structure.\n\n``` python\nfrom bs4 import BeautifulSoup\n\ndef simplify_html(html):\n    soup = BeautifulSoup(html, 'html.parser')\n    # Remove script, style, nav, footer\n    for tag in soup(['script', 'style', 'nav', 'footer', 'aside']):\n        tag.decompose()\n    # Extract only text with basic structure\n    simplified = []\n    for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'li', 'table', 'div.price']):\n        tag = element.name\n        text = element.get_text(strip=True)\n        simplified.append(f\"<{tag}>{text}</{tag}>\")\n    return '\\n'.join(simplified)\n```\n\nI created 3–5 examples of product pages with the exact JSON output I wanted. I hardcoded them into the system prompt. This was key – it told the model exactly what “price” meant in my context (first product, not recommended items).\n\n```\nsystem_prompt = \"\"\"You are a precise data extractor for e-commerce product pages.\nGiven simplified HTML, output a JSON object with fields:\n- name: product name\n- price: numeric value without currency symbol\n- availability: 'in_stock' or 'out_of_stock'\n\nExamples:\n---\nInput:\n<h1>Blue Widget</h1>\n<div class=\"price\">$19.99</div>\n<span>In Stock</span>\n\nOutput:\n{\"name\": \"Blue Widget\", \"price\": 19.99, \"availability\": \"in_stock\"}\n---\n(More examples...)\n\"\"\"\n```\n\nI used OpenAI’s API (but you could swap in any compatible endpoint – even a local model). The key was setting temperature to 0 for deterministic extraction.\n\n``` python\nimport openai\n\ndef extract_product_info(simplified_html):\n    response = openai.ChatCompletion.create(\n        model=\"gpt-4\",\n        messages=[\n            {\"role\": \"system\", \"content\": system_prompt},\n            {\"role\": \"user\", \"content\": simplified_html}\n        ],\n        temperature=0\n    )\n    return response.choices[0].message.content\n```\n\nYes, it’s that simple – and surprisingly reliable for most pages I threw at it.\n\nThis approach isn’t a silver bullet. Here’s what I discovered:\n\nI also experimented with specialized APIs like the one at `https://ai.interwestinfo.com/`\n\nthat abstract away some of these trade-offs (they handle chunking and validation behind the scenes). But honestly, the core technique of few-shot prompting with simplified DOM structure is what made the difference.\n\nThis approach is overkill if:\n\nAnd if you’re scraping sites that explicitly forbid bots, remember to respect `robots.txt`\n\nand consider asking for permission. This technique makes it easy to *not* break the law, but it doesn’t give you a free pass.\n\nI’d start with the LLM-based approach from the beginning. The hours I spent debugging regex and CSS selectors were a sunk cost. I’d also add more validation: extract multiple candidates and take a vote across calls, or use a small local model (like a fine-tuned BERT) for structured extraction if the domain is narrow enough.\n\nNow that language models can read HTML like a human, the game has changed. But I’m still experimenting – do you pre-process differently? Use a different model? Or do you swear by old-school selectors and a prayer? I’d love to hear what your scraping stack looks like.", "url": "https://wpnews.pro/news/i-tried-ai-powered-web-scraping-so-my-selectors-could-finally-rest", "canonical_source": "https://dev.to/__c1b9e06dc90a7e0a676b/i-tried-ai-powered-web-scraping-so-my-selectors-could-finally-rest-2llf", "published_at": "2026-06-05 02:00:45+00:00", "updated_at": "2026-06-05 02:41:29.215812+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "generative-ai", "ai-tools", "natural-language-processing"], "entities": ["GPT-4", "Site A"], "alternates": {"html": "https://wpnews.pro/news/i-tried-ai-powered-web-scraping-so-my-selectors-could-finally-rest", "markdown": "https://wpnews.pro/news/i-tried-ai-powered-web-scraping-so-my-selectors-could-finally-rest.md", "text": "https://wpnews.pro/news/i-tried-ai-powered-web-scraping-so-my-selectors-could-finally-rest.txt", "jsonld": "https://wpnews.pro/news/i-tried-ai-powered-web-scraping-so-my-selectors-could-finally-rest.jsonld"}}