Blue Widget

# I Tried AI-Powered Web Scraping So My Selectors Could Finally Rest

> Source: <https://dev.to/__c1b9e06dc90a7e0a676b/i-tried-ai-powered-web-scraping-so-my-selectors-could-finally-rest-2llf>
> Published: 2026-06-05 02:00:45+00:00

A few months ago, I was building a price comparison tool that needed to pull product info from a dozen different e-commerce sites. Each one had its own lovingly crafted HTML structure—nested `<div>`

s with classes like `price-123abc`

that changed on every deployment. My initial approach was traditional: XPath, CSS selectors, and a sprinkle of regex. It worked until it didn’t. Then I discovered that I could throw an LLM at the raw HTML and let it figure out the extraction. Here’s what I learned.

I had a scraper for Site A that used `document.querySelector('.product-price')`

. It was fragile but worked for months. Then Site A redesigned. The selector broke. I updated it. A week later, another redesign. I started using `regex`

to find patterns like `\$\d+\.\d{2}`

. Then someone added a badge that said “$5 off” and my regex grabbed the wrong number.

I needed something that could understand the *meaning* of a price, not just its structure. That’s when I wondered: could GPT-4 (or any language model) parse the raw HTML and give me the structured data I needed?

First, I tried passing the full HTML of a product page directly to an LLM and asking, “extract the product name, price, and availability.” Two problems:

I also tried simplifying the HTML with `html2text`

to reduce tokens. That lost too much structure – the model couldn’t distinguish between a price in the main content and a price in a footer ad.

Then I tried extracting only the parts of the page that looked price-like using regex first, then feeding that to the LLM. That was a maintenance nightmare – I was back to writing brittle patterns.

The breakthrough came when I stopped trying to reduce *what* the model sees and instead improved *how* I asked. Here’s the approach that stuck:

Instead of raw HTML, I converted the page to a clean JSON tree of common elements (headings, paragraphs, lists, tables) and their text content. This reduced token count by ~70% while preserving structure.

``` python
from bs4 import BeautifulSoup

def simplify_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    # Remove script, style, nav, footer
    for tag in soup(['script', 'style', 'nav', 'footer', 'aside']):
        tag.decompose()
    # Extract only text with basic structure
    simplified = []
    for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'li', 'table', 'div.price']):
        tag = element.name
        text = element.get_text(strip=True)
        simplified.append(f"<{tag}>{text}</{tag}>")
    return '\n'.join(simplified)
```

I created 3–5 examples of product pages with the exact JSON output I wanted. I hardcoded them into the system prompt. This was key – it told the model exactly what “price” meant in my context (first product, not recommended items).

```
system_prompt = """You are a precise data extractor for e-commerce product pages.
Given simplified HTML, output a JSON object with fields:
- name: product name
- price: numeric value without currency symbol
- availability: 'in_stock' or 'out_of_stock'

Examples:
---
Input:
<h1>Blue Widget</h1>
<div class="price">$19.99</div>
<span>In Stock</span>

Output:
{"name": "Blue Widget", "price": 19.99, "availability": "in_stock"}
---
(More examples...)
"""
```

I used OpenAI’s API (but you could swap in any compatible endpoint – even a local model). The key was setting temperature to 0 for deterministic extraction.

``` python
import openai

def extract_product_info(simplified_html):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": simplified_html}
        ],
        temperature=0
    )
    return response.choices[0].message.content
```

Yes, it’s that simple – and surprisingly reliable for most pages I threw at it.

This approach isn’t a silver bullet. Here’s what I discovered:

I also experimented with specialized APIs like the one at `https://ai.interwestinfo.com/`

that abstract away some of these trade-offs (they handle chunking and validation behind the scenes). But honestly, the core technique of few-shot prompting with simplified DOM structure is what made the difference.

This approach is overkill if:

And if you’re scraping sites that explicitly forbid bots, remember to respect `robots.txt`

and consider asking for permission. This technique makes it easy to *not* break the law, but it doesn’t give you a free pass.

I’d start with the LLM-based approach from the beginning. The hours I spent debugging regex and CSS selectors were a sunk cost. I’d also add more validation: extract multiple candidates and take a vote across calls, or use a small local model (like a fine-tuned BERT) for structured extraction if the domain is narrow enough.

Now that language models can read HTML like a human, the game has changed. But I’m still experimenting – do you pre-process differently? Use a different model? Or do you swear by old-school selectors and a prayer? I’d love to hear what your scraping stack looks like.