I Tried AI-Powered Web Scraping So My Selectors Could Finally Rest

wpnews.pro

cd /news/artificial-intelligence/i-tried-ai-powered-web-scraping-so-m… · home › topics › artificial-intelligence › article

[ARTICLE · art-22101] src=dev.to ↗ pub=2026-06-05T02:00Z topic=artificial-intelligence verified=true sentiment=· neutral

I Tried AI-Powered Web Scraping So My Selectors Could Finally Rest

A developer built an AI-powered web scraper that uses large language models to extract product data from e-commerce sites, replacing fragile CSS selectors and regex patterns. The approach converts raw HTML into a simplified JSON tree, reducing token usage by 70%, then feeds it to GPT-4 with few-shot examples to reliably extract fields like price and availability. The system proved more resilient to site redesigns than traditional selector-based methods.

read4 min views16 publishedJun 5, 2026

A few months ago, I was building a price comparison tool that needed to pull product info from a dozen different e-commerce sites. Each one had its own lovingly crafted HTML structure—nested <div>

s with classes like price-123abc

that changed on every deployment. My initial approach was traditional: XPath, CSS selectors, and a sprinkle of regex. It worked until it didn’t. Then I discovered that I could throw an LLM at the raw HTML and let it figure out the extraction. Here’s what I learned.

I had a scraper for Site A that used document.querySelector('.product-price')

. It was fragile but worked for months. Then Site A redesigned. The selector broke. I updated it. A week later, another redesign. I started using regex

to find patterns like \$\d+\.\d{2}

. Then someone added a badge that said “$5 off” and my regex grabbed the wrong number.

I needed something that could understand the meaning of a price, not just its structure. That’s when I wondered: could GPT-4 (or any language model) parse the raw HTML and give me the structured data I needed?

First, I tried passing the full HTML of a product page directly to an LLM and asking, “extract the product name, price, and availability.” Two problems:

I also tried simplifying the HTML with html2text

to reduce tokens. That lost too much structure – the model couldn’t distinguish between a price in the main content and a price in a footer ad.

Then I tried extracting only the parts of the page that looked price-like using regex first, then feeding that to the LLM. That was a maintenance nightmare – I was back to writing brittle patterns.

The breakthrough came when I stopped trying to reduce what the model sees and instead improved how I asked. Here’s the approach that stuck:

Instead of raw HTML, I converted the page to a clean JSON tree of common elements (headings, paragraphs, lists, tables) and their text content. This reduced token count by ~70% while preserving structure.

from bs4 import BeautifulSoup

def simplify_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    for tag in soup(['script', 'style', 'nav', 'footer', 'aside']):
        tag.decompose()
    simplified = []
    for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'li', 'table', 'div.price']):
        tag = element.name
        text = element.get_text(strip=True)
        simplified.append(f"<{tag}>{text}</{tag}>")
    return '\n'.join(simplified)

I created 3–5 examples of product pages with the exact JSON output I wanted. I hardcoded them into the system prompt. This was key – it told the model exactly what “price” meant in my context (first product, not recommended items).

system_prompt = """You are a precise data extractor for e-commerce product pages.
Given simplified HTML, output a JSON object with fields:
- name: product name
- price: numeric value without currency symbol
- availability: 'in_stock' or 'out_of_stock'

Examples:
---
Input:
<h1>Blue Widget</h1>
<div class="price">$19.99</div>
<span>In Stock</span>

Output:
{"name": "Blue Widget", "price": 19.99, "availability": "in_stock"}
---
(More examples...)
"""

I used OpenAI’s API (but you could swap in any compatible endpoint – even a local model). The key was setting temperature to 0 for deterministic extraction.

import openai

def extract_product_info(simplified_html):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": simplified_html}
        ],
        temperature=0
    )
    return response.choices[0].message.content

Yes, it’s that simple – and surprisingly reliable for most pages I threw at it.

This approach isn’t a silver bullet. Here’s what I discovered:

I also experimented with specialized APIs like the one at https://ai.interwestinfo.com/

that abstract away some of these trade-offs (they handle chunking and validation behind the scenes). But honestly, the core technique of few-shot prompting with simplified DOM structure is what made the difference.

This approach is overkill if:

And if you’re scraping sites that explicitly forbid bots, remember to respect robots.txt

and consider asking for permission. This technique makes it easy to not break the law, but it doesn’t give you a free pass.

I’d start with the LLM-based approach from the beginning. The hours I spent debugging regex and CSS selectors were a sunk cost. I’d also add more validation: extract multiple candidates and take a vote across calls, or use a small local model (like a fine-tuned BERT) for structured extraction if the domain is narrow enough.

Now that language models can read HTML like a human, the game has changed. But I’m still experimenting – do you pre-process differently? Use a different model? Or do you swear by old-school selectors and a prayer? I’d love to hear what your scraping stack looks like.

source & further reading

dev.to — original article Wrap the GitHub Copilot SDK in an Action Envelope Before It Reaches Your Application GitHub AI Credit Pools Need a Cost-Center Stop Rule, Not Just a Bigger Budget Use ai-agent-book as a Lab Manual, Not a 12,000-Star Reading List

~/api · this article 200

$curl api.wpnews.pro/v1/news/i-tried-ai-powered-web-s…

Read original on dev.to → dev.to/__c1b9e06dc90a7e0a676b/i-tried-ai-powered…

mentioned entities

GPT-4

Site A

metadata

slugi-tried-ai-powered-web-scraping-so-my-selectors-could-finally-rest

topic#artificial-intelligence

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevThe Faster AI Gets, the Bigger t…

next →Can Europe quit American Big Tec…

── more in #artificial-intelligence 4 stories · sorted by recency

seangoedecke.com · 22 Jul · #artificial-intelligence

How I use LLMs as a staff engineer

machinebrief.com · 22 Jul · #artificial-intelligence

AutoJourn: Multi-Perspective Summarisation, Bias Detection and Bias Neutralisation for LLM-Generated News in Automated Journalism

machinebrief.com · 22 Jul · #artificial-intelligence

For What Reason? Interpreting Models' Encoding of Causation and Antithesis

machinebrief.com · 22 Jul · #artificial-intelligence

AILQA: Evaluating AI-Driven Legal Question Answering Systems for the Indian Legal System

── more on @gpt-4 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 8 Jul · #ai-tools

What's the Future of Clay?

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required