cd /news/artificial-intelligence/when-traditional-web-scraping-fails-… · home topics artificial-intelligence article
[ARTICLE · art-18245] src=dev.to pub= topic=artificial-intelligence verified=true sentiment=↑ positive

When Traditional Web Scraping Fails: A Practical AI Approach

A developer built an AI-based web scraper using GPT-4 to extract product data from a dynamic e-commerce site that changed its HTML structure every few days, breaking traditional CSS selectors. The approach feeds raw HTML to a language model with natural language instructions like "find the price," eliminating the need for brittle selector maintenance. The developer ultimately deployed a hybrid system using traditional selectors for stable elements and AI fallback when selectors fail.

read3 min publishedMay 30, 2026

I've been building web scrapers for years. BeautifulSoup, Scrapy, Selenium — I've used them all. But last month I hit a wall. A client needed me to extract product data from a site that changed its HTML structure every few days. One week the price was in a <span class="price">

, the next it was inside a <div>

with a random ID. My scraper kept breaking, and I was spending more time fixing selectors than actually getting data.

The site was a dynamic e-commerce platform. It used JavaScript to render content, and the developers seemed to enjoy shuffling class names. I tried the usual suspects:

I needed something that could understand the meaning of the data, not just its position in the DOM. That's when I thought: why not use an AI model to read the page like a human would?

Instead of writing CSS selectors, I'd feed the raw HTML (or even a screenshot) to a language model and ask it to extract structured data. The model doesn't care about class names — it understands context. "Find the price" becomes a natural language instruction.

I decided to test this with OpenAI's GPT-4, but the same approach works with any capable LLM (Claude, local models via Ollama, or specialized APIs like the one at https://ai.interwestinfo.com/

).

Here's a simple Python script that extracts product info from a webpage using GPT-4. You'll need an OpenAI API key.

import requests
from bs4 import BeautifulSoup
import openai
import json

url = "https://example.com/product-page"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

for tag in soup(['script', 'style', 'nav', 'footer']):
    tag.decompose()
clean_html = soup.prettify()[:5000]  # limit to first 5000 chars

prompt = f"""
Extract the following fields from this HTML and return them as JSON:
- product_name
- price (as a number, without currency symbol)
- availability (in stock / out of stock)
- description (first 100 characters)

HTML:
{clean_html}

Return ONLY valid JSON, no extra text.
"""

openai.api_key = "sk-..."
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0
)

try:
    data = json.loads(response.choices[0].message.content)
    print(data)
except json.JSONDecodeError:
    print("Failed to parse response:", response.choices[0].message.content)

This is a minimal example. In production, you'd want to handle pagination, retries, and rate limiting.

It works — but it's not magic.

Approach Pros Cons
Traditional scraping (CSS/XPath) Fast, cheap, predictable Brittle, requires constant maintenance
AI-based extraction Robust to layout changes, understands context Slow, expensive, can hallucinate
Hybrid Best of both worlds More complex to implement

For my client, I ended up using a hybrid: traditional selectors for stable parts (like the product title), and AI fallback when selectors fail. That reduced costs while keeping reliability high.

AI won't replace traditional scraping entirely, but it's a powerful tool for those annoying edge cases where selectors break. The technique I showed here is just one example — you could also use vision models on screenshots, or structured extraction APIs.

Have you tried using LLMs for data extraction? What's your setup look like?

── more in #artificial-intelligence 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/when-traditional-web…] indexed:0 read:3min 2026-05-30 ·