# When Traditional Web Scraping Fails: A Practical AI Approach

> Source: <https://dev.to/__c1b9e06dc90a7e0a676b/when-traditional-web-scraping-fails-a-practical-ai-approach-3o6p>
> Published: 2026-05-30 01:01:55+00:00

I've been building web scrapers for years. BeautifulSoup, Scrapy, Selenium — I've used them all. But last month I hit a wall. A client needed me to extract product data from a site that changed its HTML structure every few days. One week the price was in a `<span class="price">`

, the next it was inside a `<div>`

with a random ID. My scraper kept breaking, and I was spending more time fixing selectors than actually getting data.

The site was a dynamic e-commerce platform. It used JavaScript to render content, and the developers seemed to enjoy shuffling class names. I tried the usual suspects:

I needed something that could understand the *meaning* of the data, not just its position in the DOM. That's when I thought: why not use an AI model to read the page like a human would?

Instead of writing CSS selectors, I'd feed the raw HTML (or even a screenshot) to a language model and ask it to extract structured data. The model doesn't care about class names — it understands context. "Find the price" becomes a natural language instruction.

I decided to test this with OpenAI's GPT-4, but the same approach works with any capable LLM (Claude, local models via Ollama, or specialized APIs like the one at `https://ai.interwestinfo.com/`

).

Here's a simple Python script that extracts product info from a webpage using GPT-4. You'll need an OpenAI API key.

``` python
import requests
from bs4 import BeautifulSoup
import openai
import json

# 1. Fetch the page (use a headless browser if JS-heavy)
url = "https://example.com/product-page"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 2. Clean the HTML to reduce tokens
# Remove scripts, styles, and empty tags
for tag in soup(['script', 'style', 'nav', 'footer']):
    tag.decompose()
clean_html = soup.prettify()[:5000]  # limit to first 5000 chars

# 3. Prompt the model
prompt = f"""
Extract the following fields from this HTML and return them as JSON:
- product_name
- price (as a number, without currency symbol)
- availability (in stock / out of stock)
- description (first 100 characters)

HTML:
{clean_html}

Return ONLY valid JSON, no extra text.
"""

openai.api_key = "sk-..."
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0
)

# 4. Parse the JSON response
try:
    data = json.loads(response.choices[0].message.content)
    print(data)
except json.JSONDecodeError:
    print("Failed to parse response:", response.choices[0].message.content)
```

This is a minimal example. In production, you'd want to handle pagination, retries, and rate limiting.

**It works — but it's not magic.**

| Approach | Pros | Cons |
|---|---|---|
| Traditional scraping (CSS/XPath) | Fast, cheap, predictable | Brittle, requires constant maintenance |
| AI-based extraction | Robust to layout changes, understands context | Slow, expensive, can hallucinate |
| Hybrid | Best of both worlds | More complex to implement |

For my client, I ended up using a hybrid: traditional selectors for stable parts (like the product title), and AI fallback when selectors fail. That reduced costs while keeping reliability high.

AI won't replace traditional scraping entirely, but it's a powerful tool for those annoying edge cases where selectors break. The technique I showed here is just one example — you could also use vision models on screenshots, or structured extraction APIs.

Have you tried using LLMs for data extraction? What's your setup look like?
