How I stopped wrestling with regex and started using AI for data extraction

wpnews.pro

cd /news/artificial-intelligence/how-i-stopped-wrestling-with-regex-a… · home › topics › artificial-intelligence › article

[ARTICLE · art-18885] src=dev.to ↗ pub=2026-05-31T01:04Z topic=artificial-intelligence verified=true sentiment=↑ positive

How I stopped wrestling with regex and started using AI for data extraction

A developer replaced a 40-line regex system with GPT-4o-mini for extracting product data from unstructured supplier descriptions, achieving nearly 100% valid JSON output after struggling with a 37% success rate from regex. The AI-based approach cost about $8 to process 10,000 records, far less than the time spent debugging regex patterns. The developer used a strict system prompt requiring JSON-only output and a temperature of 0.1 for consistency, though the model still struggles with heavily ambiguous text.

read4 min views23 publishedMay 31, 2026

Last month, I spent three days fighting with regular expressions.

I had a pile of unstructured product descriptions from various suppliers—some with prices hidden in paragraphs, others with specs scattered across bullet points. My job was to normalize them into a clean JSON structure: { name, price, specs, description }

It started simple. A few regex patterns. \$\d+\.\d{2}

for prices. (?<=Brand:)\w+

for brands. Then the edge cases hit me like a freight train.

The first supplier used "$12.99" format. The second used "USD 12.99". One even wrote "costs around twelve dollars and ninety nine cents". My regex grew into a monster spanning 40 lines, with lookaheads, groups, and conditional statements. It worked for the first 20 products. Then I ran it on the full dataset (10,000 records).

I got a 37% success rate. The rest were either wrong or empty. I spent another two days adding fallback patterns, but every new pattern introduced new false positives. I knew I was fighting a losing battle.

I considered spaCy and NLTK. Trained a custom NER model for product attributes? That would require labeled data, compute time, and ongoing maintenance as supplier formats changed. Overkill for a one-time migration project. I needed something that could handle unstructured text on the fly without training.

A colleague mentioned using GPT-style models for data extraction. I was skeptical—seemed like using a sledgehammer to crack a nut. But after hitting that regex wall, I tried it.

The key insight: you don't need to fine-tune a model. You just need a well-crafted system prompt and a consistent output format. Here's what I ended up with:

import json
from openai import OpenAI

client = OpenAI()  # or pass your key from env

def extract_product_info(text):
    system_prompt = """
You are a data extraction assistant. Given a product description, extract the following fields and return ONLY a valid JSON object:
- name (string)
- price (float, in USD, if not specified use null)
- specs (object of key-value pairs if any specs mentioned, else empty object)
- description (string, cleaned summary of the product)

Rules:
- If price uses words like 'twelve dollars', convert to number.
- If multiple prices, pick the one for the product, not shipping.
- If no price found, use null.
- Return ONLY JSON, no markdown, no extra text.
"""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": text}
        ],
        temperature=0.1,  # low for consistency
        max_tokens=500
    )
    raw = response.choices[0].message.content
    raw = raw.strip().removeprefix("```

json").removesuffix("

```").strip()
    return json.loads(raw)

Prompt engineering matters more than model size. I started with GPT-3.5 and got inconsistent outputs. Switching to GPT-4o-mini with a strict system prompt ("Return ONLY JSON") gave nearly 100% valid JSON. But I also learned to explicitly parse out markdown fences—models sometimes wrap JSON in triple backticks, even when told not to.

Validation saves the day. The json.loads

will crash if the model hallucinates an extra comma. I added a retry loop with a fallback prompt:

import json
import re

def extract_with_retry(text, max_retries=2):
    for attempt in range(max_retries):
        try:
            return extract_product_info(text)
        except (json.JSONDecodeError, KeyError) as e:
            if attempt == max_retries - 1:
                raise
            pass

Cost isn't ridiculous. Processing 10,000 records with GPT-4o-mini cost about $8—far cheaper than my time debugging regex patterns. Each product description averaged ~150 tokens, and output ~80 tokens.

But it's not a silver bullet. The AI model still struggles with heavily ambiguous text. If a supplier describes a "wireless mouse" and later mentions "batteries not included" without a price, the model might guess a price based on training data—which is wrong. I learned to set null

default and add a human review step for any record where price

is null.

I'd start with AI from the beginning, but pair it with a robust validation layer: check that extracted fields conform to expected types (price as float, name non-empty). Use Pydantic models to enforce structure. Also, I'd batch the requests to amortize latency and reduce cost.

Oh, and I'd explore specialized extraction endpoints like the one at ai.interwestinfo.com

that claims to handle this sort of thing—but honestly, the general-purpose approach with prompt engineering gave me enough control. I might use a dedicated tool if I revisit this project next quarter.

In the end, I stopped writing regex. I started writing prompts. And I got my weekends back.

What's your experience with AI for data parsing? Do you lean on regex or are you all-in on LLMs? I'm curious to hear what works for you.

source & further reading

dev.to — original article RAG vs Fine-tuning How I Stopped an AI Agent from Freezing with Two Lines of Code An AI Science Workbench Needs a Reproducibility Graph, Not Just Chat History

~/api · this article 200

$curl api.wpnews.pro/v1/news/how-i-stopped-wrestling-…

Read original on dev.to → dev.to/__c1b9e06dc90a7e0a676b/how-i-stopped-wres…

mentioned entities

spaCy

NLTK

GPT

metadata

slughow-i-stopped-wrestling-with-regex-and-started-using-ai-for-data-extraction

topic#artificial-intelligence

secondary3 topics

sentimentpositive

canonicaldev.to

navigation

← prevHow I Built an AI Assistant That…

next →How to protect your AI endpoints…

── more in #artificial-intelligence 4 stories · sorted by recency

machinebrief.com · 15 Jul · #artificial-intelligence

AI Ethics in Education: ChatGPT's Role in Shaping Public Sentiment

machinebrief.com · 15 Jul · #artificial-intelligence

AI Search: A Game Changer in Digital Intermediation?

machinebrief.com · 15 Jul · #artificial-intelligence

Semantic Decoding: A New Fusion Framework at the Forefront

machinebrief.com · 15 Jul · #artificial-intelligence

Humans vs. AI: The Search for Semantic Soul

── more on @spacy 3 stories trending now

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required