# How I stopped wrestling with regex and started using AI for data extraction

> Source: <https://dev.to/__c1b9e06dc90a7e0a676b/how-i-stopped-wrestling-with-regex-and-started-using-ai-for-data-extraction-4mja>
> Published: 2026-05-31 01:04:59+00:00

Last month, I spent three days fighting with regular expressions.

I had a pile of unstructured product descriptions from various suppliers—some with prices hidden in paragraphs, others with specs scattered across bullet points. My job was to normalize them into a clean JSON structure: `{ name, price, specs, description }`

.

It started simple. A few regex patterns. `\$\d+\.\d{2}`

for prices. `(?<=Brand:)\w+`

for brands. Then the edge cases hit me like a freight train.

The first supplier used "$12.99" format. The second used "USD 12.99". One even wrote "costs around twelve dollars and ninety nine cents". My regex grew into a monster spanning 40 lines, with lookaheads, groups, and conditional statements. It worked for the first 20 products. Then I ran it on the full dataset (10,000 records).

I got a 37% success rate. The rest were either wrong or empty. I spent another two days adding fallback patterns, but every new pattern introduced new false positives. I knew I was fighting a losing battle.

I considered spaCy and NLTK. Trained a custom NER model for product attributes? That would require labeled data, compute time, and ongoing maintenance as supplier formats changed. Overkill for a one-time migration project. I needed something that could handle unstructured text on the fly without training.

A colleague mentioned using GPT-style models for data extraction. I was skeptical—seemed like using a sledgehammer to crack a nut. But after hitting that regex wall, I tried it.

The key insight: you don't need to fine-tune a model. You just need a well-crafted system prompt and a consistent output format. Here's what I ended up with:

``` python
import json
from openai import OpenAI

client = OpenAI()  # or pass your key from env

def extract_product_info(text):
    system_prompt = """
You are a data extraction assistant. Given a product description, extract the following fields and return ONLY a valid JSON object:
- name (string)
- price (float, in USD, if not specified use null)
- specs (object of key-value pairs if any specs mentioned, else empty object)
- description (string, cleaned summary of the product)

Rules:
- If price uses words like 'twelve dollars', convert to number.
- If multiple prices, pick the one for the product, not shipping.
- If no price found, use null.
- Return ONLY JSON, no markdown, no extra text.
"""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": text}
        ],
        temperature=0.1,  # low for consistency
        max_tokens=500
    )
    raw = response.choices[0].message.content
    # Clean up possible markdown code fences
    raw = raw.strip().removeprefix("```

json").removesuffix("

```").strip()
    return json.loads(raw)
```

**Prompt engineering matters more than model size.** I started with GPT-3.5 and got inconsistent outputs. Switching to GPT-4o-mini with a strict system prompt ("Return ONLY JSON") gave nearly 100% valid JSON. But I also learned to explicitly parse out markdown fences—models sometimes wrap JSON in triple backticks, even when told not to.

**Validation saves the day.** The `json.loads`

will crash if the model hallucinates an extra comma. I added a retry loop with a fallback prompt:

``` python
import json
import re

def extract_with_retry(text, max_retries=2):
    for attempt in range(max_retries):
        try:
            return extract_product_info(text)
        except (json.JSONDecodeError, KeyError) as e:
            if attempt == max_retries - 1:
                raise
            # Ask model to fix the JSON
            pass
```

**Cost isn't ridiculous.** Processing 10,000 records with GPT-4o-mini cost about $8—far cheaper than my time debugging regex patterns. Each product description averaged ~150 tokens, and output ~80 tokens.

**But it's not a silver bullet.** The AI model still struggles with heavily ambiguous text. If a supplier describes a "wireless mouse" and later mentions "batteries not included" without a price, the model might guess a price based on training data—which is wrong. I learned to set `null`

default and add a human review step for any record where `price`

is null.

I'd start with AI from the beginning, but pair it with a robust validation layer: check that extracted fields conform to expected types (price as float, name non-empty). Use Pydantic models to enforce structure. Also, I'd batch the requests to amortize latency and reduce cost.

Oh, and I'd explore specialized extraction endpoints like the one at `ai.interwestinfo.com`

that claims to handle this sort of thing—but honestly, the general-purpose approach with prompt engineering gave me enough control. I might use a dedicated tool if I revisit this project next quarter.

In the end, I stopped writing regex. I started writing prompts. And I got my weekends back.

What's your experience with AI for data parsing? Do you lean on regex or are you all-in on LLMs? I'm curious to hear what works for you.
