Last month, I spent three days fighting with regular expressions.
I had a pile of unstructured product descriptions from various suppliers—some with prices hidden in paragraphs, others with specs scattered across bullet points. My job was to normalize them into a clean JSON structure: { name, price, specs, description }
.
It started simple. A few regex patterns. \$\d+\.\d{2}
for prices. (?<=Brand:)\w+
for brands. Then the edge cases hit me like a freight train.
The first supplier used "$12.99" format. The second used "USD 12.99". One even wrote "costs around twelve dollars and ninety nine cents". My regex grew into a monster spanning 40 lines, with lookaheads, groups, and conditional statements. It worked for the first 20 products. Then I ran it on the full dataset (10,000 records).
I got a 37% success rate. The rest were either wrong or empty. I spent another two days adding fallback patterns, but every new pattern introduced new false positives. I knew I was fighting a losing battle.
I considered spaCy and NLTK. Trained a custom NER model for product attributes? That would require labeled data, compute time, and ongoing maintenance as supplier formats changed. Overkill for a one-time migration project. I needed something that could handle unstructured text on the fly without training.
A colleague mentioned using GPT-style models for data extraction. I was skeptical—seemed like using a sledgehammer to crack a nut. But after hitting that regex wall, I tried it.
The key insight: you don't need to fine-tune a model. You just need a well-crafted system prompt and a consistent output format. Here's what I ended up with:
import json
from openai import OpenAI
client = OpenAI() # or pass your key from env
def extract_product_info(text):
system_prompt = """
You are a data extraction assistant. Given a product description, extract the following fields and return ONLY a valid JSON object:
- name (string)
- price (float, in USD, if not specified use null)
- specs (object of key-value pairs if any specs mentioned, else empty object)
- description (string, cleaned summary of the product)
Rules:
- If price uses words like 'twelve dollars', convert to number.
- If multiple prices, pick the one for the product, not shipping.
- If no price found, use null.
- Return ONLY JSON, no markdown, no extra text.
"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": text}
],
temperature=0.1, # low for consistency
max_tokens=500
)
raw = response.choices[0].message.content
raw = raw.strip().removeprefix("```
json").removesuffix("
```").strip()
return json.loads(raw)
Prompt engineering matters more than model size. I started with GPT-3.5 and got inconsistent outputs. Switching to GPT-4o-mini with a strict system prompt ("Return ONLY JSON") gave nearly 100% valid JSON. But I also learned to explicitly parse out markdown fences—models sometimes wrap JSON in triple backticks, even when told not to.
Validation saves the day. The json.loads
will crash if the model hallucinates an extra comma. I added a retry loop with a fallback prompt:
import json
import re
def extract_with_retry(text, max_retries=2):
for attempt in range(max_retries):
try:
return extract_product_info(text)
except (json.JSONDecodeError, KeyError) as e:
if attempt == max_retries - 1:
raise
pass
Cost isn't ridiculous. Processing 10,000 records with GPT-4o-mini cost about $8—far cheaper than my time debugging regex patterns. Each product description averaged ~150 tokens, and output ~80 tokens.
But it's not a silver bullet. The AI model still struggles with heavily ambiguous text. If a supplier describes a "wireless mouse" and later mentions "batteries not included" without a price, the model might guess a price based on training data—which is wrong. I learned to set null
default and add a human review step for any record where price
is null.
I'd start with AI from the beginning, but pair it with a robust validation layer: check that extracted fields conform to expected types (price as float, name non-empty). Use Pydantic models to enforce structure. Also, I'd batch the requests to amortize latency and reduce cost.
Oh, and I'd explore specialized extraction endpoints like the one at ai.interwestinfo.com
that claims to handle this sort of thing—but honestly, the general-purpose approach with prompt engineering gave me enough control. I might use a dedicated tool if I revisit this project next quarter.
In the end, I stopped writing regex. I started writing prompts. And I got my weekends back.
What's your experience with AI for data parsing? Do you lean on regex or are you all-in on LLMs? I'm curious to hear what works for you.