How I stopped wrestling with regex and started using AI for data extraction

A developer replaced a 40-line regex system with GPT-4o-mini for extracting product data from unstructured supplier descriptions, achieving nearly 100% valid JSON output after struggling with a 37% success rate from regex. The AI-based approach cost about $8 to process 10,000 records, far less than the time spent debugging regex patterns. The developer used a strict system prompt requiring JSON-only output and a temperature of 0.1 for consistency, though the model still struggles with heavily ambiguous text.

Last month, I spent three days fighting with regular expressions. I had a pile of unstructured product descriptions from various suppliers—some with prices hidden in paragraphs, others with specs scattered across bullet points. My job was to normalize them into a clean JSON structure: { name, price, specs, description } . It started simple. A few regex patterns. \$\d+\.\d{2} for prices. ?<=Brand: \w+ for brands. Then the edge cases hit me like a freight train. The first supplier used "$12.99" format. The second used "USD 12.99". One even wrote "costs around twelve dollars and ninety nine cents". My regex grew into a monster spanning 40 lines, with lookaheads, groups, and conditional statements. It worked for the first 20 products. Then I ran it on the full dataset 10,000 records . I got a 37% success rate. The rest were either wrong or empty. I spent another two days adding fallback patterns, but every new pattern introduced new false positives. I knew I was fighting a losing battle. I considered spaCy and NLTK. Trained a custom NER model for product attributes? That would require labeled data, compute time, and ongoing maintenance as supplier formats changed. Overkill for a one-time migration project. I needed something that could handle unstructured text on the fly without training. A colleague mentioned using GPT-style models for data extraction. I was skeptical—seemed like using a sledgehammer to crack a nut. But after hitting that regex wall, I tried it. The key insight: you don't need to fine-tune a model. You just need a well-crafted system prompt and a consistent output format. Here's what I ended up with: python import json from openai import OpenAI client = OpenAI or pass your key from env def extract product info text : system prompt = """ You are a data extraction assistant. Given a product description, extract the following fields and return ONLY a valid JSON object: - name string - price float, in USD, if not specified use null - specs object of key-value pairs if any specs mentioned, else empty object - description string, cleaned summary of the product Rules: - If price uses words like 'twelve dollars', convert to number. - If multiple prices, pick the one for the product, not shipping. - If no price found, use null. - Return ONLY JSON, no markdown, no extra text. """ response = client.chat.completions.create model="gpt-4o-mini", messages= {"role": "system", "content": system prompt}, {"role": "user", "content": text} , temperature=0.1, low for consistency max tokens=500 raw = response.choices 0 .message.content Clean up possible markdown code fences raw = raw.strip .removeprefix " json" .removesuffix " " .strip return json.loads raw Prompt engineering matters more than model size. I started with GPT-3.5 and got inconsistent outputs. Switching to GPT-4o-mini with a strict system prompt "Return ONLY JSON" gave nearly 100% valid JSON. But I also learned to explicitly parse out markdown fences—models sometimes wrap JSON in triple backticks, even when told not to. Validation saves the day. The json.loads will crash if the model hallucinates an extra comma. I added a retry loop with a fallback prompt: python import json import re def extract with retry text, max retries=2 : for attempt in range max retries : try: return extract product info text except json.JSONDecodeError, KeyError as e: if attempt == max retries - 1: raise Ask model to fix the JSON pass Cost isn't ridiculous. Processing 10,000 records with GPT-4o-mini cost about $8—far cheaper than my time debugging regex patterns. Each product description averaged ~150 tokens, and output ~80 tokens. But it's not a silver bullet. The AI model still struggles with heavily ambiguous text. If a supplier describes a "wireless mouse" and later mentions "batteries not included" without a price, the model might guess a price based on training data—which is wrong. I learned to set null default and add a human review step for any record where price is null. I'd start with AI from the beginning, but pair it with a robust validation layer: check that extracted fields conform to expected types price as float, name non-empty . Use Pydantic models to enforce structure. Also, I'd batch the requests to amortize latency and reduce cost. Oh, and I'd explore specialized extraction endpoints like the one at ai.interwestinfo.com that claims to handle this sort of thing—but honestly, the general-purpose approach with prompt engineering gave me enough control. I might use a dedicated tool if I revisit this project next quarter. In the end, I stopped writing regex. I started writing prompts. And I got my weekends back. What's your experience with AI for data parsing? Do you lean on regex or are you all-in on LLMs? I'm curious to hear what works for you.