{"slug": "how-i-stopped-wrestling-with-regex-and-started-using-ai-for-data-extraction", "title": "How I stopped wrestling with regex and started using AI for data extraction", "summary": "A developer replaced a 40-line regex system with GPT-4o-mini for extracting product data from unstructured supplier descriptions, achieving nearly 100% valid JSON output after struggling with a 37% success rate from regex. The AI-based approach cost about $8 to process 10,000 records, far less than the time spent debugging regex patterns. The developer used a strict system prompt requiring JSON-only output and a temperature of 0.1 for consistency, though the model still struggles with heavily ambiguous text.", "body_md": "Last month, I spent three days fighting with regular expressions.\n\nI had a pile of unstructured product descriptions from various suppliers—some with prices hidden in paragraphs, others with specs scattered across bullet points. My job was to normalize them into a clean JSON structure: `{ name, price, specs, description }`\n\n.\n\nIt started simple. A few regex patterns. `\\$\\d+\\.\\d{2}`\n\nfor prices. `(?<=Brand:)\\w+`\n\nfor brands. Then the edge cases hit me like a freight train.\n\nThe first supplier used \"$12.99\" format. The second used \"USD 12.99\". One even wrote \"costs around twelve dollars and ninety nine cents\". My regex grew into a monster spanning 40 lines, with lookaheads, groups, and conditional statements. It worked for the first 20 products. Then I ran it on the full dataset (10,000 records).\n\nI got a 37% success rate. The rest were either wrong or empty. I spent another two days adding fallback patterns, but every new pattern introduced new false positives. I knew I was fighting a losing battle.\n\nI considered spaCy and NLTK. Trained a custom NER model for product attributes? That would require labeled data, compute time, and ongoing maintenance as supplier formats changed. Overkill for a one-time migration project. I needed something that could handle unstructured text on the fly without training.\n\nA colleague mentioned using GPT-style models for data extraction. I was skeptical—seemed like using a sledgehammer to crack a nut. But after hitting that regex wall, I tried it.\n\nThe key insight: you don't need to fine-tune a model. You just need a well-crafted system prompt and a consistent output format. Here's what I ended up with:\n\n``` python\nimport json\nfrom openai import OpenAI\n\nclient = OpenAI()  # or pass your key from env\n\ndef extract_product_info(text):\n    system_prompt = \"\"\"\nYou are a data extraction assistant. Given a product description, extract the following fields and return ONLY a valid JSON object:\n- name (string)\n- price (float, in USD, if not specified use null)\n- specs (object of key-value pairs if any specs mentioned, else empty object)\n- description (string, cleaned summary of the product)\n\nRules:\n- If price uses words like 'twelve dollars', convert to number.\n- If multiple prices, pick the one for the product, not shipping.\n- If no price found, use null.\n- Return ONLY JSON, no markdown, no extra text.\n\"\"\"\n    response = client.chat.completions.create(\n        model=\"gpt-4o-mini\",\n        messages=[\n            {\"role\": \"system\", \"content\": system_prompt},\n            {\"role\": \"user\", \"content\": text}\n        ],\n        temperature=0.1,  # low for consistency\n        max_tokens=500\n    )\n    raw = response.choices[0].message.content\n    # Clean up possible markdown code fences\n    raw = raw.strip().removeprefix(\"```\n\njson\").removesuffix(\"\n\n```\").strip()\n    return json.loads(raw)\n```\n\n**Prompt engineering matters more than model size.** I started with GPT-3.5 and got inconsistent outputs. Switching to GPT-4o-mini with a strict system prompt (\"Return ONLY JSON\") gave nearly 100% valid JSON. But I also learned to explicitly parse out markdown fences—models sometimes wrap JSON in triple backticks, even when told not to.\n\n**Validation saves the day.** The `json.loads`\n\nwill crash if the model hallucinates an extra comma. I added a retry loop with a fallback prompt:\n\n``` python\nimport json\nimport re\n\ndef extract_with_retry(text, max_retries=2):\n    for attempt in range(max_retries):\n        try:\n            return extract_product_info(text)\n        except (json.JSONDecodeError, KeyError) as e:\n            if attempt == max_retries - 1:\n                raise\n            # Ask model to fix the JSON\n            pass\n```\n\n**Cost isn't ridiculous.** Processing 10,000 records with GPT-4o-mini cost about $8—far cheaper than my time debugging regex patterns. Each product description averaged ~150 tokens, and output ~80 tokens.\n\n**But it's not a silver bullet.** The AI model still struggles with heavily ambiguous text. If a supplier describes a \"wireless mouse\" and later mentions \"batteries not included\" without a price, the model might guess a price based on training data—which is wrong. I learned to set `null`\n\ndefault and add a human review step for any record where `price`\n\nis null.\n\nI'd start with AI from the beginning, but pair it with a robust validation layer: check that extracted fields conform to expected types (price as float, name non-empty). Use Pydantic models to enforce structure. Also, I'd batch the requests to amortize latency and reduce cost.\n\nOh, and I'd explore specialized extraction endpoints like the one at `ai.interwestinfo.com`\n\nthat claims to handle this sort of thing—but honestly, the general-purpose approach with prompt engineering gave me enough control. I might use a dedicated tool if I revisit this project next quarter.\n\nIn the end, I stopped writing regex. I started writing prompts. And I got my weekends back.\n\nWhat's your experience with AI for data parsing? Do you lean on regex or are you all-in on LLMs? I'm curious to hear what works for you.", "url": "https://wpnews.pro/news/how-i-stopped-wrestling-with-regex-and-started-using-ai-for-data-extraction", "canonical_source": "https://dev.to/__c1b9e06dc90a7e0a676b/how-i-stopped-wrestling-with-regex-and-started-using-ai-for-data-extraction-4mja", "published_at": "2026-05-31 01:04:59+00:00", "updated_at": "2026-05-31 01:42:25.478036+00:00", "lang": "en", "topics": ["artificial-intelligence", "natural-language-processing", "large-language-models", "ai-tools"], "entities": ["spaCy", "NLTK", "GPT"], "alternates": {"html": "https://wpnews.pro/news/how-i-stopped-wrestling-with-regex-and-started-using-ai-for-data-extraction", "markdown": "https://wpnews.pro/news/how-i-stopped-wrestling-with-regex-and-started-using-ai-for-data-extraction.md", "text": "https://wpnews.pro/news/how-i-stopped-wrestling-with-regex-and-started-using-ai-for-data-extraction.txt", "jsonld": "https://wpnews.pro/news/how-i-stopped-wrestling-with-regex-and-started-using-ai-for-data-extraction.jsonld"}}