I spent a week on regex before realizing AI agent was the answer for data extraction

wpnews.pro

cd /news/artificial-intelligence/i-spent-a-week-on-regex-before-reali… · home › topics › artificial-intelligence › article

[ARTICLE · art-20214] src=dev.to ↗ pub=2026-06-03T10:00Z topic=artificial-intelligence verified=true sentiment=↓ negative

I spent a week on regex before realizing AI agent was the answer for data extraction

A developer spent a week trying to extract structured data from free-form emails using regex and spaCy before switching to an AI agent with function calling. The agent, built with OpenAI's API and a Pydantic schema, resolved relative dates and extracted fields like names, amounts, and purposes by calling an LLM to output JSON. This approach eliminated the brittleness of pattern matching and the overhead of fine-tuning models on small datasets.

read4 min views14 publishedJun 3, 2026

I spent a week on regex before realizing AI agent was the answer for data extraction

A couple of months ago, I was building a small internal tool that had to parse user emails and extract structured data: names, dates, amounts, and some custom fields. The emails weren't formal forms — they were free-form requests like "Hey, can we schedule a meeting for next Tuesday at 3 PM to discuss the $500 invoice?"

At first, I thought, "Regex will handle this, it's just pattern matching." I was wrong. So wrong.

I needed to extract:

The input was email bodies — no standard structure, no templates. People write the way they talk.

I started with Python's re

module. I wrote patterns like r"\$?\d+(\.\d{2})?"

for amounts, r"(next|this) (Monday|Tuesday|...)"

for dates. It worked on my test cases but failed on real data:

Regex is brittle. Every edge case required a new pattern. After 50 lines of regex, I was still missing half the extractions.

I thought I'd be smart and use spaCy's named entity recognition (NER). I loaded the en_core_web_lg

model, applied it to each email. It found dates and money entities reasonably well, but:

I ended up writing post-processing rules on top of spaCy. That was another rabbit hole.

I even tried fine-tuning a small BERT model on a dataset of 200 annotated emails. It was overkill. It took hours to train, and the results weren't much better than spaCy because the dataset was too small and diverse. I gave up on that after two days.

After hitting multiple dead ends, I stepped back and asked: "What's the most flexible way to extract structured data from free text?" The answer was an AI language model that can follow instructions and output JSON.

I built a lightweight agent that takes the raw text and a schema definition, then calls an LLM (I used OpenAI's API, but you can use any model with function calling) to extract the fields. The key was function calling: I defined a function that the model could "call" with the extracted parameters.

Here's the core approach:

import openai
from pydantic import BaseModel, Field
from typing import Optional
import json

class EmailExtraction(BaseModel):
    date: Optional[str] = Field(description="The date mentioned, in YYYY-MM-DD format. Use relative date resolution.")
    amount: Optional[float] = Field(description="Monetary amount mentioned, as a number.")
    person: Optional[str] = Field(description="Full name of the person mentioned.")
    purpose: Optional[str] = Field(description="Short description of the meeting or request purpose.")

def extract_email_data(text: str) -> EmailExtraction:
    response = openai.ChatCompletion.create(
        model="gpt-4-1106-preview",  # or gpt-3.5-turbo for faster/cheaper
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an extraction assistant. Extract structured fields from the user's email text. "
                    "If a field is not present, leave it as null. For relative dates like 'next Tuesday', "
                    "resolve them to an absolute date in YYYY-MM-DD format assuming today is 2024-03-20. "
                    "Output only the JSON matching the provided schema."
                )
            },
            {
                "role": "user",
                "content": f"Extract from this email:\n\n{text}"
            }
        ],
        functions=[
            {
                "name": "extract_email_fields",
                "description": "Extract structured fields from an email",
                "parameters": EmailExtraction.schema()
            }
        ],
        function_call={"name": "extract_email_fields"}
    )

    function_call = response.choices[0].message.get("function_call")
    if function_call:
        args = json.loads(function_call.arguments)
        return EmailExtraction(**args)
    else:
        raise ValueError("Model did not call the extraction function")

email_text = """
Hi, I need to meet with Dr. Alice next Wednesday at 2pm to go over the $3000 proposal. Let me know if that works.
"""

result = extract_email_data(email_text)
print(result.json(indent=2))

The beauty of this approach: you change the schema, and the model adapts. Adding a new field? Just add it to the Pydantic model. No regex rewrites, no pipeline changes.

For my internal tool, I deployed a small Flask app that hits the OpenAI API. I also tested it with a local model via Ollama (like llama3

), but the extraction accuracy was lower — enough for prototyping, not production. If you want to try a similar endpoint, there are services like https://ai.interwestinfo.com/

that offer structured extraction endpoints (I used a hosted one to offload the LLM call). But the technique is the same regardless of provider.

Next time, I'd start with the AI agent approach from day one, but also build a hybrid: use regex for the easy, high-confidence patterns (like email addresses), then fall back to the LLM for fuzzy extractions. I'd also add a validation layer to catch obvious LLM errors (e.g., date out of range).

Also, I'd spend more time crafting the system prompt — a good prompt reduces hallucination and improves accuracy dramatically.

Have you tried using LLMs for data extraction? Or do you still swear by regex? I'd love to hear about your experiences — especially if you've found a good open-source model that matches GPT-4 for this task.

source & further reading

dev.to — original article You Didn't Build a System. You Wrote a Script. AI Agents That Live Inside a Dreamed-Up World I Gave 3 AI Agents a Decaying Notepad and They Built a Culture

~/api · this article 200

$curl api.wpnews.pro/v1/news/i-spent-a-week-on-regex-…

Read original on dev.to → dev.to/__c1b9e06dc90a7e0a676b/i-spent-a-week-on-…

mentioned entities

Python

spaCy

BERT

metadata

slugi-spent-a-week-on-regex-before-realizing-ai-agent-was-the-answer-for-data

topic#artificial-intelligence

secondary3 topics

sentimentnegative

canonicaldev.to

navigation

← prevHow AI agents will transform you…

next →Why AI can solve hard math probl…

── more in #artificial-intelligence 4 stories · sorted by recency

dev.to · 21 Jul · #artificial-intelligence

I Watched Two AI Agents Invent Their Own Language

arxiv.org · 21 Jul · #artificial-intelligence

Symbolic Augmentation Closes a Canonical-Equivalence Blind Spot in Neural Fact-Checkers

arxiv.org · 21 Jul · #artificial-intelligence

ColGraphRAG: Late-Interaction Evidence Retrieval for Multimodal GraphRAG

machinebrief.com · 21 Jul · #artificial-intelligence

Team DACTYL at PAN 2026: Bayesian Data Mixing and Empirical X-risk Minimization for AI-text Detection

── more on @python 3 stories trending now

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 8 Jul · #ai-tools

What's the Future of Clay?

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required