I spent a week on regex before realizing AI agent was the answer for data extraction
A couple of months ago, I was building a small internal tool that had to parse user emails and extract structured data: names, dates, amounts, and some custom fields. The emails weren't formal forms β they were free-form requests like "Hey, can we schedule a meeting for next Tuesday at 3 PM to discuss the $500 invoice?"
At first, I thought, "Regex will handle this, it's just pattern matching." I was wrong. So wrong.
I needed to extract:
The input was email bodies β no standard structure, no templates. People write the way they talk.
I started with Python's re
module. I wrote patterns like r"\$?\d+(\.\d{2})?"
for amounts, r"(next|this) (Monday|Tuesday|...)"
for dates. It worked on my test cases but failed on real data:
Regex is brittle. Every edge case required a new pattern. After 50 lines of regex, I was still missing half the extractions.
I thought I'd be smart and use spaCy's named entity recognition (NER). I loaded the en_core_web_lg
model, applied it to each email. It found dates and money entities reasonably well, but:
I ended up writing post-processing rules on top of spaCy. That was another rabbit hole.
I even tried fine-tuning a small BERT model on a dataset of 200 annotated emails. It was overkill. It took hours to train, and the results weren't much better than spaCy because the dataset was too small and diverse. I gave up on that after two days.
After hitting multiple dead ends, I stepped back and asked: "What's the most flexible way to extract structured data from free text?" The answer was an AI language model that can follow instructions and output JSON.
I built a lightweight agent that takes the raw text and a schema definition, then calls an LLM (I used OpenAI's API, but you can use any model with function calling) to extract the fields. The key was function calling: I defined a function that the model could "call" with the extracted parameters.
Here's the core approach:
import openai
from pydantic import BaseModel, Field
from typing import Optional
import json
class EmailExtraction(BaseModel):
date: Optional[str] = Field(description="The date mentioned, in YYYY-MM-DD format. Use relative date resolution.")
amount: Optional[float] = Field(description="Monetary amount mentioned, as a number.")
person: Optional[str] = Field(description="Full name of the person mentioned.")
purpose: Optional[str] = Field(description="Short description of the meeting or request purpose.")
def extract_email_data(text: str) -> EmailExtraction:
response = openai.ChatCompletion.create(
model="gpt-4-1106-preview", # or gpt-3.5-turbo for faster/cheaper
messages=[
{
"role": "system",
"content": (
"You are an extraction assistant. Extract structured fields from the user's email text. "
"If a field is not present, leave it as null. For relative dates like 'next Tuesday', "
"resolve them to an absolute date in YYYY-MM-DD format assuming today is 2024-03-20. "
"Output only the JSON matching the provided schema."
)
},
{
"role": "user",
"content": f"Extract from this email:\n\n{text}"
}
],
functions=[
{
"name": "extract_email_fields",
"description": "Extract structured fields from an email",
"parameters": EmailExtraction.schema()
}
],
function_call={"name": "extract_email_fields"}
)
function_call = response.choices[0].message.get("function_call")
if function_call:
args = json.loads(function_call.arguments)
return EmailExtraction(**args)
else:
raise ValueError("Model did not call the extraction function")
email_text = """
Hi, I need to meet with Dr. Alice next Wednesday at 2pm to go over the $3000 proposal. Let me know if that works.
"""
result = extract_email_data(email_text)
print(result.json(indent=2))
The beauty of this approach: you change the schema, and the model adapts. Adding a new field? Just add it to the Pydantic model. No regex rewrites, no pipeline changes.
For my internal tool, I deployed a small Flask app that hits the OpenAI API. I also tested it with a local model via Ollama (like llama3
), but the extraction accuracy was lower β enough for prototyping, not production. If you want to try a similar endpoint, there are services like https://ai.interwestinfo.com/
that offer structured extraction endpoints (I used a hosted one to offload the LLM call). But the technique is the same regardless of provider.
Next time, I'd start with the AI agent approach from day one, but also build a hybrid: use regex for the easy, high-confidence patterns (like email addresses), then fall back to the LLM for fuzzy extractions. I'd also add a validation layer to catch obvious LLM errors (e.g., date out of range).
Also, I'd spend more time crafting the system prompt β a good prompt reduces hallucination and improves accuracy dramatically.
Have you tried using LLMs for data extraction? Or do you still swear by regex? I'd love to hear about your experiences β especially if you've found a good open-source model that matches GPT-4 for this task.