Regex Hell to LLM Function Calling: My Data Extraction Journey

wpnews.pro

cd /news/large-language-models/regex-hell-to-llm-function-calling-m… · home › topics › large-language-models › article

[ARTICLE · art-22327] src=dev.to ↗ pub=2026-06-05T08:01Z topic=large-language-models verified=true sentiment=↑ positive

Regex Hell to LLM Function Calling: My Data Extraction Journey

A developer replaced a fragile regex-based PDF invoice extraction pipeline with LLM function calling, achieving reliable structured data output after a week of struggling with brittle pattern matching and OCR noise. The engineer used OpenAI's API with a defined JSON schema to extract invoice fields from 500+ PDFs with varying formats, eliminating the need for hardcoded templates and conditional logic.

read3 min views16 publishedJun 5, 2026

A few months ago, I had a problem that made me question my career choices.

I was staring down a folder with 500+ PDF invoices. Each one had the same fields - invoice number, date, line items, totals - but the formatting was different on every single one. Some had tables, some had columns, some were just text blocks with a dash separator. The client wanted all of it in a clean JSON array.

At first, I thought: "Regex. I'll write a few patterns, test them, and be done in a day." That was naive.

I spent two days writing regexes that worked for 80% of the documents. The remaining 20% had edge cases - missing fields, extra whitespace, merged cells - that broke everything. Every fix for one edge case broke another. I ended up with a sprawling Python script full of conditional logic that was impossible to maintain.

Then I tried extracting text positions and matching templates. Same problem. The fonts, alignment, and indentation varied too much. Hardcoding coordinates was doomed.

When the PDFs were scanned images, OCR added another layer of noise - typos, misrecognized characters. I spent more time cleaning up OCR output than extracting data.

After a week, I had a fragile pipeline that still spat out errors on almost every invoice. I was ready to tell the client this was impossible.

Then I heard about using LLMs for structured data extraction - not just summarization, but actual JSON output through function calling. I was skeptical. LLMs are probabilistic; how could they reliably extract exact invoice numbers?

But I decided to try with a small subset. I wrote a Python script that sent each PDF's text (extracted via pdfplumber

) to OpenAI's API with a function call definition for the invoice schema. The idea: instead of parsing text with brittle patterns, let the LLM understand the semantics and map fields.

Here's what the function call looked like:

import json
import openai
import pdfplumber

def extract_invoice_data(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = "\n".join([page.extract_text() for page in pdf.pages])

    functions = [
        {
            "name": "store_invoice",
            "description": "Store extracted invoice data",
            "parameters": {
                "type": "object",
                "properties": {
                    "invoice_number": {
                        "type": "string",
                        "description": "The invoice number (e.g., INV-12345)"
                    },
                    "date": {
                        "type": "string",
                        "description": "Invoice date in YYYY-MM-DD format"
                    },
                    "vendor": {
                        "type": "string"
                    },
                    "line_items": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "description": {"type": "string"},
                                "quantity": {"type": "integer"},
                                "unit_price": {"type": "number"},
                                "amount": {"type": "number"}
                            },
                            "required": ["description", "amount"]
                        }
                    },
                    "total": {
                        "type": "number"
                    }
                },
                "required": ["invoice_number", "date", "total"]
            }
        }
    ]

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "user", "content": f"Extract invoice data from this text:\n\n{text}"}
        ],
        functions=functions,
        function_call={"name": "store_invoice"}
    )

    return json.loads(response.choices[0].message["function_call"]["arguments"])

I ran this on 20 invoices. The first ten were perfect. Numbers matched, dates were correct, even the tricky line items with missing quantities got flagged as null

. I was stunned.

But then the next ten had problems - the LLM hallucinated a vendor name, or misread a decimal. So I added post-processing validation. And then I learned about structured output formats - specifically JSON mode or constrained decoding - to reduce hallucinations.

INV-\d{6}

. Reject and retry if not.One tool that helped me prototype faster was the Interwest AI extraction platform. It essentially wraps this pattern - you upload a template, define fields, and it uses an LLM backend. But the approach is what matters: semantic extraction via function calling.

LLM function calling turned a near-impossible task into a weekend project. It's not a silver bullet - you still need validation, error handling, and a clear schema. But it's dramatically better than trying to write rules for every edge case.

What's your approach to extracting messy data? Have you tried this technique, or do you stick with more traditional methods? I'd love to hear what's worked for you.

source & further reading

dev.to — original article I Couldn’t Fix My LLM Costs Until I Measured Tokens Per Feature Small Model SWE‑bench: What Happens When You Push Tiny Models Into Full Task Pipelines Grok 4.5 Isn't Open Source. The Apache 2.0 Release Has a Privacy Catch.

~/api · this article 200

$curl api.wpnews.pro/v1/news/regex-hell-to-llm-functi…

Read original on dev.to → dev.to/__c1b9e06dc90a7e0a676b/regex-hell-to-llm-…

mentioned entities

LLM

JSON

Python

OCR

PDF

metadata

slugregex-hell-to-llm-function-calling-my-data-extraction-journey

topic#large-language-models

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevDear Abby: We unwittingly commit…

next →Review Doesn’t Scale, Validation…

── more in #large-language-models 4 stories · sorted by recency

github.com · 21 Jul · #large-language-models

LLM-Based Hierarchical Topic Modeling Tool

dev.to · 12 Jul · #large-language-models

TrulyFreeOCR – a Java OCR pipeline in a single fat JAR, zero native deps required

byteiota.com · 22 Jul · #large-language-models

Claude Managed Agents Memory API Changed Today: Fix These 3 Things Now

siliconangle.com · 22 Jul · #large-language-models

Mistral AI strikes multibillion-dollar deal with Microsoft to build out Azure infrastructure in Europe

── more on @llm 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 8 Jul · #ai-tools

What's the Future of Clay?

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required