cd /news/large-language-models/regex-hell-to-llm-function-calling-m… · home topics large-language-models article
[ARTICLE · art-22327] src=dev.to pub= topic=large-language-models verified=true sentiment=↑ positive

Regex Hell to LLM Function Calling: My Data Extraction Journey

A developer replaced a fragile regex-based PDF invoice extraction pipeline with LLM function calling, achieving reliable structured data output after a week of struggling with brittle pattern matching and OCR noise. The engineer used OpenAI's API with a defined JSON schema to extract invoice fields from 500+ PDFs with varying formats, eliminating the need for hardcoded templates and conditional logic.

read3 min publishedJun 5, 2026

A few months ago, I had a problem that made me question my career choices.

I was staring down a folder with 500+ PDF invoices. Each one had the same fields - invoice number, date, line items, totals - but the formatting was different on every single one. Some had tables, some had columns, some were just text blocks with a dash separator. The client wanted all of it in a clean JSON array.

At first, I thought: "Regex. I'll write a few patterns, test them, and be done in a day." That was naive.

I spent two days writing regexes that worked for 80% of the documents. The remaining 20% had edge cases - missing fields, extra whitespace, merged cells - that broke everything. Every fix for one edge case broke another. I ended up with a sprawling Python script full of conditional logic that was impossible to maintain.

Then I tried extracting text positions and matching templates. Same problem. The fonts, alignment, and indentation varied too much. Hardcoding coordinates was doomed.

When the PDFs were scanned images, OCR added another layer of noise - typos, misrecognized characters. I spent more time cleaning up OCR output than extracting data.

After a week, I had a fragile pipeline that still spat out errors on almost every invoice. I was ready to tell the client this was impossible.

Then I heard about using LLMs for structured data extraction - not just summarization, but actual JSON output through function calling. I was skeptical. LLMs are probabilistic; how could they reliably extract exact invoice numbers?

But I decided to try with a small subset. I wrote a Python script that sent each PDF's text (extracted via pdfplumber

) to OpenAI's API with a function call definition for the invoice schema. The idea: instead of parsing text with brittle patterns, let the LLM understand the semantics and map fields.

Here's what the function call looked like:

import json
import openai
import pdfplumber

def extract_invoice_data(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = "\n".join([page.extract_text() for page in pdf.pages])

    functions = [
        {
            "name": "store_invoice",
            "description": "Store extracted invoice data",
            "parameters": {
                "type": "object",
                "properties": {
                    "invoice_number": {
                        "type": "string",
                        "description": "The invoice number (e.g., INV-12345)"
                    },
                    "date": {
                        "type": "string",
                        "description": "Invoice date in YYYY-MM-DD format"
                    },
                    "vendor": {
                        "type": "string"
                    },
                    "line_items": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "description": {"type": "string"},
                                "quantity": {"type": "integer"},
                                "unit_price": {"type": "number"},
                                "amount": {"type": "number"}
                            },
                            "required": ["description", "amount"]
                        }
                    },
                    "total": {
                        "type": "number"
                    }
                },
                "required": ["invoice_number", "date", "total"]
            }
        }
    ]

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "user", "content": f"Extract invoice data from this text:\n\n{text}"}
        ],
        functions=functions,
        function_call={"name": "store_invoice"}
    )

    return json.loads(response.choices[0].message["function_call"]["arguments"])

I ran this on 20 invoices. The first ten were perfect. Numbers matched, dates were correct, even the tricky line items with missing quantities got flagged as null

. I was stunned.

But then the next ten had problems - the LLM hallucinated a vendor name, or misread a decimal. So I added post-processing validation. And then I learned about structured output formats - specifically JSON mode or constrained decoding - to reduce hallucinations.

INV-\d{6}

. Reject and retry if not.One tool that helped me prototype faster was the Interwest AI extraction platform. It essentially wraps this pattern - you upload a template, define fields, and it uses an LLM backend. But the approach is what matters: semantic extraction via function calling.

LLM function calling turned a near-impossible task into a weekend project. It's not a silver bullet - you still need validation, error handling, and a clear schema. But it's dramatically better than trying to write rules for every edge case.

What's your approach to extracting messy data? Have you tried this technique, or do you stick with more traditional methods? I'd love to hear what's worked for you.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/regex-hell-to-llm-fu…] indexed:0 read:3min 2026-06-05 ·