From Regex Hell to AI: How I Finally Tamed Messy PDF Invoices

A developer built an AI-powered system to extract structured data from messy PDF invoices, replacing unreliable regex and rule-based parsers. By using a large language model with few-shot prompting and JSON schema, the system achieved reliable extraction across varied layouts. The approach works with any LLM supporting function calling or JSON mode.

Last month, I spent three days wrestling with 500 PDF invoices. Each one had the same data—vendor name, invoice number, total amount—but the layouts were all over the place. Different fonts, missing headers, tables that somehow broke across pages. I tried regex. I tried OCR with layout analysis. I even tried building a rule-based parser that looked for keywords like "Total:" . Nothing worked reliably. Every time I fixed one pattern, another invoice broke. I was one commit away from throwing my laptop out the window. Then I took a step back. I realized I didn't need to understand every layout variation. I just needed to understand the data . And that's where AI came in. Let me be clear: I tried the usual suspects first. Regex. Classic. I wrote patterns like r"Total\s :\s \$? \d+\.\d{2} " . Worked on 60% of invoices. The rest had "Total Due" or "Amount Total" or the dollar sign in a different place. Regex is great when you control the input. I didn't. OCR with layout parsing. I used Tesseract with --psm 6 and tried to extract lines by bounding boxes. It helped a bit, but tables with merged cells or rotated text threw it off. Plus, I had to write code to guess which box was a field name and which was a value. Rule-based parser. I built a dictionary of known vendors and their layouts. That worked … until I got an invoice from a new vendor. Maintenance became a nightmare. I was solving the wrong problem. Instead of fighting formatting, I needed to focus on meaning . I remembered that large language models are surprisingly good at understanding context. If I could give the model the raw text from a PDF and a description of what I wanted, maybe it could extract the fields directly. Here’s the core idea: treat extraction as a structured generation task. Provide a prompt with a few examples few-shot or just describe the schema, and let the model output JSON. I found an API that did exactly this with a simple HTTP call. Full disclosure: I used Interwest AI https://ai.interwestinfo.com/ because it had a free tier and a straightforward endpoint. But the technique works with any LLM that supports function calling or JSON mode—OpenAI, Anthropic, local models, etc. I used PyMuPDF fitz to grab all text in order. python import fitz PyMuPDF def extract text pdf path : doc = fitz.open pdf path text = "" for page in doc: text += page.get text return text This gives me a big string, with no table structure. That’s fine—the AI will figure it out. I defined a clear JSON schema for the output I wanted. extraction schema = { "vendor name": "string", "invoice number": "string", "invoice date": "YYYY-MM-DD string", "total amount": "number float ", "currency": "string e.g. USD, EUR ", "line items": {"description": "string", "quantity": "number", "unit price": "number", "amount": "number"} } Then I wrote a system prompt that explains the task and provides two examples. system prompt = f""" You are a data extraction assistant. Extract the requested fields from the invoice text below. Output ONLY valid JSON matching this schema: {json.dumps extraction schema, indent=2 } If a field is missing, use null. For line items, include all items mentioned in the invoice. """ Here’s the function that sends the document text and prompt to the AI API. python import requests import json API URL = "https://ai.interwestinfo.com/v1/extract" or any compatible endpoint API KEY = "your key here" def extract invoice data text : payload = { "model": "gpt-4o-mini", or whatever model the API supports "messages": {"role": "system", "content": system prompt}, {"role": "user", "content": f"Invoice text:\n\n{text}"} , "response format": {"type": "json object"} } headers = {"Authorization": f"Bearer {API KEY}", "Content-Type": "application/json"} resp = requests.post API URL, json=payload, headers=headers resp.raise for status return resp.json "choices" 0 "message" "content" AI outputs aren’t perfect. I added a validation step that checks for required fields and retries with a simpler prompt if the JSON is malformed. python import json def validate and parse raw output : try: data = json.loads raw output except json.JSONDecodeError: print "AI returned invalid JSON. Falling back to retry..." return None if data.get "total amount" is None: print "Missing critical field. Marking for manual review." return None return data Then I loop through all invoices, collect results, and export to CSV. Out of 500 invoices, the AI correctly extracted all requested fields for 478 95.6% . The remaining 22 had issues: usually a missing line item or a hallucinated date. I set those aside for manual review. Total time: ~45 minutes of processing with a 3-second delay per request plus 30 minutes of manual fixes. Way better than three days of regex. This approach isn't magic. Here’s what I wish I’d known: When not to use this: If you have thousands of identical PDFs with zero variation, a regex + OCR pipeline is cheaper and faster. If you need real-time extraction <1 second , this isn’t it. Also, if you don’t have a clear schema or need nested relationships, the AI can get confused. This whole experience changed how I approach extractive tasks. Instead of trying to reverse-engineer every layout, I let the model understand the content . It’s not perfect, but it’s the closest thing to a universal parser I’ve seen. What messy data problem are you dealing with right now? Have you tried throwing an LLM at it? I’d love to hear your war stories.