Last month, I spent three days wrestling with 500 PDF invoices. Each one had the same data—vendor name, invoice number, total amount—but the layouts were all over the place. Different fonts, missing headers, tables that somehow broke across pages. I tried regex. I tried OCR with layout analysis. I even tried building a rule-based parser that looked for keywords like "Total:" .
Nothing worked reliably. Every time I fixed one pattern, another invoice broke. I was one commit away from throwing my laptop out the window.
Then I took a step back. I realized I didn't need to understand every layout variation. I just needed to understand the data. And that's where AI came in.
Let me be clear: I tried the usual suspects first.
Regex. Classic. I wrote patterns like r"Total\s*:\s*\$?(\d+\.\d{2})"
. Worked on 60% of invoices. The rest had "Total Due" or "Amount Total" or the dollar sign in a different place. Regex is great when you control the input. I didn't.
OCR with layout parsing. I used Tesseract with --psm 6
and tried to extract lines by bounding boxes. It helped a bit, but tables with merged cells or rotated text threw it off. Plus, I had to write code to guess which box was a field name and which was a value.
Rule-based parser. I built a dictionary of known vendors and their layouts. That worked … until I got an invoice from a new vendor. Maintenance became a nightmare.
I was solving the wrong problem. Instead of fighting formatting, I needed to focus on meaning.
I remembered that large language models are surprisingly good at understanding context. If I could give the model the raw text from a PDF and a description of what I wanted, maybe it could extract the fields directly.
Here’s the core idea: treat extraction as a structured generation task. Provide a prompt with a few examples (few-shot) or just describe the schema, and let the model output JSON.
I found an API that did exactly this with a simple HTTP call. (Full disclosure: I used Interwest AI because it had a free tier and a straightforward endpoint.) But the technique works with any LLM that supports function calling or JSON mode—OpenAI, Anthropic, local models, etc.
I used PyMuPDF
(fitz) to grab all text in order.
import fitz # PyMuPDF
def extract_text(pdf_path):
doc = fitz.open(pdf_path)
text = ""
for page in doc:
text += page.get_text()
return text
This gives me a big string, with no table structure. That’s fine—the AI will figure it out.
I defined a clear JSON schema for the output I wanted.
extraction_schema = {
"vendor_name": "string",
"invoice_number": "string",
"invoice_date": "YYYY-MM-DD string",
"total_amount": "number (float)",
"currency": "string (e.g. USD, EUR)",
"line_items": [{"description": "string", "quantity": "number", "unit_price": "number", "amount": "number"}]
}
Then I wrote a system prompt that explains the task and provides two examples.
system_prompt = f"""
You are a data extraction assistant. Extract the requested fields from the invoice text below.
Output ONLY valid JSON matching this schema:
{json.dumps(extraction_schema, indent=2)}
If a field is missing, use null. For line_items, include all items mentioned in the invoice.
"""
Here’s the function that sends the document text and prompt to the AI API.
import requests
import json
API_URL = "https://ai.interwestinfo.com/v1/extract" # or any compatible endpoint
API_KEY = "your_key_here"
def extract_invoice_data(text):
payload = {
"model": "gpt-4o-mini", # or whatever model the API supports
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Invoice text:\n\n{text}"}
],
"response_format": {"type": "json_object"}
}
headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
resp = requests.post(API_URL, json=payload, headers=headers)
resp.raise_for_status()
return resp.json()["choices"][0]["message"]["content"]
AI outputs aren’t perfect. I added a validation step that checks for required fields and retries with a simpler prompt if the JSON is malformed.
import json
def validate_and_parse(raw_output):
try:
data = json.loads(raw_output)
except json.JSONDecodeError:
print("AI returned invalid JSON. Falling back to retry...")
return None
if data.get("total_amount") is None:
print("Missing critical field. Marking for manual review.")
return None
return data
Then I loop through all invoices, collect results, and export to CSV.
Out of 500 invoices, the AI correctly extracted all requested fields for 478 (95.6%). The remaining 22 had issues: usually a missing line item or a hallucinated date. I set those aside for manual review. Total time: ~45 minutes of processing (with a 3-second delay per request) plus 30 minutes of manual fixes. Way better than three days of regex.
This approach isn't magic. Here’s what I wish I’d known:
When not to use this: If you have thousands of identical PDFs with zero variation, a regex + OCR pipeline is cheaper and faster. If you need real-time extraction (<1 second), this isn’t it. Also, if you don’t have a clear schema or need nested relationships, the AI can get confused.
This whole experience changed how I approach extractive tasks. Instead of trying to reverse-engineer every layout, I let the model understand the content. It’s not perfect, but it’s the closest thing to a universal parser I’ve seen.
What messy data problem are you dealing with right now? Have you tried throwing an LLM at it? I’d love to hear your war stories.