{"slug": "from-regex-hell-to-ai-how-i-finally-tamed-messy-pdf-invoices", "title": "From Regex Hell to AI: How I Finally Tamed Messy PDF Invoices", "summary": "A developer built an AI-powered system to extract structured data from messy PDF invoices, replacing unreliable regex and rule-based parsers. By using a large language model with few-shot prompting and JSON schema, the system achieved reliable extraction across varied layouts. The approach works with any LLM supporting function calling or JSON mode.", "body_md": "Last month, I spent three days wrestling with 500 PDF invoices. Each one had the same data—vendor name, invoice number, total amount—but the layouts were all over the place. Different fonts, missing headers, tables that somehow broke across pages. I tried regex. I tried OCR with layout analysis. I even tried building a rule-based parser that looked for keywords like \"Total:\" .\n\nNothing worked reliably. Every time I fixed one pattern, another invoice broke. I was one commit away from throwing my laptop out the window.\n\nThen I took a step back. I realized I didn't need to understand every layout variation. I just needed to *understand the data*. And that's where AI came in.\n\nLet me be clear: I tried the usual suspects first.\n\n**Regex.** Classic. I wrote patterns like `r\"Total\\s*:\\s*\\$?(\\d+\\.\\d{2})\"`\n\n. Worked on 60% of invoices. The rest had \"Total Due\" or \"Amount Total\" or the dollar sign in a different place. Regex is great when you control the input. I didn't.\n\n**OCR with layout parsing.** I used Tesseract with `--psm 6`\n\nand tried to extract lines by bounding boxes. It helped a bit, but tables with merged cells or rotated text threw it off. Plus, I had to write code to guess which box was a field name and which was a value.\n\n**Rule-based parser.** I built a dictionary of known vendors and their layouts. That worked … until I got an invoice from a new vendor. Maintenance became a nightmare.\n\nI was solving the wrong problem. Instead of fighting formatting, I needed to focus on *meaning*.\n\nI remembered that large language models are surprisingly good at understanding context. If I could give the model the raw text from a PDF and a description of what I wanted, maybe it could extract the fields directly.\n\nHere’s the core idea: treat extraction as a structured generation task. Provide a prompt with a few examples (few-shot) or just describe the schema, and let the model output JSON.\n\nI found an API that did exactly this with a simple HTTP call. (Full disclosure: I used [Interwest AI](https://ai.interwestinfo.com/) because it had a free tier and a straightforward endpoint.) But the technique works with any LLM that supports function calling or JSON mode—OpenAI, Anthropic, local models, etc.\n\nI used `PyMuPDF`\n\n(fitz) to grab all text in order.\n\n``` python\nimport fitz  # PyMuPDF\n\ndef extract_text(pdf_path):\n    doc = fitz.open(pdf_path)\n    text = \"\"\n    for page in doc:\n        text += page.get_text()\n    return text\n```\n\nThis gives me a big string, with no table structure. That’s fine—the AI will figure it out.\n\nI defined a clear JSON schema for the output I wanted.\n\n```\nextraction_schema = {\n    \"vendor_name\": \"string\",\n    \"invoice_number\": \"string\",\n    \"invoice_date\": \"YYYY-MM-DD string\",\n    \"total_amount\": \"number (float)\",\n    \"currency\": \"string (e.g. USD, EUR)\",\n    \"line_items\": [{\"description\": \"string\", \"quantity\": \"number\", \"unit_price\": \"number\", \"amount\": \"number\"}]\n}\n```\n\nThen I wrote a system prompt that explains the task and provides two examples.\n\n```\nsystem_prompt = f\"\"\"\nYou are a data extraction assistant. Extract the requested fields from the invoice text below.\nOutput ONLY valid JSON matching this schema:\n{json.dumps(extraction_schema, indent=2)}\n\nIf a field is missing, use null. For line_items, include all items mentioned in the invoice.\n\"\"\"\n```\n\nHere’s the function that sends the document text and prompt to the AI API.\n\n``` python\nimport requests\nimport json\n\nAPI_URL = \"https://ai.interwestinfo.com/v1/extract\"  # or any compatible endpoint\nAPI_KEY = \"your_key_here\"\n\ndef extract_invoice_data(text):\n    payload = {\n        \"model\": \"gpt-4o-mini\",  # or whatever model the API supports\n        \"messages\": [\n            {\"role\": \"system\", \"content\": system_prompt},\n            {\"role\": \"user\", \"content\": f\"Invoice text:\\n\\n{text}\"}\n        ],\n        \"response_format\": {\"type\": \"json_object\"}\n    }\n    headers = {\"Authorization\": f\"Bearer {API_KEY}\", \"Content-Type\": \"application/json\"}\n    resp = requests.post(API_URL, json=payload, headers=headers)\n    resp.raise_for_status()\n    return resp.json()[\"choices\"][0][\"message\"][\"content\"]\n```\n\nAI outputs aren’t perfect. I added a validation step that checks for required fields and retries with a simpler prompt if the JSON is malformed.\n\n``` python\nimport json\n\ndef validate_and_parse(raw_output):\n    try:\n        data = json.loads(raw_output)\n    except json.JSONDecodeError:\n        print(\"AI returned invalid JSON. Falling back to retry...\")\n        return None\n    if data.get(\"total_amount\") is None:\n        print(\"Missing critical field. Marking for manual review.\")\n        return None\n    return data\n```\n\nThen I loop through all invoices, collect results, and export to CSV.\n\nOut of 500 invoices, the AI correctly extracted all requested fields for **478** (95.6%). The remaining 22 had issues: usually a missing line item or a hallucinated date. I set those aside for manual review. Total time: ~45 minutes of processing (with a 3-second delay per request) plus 30 minutes of manual fixes. Way better than three days of regex.\n\nThis approach isn't magic. Here’s what I wish I’d known:\n\n**When not to use this:** If you have thousands of identical PDFs with zero variation, a regex + OCR pipeline is cheaper and faster. If you need real-time extraction (<1 second), this isn’t it. Also, if you don’t have a clear schema or need nested relationships, the AI can get confused.\n\nThis whole experience changed how I approach extractive tasks. Instead of trying to reverse-engineer every layout, I let the model understand the *content*. It’s not perfect, but it’s the closest thing to a universal parser I’ve seen.\n\nWhat messy data problem are you dealing with right now? Have you tried throwing an LLM at it? I’d love to hear your war stories.", "url": "https://wpnews.pro/news/from-regex-hell-to-ai-how-i-finally-tamed-messy-pdf-invoices", "canonical_source": "https://dev.to/__c1b9e06dc90a7e0a676b/from-regex-hell-to-ai-how-i-finally-tamed-messy-pdf-invoices-3p64", "published_at": "2026-06-28 10:00:45+00:00", "updated_at": "2026-06-28 10:03:34.548676+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "developer-tools", "natural-language-processing"], "entities": ["Interwest AI", "OpenAI", "Anthropic", "PyMuPDF", "Tesseract"], "alternates": {"html": "https://wpnews.pro/news/from-regex-hell-to-ai-how-i-finally-tamed-messy-pdf-invoices", "markdown": "https://wpnews.pro/news/from-regex-hell-to-ai-how-i-finally-tamed-messy-pdf-invoices.md", "text": "https://wpnews.pro/news/from-regex-hell-to-ai-how-i-finally-tamed-messy-pdf-invoices.txt", "jsonld": "https://wpnews.pro/news/from-regex-hell-to-ai-how-i-finally-tamed-messy-pdf-invoices.jsonld"}}