AI Document Processing in Production: Full Pipeline Guide

A developer built a production-grade AI document processing pipeline that handles PDF invoices, contracts, and bank statements at scale. The pipeline uses pdfplumber for text extraction, falls back to OCR for scanned documents, and includes page splitting, table extraction, and structured model calls to overcome common failures like token limits, scanned documents, table misalignment, and high costs.

Someone emails you a PDF invoice. You want to extract the vendor name, line items, total amount, currency, and due date — automatically, at scale, without manual keying. You call the OpenAI API, pass the PDF as base64, get a JSON blob back. It works. You ship it. Then reality arrives: a scanned invoice from a vendor who still uses a physical stamp. A 60-page contract where the key clause is on page 47. A table-heavy bank statement where amounts bleed across column boundaries. A PDF that's actually an image with no embedded text at all. The naive approach collapses on all of them. Here's the production architecture that does. The simplest version — encode the whole PDF, send it to GPT, ask it to return JSON — fails in four common ways: Token limits. A 50-page contract is roughly 25,000–40,000 tokens of text, plus image tokens if you're sending page renders. Most model context windows handle it technically, but accuracy degrades on long documents. The model loses track of structure. Extraction quality on page 45 is noticeably worse than page 2. Scanned documents. A PDF with no embedded text layer is just a sequence of images. No amount of prompting extracts text that isn't there. You need OCR. This affects more documents than you expect — expense receipts, legacy contracts, anything printed and scanned, anything generated by certain accounting systems. Tables. Tables are the hardest part of PDF extraction. Embedded text in a PDF doesn't encode column relationships — the text objects are just positioned by x/y coordinates. A naive extraction reads the text linearly and loses the table structure entirely. Line items from an invoice become a flat list with no mapping between description, quantity, and amount. Cost at scale. Sending a 10MB PDF as a single API call costs real money and consumes tokens inefficiently. Most of that context is headers, footers, boilerplate legal text, and page numbers. The fields you actually need are in 5% of the document. Input PDF → File validation size, type, not encrypted → Text extraction attempt pdfplumber / pypdf → Quality check: does extracted text look usable? → If not: OCR pipeline Tesseract / Textract / Document AI → Page splitting + relevant-page detection → Table extraction if document type warrants it → Structured model call JSON mode / tool use → Output validation schema check + cross-field rules → Confidence scoring → Storage or human-review queue Each stage has a failure mode. Each one needs its own handling. slug="ai-integration" text="Building AI document processing for EU invoices, contracts, or compliance workflows? I design and ship these pipelines end-to-end." / Start with embedded text — it's faster, cheaper, and more accurate than OCR when available. I use pdfplumber for extraction in Python. It handles text positioning better than pypdf and gives you bounding box data that's useful for table detection: python extract.py from future import annotations import pdfplumber def extract text by page pdf path: str - list dict : """ Extract text from each page. Returns a list of dicts with page number, raw text, and a usability flag. """ pages = with pdfplumber.open pdf path as pdf: for i, page in enumerate pdf.pages : text = page.extract text x tolerance=3, y tolerance=3 or "" word count = len text.split pages.append { "page": i + 1, "text": text, "word count": word count, Heuristic: fewer than 30 words on a non-blank page = likely scanned "needs ocr": word count < 30 and len page.images 0, } return pages def extract tables pdf path: str, page number: int - list list list str : """ Extract tables from a specific page. Returns a list of tables, each table being a list of rows, each row being a list of cell strings. """ with pdfplumber.open pdf path as pdf: page = pdf.pages page number - 1 tables = page.extract tables Normalize: replace None cells with empty string return cell or "" for cell in row for row in table for table in tables or The per-page needs ocr flag is the key decision point. A hybrid document — mostly text with one scanned attachment page — gets OCR applied only to the scanned pages, not the whole file. Three options, each with different tradeoffs: Tesseract — open source, free, runs locally. Accuracy is acceptable for clean scans, poor for low-resolution, skewed, or multi-column layouts. Good default for low-volume pipelines where you control the document source. AWS Textract — purpose-built for documents. Handles tables natively, returns structured output with bounding boxes and confidence scores per word. Costs $0.0015 per page for basic detection, $0.015 for table/form extraction. The table extraction is worth the cost for invoices and financial statements. Google Document AI — strongest accuracy for complex layouts and multilingual documents. More expensive than Textract, but noticeably better on documents with mixed scripts or unusual formatting. My heuristic: Tesseract for internal tooling and prototypes, Textract for invoices and financial documents, Document AI if you're processing government documents or multi-language content. For Textract integration from Python: python ocr textract.py from future import annotations import boto3 def ocr page with textract image bytes: bytes - dict: """ Run a single page image through AWS Textract. Returns raw blocks with type, text, and confidence. """ client = boto3.client "textract", region name="eu-west-1" response = client.detect document text Document={"Bytes": image bytes} lines: list dict = for block in response "Blocks" : if block "BlockType" == "LINE": lines.append { "text": block "Text" , "confidence": block "Confidence" , "bbox": block "Geometry" "BoundingBox" , } return { "lines": lines, "full text": "\n".join b "text" for b in lines , "min confidence": min b "confidence" for b in lines , default=0 , } For table extraction specifically, use analyze document with FeatureTypes= "TABLES" — this is a separate call and a higher per-page cost, but it's the only reliable way to get structured table data from scanned documents. For a 60-page contract, you don't send all 60 pages to the model. You find the pages that contain the data you need. The approach: keyword scoring per page. For invoice extraction, I score pages by the presence of terms like "invoice", "total", "amount", "due date", "vendor", and their localized equivalents. The top-scoring 3–5 pages get sent to the model. For contracts, I look for "payment terms", "effective date", "party", "agrees to". python relevance.py from future import annotations import re INVOICE KEYWORDS = "invoice", "total", "amount due", "subtotal", "tax", "vat", "due date", "payment terms", "bill to", "vendor", "supplier", Finnish common in EU processing "lasku", "summa", "eräpäivä", "alv", def score page relevance text: str, keywords: list str - float: """ Returns a 0.0–1.0 relevance score for a page given target keywords. """ if not text: return 0.0 text lower = text.lower matches = sum 1 for kw in keywords if kw in text lower return min matches / max len keywords 0.3, 1 , 1.0 def select relevant pages pages: list dict , keywords: list str , max pages: int = 5, - list dict : scored = { page, "relevance": score page relevance page "text" , keywords } for page in pages scored.sort key=lambda p: p "relevance" , reverse=True return p for p in scored :max pages if p "relevance" 0.05 This alone cuts token consumption by 70–80% on long documents while keeping extraction quality the same or better — because the model isn't distracted by irrelevant content. Keyword scoring is a pragmatic baseline — for higher recall and precision, embedding-based retrieval per page can replace it, but at higher cost and complexity. This is where most tutorials go wrong. They send a prompt like "extract the invoice fields and return JSON". The model returns JSON most of the time. Sometimes it returns JSON wrapped in a markdown code fence. Sometimes it adds commentary before the JSON. Sometimes it invents fields. You end up writing a fragile parser on top of an unpredictable output. Use JSON mode OpenAI or tool use. These are not optional conveniences — they're the difference between a system that works reliably and one that works most of the time. Here is the TypeScript extraction call using the Vercel AI SDK with a Zod schema enforced via tool use: js // lib/documents/extract-invoice.ts import { openai } from "@ai-sdk/openai"; import { generateObject } from "ai"; import { z } from "zod"; const InvoiceSchema = z.object { vendor name: z.string .describe "The name of the company issuing the invoice" , vendor vat number: z .string .nullable .describe "VAT registration number if present, null otherwise" , invoice number: z.string .describe "The invoice reference number" , invoice date: z.string .describe "Date the invoice was issued, ISO 8601 format if possible" , due date: z .string .nullable .describe "Payment due date, ISO 8601 format if possible, null if not found" , currency: z.string .describe "ISO 4217 currency code, e.g. EUR, USD, GBP" , subtotal: z.number .nullable .describe "Pre-tax amount as a number, null if not found" , tax amount: z.number .nullable .describe "Tax/VAT amount as a number, null if not found" , total amount: z.number .describe "Total amount due as a number" , line items: z.array z.object { description: z.string , quantity: z.number .nullable , unit price: z.number .nullable , total: z.number .nullable , } , confidence: z.number .min 0 .max 1 .describe "Your confidence in this extraction, 0.0 to 1.0" , } ; export type InvoiceExtraction = z.infer<typeof InvoiceSchema ; export async function extractInvoice pageTexts: string , tableData: string = : Promise<InvoiceExtraction { const context = pageTexts.join "\n\n---\n\n" ; const tableContext = tableData.length 0 ? "\n\nExtracted tables:\n" + tableData.map t = t.map r = r.join " | " .join "\n" .join "\n\n" : ""; const { object } = await generateObject { model: openai "gpt-4o-mini" , schema: InvoiceSchema, prompt: Extract all invoice fields from the following document text. Use null for fields not present in the document. For currency, always return an ISO 4217 code. For dates, normalize to YYYY-MM-DD if possible. Set confidence to reflect how certain you are about the extraction overall. Document: ${context}${tableContext} , } ; return object; } The confidence field in the schema is deliberate — I ask the model to self-report its certainty. It's not always calibrated perfectly, but it's a useful first filter. Extractions with confidence < 0.6 go to the human review queue automatically. Structured output from the model is not validated output. The model will comply with the schema — but it can still produce values that pass schema validation while being logically wrong. python // lib/documents/validate-invoice.ts import type { InvoiceExtraction } from "./extract-invoice"; export interface ValidationResult { valid: boolean; errors: string ; warnings: string ; } export function validateInvoiceExtraction data: InvoiceExtraction : ValidationResult { const errors: string = ; const warnings: string = ; // Hard failures — this extraction is unreliable if data.vendor name || data.vendor name.trim .length < 2 { errors.push "vendor name is missing or too short" ; } if data.total amount <= 0 { errors.push "total amount must be greater than zero" ; } if data.currency || /^ A-Z {3}$/.test data.currency { errors.push currency '${data.currency}' is not a valid ISO 4217 code ; } // Cross-field logic if data.subtotal == null && data.tax amount == null { const expectedTotal = data.subtotal + data.tax amount; const delta = Math.abs expectedTotal - data.total amount ; if delta 0.02 { // Tolerance for rounding errors.push subtotal ${data.subtotal} + tax ${data.tax amount} = ${expectedTotal}, + but total amount is ${data.total amount}. Delta: ${delta.toFixed 2 } ; } } if data.total amount 0 && data.currency { errors.push "total amount is present but currency is missing" ; } // Soft warnings — flag for review but don't reject if data.line items.length === 0 { warnings.push "no line items extracted — verify manually" ; } if data.confidence < 0.6 { warnings.push low model confidence: ${data.confidence} ; } if data.due date { warnings.push "due date not found — may need manual entry" ; } const vatPattern = /^ A-Z {2}\d{8,12}$/; if data.vendor vat number && vatPattern.test data.vendor vat number.replace /\s/g, "" { warnings.push vendor vat number '${data.vendor vat number}' doesn't match expected format ; } return { valid: errors.length === 0, errors, warnings, }; } The subtotal + tax = total cross-check catches more extraction errors than any other single rule. Models sometimes extract the subtotal as the total, or miss the tax component entirely. Two strategies, not mutually exclusive: Automated retry with a stronger model. If confidence < 0.7 and validation passes but has warnings, retry with gpt-4o instead of gpt-4o-mini . The cost difference is roughly 15×, so don't do this for all documents — only for those where the faster model flagged uncertainty. In practice this applies to 10–15% of documents and catches most of the edge cases. Human-in-the-loop queue. If validation returns errors, or if the retry also produces low confidence, route to a review interface. The key is to pre-fill the UI with the extracted values — the reviewer confirms or corrects, they don't start from scratch. This makes human review fast enough to be operationally viable even at moderate volume. js // lib/documents/pipeline.ts import { extractInvoice } from "./extract-invoice"; import { validateInvoiceExtraction } from "./validate-invoice"; import { db } from "@/lib/db"; type ProcessingOutcome = "stored" | "retry" | "human review"; export async function processInvoiceDocument pageTexts: string , tables: string , documentId: string : Promise<ProcessingOutcome { let result = await extractInvoice pageTexts, tables ; let validation = validateInvoiceExtraction result ; // Retry with stronger model if first pass is uncertain if validation.valid || result.confidence < 0.7 { result = await extractInvoice pageTexts, tables ; // gpt-4o retry handled inside validation = validateInvoiceExtraction result ; } if validation.valid { await db.insert reviewQueue .values { documentId, extractedData: result, validationErrors: validation.errors, validationWarnings: validation.warnings, status: "pending review", } ; return "human review"; } await db.insert invoices .values { documentId, ...result, processedAt: new Date , } ; return "stored"; } At low volume under 100 documents/day , synchronous processing is fine. At scale, you need async. If this is part of a broader automation workflow https://iurii.rogulia.fi/services/automation-workflows , BullMQ fits naturally as the processing backbone. Queue documents with BullMQ. Set concurrency based on your rate limits — OpenAI's Tier 2 allows 5,000 RPM for gpt-4o-mini . With 3 API calls per document extraction + possible retry + any enrichment , you can process roughly 1,600 documents per minute at full throughput before hitting rate limits. Model selection by document type matters for cost: | Document type | Model | Avg cost/doc | Notes | |---|---|---|---| | Clean digital invoice | gpt-4o-mini | ~$0.004 | High accuracy, fast | | Complex table-heavy doc | gpt-4o | ~$0.06 | On retry/escalation only | | Scanned + OCR needed | Textract + 4o-mini | ~$0.018 | Textract adds $0.015/page | | Contract 10+ pages | gpt-4o-mini | ~$0.012 | After page filtering | The page relevance filtering from Stage 3 is the biggest cost lever. A 20-page document with 3 relevant pages costs 85% less than sending all 20 pages. What breaks silently: extraction quality drift . The model doesn't throw an error — it just starts extracting total amount less reliably as your document corpus evolves and new formats appear. Log these metrics per processing run: Set an alert if the validation error rate crosses 5% in a rolling 1-hour window. That's your signal that a new document format is breaking the pipeline and needs a schema or prompt update. pdfplumber hangs on encrypted PDFs. If the document is password-protected, the extraction call never returns. Add a timeout and catch the exception. Check for encryption before attempting extraction. Scanned PDFs often have a text layer that's garbage. Some scan-to-PDF workflows produce PDFs with embedded text, but the text is from a failed OCR pass — garbled characters, incorrect word boundaries. My heuristic for detecting this: extract text, then count the ratio of alphabetic characters to total characters. Below 0.6, treat it as scanned regardless of what pdfplumber returns. The model sometimes returns numbers with thousand separators. "1,234.56" — and Zod will reject it because the schema expects a number. Normalize numeric strings before schema validation: strip commas, handle both . and , as decimal separators European invoices use comma as decimal . Date formats vary wildly. 14.03.2025 , 14/03/2025 , March 14, 2025 , 2025-03-14 . The model normalizes most of them, but it's worth adding a post-processing step that parses the extracted date string through date-fns/parse with multiple format attempts before storing. Textract is slow for real-time use. Average latency is 3–8 seconds per page for detect document text . If you need synchronous responses, use Tesseract locally for a fast first pass and Textract in an async enrichment step. I built this pipeline for internal document processing and as a reusable module across client projects. In production use: If you're building document processing automation for EU businesses — invoice processing, contract extraction, compliance document workflows — the pipeline above is where you start. The naive "pass to GPT" approach works in a demo. Production requires the full stack: pre-processing, OCR fallbacks, structured schemas, validation, and a human-in-the-loop safety net for the edge cases. I've deployed this end-to-end across several production systems, including htpbe.tech https://iurii.rogulia.fi/projects/htpbe-pdf-analysis PDF forensics and document automation pipelines https://iurii.rogulia.fi/services/automation-workflows for e-commerce and logistics clients. Document processing is not an AI problem — it's a systems problem with an AI component. If you need a senior developer who can build AI document processing that actually works at scale — get in touch https://iurii.rogulia.fi/contact . I'm available for freelance projects and long-term engagements. Related reading: How I Detect Tampered PDFs in 9 Seconds https://iurii.rogulia.fi/blog/detect-tampered-pdfs — forensic analysis of PDF structure for document authenticity verification. How to Add AI to an Existing Product Without Rewriting It https://iurii.rogulia.fi/blog/add-ai-without-rewriting — the three integration patterns and where document processing fits.