AI Document Processing in Production: Full Pipeline Guide A developer built a production-grade AI document processing pipeline that handles PDF invoices, contracts, and bank statements at scale. The pipeline uses pdfplumber for text extraction, falls back to OCR for scanned documents, and includes page splitting, table extraction, and structured model calls to overcome common failures like token limits, scanned documents, table misalignment, and high costs. Someone emails you a PDF invoice. You want to extract the vendor name, line items, total amount, currency, and due date — automatically, at scale, without manual keying. You call the OpenAI API, pass the PDF as base64, get a JSON blob back. It works. You ship it. Then reality arrives: a scanned invoice from a vendor who still uses a physical stamp. A 60-page contract where the key clause is on page 47. A table-heavy bank statement where amounts bleed across column boundaries. A PDF that's actually an image with no embedded text at all. The naive approach collapses on all of them. Here's the production architecture that does. The simplest version — encode the whole PDF, send it to GPT, ask it to return JSON — fails in four common ways: Token limits. A 50-page contract is roughly 25,000–40,000 tokens of text, plus image tokens if you're sending page renders. Most model context windows handle it technically, but accuracy degrades on long documents. The model loses track of structure. Extraction quality on page 45 is noticeably worse than page 2. Scanned documents. A PDF with no embedded text layer is just a sequence of images. No amount of prompting extracts text that isn't there. You need OCR. This affects more documents than you expect — expense receipts, legacy contracts, anything printed and scanned, anything generated by certain accounting systems. Tables. Tables are the hardest part of PDF extraction. Embedded text in a PDF doesn't encode column relationships — the text objects are just positioned by x/y coordinates. A naive extraction reads the text linearly and loses the table structure entirely. Line items from an invoice become a flat list with no mapping between description, quantity, and amount. Cost at scale. Sending a 10MB PDF as a single API call costs real money and consumes tokens inefficiently. Most of that context is headers, footers, boilerplate legal text, and page numbers. The fields you actually need are in 5% of the document. Input PDF → File validation size, type, not encrypted → Text extraction attempt pdfplumber / pypdf → Quality check: does extracted text look usable? → If not: OCR pipeline Tesseract / Textract / Document AI → Page splitting + relevant-page detection → Table extraction if document type warrants it → Structured model call JSON mode / tool use → Output validation schema check + cross-field rules → Confidence scoring → Storage or human-review queue Each stage has a failure mode. Each one needs its own handling. slug="ai-integration" text="Building AI document processing for EU invoices, contracts, or compliance workflows? I design and ship these pipelines end-to-end." / Start with embedded text — it's faster, cheaper, and more accurate than OCR when available. I use pdfplumber for extraction in Python. It handles text positioning better than pypdf and gives you bounding box data that's useful for table detection: python extract.py from future import annotations import pdfplumber def extract text by page pdf path: str - list dict : """ Extract text from each page. Returns a list of dicts with page number, raw text, and a usability flag. """ pages = with pdfplumber.open pdf path as pdf: for i, page in enumerate pdf.pages : text = page.extract text x tolerance=3, y tolerance=3 or "" word count = len text.split pages.append { "page": i + 1, "text": text, "word count": word count, Heuristic: fewer than 30 words on a non-blank page = likely scanned "needs ocr": word count < 30 and len page.images 0, } return pages def extract tables pdf path: str, page number: int - list list list str : """ Extract tables from a specific page. Returns a list of tables, each table being a list of rows, each row being a list of cell strings. """ with pdfplumber.open pdf path as pdf: page = pdf.pages page number - 1 tables = page.extract tables Normalize: replace None cells with empty string return cell or "" for cell in row for row in table for table in tables or The per-page needs ocr flag is the key decision point. A hybrid document — mostly text with one scanned attachment page — gets OCR applied only to the scanned pages, not the whole file. Three options, each with different tradeoffs: Tesseract — open source, free, runs locally. Accuracy is acceptable for clean scans, poor for low-resolution, skewed, or multi-column layouts. Good default for low-volume pipelines where you control the document source. AWS Textract — purpose-built for documents. Handles tables natively, returns structured output with bounding boxes and confidence scores per word. Costs $0.0015 per page for basic detection, $0.015 for table/form extraction. The table extraction is worth the cost for invoices and financial statements. Google Document AI — strongest accuracy for complex layouts and multilingual documents. More expensive than Textract, but noticeably better on documents with mixed scripts or unusual formatting. My heuristic: Tesseract for internal tooling and prototypes, Textract for invoices and financial documents, Document AI if you're processing government documents or multi-language content. For Textract integration from Python: python ocr textract.py from future import annotations import boto3 def ocr page with textract image bytes: bytes - dict: """ Run a single page image through AWS Textract. Returns raw blocks with type, text, and confidence. """ client = boto3.client "textract", region name="eu-west-1" response = client.detect document text Document={"Bytes": image bytes} lines: list dict = for block in response "Blocks" : if block "BlockType" == "LINE": lines.append { "text": block "Text" , "confidence": block "Confidence" , "bbox": block "Geometry" "BoundingBox" , } return { "lines": lines, "full text": "\n".join b "text" for b in lines , "min confidence": min b "confidence" for b in lines , default=0 , } For table extraction specifically, use analyze document with FeatureTypes= "TABLES" — this is a separate call and a higher per-page cost, but it's the only reliable way to get structured table data from scanned documents. For a 60-page contract, you don't send all 60 pages to the model. You find the pages that contain the data you need. The approach: keyword scoring per page. For invoice extraction, I score pages by the presence of terms like "invoice", "total", "amount", "due date", "vendor", and their localized equivalents. The top-scoring 3–5 pages get sent to the model. For contracts, I look for "payment terms", "effective date", "party", "agrees to". python relevance.py from future import annotations import re INVOICE KEYWORDS = "invoice", "total", "amount due", "subtotal", "tax", "vat", "due date", "payment terms", "bill to", "vendor", "supplier", Finnish common in EU processing "lasku", "summa", "eräpäivä", "alv", def score page relevance text: str, keywords: list str - float: """ Returns a 0.0–1.0 relevance score for a page given target keywords. """ if not text: return 0.0 text lower = text.lower matches = sum 1 for kw in keywords if kw in text lower return min matches / max len keywords 0.3, 1 , 1.0 def select relevant pages pages: list dict , keywords: list str , max pages: int = 5, - list dict : scored = { page, "relevance": score page relevance page "text" , keywords } for page in pages scored.sort key=lambda p: p "relevance" , reverse=True return p for p in scored :max pages if p "relevance" 0.05 This alone cuts token consumption by 70–80% on long documents while keeping extraction quality the same or better — because the model isn't distracted by irrelevant content. Keyword scoring is a pragmatic baseline — for higher recall and precision, embedding-based retrieval per page can replace it, but at higher cost and complexity. This is where most tutorials go wrong. They send a prompt like "extract the invoice fields and return JSON". The model returns JSON most of the time. Sometimes it returns JSON wrapped in a markdown code fence. Sometimes it adds commentary before the JSON. Sometimes it invents fields. You end up writing a fragile parser on top of an unpredictable output. Use JSON mode OpenAI or tool use. These are not optional conveniences — they're the difference between a system that works reliably and one that works most of the time. Here is the TypeScript extraction call using the Vercel AI SDK with a Zod schema enforced via tool use: js // lib/documents/extract-invoice.ts import { openai } from "@ai-sdk/openai"; import { generateObject } from "ai"; import { z } from "zod"; const InvoiceSchema = z.object { vendor name: z.string .describe "The name of the company issuing the invoice" , vendor vat number: z .string .nullable .describe "VAT registration number if present, null otherwise" , invoice number: z.string .describe "The invoice reference number" , invoice date: z.string .describe "Date the invoice was issued, ISO 8601 format if possible" , due date: z .string .nullable .describe "Payment due date, ISO 8601 format if possible, null if not found" , currency: z.string .describe "ISO 4217 currency code, e.g. EUR, USD, GBP" , subtotal: z.number .nullable .describe "Pre-tax amount as a number, null if not found" , tax amount: z.number .nullable .describe "Tax/VAT amount as a number, null if not found" , total amount: z.number .describe "Total amount due as a number" , line items: z.array z.object { description: z.string , quantity: z.number .nullable , unit price: z.number .nullable , total: z.number .nullable , } , confidence: z.number .min 0 .max 1 .describe "Your confidence in this extraction, 0.0 to 1.0" , } ; export type InvoiceExtraction = z.infer