{"slug": "ai-document-processing-in-production-full-pipeline-guide", "title": "AI Document Processing in Production: Full Pipeline Guide", "summary": "A developer built a production-grade AI document processing pipeline that handles PDF invoices, contracts, and bank statements at scale. The pipeline uses pdfplumber for text extraction, falls back to OCR for scanned documents, and includes page splitting, table extraction, and structured model calls to overcome common failures like token limits, scanned documents, table misalignment, and high costs.", "body_md": "Someone emails you a PDF invoice. You want to extract the vendor name, line items, total amount, currency, and due date — automatically, at scale, without manual keying.\n\nYou call the OpenAI API, pass the PDF as base64, get a JSON blob back. It works. You ship it. Then reality arrives: a scanned invoice from a vendor who still uses a physical stamp. A 60-page contract where the key clause is on page 47. A table-heavy bank statement where amounts bleed across column boundaries. A PDF that's actually an image with no embedded text at all.\n\nThe naive approach collapses on all of them. Here's the production architecture that does.\n\nThe simplest version — encode the whole PDF, send it to GPT, ask it to return JSON — fails in four common ways:\n\n**Token limits.** A 50-page contract is roughly 25,000–40,000 tokens of text, plus image tokens if you're sending page renders. Most model context windows handle it technically, but accuracy degrades on long documents. The model loses track of structure. Extraction quality on page 45 is noticeably worse than page 2.\n\n**Scanned documents.** A PDF with no embedded text layer is just a sequence of images. No amount of prompting extracts text that isn't there. You need OCR. This affects more documents than you expect — expense receipts, legacy contracts, anything printed and scanned, anything generated by certain accounting systems.\n\n**Tables.** Tables are the hardest part of PDF extraction. Embedded text in a PDF doesn't encode column relationships — the text objects are just positioned by x/y coordinates. A naive extraction reads the text linearly and loses the table structure entirely. Line items from an invoice become a flat list with no mapping between description, quantity, and amount.\n\n**Cost at scale.** Sending a 10MB PDF as a single API call costs real money and consumes tokens inefficiently. Most of that context is headers, footers, boilerplate legal text, and page numbers. The fields you actually need are in 5% of the document.\n\n```\nInput PDF\n  → File validation (size, type, not encrypted)\n  → Text extraction attempt (pdfplumber / pypdf)\n  → Quality check: does extracted text look usable?\n  → If not: OCR pipeline (Tesseract / Textract / Document AI)\n  → Page splitting + relevant-page detection\n  → Table extraction (if document type warrants it)\n  → Structured model call (JSON mode / tool use)\n  → Output validation (schema check + cross-field rules)\n  → Confidence scoring\n  → Storage or human-review queue\n```\n\nEach stage has a failure mode. Each one needs its own handling.\n\nslug=\"ai-integration\"\n\ntext=\"Building AI document processing for EU invoices, contracts, or compliance workflows? I design and ship these pipelines end-to-end.\"\n\n/>\n\nStart with embedded text — it's faster, cheaper, and more accurate than OCR when available.\n\nI use `pdfplumber`\n\nfor extraction in Python. It handles text positioning better than `pypdf`\n\nand gives you bounding box data that's useful for table detection:\n\n``` python\n# extract.py\nfrom __future__ import annotations\n\nimport pdfplumber\n\ndef extract_text_by_page(pdf_path: str) -> list[dict]:\n    \"\"\"\n    Extract text from each page. Returns a list of dicts with\n    page number, raw text, and a usability flag.\n    \"\"\"\n    pages = []\n\n    with pdfplumber.open(pdf_path) as pdf:\n        for i, page in enumerate(pdf.pages):\n            text = page.extract_text(x_tolerance=3, y_tolerance=3) or \"\"\n            word_count = len(text.split())\n\n            pages.append({\n                \"page\": i + 1,\n                \"text\": text,\n                \"word_count\": word_count,\n                # Heuristic: fewer than 30 words on a non-blank page = likely scanned\n                \"needs_ocr\": word_count < 30 and len(page.images) > 0,\n            })\n\n    return pages\n\ndef extract_tables(pdf_path: str, page_number: int) -> list[list[list[str]]]:\n    \"\"\"\n    Extract tables from a specific page. Returns a list of tables,\n    each table being a list of rows, each row being a list of cell strings.\n    \"\"\"\n    with pdfplumber.open(pdf_path) as pdf:\n        page = pdf.pages[page_number - 1]\n        tables = page.extract_tables()\n        # Normalize: replace None cells with empty string\n        return [\n            [[cell or \"\" for cell in row] for row in table]\n            for table in (tables or [])\n        ]\n```\n\nThe per-page `needs_ocr`\n\nflag is the key decision point. A hybrid document — mostly text with one scanned attachment page — gets OCR applied only to the scanned pages, not the whole file.\n\nThree options, each with different tradeoffs:\n\n**Tesseract** — open source, free, runs locally. Accuracy is acceptable for clean scans, poor for low-resolution, skewed, or multi-column layouts. Good default for low-volume pipelines where you control the document source.\n\n**AWS Textract** — purpose-built for documents. Handles tables natively, returns structured output with bounding boxes and confidence scores per word. Costs $0.0015 per page for basic detection, $0.015 for table/form extraction. The table extraction is worth the cost for invoices and financial statements.\n\n**Google Document AI** — strongest accuracy for complex layouts and multilingual documents. More expensive than Textract, but noticeably better on documents with mixed scripts or unusual formatting.\n\nMy heuristic: Tesseract for internal tooling and prototypes, Textract for invoices and financial documents, Document AI if you're processing government documents or multi-language content.\n\nFor Textract integration from Python:\n\n``` python\n# ocr_textract.py\nfrom __future__ import annotations\n\nimport boto3\n\ndef ocr_page_with_textract(image_bytes: bytes) -> dict:\n    \"\"\"\n    Run a single page image through AWS Textract.\n    Returns raw blocks with type, text, and confidence.\n    \"\"\"\n    client = boto3.client(\"textract\", region_name=\"eu-west-1\")\n    response = client.detect_document_text(\n        Document={\"Bytes\": image_bytes}\n    )\n\n    lines: list[dict] = []\n    for block in response[\"Blocks\"]:\n        if block[\"BlockType\"] == \"LINE\":\n            lines.append({\n                \"text\": block[\"Text\"],\n                \"confidence\": block[\"Confidence\"],\n                \"bbox\": block[\"Geometry\"][\"BoundingBox\"],\n            })\n\n    return {\n        \"lines\": lines,\n        \"full_text\": \"\\n\".join(b[\"text\"] for b in lines),\n        \"min_confidence\": min((b[\"confidence\"] for b in lines), default=0),\n    }\n```\n\nFor table extraction specifically, use `analyze_document`\n\nwith `FeatureTypes=[\"TABLES\"]`\n\n— this is a separate call and a higher per-page cost, but it's the only reliable way to get structured table data from scanned documents.\n\nFor a 60-page contract, you don't send all 60 pages to the model. You find the pages that contain the data you need.\n\nThe approach: keyword scoring per page. For invoice extraction, I score pages by the presence of terms like \"invoice\", \"total\", \"amount\", \"due date\", \"vendor\", and their localized equivalents. The top-scoring 3–5 pages get sent to the model. For contracts, I look for \"payment terms\", \"effective date\", \"party\", \"agrees to\".\n\n``` python\n# relevance.py\nfrom __future__ import annotations\n\nimport re\n\nINVOICE_KEYWORDS = [\n    \"invoice\", \"total\", \"amount due\", \"subtotal\", \"tax\", \"vat\",\n    \"due date\", \"payment terms\", \"bill to\", \"vendor\", \"supplier\",\n    # Finnish (common in EU processing)\n    \"lasku\", \"summa\", \"eräpäivä\", \"alv\",\n]\n\ndef score_page_relevance(text: str, keywords: list[str]) -> float:\n    \"\"\"\n    Returns a 0.0–1.0 relevance score for a page given target keywords.\n    \"\"\"\n    if not text:\n        return 0.0\n\n    text_lower = text.lower()\n    matches = sum(1 for kw in keywords if kw in text_lower)\n    return min(matches / max(len(keywords) * 0.3, 1), 1.0)\n\ndef select_relevant_pages(\n    pages: list[dict],\n    keywords: list[str],\n    max_pages: int = 5,\n) -> list[dict]:\n    scored = [\n        {**page, \"relevance\": score_page_relevance(page[\"text\"], keywords)}\n        for page in pages\n    ]\n    scored.sort(key=lambda p: p[\"relevance\"], reverse=True)\n    return [p for p in scored[:max_pages] if p[\"relevance\"] > 0.05]\n```\n\nThis alone cuts token consumption by 70–80% on long documents while keeping extraction quality the same or better — because the model isn't distracted by irrelevant content.\n\nKeyword scoring is a pragmatic baseline — for higher recall and precision, embedding-based retrieval per page can replace it, but at higher cost and complexity.\n\nThis is where most tutorials go wrong. They send a prompt like \"extract the invoice fields and return JSON\". The model returns JSON most of the time. Sometimes it returns JSON wrapped in a markdown code fence. Sometimes it adds commentary before the JSON. Sometimes it invents fields. You end up writing a fragile parser on top of an unpredictable output.\n\nUse JSON mode (OpenAI) or tool use. These are not optional conveniences — they're the difference between a system that works reliably and one that works most of the time.\n\nHere is the TypeScript extraction call using the Vercel AI SDK with a Zod schema enforced via tool use:\n\n``` js\n// lib/documents/extract-invoice.ts\nimport { openai } from \"@ai-sdk/openai\";\nimport { generateObject } from \"ai\";\nimport { z } from \"zod\";\n\nconst InvoiceSchema = z.object({\n  vendor_name: z.string().describe(\"The name of the company issuing the invoice\"),\n  vendor_vat_number: z\n    .string()\n    .nullable()\n    .describe(\"VAT registration number if present, null otherwise\"),\n  invoice_number: z.string().describe(\"The invoice reference number\"),\n  invoice_date: z.string().describe(\"Date the invoice was issued, ISO 8601 format if possible\"),\n  due_date: z\n    .string()\n    .nullable()\n    .describe(\"Payment due date, ISO 8601 format if possible, null if not found\"),\n  currency: z.string().describe(\"ISO 4217 currency code, e.g. EUR, USD, GBP\"),\n  subtotal: z.number().nullable().describe(\"Pre-tax amount as a number, null if not found\"),\n  tax_amount: z.number().nullable().describe(\"Tax/VAT amount as a number, null if not found\"),\n  total_amount: z.number().describe(\"Total amount due as a number\"),\n  line_items: z.array(\n    z.object({\n      description: z.string(),\n      quantity: z.number().nullable(),\n      unit_price: z.number().nullable(),\n      total: z.number().nullable(),\n    })\n  ),\n  confidence: z.number().min(0).max(1).describe(\"Your confidence in this extraction, 0.0 to 1.0\"),\n});\n\nexport type InvoiceExtraction = z.infer<typeof InvoiceSchema>;\n\nexport async function extractInvoice(\n  pageTexts: string[],\n  tableData: string[][][][] = []\n): Promise<InvoiceExtraction> {\n  const context = pageTexts.join(\"\\n\\n---\\n\\n\");\n  const tableContext =\n    tableData.length > 0\n      ? \"\\n\\nExtracted tables:\\n\" +\n        tableData.map((t) => t.map((r) => r.join(\" | \")).join(\"\\n\")).join(\"\\n\\n\")\n      : \"\";\n\n  const { object } = await generateObject({\n    model: openai(\"gpt-4o-mini\"),\n    schema: InvoiceSchema,\n    prompt: `Extract all invoice fields from the following document text.\nUse null for fields not present in the document.\nFor currency, always return an ISO 4217 code.\nFor dates, normalize to YYYY-MM-DD if possible.\nSet confidence to reflect how certain you are about the extraction overall.\n\nDocument:\n${context}${tableContext}`,\n  });\n\n  return object;\n}\n```\n\nThe `confidence`\n\nfield in the schema is deliberate — I ask the model to self-report its certainty. It's not always calibrated perfectly, but it's a useful first filter. Extractions with `confidence < 0.6`\n\ngo to the human review queue automatically.\n\nStructured output from the model is not validated output. The model will comply with the schema — but it can still produce values that pass schema validation while being logically wrong.\n\n``` python\n// lib/documents/validate-invoice.ts\nimport type { InvoiceExtraction } from \"./extract-invoice\";\n\nexport interface ValidationResult {\n  valid: boolean;\n  errors: string[];\n  warnings: string[];\n}\n\nexport function validateInvoiceExtraction(data: InvoiceExtraction): ValidationResult {\n  const errors: string[] = [];\n  const warnings: string[] = [];\n\n  // Hard failures — this extraction is unreliable\n  if (!data.vendor_name || data.vendor_name.trim().length < 2) {\n    errors.push(\"vendor_name is missing or too short\");\n  }\n\n  if (data.total_amount <= 0) {\n    errors.push(\"total_amount must be greater than zero\");\n  }\n\n  if (!data.currency || !/^[A-Z]{3}$/.test(data.currency)) {\n    errors.push(`currency '${data.currency}' is not a valid ISO 4217 code`);\n  }\n\n  // Cross-field logic\n  if (data.subtotal !== null && data.tax_amount !== null) {\n    const expectedTotal = data.subtotal + data.tax_amount;\n    const delta = Math.abs(expectedTotal - data.total_amount);\n    if (delta > 0.02) {\n      // Tolerance for rounding\n      errors.push(\n        `subtotal (${data.subtotal}) + tax (${data.tax_amount}) = ${expectedTotal}, ` +\n          `but total_amount is ${data.total_amount}. Delta: ${delta.toFixed(2)}`\n      );\n    }\n  }\n\n  if (data.total_amount > 0 && !data.currency) {\n    errors.push(\"total_amount is present but currency is missing\");\n  }\n\n  // Soft warnings — flag for review but don't reject\n  if (data.line_items.length === 0) {\n    warnings.push(\"no line items extracted — verify manually\");\n  }\n\n  if (data.confidence < 0.6) {\n    warnings.push(`low model confidence: ${data.confidence}`);\n  }\n\n  if (!data.due_date) {\n    warnings.push(\"due_date not found — may need manual entry\");\n  }\n\n  const vatPattern = /^[A-Z]{2}\\d{8,12}$/;\n  if (data.vendor_vat_number && !vatPattern.test(data.vendor_vat_number.replace(/\\s/g, \"\"))) {\n    warnings.push(`vendor_vat_number '${data.vendor_vat_number}' doesn't match expected format`);\n  }\n\n  return {\n    valid: errors.length === 0,\n    errors,\n    warnings,\n  };\n}\n```\n\nThe subtotal + tax = total cross-check catches more extraction errors than any other single rule. Models sometimes extract the subtotal as the total, or miss the tax component entirely.\n\nTwo strategies, not mutually exclusive:\n\n**Automated retry with a stronger model.** If `confidence < 0.7`\n\nand validation passes but has warnings, retry with `gpt-4o`\n\ninstead of `gpt-4o-mini`\n\n. The cost difference is roughly 15×, so don't do this for all documents — only for those where the faster model flagged uncertainty. In practice this applies to 10–15% of documents and catches most of the edge cases.\n\n**Human-in-the-loop queue.** If validation returns errors, or if the retry also produces low confidence, route to a review interface. The key is to pre-fill the UI with the extracted values — the reviewer confirms or corrects, they don't start from scratch. This makes human review fast enough to be operationally viable even at moderate volume.\n\n``` js\n// lib/documents/pipeline.ts\nimport { extractInvoice } from \"./extract-invoice\";\nimport { validateInvoiceExtraction } from \"./validate-invoice\";\nimport { db } from \"@/lib/db\";\n\ntype ProcessingOutcome = \"stored\" | \"retry\" | \"human_review\";\n\nexport async function processInvoiceDocument(\n  pageTexts: string[],\n  tables: string[][][][],\n  documentId: string\n): Promise<ProcessingOutcome> {\n  let result = await extractInvoice(pageTexts, tables);\n  let validation = validateInvoiceExtraction(result);\n\n  // Retry with stronger model if first pass is uncertain\n  if (!validation.valid || result.confidence < 0.7) {\n    result = await extractInvoice(pageTexts, tables); // gpt-4o retry handled inside\n    validation = validateInvoiceExtraction(result);\n  }\n\n  if (!validation.valid) {\n    await db.insert(reviewQueue).values({\n      documentId,\n      extractedData: result,\n      validationErrors: validation.errors,\n      validationWarnings: validation.warnings,\n      status: \"pending_review\",\n    });\n    return \"human_review\";\n  }\n\n  await db.insert(invoices).values({\n    documentId,\n    ...result,\n    processedAt: new Date(),\n  });\n  return \"stored\";\n}\n```\n\nAt low volume (under 100 documents/day), synchronous processing is fine. At scale, you need async. If this is part of a broader [automation workflow](https://iurii.rogulia.fi/services/automation-workflows), BullMQ fits naturally as the processing backbone.\n\nQueue documents with BullMQ. Set concurrency based on your rate limits — OpenAI's Tier 2 allows 5,000 RPM for `gpt-4o-mini`\n\n. With 3 API calls per document (extraction + possible retry + any enrichment), you can process roughly 1,600 documents per minute at full throughput before hitting rate limits.\n\nModel selection by document type matters for cost:\n\n| Document type | Model | Avg cost/doc | Notes |\n|---|---|---|---|\n| Clean digital invoice | gpt-4o-mini | ~$0.004 | High accuracy, fast |\n| Complex table-heavy doc | gpt-4o | ~$0.06 | On retry/escalation only |\n| Scanned + OCR needed | Textract + 4o-mini | ~$0.018 | Textract adds $0.015/page |\n| Contract (10+ pages) | gpt-4o-mini | ~$0.012 | After page filtering |\n\nThe page relevance filtering from Stage 3 is the biggest cost lever. A 20-page document with 3 relevant pages costs 85% less than sending all 20 pages.\n\nWhat breaks silently: **extraction quality drift**. The model doesn't throw an error — it just starts extracting `total_amount`\n\nless reliably as your document corpus evolves and new formats appear.\n\nLog these metrics per processing run:\n\nSet an alert if the validation error rate crosses 5% in a rolling 1-hour window. That's your signal that a new document format is breaking the pipeline and needs a schema or prompt update.\n\n** pdfplumber hangs on encrypted PDFs.** If the document is password-protected, the extraction call never returns. Add a timeout and catch the exception. Check for encryption before attempting extraction.\n\n**Scanned PDFs often have a text layer that's garbage.** Some scan-to-PDF workflows produce PDFs with embedded text, but the text is from a failed OCR pass — garbled characters, incorrect word boundaries. My heuristic for detecting this: extract text, then count the ratio of alphabetic characters to total characters. Below 0.6, treat it as scanned regardless of what `pdfplumber`\n\nreturns.\n\n**The model sometimes returns numbers with thousand separators.** `\"1,234.56\"`\n\n— and Zod will reject it because the schema expects a number. Normalize numeric strings before schema validation: strip commas, handle both `.`\n\nand `,`\n\nas decimal separators (European invoices use comma as decimal).\n\n**Date formats vary wildly.** `14.03.2025`\n\n, `14/03/2025`\n\n, `March 14, 2025`\n\n, `2025-03-14`\n\n. The model normalizes most of them, but it's worth adding a post-processing step that parses the extracted date string through `date-fns/parse`\n\nwith multiple format attempts before storing.\n\n**Textract is slow for real-time use.** Average latency is 3–8 seconds per page for `detect_document_text`\n\n. If you need synchronous responses, use Tesseract locally for a fast first pass and Textract in an async enrichment step.\n\nI built this pipeline for internal document processing and as a reusable module across client projects. In production use:\n\nIf you're building document processing automation for EU businesses — invoice processing, contract extraction, compliance document workflows — the pipeline above is where you start. The naive \"pass to GPT\" approach works in a demo. Production requires the full stack: pre-processing, OCR fallbacks, structured schemas, validation, and a human-in-the-loop safety net for the edge cases.\n\nI've deployed this end-to-end across several production systems, including [htpbe.tech](https://iurii.rogulia.fi/projects/htpbe-pdf-analysis) (PDF forensics) and [document automation pipelines](https://iurii.rogulia.fi/services/automation-workflows) for e-commerce and logistics clients.\n\nDocument processing is not an AI problem — it's a systems problem with an AI component.\n\nIf you need a senior developer who can build AI document processing that actually works at scale — [get in touch](https://iurii.rogulia.fi/contact). I'm available for freelance projects and long-term engagements.\n\n**Related reading:** [How I Detect Tampered PDFs in 9 Seconds](https://iurii.rogulia.fi/blog/detect-tampered-pdfs) — forensic analysis of PDF structure for document authenticity verification. [How to Add AI to an Existing Product Without Rewriting It](https://iurii.rogulia.fi/blog/add-ai-without-rewriting) — the three integration patterns and where document processing fits.", "url": "https://wpnews.pro/news/ai-document-processing-in-production-full-pipeline-guide", "canonical_source": "https://dev.to/iurii_rogulia/ai-document-processing-in-production-full-pipeline-guide-2cj", "published_at": "2026-06-24 10:00:41+00:00", "updated_at": "2026-06-24 10:13:29.318571+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "natural-language-processing", "developer-tools", "ai-products"], "entities": ["OpenAI", "pdfplumber", "Tesseract", "Textract", "Document AI", "Python"], "alternates": {"html": "https://wpnews.pro/news/ai-document-processing-in-production-full-pipeline-guide", "markdown": "https://wpnews.pro/news/ai-document-processing-in-production-full-pipeline-guide.md", "text": "https://wpnews.pro/news/ai-document-processing-in-production-full-pipeline-guide.txt", "jsonld": "https://wpnews.pro/news/ai-document-processing-in-production-full-pipeline-guide.jsonld"}}