AI Document Processing in Production: Full Pipeline Guide

wpnews.pro

Someone emails you a PDF invoice. You want to extract the vendor name, line items, total amount, currency, and due date — automatically, at scale, without manual keying.

You call the OpenAI API, pass the PDF as base64, get a JSON blob back. It works. You ship it. Then reality arrives: a scanned invoice from a vendor who still uses a physical stamp. A 60-page contract where the key clause is on page 47. A table-heavy bank statement where amounts bleed across column boundaries. A PDF that's actually an image with no embedded text at all.

The naive approach collapses on all of them. Here's the production architecture that does.

The simplest version — encode the whole PDF, send it to GPT, ask it to return JSON — fails in four common ways:

Token limits. A 50-page contract is roughly 25,000–40,000 tokens of text, plus image tokens if you're sending page renders. Most model context windows handle it technically, but accuracy degrades on long documents. The model loses track of structure. Extraction quality on page 45 is noticeably worse than page 2.

Scanned documents. A PDF with no embedded text layer is just a sequence of images. No amount of prompting extracts text that isn't there. You need OCR. This affects more documents than you expect — expense receipts, legacy contracts, anything printed and scanned, anything generated by certain accounting systems.

Tables. Tables are the hardest part of PDF extraction. Embedded text in a PDF doesn't encode column relationships — the text objects are just positioned by x/y coordinates. A naive extraction reads the text linearly and loses the table structure entirely. Line items from an invoice become a flat list with no mapping between description, quantity, and amount.

Cost at scale. Sending a 10MB PDF as a single API call costs real money and consumes tokens inefficiently. Most of that context is headers, footers, boilerplate legal text, and page numbers. The fields you actually need are in 5% of the document.

Input PDF
  → File validation (size, type, not encrypted)
  → Text extraction attempt (pdfplumber / pypdf)
  → Quality check: does extracted text look usable?
  → If not: OCR pipeline (Tesseract / Textract / Document AI)
  → Page splitting + relevant-page detection
  → Table extraction (if document type warrants it)
  → Structured model call (JSON mode / tool use)
  → Output validation (schema check + cross-field rules)
  → Confidence scoring
  → Storage or human-review queue

Each stage has a failure mode. Each one needs its own handling.

slug="ai-integration"

text="Building AI document processing for EU invoices, contracts, or compliance workflows? I design and ship these pipelines end-to-end."

/>

Start with embedded text — it's faster, cheaper, and more accurate than OCR when available.

I use pdfplumber

for extraction in Python. It handles text positioning better than pypdf

and gives you bounding box data that's useful for table detection:

from __future__ import annotations

import pdfplumber

def extract_text_by_page(pdf_path: str) -> list[dict]:
    """
    Extract text from each page. Returns a list of dicts with
    page number, raw text, and a usability flag.
    """
    pages = []

    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            text = page.extract_text(x_tolerance=3, y_tolerance=3) or ""
            word_count = len(text.split())

            pages.append({
                "page": i + 1,
                "text": text,
                "word_count": word_count,
                "needs_ocr": word_count < 30 and len(page.images) > 0,
            })

    return pages

def extract_tables(pdf_path: str, page_number: int) -> list[list[list[str]]]:
    """
    Extract tables from a specific page. Returns a list of tables,
    each table being a list of rows, each row being a list of cell strings.
    """
    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[page_number - 1]
        tables = page.extract_tables()
        return [
            [[cell or "" for cell in row] for row in table]
            for table in (tables or [])
        ]

The per-page needs_ocr

flag is the key decision point. A hybrid document — mostly text with one scanned attachment page — gets OCR applied only to the scanned pages, not the whole file.

Three options, each with different tradeoffs:

Tesseract — open source, free, runs locally. Accuracy is acceptable for clean scans, poor for low-resolution, skewed, or multi-column layouts. Good default for low-volume pipelines where you control the document source.

AWS Textract — purpose-built for documents. Handles tables natively, returns structured output with bounding boxes and confidence scores per word. Costs $0.0015 per page for basic detection, $0.015 for table/form extraction. The table extraction is worth the cost for invoices and financial statements.

Google Document AI — strongest accuracy for complex layouts and multilingual documents. More expensive than Textract, but noticeably better on documents with mixed scripts or unusual formatting.

My heuristic: Tesseract for internal tooling and prototypes, Textract for invoices and financial documents, Document AI if you're processing government documents or multi-language content.

For Textract integration from Python:

from __future__ import annotations

import boto3

def ocr_page_with_textract(image_bytes: bytes) -> dict:
    """
    Run a single page image through AWS Textract.
    Returns raw blocks with type, text, and confidence.
    """
    client = boto3.client("textract", region_name="eu-west-1")
    response = client.detect_document_text(
        Document={"Bytes": image_bytes}
    )

    lines: list[dict] = []
    for block in response["Blocks"]:
        if block["BlockType"] == "LINE":
            lines.append({
                "text": block["Text"],
                "confidence": block["Confidence"],
                "bbox": block["Geometry"]["BoundingBox"],
            })

    return {
        "lines": lines,
        "full_text": "\n".join(b["text"] for b in lines),
        "min_confidence": min((b["confidence"] for b in lines), default=0),
    }

For table extraction specifically, use analyze_document

with FeatureTypes=["TABLES"]

— this is a separate call and a higher per-page cost, but it's the only reliable way to get structured table data from scanned documents.

For a 60-page contract, you don't send all 60 pages to the model. You find the pages that contain the data you need.

The approach: keyword scoring per page. For invoice extraction, I score pages by the presence of terms like "invoice", "total", "amount", "due date", "vendor", and their localized equivalents. The top-scoring 3–5 pages get sent to the model. For contracts, I look for "payment terms", "effective date", "party", "agrees to".

from __future__ import annotations

import re

INVOICE_KEYWORDS = [
    "invoice", "total", "amount due", "subtotal", "tax", "vat",
    "due date", "payment terms", "bill to", "vendor", "supplier",
    "lasku", "summa", "eräpäivä", "alv",
]

def score_page_relevance(text: str, keywords: list[str]) -> float:
    """
    Returns a 0.0–1.0 relevance score for a page given target keywords.
    """
    if not text:
        return 0.0

    text_lower = text.lower()
    matches = sum(1 for kw in keywords if kw in text_lower)
    return min(matches / max(len(keywords) * 0.3, 1), 1.0)

def select_relevant_pages(
    pages: list[dict],
    keywords: list[str],
    max_pages: int = 5,
) -> list[dict]:
    scored = [
        {**page, "relevance": score_page_relevance(page["text"], keywords)}
        for page in pages
    ]
    scored.sort(key=lambda p: p["relevance"], reverse=True)
    return [p for p in scored[:max_pages] if p["relevance"] > 0.05]

This alone cuts token consumption by 70–80% on long documents while keeping extraction quality the same or better — because the model isn't distracted by irrelevant content.

Keyword scoring is a pragmatic baseline — for higher recall and precision, embedding-based retrieval per page can replace it, but at higher cost and complexity.

This is where most tutorials go wrong. They send a prompt like "extract the invoice fields and return JSON". The model returns JSON most of the time. Sometimes it returns JSON wrapped in a markdown code fence. Sometimes it adds commentary before the JSON. Sometimes it invents fields. You end up writing a fragile parser on top of an unpredictable output.

Use JSON mode (OpenAI) or tool use. These are not optional conveniences — they're the difference between a system that works reliably and one that works most of the time.

Here is the TypeScript extraction call using the Vercel AI SDK with a Zod schema enforced via tool use:

// lib/documents/extract-invoice.ts
import { openai } from "@ai-sdk/openai";
import { generateObject } from "ai";
import { z } from "zod";

const InvoiceSchema = z.object({
  vendor_name: z.string().describe("The name of the company issuing the invoice"),
  vendor_vat_number: z
    .string()
    .nullable()
    .describe("VAT registration number if present, null otherwise"),
  invoice_number: z.string().describe("The invoice reference number"),
  invoice_date: z.string().describe("Date the invoice was issued, ISO 8601 format if possible"),
  due_date: z
    .string()
    .nullable()
    .describe("Payment due date, ISO 8601 format if possible, null if not found"),
  currency: z.string().describe("ISO 4217 currency code, e.g. EUR, USD, GBP"),
  subtotal: z.number().nullable().describe("Pre-tax amount as a number, null if not found"),
  tax_amount: z.number().nullable().describe("Tax/VAT amount as a number, null if not found"),
  total_amount: z.number().describe("Total amount due as a number"),
  line_items: z.array(
    z.object({
      description: z.string(),
      quantity: z.number().nullable(),
      unit_price: z.number().nullable(),
      total: z.number().nullable(),
    })
  ),
  confidence: z.number().min(0).max(1).describe("Your confidence in this extraction, 0.0 to 1.0"),
});

export type InvoiceExtraction = z.infer<typeof InvoiceSchema>;

export async function extractInvoice(
  pageTexts: string[],
  tableData: string[][][][] = []
): Promise<InvoiceExtraction> {
  const context = pageTexts.join("\n\n---\n\n");
  const tableContext =
    tableData.length > 0
      ? "\n\nExtracted tables:\n" +
        tableData.map((t) => t.map((r) => r.join(" | ")).join("\n")).join("\n\n")
      : "";

  const { object } = await generateObject({
    model: openai("gpt-4o-mini"),
    schema: InvoiceSchema,
    prompt: `Extract all invoice fields from the following document text.
Use null for fields not present in the document.
For currency, always return an ISO 4217 code.
For dates, normalize to YYYY-MM-DD if possible.
Set confidence to reflect how certain you are about the extraction overall.

Document:
${context}${tableContext}`,
  });

  return object;
}

The confidence

field in the schema is deliberate — I ask the model to self-report its certainty. It's not always calibrated perfectly, but it's a useful first filter. Extractions with confidence < 0.6

go to the human review queue automatically.

Structured output from the model is not validated output. The model will comply with the schema — but it can still produce values that pass schema validation while being logically wrong.

// lib/documents/validate-invoice.ts
import type { InvoiceExtraction } from "./extract-invoice";

export interface ValidationResult {
  valid: boolean;
  errors: string[];
  warnings: string[];
}

export function validateInvoiceExtraction(data: InvoiceExtraction): ValidationResult {
  const errors: string[] = [];
  const warnings: string[] = [];

  // Hard failures — this extraction is unreliable
  if (!data.vendor_name || data.vendor_name.trim().length < 2) {
    errors.push("vendor_name is missing or too short");
  }

  if (data.total_amount <= 0) {
    errors.push("total_amount must be greater than zero");
  }

  if (!data.currency || !/^[A-Z]{3}$/.test(data.currency)) {
    errors.push(`currency '${data.currency}' is not a valid ISO 4217 code`);
  }

  // Cross-field logic
  if (data.subtotal !== null && data.tax_amount !== null) {
    const expectedTotal = data.subtotal + data.tax_amount;
    const delta = Math.abs(expectedTotal - data.total_amount);
    if (delta > 0.02) {
      // Tolerance for rounding
      errors.push(
        `subtotal (${data.subtotal}) + tax (${data.tax_amount}) = ${expectedTotal}, ` +
          `but total_amount is ${data.total_amount}. Delta: ${delta.toFixed(2)}`
      );
    }
  }

  if (data.total_amount > 0 && !data.currency) {
    errors.push("total_amount is present but currency is missing");
  }

  // Soft warnings — flag for review but don't reject
  if (data.line_items.length === 0) {
    warnings.push("no line items extracted — verify manually");
  }

  if (data.confidence < 0.6) {
    warnings.push(`low model confidence: ${data.confidence}`);
  }

  if (!data.due_date) {
    warnings.push("due_date not found — may need manual entry");
  }

  const vatPattern = /^[A-Z]{2}\d{8,12}$/;
  if (data.vendor_vat_number && !vatPattern.test(data.vendor_vat_number.replace(/\s/g, ""))) {
    warnings.push(`vendor_vat_number '${data.vendor_vat_number}' doesn't match expected format`);
  }

  return {
    valid: errors.length === 0,
    errors,
    warnings,
  };
}

The subtotal + tax = total cross-check catches more extraction errors than any other single rule. Models sometimes extract the subtotal as the total, or miss the tax component entirely.

Two strategies, not mutually exclusive:

Automated retry with a stronger model. If confidence < 0.7

and validation passes but has warnings, retry with gpt-4o

instead of gpt-4o-mini

. The cost difference is roughly 15×, so don't do this for all documents — only for those where the faster model flagged uncertainty. In practice this applies to 10–15% of documents and catches most of the edge cases.

Human-in-the-loop queue. If validation returns errors, or if the retry also produces low confidence, route to a review interface. The key is to pre-fill the UI with the extracted values — the reviewer confirms or corrects, they don't start from scratch. This makes human review fast enough to be operationally viable even at moderate volume.

// lib/documents/pipeline.ts
import { extractInvoice } from "./extract-invoice";
import { validateInvoiceExtraction } from "./validate-invoice";
import { db } from "@/lib/db";

type ProcessingOutcome = "stored" | "retry" | "human_review";

export async function processInvoiceDocument(
  pageTexts: string[],
  tables: string[][][][],
  documentId: string
): Promise<ProcessingOutcome> {
  let result = await extractInvoice(pageTexts, tables);
  let validation = validateInvoiceExtraction(result);

  // Retry with stronger model if first pass is uncertain
  if (!validation.valid || result.confidence < 0.7) {
    result = await extractInvoice(pageTexts, tables); // gpt-4o retry handled inside
    validation = validateInvoiceExtraction(result);
  }

  if (!validation.valid) {
    await db.insert(reviewQueue).values({
      documentId,
      extractedData: result,
      validationErrors: validation.errors,
      validationWarnings: validation.warnings,
      status: "pending_review",
    });
    return "human_review";
  }

  await db.insert(invoices).values({
    documentId,
    ...result,
    processedAt: new Date(),
  });
  return "stored";
}

At low volume (under 100 documents/day), synchronous processing is fine. At scale, you need async. If this is part of a broader automation workflow, BullMQ fits naturally as the processing backbone.

Queue documents with BullMQ. Set concurrency based on your rate limits — OpenAI's Tier 2 allows 5,000 RPM for gpt-4o-mini

. With 3 API calls per document (extraction + possible retry + any enrichment), you can process roughly 1,600 documents per minute at full throughput before hitting rate limits.

Model selection by document type matters for cost:

Document type	Model	Avg cost/doc	Notes
Clean digital invoice	gpt-4o-mini	~$0.004	High accuracy, fast
Complex table-heavy doc	gpt-4o	~$0.06	On retry/escalation only
Scanned + OCR needed	Textract + 4o-mini	~$0.018	Textract adds $0.015/page
Contract (10+ pages)	gpt-4o-mini	~$0.012	After page filtering

The page relevance filtering from Stage 3 is the biggest cost lever. A 20-page document with 3 relevant pages costs 85% less than sending all 20 pages.

What breaks silently: extraction quality drift. The model doesn't throw an error — it just starts extracting total_amount

less reliably as your document corpus evolves and new formats appear.

Log these metrics per processing run:

Set an alert if the validation error rate crosses 5% in a rolling 1-hour window. That's your signal that a new document format is breaking the pipeline and needs a schema or prompt update.

** pdfplumber hangs on encrypted PDFs.** If the document is password-protected, the extraction call never returns. Add a timeout and catch the exception. Check for encryption before attempting extraction.

Scanned PDFs often have a text layer that's garbage. Some scan-to-PDF workflows produce PDFs with embedded text, but the text is from a failed OCR pass — garbled characters, incorrect word boundaries. My heuristic for detecting this: extract text, then count the ratio of alphabetic characters to total characters. Below 0.6, treat it as scanned regardless of what pdfplumber

returns.

The model sometimes returns numbers with thousand separators. "1,234.56"

— and Zod will reject it because the schema expects a number. Normalize numeric strings before schema validation: strip commas, handle both .

and ,

as decimal separators (European invoices use comma as decimal).

Date formats vary wildly. 14.03.2025

, 14/03/2025

, March 14, 2025

, 2025-03-14

. The model normalizes most of them, but it's worth adding a post-processing step that parses the extracted date string through date-fns/parse

with multiple format attempts before storing.

Textract is slow for real-time use. Average latency is 3–8 seconds per page for detect_document_text

. If you need synchronous responses, use Tesseract locally for a fast first pass and Textract in an async enrichment step.

I built this pipeline for internal document processing and as a reusable module across client projects. In production use:

If you're building document processing automation for EU businesses — invoice processing, contract extraction, compliance document workflows — the pipeline above is where you start. The naive "pass to GPT" approach works in a demo. Production requires the full stack: pre-processing, OCR fallbacks, structured schemas, validation, and a human-in-the-loop safety net for the edge cases.

I've deployed this end-to-end across several production systems, including htpbe.tech (PDF forensics) and document automation pipelines for e-commerce and logistics clients.

Document processing is not an AI problem — it's a systems problem with an AI component.

If you need a senior developer who can build AI document processing that actually works at scale — get in touch. I'm available for freelance projects and long-term engagements.

Related reading: How I Detect Tampered PDFs in 9 Seconds — forensic analysis of PDF structure for document authenticity verification. How to Add AI to an Existing Product Without Rewriting It — the three integration patterns and where document processing fits.

source & further reading

dev.to — original article 12 Best Frameworks for Building AI Agents in 2026 "Dispatch: 10 days autonomous, 2 visitors, $0 — what the data says to do next" I Tried AWS Blocks on a Real Amplify Gen2 Project — Local DynamoDB, No AWS Account, 1-Second Loops

AI Document Processing in Production: Full Pipeline Guide

Run your AI side-project on zahid.host