Extract Structured Data from Documents with Claude Tool Use and Pydantic

wpnews.pro

Build a Python pipeline that sends invoice PDFs or images to Claude, forces a structured response via tool use, and returns a fully validated Pydantic model.

Priya Nair

What you'll build #

A Python pipeline that sends invoices (PDFs or images) to Claude, forces a structured response via tool use, and returns a validated Pydantic model. Swap the schema for any document type: receipts, contracts, medical forms.

Prerequisites #

Python 3.10+ ANTHROPIC_API_KEY

in your environment (get one atconsole.anthropic.com)- Familiarity with Pydantic v2 models

pip install "anthropic>=0.40.0" "pydantic>=2.0"

macOS/Linux: export ANTHROPIC_API_KEY=sk-ant-...

. Windows: set ANTHROPIC_API_KEY=sk-ant-...

.

Step 1: Define your Pydantic model #

The schema you define here becomes the contract between Claude and your code. Field descriptions are included in the JSON Schema sent to Claude and serve as extraction hints.

from pydantic import BaseModel, Field
from typing import Optional

class LineItem(BaseModel):
    description: str = Field(description="Product or service name")
    quantity: float
    unit_price: float = Field(description="Price per unit before tax")
    total: float

class Invoice(BaseModel):
    invoice_number: str
    vendor_name: str
    invoice_date: str = Field(description="Date in YYYY-MM-DD format")
    line_items: list[LineItem]
    subtotal: float
    tax_amount: Optional[float] = None
    total_amount: float = Field(description="Final amount due including tax")

Because Invoice

contains a nested LineItem

model, Pydantic v2's model_json_schema()

generates a schema with $defs

and $ref

entries. Claude's API rejects those references, so Step 3 includes a helper to inline them before the schema is sent.

Step 2: Build the document #

Claude accepts PDFs as document

content blocks and images as image

content blocks. Both use base64 encoding. PDF support requires claude-3-5-sonnet-20241022

or later.

import base64
from pathlib import Path

MEDIA_TYPES = {
    ".pdf":  "application/pdf",
    ".png":  "image/png",
    ".jpg":  "image/jpeg",
    ".jpeg": "image/jpeg",
    ".webp": "image/webp",
    ".gif":  "image/gif",
}

def load_document(file_path: str) -> dict:
    path = Path(file_path)
    suffix = path.suffix.lower()

    if suffix not in MEDIA_TYPES:
        raise ValueError(f"Unsupported file type: {suffix}")

    data = base64.standard_b64encode(path.read_bytes()).decode("utf-8")
    media_type = MEDIA_TYPES[suffix]

    if suffix == ".pdf":
        return {
            "type": "document",
            "source": {"type": "base64", "media_type": media_type, "data": data},
        }
    return {
        "type": "image",
        "source": {"type": "base64", "media_type": media_type, "data": data},
    }

Step 3: Wire Claude tool use to your model #

Two things to know before the code. First, Claude's API rejects tool input_schema

values containing $ref

or $defs

, so inline_refs

resolves all references before the schema leaves your process. Second, the PDF document

block type is in public beta and requires an anthropic-beta

header on every request that includes one.

Setting tool_choice

to {"type": "tool", "name": "..."}

forces Claude to call your specific tool instead of answering in prose. The response then contains a tool_use

content block whose input

is a plain dict — hand it straight to model_validate

.

import anthropic
from models import Invoice
from  import load_document

client = anthropic.Anthropic()

def inline_refs(schema: dict) -> dict:
    """Recursively resolve $ref/$defs so Claude's API accepts the schema."""
    def resolve(node, defs):
        if isinstance(node, dict):
            if "$ref" in node:
                ref_path = node["$ref"]
                def_name = ref_path.split("/")[-1]
                return resolve(defs[def_name], defs)
            return {k: resolve(v, defs) for k, v in node.items()}
        elif isinstance(node, list):
            return [resolve(item, defs) for item in node]
        return node

    defs = schema.get("$defs", {})
    resolved = resolve(schema, defs)
    resolved.pop("$defs", None)
    return resolved

def extract_invoice(file_path: str) -> Invoice:
    tool_def = {
        "name": "extract_invoice",
        "description": "Extract all invoice fields from the provided document.",
        "input_schema": inline_refs(Invoice.model_json_schema()),
    }

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        tools=[tool_def],
        tool_choice={"type": "tool", "name": "extract_invoice"},
        extra_headers={"anthropic-beta": "pdfs-2024-09-25"},
        messages=[
            {
                "role": "user",
                "content": [
                    load_document(file_path),
                    {"type": "text", "text": "Extract the invoice data from this document."},
                ],
            }
        ],
    )

    for block in response.content:
        if block.type == "tool_use":
            return Invoice.model_validate(block.input)

    raise RuntimeError(
        "Claude did not return a tool_use block. Verify tool_choice name matches tool_def name."
    )

The loop over response.content

guards against Claude occasionally prepending a short text

block before the tool_use

block, even when the tool is forced.

Step 4: Run the pipeline #

from extractor import extract_invoice

invoice = extract_invoice("sample_invoice.pdf")  # also accepts .jpg, .png, .webp

print(f"Vendor:    {invoice.vendor_name}")
print(f"Invoice #: {invoice.invoice_number}")
print(f"Total:     ${invoice.total_amount:.2f}")
for item in invoice.line_items:
    print(f"  {item.description}: {item.quantity} x ${item.unit_price:.2f}")

Verify it works #

Run python main.py

against any invoice PDF or image. Expected output:

Vendor:    Acme Supplies Co.
Invoice #: INV-2024-0042
Total:     $1348.00
  Widget A: 10.0 x $89.90
  Widget B: 5.0 x $99.80

If Pydantic raises a ValidationError

, the message pinpoints exactly which field came back in an unexpected shape. That's the contract at work: silent bad data becomes an explicit failure with a line number.

To inspect the inlined schema that Claude actually receives:

import json
from models import Invoice
from extractor import inline_refs
print(json.dumps(inline_refs(Invoice.model_json_schema()), indent=2))

Troubleshooting #

** ValidationError on invoice_date** Claude returned "January 5, 2024" instead of ISO format. Tighten the field description to

"strict ISO 8601, e.g. 2024-01-05"

, or add a Pydantic field_validator

to normalize common date string formats.** anthropic.BadRequestError on the request** Two common causes. If the error mentions the PDF or document type, check that you're using

claude-3-5-sonnet-20241022

or a newer model and that extra_headers={"anthropic-beta": "pdfs-2024-09-25"}

is present on the call. If the error mentions the tool schema, confirm you're passing inline_refs(Invoice.model_json_schema())

rather than the raw schema, which contains $ref

entries the API will reject.** RuntimeError: Claude did not return a tool_use block** The tool name in

tool_choice

must match tool_def["name"]

exactly. A mismatch causes Claude to fall back to a plain text answer. Double-check for typos.Extraction stops mid-invoice on large files The default max_tokens=1024

can be too small for invoices with many line items. Increase it to 2048

or 4096

. Claude 3.5 Sonnet handles PDFs up to roughly 100 pages; beyond that, use pypdf

(pip install pypdf

) to slice pages before encoding.

Next steps #

Make it generic: parameterizeextract_invoice

with atype[T]

bound toBaseModel

so one function handles any schema.Async batch processing: swap inanthropic.AsyncAnthropic()

and run extractions withasyncio.gather()

for throughput.Self-healing extraction: catchValidationError

, format the error message, and send it back to Claude as a follow-up user turn asking it to correct the specific fields.Confidence flags: addOptional[float]

confidence fields to your model and ask Claude to populate them; route low-confidence results to a human review queue.- Anthropic's tool use documentationcovers parallel tool calls and multi-turn tool workflows.

Priya Nair· AI & Developer Experience Writer

Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.

Discussion 0 #

No comments yet

Be the first to weigh in.

source & further reading

devclubhouse.com — original article The AI Auditing Wave and the End of Battle-Tested Code Baidu's Unlimited OCR: Ditching the Split-and-Stitch Document Pipeline The Real Cost of the Open-Weight Price Collapse