Build a Python pipeline that sends invoice PDFs or images to Claude, forces a structured response via tool use, and returns a fully validated Pydantic model.
What you'll build #
A Python pipeline that sends invoices (PDFs or images) to Claude, forces a structured response via tool use, and returns a validated Pydantic model. Swap the schema for any document type: receipts, contracts, medical forms.
Prerequisites #
- Python 3.10+
ANTHROPIC_API_KEY
in your environment (get one atconsole.anthropic.com)- Familiarity with Pydantic v2 models
pip install "anthropic>=0.40.0" "pydantic>=2.0"
macOS/Linux: export ANTHROPIC_API_KEY=sk-ant-...
. Windows: set ANTHROPIC_API_KEY=sk-ant-...
.
Step 1: Define your Pydantic model #
The schema you define here becomes the contract between Claude and your code. Field descriptions are included in the JSON Schema sent to Claude and serve as extraction hints.
from pydantic import BaseModel, Field
from typing import Optional
class LineItem(BaseModel):
description: str = Field(description="Product or service name")
quantity: float
unit_price: float = Field(description="Price per unit before tax")
total: float
class Invoice(BaseModel):
invoice_number: str
vendor_name: str
invoice_date: str = Field(description="Date in YYYY-MM-DD format")
line_items: list[LineItem]
subtotal: float
tax_amount: Optional[float] = None
total_amount: float = Field(description="Final amount due including tax")
Because Invoice
contains a nested LineItem
model, Pydantic v2's model_json_schema()
generates a schema with $defs
and $ref
entries. Claude's API rejects those references, so Step 3 includes a helper to inline them before the schema is sent.
Step 2: Build the document #
Claude accepts PDFs as document
content blocks and images as image
content blocks. Both use base64 encoding. PDF support requires claude-3-5-sonnet-20241022
or later.
import base64
from pathlib import Path
MEDIA_TYPES = {
".pdf": "application/pdf",
".png": "image/png",
".jpg": "image/jpeg",
".jpeg": "image/jpeg",
".webp": "image/webp",
".gif": "image/gif",
}
def load_document(file_path: str) -> dict:
path = Path(file_path)
suffix = path.suffix.lower()
if suffix not in MEDIA_TYPES:
raise ValueError(f"Unsupported file type: {suffix}")
data = base64.standard_b64encode(path.read_bytes()).decode("utf-8")
media_type = MEDIA_TYPES[suffix]
if suffix == ".pdf":
return {
"type": "document",
"source": {"type": "base64", "media_type": media_type, "data": data},
}
return {
"type": "image",
"source": {"type": "base64", "media_type": media_type, "data": data},
}
Step 3: Wire Claude tool use to your model #
Two things to know before the code. First, Claude's API rejects tool input_schema
values containing $ref
or $defs
, so inline_refs
resolves all references before the schema leaves your process. Second, the PDF document
block type is in public beta and requires an anthropic-beta
header on every request that includes one.
Setting tool_choice
to {"type": "tool", "name": "..."}
forces Claude to call your specific tool instead of answering in prose. The response then contains a tool_use
content block whose input
is a plain dict — hand it straight to model_validate
.
import anthropic
from models import Invoice
from import load_document
client = anthropic.Anthropic()
def inline_refs(schema: dict) -> dict:
"""Recursively resolve $ref/$defs so Claude's API accepts the schema."""
def resolve(node, defs):
if isinstance(node, dict):
if "$ref" in node:
ref_path = node["$ref"]
def_name = ref_path.split("/")[-1]
return resolve(defs[def_name], defs)
return {k: resolve(v, defs) for k, v in node.items()}
elif isinstance(node, list):
return [resolve(item, defs) for item in node]
return node
defs = schema.get("$defs", {})
resolved = resolve(schema, defs)
resolved.pop("$defs", None)
return resolved
def extract_invoice(file_path: str) -> Invoice:
tool_def = {
"name": "extract_invoice",
"description": "Extract all invoice fields from the provided document.",
"input_schema": inline_refs(Invoice.model_json_schema()),
}
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
tools=[tool_def],
tool_choice={"type": "tool", "name": "extract_invoice"},
extra_headers={"anthropic-beta": "pdfs-2024-09-25"},
messages=[
{
"role": "user",
"content": [
load_document(file_path),
{"type": "text", "text": "Extract the invoice data from this document."},
],
}
],
)
for block in response.content:
if block.type == "tool_use":
return Invoice.model_validate(block.input)
raise RuntimeError(
"Claude did not return a tool_use block. Verify tool_choice name matches tool_def name."
)
The loop over response.content
guards against Claude occasionally prepending a short text
block before the tool_use
block, even when the tool is forced.
Step 4: Run the pipeline #
from extractor import extract_invoice
invoice = extract_invoice("sample_invoice.pdf") # also accepts .jpg, .png, .webp
print(f"Vendor: {invoice.vendor_name}")
print(f"Invoice #: {invoice.invoice_number}")
print(f"Total: ${invoice.total_amount:.2f}")
for item in invoice.line_items:
print(f" {item.description}: {item.quantity} x ${item.unit_price:.2f}")
Verify it works #
Run python main.py
against any invoice PDF or image. Expected output:
Vendor: Acme Supplies Co.
Invoice #: INV-2024-0042
Total: $1348.00
Widget A: 10.0 x $89.90
Widget B: 5.0 x $99.80
If Pydantic raises a ValidationError
, the message pinpoints exactly which field came back in an unexpected shape. That's the contract at work: silent bad data becomes an explicit failure with a line number.
To inspect the inlined schema that Claude actually receives:
import json
from models import Invoice
from extractor import inline_refs
print(json.dumps(inline_refs(Invoice.model_json_schema()), indent=2))
Troubleshooting #
** ValidationError on invoice_date** Claude returned "January 5, 2024" instead of ISO format. Tighten the field description to
"strict ISO 8601, e.g. 2024-01-05"
, or add a Pydantic field_validator
to normalize common date string formats.** anthropic.BadRequestError on the request** Two common causes. If the error mentions the PDF or document type, check that you're using
claude-3-5-sonnet-20241022
or a newer model and that extra_headers={"anthropic-beta": "pdfs-2024-09-25"}
is present on the call. If the error mentions the tool schema, confirm you're passing inline_refs(Invoice.model_json_schema())
rather than the raw schema, which contains $ref
entries the API will reject.** RuntimeError: Claude did not return a tool_use block** The tool name in
tool_choice
must match tool_def["name"]
exactly. A mismatch causes Claude to fall back to a plain text answer. Double-check for typos.Extraction stops mid-invoice on large files
The default max_tokens=1024
can be too small for invoices with many line items. Increase it to 2048
or 4096
. Claude 3.5 Sonnet handles PDFs up to roughly 100 pages; beyond that, use pypdf
(pip install pypdf
) to slice pages before encoding.
Next steps #
Make it generic: parameterizeextract_invoice
with atype[T]
bound toBaseModel
so one function handles any schema.Async batch processing: swap inanthropic.AsyncAnthropic()
and run extractions withasyncio.gather()
for throughput.Self-healing extraction: catchValidationError
, format the error message, and send it back to Claude as a follow-up user turn asking it to correct the specific fields.Confidence flags: addOptional[float]
confidence fields to your model and ask Claude to populate them; route low-confidence results to a human review queue.- Anthropic's tool use documentationcovers parallel tool calls and multi-turn tool workflows.
Priya Nair· AI & Developer Experience Writer
Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.
Discussion 0 #
No comments yet
Be the first to weigh in.