Extract Structured Data from Documents with Claude Tool Use and Pydantic

Priya Nair published a tutorial on building a Python pipeline that extracts structured data from invoice PDFs or images using Anthropic's Claude API with tool use and Pydantic validation. The pipeline forces Claude to return a validated Pydantic model by defining a schema, loading documents as base64, and handling JSON Schema references. This approach can be adapted for any document type such as receipts, contracts, or medical forms.

Extract Structured Data from Documents with Claude Tool Use and Pydantic Build a Python pipeline that sends invoice PDFs or images to Claude, forces a structured response via tool use, and returns a fully validated Pydantic model. Priya Nair https://www.devclubhouse.com/u/priya nair What you'll build A Python pipeline that sends invoices PDFs or images to Claude, forces a structured response via tool use, and returns a validated Pydantic model. Swap the schema for any document type: receipts, contracts, medical forms. Prerequisites - Python 3.10+ ANTHROPIC API KEY in your environment get one at console.anthropic.com https://console.anthropic.com - Familiarity with Pydantic v2 models pip install "anthropic =0.40.0" "pydantic =2.0" macOS/Linux: export ANTHROPIC API KEY=sk-ant-... . Windows: set ANTHROPIC API KEY=sk-ant-... . Step 1: Define your Pydantic model The schema you define here becomes the contract between Claude and your code. Field descriptions are included in the JSON Schema sent to Claude and serve as extraction hints. python models.py from pydantic import BaseModel, Field from typing import Optional class LineItem BaseModel : description: str = Field description="Product or service name" quantity: float unit price: float = Field description="Price per unit before tax" total: float class Invoice BaseModel : invoice number: str vendor name: str invoice date: str = Field description="Date in YYYY-MM-DD format" line items: list LineItem subtotal: float tax amount: Optional float = None total amount: float = Field description="Final amount due including tax" Because Invoice contains a nested LineItem model, Pydantic v2's model json schema generates a schema with $defs and $ref entries. Claude's API rejects those references, so Step 3 includes a helper to inline them before the schema is sent. Step 2: Build the document loader Claude accepts PDFs as document content blocks and images as image content blocks. Both use base64 encoding. PDF support requires claude-3-5-sonnet-20241022 or later. python loader.py import base64 from pathlib import Path MEDIA TYPES = { ".pdf": "application/pdf", ".png": "image/png", ".jpg": "image/jpeg", ".jpeg": "image/jpeg", ".webp": "image/webp", ".gif": "image/gif", } def load document file path: str - dict: path = Path file path suffix = path.suffix.lower if suffix not in MEDIA TYPES: raise ValueError f"Unsupported file type: {suffix}" data = base64.standard b64encode path.read bytes .decode "utf-8" media type = MEDIA TYPES suffix if suffix == ".pdf": return { "type": "document", "source": {"type": "base64", "media type": media type, "data": data}, } return { "type": "image", "source": {"type": "base64", "media type": media type, "data": data}, } Step 3: Wire Claude tool use to your model Two things to know before the code. First, Claude's API rejects tool input schema values containing $ref or $defs , so inline refs resolves all references before the schema leaves your process. Second, the PDF document block type is in public beta and requires an anthropic-beta header on every request that includes one. Setting tool choice to {"type": "tool", "name": "..."} forces Claude to call your specific tool instead of answering in prose. The response then contains a tool use content block whose input is a plain dict — hand it straight to model validate . python extractor.py import anthropic from models import Invoice from loader import load document client = anthropic.Anthropic def inline refs schema: dict - dict: """Recursively resolve $ref/$defs so Claude's API accepts the schema.""" def resolve node, defs : if isinstance node, dict : if "$ref" in node: ref path = node "$ref" def name = ref path.split "/" -1 return resolve defs def name , defs return {k: resolve v, defs for k, v in node.items } elif isinstance node, list : return resolve item, defs for item in node return node defs = schema.get "$defs", {} resolved = resolve schema, defs resolved.pop "$defs", None return resolved def extract invoice file path: str - Invoice: tool def = { "name": "extract invoice", "description": "Extract all invoice fields from the provided document.", "input schema": inline refs Invoice.model json schema , } response = client.messages.create model="claude-3-5-sonnet-20241022", max tokens=1024, tools= tool def , tool choice={"type": "tool", "name": "extract invoice"}, extra headers={"anthropic-beta": "pdfs-2024-09-25"}, messages= { "role": "user", "content": load document file path , {"type": "text", "text": "Extract the invoice data from this document."}, , } , for block in response.content: if block.type == "tool use": return Invoice.model validate block.input raise RuntimeError "Claude did not return a tool use block. Verify tool choice name matches tool def name." The loop over response.content guards against Claude occasionally prepending a short text block before the tool use block, even when the tool is forced. Step 4: Run the pipeline python main.py from extractor import extract invoice invoice = extract invoice "sample invoice.pdf" also accepts .jpg, .png, .webp print f"Vendor: {invoice.vendor name}" print f"Invoice : {invoice.invoice number}" print f"Total: ${invoice.total amount:.2f}" for item in invoice.line items: print f" {item.description}: {item.quantity} x ${item.unit price:.2f}" Verify it works Run python main.py against any invoice PDF or image. Expected output: Vendor: Acme Supplies Co. Invoice : INV-2024-0042 Total: $1348.00 Widget A: 10.0 x $89.90 Widget B: 5.0 x $99.80 If Pydantic raises a ValidationError , the message pinpoints exactly which field came back in an unexpected shape. That's the contract at work: silent bad data becomes an explicit failure with a line number. To inspect the inlined schema that Claude actually receives: python import json from models import Invoice from extractor import inline refs print json.dumps inline refs Invoice.model json schema , indent=2 Troubleshooting ValidationError on invoice date Claude returned "January 5, 2024" instead of ISO format. Tighten the field description to "strict ISO 8601, e.g. 2024-01-05" , or add a Pydantic field validator to normalize common date string formats. anthropic.BadRequestError on the request Two common causes. If the error mentions the PDF or document type, check that you're using claude-3-5-sonnet-20241022 or a newer model and that extra headers={"anthropic-beta": "pdfs-2024-09-25"} is present on the call. If the error mentions the tool schema, confirm you're passing inline refs Invoice.model json schema rather than the raw schema, which contains $ref entries the API will reject. RuntimeError: Claude did not return a tool use block The tool name in tool choice must match tool def "name" exactly. A mismatch causes Claude to fall back to a plain text answer. Double-check for typos. Extraction stops mid-invoice on large files The default max tokens=1024 can be too small for invoices with many line items. Increase it to 2048 or 4096 . Claude 3.5 Sonnet handles PDFs up to roughly 100 pages; beyond that, use pypdf pip install pypdf to slice pages before encoding. Next steps Make it generic : parameterize extract invoice with a type T bound to BaseModel so one function handles any schema. Async batch processing : swap in anthropic.AsyncAnthropic and run extractions with asyncio.gather for throughput. Self-healing extraction : catch ValidationError , format the error message, and send it back to Claude as a follow-up user turn asking it to correct the specific fields. Confidence flags : add Optional float confidence fields to your model and ask Claude to populate them; route low-confidence results to a human review queue.- Anthropic's tool use documentation https://docs.anthropic.com/en/docs/tool-use covers parallel tool calls and multi-turn tool workflows. Priya Nair https://www.devclubhouse.com/u/priya nair · AI & Developer Experience Writer Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to. Discussion 0 No comments yet Be the first to weigh in.