{"slug": "extract-structured-data-from-documents-with-claude-tool-use-and-pydantic", "title": "Extract Structured Data from Documents with Claude Tool Use and Pydantic", "summary": "Priya Nair published a tutorial on building a Python pipeline that extracts structured data from invoice PDFs or images using Anthropic's Claude API with tool use and Pydantic validation. The pipeline forces Claude to return a validated Pydantic model by defining a schema, loading documents as base64, and handling JSON Schema references. This approach can be adapted for any document type such as receipts, contracts, or medical forms.", "body_md": "# Extract Structured Data from Documents with Claude Tool Use and Pydantic\n\nBuild a Python pipeline that sends invoice PDFs or images to Claude, forces a structured response via tool use, and returns a fully validated Pydantic model.\n\n[Priya Nair](https://www.devclubhouse.com/u/priya_nair)\n\n## What you'll build\n\nA Python pipeline that sends invoices (PDFs or images) to Claude, forces a structured response via tool use, and returns a validated Pydantic model. Swap the schema for any document type: receipts, contracts, medical forms.\n\n## Prerequisites\n\n- Python 3.10+\n`ANTHROPIC_API_KEY`\n\nin your environment (get one at[console.anthropic.com](https://console.anthropic.com))- Familiarity with Pydantic v2 models\n\n```\npip install \"anthropic>=0.40.0\" \"pydantic>=2.0\"\n```\n\nmacOS/Linux: `export ANTHROPIC_API_KEY=sk-ant-...`\n\n. Windows: `set ANTHROPIC_API_KEY=sk-ant-...`\n\n.\n\n## Step 1: Define your Pydantic model\n\nThe schema you define here becomes the contract between Claude and your code. Field descriptions are included in the JSON Schema sent to Claude and serve as extraction hints.\n\n``` python\n# models.py\nfrom pydantic import BaseModel, Field\nfrom typing import Optional\n\nclass LineItem(BaseModel):\n    description: str = Field(description=\"Product or service name\")\n    quantity: float\n    unit_price: float = Field(description=\"Price per unit before tax\")\n    total: float\n\nclass Invoice(BaseModel):\n    invoice_number: str\n    vendor_name: str\n    invoice_date: str = Field(description=\"Date in YYYY-MM-DD format\")\n    line_items: list[LineItem]\n    subtotal: float\n    tax_amount: Optional[float] = None\n    total_amount: float = Field(description=\"Final amount due including tax\")\n```\n\nBecause `Invoice`\n\ncontains a nested `LineItem`\n\nmodel, Pydantic v2's `model_json_schema()`\n\ngenerates a schema with `$defs`\n\nand `$ref`\n\nentries. Claude's API rejects those references, so Step 3 includes a helper to inline them before the schema is sent.\n\n## Step 2: Build the document loader\n\nClaude accepts PDFs as `document`\n\ncontent blocks and images as `image`\n\ncontent blocks. Both use base64 encoding. PDF support requires `claude-3-5-sonnet-20241022`\n\nor later.\n\n``` python\n# loader.py\nimport base64\nfrom pathlib import Path\n\nMEDIA_TYPES = {\n    \".pdf\":  \"application/pdf\",\n    \".png\":  \"image/png\",\n    \".jpg\":  \"image/jpeg\",\n    \".jpeg\": \"image/jpeg\",\n    \".webp\": \"image/webp\",\n    \".gif\":  \"image/gif\",\n}\n\ndef load_document(file_path: str) -> dict:\n    path = Path(file_path)\n    suffix = path.suffix.lower()\n\n    if suffix not in MEDIA_TYPES:\n        raise ValueError(f\"Unsupported file type: {suffix}\")\n\n    data = base64.standard_b64encode(path.read_bytes()).decode(\"utf-8\")\n    media_type = MEDIA_TYPES[suffix]\n\n    if suffix == \".pdf\":\n        return {\n            \"type\": \"document\",\n            \"source\": {\"type\": \"base64\", \"media_type\": media_type, \"data\": data},\n        }\n    return {\n        \"type\": \"image\",\n        \"source\": {\"type\": \"base64\", \"media_type\": media_type, \"data\": data},\n    }\n```\n\n## Step 3: Wire Claude tool use to your model\n\nTwo things to know before the code. First, Claude's API rejects tool `input_schema`\n\nvalues containing `$ref`\n\nor `$defs`\n\n, so `inline_refs`\n\nresolves all references before the schema leaves your process. Second, the PDF `document`\n\nblock type is in public beta and requires an `anthropic-beta`\n\nheader on every request that includes one.\n\nSetting `tool_choice`\n\nto `{\"type\": \"tool\", \"name\": \"...\"}`\n\nforces Claude to call your specific tool instead of answering in prose. The response then contains a `tool_use`\n\ncontent block whose `input`\n\nis a plain dict — hand it straight to `model_validate`\n\n.\n\n``` python\n# extractor.py\nimport anthropic\nfrom models import Invoice\nfrom loader import load_document\n\nclient = anthropic.Anthropic()\n\ndef inline_refs(schema: dict) -> dict:\n    \"\"\"Recursively resolve $ref/$defs so Claude's API accepts the schema.\"\"\"\n    def resolve(node, defs):\n        if isinstance(node, dict):\n            if \"$ref\" in node:\n                ref_path = node[\"$ref\"]\n                def_name = ref_path.split(\"/\")[-1]\n                return resolve(defs[def_name], defs)\n            return {k: resolve(v, defs) for k, v in node.items()}\n        elif isinstance(node, list):\n            return [resolve(item, defs) for item in node]\n        return node\n\n    defs = schema.get(\"$defs\", {})\n    resolved = resolve(schema, defs)\n    resolved.pop(\"$defs\", None)\n    return resolved\n\ndef extract_invoice(file_path: str) -> Invoice:\n    tool_def = {\n        \"name\": \"extract_invoice\",\n        \"description\": \"Extract all invoice fields from the provided document.\",\n        \"input_schema\": inline_refs(Invoice.model_json_schema()),\n    }\n\n    response = client.messages.create(\n        model=\"claude-3-5-sonnet-20241022\",\n        max_tokens=1024,\n        tools=[tool_def],\n        tool_choice={\"type\": \"tool\", \"name\": \"extract_invoice\"},\n        extra_headers={\"anthropic-beta\": \"pdfs-2024-09-25\"},\n        messages=[\n            {\n                \"role\": \"user\",\n                \"content\": [\n                    load_document(file_path),\n                    {\"type\": \"text\", \"text\": \"Extract the invoice data from this document.\"},\n                ],\n            }\n        ],\n    )\n\n    for block in response.content:\n        if block.type == \"tool_use\":\n            return Invoice.model_validate(block.input)\n\n    raise RuntimeError(\n        \"Claude did not return a tool_use block. Verify tool_choice name matches tool_def name.\"\n    )\n```\n\nThe loop over `response.content`\n\nguards against Claude occasionally prepending a short `text`\n\nblock before the `tool_use`\n\nblock, even when the tool is forced.\n\n## Step 4: Run the pipeline\n\n``` python\n# main.py\nfrom extractor import extract_invoice\n\ninvoice = extract_invoice(\"sample_invoice.pdf\")  # also accepts .jpg, .png, .webp\n\nprint(f\"Vendor:    {invoice.vendor_name}\")\nprint(f\"Invoice #: {invoice.invoice_number}\")\nprint(f\"Total:     ${invoice.total_amount:.2f}\")\nfor item in invoice.line_items:\n    print(f\"  {item.description}: {item.quantity} x ${item.unit_price:.2f}\")\n```\n\n## Verify it works\n\nRun `python main.py`\n\nagainst any invoice PDF or image. Expected output:\n\n```\nVendor:    Acme Supplies Co.\nInvoice #: INV-2024-0042\nTotal:     $1348.00\n  Widget A: 10.0 x $89.90\n  Widget B: 5.0 x $99.80\n```\n\nIf Pydantic raises a `ValidationError`\n\n, the message pinpoints exactly which field came back in an unexpected shape. That's the contract at work: silent bad data becomes an explicit failure with a line number.\n\nTo inspect the inlined schema that Claude actually receives:\n\n``` python\nimport json\nfrom models import Invoice\nfrom extractor import inline_refs\nprint(json.dumps(inline_refs(Invoice.model_json_schema()), indent=2))\n```\n\n## Troubleshooting\n\n** ValidationError on invoice_date**\nClaude returned \"January 5, 2024\" instead of ISO format. Tighten the field description to\n\n`\"strict ISO 8601, e.g. 2024-01-05\"`\n\n, or add a Pydantic `field_validator`\n\nto normalize common date string formats.** anthropic.BadRequestError on the request**\nTwo common causes. If the error mentions the PDF or document type, check that you're using\n\n`claude-3-5-sonnet-20241022`\n\nor a newer model and that `extra_headers={\"anthropic-beta\": \"pdfs-2024-09-25\"}`\n\nis present on the call. If the error mentions the tool schema, confirm you're passing `inline_refs(Invoice.model_json_schema())`\n\nrather than the raw schema, which contains `$ref`\n\nentries the API will reject.** RuntimeError: Claude did not return a tool_use block**\nThe tool name in\n\n`tool_choice`\n\nmust match `tool_def[\"name\"]`\n\nexactly. A mismatch causes Claude to fall back to a plain text answer. Double-check for typos.**Extraction stops mid-invoice on large files**\nThe default `max_tokens=1024`\n\ncan be too small for invoices with many line items. Increase it to `2048`\n\nor `4096`\n\n. Claude 3.5 Sonnet handles PDFs up to roughly 100 pages; beyond that, use `pypdf`\n\n(`pip install pypdf`\n\n) to slice pages before encoding.\n\n## Next steps\n\n**Make it generic**: parameterize`extract_invoice`\n\nwith a`type[T]`\n\nbound to`BaseModel`\n\nso one function handles any schema.**Async batch processing**: swap in`anthropic.AsyncAnthropic()`\n\nand run extractions with`asyncio.gather()`\n\nfor throughput.**Self-healing extraction**: catch`ValidationError`\n\n, format the error message, and send it back to Claude as a follow-up user turn asking it to correct the specific fields.**Confidence flags**: add`Optional[float]`\n\nconfidence fields to your model and ask Claude to populate them; route low-confidence results to a human review queue.- Anthropic's\n[tool use documentation](https://docs.anthropic.com/en/docs/tool-use)covers parallel tool calls and multi-turn tool workflows.\n\n[Priya Nair](https://www.devclubhouse.com/u/priya_nair)· AI & Developer Experience Writer\n\nPriya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.\n\n## Discussion 0\n\nNo comments yet\n\nBe the first to weigh in.", "url": "https://wpnews.pro/news/extract-structured-data-from-documents-with-claude-tool-use-and-pydantic", "canonical_source": "https://www.devclubhouse.com/a/extract-structured-data-from-documents-with-claude-tool-use-and-pydantic", "published_at": "2026-06-23 07:39:25+00:00", "updated_at": "2026-06-24 00:15:25.068437+00:00", "lang": "en", "topics": ["ai-tools", "developer-tools", "large-language-models"], "entities": ["Claude", "Anthropic", "Pydantic", "Priya Nair"], "alternates": {"html": "https://wpnews.pro/news/extract-structured-data-from-documents-with-claude-tool-use-and-pydantic", "markdown": "https://wpnews.pro/news/extract-structured-data-from-documents-with-claude-tool-use-and-pydantic.md", "text": "https://wpnews.pro/news/extract-structured-data-from-documents-with-claude-tool-use-and-pydantic.txt", "jsonld": "https://wpnews.pro/news/extract-structured-data-from-documents-with-claude-tool-use-and-pydantic.jsonld"}}