# Extract Structured Data from Documents with Claude Tool Use and Pydantic

> Source: <https://www.devclubhouse.com/a/extract-structured-data-from-documents-with-claude-tool-use-and-pydantic>
> Published: 2026-06-23 07:39:25+00:00

# Extract Structured Data from Documents with Claude Tool Use and Pydantic

Build a Python pipeline that sends invoice PDFs or images to Claude, forces a structured response via tool use, and returns a fully validated Pydantic model.

[Priya Nair](https://www.devclubhouse.com/u/priya_nair)

## What you'll build

A Python pipeline that sends invoices (PDFs or images) to Claude, forces a structured response via tool use, and returns a validated Pydantic model. Swap the schema for any document type: receipts, contracts, medical forms.

## Prerequisites

- Python 3.10+
`ANTHROPIC_API_KEY`

in your environment (get one at[console.anthropic.com](https://console.anthropic.com))- Familiarity with Pydantic v2 models

```
pip install "anthropic>=0.40.0" "pydantic>=2.0"
```

macOS/Linux: `export ANTHROPIC_API_KEY=sk-ant-...`

. Windows: `set ANTHROPIC_API_KEY=sk-ant-...`

.

## Step 1: Define your Pydantic model

The schema you define here becomes the contract between Claude and your code. Field descriptions are included in the JSON Schema sent to Claude and serve as extraction hints.

``` python
# models.py
from pydantic import BaseModel, Field
from typing import Optional

class LineItem(BaseModel):
    description: str = Field(description="Product or service name")
    quantity: float
    unit_price: float = Field(description="Price per unit before tax")
    total: float

class Invoice(BaseModel):
    invoice_number: str
    vendor_name: str
    invoice_date: str = Field(description="Date in YYYY-MM-DD format")
    line_items: list[LineItem]
    subtotal: float
    tax_amount: Optional[float] = None
    total_amount: float = Field(description="Final amount due including tax")
```

Because `Invoice`

contains a nested `LineItem`

model, Pydantic v2's `model_json_schema()`

generates a schema with `$defs`

and `$ref`

entries. Claude's API rejects those references, so Step 3 includes a helper to inline them before the schema is sent.

## Step 2: Build the document loader

Claude accepts PDFs as `document`

content blocks and images as `image`

content blocks. Both use base64 encoding. PDF support requires `claude-3-5-sonnet-20241022`

or later.

``` python
# loader.py
import base64
from pathlib import Path

MEDIA_TYPES = {
    ".pdf":  "application/pdf",
    ".png":  "image/png",
    ".jpg":  "image/jpeg",
    ".jpeg": "image/jpeg",
    ".webp": "image/webp",
    ".gif":  "image/gif",
}

def load_document(file_path: str) -> dict:
    path = Path(file_path)
    suffix = path.suffix.lower()

    if suffix not in MEDIA_TYPES:
        raise ValueError(f"Unsupported file type: {suffix}")

    data = base64.standard_b64encode(path.read_bytes()).decode("utf-8")
    media_type = MEDIA_TYPES[suffix]

    if suffix == ".pdf":
        return {
            "type": "document",
            "source": {"type": "base64", "media_type": media_type, "data": data},
        }
    return {
        "type": "image",
        "source": {"type": "base64", "media_type": media_type, "data": data},
    }
```

## Step 3: Wire Claude tool use to your model

Two things to know before the code. First, Claude's API rejects tool `input_schema`

values containing `$ref`

or `$defs`

, so `inline_refs`

resolves all references before the schema leaves your process. Second, the PDF `document`

block type is in public beta and requires an `anthropic-beta`

header on every request that includes one.

Setting `tool_choice`

to `{"type": "tool", "name": "..."}`

forces Claude to call your specific tool instead of answering in prose. The response then contains a `tool_use`

content block whose `input`

is a plain dict — hand it straight to `model_validate`

.

``` python
# extractor.py
import anthropic
from models import Invoice
from loader import load_document

client = anthropic.Anthropic()

def inline_refs(schema: dict) -> dict:
    """Recursively resolve $ref/$defs so Claude's API accepts the schema."""
    def resolve(node, defs):
        if isinstance(node, dict):
            if "$ref" in node:
                ref_path = node["$ref"]
                def_name = ref_path.split("/")[-1]
                return resolve(defs[def_name], defs)
            return {k: resolve(v, defs) for k, v in node.items()}
        elif isinstance(node, list):
            return [resolve(item, defs) for item in node]
        return node

    defs = schema.get("$defs", {})
    resolved = resolve(schema, defs)
    resolved.pop("$defs", None)
    return resolved

def extract_invoice(file_path: str) -> Invoice:
    tool_def = {
        "name": "extract_invoice",
        "description": "Extract all invoice fields from the provided document.",
        "input_schema": inline_refs(Invoice.model_json_schema()),
    }

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        tools=[tool_def],
        tool_choice={"type": "tool", "name": "extract_invoice"},
        extra_headers={"anthropic-beta": "pdfs-2024-09-25"},
        messages=[
            {
                "role": "user",
                "content": [
                    load_document(file_path),
                    {"type": "text", "text": "Extract the invoice data from this document."},
                ],
            }
        ],
    )

    for block in response.content:
        if block.type == "tool_use":
            return Invoice.model_validate(block.input)

    raise RuntimeError(
        "Claude did not return a tool_use block. Verify tool_choice name matches tool_def name."
    )
```

The loop over `response.content`

guards against Claude occasionally prepending a short `text`

block before the `tool_use`

block, even when the tool is forced.

## Step 4: Run the pipeline

``` python
# main.py
from extractor import extract_invoice

invoice = extract_invoice("sample_invoice.pdf")  # also accepts .jpg, .png, .webp

print(f"Vendor:    {invoice.vendor_name}")
print(f"Invoice #: {invoice.invoice_number}")
print(f"Total:     ${invoice.total_amount:.2f}")
for item in invoice.line_items:
    print(f"  {item.description}: {item.quantity} x ${item.unit_price:.2f}")
```

## Verify it works

Run `python main.py`

against any invoice PDF or image. Expected output:

```
Vendor:    Acme Supplies Co.
Invoice #: INV-2024-0042
Total:     $1348.00
  Widget A: 10.0 x $89.90
  Widget B: 5.0 x $99.80
```

If Pydantic raises a `ValidationError`

, the message pinpoints exactly which field came back in an unexpected shape. That's the contract at work: silent bad data becomes an explicit failure with a line number.

To inspect the inlined schema that Claude actually receives:

``` python
import json
from models import Invoice
from extractor import inline_refs
print(json.dumps(inline_refs(Invoice.model_json_schema()), indent=2))
```

## Troubleshooting

** ValidationError on invoice_date**
Claude returned "January 5, 2024" instead of ISO format. Tighten the field description to

`"strict ISO 8601, e.g. 2024-01-05"`

, or add a Pydantic `field_validator`

to normalize common date string formats.** anthropic.BadRequestError on the request**
Two common causes. If the error mentions the PDF or document type, check that you're using

`claude-3-5-sonnet-20241022`

or a newer model and that `extra_headers={"anthropic-beta": "pdfs-2024-09-25"}`

is present on the call. If the error mentions the tool schema, confirm you're passing `inline_refs(Invoice.model_json_schema())`

rather than the raw schema, which contains `$ref`

entries the API will reject.** RuntimeError: Claude did not return a tool_use block**
The tool name in

`tool_choice`

must match `tool_def["name"]`

exactly. A mismatch causes Claude to fall back to a plain text answer. Double-check for typos.**Extraction stops mid-invoice on large files**
The default `max_tokens=1024`

can be too small for invoices with many line items. Increase it to `2048`

or `4096`

. Claude 3.5 Sonnet handles PDFs up to roughly 100 pages; beyond that, use `pypdf`

(`pip install pypdf`

) to slice pages before encoding.

## Next steps

**Make it generic**: parameterize`extract_invoice`

with a`type[T]`

bound to`BaseModel`

so one function handles any schema.**Async batch processing**: swap in`anthropic.AsyncAnthropic()`

and run extractions with`asyncio.gather()`

for throughput.**Self-healing extraction**: catch`ValidationError`

, format the error message, and send it back to Claude as a follow-up user turn asking it to correct the specific fields.**Confidence flags**: add`Optional[float]`

confidence fields to your model and ask Claude to populate them; route low-confidence results to a human review queue.- Anthropic's
[tool use documentation](https://docs.anthropic.com/en/docs/tool-use)covers parallel tool calls and multi-turn tool workflows.

[Priya Nair](https://www.devclubhouse.com/u/priya_nair)· AI & Developer Experience Writer

Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.

## Discussion 0

No comments yet

Be the first to weigh in.
