Pydantic AI: Typed, Testable Agents for Engineers Who Like Guarantees

wpnews.pro

You ship an agent that resolves billing disputes. It works in the demo. Two weeks later a support ticket lands: the agent tried to refund $4,000 on a $19 charge. You read the trace. The model returned a JSON blob, your code did json.loads

, pulled amount

, and passed it straight to the payments API. No cap. No type. No check. The model hallucinated a number and your code trusted it.

The model is stochastic. Your code does not have to be. The gap between those two facts is where most production agent bugs live, and it is exactly the gap Pydantic AI is built to close.

Most agent frameworks hand you an Agent

object and a bag of strings. Pydantic AI hands you Agent[Deps, Output]

— a generic parameterized by its dependency type and its output type. The IDE and your type checker read those parameters. So does the runtime.

Install pulls in the framework plus an optional tracing extra:

pip install "pydantic-ai[logfire]"

The smallest program that earns its keep:

from dataclasses import dataclass
from pydantic import BaseModel
from pydantic_ai import Agent, RunContext

@dataclass
class Deps:
    customer_name: str

class SupportReply(BaseModel):
    reply: str
    escalate: bool

agent = Agent(
    "anthropic:claude-opus-4-8",
    deps_type=Deps,
    output_type=SupportReply,
    system_prompt="You are a support agent.",
)

A tool is a plain function whose type hints become the schema the model sees, and the run returns the validated SupportReply

:

@agent.tool
def customer_name(ctx: RunContext[Deps]) -> str:
    return ctx.deps.customer_name

result = agent.run_sync(
    "What is my name?",
    deps=Deps(customer_name="Ana"),
)
print(result.output.reply)
print(result.output.escalate)

Three things are load-bearing there. deps_type

declares what the agent needs from you. output_type

declares what it must return. @agent.tool

wraps a plain Python function and reads its type hints to build the tool schema the model sees.

Pydantic AI ships no implicit default model, so you always pass a model string. This post reaches for Anthropic's Claude for a reason: it follows tool schemas closely and returns well-formed structured output, which is precisely what the validation layer below leans on.

When the model returns something that does not parse into SupportReply

, Pydantic AI does not hand you a broken object. It catches the ValidationError

, formats it, and sends it back to the model as a correction request. You get a validated object or a clean exception — never a string with a JSON fence stuck to it.

Push that idea onto the billing agent and the types stop being documentation. They become rails.

from dataclasses import dataclass
from typing import Literal
from pydantic import BaseModel, Field
from pydantic_ai import Agent, RunContext

@dataclass
class BillingDeps:
    customer_id: str
    api_key: str

class BillingAction(BaseModel):
    action: Literal["refund", "retry", "escalate"]
    amount_cents: int = Field(ge=0, le=20_000)
    reason: str

agent = Agent(
    "anthropic:claude-opus-4-8",
    deps_type=BillingDeps,
    output_type=BillingAction,
    system_prompt=(
        "You resolve billing disputes. Refund under "
        "$200, retry on transient failures, escalate "
        "everything else."
    ),
)

The tools read from ctx.deps

and their return types feed straight into the schema the model reads:

@agent.tool
def last_charge(ctx: RunContext[BillingDeps]) -> int:
    """Return the last charge in cents."""
    return 1899

@agent.tool
def charge_status(
    ctx: RunContext[BillingDeps],
) -> Literal["ok", "failed", "pending"]:
    """Return the status of the last charge."""
    return "failed"

Run it, and the output is a validated BillingAction

or an exception, never a raw string:

result = agent.run_sync(
    "My card was charged but the order never "
    "shipped. Fix it.",
    deps=BillingDeps(customer_id="cus_123", api_key="..."),
)
assert isinstance(result.output, BillingAction)
print(result.output.action, result.output.amount_cents)

Every annotation is doing work. output_type=BillingAction

guarantees the return is a BillingAction

or an exception. Literal["refund", "retry", "escalate"]

closes the action set so the model cannot invent a fourth. Field(ge=0, le=20_000)

caps the refund at two hundred dollars in the type system, not in a post-hoc check you will forget to write. And the tool return types become part of the schema the model reads: charge_status

telling the model that "ok"

, "failed"

, and "pending"

are the only legal answers is something it sees at call time.

The $4,000 refund from the opening cannot happen here. It fails validation before it reaches your payments code, and the model gets one shot to correct itself.

That correction loop is worth respecting before you depend on it. On a validation failure, the framework formats the error and posts it back to the model as a tool-call-style correction. The model gets a configurable number of retries — set retries

on the Agent

.

Most of the time this is what you want. Sometimes it is not. On a mistyped field name the model can burn three retries guessing at the schema, returning the same wrong shape each time and running up your token bill. Watch the retry count in your traces. If you see the same validation error repeating, the prompt is the bug, not the retry ceiling.

Here is where the type discipline turns into something you feel every day. An agent is a function from input to output with a network call and a nondeterministic model in the middle. That normally makes it miserable to test. Pydantic AI ships two tools that make it ordinary.

TestModel

runs the agent end to end without any network call. It inspects your output schema, generates data that satisfies it, and calls every tool once. It is the "does this wire up at all" test.

from pydantic_ai.models.test import TestModel

def test_billing_agent_wiring():
    with agent.override(model=TestModel()):
        result = agent.run_sync(
            "charged twice",
            deps=BillingDeps(customer_id="c", api_key="k"),
        )
    assert isinstance(result.output, BillingAction)
    assert 0 <= result.output.amount_cents <= 20_000

No API key. No latency. No token spend. The test asserts the contract holds (a BillingAction

with an amount inside the capped range) and runs in milliseconds in CI.

When you need to pin exact behavior, FunctionModel

lets you script what the model returns for a given set of messages:

from pydantic_ai.models.function import FunctionModel, AgentInfo
from pydantic_ai.messages import (
    ModelMessage,
    ModelResponse,
    ToolCallPart,
)

def always_escalate(
    messages: list[ModelMessage], info: AgentInfo
) -> ModelResponse:
    args = {
        "action": "escalate",
        "amount_cents": 0,
        "reason": "policy",
    }
    return ModelResponse(
        parts=[ToolCallPart("final_result", args)]
    )

def test_escalation_path():
    with agent.override(model=FunctionModel(always_escalate)):
        result = agent.run_sync(
            "refund $5000 now",
            deps=BillingDeps(customer_id="c", api_key="k"),
        )
    assert result.output.action == "escalate"

You are testing your own logic: the tool wiring, the dependency injection, the validation. The model is held fixed. The stochastic part is mocked out, so the test is deterministic and fast. This is the same discipline you already apply to a database or an HTTP client: swap the real dependency for a fake at the boundary. agent.override

is that boundary.

Static types cannot make a language model deterministic. What they can do is bound its output before that output touches anything that costs money or mutates state. The model proposes; the type system disposes. Literal

closes a set. Field

clamps a range. output_type

refuses a malformed shape. Everything the model returns passes through a gate you defined in Python, checked by Pyright before you ship and by Pydantic at runtime.

For a shop already living in Pydantic — most FastAPI backends in 2026 — the payoff is that agents start to feel like routes. Same type hints, same IDE support, same validation contract, same test ergonomics. The agent is no longer a special, scary thing bolted onto the side of the system. It is another typed function you can reason about.

Start with one agent. Give it an output_type

with a Field

constraint on the one value that could hurt you if the model got it wrong. Write a TestModel

test for it. Ship that. You will have closed the exact gap that produced the $4,000 refund, and you will have a test that proves it stays closed.

If you want the wider map — how typed agents sit next to the other frameworks, how to trace them once they are running, and how to keep their cost honest — that is what The AI Engineer's Library covers. Agents in Production walks the framework landscape and the patterns for building and shipping multi-step agents; Observability for LLM Applications is the companion on tracing, evals, and cost. Both aim at the same thing this post does: agents you can trust because you can see and constrain what they do.

source & further reading

dev.to — original article WalletConnect in Claude Desktop: Mobile Approval for AI Agent Transactions I Built an AI Pipeline That Scores 10,000+ Listings Daily Without Breaking the Bank Google ADK 2.0 Is Stable — Why That Makes the OpenAI Split Matter More

Pydantic AI: Typed, Testable Agents for Engineers Who Like Guarantees

Run your AI side-project on zahid.host