cd /news/large-language-models/pydantic-passed-types-matched-the-do… · home topics large-language-models article
[ARTICLE · art-38942] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Pydantic passed. Types matched. The downstream system still got garbage.

A developer at a company building a contract-extraction agent using Pydantic schemas with Claude 3.5 Sonnet and GPT-4o/4.5 encountered three production failures that appeared unrelated but stemmed from the same root cause: schema validation ensures syntactic correctness but not semantic meaning. The failures included paraphrased clause text passing validation but failing downstream exact-string matching, unlimited retry logic causing billing incidents, and model upgrades introducing regressions in nested field handling. The developer implemented fixes including a second-pass semantic check, capped retries with human escalation, and shadow evaluation for model changes, concluding that structured-output validation is syntax validation and semantic validation requires a separate layer.

read3 min views1 publishedJun 25, 2026

I want to walk through three production failures on the same contract-extraction agent, because they looked unrelated at the time and turned out to be the same problem wearing different clothes. My claim, stated up front so you can disagree with it early: schema validation tells you the grammar is correct and nothing about whether the meaning is. Those are two different jobs, and most teams (mine included, for a while) only build the first one.

The extractor used Claude 3.5 Sonnet with Pydantic schemas. A termination_clauses

field accepted list[str] . Validation passed every time. The trouble was the model returned paraphrases, not verbatim clause text, and the downstream tool did exact-string matching against a database. Paraphrases never matched.

Pydantic had no way to catch this. The schema said list[str]

. Strings arrived. Valid. The fix was a second-pass semantic check (a model call with a rubric asking, in effect, "are these strings verbatim from the source?"). Success on that field moved from 61% to 94%.

Lesson: structured-output validation is syntax validation. Semantic validation is a separate layer (and you have to build it on purpose).

Retry logic via tenacity. One customer's documents carried a dual-signatory clause with an optional co-signer. The schema expected co_signer: Optional[str]

; the model kept returning nested objects instead. Each retry was about $0.04, and on the worst documents that compounded past $2 each before anything escalated.

Two changes: cap retries at 5 with escalation to human review, and audit any new document type before it hits production.

Lesson: unlimited retry logic on validation failures is a latent billing incident (it just hasn't billed you yet).

We moved GPT-4o

to GPT-4.5

. Success on party_obligations

(a field that needs three-level nesting for conditional logic) fell from 91% to 73%. The newer model handled ambiguous cases with flatter structures. Valid JSON, wrong nesting, Pydantic waved it through, downstream broke quietly.

The fix was shadow evaluation after any upgrade: run old and new models against the same production documents, and flag any field where agreement drops below 95% before shipping.

Lesson: model upgrades are schema-compatibility events (treat them like a dependency bump, not a free swap).

None of these surfaced as a Pydantic error. The schema was valid each time. The real failures were semantic drift, an uncontrolled retry loop, and a model-specific regression. In every case the grammar was fine and the meaning was not, which is precisely the thing type validation cannot see.

What the stack looks like now: Pydantic for syntax, a lightweight evaluator for semantics, DeepEval's correctness metric for the text fields, retries capped, an escalation field on every extraction schema so failure modes are a design-time decision, and a shadow-eval checklist of 200 production documents on any model change.

Accept: "stricter schemas would have caught some of this." Partly true. Enums, discriminated unions, and constrained types genuinely shrink the semantic-validation surface when your domain is stable and bounded. If that's you, lean on them. Wouldn't accept: "so you don't need eval around structured output." Three production failures, two of them customer escalations, disagree. Stricter types reduce the surface; they do not remove it, and they get brittle the moment a new document shape arrives.

If I'm steelmanning the opposite of my own thesis: maybe the honest read is that I under-specified my schemas and called it a semantics problem to feel better about it. A verbatim-quote field could have been a constrained type backed by a span reference into the source, not a free str . A lot of what I'm calling "semantic validation" is really "validation I was too lazy to encode structurally."

So here's the concession. If you have shipped high-volume extraction without a semantic eval layer and held accuracy above 92% for more than six months, I'd genuinely like to see the schema design, because either you bounded the domain harder than I did, or you encoded meaning into types better than I did. The part I won't give up: somewhere, a field has to assert meaning, and if it isn't your schema doing it, it has to be something downstream of the schema.

── more in #large-language-models 4 stories · sorted by recency
── more on @pydantic 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/pydantic-passed-type…] indexed:0 read:3min 2026-06-25 ·