Pydantic passed. Types matched. The downstream system still got garbage.

wpnews.pro

cd /news/large-language-models/pydantic-passed-types-matched-the-do… · home › topics › large-language-models › article

[ARTICLE · art-38942] src=dev.to ↗ pub=2026-06-25T07:01Z topic=large-language-models verified=true sentiment=· neutral

Pydantic passed. Types matched. The downstream system still got garbage.

A developer at a company building a contract-extraction agent using Pydantic schemas with Claude 3.5 Sonnet and GPT-4o/4.5 encountered three production failures that appeared unrelated but stemmed from the same root cause: schema validation ensures syntactic correctness but not semantic meaning. The failures included paraphrased clause text passing validation but failing downstream exact-string matching, unlimited retry logic causing billing incidents, and model upgrades introducing regressions in nested field handling. The developer implemented fixes including a second-pass semantic check, capped retries with human escalation, and shadow evaluation for model changes, concluding that structured-output validation is syntax validation and semantic validation requires a separate layer.

read3 min views1 publishedJun 25, 2026

I want to walk through three production failures on the same contract-extraction agent, because they looked unrelated at the time and turned out to be the same problem wearing different clothes. My claim, stated up front so you can disagree with it early: schema validation tells you the grammar is correct and nothing about whether the meaning is. Those are two different jobs, and most teams (mine included, for a while) only build the first one.

The extractor used Claude 3.5 Sonnet with Pydantic schemas. A termination_clauses

field accepted list[str] . Validation passed every time. The trouble was the model returned paraphrases, not verbatim clause text, and the downstream tool did exact-string matching against a database. Paraphrases never matched.

Pydantic had no way to catch this. The schema said list[str]

. Strings arrived. Valid. The fix was a second-pass semantic check (a model call with a rubric asking, in effect, "are these strings verbatim from the source?"). Success on that field moved from 61% to 94%.

Lesson: structured-output validation is syntax validation. Semantic validation is a separate layer (and you have to build it on purpose).

Retry logic via tenacity. One customer's documents carried a dual-signatory clause with an optional co-signer. The schema expected co_signer: Optional[str]

; the model kept returning nested objects instead. Each retry was about $0.04, and on the worst documents that compounded past $2 each before anything escalated.

Two changes: cap retries at 5 with escalation to human review, and audit any new document type before it hits production.

Lesson: unlimited retry logic on validation failures is a latent billing incident (it just hasn't billed you yet).

We moved GPT-4o

to GPT-4.5

. Success on party_obligations

(a field that needs three-level nesting for conditional logic) fell from 91% to 73%. The newer model handled ambiguous cases with flatter structures. Valid JSON, wrong nesting, Pydantic waved it through, downstream broke quietly.

The fix was shadow evaluation after any upgrade: run old and new models against the same production documents, and flag any field where agreement drops below 95% before shipping.

Lesson: model upgrades are schema-compatibility events (treat them like a dependency bump, not a free swap).

None of these surfaced as a Pydantic error. The schema was valid each time. The real failures were semantic drift, an uncontrolled retry loop, and a model-specific regression. In every case the grammar was fine and the meaning was not, which is precisely the thing type validation cannot see.

What the stack looks like now: Pydantic for syntax, a lightweight evaluator for semantics, DeepEval's correctness metric for the text fields, retries capped, an escalation field on every extraction schema so failure modes are a design-time decision, and a shadow-eval checklist of 200 production documents on any model change.

Accept: "stricter schemas would have caught some of this." Partly true. Enums, discriminated unions, and constrained types genuinely shrink the semantic-validation surface when your domain is stable and bounded. If that's you, lean on them. Wouldn't accept: "so you don't need eval around structured output." Three production failures, two of them customer escalations, disagree. Stricter types reduce the surface; they do not remove it, and they get brittle the moment a new document shape arrives.

If I'm steelmanning the opposite of my own thesis: maybe the honest read is that I under-specified my schemas and called it a semantics problem to feel better about it. A verbatim-quote field could have been a constrained type backed by a span reference into the source, not a free str . A lot of what I'm calling "semantic validation" is really "validation I was too lazy to encode structurally."

So here's the concession. If you have shipped high-volume extraction without a semantic eval layer and held accuracy above 92% for more than six months, I'd genuinely like to see the schema design, because either you bounded the domain harder than I did, or you encoded meaning into types better than I did. The part I won't give up: somewhere, a field has to assert meaning, and if it isn't your schema doing it, it has to be something downstream of the schema.

source & further reading

dev.to — original article Jarvis AI Platform: Implementing Semantic Memory Retrieval with pgvector MCP Logging: What I Wish I Knew Before Deploying My Production MCP Server (3 Weeks of Production Pain) Monorepo Dependency Security — Vulnerability Scanning Across Packages

~/api · this article 200

$curl api.wpnews.pro/v1/news/pydantic-passed-types-ma…

Read original on dev.to → dev.to/james_oconnor_dev/pydantic-passed-types-m…

mentioned entities

Pydantic

Claude 3.5 Sonnet

GPT-4o

GPT-4.5

DeepEval

tenacity

metadata

slugpydantic-passed-types-matched-the-downstream-system-still-got-garbage

topic#large-language-models

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevMonorepo Dependency Security — V…

next →MCP Logging: What I Wish I Knew …

── more in #large-language-models 4 stories · sorted by recency

dev.to · 25 Jun · #large-language-models

I don't trust the LLM to classify my email. So I don't let it.

dev.to · 25 Jun · #large-language-models

MCP + RAG: Why I Stopped Building Complex RAG Systems After MCP Changed Everything

dev.to · 25 Jun · #large-language-models

Jarvis AI Platform: Implementing Semantic Memory Retrieval with pgvector

letsdatascience.com · 25 Jun · #large-language-models

SK Telecom pilots A.X K1 in steel and auto parts plants

── more on @pydantic 3 stories trending now

wpnews · 22 Jun · #generative-ai

Bain tests software takeover targets using vibecoding AI replicas

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 24 Jun · #ai-policy

An AI startup is suing the US government for taking away Anthropic's new model

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required