# Pydantic passed. Types matched. The downstream system still got garbage.

> Source: <https://dev.to/james_oconnor_dev/pydantic-passed-types-matched-the-downstream-system-still-got-garbage-530b>
> Published: 2026-06-25 07:01:37+00:00

I want to walk through three production failures on the same contract-extraction agent, because they looked unrelated at the time and turned out to be the same problem wearing different clothes. My claim, stated up front so you can disagree with it early: schema validation tells you the grammar is correct and nothing about whether the meaning is. Those are two different jobs, and most teams (mine included, for a while) only build the first one.

The extractor used Claude 3.5 Sonnet with Pydantic schemas. A `termination_clauses`

field accepted `list[str]`

. Validation passed every time. The trouble was the model returned paraphrases, not verbatim clause text, and the downstream tool did exact-string matching against a database. Paraphrases never matched.

Pydantic had no way to catch this. The schema said `list[str]`

. Strings arrived. Valid. The fix was a second-pass semantic check (a model call with a rubric asking, in effect, "are these strings verbatim from the source?"). Success on that field moved from 61% to 94%.

Lesson: structured-output validation is syntax validation. Semantic validation is a separate layer (and you have to build it on purpose).

Retry logic via tenacity. One customer's documents carried a dual-signatory clause with an optional co-signer. The schema expected `co_signer: Optional[str]`

; the model kept returning nested objects instead. Each retry was about $0.04, and on the worst documents that compounded past $2 each before anything escalated.

Two changes: cap retries at 5 with escalation to human review, and audit any new document type before it hits production.

Lesson: unlimited retry logic on validation failures is a latent billing incident (it just hasn't billed you yet).

We moved `GPT-4o`

to `GPT-4.5`

. Success on `party_obligations`

(a field that needs three-level nesting for conditional logic) fell from 91% to 73%. The newer model handled ambiguous cases with flatter structures. Valid JSON, wrong nesting, Pydantic waved it through, downstream broke quietly.

The fix was shadow evaluation after any upgrade: run old and new models against the same production documents, and flag any field where agreement drops below 95% before shipping.

Lesson: model upgrades are schema-compatibility events (treat them like a dependency bump, not a free swap).

None of these surfaced as a Pydantic error. The schema was valid each time. The real failures were semantic drift, an uncontrolled retry loop, and a model-specific regression. In every case the grammar was fine and the meaning was not, which is precisely the thing type validation cannot see.

What the stack looks like now: Pydantic for syntax, a lightweight evaluator for semantics, DeepEval's correctness metric for the text fields, retries capped, an escalation field on every extraction schema so failure modes are a design-time decision, and a shadow-eval checklist of 200 production documents on any model change.

Accept: "stricter schemas would have caught some of this." Partly true. Enums, discriminated unions, and constrained types genuinely shrink the semantic-validation surface when your domain is stable and bounded. If that's you, lean on them.

Wouldn't accept: "so you don't need eval around structured output." Three production failures, two of them customer escalations, disagree. Stricter types reduce the surface; they do not remove it, and they get brittle the moment a new document shape arrives.

If I'm steelmanning the opposite of my own thesis: maybe the honest read is that I under-specified my schemas and called it a semantics problem to feel better about it. A verbatim-quote field could have been a constrained type backed by a span reference into the source, not a free `str`

. A lot of what I'm calling "semantic validation" is really "validation I was too lazy to encode structurally."

So here's the concession. If you have shipped high-volume extraction without a semantic eval layer and held accuracy above 92% for more than six months, I'd genuinely like to see the schema design, because either you bounded the domain harder than I did, or you encoded meaning into types better than I did. The part I won't give up: somewhere, a field has to assert meaning, and if it isn't your schema doing it, it has to be something downstream of the schema.
