Benchmarking LLM Structured Outputs

At Carrick, a developer built a benchmark testing eight synthetic JSON schemas against six LLM models from OpenAI, Anthropic, and Google Gemini, revealing that structured output features fail to guarantee schema conformance in production. The benchmark found that Anthropic's models silently corrupted deeply nested objects by returning them as strings, OpenAI rejected non-conforming schemas at submit time, and Gemini rejected narrow feature sets—each failure mode requiring a four-stage fallback parser to handle malformed responses.

Cross-posted from carrick.tools . When you read the API documentation for OpenAI, Anthropic, or Google Gemini, the feature called "structured outputs" looks like a solved problem: pass a JSON schema, get back JSON that conforms to it. In production, it is not a contract. It is a well-typed, best-effort suggestion. At Carrick https://carrick.tools , the code-analysis scanner I work on, our post-LLM pipeline relies on a four-stage fallback parser. We attempt a direct parse, strip markdown fences, scan for array bounds inside surrounding garbage text, and finally apply regex cleanup. If all four fail, we drop the payload and proceed. If structured outputs worked as advertised, this would be a single serde json::from str response . To isolate why this defensive parsing is necessary, I built a benchmark testing 8 synthetic schemas against six models the flagship and cheaper tiers from each provider . The schemas isolate one structural stressor each: a flat baseline, a 3-level nested object, a 7-level nested chain, a long enum, a oneOf tagged union, nullable + format fields, a 20-item array, and a closed object with additionalProperties: false . Every response is validated against the original schema using two independent validators ajv and hyperjump . A response only counts as strict adherence when both agree. Here is how the implementations actually behave. Of the 8 stressor schemas, here is how many each model handled with full strict adherence on every run, and how many tripped a specific failure mode: Three patterns emerge. OpenAI rejects most schemas at submit time and then conforms perfectly on what is left. Anthropic accepts every schema but silently corrupts one specific structure. Gemini rejects a narrow set of features and conforms perfectly on the rest. Each pattern is the symmetric mirror of the others. Anthropic's tool-use API is the most permissive of the three. It accepts almost any standard JSON schema as the input schema for a tool, and on 7 of the 8 schemas in this bench, both Claude Sonnet 4.6 and Claude Opus 4.7 produce strict-conforming output 100% of the time. The failure mode is concentrated on one schema: a 7-level nested object chain S3 . On S3 at n=20 runs per model: The failure mode is unusual. Instead of returning a 7-level nested object, the model emits the entire nested structure as a single JSON-encoded string assigned to the root level1 field. Here is one of the Opus failures verbatim: {"level1":"{\"name\":\"system\",\"child\":{\"name\":\"ingest pipeline\", \"child\":{\"name\":\"batch 24a17\",\"child\":{\"name\":\"parse stage\", \"child\":{\"name\":\"error handling\",\"child\":{\"name\":\"dlq promotion\", \"leaf\":{\"value\":\"2 rows failed JSON parsing and were promoted to dlq .ingest.parse-errors; weekly cleanup later inspected 412 items, removed 312, returned 100 for reprocessing\",\"kind\":\"outcome summary\", \"count\":2}}}}}}}}"} The schema declares level1 as type: object . The model returned type: string containing a JSON serialisation of what the object should have been. ajv 's diagnostic: /level1 must be object {"type":"object"} This is the most dangerous failure mode in the benchmark because: tool use.input back to your application without checking whether it conforms to the input schema you sent. JSON.parse response succeeds, returning { level1: "{\"name\": ..." } . Only an explicit schema validator catches the type drift.The mechanism is consistent across all 27 silent failures in the dataset 20 Sonnet plus 7 Opus : the model wraps the entire nested payload in a single string value. Run-to-run variance is in where the string boundary sits, not in whether the wrapping happens. OpenAI's strict: true mode is the symmetric mirror of Anthropic. Where it accepts a schema, it produces strict-conforming output. Where the schema does not meet strict mode's narrow dialect, the request never reaches the model. Of the 8 bench schemas, only 2 pass OpenAI's strict-mode rules S1 baseline, which I deliberately shaped to be strict-compliant, and S8 closed object . The other 6 are rejected before the call is sent. OpenAI strict mode requires: additionalProperties: false . required array. type: "string", "null" and oneOf unions are unsupported.The bench performs the same schema validation OpenAI's API would perform, locally, before submission. A representative rejection for the 7-level schema : OpenAI strict mode violations: $: object missing additionalProperties: false; $.level1: object missing additionalProperties: false; $.level1.child: object missing additionalProperties: false The rejection rate is identical between gpt-5.4-mini and gpt-5.5. The check runs server-side at the schema-submission layer before any model is invoked, so flagship intelligence does not change the outcome. If you pull a schema from an OpenAPI spec or package.json , it will likely fail. Your options are to rewrite the schema to the strict dialect, or disable strict mode and inherit Anthropic's silent-failure problem. Gemini's schema validator rejects modern JSON Schema features that OpenAI strict also bans oneOf , type-arrays, $ref but accepts the looser shapes OpenAI strict refuses. On the 6 of 8 bench schemas that clear Gemini's pre-flight, both Gemini Pro 3.1 and Gemini Flash 3.5 maintain 100% strict adherence at n=5 each Wilson 95% CI for 5/5: 56.6%–100%; tight enough across 6 schemas to support the pattern . The two rejected schemas are S5 uses oneOf and S6 uses type: "string", "null" plus format: date-time . Gemini surfaces the rejection at submission time with a clear error naming the unsupported feature. Notably, Gemini handled the same 7-level deeply nested schema that destroyed Anthropic at 100% strict adherence on every run. Where Gemini accepts a schema, it conforms. The full pilot, condensed to one grid. S3 and S7 ran at n=20 for Anthropic; all other cells ran at n=5. The provider feature called "structured output" cannot be trusted as an application boundary. To handle the realities of the current APIs, your pipeline needs explicit guardrails. Here is the implementation priority: ajv , hyperjump , or a custom walker in your own codebase before passing the data to your application logic. additionalProperties: false propagated to every sub-level and no optional fields. oneOf and "string", "null" . Use anyOf for unions and rely on a single nullable type constraint.Three caveats worth surfacing explicitly: OpenAI rejection is bench-side, server-rule-mirrored. The 6 of 8 schemas reported as rejected by OpenAI are rejected by a pre-flight validator inside the bench that implements the documented strict-mode rules additionalProperties: false , every property required, no type-arrays, no oneOf . I did not separately submit each schema to the OpenAI API and observe the server's 400 response, so the rejection rate reported here is the rate at which OpenAI's documented strict-mode rules disqualify normal JSON Schema, not the rate at which OpenAI's server returns an error. If OpenAI relaxed strict mode tomorrow, the bench would not notice. Gemini schemas are normalised before submission. Gemini's structured-output API supports a narrower keyword set than OpenAPI / draft-2020-12 JSON Schema. The bench's convertSchemaToGemini function passes through the keywords Gemini's docs list as supported type , enum , format , min/max , required , properties , items and drops the rest before submission. The validator still checks Gemini's output against the original schema, so any constraint the converter drops is implicitly given a free pass on the Gemini side. For the current corpus this only affects S5 and S6 already rejected at pre-flight , but it would matter for any future schema relying on const , pattern , or additionalProperties as a real constraint. Sample sizes are uneven. The two cells the article quotes specifically Anthropic Sonnet and Opus on S3 deep nesting ran at n=20 each. The S7 long-array cells also ran at n=20 after an initial pilot revealed the Anthropic adapter was hard-capped at max tokens: 4096 , which was inflating the truncation rate; raising the cap to 8192 brought both Anthropic tiers to 100% strict adherence on S7. Everywhere else the bench ran at n=5 per cell, which is enough to see the dominant outcome but not enough to claim sharp rates. Methodology, raw JSONL, schemas, and reproducible scripts are available at carrick-llm-structured-bench https://github.com/daveymoores/carrick-llm-structured-bench . The full re-run that backs the figures above cost roughly $8 in API credits and took about an hour of wall time.