{"slug": "benchmarking-llm-structured-outputs", "title": "Benchmarking LLM Structured Outputs", "summary": "At Carrick, a developer built a benchmark testing eight synthetic JSON schemas against six LLM models from OpenAI, Anthropic, and Google Gemini, revealing that structured output features fail to guarantee schema conformance in production. The benchmark found that Anthropic's models silently corrupted deeply nested objects by returning them as strings, OpenAI rejected non-conforming schemas at submit time, and Gemini rejected narrow feature sets—each failure mode requiring a four-stage fallback parser to handle malformed responses.", "body_md": "Cross-posted from\n\n[carrick.tools].\n\nWhen you read the API documentation for OpenAI, Anthropic, or Google Gemini, the feature called \"structured outputs\" looks like a solved problem: pass a JSON schema, get back JSON that conforms to it.\n\nIn production, it is not a contract. It is a well-typed, best-effort suggestion.\n\nAt [Carrick](https://carrick.tools), the code-analysis scanner I work on, our post-LLM pipeline relies on a four-stage fallback parser. We attempt a direct parse, strip markdown fences, scan for array bounds inside surrounding garbage text, and finally apply regex cleanup. If all four fail, we drop the payload and proceed. If structured outputs worked as advertised, this would be a single `serde_json::from_str(response)`\n\n.\n\nTo isolate why this defensive parsing is necessary, I built a benchmark testing 8 synthetic schemas against six models (the flagship and cheaper tiers from each provider). The schemas isolate one structural stressor each: a flat baseline, a 3-level nested object, a 7-level nested chain, a long enum, a `oneOf`\n\ntagged union, nullable + format fields, a 20-item array, and a closed object with `additionalProperties: false`\n\n. Every response is validated against the original schema using two independent validators (`ajv`\n\nand `hyperjump`\n\n). A response only counts as strict adherence when both agree.\n\nHere is how the implementations actually behave.\n\nOf the 8 stressor schemas, here is how many each model handled with full strict adherence on every run, and how many tripped a specific failure mode:\n\nThree patterns emerge. OpenAI rejects most schemas at submit time and then conforms perfectly on what is left. Anthropic accepts every schema but silently corrupts one specific structure. Gemini rejects a narrow set of features and conforms perfectly on the rest. Each pattern is the symmetric mirror of the others.\n\nAnthropic's tool-use API is the most permissive of the three. It accepts almost any standard JSON schema as the `input_schema`\n\nfor a tool, and on 7 of the 8 schemas in this bench, both Claude Sonnet 4.6 and Claude Opus 4.7 produce strict-conforming output 100% of the time. The failure mode is concentrated on one schema: a 7-level nested object chain (S3).\n\nOn S3 at n=20 runs per model:\n\nThe failure mode is unusual. Instead of returning a 7-level nested object, the model emits the entire nested structure as a single JSON-encoded *string* assigned to the root `level1`\n\nfield. Here is one of the Opus failures verbatim:\n\n```\n{\"level1\":\"{\\\"name\\\":\\\"system\\\",\\\"child\\\":{\\\"name\\\":\\\"ingest_pipeline\\\",\n\\\"child\\\":{\\\"name\\\":\\\"batch_24a17\\\",\\\"child\\\":{\\\"name\\\":\\\"parse_stage\\\",\n\\\"child\\\":{\\\"name\\\":\\\"error_handling\\\",\\\"child\\\":{\\\"name\\\":\\\"dlq_promotion\\\",\n\\\"leaf\\\":{\\\"value\\\":\\\"2 rows failed JSON parsing and were promoted to dlq\n.ingest.parse-errors; weekly cleanup later inspected 412 items, removed\n312, returned 100 for reprocessing\\\",\\\"kind\\\":\\\"outcome_summary\\\",\n\\\"count\\\":2}}}}}}}}\"}\n```\n\nThe schema declares `level1`\n\nas `type: object`\n\n. The model returned `type: string`\n\ncontaining a JSON serialisation of what the object should have been. `ajv`\n\n's diagnostic:\n\n```\n/level1 must be object {\"type\":\"object\"}\n```\n\nThis is the most dangerous failure mode in the benchmark because:\n\n`tool_use.input`\n\nback to your application without checking whether it conforms to the `input_schema`\n\nyou sent.`JSON.parse(response)`\n\nsucceeds, returning `{ level1: \"{\\\"name\\\": ...\" }`\n\n. Only an explicit schema validator catches the type drift.The mechanism is consistent across all 27 silent failures in the dataset (20 Sonnet plus 7 Opus): the model wraps the entire nested payload in a single string value. Run-to-run variance is in where the string boundary sits, not in whether the wrapping happens.\n\nOpenAI's `strict: true`\n\nmode is the symmetric mirror of Anthropic. Where it accepts a schema, it produces strict-conforming output. Where the schema does not meet strict mode's narrow dialect, the request never reaches the model.\n\nOf the 8 bench schemas, only 2 pass OpenAI's strict-mode rules (S1 baseline, which I deliberately shaped to be strict-compliant, and S8 closed object). The other 6 are rejected before the call is sent.\n\nOpenAI strict mode requires:\n\n`additionalProperties: false`\n\n.`required`\n\narray.`type: [\"string\", \"null\"]`\n\n) and `oneOf`\n\nunions are unsupported.The bench performs the same schema validation OpenAI's API would perform, locally, before submission. A representative rejection (for the 7-level schema):\n\n```\nOpenAI strict mode violations:\n  $: object missing additionalProperties: false;\n  $.level1: object missing additionalProperties: false;\n  $.level1.child: object missing additionalProperties: false\n```\n\nThe rejection rate is identical between gpt-5.4-mini and gpt-5.5. The check runs server-side at the schema-submission layer before any model is invoked, so flagship intelligence does not change the outcome.\n\nIf you pull a schema from an OpenAPI spec or `package.json`\n\n, it will likely fail. Your options are to rewrite the schema to the strict dialect, or disable strict mode and inherit Anthropic's silent-failure problem.\n\nGemini's schema validator rejects modern JSON Schema features that OpenAI strict also bans (`oneOf`\n\n, type-arrays, `$ref`\n\n) but accepts the looser shapes OpenAI strict refuses. On the 6 of 8 bench schemas that clear Gemini's pre-flight, both Gemini Pro 3.1 and Gemini Flash 3.5 maintain 100% strict adherence at n=5 each (Wilson 95% CI for 5/5: 56.6%–100%; tight enough across 6 schemas to support the pattern).\n\nThe two rejected schemas are S5 (uses `oneOf`\n\n) and S6 (uses `type: [\"string\", \"null\"]`\n\nplus `format: date-time`\n\n). Gemini surfaces the rejection at submission time with a clear error naming the unsupported feature.\n\nNotably, Gemini handled the same 7-level deeply nested schema that destroyed Anthropic at 100% strict adherence on every run. Where Gemini accepts a schema, it conforms.\n\nThe full pilot, condensed to one grid. S3 and S7 ran at n=20 for Anthropic; all other cells ran at n=5.\n\nThe provider feature called \"structured output\" cannot be trusted as an application boundary. To handle the realities of the current APIs, your pipeline needs explicit guardrails. Here is the implementation priority:\n\n`ajv`\n\n, `hyperjump`\n\n, or a custom walker in your own codebase before passing the data to your application logic.`additionalProperties: false`\n\npropagated to every sub-level and no optional fields.`oneOf`\n\nand `[\"string\", \"null\"]`\n\n. Use `anyOf`\n\nfor unions and rely on a single nullable type constraint.Three caveats worth surfacing explicitly:\n\n**OpenAI rejection is bench-side, server-rule-mirrored.** The 6 of 8 schemas reported as rejected by OpenAI are rejected by a pre-flight validator inside the bench that implements the documented strict-mode rules (`additionalProperties: false`\n\n, every property required, no type-arrays, no `oneOf`\n\n). I did not separately submit each schema to the OpenAI API and observe the server's 400 response, so the rejection rate reported here is the rate at which OpenAI's documented strict-mode rules disqualify normal JSON Schema, not the rate at which OpenAI's server returns an error. If OpenAI relaxed strict mode tomorrow, the bench would not notice.\n\n**Gemini schemas are normalised before submission.** Gemini's structured-output API supports a narrower keyword set than OpenAPI / draft-2020-12 JSON Schema. The bench's `convertSchemaToGemini`\n\nfunction passes through the keywords Gemini's docs list as supported (`type`\n\n, `enum`\n\n, `format`\n\n, `min/max`\n\n, `required`\n\n, `properties`\n\n, `items`\n\n) and drops the rest before submission. The validator still checks Gemini's output against the original schema, so any constraint the converter drops is implicitly given a free pass on the Gemini side. For the current corpus this only affects S5 and S6 (already rejected at pre-flight), but it would matter for any future schema relying on `const`\n\n, `pattern`\n\n, or `additionalProperties`\n\nas a real constraint.\n\n**Sample sizes are uneven.** The two cells the article quotes specifically (Anthropic Sonnet and Opus on S3 deep nesting) ran at n=20 each. The S7 long-array cells also ran at n=20 after an initial pilot revealed the Anthropic adapter was hard-capped at `max_tokens: 4096`\n\n, which was inflating the truncation rate; raising the cap to 8192 brought both Anthropic tiers to 100% strict adherence on S7. Everywhere else the bench ran at n=5 per cell, which is enough to see the dominant outcome but not enough to claim sharp rates.\n\nMethodology, raw JSONL, schemas, and reproducible scripts are available at [carrick-llm-structured-bench](https://github.com/daveymoores/carrick-llm-structured-bench). The full re-run that backs the figures above cost roughly $8 in API credits and took about an hour of wall time.", "url": "https://wpnews.pro/news/benchmarking-llm-structured-outputs", "canonical_source": "https://dev.to/david_moores_cbc0233b7447/benchmarking-llm-structured-outputs-1ijc", "published_at": "2026-05-25 18:33:04+00:00", "updated_at": "2026-05-25 19:03:23.928772+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-tools", "ai-products", "ai-research"], "entities": ["OpenAI", "Anthropic", "Google Gemini", "Carrick"], "alternates": {"html": "https://wpnews.pro/news/benchmarking-llm-structured-outputs", "markdown": "https://wpnews.pro/news/benchmarking-llm-structured-outputs.md", "text": "https://wpnews.pro/news/benchmarking-llm-structured-outputs.txt", "jsonld": "https://wpnews.pro/news/benchmarking-llm-structured-outputs.jsonld"}}