{"slug": "we-rebuilt-the-structured-output-problem-one-layer-up", "title": "We rebuilt the structured output problem one layer up", "summary": "Evolution of \"structured output\" in AI, from late 2022 to 2024, where developers moved from prompting models to produce valid JSON to using specialized libraries and grammar-constrained decoding. Key libraries like Microsoft Guidance, Outlines, Instructor, and llama.cpp pioneered different approaches, including sampling-time constraints and retry-and-repair loops, before OpenAI's August 2024 launch of strict-schema Structured Outputs appeared to solve the problem. However, the author notes that despite this apparent victory, the fundamental challenge of ensuring reliable structured output has resurfaced in a new phase.", "body_md": "# We rebuilt the structured output problem one layer up\n\nIn late 2023, every JSON-extracting system prompt I shipped for production tooling looked roughly like this:\n\n```\nYou are a JSON API. Respond ONLY with valid JSON.\nDo not include explanations. Do not wrap in markdown.\nDo not say \"Here is the JSON\".\nThe output is parsed by a strict parser.\nIf the JSON is invalid, downstream systems break.\nReturn only the JSON object.\n```\n\nI am embarrassed by every line of it. I am also, in early 2026, still using a near-identical version in three production systems, because it works and because nothing else available at the time worked better.\n\nThat prompt is an artifact of an era, and the era has a story worth telling because it's already repeating itself.\n\n## A field guide to the era\n\nThe structured-output era runs roughly from late 2022 to August 2024. Here is what shipped, with verified dates:\n\n| Date | Artifact | What it solved | What it cost |\n|---|---|---|---|\n| Nov 10, 2022 |\n|\n\n[Outlines (dottxt-ai)](https://github.com/dottxt-ai/outlines)[Glazkov's \"Schemish\" post](https://glazkov.com/2023/05/06/schemish/)[OpenAI function calling](https://openai.com/index/function-calling-and-other-api-updates/)[jxnl/Instructor](https://github.com/567-labs/instructor)[llama.cpp grammar sampling](https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md)`response_format: json_object`\n\n, model trained to emit valid JSON[Eric Hartford's Dolphin \"kitten\" prompt](https://erichartford.com/dolphin-25-mixtral-8x7b); Theia Vogel's tipping experiment[Hamel Husain's \"Fuck You, Show Me The Prompt\"](https://hamel.dev/blog/posts/prompt/)[OpenAI Structured Outputs (strict schemas)](https://openai.com/index/introducing-structured-outputs-in-the-api/)`gpt-4o-2024-08-06`\n\n[Anthropic Model Context Protocol](https://www.anthropic.com/news/model-context-protocol)[Pydantic AI](https://ai.pydantic.dev/)[Anthropic — Code execution with MCP](https://www.anthropic.com/engineering/code-execution-with-mcp)If you were paying attention in 2023, every row in that table felt like progress. We were inventing the abstractions at the same time we were using them in production. The Instructor library was created on June 14, 2023, the day after OpenAI launched function calling on June 13. That is the pace.\n\nLooking back from 2026, the table reads as a single arc with a beginning, a middle, and an apparent end. The beginning is \"ask the model nicely, then more nicely, then with threats.\" The middle is a Cambrian explosion of libraries trying to put structural rails on prompt engineering. The apparent end is OpenAI's August 2024 launch of strict-schema Structured Outputs, which claimed 100% schema conformance on `gpt-4o-2024-08-06`\n\n.\n\nPhase one closed with confetti. We had won the war.\n\nThen phase two started, and we noticed it was the same war.\n\n## The libraries that mattered\n\nThe libraries weren't the headline at the time. The model launches were. But the libraries were doing the actual work of teaching us what structured output is, and they're worth naming.\n\n[Microsoft Guidance](https://github.com/guidance-ai/guidance) was the earliest of them, with its repo created in November 2022, before ChatGPT had finished its launch news cycle. Guidance pioneered the idea of treating LLM output as something you compose with control flow, regex constraints, and grammars. Most of the patterns we now take for granted (JSON-from-grammar, structured generation as opposed to structured prompting) trace back here.\n\n[Outlines](https://github.com/dottxt-ai/outlines), launched in March 2023 by .txt, was the cleanest expression of grammar-constrained decoding. The thesis was straightforward: if you can write a grammar for the output, the sampler should refuse to emit anything that violates it. This is a profoundly correct idea that, three years later, is finally becoming the default in open-source inference engines like vLLM and XGrammar.\n\n[Instructor](https://github.com/567-labs/instructor), shipped by jxnl on June 14, 2023, took a different path. It wrapped OpenAI's brand-new function calling and pretended the result was structured output. Define a Pydantic model, get a Pydantic model back. Retries handled. Validation handled. It was the right abstraction at the right moment, and it's part of why the term \"structured outputs\" stuck as the framing for the whole field.\n\n[llama.cpp added grammar-based sampling](https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md) in July 2023, and that's when the conversation about constrained decoding got serious in the open-source world. The fact that you could ship a four-line grammar file and force a 7B model to emit perfect JSON every time was, at the time, witchcraft.\n\nLangChain's output parsers, especially `OutputFixingParser`\n\n, standardized the retry-and-repair loop: parse, fail, send the validation error back to the model, ask it to fix the JSON, parse again. That pattern is everywhere now. It's also a lot of what people complain about when they complain about LangChain.\n\nThese libraries split into three approaches. Constrained decoding (Outlines, Guidance, llama.cpp grammars) intervenes at sampling time. Schema-driven wrappers (Instructor, later Pydantic AI) lean on the underlying API and add validation. Retry-and-repair loops (LangChain, Guardrails) treat the LLM as an unreliable black box and validate after the fact. All three still ship.\n\n## The fork in the road\n\nBy mid-2024 the field had bifurcated into camps that didn't always realize they were having the same argument.\n\nThe constrained-decoding camp said: structure should be enforced at sampling time, by the inference engine, before the model can emit anything wrong. Outlines and llama.cpp grammars are this. So is OpenAI's Structured Outputs feature, which almost certainly uses grammar-constrained decoding under the hood, building on research from llguidance and XGrammar.\n\nThe instruction-tuned camp said: train the model to emit schema-conformant output. OpenAI's function calling was the first major implementation. Anthropic's tool use, when it went GA in May 2024, was the second. The argument was that for closed-source APIs you don't control sampling, so you have to bake schema awareness into the model itself.\n\nThe wrapper camp said: it doesn't matter how the model emits the JSON, you still need validation, retries, and provider-agnosticism. Instructor and Pydantic AI (publicly launched December 2, 2024) are the cleanest expressions of this.\n\nWhat actually won? All three did, layered on top of each other. For most app developers using closed APIs, OpenAI Structured Outputs and Anthropic tool use are the default. Underneath them, constrained decoding is the implementation. On top of them, validation wrappers like Instructor handle retries and content checks the schema can't enforce.\n\nHamel Husain wrote the canonical skeptical essay about wrappers in February 2024 ([Fuck You, Show Me The Prompt](https://hamel.dev/blog/posts/prompt/)), and he was right that many of these libraries are mostly prompt manipulation. He was also wrong that this means they're not useful. Prompt manipulation, well-engineered and validated, was the bridge that got us from \"respond ONLY in valid JSON\" to schema-strict APIs. Bridges are useful, and they're also temporary.\n\n## Three lessons we learned\n\nThe era taught us things, and the things are worth naming clearly because they're falling out of working memory already.\n\n### The model wasn't broken; the interface was\n\nIn early 2023 the consensus was that LLMs \"couldn't follow instructions reliably.\" By late 2024 the consensus was that LLMs \"follow JSON schemas with 100% reliability.\" The models hadn't fundamentally changed in that window. The interface had. When a model seems unreliable at a task, the most productive question to ask is \"what does the API surface look like?\" before \"is the model good enough?\"\n\n### Every workaround eventually becomes infrastructure\n\nThe retry-and-repair loop was a hack. It became LangChain's `OutputFixingParser`\n\n, then Instructor's tenacity-backed retries, then a built-in part of Pydantic AI. The \"respond ONLY in valid JSON\" prompt was a hack. It became, near-verbatim, the default system-prompt example in OpenAI's own documentation for years. The lesson is to take your hacks seriously, because the half-life of a \"temporary\" workaround in this field is approximately five years.\n\n### The constraint doesn't disappear, it moves up the stack\n\nThe 2022 problem was: how do I get this model to emit a parseable JSON object. By 2024 we had solved that. The 2025 problem became: how do I get this model to pick the right tool from a list of 250 of them, with each tool's schema preloaded into the system prompt, and not blow my context window before the model has emitted a single token. The shape of the problem didn't change. The layer changed. We're in the middle of the same pattern again.\n\n## Why we're forgetting the lessons\n\nIf the lessons of the structured-output era were propagating, the tool-calling and MCP era would look different than it does. It doesn't.\n\nThe clearest public example: in late 2025 / early 2026, [Anthropic published an engineering post titled Code execution with MCP: building more efficient AI agents](https://www.anthropic.com/engineering/code-execution-with-mcp). The argument, in their own framing, is that direct tool calls don't scale because each tool definition consumes context, and a five-server MCP setup with 58 tools can burn ~50,000 tokens before the user has typed anything. Their proposed fix is to let the model write code that calls tools, instead of expecting the model to pick from a flat list of pre-loaded tool schemas. They report that lazy-loaded tool discovery improves Claude Opus 4.5 task accuracy from 79.5% to 88.1% while cutting tokens by roughly 85%.\n\nThat paragraph deserves to be re-read. Anthropic, the company that designed MCP, is publicly arguing that the way MCP currently works has a fundamental scaling problem, and the fix is to abandon flat tool lists in favor of letting the model write code. That maps cleanly onto the lessons above. The model isn't bad at tool calling; the interface is. The \"load all your tools upfront\" pattern is becoming infrastructure even though it was always a hack. The constraint moved up: from \"make the model emit valid JSON\" to \"make the model pick the right tool from a flat list of 250.\"\n\nA second example, with a useful counterpoint: Waleed K's piece [The Evolution of AI Tool Use: MCP Went Sideways](https://waleedk.medium.com/the-evolution-of-ai-tool-use-mcp-went-sideways-8ef4b1268126) makes a related observation about MCP's context-bloat problem with concrete numbers and a concrete war story. He argues, drawing on Cloudflare's framing, that LLMs are \"bad at tool calling\" because tool-call traces are \"out-of-distribution\" for the base models. I think he's half right. The base distributions are absolutely thin on canonical tool-call traces, which is exactly why model providers fine-tune for tool use and why Anthropic's lazy-loaded tool search lifts Opus 4.5 from 79.5% to 88.1%. But the framing \"models can't do tool calling\" misses the same point we missed in 2022 about JSON. The model isn't bad at the task. We've handed it a clumsy interface to the task. Code execution feels like a fix because it routes around the clumsy interface and lets the model do something it has trillions of training examples for: write code. That's the same insight as \"use a JSON Schema grammar.\" It's \"the interface was wrong, again.\" The lesson generalizes, and it's the lesson worth carrying forward.\n\nA third example, drawn from less rigorous evidence: production systems I've worked on in 2026 still ship system prompts that say things like \"you are a backend tool router. Output ONLY a single tool call. Do not explain. Do not apologize.\" Strip the word \"tool\" and replace with \"JSON object\" and you have my 2023 prompt. Same shape, same hack, just at a different layer.\n\n## What the next \"JSON repair library\" looks like\n\nThree predictions, with low hedging.\n\nFirst, tool-search-as-routing becomes the default agent design pattern within twelve months. Flat tool lists in the system prompt will look as embarrassing in 2027 as \"respond ONLY in valid JSON\" looks now. The tool-search tool is the new JSON-Schema strict mode.\n\nSecond, agent-as-code-author beats agent-as-tool-picker for any non-trivial workflow. Anthropic's MCP-code-execution post is the most prominent signal, but the same pattern shows up in the way Claude Code agents get work done internally and in OpenAI's evolving Responses API semantics. We'll look back at \"load 250 tools into the system prompt\" the same way we look back at the kitten prompt.\n\nThird, whatever standard replaces or absorbs MCP will be a *protocol-of-protocols*: a thin layer that brokers between code-executing agents and the underlying tool servers, rather than a flat schema dump. MCP solved the \"describe your tools\" problem. The next standard has to solve \"let the model discover tools as needed without paying upfront context.\"\n\nI have been writing system prompts that look like 2023 prompts again, this time around tool calling, and so has everyone else. The fix is the same one we figured out the first time: stop coercing, change the interface.\n\nIf you've shipped systems that hit this same wall, I'd be glad to compare notes.\n\n— *NJ*\n\n*Building KaribuKit (AI-native PMS for hospitality), running Mara Hilltop (eco-lodge in the Maasai Mara), and consulting through SimbaStack.*", "url": "https://wpnews.pro/news/we-rebuilt-the-structured-output-problem-one-layer-up", "canonical_source": "https://blog.simbastack.com/we-rebuilt-the-structured-output-problem-one-layer-up/", "published_at": "2026-05-06 13:57:30+00:00", "updated_at": "2026-05-21 16:14:33.509449+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "artificial-intelligence", "machine-learning", "products"], "entities": ["OpenAI", "Instructor"], "alternates": {"html": "https://wpnews.pro/news/we-rebuilt-the-structured-output-problem-one-layer-up", "markdown": "https://wpnews.pro/news/we-rebuilt-the-structured-output-problem-one-layer-up.md", "text": "https://wpnews.pro/news/we-rebuilt-the-structured-output-problem-one-layer-up.txt", "jsonld": "https://wpnews.pro/news/we-rebuilt-the-structured-output-problem-one-layer-up.jsonld"}}