{"slug": "llm-give-me-a-json-make-no-mistakes", "title": "LLM, give me a JSON. Make no mistakes.", "summary": "Developers seeking reliable JSON output from large language models can move beyond simple prompt instructions and retry loops by implementing constrained sampling techniques that mask invalid tokens during generation. By setting the probability of tokens that would break JSON formatting to zero at each step, inference engines can guarantee valid structured output without wasting compute on discarded responses. This token-level masking approach, while requiring more sophisticated implementation than black-box retry strategies, eliminates the risk of infinite loops and ensures every generated token contributes to a valid JSON result.", "body_md": "# LLM, give me a JSON. Make no mistakes.\n\nSo how exactly do you make your LLM output a JSON? What happens under the hood? And how do you make it reliable and fast?\n\n## Make no mistakes\n\nImagine, you have finally managed to set up the LLM inference for your application, and now it is even able to respond to you. And it can do so much stuff! But for most of these use cases, getting \"just\" text back is very limiting. In fact, in order to make most of the non-chatbot use cases work, you would need more structured info like JSON. So you just append to the prompt:\n\n```\nRemember to give me the output in JSON format. Make no mistakes.\n```\n\nAs the JSON output gets longer and longer, somehow your super smart model fails from time to time. Apart from not getting the object keys right,\nit appends the additional `,`\n\nat the end of the last key-value, which makes the parser complain. You might ask, is there a better way?\n\nThere is!\n\nBeing able to control what format exactly does your LLM produce is super valuable and technically super interesting. Let us thus take a deep dive into how you go past \"make no mistakes\" and how the inference engines do it reliably and fast.\n\nNote: If you feel familiar with JSON schemas and GBNF, just skip into the section \"Processing Grammars\".\n\n## Autoretries\n\nThe first solution that comes to mind is just to employ some retry strategy at the message level. Essentially:\n\n```\nwhile True:\n  answer = llm(prompt)\n  if is_json(answer):\n    break\n```\n\nThis works. The only positive thing I have to say about it is that you can treat the LLM as a complete blackbox, which might be viable for some libraries (actually I believe this is what LangChain does). For the negatives, there are plenty:\n\n- by being \"unlucky\" or employing smaller models, you can be looping for a very long time or forever, before reaching the desired output\n- you're wasting an enormous number of tokens, by discarding whole messages, even though they might not be all wrong\n- to be able to get a JSON, you need to construct or download a specific parser, which is not very extensible\n\nSo if you don't have to, just **don't** do this please. However, by looking more closely into the LLM, you can have a little\nbit more principled approach.\n\n## Constrained Sampling\n\nThere are two observations, which we can make.\n\nFirst, LLMs generate outputs token by token. Usually you don't have to generate the whole answer to see that something is wrong. We can retry right away when the model makes the first error which is not in the right format. This way, we are not deleting the whole message, but just the last token:\n\n```\nanswer = \"\"\nwhile True:\n  token = llm.next_token(prompt + answer)\n  if token == \"<eos>\":\n    break\n\n  if is_partial_json(answer):\n    answer += token\n```\n\nSecondly, the answer tokens don't appear out of the blue. Given a text, LLMs produce a probability distribution on the next token. Instead of simply sampling from the distribution immediately, we can start by setting the probability of all tokens leading to incorrect output to 0. This way, we are guaranteed to only sample (and thus output) a correct token. If we wanted just a number, we could do something like:\n\nTechnically, this process is called \"masking\". One more detail to address is that just shrinking the probability of the tokens we don't want to 0\nwould break the distribution property (we want the probabilities to sum to 1). In reality the solution is therefore to set the underlying\nlogits to `-inf`\n\n, which will result in turning the unwanted tokens' probabilities to 0, but slightly bumping the other tokens up.\nThe pseudocode then could look like this:\n\n```\nanswer = \"\"\nwhile True:\n  token = llm.next_token(\n    prompt + answer,\n    mask=possible_next_json_tokens(prompt + answer)\n  )\n\n  if token == \"<eos>\":\n    break\n```\n\nSo even though we are still looping, there is no discarding going on - we can't be unlucky and we are not wasting compute. The hard part is now how to specify the possible next tokens, generally called \"mask\", and how to do that quickly, so the LLM is not waiting for us.\n\n## Specifying the Format\n\nWhat exactly is JSON? To construct the mask, we have to be able to answer this token by token.\nOne of the great ways that we are able to precisely specify some text format is regexes.\nUnfortunately, as the name suggests, regexes are made for specifying regular languages, which JSON is not.\nWith the (basic set of) regex features, you won't be able to for example guarantee that any opened `{`\n\nwill also be closed by a corresponding `}`\n\n.\n\nA more fitting way for this use case is employing [JSON schemas](https://json-schema.org/).\nIf you don't know JSON schemas, they are essentially a metalanguage\non top of JSON, to specify what JSON format you expect. This way, we can say for example:\n\n```\n{ \"type\": \"object\" }\n```\n\nto get any JSON object. Or, if you want something more specific:\n\n```\n{\n  \"type\": \"object\",\n  \"properties\": {\n    \"name\": \"string\",\n    \"age\": \"integer\"\n  }\n}\n```\n\nThis is a more concrete specification and some of the inference engines/APIs accept JSON schemas directly! (e.g. [Claude API](https://platform.claude.com/docs/en/build-with-claude/structured-outputs), [OpenAI API](https://developers.openai.com/api/docs/guides/structured-outputs), [vLLM](https://docs.vllm.ai/en/latest/features/structured_outputs/), etc.)\nOn the other hand, we did not really move towards a \"lower level\" specification of what we want; JSON schemas are still quite abstract.\n\nNotethat with all of the formats (regexes, JSON schemas, grammars) it is still important to tell the LLM what it is that you expect in the answer. The LLM does not know it is being constrained, so if you do not tell it what you expect, the token distributions will stay as if the output was not constrained, possibly hurting performance.\n\n## Grammars to the Rescue\n\nMore low-level and general specification can be achieved by passing in grammars, specifically Context-Free Grammars (CFGs).\nGrammars are precise descriptions on what strings can be generated - they are typically utilized in fields like programming language theory (parsers), linguistics, and some parts of theoretical computer science.\nIn the inference engines they typically underlie processing all of the constrained generation, whether it is specified by regexes,\nJSON schemas or \"manually\". They also cover most of the wide-spread structured formats used, such as [Python syntax](https://docs.python.org/3/reference/grammar.html).\n\nExample of such a grammar could be:\n\n```\nINT    ::= \"-\"? [0-9]+\n```\n\nwhich provides an INT rule for generating any signed integer. These rules can then be composed in a way where they \"call\" each other, to form larger structures:\n\n```\nROOT   ::= \"{ \\\"name\\\":\" STRING \",\\\"age\\\":\" INT \"}\"\nSTRING ::= \"\\\"\" [^\"]* \"\\\"\"\nINT    ::= \"-\"? [0-9]+\n```\n\nwhich is a very simplified grammar for the JSON schema we discussed earlier. The STRING rule consumes any sequences without `\"`\n\n,\nand ROOT just acts as an entrypoint and assembles everything together. This exact format of writing down the grammars is known as GBNF\nand underlies the constrained generation in llama.cpp, but also other engines.\nIf you want to read more about it, I recommend starting [here](https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md).\n\nWith grammars in place, we have done the work on the \"user's\" side - we have a general and exact specification from the user of what they actually want. Still, these are not the token masks. Now, we will dive deeper into how the inference engines convert the grammars into masks (fast).\n\n## Processing Grammars\n\nTo understand how the grammars then translate into masks, we will need a bit of theory under our belts. Let us start simple: with a simple integer rule and a whole string upfront, for which we just decide if it corresponds to the grammar. Starting with the INT rule:\n\n```\nINT ::= \"-\"? [0-9]+\n```\n\nWe remove the syntactic sugar:\n\n```\nINT ::= \"-\" [0-9]+ | [0-9]+\n```\n\nand replace the `+`\n\nand `*`\n\nwith recursion:\n\n```\nINT ::= \"-\" DIGITS | DIGITS\nDIGITS ::= [0-9] | [0-9] DIGITS\n```\n\nThis gets us simplified expressions, which are easier to work with. If you feel a bit lost on what actually happened, we just made a few regex transformations, which are equivalent to what we had before.\n\nNow, to produce the masks, we will use a machine with a stack (or in the right terms \"pushdown automaton\", PDA).\nIf you don't know what a stack is, you can think about it like an array, where from one end, we are allowed to append and pop elements.\nIn our case, these will be characters (in practice bytes) and *names* of the rules (INT, ROOT, STRING, ...).\n\n## Top-Down Parsing\n\nTo make things a little bit easier, we will again focus on a simpler case - we won't generate strings from tokens yet.\nInstead, we will be given a whole string at the start, and only be tasked with checking if such a string\ncan be generated with the current grammar or not. The following approach is usually called top-down parsing\nand again, has been here since the [70s](https://en.wikipedia.org/wiki/Top-down_parsing).\n\nWe will continue with the following four simple rules:\n\n- Start with appending the root rule to the stack.\n- If there is a rule at the top, pop it and replace it with one of its definitions.\n- If there is a character at the top, which matches the input, consume both. Otherwise get stuck and\n*reject*. - If both stack and input are empty,\n*accept*.\n\nTo illustrate how this would go, look at the INT rule and input `-1234`\n\n. First, the stack is empty, so we proceed with (1).\n\nAs the top is a rule, we follow with (2). How do we know which definition to choose from?\nFor now, let's pretend the machine *just knows*:\n\nApplying rule (3), as the top is a character which matches the input.\n\nThen, nothing new happens. We just go by the previous rules (2, 3, 2, 3).\n\nUntil arriving at the empty stack and empty input. Well done! We can apply rule (4) and accept.\n\nThis way we know the string conforms to the grammar specified. If we were parsing non-integer input, like `.677`\n\n,\nwe would arrive at this state:\n\nAs both are characters (or character ranges) and don't match, we are forced to reject the input.\n\nHopefully, now you have an idea of how you would implement such an approach. The last two problems we have are:\n\n- We can consume whole strings, but LLMs produce tokens.\n- How does the machine\n*just know*which definition to choose?\n\nHow exactly you solve these problems then depends on the particular grammar library you use.\nIn the following, I will describe how [llama.cpp](https://github.com/ggml-org/llama.cpp) does this.\n\n## From Whole Strings to Masks\n\nAdapting to strings is now easier than it looks. What we will do is simply run the top-down parsing for each token. The only exception to the previous algorithm is that we don't need to consume the whole input and the whole stack. What simply suffices is that we don't get stuck on the way. The result should be a distribution, containing only tokens conforming to the grammar.\n\nThere is however one small caveat. Imagine sampling from the following distribution:\n\nInitializing the stack again with the INT rule, and running the parsing, we end up just with `-42`\n\nand `-`\n\n.\nNow we sample, choosing the respective stack associated with the token. Let's say we sampled `-42`\n\n.\n\nTo continue and get the next token, we can't just reset the stack with the INT rule, but have to remember the position in the grammar and the corresponding stack.\n\n## Dealing with Non-determinism\n\nYou might still argue that we are dealing with this oracle machine, which somehow knows\nif the rule `[0-9] DIGITS`\n\nor `DIGITS`\n\nis the right path. You would be right. Again, we can propose a simple solution.\n\nWhenever there is a point where you have to make a decision, just go both ways.\n\nConcretely, that means that when finding a rule like `DIGITS | [0-9] DIGITS`\n\nyou just fork the current stack you have\nand in one go with `DIGITS`\n\n, whereas the other one will have `[0-9] DIGITS`\n\n. Checking if a token goes through\nis then about answering \"Is there a stack which accepts?\". Since all tokens are of finite length, you don't have to care\nabout infinite recursion (having a stack that would always expand DIGITS branch).\n\nAnd that's it! Conceptually, this is how llama.cpp works!\n\n## Other Solutions\n\nBear in mind that the llama.cpp solution is more on the side of *slow* solutions, as it does everything\nat inference time. In the end, you incur something like `O(vocab_size * active_stacks)`\n\ncost,\nas you are keeping stacks for each token and for each different route through the grammar.\nWith current vocab sizes (100k+) and longer grammars, that can be just incredibly time-consuming.\n\nCertainly, better solutions like [XGrammar](https://github.com/mlc-ai/xgrammar) or [LLGuidance](https://github.com/guidance-ai/llguidance) exist.\nThe main problem with llama.cpp is, that we are doing *a lot* of work per step. Generally, smart precomputation is where you would go next.\nIf you however would have to choose a solution yourself, consider this graph:\n\n*Source: guidance-ai/jsonschemabench — MaskBench benchmark*\n\nClearly, there is a tradeoff going on. Engines like XGrammar and Outlines, which choose more precomputation, suffer from long \"loading\" times (shown as TTFM). On the other hand, llama.cpp does little precomputation, but then is generally slower per step (shown as TBM). And then there is llguidance, which seems to excel at both.\n\n## How Do We Use It\n\nA few weeks ago, we decided to integrate NobodyWho with LLGuidance. This way, we get all of the benefits. Now, you can simply:\n\n- pass in regex,\n- pass in JSON schema,\n- and pass in grammars. All wrapped nicely in an API, which is just as simple as it should be:\n\n``` python\nfrom nobodywho import Chat, SamplerPresets\nsampler = SamplerPresets.constrain_with_regex(r\"yes|no\")\nchat = Chat('./model.gguf', sampler=sampler)\nanswer = chat.ask(\"Is the sky blue?\").completed()\nprint(answer) # yes!\n```\n\n## Conclusion\n\nI very much like how constrained generation is a new problem, where we were able to apply existing theory to solve it in a nice, tractable way. I feel like that does not happen as often as it should.\n\nAt this point you know why:\n\n- you should\n*not*do auto-retries, if you absolutely don't have to, - having to sample other formats than JSON is not the end of the world, as you have grammars,\n- there is a tradeoff between precomputation and per-step work with constrained sampling.\n\nIf you want to dive even deeper, I would go for:\n\n- reading this\n[incredible blog post](https://guidance-ai.github.io/llguidance/llg-go-brrr)about how LLGuidance is built, - looking at the\n[outlines paper](https://arxiv.org/pdf/2307.09702)to understand what you might cache before generating, - or\n[xgrammar paper](https://proceedings.mlsys.org/paper_files/paper/2025/file/5c20ca4b0b20b0bd2f1d839dc605e70f-Paper-Conference.pdf)which has nice insights into what is good to precompute.\n\nDid you like this post? Consider [giving us a star](https://github.com/nobodywho-ooo/nobodywho)! That would mean a lot.\n\nThanks for reading!\n\n*This post was written entirely by a human. No words were made up by the machine.*\n\n## Who We Are\n\nWe're NobodyWho, a local inference library, which enables running small models on edge-devices. We value open-source code, control over your models, solid software engineering, standardization and making simple things simple. All of which is missing in today's AI world. Running a model with us is as easy as:\n\n``` python\nfrom nobodywho import Chat\n\nchat = Chat(\"model.gguf\")\nanswer = chat.ask(\"Is water wet?\").completed()\nprint(answer)\n```\n\nIf you value the same things, come and [become a contributor](https://github.com/nobodywho-ooo/nobodywho) or just [download and test our library](https://docs.nobodywho.ooo/).\n\nPublished Jun 1, 2026", "url": "https://wpnews.pro/news/llm-give-me-a-json-make-no-mistakes", "canonical_source": "https://nobodywho.ooo/posts/llm-give-me-a-json/", "published_at": "2026-06-01 00:00:00+00:00", "updated_at": "2026-06-04 13:17:47.144412+00:00", "lang": "en", "topics": ["large-language-models", "generative-ai", "natural-language-processing", "ai-tools", "ai-infrastructure"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/llm-give-me-a-json-make-no-mistakes", "markdown": "https://wpnews.pro/news/llm-give-me-a-json-make-no-mistakes.md", "text": "https://wpnews.pro/news/llm-give-me-a-json-make-no-mistakes.txt", "jsonld": "https://wpnews.pro/news/llm-give-me-a-json-make-no-mistakes.jsonld"}}