LLM, give me a JSON. Make no mistakes. Developers seeking reliable JSON output from large language models can move beyond simple prompt instructions and retry loops by implementing constrained sampling techniques that mask invalid tokens during generation. By setting the probability of tokens that would break JSON formatting to zero at each step, inference engines can guarantee valid structured output without wasting compute on discarded responses. This token-level masking approach, while requiring more sophisticated implementation than black-box retry strategies, eliminates the risk of infinite loops and ensures every generated token contributes to a valid JSON result. LLM, give me a JSON. Make no mistakes. So how exactly do you make your LLM output a JSON? What happens under the hood? And how do you make it reliable and fast? Make no mistakes Imagine, you have finally managed to set up the LLM inference for your application, and now it is even able to respond to you. And it can do so much stuff But for most of these use cases, getting "just" text back is very limiting. In fact, in order to make most of the non-chatbot use cases work, you would need more structured info like JSON. So you just append to the prompt: Remember to give me the output in JSON format. Make no mistakes. As the JSON output gets longer and longer, somehow your super smart model fails from time to time. Apart from not getting the object keys right, it appends the additional , at the end of the last key-value, which makes the parser complain. You might ask, is there a better way? There is Being able to control what format exactly does your LLM produce is super valuable and technically super interesting. Let us thus take a deep dive into how you go past "make no mistakes" and how the inference engines do it reliably and fast. Note: If you feel familiar with JSON schemas and GBNF, just skip into the section "Processing Grammars". Autoretries The first solution that comes to mind is just to employ some retry strategy at the message level. Essentially: while True: answer = llm prompt if is json answer : break This works. The only positive thing I have to say about it is that you can treat the LLM as a complete blackbox, which might be viable for some libraries actually I believe this is what LangChain does . For the negatives, there are plenty: - by being "unlucky" or employing smaller models, you can be looping for a very long time or forever, before reaching the desired output - you're wasting an enormous number of tokens, by discarding whole messages, even though they might not be all wrong - to be able to get a JSON, you need to construct or download a specific parser, which is not very extensible So if you don't have to, just don't do this please. However, by looking more closely into the LLM, you can have a little bit more principled approach. Constrained Sampling There are two observations, which we can make. First, LLMs generate outputs token by token. Usually you don't have to generate the whole answer to see that something is wrong. We can retry right away when the model makes the first error which is not in the right format. This way, we are not deleting the whole message, but just the last token: answer = "" while True: token = llm.next token prompt + answer if token == "