Your LLM Obeys 99% of the Time. That 1% Is Taking Down Production.

wpnews.pro

You shipped the feature. It worked in the demo. It worked in staging. It worked in the first week of prod. Then one Tuesday at 2 a.m. your pager goes off because a downstream consumer choked on this:

{ "priority": "urgent", "category": "billing", "needs_human": "yes", "confidence": 0.91 }

Spot the bombs. priority was supposed to be one of low | medium | high — "urgent" isn't in the enum. needs_human was supposed to be a boolean — you got the string "yes". Your parser did exactly what you told it to. Your model did almost exactly what you told it to. And "almost" is the whole problem.

Here’s the uncomfortable math. A 99% obedient model sounds great until you’re doing a few million calls a day. 1% of a few million is tens of thousands of malformed objects — every single day. That’s not an edge case. That’s a feature with a built-in outage schedule.

The instinct is to fix this where you can see it: in the prompt. Add more instructions. Add examples. Add a validation-and-retry loop. All of that helps. None of it guarantees anything. This article is about why, and about the one technique that actually closes the gap: logit masking.

To understand the fix, you have to look one level below the token. Most of us treat the model as a black box that emits text. It isn’t. At every single step, it does this:

The thing to internalize is that there’s a precise, surgical moment between step 1 and step 2 — between the raw scores and the dice roll — where you can reach in and edit the numbers before they ever become probabilities. That’s the only place a hard guarantee can live. Everything you do in the prompt is upstream of the model deciding; logit masking happens inside the decision.

Logit masking is almost insultingly simple to state: for any token you want to forbid, set its logit to negative infinity before the softmax.

Why −∞? Because softmax exponentiates. e^(-∞) = 0. A token with a logit of −∞ comes out the other side of softmax with a probability of exactly zero. Not 0.3%. Not 0.0001%. Zero. It cannot be sampled. It has been deleted from the model's vocabulary for that one step.

Go back to our triage object. When the model is filling in the priority field and it's at the position right after "priority": ", you know — mechanically, from the schema — that the only legal next tokens are the ones that spell low, medium, or high. So you mask everything else. The token for urg (the start of "urgent") gets −∞. The model physically cannot start typing "urgent," no matter how confident it was. The 2 a.m. bug isn't caught — it's made unrepresentable.

That’s the entire idea. Now the interesting part: why can’t the cheaper, more familiar tools do the same thing?

Every technique for steering a model lives in one of two layers, and they are not interchangeable.

The semantic layer is where you persuade. Prompts, few-shot examples, “respond ONLY with JSON, do not add commentary, I will tip you $200” — all of it operates by nudging the probability distribution toward what you want. It’s negotiation. The model considers your request and is more likely to comply.

The mechanical layer is where you overrule. Logit masking doesn’t ask the model to behave. It removes the model’s ability to misbehave. It’s not a stronger argument; it’s a different category of thing entirely.

The reason your prod is on fire is that you’ve been trying to win a guarantee with persuasion. You can’t. Persuasion shifts odds. It never zeroes them. Let me show you exactly why, for both of the usual suspects.

Few-shot is genuinely useful. Show the model five perfect examples of the triage object and your malformed-output rate will drop. Maybe a lot. But watch what’s actually happening to the distribution.

Examples pile probability mass onto the valid tokens. The peak over low | medium | high gets taller and sharper. But the token for "urgent" doesn't go to zero — it goes to small. There's still a little bump out in the tail. The model still believes, with some tiny probability, that "urgent" is a reasonable thing to say, because in its training data it absolutely was.

And here’s the part people miss: small times big is not small. A 0.3% chance of a bad token, across millions of generations, is a steady drip of broken objects into your pipeline. Few-shot moved you from “broken often” to “broken rarely.” It cannot move you to “broken never,” because there is no prompt — none — that sets a probability to exactly zero. That’s not a thing the semantic layer can do. It doesn’t have the API surface for it.

This is the trap with prompting in general: it feels like control because the failure rate drops, so you keep tuning, chasing the last 1%. But the last 1% is asymptotic. You’ll spend a week of prompt-golf to go from 99% to 99.4% and still get paged.

The other reflex is to let the model generate, validate the output, and if it’s malformed, ask again. This works, in the sense that you’ll eventually get valid JSON. But it’s reactive, not preventive — you’re cleaning up a mistake instead of preventing it — and it comes with a tax.

The retry loop costs you on four axes:

Masking collapses all of that into a single pass. The output is valid by construction — there’s nothing to validate-and-retry because there’s no path to an invalid object in the first place. One call. Zero retries. Bounded latency. This is the difference between a bouncer who checks IDs at the door and a cleanup crew you call after the party gets raided.

“Just set the bad logits to −∞” raises an obvious question: which logits are bad? The answer changes at every single position, and that’s where the real engineering lives.

The wrinkle is tokenization. "priority" isn't one token — it might be ", prior, ity, ". low might be one token or two. So you can't compute a static "allowed list" once and reuse it. You need to know, given everything generated so far, what tokens are legal next. That's a finite state machine (FSM).

You compile your schema (or regex, or grammar) into an FSM once. Then, at each decoding step, the FSM tells you the set of valid next tokens for the current state, and you mask everything else. In pseudocode the decode loop looks like this:

fsm = compile_schema_to_fsm(triage_schema)   # one-time coststate = fsm.start
while not done:    logits = model.forward(tokens)           # raw scores for whole vocab
allowed = fsm.allowed_token_ids(state)    # depends on what we've emitted    mask = full_of(-inf)    mask[allowed] = 0    logits = logits + mask                    # forbidden tokens -> -inf
next_token = sample(softmax(logits))      # can ONLY be a legal token    tokens.append(next_token)    state = fsm.advance(state, next_token)

That logits + mask line is the whole ballgame. Everything else is bookkeeping to figure out what's allowed.

You almost never write this loop yourself. On open models, libraries do it for you. With Outlines, constraining generation to a Pydantic schema is a few lines:

from outlines import models, generatefrom pydantic import BaseModelfrom enum import Enum
class Priority(str, Enum):    low = "low"; medium = "medium"; high = "high"
class Triage(BaseModel):    priority: Priority    category: str    needs_human: bool    confidence: float
model = models.transformers("your-model")generator = generate.json(model, Triage)   # schema -> FSM -> per-token masksresult = generator("Classify this ticket: ...")  # guaranteed to match Triage

generate.json compiles your schema into exactly the kind of FSM-driven masking above. The model literally cannot emit "urgent" for priority or "yes" for needs_human.

One sharp edge worth knowing: simple FSMs can’t express recursion — arbitrarily nested structures, balanced brackets in deeply nested JSON. That’s why the serious implementations (OpenAI’s, for one) compile to a context-free grammar instead, which is strictly more expressive than an FSM and can handle recursive schemas. Same masking idea, bigger class of languages it can enforce.

Whether you’re calling an API or running the model yourself, it’s the same mechanism. Only the steering wheel is in a different place.

If you’re on a hosted API, you’re often using logit masking without knowing it. OpenAI’s Structured Outputs (response_format: { type: "json_schema", strict: true }) compiles your JSON Schema into a grammar and constrains decoding server-side — that's why it can promise 100% schema compliance rather than "usually." Anthropic shipped the same capability for Claude in late 2025 (constrained decoding via structured outputs / strict tool use), and Google's Gemini exposes it through responseSchema. The one raw knob most APIs expose is logit_bias: a static map of token-id → bias (roughly −100 to +100) added to logits before sampling. It's the primitive itself, but it's blunt — it's the same bias at every position, with no FSM tracking state, so it's good for nudging or banning a handful of specific tokens and useless for enforcing a whole schema.

If you’re running open models (vLLM, llama.cpp, TGI), you wire it yourself with Outlines, lm-format-enforcer, llama.cpp’s GBNF grammars, or vLLM’s guided decoding. Under the hood, modern engines like XGrammar and llguidance do the per-token grammar check in tens of microseconds — negligible next to the model’s own per-token inference time, which is why “constrained” decoding is essentially free at runtime.

The thesis holds in both worlds: form gets masked, meaning gets prompted. Now, the two questions that actually matter in practice.

Masking earns its keep anywhere the output crosses a system boundary and a machine has to parse it:

The pattern: the constraint is about shape, and the shape is knowable in advance.

This is the section the breathless “just use constrained decoding!” posts skip, and it’s the one that separates people who’ve shipped this from people who’ve read about it.

Masking enforces shape, not truth. This is the big one. A masked confidence field is guaranteed to be a valid float. It is not guaranteed to be the right float. You can produce a perfectly schema-valid object that is completely wrong — priority: "low" on a ticket that's actually a five-alarm fire. Masking moves your failures from "won't parse" to "parses fine, means nothing," which can be worse because now they're silent. You still need evals and business-logic validation on top. The mask guarantees the box is the right shape; it says nothing about what's inside.

Over-constraining degrades quality — measurably. When you force the model down a narrow grammar, you sometimes force it to emit a token it assigned almost no probability to, and that distorts everything downstream of it. There’s real evidence here, not just hand-waving: a 2026 structured-extraction benchmark found that flipping providers into strict structured-output mode actually lowered both validity and accuracy on complex schemas versus plain prompting — overall validity dropped from around 51% to 37%. Forcing the shape made the content worse. Constrained decoding is not a free lunch; it’s a trade.

The greedy-local trap. Masking is myopic. It picks the best legal token at each step without knowing that a choice now might paint the model into a corner later, where every remaining legal continuation is one the model hates. Token-by-token masking optimizes locally and can land you in a globally awkward generation. (This is exactly what the research on “grammar-aligned decoding” is trying to fix.)

It fights chain-of-thought. If you constrain the model to emit JSON from token one, you’ve forbidden it from thinking out loud first. For tasks that need reasoning, that’s a real cost. The usual fix is to let it reason freely in an unconstrained scratchpad, then mask only the final structured answer — not to clamp the grammar over the whole generation.

Semantic constraints can’t be masked at all. “Be polite.” “Match our brand voice.” “Don’t be condescending.” There’s no FSM for tone. These are meaning, and meaning lives in the semantic layer. Trying to mask your way to politeness is a category error.

And you need to control the decoder. If your provider doesn’t expose structured outputs or grammar support, you don’t have this lever — you’re back to prompting plus validation, and that’s fine, just know which world you’re in.

If you take one thing from all of this, take the dividing line:

Mask the form. Prompt the meaning.

Shape, types, enums, syntax, the set of legal tokens — that’s the mechanical layer. Pin it with a mask and make bad output impossible, not merely unlikely. Tone, judgment, reasoning, whether the value is actually correct — that’s the semantic layer. Prompt it, example it, and catch the rest with evals.

The reason your LLM obeys 99% of the time is that you’ve been asking it nicely, and asking has a ceiling. The last 1% — the one paging you at 2 a.m. — doesn’t yield to a better argument. It yields to having its options taken away.

Stop negotiating with the tail. Delete it.

*If this was useful, I write about LLM systems, retrieval, and the unglamorous engineering that keeps them from falling over. Follow me on *

Your LLM Obeys 99% of the Time. That 1% Is Taking Down Production. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article I Deleted Every Static Claude API Key I Owned. Here’s the Keyless Migration, Provider by Provider. I Replaced ChatGPT With Local AI for 30 Days. Here’s What Actually Happened. AsyncIO in Python: What It Actually Is and Why Your ‘Async’ Code Might Not Be Async

Your LLM Obeys 99% of the Time. That 1% Is Taking Down Production.

Run your AI side-project on zahid.host