{"slug": "your-llm-obeys-99-of-the-time-that-1-is-taking-down-production", "title": "Your LLM Obeys 99% of the Time. That 1% Is Taking Down Production.", "summary": "A 99% obedient LLM still produces tens of thousands of malformed outputs daily at scale, causing production outages. The root cause is that prompt engineering and validation loops only persuade, not guarantee, correct output. Logit masking—setting forbidden tokens' logits to negative infinity before softmax—is the only technique that provides a hard guarantee by making invalid tokens unrepresentable.", "body_md": "You shipped the feature. It worked in the demo. It worked in staging. It worked in the first week of prod. Then one Tuesday at 2 a.m. your pager goes off because a downstream consumer choked on this:\n\n```\n{ \"priority\": \"urgent\", \"category\": \"billing\", \"needs_human\": \"yes\", \"confidence\": 0.91 }\n```\n\nSpot the bombs. priority was supposed to be one of low | medium | high — \"urgent\" isn't in the enum. needs_human was supposed to be a boolean — you got the *string* \"yes\". Your parser did exactly what you told it to. Your model did *almost* exactly what you told it to. And \"almost\" is the whole problem.\n\nHere’s the uncomfortable math. A 99% obedient model sounds great until you’re doing a few million calls a day. 1% of a few million is tens of thousands of malformed objects — every single day. That’s not an edge case. That’s a feature with a built-in outage schedule.\n\nThe instinct is to fix this where you can see it: in the prompt. Add more instructions. Add examples. Add a validation-and-retry loop. All of that helps. *None of it guarantees anything.* This article is about why, and about the one technique that actually closes the gap: **logit masking**.\n\nTo understand the fix, you have to look one level below the token. Most of us treat the model as a black box that emits text. It isn’t. At every single step, it does this:\n\nThe thing to internalize is that there’s a precise, surgical moment between step 1 and step 2 — between the raw scores and the dice roll — where you can reach in and edit the numbers before they ever become probabilities. That’s the only place a hard guarantee can live. Everything you do in the prompt is upstream of the model deciding; logit masking happens *inside* the decision.\n\nLogit masking is almost insultingly simple to state: **for any token you want to forbid, set its logit to negative infinity before the softmax.**\n\nWhy −∞? Because softmax exponentiates. e^(-∞) = 0. A token with a logit of −∞ comes out the other side of softmax with a probability of *exactly* zero. Not 0.3%. Not 0.0001%. Zero. It cannot be sampled. It has been deleted from the model's vocabulary for that one step.\n\nGo back to our triage object. When the model is filling in the priority field and it's at the position right after \"priority\": \", you know — mechanically, from the schema — that the only legal next tokens are the ones that spell low, medium, or high. So you mask everything else. The token for urg (the start of \"urgent\") gets −∞. The model *physically cannot* start typing \"urgent,\" no matter how confident it was. The 2 a.m. bug isn't caught — it's made unrepresentable.\n\nThat’s the entire idea. Now the interesting part: why can’t the cheaper, more familiar tools do the same thing?\n\nEvery technique for steering a model lives in one of two layers, and they are not interchangeable.\n\nThe **semantic layer** is where you *persuade*. Prompts, few-shot examples, “respond ONLY with JSON, do not add commentary, I will tip you $200” — all of it operates by nudging the probability distribution toward what you want. It’s negotiation. The model considers your request and is *more likely* to comply.\n\nThe **mechanical layer** is where you *overrule*. Logit masking doesn’t ask the model to behave. It removes the model’s ability to misbehave. It’s not a stronger argument; it’s a different category of thing entirely.\n\nThe reason your prod is on fire is that you’ve been trying to win a *guarantee* with *persuasion*. You can’t. Persuasion shifts odds. It never zeroes them. Let me show you exactly why, for both of the usual suspects.\n\nFew-shot is genuinely useful. Show the model five perfect examples of the triage object and your malformed-output rate will drop. Maybe a lot. But watch what’s actually happening to the distribution.\n\nExamples pile probability mass onto the valid tokens. The peak over low | medium | high gets taller and sharper. But the token for \"urgent\" doesn't go to zero — it goes to *small*. There's still a little bump out in the tail. The model still believes, with some tiny probability, that \"urgent\" is a reasonable thing to say, because in its training data it absolutely was.\n\nAnd here’s the part people miss: **small times big is not small.** A 0.3% chance of a bad token, across millions of generations, is a steady drip of broken objects into your pipeline. Few-shot moved you from “broken often” to “broken rarely.” It cannot move you to “broken never,” because there is no prompt — none — that sets a probability to exactly zero. That’s not a thing the semantic layer can do. It doesn’t have the API surface for it.\n\nThis is the trap with prompting in general: it *feels* like control because the failure rate drops, so you keep tuning, chasing the last 1%. But the last 1% is asymptotic. You’ll spend a week of prompt-golf to go from 99% to 99.4% and still get paged.\n\nThe other reflex is to let the model generate, validate the output, and if it’s malformed, ask again. This *works*, in the sense that you’ll eventually get valid JSON. But it’s reactive, not preventive — you’re cleaning up a mistake instead of preventing it — and it comes with a tax.\n\nThe retry loop costs you on four axes:\n\nMasking collapses all of that into a single pass. The output is valid *by construction* — there’s nothing to validate-and-retry because there’s no path to an invalid object in the first place. One call. Zero retries. Bounded latency. This is the difference between a bouncer who checks IDs at the door and a cleanup crew you call after the party gets raided.\n\n“Just set the bad logits to −∞” raises an obvious question: *which* logits are bad? The answer changes at every single position, and that’s where the real engineering lives.\n\nThe wrinkle is tokenization. \"priority\" isn't one token — it might be \", prior, ity, \". low might be one token or two. So you can't compute a static \"allowed list\" once and reuse it. You need to know, *given everything generated so far*, what tokens are legal *next*. That's a finite state machine (FSM).\n\nYou compile your schema (or regex, or grammar) into an FSM once. Then, at each decoding step, the FSM tells you the set of valid next tokens for the current state, and you mask everything else. In pseudocode the decode loop looks like this:\n\n```\nfsm = compile_schema_to_fsm(triage_schema)   # one-time coststate = fsm.start\nwhile not done:    logits = model.forward(tokens)           # raw scores for whole vocab\nallowed = fsm.allowed_token_ids(state)    # depends on what we've emitted    mask = full_of(-inf)    mask[allowed] = 0    logits = logits + mask                    # forbidden tokens -> -inf\nnext_token = sample(softmax(logits))      # can ONLY be a legal token    tokens.append(next_token)    state = fsm.advance(state, next_token)\n```\n\nThat logits + mask line is the whole ballgame. Everything else is bookkeeping to figure out what's allowed.\n\nYou almost never write this loop yourself. On open models, libraries do it for you. With [Outlines](https://github.com/dottxt-ai/outlines), constraining generation to a Pydantic schema is a few lines:\n\n``` python\nfrom outlines import models, generatefrom pydantic import BaseModelfrom enum import Enum\nclass Priority(str, Enum):    low = \"low\"; medium = \"medium\"; high = \"high\"\nclass Triage(BaseModel):    priority: Priority    category: str    needs_human: bool    confidence: float\nmodel = models.transformers(\"your-model\")generator = generate.json(model, Triage)   # schema -> FSM -> per-token masksresult = generator(\"Classify this ticket: ...\")  # guaranteed to match Triage\n```\n\ngenerate.json compiles your schema into exactly the kind of FSM-driven masking above. The model literally cannot emit \"urgent\" for priority or \"yes\" for needs_human.\n\nOne sharp edge worth knowing: simple FSMs can’t express *recursion* — arbitrarily nested structures, balanced brackets in deeply nested JSON. That’s why the serious implementations (OpenAI’s, for one) compile to a **context-free grammar** instead, which is strictly more expressive than an FSM and can handle recursive schemas. Same masking idea, bigger class of languages it can enforce.\n\nWhether you’re calling an API or running the model yourself, it’s the *same mechanism*. Only the steering wheel is in a different place.\n\nIf you’re on a **hosted API**, you’re often using logit masking without knowing it. OpenAI’s Structured Outputs (response_format: { type: \"json_schema\", strict: true }) compiles your JSON Schema into a grammar and constrains decoding server-side — that's why it can promise 100% schema compliance rather than \"usually.\" Anthropic shipped the same capability for Claude in late 2025 (constrained decoding via structured outputs / strict tool use), and Google's Gemini exposes it through responseSchema. The one *raw* knob most APIs expose is logit_bias: a static map of token-id → bias (roughly −100 to +100) added to logits before sampling. It's the primitive itself, but it's blunt — it's the same bias at every position, with no FSM tracking state, so it's good for nudging or banning a handful of specific tokens and useless for enforcing a whole schema.\n\nIf you’re running **open models** (vLLM, llama.cpp, TGI), you wire it yourself with Outlines, lm-format-enforcer, llama.cpp’s GBNF grammars, or vLLM’s guided decoding. Under the hood, modern engines like XGrammar and llguidance do the per-token grammar check in tens of microseconds — negligible next to the model’s own per-token inference time, which is why “constrained” decoding is essentially free at runtime.\n\nThe thesis holds in both worlds: **form gets masked, meaning gets prompted.** Now, the two questions that actually matter in practice.\n\nMasking earns its keep anywhere the output crosses a system boundary and a machine has to parse it:\n\nThe pattern: **the constraint is about shape, and the shape is knowable in advance.**\n\nThis is the section the breathless “just use constrained decoding!” posts skip, and it’s the one that separates people who’ve shipped this from people who’ve read about it.\n\n**Masking enforces shape, not truth.** This is the big one. A masked confidence field is *guaranteed* to be a valid float. It is *not* guaranteed to be the *right* float. You can produce a perfectly schema-valid object that is completely wrong — priority: \"low\" on a ticket that's actually a five-alarm fire. Masking moves your failures from \"won't parse\" to \"parses fine, means nothing,\" which can be *worse* because now they're silent. You still need evals and business-logic validation on top. The mask guarantees the box is the right shape; it says nothing about what's inside.\n\n**Over-constraining degrades quality — measurably.** When you force the model down a narrow grammar, you sometimes force it to emit a token it assigned almost no probability to, and that distorts everything downstream of it. There’s real evidence here, not just hand-waving: a 2026 structured-extraction benchmark found that flipping providers into strict structured-output mode actually *lowered* both validity and accuracy on complex schemas versus plain prompting — overall validity dropped from around 51% to 37%. Forcing the shape made the content worse. Constrained decoding is not a free lunch; it’s a trade.\n\n**The greedy-local trap.** Masking is myopic. It picks the best *legal* token at each step without knowing that a choice now might paint the model into a corner later, where every remaining legal continuation is one the model hates. Token-by-token masking optimizes locally and can land you in a globally awkward generation. (This is exactly what the research on “grammar-aligned decoding” is trying to fix.)\n\n**It fights chain-of-thought.** If you constrain the model to emit JSON from token one, you’ve forbidden it from thinking out loud first. For tasks that need reasoning, that’s a real cost. The usual fix is to let it reason freely in an unconstrained scratchpad, *then* mask only the final structured answer — not to clamp the grammar over the whole generation.\n\n**Semantic constraints can’t be masked at all.** “Be polite.” “Match our brand voice.” “Don’t be condescending.” There’s no FSM for tone. These are meaning, and meaning lives in the semantic layer. Trying to mask your way to politeness is a category error.\n\n**And you need to control the decoder.** If your provider doesn’t expose structured outputs or grammar support, you don’t have this lever — you’re back to prompting plus validation, and that’s fine, just know which world you’re in.\n\nIf you take one thing from all of this, take the dividing line:\n\n**Mask the form. Prompt the meaning.**\n\nShape, types, enums, syntax, the set of legal tokens — that’s the mechanical layer. Pin it with a mask and make bad output *impossible*, not merely unlikely. Tone, judgment, reasoning, whether the value is actually *correct* — that’s the semantic layer. Prompt it, example it, and catch the rest with evals.\n\nThe reason your LLM obeys 99% of the time is that you’ve been asking it nicely, and asking has a ceiling. The last 1% — the one paging you at 2 a.m. — doesn’t yield to a better argument. It yields to having its options taken away.\n\nStop negotiating with the tail. Delete it.\n\n*If this was useful, I write about LLM systems, retrieval, and the unglamorous engineering that keeps them from falling over. Follow me on *\n\n[Your LLM Obeys 99% of the Time. That 1% Is Taking Down Production.](https://pub.towardsai.net/your-llm-obeys-99-of-the-time-that-1-is-taking-down-production-a6ea1b6f00c1) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/your-llm-obeys-99-of-the-time-that-1-is-taking-down-production", "canonical_source": "https://pub.towardsai.net/your-llm-obeys-99-of-the-time-that-1-is-taking-down-production-a6ea1b6f00c1?source=rss----98111c9905da---4", "published_at": "2026-06-24 04:52:14+00:00", "updated_at": "2026-06-24 05:19:01.783595+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "ai-products", "ai-infrastructure"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/your-llm-obeys-99-of-the-time-that-1-is-taking-down-production", "markdown": "https://wpnews.pro/news/your-llm-obeys-99-of-the-time-that-1-is-taking-down-production.md", "text": "https://wpnews.pro/news/your-llm-obeys-99-of-the-time-that-1-is-taking-down-production.txt", "jsonld": "https://wpnews.pro/news/your-llm-obeys-99-of-the-time-that-1-is-taking-down-production.jsonld"}}