When an LLM feature breaks in production, the first instinct in the room is to open the prompt and reword it. We had that instinct too. Then we built a production assistant for financial advisors and kept a record of every LLM-related failure the system hit, along with the fix that actually closed each one. Across the whole build, almost nothing that mattered was fixable by editing a prompt. The durable fixes were architectural. The single time we tried a prompt-only fix on the hardest problem, it made things measurably worse, and we reverted it.
This is the case for treating the model as one untrusted component in a larger system, written from the failures that taught us to do it.
Routing was our most unstable surface, and it was unstable in a way that prompt editing could not reach. The same question routed one way on one run and a different way on the next, with no code change in between. On the ambiguous edges, routing accuracy sat around 56 to 64 percent and was non-deterministic from run to run. A question like “rank my households by AUM” came back as a clarification request on one run and a confident answer on the next.
The obvious move was to explain the boundary better in the routing prompt. We added a block of guidance describing how to tell the categories apart. The classifier got less stable, not more, so we took the prose back out.
The reason mattered more than the symptom. The flakiness was not bad luck. It was structural, and it got worse every time we added a domain. The router guessed an abstract category first, and then code mapped that category to a concrete tool. The category was a lossy middle step. When the guess was wrong, the right tool was unreachable, and there was no signal left to recover from.
The fix was to delete the abstract category. We collapsed routing into a single stage that picks one concrete tool directly from the catalog, with the scope derived from the tool it picked rather than guessed up front. We narrowed what each decision had to weigh, so any one call discriminates among a handful of tools instead of the full set. We grounded each tool with a few example utterances carried as structured data, not as prose inside the prompt. Accuracy moved from a flaky 98 percent on the old design, down through a real regression to 72 percent right after the rewrite, and up to 100 percent on both of our evaluation suites once the grounding was in place.
The improvement came from removing decisions the model did not need to make. That set the pattern for most of what followed: take work away from the model wherever code can do it, and catch what is left in deterministic guardrails.
A related failure looked like a bad model decision and was actually a missing option. Asked to “show the first account,” the model picked the nearest available tool, a holdings lookup, because no account-listing tool existed, and it invented an ordinal to make the answer fit. The model was choosing the closest thing within reach. We fixed it by building the tool it needed, not by writing a prompt that apologized for the gap.
A model fills a blank confidently, and confidence is the dangerous part. “Create a task for 2pm” put the string “2pm” into a field our code expected to hold a computed timestamp. The parser tried to read “2pm” as an ISO instant, threw, and the user saw a generic server error. This crash only reproduced with real model output. Every offline test we had written passed an empty argument map, and an empty map never triggers the bug. The lesson we wrote down for the next person chasing a live crash they cannot reproduce: vary the arguments the model actually produces, because empty-argument mocks hide a whole class of failure.
The crash was one instance of a wider habit. When asked to create a task with no subject, the model invented a subject that was a bare echo of the tool’s own name, and because the field was now non-empty, it sent the task straight to confirmation instead of asking. It presented a specific row as “the first” on a list our schema never ordered, reading fixture order as if it were real order. Given a comparison of two figures, it would do the arithmetic itself and sometimes get it wrong.
These were not instruction problems, and the fixes had the same shape every time. Detect the value the model invented and refuse it. A matcher catches tool-name-echo placeholders and forces a clarification. A check drops a non-ISO time argument and re-derives the time from the user’s words in code. A flag on unordered lists, plus a guard that returns the full list with a notice instead of asserting an ordinal. Percentages, deltas, and filtered totals moved into code, with the prompt reduced to one line: do not compute this yourself. The principle underneath all of them is to never trust a model-populated value as if it were the typed or computed value. Validate before you parse.
We run a deterministic grounding check over every answer that cites figures. It compares the numbers in the rendered answer against the numbers the tool actually returned, and when the answer cites a figure with no support, it withholds the answer behind a message saying we could not fully verify it. This is the reason a fabricated “that’s 5% of your $8.3M,” which the renderer invented on three or four runs out of six, never reached a user. The render prompt already banned invented statistics. The ban alone did not hold, so deterministic shaping that drops the unreliable input and renders from the authoritative summary is what took those failures from four in six to one in six.
That same layer was also our most common source of false holds, where a correct answer was withheld because of a verification bug. A correct answer held back is expensive precisely because the fallback message is the right behavior when verification genuinely fails, which made every false positive look legitimate.
The verification bugs were all in code, never in a prompt. A currency regex read the leading “t” of the word “today” as a trillion suffix, turned a small dollar figure into a number on the order of ten to the twelfth, failed to find it in the source, and hard-blocked about 80 percent of runs. The first attempt to fix it widened the accepted values in a test fixture, which masked the symptom rather than the cause, and we reverted that. The real fix was a magnitude token that terminates on a word boundary. The check was sign-sensitive, so a negative withdrawal rendered as a positive figure failed to ground until we compared absolute values. It scanned only numeric fields, so a figure living inside a human-readable explanation string was flagged as fabricated until we extracted numbers from source strings too.
Two design decisions about that layer were the fix rather than a bug. We stopped running the non-deterministic rubric judge on executed writes, because the confirmation there is deterministic and templated, so judging it added flake for no signal. We made the data path fail closed, so a tool-backed read that cites figures against an empty source is blocked outright, and an empty tool result returns a fixed unavailable message instead of calling the renderer with nothing to render.
There is a cost to running a guardrail this strict, and we are still paying it. A required gate that flakes trains people to bypass it, which is the subject of the second part of this series.
Prompt edits did real work in this system, inside a narrow band. A one-line rule so “compare A and B” is not mis-read as a request that needs clarification. A broadened description of an activity category so a class of question routes correctly. A shared preamble that owns the never-invent-a-link rule so it cannot drift across the several render prompts that each need it. Each of these is a small, generalizable nudge sitting on top of code that is already correct. Correctness lives in the code beneath them.
We did not close all of it. One residual case where the renderer occasionally fabricates a derived percentage is still left to the prompt, with the grounding check as the backstop. The grounding layer is tuned conservatively, and reducing its false holds without reopening a fabrication gap is a standing piece of work. There is a model-variance lever we have not pulled: running the tool-selection call at temperature zero with a self-consistency vote. We deferred it because example-grounding already reached 100 percent on our current suites, which also means the lever is untested against the cases that would actually exercise it.
We work on AI and data infrastructure for wealth management at Advisor360°. We know these failure modes at this resolution because we kept the receipts: every incident here came out of the build history of a system we shipped, not from a whiteboard.
The reflex to fix an LLM by talking to it differently is strong because the prompt is the part of the system that feels like it is in our hands. The work that held up was less satisfying than that. We took decisions away from the model, we validated what it returned before we trusted it, and we let deterministic code block the outcomes we could not afford. The model is one component, and treating its output as untrusted input is what made the rest of the system safe to build on.
This is the first of a two-part series. Part 2 covers what happens when the test harness meant to catch these failures is itself non-deterministic: the rerun-until-green trap, accuracy floors that stay green while a real regression merges, and LLM judges that disagree with themselves run to run.
You Can’t Prompt Your Away Your LLM Problems was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.