# The hardest LLM bugs are contract failures, not hallucinations

> Source: <https://dev.to/rishabh_jain_4d7dd020e595/the-hardest-llm-bugs-are-contract-failures-not-hallucinations-7g2>
> Published: 2026-06-20 03:43:11+00:00

When people talk about LLM failures, the default word is usually "hallucination."

But after building and testing LLM apps, I think many production bugs are better described as contract failures.

A hallucination is when the model makes something up. That matters, but it is not the only failure mode.

The subtler bugs happen when the model had enough context, but violated the surrounding system contract.

Examples:

Retrieval failures are usually easier to notice. You can compare the answer against the retrieved chunks and see that the grounding is weak.

Contract failures are harder because the answer may look plausible. The text might even be correct in isolation. But the system still failed because the model did not do the thing the application required.

For example, in a support agent, the problem may not be that the model gave a bad refund answer. The problem may be that it answered without first calling the refund eligibility tool.

In a data extraction app, the problem may not be that the model misunderstood the document. The problem may be that it returned almost-correct JSON that fails validation in the next service.

In a RAG workflow, the problem may not be missing context. The problem may be that the answer made a claim without attaching the citation or artifact that proves it.

This changed how I think about LLM debugging.

Prompt and response logs are useful, but they are not enough. A debugger should also inspect the contracts around the model call:

This is the direction I am taking with DebugAI, a Python SDK I have been building.

The goal is to take a bad LLM response and return a structured debug artifact:

For example, instead of only saying "bad answer," the diagnosis should say something like:

{

"failure": "tool_call_failure",

"evidence": [

"Expected refund_order tool before answering",

"No tool call was made"

],

"fix": "Require tool execution before final answer and reject responses without tool evidence."

}

I think this framing is more useful than treating every LLM bug as a hallucination.

Some bugs are grounding failures.

Some are retrieval failures.

Some are instruction failures.

Some are output validation failures.

Some are tool contract failures.

The more specific the failure class, the easier it is to fix and test.

Repo is public here if useful: [https://github.com/civicRJ/DebugAI](https://github.com/civicRJ/DebugAI)