{"slug": "the-hardest-llm-bugs-are-contract-failures-not-hallucinations", "title": "The hardest LLM bugs are contract failures, not hallucinations", "summary": "A developer argues that many production bugs in LLM applications are better described as contract failures rather than hallucinations. The developer introduces DebugAI, a Python SDK that diagnoses failures by inspecting system contracts around model calls, categorizing issues such as tool call failures, output validation failures, and instruction failures.", "body_md": "When people talk about LLM failures, the default word is usually \"hallucination.\"\n\nBut after building and testing LLM apps, I think many production bugs are better described as contract failures.\n\nA hallucination is when the model makes something up. That matters, but it is not the only failure mode.\n\nThe subtler bugs happen when the model had enough context, but violated the surrounding system contract.\n\nExamples:\n\nRetrieval failures are usually easier to notice. You can compare the answer against the retrieved chunks and see that the grounding is weak.\n\nContract failures are harder because the answer may look plausible. The text might even be correct in isolation. But the system still failed because the model did not do the thing the application required.\n\nFor example, in a support agent, the problem may not be that the model gave a bad refund answer. The problem may be that it answered without first calling the refund eligibility tool.\n\nIn a data extraction app, the problem may not be that the model misunderstood the document. The problem may be that it returned almost-correct JSON that fails validation in the next service.\n\nIn a RAG workflow, the problem may not be missing context. The problem may be that the answer made a claim without attaching the citation or artifact that proves it.\n\nThis changed how I think about LLM debugging.\n\nPrompt and response logs are useful, but they are not enough. A debugger should also inspect the contracts around the model call:\n\nThis is the direction I am taking with DebugAI, a Python SDK I have been building.\n\nThe goal is to take a bad LLM response and return a structured debug artifact:\n\nFor example, instead of only saying \"bad answer,\" the diagnosis should say something like:\n\n{\n\n\"failure\": \"tool_call_failure\",\n\n\"evidence\": [\n\n\"Expected refund_order tool before answering\",\n\n\"No tool call was made\"\n\n],\n\n\"fix\": \"Require tool execution before final answer and reject responses without tool evidence.\"\n\n}\n\nI think this framing is more useful than treating every LLM bug as a hallucination.\n\nSome bugs are grounding failures.\n\nSome are retrieval failures.\n\nSome are instruction failures.\n\nSome are output validation failures.\n\nSome are tool contract failures.\n\nThe more specific the failure class, the easier it is to fix and test.\n\nRepo is public here if useful: [https://github.com/civicRJ/DebugAI](https://github.com/civicRJ/DebugAI)", "url": "https://wpnews.pro/news/the-hardest-llm-bugs-are-contract-failures-not-hallucinations", "canonical_source": "https://dev.to/rishabh_jain_4d7dd020e595/the-hardest-llm-bugs-are-contract-failures-not-hallucinations-7g2", "published_at": "2026-06-20 03:43:11+00:00", "updated_at": "2026-06-20 04:36:50.824648+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-agents", "ai-products", "ai-research"], "entities": ["DebugAI", "Python", "GitHub"], "alternates": {"html": "https://wpnews.pro/news/the-hardest-llm-bugs-are-contract-failures-not-hallucinations", "markdown": "https://wpnews.pro/news/the-hardest-llm-bugs-are-contract-failures-not-hallucinations.md", "text": "https://wpnews.pro/news/the-hardest-llm-bugs-are-contract-failures-not-hallucinations.txt", "jsonld": "https://wpnews.pro/news/the-hardest-llm-bugs-are-contract-failures-not-hallucinations.jsonld"}}