{"slug": "rag-for-codebases-is-hard-trusting-the-answer-is-harder", "title": "RAG for codebases is hard. Trusting the answer is harder.", "summary": "A developer argues that retrieval-augmented generation (RAG) for codebases improves context but not verifiability, citing that 30% of failed SWE-agent runs still claimed success. They introduce 'truth', a verification tool that checks model claims against real code and git diffs, returning Supported, Contradicted, or Inconclusive verdicts with cited evidence.", "body_md": "There's a good post making the rounds — [\"RAG for Codebases Is Harder Than It Looks\"](https://dev.to/mahima_thacker/rag-for-codebases-is-harder-than-it-looks-1nhg) — on why naive retrieval falls apart on code. Filtering out `node_modules`\n\n, chunking on AST boundaries instead of token counts, preserving file-path metadata, treating architecture questions as graph traversals rather than nearest-neighbor lookups. All correct. If you're building code RAG, read it.\n\nBut notice where the author lands at the very end:\n\nDeveloper trust depends on retrieval quality,\n\nhonest uncertainty acknowledgment, and verifiable sources— not just algorithmic sophistication.\n\nThat's the part I want to pull on. Because everything before it — better chunking, better embeddings, repo maps — makes the model's answer *more likely to be right*. None of it makes the answer *checkable*. And once an LLM is writing your code, \"more likely to be right\" is not the same problem as \"did it actually do what it just told me it did.\"\n\nGood RAG gets the model better context. The model then makes an edit and reports back: *\"I added the /v1/refund route, set MAX_RETRIES to 5, only touched the parser, and the tests pass.\"*\n\nEvery one of those is a factual claim about the repo as it exists right now. And here's the uncomfortable measurement: across 100 real SWE-agent runs on real GitHub issues, **30% of the attempts that failed the test suite still claimed they fixed the issue** — \"the method has been successfully added,\" \"the issue has been resolved.\" Ground truth from the SWE-bench eval, claims judged by an LLM, reproducible.\n\nBetter retrieval doesn't fix that. A model with perfect context can still over-claim what it did. The hallucination moves from \"wrong fact about the codebase\" to \"wrong fact about my own change\" — which is harder to catch, because it sounds like a status update, not a guess.\n\nThe article's three closing words — honest uncertainty, verifiable sources, no speculation — are exactly the design constraints I've been building `truth`\n\naround. But I'd argue you can't get them from a better RAG pipeline, because they're not retrieval properties. They're *verification* properties:\n\n`file:line`\n\n.So `truth`\n\ndoes the inverse of RAG. RAG retrieves context *to help the model answer*. `truth`\n\nretrieves evidence *to check what the model claimed* — against the real code, the working-tree git diff, recorded command runs, and logs — and returns **Supported / Contradicted / Refused**, every verdict cited.\n\n``` bash\n$ truth verify-turn \"I added the /v1/refund endpoint, set MAX_RETRIES to 3, I only changed src/api.rs, and tests pass\"\n\n  ✓ Supported     I added the /v1/refund endpoint  (src/api.rs)\n  ✗ Contradicted  set MAX_RETRIES to 3             (src/config.rs:1)\n  ✗ Contradicted  I only changed src/api.rs        (src/api.rs)\n  ✗ Contradicted  tests pass                       (recorded 2026-06-11 09:14 UTC)\n```\n\nAn LLM only parses the sentence into a structured claim. The verdict comes from a deterministic engine.\n\n\"Honest uncertainty acknowledgment\" has a sharp edge the article doesn't dwell on: **a verifier that false-accuses gets uninstalled on day one.** Crying wolf on a truthful agent is the one unforgivable failure.\n\nSo `truth`\n\nonly ever says `Contradicted`\n\non a structured binary fact — never on a noisy count (a text scan over-counts comments; a log window samples). *\"Nobody uses this,\"* *\"updated all 4 call sites,\"* *\"X is unused\"* — those surface their evidence as **Inconclusive**: a suspicion to act on, not a verdict that blocks. And it's measured, not asserted: a labeled corpus of *real* agent over-claims holds the false-contradiction rate at **0** as a hard CI gate. When `truth`\n\nblocks, it means it.\n\nThis is the same instinct the article has about retrieval noise — \"if chunks are too large, retrieval becomes noisy\" — applied to verdicts. A noisy verifier is as useless as noisy retrieval. Worse, because it erodes the exact trust it was supposed to provide.\n\nThe author is right that code RAG needs preprocessing, structure-awareness, and citations. I'd add one thing to the conclusion: those buy you a *better-informed* model, not a *trustworthy* one. When the model is editing your repo, the last mile isn't retrieval quality — it's whether you can independently confirm what it told you, with a citation, deterministically, and without it ever crying wolf.\n\n`truth`\n\nis that last mile. It runs entirely locally (a single SQLite file in `.truth/`\n\n; your code never leaves the machine), exposes one MCP tool an agent calls on *its own* message before reporting done, and can be a hook the agent can't skip.\n\nIf you're building code RAG, your model's answers are about to get a lot better. Make them checkable too.", "url": "https://wpnews.pro/news/rag-for-codebases-is-hard-trusting-the-answer-is-harder", "canonical_source": "https://dev.to/blasrodri/rag-for-codebases-is-hard-trusting-the-answer-is-harder-4lfi", "published_at": "2026-06-29 17:16:36+00:00", "updated_at": "2026-06-29 17:49:09.554826+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-agents"], "entities": ["SWE-bench", "truth"], "alternates": {"html": "https://wpnews.pro/news/rag-for-codebases-is-hard-trusting-the-answer-is-harder", "markdown": "https://wpnews.pro/news/rag-for-codebases-is-hard-trusting-the-answer-is-harder.md", "text": "https://wpnews.pro/news/rag-for-codebases-is-hard-trusting-the-answer-is-harder.txt", "jsonld": "https://wpnews.pro/news/rag-for-codebases-is-hard-trusting-the-answer-is-harder.jsonld"}}