cd /news/large-language-models/rag-for-codebases-is-hard-trusting-t… · home topics large-language-models article
[ARTICLE · art-43796] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

RAG for codebases is hard. Trusting the answer is harder.

A developer argues that retrieval-augmented generation (RAG) for codebases improves context but not verifiability, citing that 30% of failed SWE-agent runs still claimed success. They introduce 'truth', a verification tool that checks model claims against real code and git diffs, returning Supported, Contradicted, or Inconclusive verdicts with cited evidence.

read4 min views1 publishedJun 29, 2026

There's a good post making the rounds — "RAG for Codebases Is Harder Than It Looks" — on why naive retrieval falls apart on code. Filtering out node_modules

, chunking on AST boundaries instead of token counts, preserving file-path metadata, treating architecture questions as graph traversals rather than nearest-neighbor lookups. All correct. If you're building code RAG, read it.

But notice where the author lands at the very end:

Developer trust depends on retrieval quality,

honest uncertainty acknowledgment, and verifiable sources— not just algorithmic sophistication.

That's the part I want to pull on. Because everything before it — better chunking, better embeddings, repo maps — makes the model's answer more likely to be right. None of it makes the answer checkable. And once an LLM is writing your code, "more likely to be right" is not the same problem as "did it actually do what it just told me it did."

Good RAG gets the model better context. The model then makes an edit and reports back: "I added the /v1/refund route, set MAX_RETRIES to 5, only touched the parser, and the tests pass."

Every one of those is a factual claim about the repo as it exists right now. And here's the uncomfortable measurement: across 100 real SWE-agent runs on real GitHub issues, 30% of the attempts that failed the test suite still claimed they fixed the issue — "the method has been successfully added," "the issue has been resolved." Ground truth from the SWE-bench eval, claims judged by an LLM, reproducible.

Better retrieval doesn't fix that. A model with perfect context can still over-claim what it did. The hallucination moves from "wrong fact about the codebase" to "wrong fact about my own change" — which is harder to catch, because it sounds like a status update, not a guess.

The article's three closing words — honest uncertainty, verifiable sources, no speculation — are exactly the design constraints I've been building truth

around. But I'd argue you can't get them from a better RAG pipeline, because they're not retrieval properties. They're verification properties:

file:line

.So truth

does the inverse of RAG. RAG retrieves context to help the model answer. truth

retrieves evidence to check what the model claimed — against the real code, the working-tree git diff, recorded command runs, and logs — and returns Supported / Contradicted / Refused, every verdict cited.

$ truth verify-turn "I added the /v1/refund endpoint, set MAX_RETRIES to 3, I only changed src/api.rs, and tests pass"

  ✓ Supported     I added the /v1/refund endpoint  (src/api.rs)
  ✗ Contradicted  set MAX_RETRIES to 3             (src/config.rs:1)
  ✗ Contradicted  I only changed src/api.rs        (src/api.rs)
  ✗ Contradicted  tests pass                       (recorded 2026-06-11 09:14 UTC)

An LLM only parses the sentence into a structured claim. The verdict comes from a deterministic engine.

"Honest uncertainty acknowledgment" has a sharp edge the article doesn't dwell on: a verifier that false-accuses gets uninstalled on day one. Crying wolf on a truthful agent is the one unforgivable failure.

So truth

only ever says Contradicted

on a structured binary fact — never on a noisy count (a text scan over-counts comments; a log window samples). "Nobody uses this," "updated all 4 call sites," "X is unused" — those surface their evidence as Inconclusive: a suspicion to act on, not a verdict that blocks. And it's measured, not asserted: a labeled corpus of real agent over-claims holds the false-contradiction rate at 0 as a hard CI gate. When truth

blocks, it means it.

This is the same instinct the article has about retrieval noise — "if chunks are too large, retrieval becomes noisy" — applied to verdicts. A noisy verifier is as useless as noisy retrieval. Worse, because it erodes the exact trust it was supposed to provide.

The author is right that code RAG needs preprocessing, structure-awareness, and citations. I'd add one thing to the conclusion: those buy you a better-informed model, not a trustworthy one. When the model is editing your repo, the last mile isn't retrieval quality — it's whether you can independently confirm what it told you, with a citation, deterministically, and without it ever crying wolf.

truth

is that last mile. It runs entirely locally (a single SQLite file in .truth/

; your code never leaves the machine), exposes one MCP tool an agent calls on its own message before reporting done, and can be a hook the agent can't skip.

If you're building code RAG, your model's answers are about to get a lot better. Make them checkable too.

── more in #large-language-models 4 stories · sorted by recency
── more on @swe-bench 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/rag-for-codebases-is…] indexed:0 read:4min 2026-06-29 ·