RAG for codebases is hard. Trusting the answer is harder.

wpnews.pro

cd /news/large-language-models/rag-for-codebases-is-hard-trusting-t… · home › topics › large-language-models › article

[ARTICLE · art-43796] src=dev.to ↗ pub=2026-06-29T17:16Z topic=large-language-models verified=true sentiment=· neutral

RAG for codebases is hard. Trusting the answer is harder.

A developer argues that retrieval-augmented generation (RAG) for codebases improves context but not verifiability, citing that 30% of failed SWE-agent runs still claimed success. They introduce 'truth', a verification tool that checks model claims against real code and git diffs, returning Supported, Contradicted, or Inconclusive verdicts with cited evidence.

read4 min views1 publishedJun 29, 2026

There's a good post making the rounds — "RAG for Codebases Is Harder Than It Looks" — on why naive retrieval falls apart on code. Filtering out node_modules

, chunking on AST boundaries instead of token counts, preserving file-path metadata, treating architecture questions as graph traversals rather than nearest-neighbor lookups. All correct. If you're building code RAG, read it.

But notice where the author lands at the very end:

Developer trust depends on retrieval quality,

honest uncertainty acknowledgment, and verifiable sources— not just algorithmic sophistication.

That's the part I want to pull on. Because everything before it — better chunking, better embeddings, repo maps — makes the model's answer more likely to be right. None of it makes the answer checkable. And once an LLM is writing your code, "more likely to be right" is not the same problem as "did it actually do what it just told me it did."

Good RAG gets the model better context. The model then makes an edit and reports back: "I added the /v1/refund route, set MAX_RETRIES to 5, only touched the parser, and the tests pass."

Every one of those is a factual claim about the repo as it exists right now. And here's the uncomfortable measurement: across 100 real SWE-agent runs on real GitHub issues, 30% of the attempts that failed the test suite still claimed they fixed the issue — "the method has been successfully added," "the issue has been resolved." Ground truth from the SWE-bench eval, claims judged by an LLM, reproducible.

Better retrieval doesn't fix that. A model with perfect context can still over-claim what it did. The hallucination moves from "wrong fact about the codebase" to "wrong fact about my own change" — which is harder to catch, because it sounds like a status update, not a guess.

The article's three closing words — honest uncertainty, verifiable sources, no speculation — are exactly the design constraints I've been building truth

around. But I'd argue you can't get them from a better RAG pipeline, because they're not retrieval properties. They're verification properties:

file:line

.So truth

does the inverse of RAG. RAG retrieves context to help the model answer. truth

retrieves evidence to check what the model claimed — against the real code, the working-tree git diff, recorded command runs, and logs — and returns Supported / Contradicted / Refused, every verdict cited.

$ truth verify-turn "I added the /v1/refund endpoint, set MAX_RETRIES to 3, I only changed src/api.rs, and tests pass"

  ✓ Supported     I added the /v1/refund endpoint  (src/api.rs)
  ✗ Contradicted  set MAX_RETRIES to 3             (src/config.rs:1)
  ✗ Contradicted  I only changed src/api.rs        (src/api.rs)
  ✗ Contradicted  tests pass                       (recorded 2026-06-11 09:14 UTC)

An LLM only parses the sentence into a structured claim. The verdict comes from a deterministic engine.

"Honest uncertainty acknowledgment" has a sharp edge the article doesn't dwell on: a verifier that false-accuses gets uninstalled on day one. Crying wolf on a truthful agent is the one unforgivable failure.

So truth

only ever says Contradicted

on a structured binary fact — never on a noisy count (a text scan over-counts comments; a log window samples). "Nobody uses this," "updated all 4 call sites," "X is unused" — those surface their evidence as Inconclusive: a suspicion to act on, not a verdict that blocks. And it's measured, not asserted: a labeled corpus of real agent over-claims holds the false-contradiction rate at 0 as a hard CI gate. When truth

blocks, it means it.

This is the same instinct the article has about retrieval noise — "if chunks are too large, retrieval becomes noisy" — applied to verdicts. A noisy verifier is as useless as noisy retrieval. Worse, because it erodes the exact trust it was supposed to provide.

The author is right that code RAG needs preprocessing, structure-awareness, and citations. I'd add one thing to the conclusion: those buy you a better-informed model, not a trustworthy one. When the model is editing your repo, the last mile isn't retrieval quality — it's whether you can independently confirm what it told you, with a citation, deterministically, and without it ever crying wolf.

truth

is that last mile. It runs entirely locally (a single SQLite file in .truth/

; your code never leaves the machine), exposes one MCP tool an agent calls on its own message before reporting done, and can be a hook the agent can't skip.

If you're building code RAG, your model's answers are about to get a lot better. Make them checkable too.

source & further reading

dev.to — original article AccessiBe Alternative: Why I Preferred TestGrid for Automated Accessibility Testing What AutoGPT ships in 2026: a low-code platform for continuous AI agents AI took me somewhere new, and proved me wrong

~/api · this article 200

$curl api.wpnews.pro/v1/news/rag-for-codebases-is-har…

Read original on dev.to → dev.to/blasrodri/rag-for-codebases-is-hard-trust…

mentioned entities

SWE-bench

truth

metadata

slugrag-for-codebases-is-hard-trusting-the-answer-is-harder

topic#large-language-models

secondary2 topics

sentimentneutral

canonicaldev.to

navigation

← prevThe stale eval fixture that pass…

next →OpenAI teases Codex-branded hard…

── more in #large-language-models 4 stories · sorted by recency

techcrunch.com · 29 Jun · #large-language-models

Cursor now has a mobile app for guiding your coding agent on the go

github.com · 29 Jun · #large-language-models

Relay – open-source coding agent for non-mainstream/Chinese LLM providers

cursor.com · 29 Jun · #large-language-models

Build from Anywhere with Cursor for iOS

github.blog · 29 Jun · #large-language-models

Claude Opus 4.8 (fast mode) is now in preview for GitHub Copilot

── more on @swe-bench 3 stories trending now

wpnews · 28 May · #ai-startups

[AINews] Cognition raises $1B in $26B Series D

wpnews · 5 Jun · #ai-agents

Miasma Worm Targets AI Coding Agents via GitHub Repos

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required