cd /news/large-language-models/layered-retrieval-beats-grep-alone-f… · home topics large-language-models article
[ARTICLE · art-14219] src=github.com pub= topic=large-language-models verified=true sentiment=· neutral

Layered retrieval beats grep alone for LLM-generated engineering docs

A new empirical study found that layering three retrieval methods—typed discovery, semantic context, and file verification—achieved a 0.954 score for LLM-generated engineering artifacts, outperforming grep alone (0.918) and semantic search (0.720). The research, conducted on a production Kubernetes platform with three months of engineering history, also showed that Claude Sonnet with layered retrieval matched the performance of the more expensive Opus model at five times lower cost. The findings indicate that retrieval method composition matters more than model choice, though a minimum model capability floor exists below which even rich context fails.

read3 min publishedMay 26, 2026

Don't Choose Your Memory Tool — Layer Them.

An empirical study comparing retrieval methods for LLM-generated engineering artifacts (Architecture Decision Records). Tests 5 retrieval conditions + 3 model tiers on a production K8s engineering platform with 3 months of accumulated engineering history.

Layered retrieval (typed discovery → semantic context → file verification) scores 0.954 on a 5-dimension rubric, beating every individual method:

Condition Mean Score Cost/ADR
A — No memory 0.572 ~$1.00
B — Semantic search (Qdrant) 0.720 ~$1.50
C — Grep + file read 0.918 ~$1.80
D — Typed-fact retrieval only 0.650 ~$1.20
E — All three layered
0.954
~$2.50

Sonnet + layered retrieval (0.88) matches Opus + layered (0.91) at 5x less cost. Haiku fails on complex topics (0.35) despite rich context — there's a minimum model capability floor.

Retrieval methods compose super-linearly— E > max(B,C,D) because each layer catches errors the others introduce** Semantic search can hurt below baseline**— returns adjacent-but-wrong context that the LLM trusts** Extraction quality is the binding constraint**— typed retrieval is only as good as what was extracted** Model matters less than retrieval**— Sonnet+E ≈ Opus+E, but Haiku+E fails (capability floor between Haiku and Sonnet)

├── PAPER.md                    Full paper (3,700 words)
├── data/
│   ├── ground-truth/           5 real ADRs from production (gold standard)
│   ├── condition-a/            Generated with no memory
│   ├── condition-b/            Generated with semantic search only
│   ├── condition-c/            Generated with grep + file read
│   ├── condition-d/            Generated with typed memory tools only
│   ├── condition-e/            Generated with all three layered (Opus)
│   ├── condition-e-sonnet/     Generated with layered retrieval (Sonnet)
│   └── condition-e-haiku/      Generated with layered retrieval (Haiku)
├── scores/                     23 JSON score files (per-claim decomposition)
├── rubric/
│   └── locked-rubric-v1.md     Immutable scoring rubric (5 dimensions)
├── scripts/
│   └── score_with_gpt4o.py     GPT-4o dual-judge scoring script
├── calibration-manifest.json   15 calibration artifacts
└── LICENSE                     CC-BY-4.0

Rubric: 5 dimensions (technical correctness, citation, completeness, conciseness, pattern adoption), locked per RULERS methodology (arXiv 2601.08654)Judge: Claude Opus 4.7 (primary) + GPT-4o (dual-judge validation, 100% rank agreement on top condition)** Isolation**: Each condition runs in a fresh LLM session with only the tools that condition allows** Evidence trail**: Every score JSON includes per-claim reasoning explaining why each score was given

Step 1 — DISCOVERY (typed memory)
  "What decisions/problems exist about this topic?"
  → recall_decisions(topic=X), find_problems(topic=X)

Step 2 — CONTEXT (semantic search)
  "What else is related?"
  → auto_search_vault(query=X)

Step 3 — VERIFICATION (file access)
  "Do the facts check out against source?"
  → grep + read the actual files

Skip layers only for trivial lookups. The full workflow costs 5% more than grep alone but consistently produces better output.

Built on Rootweaver — a typed engineering-memory platform running on single-node K3s (RTX 4080). 248 sessions, 2,748 typed facts, 6,135 artifacts, 376 v2-quality enriched facts across 3 months of real engineering work.

Duffy, R. G. (2026). Don't Choose Your Memory Tool — Layer Them: How Typed
Discovery + Semantic Context + File Verification Produces Near-Human Engineering
Artifacts. https://github.com/rduffyuk/engineering-memory-benchmark

Ryan G. Duffy — SRE, AI-orchestration practitioner

CC-BY-4.0 — use freely with attribution.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/layered-retrieval-be…] indexed:0 read:3min 2026-05-26 ·