Layered retrieval beats grep alone for LLM-generated engineering docs

wpnews.pro

cd /news/large-language-models/layered-retrieval-beats-grep-alone-f… · home › topics › large-language-models › article

[ARTICLE · art-14219] src=github.com ↗ pub=2026-05-26T08:09Z topic=large-language-models verified=true sentiment=· neutral

Layered retrieval beats grep alone for LLM-generated engineering docs

A new empirical study found that layering three retrieval methods—typed discovery, semantic context, and file verification—achieved a 0.954 score for LLM-generated engineering artifacts, outperforming grep alone (0.918) and semantic search (0.720). The research, conducted on a production Kubernetes platform with three months of engineering history, also showed that Claude Sonnet with layered retrieval matched the performance of the more expensive Opus model at five times lower cost. The findings indicate that retrieval method composition matters more than model choice, though a minimum model capability floor exists below which even rich context fails.

read3 min views12 publishedMay 26, 2026

Don't Choose Your Memory Tool — Layer Them.

An empirical study comparing retrieval methods for LLM-generated engineering artifacts (Architecture Decision Records). Tests 5 retrieval conditions + 3 model tiers on a production K8s engineering platform with 3 months of accumulated engineering history.

Layered retrieval (typed discovery → semantic context → file verification) scores 0.954 on a 5-dimension rubric, beating every individual method:

Condition	Mean Score	Cost/ADR
A — No memory	0.572	~$1.00
B — Semantic search (Qdrant)	0.720	~$1.50
C — Grep + file read	0.918	~$1.80
D — Typed-fact retrieval only	0.650	~$1.20
E — All three layered
0.954
~$2.50

Sonnet + layered retrieval (0.88) matches Opus + layered (0.91) at 5x less cost. Haiku fails on complex topics (0.35) despite rich context — there's a minimum model capability floor.

Retrieval methods compose super-linearly— E > max(B,C,D) because each layer catches errors the others introduce** Semantic search can hurt below baseline**— returns adjacent-but-wrong context that the LLM trusts** Extraction quality is the binding constraint**— typed retrieval is only as good as what was extracted** Model matters less than retrieval**— Sonnet+E ≈ Opus+E, but Haiku+E fails (capability floor between Haiku and Sonnet)

├── PAPER.md                    Full paper (3,700 words)
├── data/
│   ├── ground-truth/           5 real ADRs from production (gold standard)
│   ├── condition-a/            Generated with no memory
│   ├── condition-b/            Generated with semantic search only
│   ├── condition-c/            Generated with grep + file read
│   ├── condition-d/            Generated with typed memory tools only
│   ├── condition-e/            Generated with all three layered (Opus)
│   ├── condition-e-sonnet/     Generated with layered retrieval (Sonnet)
│   └── condition-e-haiku/      Generated with layered retrieval (Haiku)
├── scores/                     23 JSON score files (per-claim decomposition)
├── rubric/
│   └── locked-rubric-v1.md     Immutable scoring rubric (5 dimensions)
├── scripts/
│   └── score_with_gpt4o.py     GPT-4o dual-judge scoring script
├── calibration-manifest.json   15 calibration artifacts
└── LICENSE                     CC-BY-4.0

Rubric: 5 dimensions (technical correctness, citation, completeness, conciseness, pattern adoption), locked per RULERS methodology (arXiv 2601.08654)Judge: Claude Opus 4.7 (primary) + GPT-4o (dual-judge validation, 100% rank agreement on top condition)** Isolation**: Each condition runs in a fresh LLM session with only the tools that condition allows** Evidence trail**: Every score JSON includes per-claim reasoning explaining why each score was given

Step 1 — DISCOVERY (typed memory)
  "What decisions/problems exist about this topic?"
  → recall_decisions(topic=X), find_problems(topic=X)

Step 2 — CONTEXT (semantic search)
  "What else is related?"
  → auto_search_vault(query=X)

Step 3 — VERIFICATION (file access)
  "Do the facts check out against source?"
  → grep + read the actual files

Skip layers only for trivial lookups. The full workflow costs 5% more than grep alone but consistently produces better output.

Built on Rootweaver — a typed engineering-memory platform running on single-node K3s (RTX 4080). 248 sessions, 2,748 typed facts, 6,135 artifacts, 376 v2-quality enriched facts across 3 months of real engineering work.

Duffy, R. G. (2026). Don't Choose Your Memory Tool — Layer Them: How Typed
Discovery + Semantic Context + File Verification Produces Near-Human Engineering
Artifacts. https://github.com/rduffyuk/engineering-memory-benchmark

Ryan G. Duffy — SRE, AI-orchestration practitioner

ORCID: 0009-0009-6464-0617 - Blog: rduffy.uk - Email: rduffyuk@gmail.com

CC-BY-4.0 — use freely with attribution.

source & further reading

github.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/layered-retrieval-beats-…

Read original on github.com → github.com/rduffyuk/engineering-memory-benchmark

mentioned entities

Qdrant

Sonnet

Opus

Haiku

metadata

sluglayered-retrieval-beats-grep-alone-for-llm-generated-engineering-docs

topic#large-language-models

secondary4 topics

sentimentneutral

canonicalgithub.com

navigation

← prevGithub Copilot helped us cut dow…

next →Launching Avrea: CI that helps t…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 10 Jul · #large-language-models

Anthropic Just Admitted MCP Has a Context Problem

pub.towardsai.net · 9 Jul · #large-language-models

How Qdrant Reduced RAG Token Costs by 67% with Native ColBERT Reranking

upstash.com · 9 Jul · #large-language-models

How to Keep Claude Fable 5 Costs Under Control

ca.finance.yahoo.com · 9 Jul · #large-language-models

Meta debuts Muse Spark 1.1 with preview open to developers

── more on @qdrant 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #artificial-intelligence

Anthropic's "J-lens" reveals workspace in Claude mirrors theory of consciousness

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required