{"slug": "layered-retrieval-beats-grep-alone-for-llm-generated-engineering-docs", "title": "Layered retrieval beats grep alone for LLM-generated engineering docs", "summary": "A new empirical study found that layering three retrieval methods—typed discovery, semantic context, and file verification—achieved a 0.954 score for LLM-generated engineering artifacts, outperforming grep alone (0.918) and semantic search (0.720). The research, conducted on a production Kubernetes platform with three months of engineering history, also showed that Claude Sonnet with layered retrieval matched the performance of the more expensive Opus model at five times lower cost. The findings indicate that retrieval method composition matters more than model choice, though a minimum model capability floor exists below which even rich context fails.", "body_md": "**Don't Choose Your Memory Tool — Layer Them.**\n\nAn empirical study comparing retrieval methods for LLM-generated engineering artifacts (Architecture Decision Records). Tests 5 retrieval conditions + 3 model tiers on a production K8s engineering platform with 3 months of accumulated engineering history.\n\nLayered retrieval (typed discovery → semantic context → file verification) scores **0.954** on a 5-dimension rubric, beating every individual method:\n\n| Condition | Mean Score | Cost/ADR |\n|---|---|---|\n| A — No memory | 0.572 | ~$1.00 |\n| B — Semantic search (Qdrant) | 0.720 | ~$1.50 |\n| C — Grep + file read | 0.918 | ~$1.80 |\n| D — Typed-fact retrieval only | 0.650 | ~$1.20 |\nE — All three layered |\n0.954 |\n~$2.50 |\n\nSonnet + layered retrieval (0.88) matches Opus + layered (0.91) at 5x less cost. Haiku fails on complex topics (0.35) despite rich context — there's a minimum model capability floor.\n\n**Retrieval methods compose super-linearly**— E > max(B,C,D) because each layer catches errors the others introduce** Semantic search can hurt below baseline**— returns adjacent-but-wrong context that the LLM trusts** Extraction quality is the binding constraint**— typed retrieval is only as good as what was extracted** Model matters less than retrieval**— Sonnet+E ≈ Opus+E, but Haiku+E fails (capability floor between Haiku and Sonnet)\n\n```\n├── PAPER.md                    Full paper (3,700 words)\n├── data/\n│   ├── ground-truth/           5 real ADRs from production (gold standard)\n│   ├── condition-a/            Generated with no memory\n│   ├── condition-b/            Generated with semantic search only\n│   ├── condition-c/            Generated with grep + file read\n│   ├── condition-d/            Generated with typed memory tools only\n│   ├── condition-e/            Generated with all three layered (Opus)\n│   ├── condition-e-sonnet/     Generated with layered retrieval (Sonnet)\n│   └── condition-e-haiku/      Generated with layered retrieval (Haiku)\n├── scores/                     23 JSON score files (per-claim decomposition)\n├── rubric/\n│   └── locked-rubric-v1.md     Immutable scoring rubric (5 dimensions)\n├── scripts/\n│   └── score_with_gpt4o.py     GPT-4o dual-judge scoring script\n├── calibration-manifest.json   15 calibration artifacts\n└── LICENSE                     CC-BY-4.0\n```\n\n**Rubric**: 5 dimensions (technical correctness, citation, completeness, conciseness, pattern adoption), locked per RULERS methodology (arXiv 2601.08654)**Judge**: Claude Opus 4.7 (primary) + GPT-4o (dual-judge validation, 100% rank agreement on top condition)** Isolation**: Each condition runs in a fresh LLM session with only the tools that condition allows** Evidence trail**: Every score JSON includes per-claim reasoning explaining why each score was given\n\n```\nStep 1 — DISCOVERY (typed memory)\n  \"What decisions/problems exist about this topic?\"\n  → recall_decisions(topic=X), find_problems(topic=X)\n\nStep 2 — CONTEXT (semantic search)\n  \"What else is related?\"\n  → auto_search_vault(query=X)\n\nStep 3 — VERIFICATION (file access)\n  \"Do the facts check out against source?\"\n  → grep + read the actual files\n```\n\nSkip layers only for trivial lookups. The full workflow costs 5% more than grep alone but consistently produces better output.\n\nBuilt on [Rootweaver](https://gitlab.com/ryanduffy.uk/rootweaver-platform) — a typed engineering-memory platform running on single-node K3s (RTX 4080). 248 sessions, 2,748 typed facts, 6,135 artifacts, 376 v2-quality enriched facts across 3 months of real engineering work.\n\n```\nDuffy, R. G. (2026). Don't Choose Your Memory Tool — Layer Them: How Typed\nDiscovery + Semantic Context + File Verification Produces Near-Human Engineering\nArtifacts. https://github.com/rduffyuk/engineering-memory-benchmark\n```\n\n**Ryan G. Duffy** — SRE, AI-orchestration practitioner\n\n- ORCID:\n[0009-0009-6464-0617](https://orcid.org/0009-0009-6464-0617) - Blog:\n[rduffy.uk](https://blog.rduffy.uk) - Email:\n[rduffyuk@gmail.com](mailto:rduffyuk@gmail.com)\n\nCC-BY-4.0 — use freely with attribution.", "url": "https://wpnews.pro/news/layered-retrieval-beats-grep-alone-for-llm-generated-engineering-docs", "canonical_source": "https://github.com/rduffyuk/engineering-memory-benchmark", "published_at": "2026-05-26 08:09:33+00:00", "updated_at": "2026-05-26 08:39:51.690098+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-tools", "ai-infrastructure", "mlops"], "entities": ["Qdrant", "Sonnet", "Opus", "Haiku"], "alternates": {"html": "https://wpnews.pro/news/layered-retrieval-beats-grep-alone-for-llm-generated-engineering-docs", "markdown": "https://wpnews.pro/news/layered-retrieval-beats-grep-alone-for-llm-generated-engineering-docs.md", "text": "https://wpnews.pro/news/layered-retrieval-beats-grep-alone-for-llm-generated-engineering-docs.txt", "jsonld": "https://wpnews.pro/news/layered-retrieval-beats-grep-alone-for-llm-generated-engineering-docs.jsonld"}}