{"slug": "evidence-saturation-k-retrieval-depth-should-be-calibrated-not-guessed", "title": "Evidence saturation k*: retrieval depth should be calibrated, not guessed", "summary": "A new analysis of retrieval-augmented generation (RAG) systems argues that the optimal retrieval depth k* should be calibrated per pipeline stage and reliability axis, not guessed as a single hyperparameter. The author proposes separating k into candidate_k, rerank_k, injected_k, and effective_k, and testing three regions: evidence-insufficient, evidence-sufficient, and over-context contamination. This approach builds on the Sufficient Context framework to distinguish retrieval failures from model failures.", "body_md": "I thought about how to make this easier to test:\n\nShort version: I think the most important part is not “what is the best `k`\n\n?”, but “which `k`\n\nare we talking about, at which stage of the RAG pipeline, and for which reliability axis?”\n\nI have seen nearby patterns in adaptive retrieval depth, sufficient-context evaluation, irrelevant-context robustness, RAG diagnostic metrics, and long-context position effects. I would not treat those as the same claim, though. The interesting part of your framing, to me, is that `k*`\n\nis not just a retrieval hyperparameter. It is a calibration target for final evidence depth under a chosen reliability axis.\n\nThe smallest useful next step may be:\n\n`top_k`\n\ninto several pipeline-level variables;The rest of this comment is just a proposed testing map.\n\n`k`\n\nI would avoid using one overloaded `top_k`\n\nterm.\n\n| Name | Meaning |\n|---|---|\n`candidate_k` |\nnumber of candidates returned by the first retriever |\n`rerank_k` / `top_n` |\nnumber of candidates kept after reranking or filtering |\n`injected_k` |\nnumber of evidence fragments actually inserted into the LLM prompt |\n`effective_k` |\nnumber of distinct/useful evidence units after deduplication, compression, summarization, or prompt packing |\n\nI would interpret your `k*`\n\nmostly as an `injected_k`\n\n/ `effective_k`\n\nquestion, not as the retriever’s initial `top_k`\n\n.\n\nThis matters because real RAG stacks often separate retrieval, reranking, filtering, compression, and final prompt construction. For example, LlamaIndex has retrieval-time [node postprocessors](https://developers.llamaindex.ai/python/framework/module_guides/querying/node_postprocessors/) and rerankers where `top_n`\n\ncan mean the number of nodes returned after reranking, not the number of original candidates. Open WebUI also has practical discussion around `RAG_TOP_K`\n\nand `RAG_TOP_K_RERANKER`\n\n, which is a useful reminder that “top k” can mean different things in different layers: [discussion](https://github.com/open-webui/open-webui/discussions/14428), [env docs](https://docs.openwebui.com/reference/env-configuration/).\n\nSo before interpreting any `k*`\n\ncurve, I would log the actual fragment IDs and fragment types that reach the model.\n\nI would not look for one universal `k*`\n\nat first.\n\nI would separate at least three regions:\n\n| Region | Question |\n|---|---|\n| evidence-insufficient | are required facts still missing? |\n| evidence-sufficient | do the retrieved snippets contain enough information to answer? |\n| over-context / contamination region | does additional context start shifting wording, assumptions, framing, source emphasis, or state? |\n\nThat distinction seems important because a model can be correct over a range of `k`\n\nvalues while still changing some other reliability axis.\n\nThis connects well with the “sufficient context” line of work. The [Sufficient Context](https://github.com/hljoren/sufficientcontext) project tests whether retrieved snippets alone could plausibly answer the question, and the paper [Sufficient Context: A New Lens on Retrieval Augmented Generation Systems](https://arxiv.org/abs/2411.06037) uses that idea to separate “retrieval did not provide enough information” from “the model had enough information but failed to use it.”\n\nFor this thread, I would phrase it as:\n\nA minimal sufficiency-first testfirst measure the smallest\n\n`injected_k`\n\nwhere the context becomes sufficient; then measure whether correctness, groundedness, or contamination/framing leakage changes after that point.\n\nOne implementation trap: the condition label is not enough.\n\nA row labeled `gold + distractor`\n\nshould be checked against the actual injected fragment IDs and fragment kinds. Otherwise the contamination curve can accidentally measure data assembly mistakes.\n\nI would add a validation step like this before any judge or generator result is interpreted:\n\n| Intended condition | Required validation |\n|---|---|\n`gold_only` |\nonly required evidence is injected |\n`gold_plus_random` |\ngold evidence plus at least one random irrelevant fragment |\n`gold_plus_semantic` |\ngold evidence plus a semantically related non-answer fragment |\n`gold_plus_duplicate` |\ngold evidence plus a duplicate or near-duplicate |\n`gold_plus_stale` |\ngold evidence plus temporally stale evidence |\n`gold_plus_conflict` |\ngold evidence plus a genuinely conflicting fragment |\n`gold_plus_frame_shift` |\ngold evidence plus a fragment that changes framing but not necessarily the answer |\n`gold_plus_adversarial` |\ngold evidence plus instruction-like or adversarial text |\n\nIf this validation fails, I would not interpret the contamination score yet.\n\nWhy I think condition validation mattersWhen `k`\n\nchanges, the decisive evidence may move inside the prompt.\n\nSo a `k`\n\neffect can be mixed with a prompt-position effect.\n\nFor every run, I would log:\n\n```\ndecisive_evidence_rank_in_retrieval\ndecisive_evidence_rank_after_rerank\ndecisive_evidence_position_in_prompt\ndecisive_evidence_token_start\ndecisive_evidence_token_end\n```\n\nThis is not only theoretical. [Lost in the Middle](https://arxiv.org/abs/2307.03172) showed that models can be sensitive to where relevant information appears in long context. LlamaIndex also has a [LongContextReorder](https://developers.llamaindex.ai/python/examples/node_postprocessor/longcontextreorder/) postprocessor, which is a practical sign that node order can matter when a large top-k is placed into context.\n\nSo if contamination changes at larger `k`\n\n, I would check whether the decisive evidence was also pushed into a worse prompt position.\n\nI think your contamination/framing axis is adjacent to existing RAG metrics, but probably not identical to any one of them.\n\nExisting tools already give useful vocabulary. For example, [RAGAS metrics](https://docs.ragas.io/en/v0.1.21/concepts/metrics/) include faithfulness, answer relevancy, context recall, context precision, and context utilization. RAGAS also has [Noise Sensitivity](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/noise_sensitivity/), which measures how often a system makes errors when using relevant or irrelevant retrieved documents.\n\nThat is related, but I would not collapse framing leakage into noise sensitivity.\n\nA useful separation might be:\n\n| Axis | What it measures |\n|---|---|\n| correctness | final answer is factually right |\n| sufficiency | context contains enough evidence to answer |\n| context recall | necessary evidence was retrieved |\n| context precision | irrelevant evidence is not dominating the context |\n| context utilization | answer-relevant context is ranked/used well |\n| faithfulness / groundedness | answer is supported by supplied context |\n| noise sensitivity | retrieved docs cause incorrect responses |\n| contamination / framing leakage | extra context shifts framing, assumptions, vocabulary, source emphasis, or state even if correctness is flat |\n\nThe last row is the hard part.\n\nI would treat it as an additional rubric, not as a replacement for correctness or groundedness.\n\nAdjacent prior work I would use as controls, not as exact equivalentsIf the contamination score is judged by an LLM or by a heuristic rubric, I would test that evaluator before trusting the `k`\n\ncurve.\n\nOtherwise, the curve may partly measure:\n\nA small rubric test suite could include:\n\n| Case | Expected behavior |\n|---|---|\n| gold only | no contamination |\n| gold + neutral filler | no contamination |\n| gold + random irrelevant | usually no framing leakage unless used |\n| gold + semantic distractor | possible distractor sensitivity |\n| gold + duplicate | should not be called framing leakage by default |\n| gold + stale evidence | stale-memory / temporal-validity issue |\n| gold + conflict | conflict-handling issue |\n| gold + frame-shifting evidence | likely framing leakage candidate |\n| gold + adversarial instruction-like text | adversarial stress test, separate from ordinary leakage |\n\nOutside the details, my proposed minimal protocol would be:\n\n`injected_k`\n\n.If you wanted to make this easier for others to reproduce, I would add these in roughly this order:\n\n| Effort | Addition |\n|---|---|\n| low | define whether `k` means `candidate_k` , `rerank_k` , `injected_k` , or `effective_k` |\n| low | log actual injected fragment IDs and fragment kinds |\n| low | log decisive evidence position in the prompt |\n| medium | add retrieval-only sufficiency sweep |\n| medium | add condition validation before generation/judging |\n| medium | add unit tests for the contamination/framing rubric |\n| medium | compare fixed-k with one adaptive-k baseline |\n| higher | add HotpotQA-style public multi-hop preview |\n| higher | compare raw chunks vs reranked vs compressed/filtered evidence |\n\nMy main takeaway is:\n\nThe valuable target is probably not a universal\n\n`k*`\n\n, but a reproducible calibration procedure: for a given corpus, chunking scheme, retriever, reranker, prompt builder, model, and reliability axis, what final evidence depth is sufficient, and when do extra fragments start changing another axis?\n\nThat would make the idea easier to test, compare, and falsify without forcing every reader to guess which part of the RAG pipeline the reported `k*`\n\nbelongs to.", "url": "https://wpnews.pro/news/evidence-saturation-k-retrieval-depth-should-be-calibrated-not-guessed", "canonical_source": "https://discuss.huggingface.co/t/evidence-saturation-k-retrieval-depth-should-be-calibrated-not-guessed/177363#post_2", "published_at": "2026-07-03 23:26:08+00:00", "updated_at": "2026-07-03 23:58:39.362136+00:00", "lang": "en", "topics": ["large-language-models", "ai-research", "ai-infrastructure", "natural-language-processing", "ai-tools"], "entities": ["LlamaIndex", "Open WebUI", "Sufficient Context", "RAG", "LLM"], "alternates": {"html": "https://wpnews.pro/news/evidence-saturation-k-retrieval-depth-should-be-calibrated-not-guessed", "markdown": "https://wpnews.pro/news/evidence-saturation-k-retrieval-depth-should-be-calibrated-not-guessed.md", "text": "https://wpnews.pro/news/evidence-saturation-k-retrieval-depth-should-be-calibrated-not-guessed.txt", "jsonld": "https://wpnews.pro/news/evidence-saturation-k-retrieval-depth-should-be-calibrated-not-guessed.jsonld"}}