cd /news/large-language-models/evidence-saturation-k-retrieval-dept… · home topics large-language-models article
[ARTICLE · art-47457] src=discuss.huggingface.co ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Evidence saturation k*: retrieval depth should be calibrated, not guessed

A new analysis of retrieval-augmented generation (RAG) systems argues that the optimal retrieval depth k* should be calibrated per pipeline stage and reliability axis, not guessed as a single hyperparameter. The author proposes separating k into candidate_k, rerank_k, injected_k, and effective_k, and testing three regions: evidence-insufficient, evidence-sufficient, and over-context contamination. This approach builds on the Sufficient Context framework to distinguish retrieval failures from model failures.

read7 min views1 publishedJul 3, 2026

I thought about how to make this easier to test:

Short version: I think the most important part is not “what is the best k

?”, but “which k

are we talking about, at which stage of the RAG pipeline, and for which reliability axis?”

I have seen nearby patterns in adaptive retrieval depth, sufficient-context evaluation, irrelevant-context robustness, RAG diagnostic metrics, and long-context position effects. I would not treat those as the same claim, though. The interesting part of your framing, to me, is that k*

is not just a retrieval hyperparameter. It is a calibration target for final evidence depth under a chosen reliability axis.

The smallest useful next step may be:

top_k

into several pipeline-level variables;The rest of this comment is just a proposed testing map.

k

I would avoid using one overloaded top_k

term.

Name Meaning
candidate_k
number of candidates returned by the first retriever
rerank_k / top_n
number of candidates kept after reranking or filtering
injected_k
number of evidence fragments actually inserted into the LLM prompt
effective_k
number of distinct/useful evidence units after deduplication, compression, summarization, or prompt packing

I would interpret your k*

mostly as an injected_k

/ effective_k

question, not as the retriever’s initial top_k

.

This matters because real RAG stacks often separate retrieval, reranking, filtering, compression, and final prompt construction. For example, LlamaIndex has retrieval-time node postprocessors and rerankers where top_n

can mean the number of nodes returned after reranking, not the number of original candidates. Open WebUI also has practical discussion around RAG_TOP_K

and RAG_TOP_K_RERANKER

, which is a useful reminder that “top k” can mean different things in different layers: discussion, env docs.

So before interpreting any k*

curve, I would log the actual fragment IDs and fragment types that reach the model.

I would not look for one universal k*

at first.

I would separate at least three regions:

Region Question
evidence-insufficient are required facts still missing?
evidence-sufficient do the retrieved snippets contain enough information to answer?
over-context / contamination region does additional context start shifting wording, assumptions, framing, source emphasis, or state?

That distinction seems important because a model can be correct over a range of k

values while still changing some other reliability axis.

This connects well with the “sufficient context” line of work. The Sufficient Context project tests whether retrieved snippets alone could plausibly answer the question, and the paper Sufficient Context: A New Lens on Retrieval Augmented Generation Systems uses that idea to separate “retrieval did not provide enough information” from “the model had enough information but failed to use it.”

For this thread, I would phrase it as:

A minimal sufficiency-first testfirst measure the smallest

injected_k

where the context becomes sufficient; then measure whether correctness, groundedness, or contamination/framing leakage changes after that point.

One implementation trap: the condition label is not enough.

A row labeled gold + distractor

should be checked against the actual injected fragment IDs and fragment kinds. Otherwise the contamination curve can accidentally measure data assembly mistakes.

I would add a validation step like this before any judge or generator result is interpreted:

Intended condition Required validation
gold_only
only required evidence is injected
gold_plus_random
gold evidence plus at least one random irrelevant fragment
gold_plus_semantic
gold evidence plus a semantically related non-answer fragment
gold_plus_duplicate
gold evidence plus a duplicate or near-duplicate
gold_plus_stale
gold evidence plus temporally stale evidence
gold_plus_conflict
gold evidence plus a genuinely conflicting fragment
gold_plus_frame_shift
gold evidence plus a fragment that changes framing but not necessarily the answer
gold_plus_adversarial
gold evidence plus instruction-like or adversarial text

If this validation fails, I would not interpret the contamination score yet.

Why I think condition validation mattersWhen k

changes, the decisive evidence may move inside the prompt.

So a k

effect can be mixed with a prompt-position effect.

For every run, I would log:

decisive_evidence_rank_in_retrieval
decisive_evidence_rank_after_rerank
decisive_evidence_position_in_prompt
decisive_evidence_token_start
decisive_evidence_token_end

This is not only theoretical. Lost in the Middle showed that models can be sensitive to where relevant information appears in long context. LlamaIndex also has a LongContextReorder postprocessor, which is a practical sign that node order can matter when a large top-k is placed into context.

So if contamination changes at larger k

, I would check whether the decisive evidence was also pushed into a worse prompt position.

I think your contamination/framing axis is adjacent to existing RAG metrics, but probably not identical to any one of them.

Existing tools already give useful vocabulary. For example, RAGAS metrics include faithfulness, answer relevancy, context recall, context precision, and context utilization. RAGAS also has Noise Sensitivity, which measures how often a system makes errors when using relevant or irrelevant retrieved documents.

That is related, but I would not collapse framing leakage into noise sensitivity.

A useful separation might be:

Axis What it measures
correctness final answer is factually right
sufficiency context contains enough evidence to answer
context recall necessary evidence was retrieved
context precision irrelevant evidence is not dominating the context
context utilization answer-relevant context is ranked/used well
faithfulness / groundedness answer is supported by supplied context
noise sensitivity retrieved docs cause incorrect responses
contamination / framing leakage extra context shifts framing, assumptions, vocabulary, source emphasis, or state even if correctness is flat

The last row is the hard part.

I would treat it as an additional rubric, not as a replacement for correctness or groundedness.

Adjacent prior work I would use as controls, not as exact equivalentsIf the contamination score is judged by an LLM or by a heuristic rubric, I would test that evaluator before trusting the k

curve.

Otherwise, the curve may partly measure:

A small rubric test suite could include:

Case Expected behavior
gold only no contamination
gold + neutral filler no contamination
gold + random irrelevant usually no framing leakage unless used
gold + semantic distractor possible distractor sensitivity
gold + duplicate should not be called framing leakage by default
gold + stale evidence stale-memory / temporal-validity issue
gold + conflict conflict-handling issue
gold + frame-shifting evidence likely framing leakage candidate
gold + adversarial instruction-like text adversarial stress test, separate from ordinary leakage

Outside the details, my proposed minimal protocol would be:

injected_k

.If you wanted to make this easier for others to reproduce, I would add these in roughly this order:

Effort Addition
low define whether k means candidate_k , rerank_k , injected_k , or effective_k
low log actual injected fragment IDs and fragment kinds
low log decisive evidence position in the prompt
medium add retrieval-only sufficiency sweep
medium add condition validation before generation/judging
medium add unit tests for the contamination/framing rubric
medium compare fixed-k with one adaptive-k baseline
higher add HotpotQA-style public multi-hop preview
higher compare raw chunks vs reranked vs compressed/filtered evidence

My main takeaway is:

The valuable target is probably not a universal

k*

, but a reproducible calibration procedure: for a given corpus, chunking scheme, retriever, reranker, prompt builder, model, and reliability axis, what final evidence depth is sufficient, and when do extra fragments start changing another axis?

That would make the idea easier to test, compare, and falsify without forcing every reader to guess which part of the RAG pipeline the reported k*

belongs to.

── more in #large-language-models 4 stories · sorted by recency
── more on @llamaindex 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/evidence-saturation-…] indexed:0 read:7min 2026-07-03 ·