Evidence saturation k*: retrieval depth should be calibrated, not guessed

wpnews.pro

I thought about how to make this easier to test:

Short version: I think the most important part is not “what is the best k

?”, but “which k

are we talking about, at which stage of the RAG pipeline, and for which reliability axis?”

I have seen nearby patterns in adaptive retrieval depth, sufficient-context evaluation, irrelevant-context robustness, RAG diagnostic metrics, and long-context position effects. I would not treat those as the same claim, though. The interesting part of your framing, to me, is that k*

is not just a retrieval hyperparameter. It is a calibration target for final evidence depth under a chosen reliability axis.

The smallest useful next step may be:

top_k

into several pipeline-level variables;The rest of this comment is just a proposed testing map.

k

I would avoid using one overloaded top_k

term.

Name	Meaning
`candidate_k`
number of candidates returned by the first retriever
`rerank_k` / `top_n`
number of candidates kept after reranking or filtering
`injected_k`
number of evidence fragments actually inserted into the LLM prompt
`effective_k`
number of distinct/useful evidence units after deduplication, compression, summarization, or prompt packing

I would interpret your k*

mostly as an injected_k

/ effective_k

question, not as the retriever’s initial top_k

.

This matters because real RAG stacks often separate retrieval, reranking, filtering, compression, and final prompt construction. For example, LlamaIndex has retrieval-time node postprocessors and rerankers where top_n

can mean the number of nodes returned after reranking, not the number of original candidates. Open WebUI also has practical discussion around RAG_TOP_K

and RAG_TOP_K_RERANKER

, which is a useful reminder that “top k” can mean different things in different layers: discussion, env docs.

So before interpreting any k*

curve, I would log the actual fragment IDs and fragment types that reach the model.

I would not look for one universal k*

at first.

I would separate at least three regions:

Region	Question
evidence-insufficient	are required facts still missing?
evidence-sufficient	do the retrieved snippets contain enough information to answer?
over-context / contamination region	does additional context start shifting wording, assumptions, framing, source emphasis, or state?

That distinction seems important because a model can be correct over a range of k

values while still changing some other reliability axis.

This connects well with the “sufficient context” line of work. The Sufficient Context project tests whether retrieved snippets alone could plausibly answer the question, and the paper Sufficient Context: A New Lens on Retrieval Augmented Generation Systems uses that idea to separate “retrieval did not provide enough information” from “the model had enough information but failed to use it.”

For this thread, I would phrase it as:

A minimal sufficiency-first testfirst measure the smallest

injected_k

where the context becomes sufficient; then measure whether correctness, groundedness, or contamination/framing leakage changes after that point.

One implementation trap: the condition label is not enough.

A row labeled gold + distractor

should be checked against the actual injected fragment IDs and fragment kinds. Otherwise the contamination curve can accidentally measure data assembly mistakes.

I would add a validation step like this before any judge or generator result is interpreted:

Intended condition	Required validation
`gold_only`
only required evidence is injected
`gold_plus_random`
gold evidence plus at least one random irrelevant fragment
`gold_plus_semantic`
gold evidence plus a semantically related non-answer fragment
`gold_plus_duplicate`
gold evidence plus a duplicate or near-duplicate
`gold_plus_stale`
gold evidence plus temporally stale evidence
`gold_plus_conflict`
gold evidence plus a genuinely conflicting fragment
`gold_plus_frame_shift`
gold evidence plus a fragment that changes framing but not necessarily the answer
`gold_plus_adversarial`
gold evidence plus instruction-like or adversarial text

If this validation fails, I would not interpret the contamination score yet.

Why I think condition validation mattersWhen k

changes, the decisive evidence may move inside the prompt.

So a k

effect can be mixed with a prompt-position effect.

For every run, I would log:

decisive_evidence_rank_in_retrieval
decisive_evidence_rank_after_rerank
decisive_evidence_position_in_prompt
decisive_evidence_token_start
decisive_evidence_token_end

This is not only theoretical. Lost in the Middle showed that models can be sensitive to where relevant information appears in long context. LlamaIndex also has a LongContextReorder postprocessor, which is a practical sign that node order can matter when a large top-k is placed into context.

So if contamination changes at larger k

, I would check whether the decisive evidence was also pushed into a worse prompt position.

I think your contamination/framing axis is adjacent to existing RAG metrics, but probably not identical to any one of them.

Existing tools already give useful vocabulary. For example, RAGAS metrics include faithfulness, answer relevancy, context recall, context precision, and context utilization. RAGAS also has Noise Sensitivity, which measures how often a system makes errors when using relevant or irrelevant retrieved documents.

That is related, but I would not collapse framing leakage into noise sensitivity.

A useful separation might be:

Axis	What it measures
correctness	final answer is factually right
sufficiency	context contains enough evidence to answer
context recall	necessary evidence was retrieved
context precision	irrelevant evidence is not dominating the context
context utilization	answer-relevant context is ranked/used well
faithfulness / groundedness	answer is supported by supplied context
noise sensitivity	retrieved docs cause incorrect responses
contamination / framing leakage	extra context shifts framing, assumptions, vocabulary, source emphasis, or state even if correctness is flat

The last row is the hard part.

I would treat it as an additional rubric, not as a replacement for correctness or groundedness.

Adjacent prior work I would use as controls, not as exact equivalentsIf the contamination score is judged by an LLM or by a heuristic rubric, I would test that evaluator before trusting the k

curve.

Otherwise, the curve may partly measure:

A small rubric test suite could include:

Case	Expected behavior
gold only	no contamination
gold + neutral filler	no contamination
gold + random irrelevant	usually no framing leakage unless used
gold + semantic distractor	possible distractor sensitivity
gold + duplicate	should not be called framing leakage by default
gold + stale evidence	stale-memory / temporal-validity issue
gold + conflict	conflict-handling issue
gold + frame-shifting evidence	likely framing leakage candidate
gold + adversarial instruction-like text	adversarial stress test, separate from ordinary leakage

Outside the details, my proposed minimal protocol would be:

injected_k

.If you wanted to make this easier for others to reproduce, I would add these in roughly this order:

Effort	Addition
low	define whether `k` means `candidate_k` , `rerank_k` , `injected_k` , or `effective_k`
low	log actual injected fragment IDs and fragment kinds
low	log decisive evidence position in the prompt
medium	add retrieval-only sufficiency sweep
medium	add condition validation before generation/judging
medium	add unit tests for the contamination/framing rubric
medium	compare fixed-k with one adaptive-k baseline
higher	add HotpotQA-style public multi-hop preview
higher	compare raw chunks vs reranked vs compressed/filtered evidence

My main takeaway is:

The valuable target is probably not a universal

k*

, but a reproducible calibration procedure: for a given corpus, chunking scheme, retriever, reranker, prompt builder, model, and reliability axis, what final evidence depth is sufficient, and when do extra fragments start changing another axis?

That would make the idea easier to test, compare, and falsify without forcing every reader to guess which part of the RAG pipeline the reported k*

belongs to.

source & further reading

discuss.huggingface.co — original article Rakarrack-0.6.1 port making progress! ( AI assisted ) Cloud Storage Poll Welcome to Haiku basic(Haiku Docs, Haiku slide and Haiku sheets)

Evidence saturation k*: retrieval depth should be calibrated, not guessed

Run your AI side-project on zahid.host