I thought about how to make this easier to test:
Short version: I think the most important part is not “what is the best k
?”, but “which k
are we talking about, at which stage of the RAG pipeline, and for which reliability axis?”
I have seen nearby patterns in adaptive retrieval depth, sufficient-context evaluation, irrelevant-context robustness, RAG diagnostic metrics, and long-context position effects. I would not treat those as the same claim, though. The interesting part of your framing, to me, is that k*
is not just a retrieval hyperparameter. It is a calibration target for final evidence depth under a chosen reliability axis.
The smallest useful next step may be:
top_k
into several pipeline-level variables;The rest of this comment is just a proposed testing map.
k
I would avoid using one overloaded top_k
term.
| Name | Meaning |
|---|---|
candidate_k |
|
| number of candidates returned by the first retriever | |
rerank_k / top_n |
|
| number of candidates kept after reranking or filtering | |
injected_k |
|
| number of evidence fragments actually inserted into the LLM prompt | |
effective_k |
|
| number of distinct/useful evidence units after deduplication, compression, summarization, or prompt packing |
I would interpret your k*
mostly as an injected_k
/ effective_k
question, not as the retriever’s initial top_k
.
This matters because real RAG stacks often separate retrieval, reranking, filtering, compression, and final prompt construction. For example, LlamaIndex has retrieval-time node postprocessors and rerankers where top_n
can mean the number of nodes returned after reranking, not the number of original candidates. Open WebUI also has practical discussion around RAG_TOP_K
and RAG_TOP_K_RERANKER
, which is a useful reminder that “top k” can mean different things in different layers: discussion, env docs.
So before interpreting any k*
curve, I would log the actual fragment IDs and fragment types that reach the model.
I would not look for one universal k*
at first.
I would separate at least three regions:
| Region | Question |
|---|---|
| evidence-insufficient | are required facts still missing? |
| evidence-sufficient | do the retrieved snippets contain enough information to answer? |
| over-context / contamination region | does additional context start shifting wording, assumptions, framing, source emphasis, or state? |
That distinction seems important because a model can be correct over a range of k
values while still changing some other reliability axis.
This connects well with the “sufficient context” line of work. The Sufficient Context project tests whether retrieved snippets alone could plausibly answer the question, and the paper Sufficient Context: A New Lens on Retrieval Augmented Generation Systems uses that idea to separate “retrieval did not provide enough information” from “the model had enough information but failed to use it.”
For this thread, I would phrase it as:
A minimal sufficiency-first testfirst measure the smallest
injected_k
where the context becomes sufficient; then measure whether correctness, groundedness, or contamination/framing leakage changes after that point.
One implementation trap: the condition label is not enough.
A row labeled gold + distractor
should be checked against the actual injected fragment IDs and fragment kinds. Otherwise the contamination curve can accidentally measure data assembly mistakes.
I would add a validation step like this before any judge or generator result is interpreted:
| Intended condition | Required validation |
|---|---|
gold_only |
|
| only required evidence is injected | |
gold_plus_random |
|
| gold evidence plus at least one random irrelevant fragment | |
gold_plus_semantic |
|
| gold evidence plus a semantically related non-answer fragment | |
gold_plus_duplicate |
|
| gold evidence plus a duplicate or near-duplicate | |
gold_plus_stale |
|
| gold evidence plus temporally stale evidence | |
gold_plus_conflict |
|
| gold evidence plus a genuinely conflicting fragment | |
gold_plus_frame_shift |
|
| gold evidence plus a fragment that changes framing but not necessarily the answer | |
gold_plus_adversarial |
|
| gold evidence plus instruction-like or adversarial text |
If this validation fails, I would not interpret the contamination score yet.
Why I think condition validation mattersWhen k
changes, the decisive evidence may move inside the prompt.
So a k
effect can be mixed with a prompt-position effect.
For every run, I would log:
decisive_evidence_rank_in_retrieval
decisive_evidence_rank_after_rerank
decisive_evidence_position_in_prompt
decisive_evidence_token_start
decisive_evidence_token_end
This is not only theoretical. Lost in the Middle showed that models can be sensitive to where relevant information appears in long context. LlamaIndex also has a LongContextReorder postprocessor, which is a practical sign that node order can matter when a large top-k is placed into context.
So if contamination changes at larger k
, I would check whether the decisive evidence was also pushed into a worse prompt position.
I think your contamination/framing axis is adjacent to existing RAG metrics, but probably not identical to any one of them.
Existing tools already give useful vocabulary. For example, RAGAS metrics include faithfulness, answer relevancy, context recall, context precision, and context utilization. RAGAS also has Noise Sensitivity, which measures how often a system makes errors when using relevant or irrelevant retrieved documents.
That is related, but I would not collapse framing leakage into noise sensitivity.
A useful separation might be:
| Axis | What it measures |
|---|---|
| correctness | final answer is factually right |
| sufficiency | context contains enough evidence to answer |
| context recall | necessary evidence was retrieved |
| context precision | irrelevant evidence is not dominating the context |
| context utilization | answer-relevant context is ranked/used well |
| faithfulness / groundedness | answer is supported by supplied context |
| noise sensitivity | retrieved docs cause incorrect responses |
| contamination / framing leakage | extra context shifts framing, assumptions, vocabulary, source emphasis, or state even if correctness is flat |
The last row is the hard part.
I would treat it as an additional rubric, not as a replacement for correctness or groundedness.
Adjacent prior work I would use as controls, not as exact equivalentsIf the contamination score is judged by an LLM or by a heuristic rubric, I would test that evaluator before trusting the k
curve.
Otherwise, the curve may partly measure:
A small rubric test suite could include:
| Case | Expected behavior |
|---|---|
| gold only | no contamination |
| gold + neutral filler | no contamination |
| gold + random irrelevant | usually no framing leakage unless used |
| gold + semantic distractor | possible distractor sensitivity |
| gold + duplicate | should not be called framing leakage by default |
| gold + stale evidence | stale-memory / temporal-validity issue |
| gold + conflict | conflict-handling issue |
| gold + frame-shifting evidence | likely framing leakage candidate |
| gold + adversarial instruction-like text | adversarial stress test, separate from ordinary leakage |
Outside the details, my proposed minimal protocol would be:
injected_k
.If you wanted to make this easier for others to reproduce, I would add these in roughly this order:
| Effort | Addition |
|---|---|
| low | define whether k means candidate_k , rerank_k , injected_k , or effective_k |
| low | log actual injected fragment IDs and fragment kinds |
| low | log decisive evidence position in the prompt |
| medium | add retrieval-only sufficiency sweep |
| medium | add condition validation before generation/judging |
| medium | add unit tests for the contamination/framing rubric |
| medium | compare fixed-k with one adaptive-k baseline |
| higher | add HotpotQA-style public multi-hop preview |
| higher | compare raw chunks vs reranked vs compressed/filtered evidence |
My main takeaway is:
The valuable target is probably not a universal
k*
, but a reproducible calibration procedure: for a given corpus, chunking scheme, retriever, reranker, prompt builder, model, and reliability axis, what final evidence depth is sufficient, and when do extra fragments start changing another axis?
That would make the idea easier to test, compare, and falsify without forcing every reader to guess which part of the RAG pipeline the reported k*
belongs to.