# A Cognitive Benchmark for Code-RAG Retrieval: Part 2 — Why Model Rankings Depend on the Pipeline

> Source: <https://dev.to/miftakhov/a-cognitive-benchmark-for-code-rag-retrieval-part-2-why-model-rankings-depend-on-the-pipeline-12a4>
> Published: 2026-06-14 21:00:41+00:00

When developers enter an unfamiliar project, they rarely search for a specific

file by name. They usually ask about system behavior: where incoming

connections are accepted, which component cleans logs, or how a request travels

between architectural layers.

Code-RAG tries to answer such questions through semantic search. It splits and

indexes the source code, then retrieves the context most closely related to a

developer's query.

The quality of this search is often reduced to the choice of embedding model:

compare several candidates and select the one with the highest metric. In

practice, the result also depends on how the code was split and which retrieval

mode was used.

To study these dependencies, I built a Code-RAG benchmark on the Apache Kafka

4.0.0 broker core, a real polyglot project written in Java and Scala. For

thirty questions about system behavior, I identified the correct files in

advance, allowing me to measure how accurately the retrieval pipeline finds

the relevant code.

The results show that a model ranking exists only within a specific

configuration. Changing the chunking strategy, retrieval mode, or query

phrasing can change both the metric value and the order of models in the

ranking.

In this part of the study, I compare sixteen embedding models, five chunking

configurations, and three retrieval modes: BM25, vector search, and hybrid

search. Each of the thirty questions was expressed in five forms, ranging from

a natural developer question to a query with inaccurate terminology or a

reference to a neighboring module. The structure of these variants and the

evaluation methodology are described in

[Part 1 of the study](https://dev.to/miftakhov/a-cognitive-benchmark-for-code-rag-retrieval-part-1-methodology-3m7l).

I compared four groups of variables:

| Variable | What changed | What it tested |
|---|---|---|
| Embedding model | 7 local models through Ollama and 9 commercial APIs | How strongly quality depends on the vector representation of the query and code |
| Chunking | Whole-file indexing and four fixed-size chunks with overlap | How the indexed fragment size affects a particular model |
| Retrieval mode |
`BM25_ONLY` , `VECTOR_ONLY` , `HYBRID_RRF`
|
Whether lexical search, vector search, or their combination works best |
| Query phrasing | Natural question, technical query, keywords, inaccurate terminology, and selected cross-module queries | How strongly the result depends on the language of the query |

The three retrieval modes work as follows:

| Mode | How the ranking is produced |
|---|---|
`BM25_ONLY` |
Lucene lexical search. Files rank highly when query terms match terms in the code. No embedding model is used. |
`VECTOR_ONLY` |
The query and code fragments are converted into embeddings. Ranking is based on vector similarity, so exact word overlap is unnecessary. |
`HYBRID_RRF` |
BM25 and vector search run independently, then their positions are combined using Reciprocal Rank Fusion. RRF uses rank positions rather than directly adding incomparable scores. |

The primary metric in this article is `recall@10`

. For a single question, it

equals 1 when the primary correct file appears in the first ten results and 0

otherwise. The final value is the average across thirty questions. For example,

`recall@10 = 0.900`

means that the correct file appeared in the top ten for 27

of 30 questions.

The model ranking also reports a `95% CI`

, a 95% bootstrap confidence interval.

To calculate it, I repeatedly resampled the set of questions with replacement

and recalculated recall for each sample. A wide interval means that thirty

questions are insufficient for precisely estimating small differences.

Overlapping intervals are not themselves a formal pairwise test, but they warn

against treating the order of neighboring rows as stable.

The chunking label `c500-o100`

means fragments of 500 characters with a

100-character overlap. `whole-file`

means that an entire file is indexed as one

fragment.

I did not test the complete Cartesian product of all parameters. Models were

compared under a fixed baseline configuration; local and commercial models were

tested across five chunking configurations; and `VECTOR_ONLY`

was compared

with `HYBRID_RRF`

at the fixed `c1500-o200`

chunking. The full interaction

between retrieval mode and chunking remained outside the study. BM25 was run as

a single baseline because it does not depend on an embedding model.

To compare embedding models, the remaining retrieval-pipeline parameters must

be fixed. Otherwise, it is impossible to tell whether a difference was caused

by the model, fragment size, or retrieval method.

The baseline comparison used natural `human`

questions, `HYBRID_RRF`

, and

`c1500-o200`

chunking. For each model, I measured the share of thirty questions

for which the correct file appeared in the first ten results.

This ranking compares models under identical conditions, but it does not

describe their quality outside the selected configuration. For example, OpenAI

`text-embedding-3-large`

achieved `recall@10 = 0.833`

with `c1500-o200`

,

`0.900`

with the smaller `c500-o100`

fragments, and `0.433`

when files were

indexed whole.

The value `0.833`

therefore cannot be treated as an independent property of the

model. It describes one combination of model, chunking, retrieval mode, corpus,

and question set. The baseline ranking is a useful starting point, but it

cannot identify the best configuration without testing the other parameters.

Ideally, code should be split along logical boundaries such as methods, classes,

or other structural units. Structural chunking, however, requires a dedicated

parser for every language.

This study deliberately uses a polyglot Java and Scala project. I therefore

split the code into fixed-size fragments. This is not presented as the optimal

way to index code; it provides a common denominator across languages and makes

it possible to isolate the effect of fragment size.

Every value in the table is `recall@10`

for natural `human`

questions using

hybrid retrieval. The best observed result for each model is shown in bold.

| Model | c500-o100 | c1500-o200 | c3000-o300 | c5000-o500 | whole-file | Type |
|---|---|---|---|---|---|---|
| all-minilm | 0.733 |
0.700 | 0.667 | 0.700 | 0.567 | local |
| bge-m3 | 0.833 |
0.767 | 0.767 | 0.633 | 0.733 | local |
| granite-embedding | 0.800 |
0.767 | 0.767 | 0.633 | 0.433 | local |
| mxbai-embed-large | 0.900 |
0.833 | 0.733 | 0.633 | 0.433 | local |
| nomic-embed-text | 0.733 |
0.667 | 0.700 | 0.633 | 0.133 | local |
| qwen3-embedding:0.6b | 0.800 | 0.800 | 0.933 |
0.833 | 0.900 | local |
| snowflake-arctic-embed2 | 0.800 |
0.800 |
0.800 |
0.733 | 0.733 | local |
| EmbeddingsGigaR | 0.767 | 0.767 | 0.800 |
0.789 | 0.167 | commercial |
| GigaEmbeddings-3B | 0.767 | 0.867 |
0.833 | 0.767 | 0.300 | commercial |
| codestral-embed | 0.800 | 0.900 |
0.867 | 0.800 | 0.467 | commercial |
| mistral-embed-2312 | 0.800 | 0.800 | 0.900 |
0.800 | 0.400 | commercial |
| text-embedding-3-large | 0.900 |
0.833 | 0.867 | 0.800 | 0.433 | commercial |
| text-embedding-3-small | 0.867 |
0.867 |
0.867 |
0.867 |
0.433 | commercial |
| voyage-4-large | 0.900 |
0.833 | 0.867 | 0.833 | 0.867 | commercial |
| voyage-code-2 | 0.933 |
0.867 | 0.800 | 0.800 | 0.533 | commercial |
| voyage-code-3 | 0.900 |
0.867 | 0.900 |
0.833 | 0.833 | commercial |

For five of the seven local models, `c500-o100`

produced the highest observed

result. One possible explanation is that a small fragment contains less

unrelated code. Its embedding can describe a local implementation more

precisely, while BM25 benefits from matching specific terms.

The experiment does not establish this mechanism directly. Doing so would

require inspecting retrieved fragments and comparing hybrid and vector-only

search at every chunk size.

`qwen3-embedding:0.6b`

achieved its highest result with `c3000-o300`

and still

reached `0.900`

when indexing whole files. Unlike most local models, it retained

quality on larger fragments.

A possible explanation is the model's ability to process longer context. A

larger fragment preserves relationships between methods and their surrounding

class that smaller fragments may lose. A similar pattern appeared for

`mistral-embed-2312`

, `EmbeddingsGigaR`

, and partly for `voyage-code-3`

.

This remains a hypothesis: the experiment measured retrieval outcomes, not the

internal cause of each model's behavior.

With `whole-file`

, results ranged from `0.133`

to `0.900`

. The approach

remained viable for `qwen3-embedding`

, `voyage-4-large`

, and `voyage-code-3`

,

but quality dropped sharply for `nomic-embed-text`

and `EmbeddingsGigaR`

.

The likely explanation is context-window limits and truncation of long files.

Because I did not directly measure truncation by provider tokenizers, this must

also remain a hypothesis.

The matrix does not reveal a universally best fragment size. Instead, it shows

three kinds of behavior:

Chunking should therefore be selected together with the embedding model. When

tuning time is limited, `c500-o100`

is a reasonable starting point, but at

least one larger alternative should also be tested, and `whole-file`

should

not be used without separate validation.

After choosing how to split the code, the next question is how to retrieve the

relevant fragments. The experiment compared three modes:

`BM25_ONLY`

matches words in the query against words in the code;`VECTOR_ONLY`

compares semantic similarity between embeddings;`HYBRID_RRF`

combines the rank positions from BM25 and vector search.The retrieval-mode comparison used `c1500-o200`

. In an earlier experiment, the

combination `c1500-o200 + HYBRID_RRF`

produced the strongest result available

at the time and became the control configuration for later runs.

The subsequent chunking matrix showed that there is no universally optimal

fragment size. Keeping `c1500-o200`

, however, allowed retrieval modes to be

compared under identical conditions without mixing their effect with a

chunking change.

The full matrix of retrieval modes and chunking configurations was not tested.

The results below therefore describe retrieval-mode behavior only at

`c1500-o200`

.

Every value is `recall@10`

for natural `human`

questions. The best mode for

each model is shown in bold.

| Model | BM25_ONLY | VECTOR_ONLY | HYBRID_RRF | Type |
|---|---|---|---|---|
| No embedding model | 0.600 |
— | — | lexical baseline |
| all-minilm | — | 0.667 | 0.700 |
local |
| bge-m3 | — | 0.867 |
0.767 | local |
| granite-embedding | — | 0.733 | 0.767 |
local |
| mxbai-embed-large | — | 0.833 |
0.833 |
local |
| nomic-embed-text | — | 0.667 |
0.667 |
local |
| qwen3-embedding:0.6b | — | 0.800 |
0.800 |
local |
| snowflake-arctic-embed2 | — | 0.900 |
0.800 | local |
| EmbeddingsGigaR | — | 0.711 | 0.767 |
commercial |
| GigaEmbeddings-3B | — | 0.833 | 0.867 |
commercial |
| codestral-embed | — | 0.967 |
0.900 | commercial |
| mistral-embed-2312 | — | 0.900 |
0.800 | commercial |
| text-embedding-3-large | — | 0.867 |
0.833 | commercial |
| text-embedding-3-small | — | 0.833 | 0.867 |
commercial |
| voyage-4-large | — | 0.878 |
0.833 | commercial |
| voyage-code-2 | — | 0.933 |
0.867 | commercial |
| voyage-code-3 | — | 0.933 |
0.867 | commercial |

Commercial-model values are averaged across three repeated runs, so some

values are not multiples of one question out of thirty.

Adding BM25 to vector search helped two local and three commercial models. It

made no difference for three local models. In the remaining cases, hybrid

retrieval reduced `recall@10`

.

Among local models, the clearest differences appeared for `bge-m3`

and

`snowflake-arctic-embed2`

: vector-only search improved their results by

`0.100`

. Among commercial models, `mistral-embed-2312`

showed the same

improvement.

One possible explanation is that BM25 helps when the correct file contains

query terms missed by vector search. It can also promote lexically similar but

semantically incorrect files and weaken an already strong vector ranking. The

experiment did not test this mechanism directly.

For natural questions, BM25 achieved `recall@10 = 0.600`

, below every tested

embedding-based combination. Its result, however, depended strongly on query

language.

For queries composed of technical terms and keywords, BM25 reached

`0.833–0.867`

. With inaccurate terminology, it fell to `0.400`

. Lexical search

works well when the developer already knows the names of relevant entities,

but it is less effective when system behavior is described in the developer's

own words.

The choice of retrieval mode, like the choice of chunking, depends on the

embedding model. Hybrid retrieval cannot be assumed to improve vector search:

it helped some models, left some unchanged, and reduced the results of others.

A practical evaluation should compare at least `VECTOR_ONLY`

and `HYBRID_RRF`

on the selected model and representative queries. BM25 remains both a useful

control point and a standalone option for precise technical searches.

The same question about code can be expressed in different ways. A developer

may describe system behavior in natural language, list known technical terms,

or use a plausible but incorrect name for a component.

To test retrieval robustness under these changes, each question was represented

in several forms:

`human`

— a natural developer question;`ai_optimized`

— a detailed query using precise technical terminology;`keyword`

— a short list of keywords;`wrong_terminology`

— the original intent with one controlled terminology error;`cross_module`

— a question connecting multiple system components.The construction rules for these variants are described in

[Part 1 of the study](https://dev.to/miftakhov/a-cognitive-benchmark-for-code-rag-retrieval-part-1-methodology-3m7l).

This comparison fixed chunking at `c1500-o200`

and used `HYBRID_RRF`

. Every

value is `recall@10`

. The `cross_module`

variant existed for only ten

applicable questions, while the other results were calculated across all

thirty.

| Model | human | ai_optimized | keyword | wrong_terminology | cross_module | Type |
|---|---|---|---|---|---|---|
| BM25 without an embedding model | 0.600 | 0.833 | 0.867 | 0.400 | 0.600 | baseline |
| all-minilm | 0.700 | 0.933 | 0.933 | 0.433 | 0.600 | local |
| bge-m3 | 0.767 | 1.000 | 0.833 | 0.633 | 0.700 | local |
| granite-embedding | 0.767 | 1.000 | 0.967 | 0.567 | 0.700 | local |
| mxbai-embed-large | 0.833 | 0.967 | 0.867 | 0.633 | 0.700 | local |
| nomic-embed-text | 0.667 | 0.700 | 0.733 | 0.467 | 0.700 | local |
| qwen3-embedding:0.6b | 0.800 | 1.000 | 0.900 | 0.600 | 0.700 | local |
| snowflake-arctic-embed2 | 0.800 | 1.000 | 1.000 | 0.633 | 0.700 | local |
| EmbeddingsGigaR | 0.767 | 1.000 | 0.900 | 0.600 | 0.700 | commercial |
| GigaEmbeddings-3B | 0.867 | 1.000 | 0.800 | 0.633 | 0.700 | commercial |
| codestral-embed | 0.900 | 1.000 | 1.000 | 0.667 | 0.700 | commercial |
| mistral-embed-2312 | 0.800 | 1.000 | 1.000 | 0.633 | 0.700 | commercial |
| text-embedding-3-large | 0.833 | 1.000 | 0.933 | 0.600 | 0.700 | commercial |
| text-embedding-3-small | 0.867 | 1.000 | 0.967 | 0.567 | 0.700 | commercial |
| voyage-4-large | 0.833 | 1.000 | 1.000 | 0.733 | 0.700 | commercial |
| voyage-code-2 | 0.867 | 1.000 | 1.000 | 0.700 | 0.700 | commercial |
| voyage-code-3 | 0.867 | 1.000 | 1.000 | 0.700 | 0.700 | commercial |

The `ai_optimized`

variant contains class names, component names, and

operations already present in the code. On these queries, all nine commercial

and four of the seven local models reached `recall@10 = 1.000`

.

This does not mean that code retrieval is solved. It shows that Code-RAG works

far better when the user already knows the terminology and approximate

location of the answer. In practice, retrieval is often needed precisely

because that knowledge is missing.

Short `keyword`

queries also performed well. Even BM25 reached `0.867`

, because

the keywords often matched names and terms in the source code directly.

Replacing one term with a plausible but incorrect alternative reduced the

result of every tested model. Among commercial models, the drop relative to

`human`

ranged from `0.100`

for `voyage-4-large`

to `0.300`

for

`text-embedding-3-small`

.

The model ranked first on natural questions was not the most robust to

terminology distortion. `codestral-embed`

fell from `0.900`

to `0.667`

, while

`voyage-4-large`

fell from `0.833`

to `0.733`

.

Code specialization also failed to predict robustness. The smallest and

largest observed drops among commercial models both belonged to

general-purpose models.

The `cross_module`

values barely distinguish the embedding models: every model

except `all-minilm`

received `0.700`

, while `all-minilm`

received `0.600`

.

This variant existed for only ten questions, so the result cannot be

interpreted as evidence of equal robustness.

A meaningful comparison would require a separate question set focused on

relationships between modules and containing more examples of this type.

Query phrasing is another parameter of the retrieval pipeline. Precise

terminology can bring almost every model close to the maximum result, while a

small terminology error can reduce quality substantially.

Model selection should therefore account for where queries come from. A system

for developers familiar with the codebase and a system for new team members or

non-technical users may require different configurations.

For the baseline comparison, every model ran under the same conditions:

natural `human`

questions, `HYBRID_RRF`

, and `c1500-o200`

chunking.

| Rank | Model | Model types in row | Recall@10 | 95% CI |
|---|---|---|---|---|
| 1 | mistral/codestral-embed | commercial | 0.900 | [0.800–1.000] |
| 2–5 | GigaEmbeddings-3B, text-embedding-3-small, voyage-code-2, voyage-code-3 | commercial | 0.867 | [0.733–0.967] |
| 6–8 | mxbai-embed-large, text-embedding-3-large, voyage-4-large | local and commercial | 0.833 | [0.700–0.967] |
| 9–11 | qwen3-embedding:0.6b, snowflake-arctic-embed2 | local | 0.800 | [0.633–0.933] |
| 9–11 | mistral-embed-2312 | commercial | 0.800 | [0.666–0.933] |
| 12–14 | bge-m3, granite-embedding, EmbeddingsGigaR | local and commercial | 0.767 | [0.600–0.900] |
| 15 | all-minilm | local | 0.700 | [0.533–0.867] |
| 16 | nomic-embed-text | local | 0.667 | [0.500–0.833] |

At first glance, this looks like a conventional ranking: a specialized

commercial model takes first place, and the remaining results gradually fall

from `0.867`

to `0.667`

. With thirty questions, however, a difference of

`0.033`

represents only one retrieved file.

One question separates `codestral-embed`

from the next four models. Those four

retrieved the correct files for the same 26 questions out of 30, while the

leader retrieved one additional file. A paired bootstrap analysis showed that

the confidence interval for every pairwise difference among the top five

models included zero. The available data is therefore insufficient to treat

their order as stable.

The separation between local and commercial models was also less pronounced

than expected. Local `mxbai-embed-large`

achieved `0.833`

, two correct answers

behind the leader out of thirty, and its confidence interval overlaps those of

commercial models.

Larger and more expensive models did not always produce better results.

`text-embedding-3-large`

achieved `0.833`

, while the cheaper

`text-embedding-3-small`

reached `0.867`

. The compact 62 MB

`granite-embedding`

tied the 1.2 GB `bge-m3`

at `0.767`

. Two generations of

specialized Voyage models, `voyage-code-2`

and `voyage-code-3`

, also completed

the baseline comparison with the same result of `0.867`

.

These observations do not prove that the models are equal: thirty questions

are insufficient for confidently comparing small differences. They do show

why an embedding-model ranking is meaningful only together with its

measurement conditions. Changing chunking, retrieval mode, or query phrasing

can alter both the metric and the order of models in the table.

The experiment does not identify one configuration suitable for every

Code-RAG project. The decision depends on the queries the system will receive,

where it will run, and how much time is available for tuning.

| Requirement | Candidate | Evidence from this study |
|---|---|---|
Highest observed `recall@10`
|
`codestral-embed` + `c1500-o200` + `VECTOR_ONLY`
|
Highest point estimate among completed runs: `0.967`
|
| No commercial APIs |
`qwen3-embedding:0.6b` + `c3000-o300`
|
Highest tuned local-model result: `0.933`
|
| Minimal initial tuning |
`voyage-4-large` or `voyage-code-3`
|
Smallest observed range across chunking configurations: `0.067`
|
| Restricted memory | `granite-embedding` |
A 62 MB model with the same baseline point estimate, `0.767` , as the 1.2 GB `bge-m3`
|
| Precise technical queries | BM25 as standalone search or a baseline | BM25 reached `recall@10 = 0.867` on `keyword` queries without embedding infrastructure |
| Inaccurate user queries | Vector or hybrid retrieval after testing the chosen model | The advantage of semantic retrieval over BM25 was most visible with inaccurate terminology |

These candidates reflect the best observed results inside the experiment, not

universal production configurations. For example, `codestral-embed`

produced

the highest `recall@10`

, but its advantage was measured at one chunking

configuration, and statistical superiority over nearby models was not

established. Local models avoid API charges but move cost into hardware and

operations.

Practical Code-RAG tuning should begin with a description of future queries,

not a large model leaderboard. If the system is used by developers familiar

with project terminology, BM25 may be a strong starting point. If questions

come from new team members or users describing behavior in their own words,

semantic retrieval becomes more important.

The next step is to choose a short list of models that meet cost, memory, and

deployment constraints. For each candidate, test several fragment sizes, then

compare `VECTOR_ONLY`

and `HYBRID_RRF`

. A chunking or retrieval-mode choice

should not be transferred from one model to another without retesting.

The final comparison should include not only convenient technical queries, but

also natural phrasing, inaccurate terminology, and plausible but incorrect

files. Results should be retained per question so that an apparent advantage

can be traced to a stable pattern rather than a few favorable examples.

In practice, selecting a Code-RAG configuration becomes a process of narrowing

the search space:

The central result of this study is that an embedding model cannot be evaluated

independently from the retrieval pipeline around it. Each model has its own

effective combination of chunking, retrieval mode, and query phrasing.

These parameters are connected. Fragment size determines how much code enters

an embedding. Retrieval mode sets the balance between semantic similarity and

exact terminology. Query phrasing determines how easily the system can connect

a developer's intent to the vocabulary of the source code.

There is therefore no universal ranking of Code-RAG models. Models can only be

compared under explicit conditions: on a particular codebase, with a selected

chunking strategy and retrieval mode, and for a known distribution of user

queries.

The practical question is not "Which model is best?" but "Which configuration

best solves the tasks of this project's users?" Answering it requires joint

tuning of the pipeline and evaluation on a project-specific question set.

This study used one polyglot project and a small gold set, so its selected

configurations should not be transferred to other codebases without retesting.

Part 3 will describe the reproducible benchmark harness used to run these

comparisons.

The raw results are published with the project. The main tables in this article

can be verified using these artifacts:

```
results/E003-full/all.parquet
results/E007-commercial-chunking-merged/all.parquet
results/E004-vector/all.parquet
results/E005-production/all.parquet
results/E006-production-merged/all.parquet
results/E007-commercial-vector-merged/all.parquet
results/E004-bm25/all.parquet
results/forest_plot_data.csv
```

The confidence intervals for the baseline ranking can be recalculated from the

source CSV files:

```
git clone https://github.com/Daeryss/karta-rag-map
cd karta-rag-map
python3 -m venv .venv
.venv/bin/pip install -r scripts/requirements.txt
.venv/bin/python scripts/bootstrap_cis.py \
  results/E005-production/all.csv \
  results/E006-production-merged/all.csv
```

The script fixes the baseline conditions at `k=10`

, the `human`

query variant,

2,000 bootstrap samples, and seed `42`

.

*A Cognitive Benchmark for Code-RAG Retrieval · Part 2 of 3 · Previous:
Part 1 — Methodology · Next: Part 3 —
Engineering a Reproducible Benchmark*