{"slug": "a-cognitive-benchmark-for-code-rag-retrieval-part-2-why-model-rankings-depend-on", "title": "A Cognitive Benchmark for Code-RAG Retrieval: Part 2 — Why Model Rankings Depend on the Pipeline", "summary": "A developer built a Code-RAG benchmark on Apache Kafka 4.0.0 to study how retrieval pipeline choices affect model rankings. Comparing 16 embedding models, 5 chunking strategies, and 3 retrieval modes across 30 questions, the results show that model rankings are not absolute but depend on specific configurations like chunking, retrieval mode, and query phrasing.", "body_md": "When developers enter an unfamiliar project, they rarely search for a specific\n\nfile by name. They usually ask about system behavior: where incoming\n\nconnections are accepted, which component cleans logs, or how a request travels\n\nbetween architectural layers.\n\nCode-RAG tries to answer such questions through semantic search. It splits and\n\nindexes the source code, then retrieves the context most closely related to a\n\ndeveloper's query.\n\nThe quality of this search is often reduced to the choice of embedding model:\n\ncompare several candidates and select the one with the highest metric. In\n\npractice, the result also depends on how the code was split and which retrieval\n\nmode was used.\n\nTo study these dependencies, I built a Code-RAG benchmark on the Apache Kafka\n\n4.0.0 broker core, a real polyglot project written in Java and Scala. For\n\nthirty questions about system behavior, I identified the correct files in\n\nadvance, allowing me to measure how accurately the retrieval pipeline finds\n\nthe relevant code.\n\nThe results show that a model ranking exists only within a specific\n\nconfiguration. Changing the chunking strategy, retrieval mode, or query\n\nphrasing can change both the metric value and the order of models in the\n\nranking.\n\nIn this part of the study, I compare sixteen embedding models, five chunking\n\nconfigurations, and three retrieval modes: BM25, vector search, and hybrid\n\nsearch. Each of the thirty questions was expressed in five forms, ranging from\n\na natural developer question to a query with inaccurate terminology or a\n\nreference to a neighboring module. The structure of these variants and the\n\nevaluation methodology are described in\n\n[Part 1 of the study](https://dev.to/miftakhov/a-cognitive-benchmark-for-code-rag-retrieval-part-1-methodology-3m7l).\n\nI compared four groups of variables:\n\n| Variable | What changed | What it tested |\n|---|---|---|\n| Embedding model | 7 local models through Ollama and 9 commercial APIs | How strongly quality depends on the vector representation of the query and code |\n| Chunking | Whole-file indexing and four fixed-size chunks with overlap | How the indexed fragment size affects a particular model |\n| Retrieval mode |\n`BM25_ONLY` , `VECTOR_ONLY` , `HYBRID_RRF`\n|\nWhether lexical search, vector search, or their combination works best |\n| Query phrasing | Natural question, technical query, keywords, inaccurate terminology, and selected cross-module queries | How strongly the result depends on the language of the query |\n\nThe three retrieval modes work as follows:\n\n| Mode | How the ranking is produced |\n|---|---|\n`BM25_ONLY` |\nLucene lexical search. Files rank highly when query terms match terms in the code. No embedding model is used. |\n`VECTOR_ONLY` |\nThe query and code fragments are converted into embeddings. Ranking is based on vector similarity, so exact word overlap is unnecessary. |\n`HYBRID_RRF` |\nBM25 and vector search run independently, then their positions are combined using Reciprocal Rank Fusion. RRF uses rank positions rather than directly adding incomparable scores. |\n\nThe primary metric in this article is `recall@10`\n\n. For a single question, it\n\nequals 1 when the primary correct file appears in the first ten results and 0\n\notherwise. The final value is the average across thirty questions. For example,\n\n`recall@10 = 0.900`\n\nmeans that the correct file appeared in the top ten for 27\n\nof 30 questions.\n\nThe model ranking also reports a `95% CI`\n\n, a 95% bootstrap confidence interval.\n\nTo calculate it, I repeatedly resampled the set of questions with replacement\n\nand recalculated recall for each sample. A wide interval means that thirty\n\nquestions are insufficient for precisely estimating small differences.\n\nOverlapping intervals are not themselves a formal pairwise test, but they warn\n\nagainst treating the order of neighboring rows as stable.\n\nThe chunking label `c500-o100`\n\nmeans fragments of 500 characters with a\n\n100-character overlap. `whole-file`\n\nmeans that an entire file is indexed as one\n\nfragment.\n\nI did not test the complete Cartesian product of all parameters. Models were\n\ncompared under a fixed baseline configuration; local and commercial models were\n\ntested across five chunking configurations; and `VECTOR_ONLY`\n\nwas compared\n\nwith `HYBRID_RRF`\n\nat the fixed `c1500-o200`\n\nchunking. The full interaction\n\nbetween retrieval mode and chunking remained outside the study. BM25 was run as\n\na single baseline because it does not depend on an embedding model.\n\nTo compare embedding models, the remaining retrieval-pipeline parameters must\n\nbe fixed. Otherwise, it is impossible to tell whether a difference was caused\n\nby the model, fragment size, or retrieval method.\n\nThe baseline comparison used natural `human`\n\nquestions, `HYBRID_RRF`\n\n, and\n\n`c1500-o200`\n\nchunking. For each model, I measured the share of thirty questions\n\nfor which the correct file appeared in the first ten results.\n\nThis ranking compares models under identical conditions, but it does not\n\ndescribe their quality outside the selected configuration. For example, OpenAI\n\n`text-embedding-3-large`\n\nachieved `recall@10 = 0.833`\n\nwith `c1500-o200`\n\n,\n\n`0.900`\n\nwith the smaller `c500-o100`\n\nfragments, and `0.433`\n\nwhen files were\n\nindexed whole.\n\nThe value `0.833`\n\ntherefore cannot be treated as an independent property of the\n\nmodel. It describes one combination of model, chunking, retrieval mode, corpus,\n\nand question set. The baseline ranking is a useful starting point, but it\n\ncannot identify the best configuration without testing the other parameters.\n\nIdeally, code should be split along logical boundaries such as methods, classes,\n\nor other structural units. Structural chunking, however, requires a dedicated\n\nparser for every language.\n\nThis study deliberately uses a polyglot Java and Scala project. I therefore\n\nsplit the code into fixed-size fragments. This is not presented as the optimal\n\nway to index code; it provides a common denominator across languages and makes\n\nit possible to isolate the effect of fragment size.\n\nEvery value in the table is `recall@10`\n\nfor natural `human`\n\nquestions using\n\nhybrid retrieval. The best observed result for each model is shown in bold.\n\n| Model | c500-o100 | c1500-o200 | c3000-o300 | c5000-o500 | whole-file | Type |\n|---|---|---|---|---|---|---|\n| all-minilm | 0.733 |\n0.700 | 0.667 | 0.700 | 0.567 | local |\n| bge-m3 | 0.833 |\n0.767 | 0.767 | 0.633 | 0.733 | local |\n| granite-embedding | 0.800 |\n0.767 | 0.767 | 0.633 | 0.433 | local |\n| mxbai-embed-large | 0.900 |\n0.833 | 0.733 | 0.633 | 0.433 | local |\n| nomic-embed-text | 0.733 |\n0.667 | 0.700 | 0.633 | 0.133 | local |\n| qwen3-embedding:0.6b | 0.800 | 0.800 | 0.933 |\n0.833 | 0.900 | local |\n| snowflake-arctic-embed2 | 0.800 |\n0.800 |\n0.800 |\n0.733 | 0.733 | local |\n| EmbeddingsGigaR | 0.767 | 0.767 | 0.800 |\n0.789 | 0.167 | commercial |\n| GigaEmbeddings-3B | 0.767 | 0.867 |\n0.833 | 0.767 | 0.300 | commercial |\n| codestral-embed | 0.800 | 0.900 |\n0.867 | 0.800 | 0.467 | commercial |\n| mistral-embed-2312 | 0.800 | 0.800 | 0.900 |\n0.800 | 0.400 | commercial |\n| text-embedding-3-large | 0.900 |\n0.833 | 0.867 | 0.800 | 0.433 | commercial |\n| text-embedding-3-small | 0.867 |\n0.867 |\n0.867 |\n0.867 |\n0.433 | commercial |\n| voyage-4-large | 0.900 |\n0.833 | 0.867 | 0.833 | 0.867 | commercial |\n| voyage-code-2 | 0.933 |\n0.867 | 0.800 | 0.800 | 0.533 | commercial |\n| voyage-code-3 | 0.900 |\n0.867 | 0.900 |\n0.833 | 0.833 | commercial |\n\nFor five of the seven local models, `c500-o100`\n\nproduced the highest observed\n\nresult. One possible explanation is that a small fragment contains less\n\nunrelated code. Its embedding can describe a local implementation more\n\nprecisely, while BM25 benefits from matching specific terms.\n\nThe experiment does not establish this mechanism directly. Doing so would\n\nrequire inspecting retrieved fragments and comparing hybrid and vector-only\n\nsearch at every chunk size.\n\n`qwen3-embedding:0.6b`\n\nachieved its highest result with `c3000-o300`\n\nand still\n\nreached `0.900`\n\nwhen indexing whole files. Unlike most local models, it retained\n\nquality on larger fragments.\n\nA possible explanation is the model's ability to process longer context. A\n\nlarger fragment preserves relationships between methods and their surrounding\n\nclass that smaller fragments may lose. A similar pattern appeared for\n\n`mistral-embed-2312`\n\n, `EmbeddingsGigaR`\n\n, and partly for `voyage-code-3`\n\n.\n\nThis remains a hypothesis: the experiment measured retrieval outcomes, not the\n\ninternal cause of each model's behavior.\n\nWith `whole-file`\n\n, results ranged from `0.133`\n\nto `0.900`\n\n. The approach\n\nremained viable for `qwen3-embedding`\n\n, `voyage-4-large`\n\n, and `voyage-code-3`\n\n,\n\nbut quality dropped sharply for `nomic-embed-text`\n\nand `EmbeddingsGigaR`\n\n.\n\nThe likely explanation is context-window limits and truncation of long files.\n\nBecause I did not directly measure truncation by provider tokenizers, this must\n\nalso remain a hypothesis.\n\nThe matrix does not reveal a universally best fragment size. Instead, it shows\n\nthree kinds of behavior:\n\nChunking should therefore be selected together with the embedding model. When\n\ntuning time is limited, `c500-o100`\n\nis a reasonable starting point, but at\n\nleast one larger alternative should also be tested, and `whole-file`\n\nshould\n\nnot be used without separate validation.\n\nAfter choosing how to split the code, the next question is how to retrieve the\n\nrelevant fragments. The experiment compared three modes:\n\n`BM25_ONLY`\n\nmatches words in the query against words in the code;`VECTOR_ONLY`\n\ncompares semantic similarity between embeddings;`HYBRID_RRF`\n\ncombines the rank positions from BM25 and vector search.The retrieval-mode comparison used `c1500-o200`\n\n. In an earlier experiment, the\n\ncombination `c1500-o200 + HYBRID_RRF`\n\nproduced the strongest result available\n\nat the time and became the control configuration for later runs.\n\nThe subsequent chunking matrix showed that there is no universally optimal\n\nfragment size. Keeping `c1500-o200`\n\n, however, allowed retrieval modes to be\n\ncompared under identical conditions without mixing their effect with a\n\nchunking change.\n\nThe full matrix of retrieval modes and chunking configurations was not tested.\n\nThe results below therefore describe retrieval-mode behavior only at\n\n`c1500-o200`\n\n.\n\nEvery value is `recall@10`\n\nfor natural `human`\n\nquestions. The best mode for\n\neach model is shown in bold.\n\n| Model | BM25_ONLY | VECTOR_ONLY | HYBRID_RRF | Type |\n|---|---|---|---|---|\n| No embedding model | 0.600 |\n— | — | lexical baseline |\n| all-minilm | — | 0.667 | 0.700 |\nlocal |\n| bge-m3 | — | 0.867 |\n0.767 | local |\n| granite-embedding | — | 0.733 | 0.767 |\nlocal |\n| mxbai-embed-large | — | 0.833 |\n0.833 |\nlocal |\n| nomic-embed-text | — | 0.667 |\n0.667 |\nlocal |\n| qwen3-embedding:0.6b | — | 0.800 |\n0.800 |\nlocal |\n| snowflake-arctic-embed2 | — | 0.900 |\n0.800 | local |\n| EmbeddingsGigaR | — | 0.711 | 0.767 |\ncommercial |\n| GigaEmbeddings-3B | — | 0.833 | 0.867 |\ncommercial |\n| codestral-embed | — | 0.967 |\n0.900 | commercial |\n| mistral-embed-2312 | — | 0.900 |\n0.800 | commercial |\n| text-embedding-3-large | — | 0.867 |\n0.833 | commercial |\n| text-embedding-3-small | — | 0.833 | 0.867 |\ncommercial |\n| voyage-4-large | — | 0.878 |\n0.833 | commercial |\n| voyage-code-2 | — | 0.933 |\n0.867 | commercial |\n| voyage-code-3 | — | 0.933 |\n0.867 | commercial |\n\nCommercial-model values are averaged across three repeated runs, so some\n\nvalues are not multiples of one question out of thirty.\n\nAdding BM25 to vector search helped two local and three commercial models. It\n\nmade no difference for three local models. In the remaining cases, hybrid\n\nretrieval reduced `recall@10`\n\n.\n\nAmong local models, the clearest differences appeared for `bge-m3`\n\nand\n\n`snowflake-arctic-embed2`\n\n: vector-only search improved their results by\n\n`0.100`\n\n. Among commercial models, `mistral-embed-2312`\n\nshowed the same\n\nimprovement.\n\nOne possible explanation is that BM25 helps when the correct file contains\n\nquery terms missed by vector search. It can also promote lexically similar but\n\nsemantically incorrect files and weaken an already strong vector ranking. The\n\nexperiment did not test this mechanism directly.\n\nFor natural questions, BM25 achieved `recall@10 = 0.600`\n\n, below every tested\n\nembedding-based combination. Its result, however, depended strongly on query\n\nlanguage.\n\nFor queries composed of technical terms and keywords, BM25 reached\n\n`0.833–0.867`\n\n. With inaccurate terminology, it fell to `0.400`\n\n. Lexical search\n\nworks well when the developer already knows the names of relevant entities,\n\nbut it is less effective when system behavior is described in the developer's\n\nown words.\n\nThe choice of retrieval mode, like the choice of chunking, depends on the\n\nembedding model. Hybrid retrieval cannot be assumed to improve vector search:\n\nit helped some models, left some unchanged, and reduced the results of others.\n\nA practical evaluation should compare at least `VECTOR_ONLY`\n\nand `HYBRID_RRF`\n\non the selected model and representative queries. BM25 remains both a useful\n\ncontrol point and a standalone option for precise technical searches.\n\nThe same question about code can be expressed in different ways. A developer\n\nmay describe system behavior in natural language, list known technical terms,\n\nor use a plausible but incorrect name for a component.\n\nTo test retrieval robustness under these changes, each question was represented\n\nin several forms:\n\n`human`\n\n— a natural developer question;`ai_optimized`\n\n— a detailed query using precise technical terminology;`keyword`\n\n— a short list of keywords;`wrong_terminology`\n\n— the original intent with one controlled terminology error;`cross_module`\n\n— a question connecting multiple system components.The construction rules for these variants are described in\n\n[Part 1 of the study](https://dev.to/miftakhov/a-cognitive-benchmark-for-code-rag-retrieval-part-1-methodology-3m7l).\n\nThis comparison fixed chunking at `c1500-o200`\n\nand used `HYBRID_RRF`\n\n. Every\n\nvalue is `recall@10`\n\n. The `cross_module`\n\nvariant existed for only ten\n\napplicable questions, while the other results were calculated across all\n\nthirty.\n\n| Model | human | ai_optimized | keyword | wrong_terminology | cross_module | Type |\n|---|---|---|---|---|---|---|\n| BM25 without an embedding model | 0.600 | 0.833 | 0.867 | 0.400 | 0.600 | baseline |\n| all-minilm | 0.700 | 0.933 | 0.933 | 0.433 | 0.600 | local |\n| bge-m3 | 0.767 | 1.000 | 0.833 | 0.633 | 0.700 | local |\n| granite-embedding | 0.767 | 1.000 | 0.967 | 0.567 | 0.700 | local |\n| mxbai-embed-large | 0.833 | 0.967 | 0.867 | 0.633 | 0.700 | local |\n| nomic-embed-text | 0.667 | 0.700 | 0.733 | 0.467 | 0.700 | local |\n| qwen3-embedding:0.6b | 0.800 | 1.000 | 0.900 | 0.600 | 0.700 | local |\n| snowflake-arctic-embed2 | 0.800 | 1.000 | 1.000 | 0.633 | 0.700 | local |\n| EmbeddingsGigaR | 0.767 | 1.000 | 0.900 | 0.600 | 0.700 | commercial |\n| GigaEmbeddings-3B | 0.867 | 1.000 | 0.800 | 0.633 | 0.700 | commercial |\n| codestral-embed | 0.900 | 1.000 | 1.000 | 0.667 | 0.700 | commercial |\n| mistral-embed-2312 | 0.800 | 1.000 | 1.000 | 0.633 | 0.700 | commercial |\n| text-embedding-3-large | 0.833 | 1.000 | 0.933 | 0.600 | 0.700 | commercial |\n| text-embedding-3-small | 0.867 | 1.000 | 0.967 | 0.567 | 0.700 | commercial |\n| voyage-4-large | 0.833 | 1.000 | 1.000 | 0.733 | 0.700 | commercial |\n| voyage-code-2 | 0.867 | 1.000 | 1.000 | 0.700 | 0.700 | commercial |\n| voyage-code-3 | 0.867 | 1.000 | 1.000 | 0.700 | 0.700 | commercial |\n\nThe `ai_optimized`\n\nvariant contains class names, component names, and\n\noperations already present in the code. On these queries, all nine commercial\n\nand four of the seven local models reached `recall@10 = 1.000`\n\n.\n\nThis does not mean that code retrieval is solved. It shows that Code-RAG works\n\nfar better when the user already knows the terminology and approximate\n\nlocation of the answer. In practice, retrieval is often needed precisely\n\nbecause that knowledge is missing.\n\nShort `keyword`\n\nqueries also performed well. Even BM25 reached `0.867`\n\n, because\n\nthe keywords often matched names and terms in the source code directly.\n\nReplacing one term with a plausible but incorrect alternative reduced the\n\nresult of every tested model. Among commercial models, the drop relative to\n\n`human`\n\nranged from `0.100`\n\nfor `voyage-4-large`\n\nto `0.300`\n\nfor\n\n`text-embedding-3-small`\n\n.\n\nThe model ranked first on natural questions was not the most robust to\n\nterminology distortion. `codestral-embed`\n\nfell from `0.900`\n\nto `0.667`\n\n, while\n\n`voyage-4-large`\n\nfell from `0.833`\n\nto `0.733`\n\n.\n\nCode specialization also failed to predict robustness. The smallest and\n\nlargest observed drops among commercial models both belonged to\n\ngeneral-purpose models.\n\nThe `cross_module`\n\nvalues barely distinguish the embedding models: every model\n\nexcept `all-minilm`\n\nreceived `0.700`\n\n, while `all-minilm`\n\nreceived `0.600`\n\n.\n\nThis variant existed for only ten questions, so the result cannot be\n\ninterpreted as evidence of equal robustness.\n\nA meaningful comparison would require a separate question set focused on\n\nrelationships between modules and containing more examples of this type.\n\nQuery phrasing is another parameter of the retrieval pipeline. Precise\n\nterminology can bring almost every model close to the maximum result, while a\n\nsmall terminology error can reduce quality substantially.\n\nModel selection should therefore account for where queries come from. A system\n\nfor developers familiar with the codebase and a system for new team members or\n\nnon-technical users may require different configurations.\n\nFor the baseline comparison, every model ran under the same conditions:\n\nnatural `human`\n\nquestions, `HYBRID_RRF`\n\n, and `c1500-o200`\n\nchunking.\n\n| Rank | Model | Model types in row | Recall@10 | 95% CI |\n|---|---|---|---|---|\n| 1 | mistral/codestral-embed | commercial | 0.900 | [0.800–1.000] |\n| 2–5 | GigaEmbeddings-3B, text-embedding-3-small, voyage-code-2, voyage-code-3 | commercial | 0.867 | [0.733–0.967] |\n| 6–8 | mxbai-embed-large, text-embedding-3-large, voyage-4-large | local and commercial | 0.833 | [0.700–0.967] |\n| 9–11 | qwen3-embedding:0.6b, snowflake-arctic-embed2 | local | 0.800 | [0.633–0.933] |\n| 9–11 | mistral-embed-2312 | commercial | 0.800 | [0.666–0.933] |\n| 12–14 | bge-m3, granite-embedding, EmbeddingsGigaR | local and commercial | 0.767 | [0.600–0.900] |\n| 15 | all-minilm | local | 0.700 | [0.533–0.867] |\n| 16 | nomic-embed-text | local | 0.667 | [0.500–0.833] |\n\nAt first glance, this looks like a conventional ranking: a specialized\n\ncommercial model takes first place, and the remaining results gradually fall\n\nfrom `0.867`\n\nto `0.667`\n\n. With thirty questions, however, a difference of\n\n`0.033`\n\nrepresents only one retrieved file.\n\nOne question separates `codestral-embed`\n\nfrom the next four models. Those four\n\nretrieved the correct files for the same 26 questions out of 30, while the\n\nleader retrieved one additional file. A paired bootstrap analysis showed that\n\nthe confidence interval for every pairwise difference among the top five\n\nmodels included zero. The available data is therefore insufficient to treat\n\ntheir order as stable.\n\nThe separation between local and commercial models was also less pronounced\n\nthan expected. Local `mxbai-embed-large`\n\nachieved `0.833`\n\n, two correct answers\n\nbehind the leader out of thirty, and its confidence interval overlaps those of\n\ncommercial models.\n\nLarger and more expensive models did not always produce better results.\n\n`text-embedding-3-large`\n\nachieved `0.833`\n\n, while the cheaper\n\n`text-embedding-3-small`\n\nreached `0.867`\n\n. The compact 62 MB\n\n`granite-embedding`\n\ntied the 1.2 GB `bge-m3`\n\nat `0.767`\n\n. Two generations of\n\nspecialized Voyage models, `voyage-code-2`\n\nand `voyage-code-3`\n\n, also completed\n\nthe baseline comparison with the same result of `0.867`\n\n.\n\nThese observations do not prove that the models are equal: thirty questions\n\nare insufficient for confidently comparing small differences. They do show\n\nwhy an embedding-model ranking is meaningful only together with its\n\nmeasurement conditions. Changing chunking, retrieval mode, or query phrasing\n\ncan alter both the metric and the order of models in the table.\n\nThe experiment does not identify one configuration suitable for every\n\nCode-RAG project. The decision depends on the queries the system will receive,\n\nwhere it will run, and how much time is available for tuning.\n\n| Requirement | Candidate | Evidence from this study |\n|---|---|---|\nHighest observed `recall@10`\n|\n`codestral-embed` + `c1500-o200` + `VECTOR_ONLY`\n|\nHighest point estimate among completed runs: `0.967`\n|\n| No commercial APIs |\n`qwen3-embedding:0.6b` + `c3000-o300`\n|\nHighest tuned local-model result: `0.933`\n|\n| Minimal initial tuning |\n`voyage-4-large` or `voyage-code-3`\n|\nSmallest observed range across chunking configurations: `0.067`\n|\n| Restricted memory | `granite-embedding` |\nA 62 MB model with the same baseline point estimate, `0.767` , as the 1.2 GB `bge-m3`\n|\n| Precise technical queries | BM25 as standalone search or a baseline | BM25 reached `recall@10 = 0.867` on `keyword` queries without embedding infrastructure |\n| Inaccurate user queries | Vector or hybrid retrieval after testing the chosen model | The advantage of semantic retrieval over BM25 was most visible with inaccurate terminology |\n\nThese candidates reflect the best observed results inside the experiment, not\n\nuniversal production configurations. For example, `codestral-embed`\n\nproduced\n\nthe highest `recall@10`\n\n, but its advantage was measured at one chunking\n\nconfiguration, and statistical superiority over nearby models was not\n\nestablished. Local models avoid API charges but move cost into hardware and\n\noperations.\n\nPractical Code-RAG tuning should begin with a description of future queries,\n\nnot a large model leaderboard. If the system is used by developers familiar\n\nwith project terminology, BM25 may be a strong starting point. If questions\n\ncome from new team members or users describing behavior in their own words,\n\nsemantic retrieval becomes more important.\n\nThe next step is to choose a short list of models that meet cost, memory, and\n\ndeployment constraints. For each candidate, test several fragment sizes, then\n\ncompare `VECTOR_ONLY`\n\nand `HYBRID_RRF`\n\n. A chunking or retrieval-mode choice\n\nshould not be transferred from one model to another without retesting.\n\nThe final comparison should include not only convenient technical queries, but\n\nalso natural phrasing, inaccurate terminology, and plausible but incorrect\n\nfiles. Results should be retained per question so that an apparent advantage\n\ncan be traced to a stable pattern rather than a few favorable examples.\n\nIn practice, selecting a Code-RAG configuration becomes a process of narrowing\n\nthe search space:\n\nThe central result of this study is that an embedding model cannot be evaluated\n\nindependently from the retrieval pipeline around it. Each model has its own\n\neffective combination of chunking, retrieval mode, and query phrasing.\n\nThese parameters are connected. Fragment size determines how much code enters\n\nan embedding. Retrieval mode sets the balance between semantic similarity and\n\nexact terminology. Query phrasing determines how easily the system can connect\n\na developer's intent to the vocabulary of the source code.\n\nThere is therefore no universal ranking of Code-RAG models. Models can only be\n\ncompared under explicit conditions: on a particular codebase, with a selected\n\nchunking strategy and retrieval mode, and for a known distribution of user\n\nqueries.\n\nThe practical question is not \"Which model is best?\" but \"Which configuration\n\nbest solves the tasks of this project's users?\" Answering it requires joint\n\ntuning of the pipeline and evaluation on a project-specific question set.\n\nThis study used one polyglot project and a small gold set, so its selected\n\nconfigurations should not be transferred to other codebases without retesting.\n\nPart 3 will describe the reproducible benchmark harness used to run these\n\ncomparisons.\n\nThe raw results are published with the project. The main tables in this article\n\ncan be verified using these artifacts:\n\n```\nresults/E003-full/all.parquet\nresults/E007-commercial-chunking-merged/all.parquet\nresults/E004-vector/all.parquet\nresults/E005-production/all.parquet\nresults/E006-production-merged/all.parquet\nresults/E007-commercial-vector-merged/all.parquet\nresults/E004-bm25/all.parquet\nresults/forest_plot_data.csv\n```\n\nThe confidence intervals for the baseline ranking can be recalculated from the\n\nsource CSV files:\n\n```\ngit clone https://github.com/Daeryss/karta-rag-map\ncd karta-rag-map\npython3 -m venv .venv\n.venv/bin/pip install -r scripts/requirements.txt\n.venv/bin/python scripts/bootstrap_cis.py \\\n  results/E005-production/all.csv \\\n  results/E006-production-merged/all.csv\n```\n\nThe script fixes the baseline conditions at `k=10`\n\n, the `human`\n\nquery variant,\n\n2,000 bootstrap samples, and seed `42`\n\n.\n\n*A Cognitive Benchmark for Code-RAG Retrieval · Part 2 of 3 · Previous:\nPart 1 — Methodology · Next: Part 3 —\nEngineering a Reproducible Benchmark*", "url": "https://wpnews.pro/news/a-cognitive-benchmark-for-code-rag-retrieval-part-2-why-model-rankings-depend-on", "canonical_source": "https://dev.to/miftakhov/a-cognitive-benchmark-for-code-rag-retrieval-part-2-why-model-rankings-depend-on-the-pipeline-12a4", "published_at": "2026-06-14 21:00:41+00:00", "updated_at": "2026-06-14 21:10:32.517398+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "natural-language-processing", "ai-research"], "entities": ["Apache Kafka", "BM25", "vector search", "hybrid search", "Lucene", "Reciprocal Rank Fusion", "Ollama"], "alternates": {"html": "https://wpnews.pro/news/a-cognitive-benchmark-for-code-rag-retrieval-part-2-why-model-rankings-depend-on", "markdown": "https://wpnews.pro/news/a-cognitive-benchmark-for-code-rag-retrieval-part-2-why-model-rankings-depend-on.md", "text": "https://wpnews.pro/news/a-cognitive-benchmark-for-code-rag-retrieval-part-2-why-model-rankings-depend-on.txt", "jsonld": "https://wpnews.pro/news/a-cognitive-benchmark-for-code-rag-retrieval-part-2-why-model-rankings-depend-on.jsonld"}}