My RAG pipeline looked fine on paper. Fast retrieval. Decent cosine scores. But when I tested it with real queries, the top results were always a little off. Documents that shared vocabulary with the query kept showing up instead of documents that actually answered it. The model was doing its job. The architecture was not.
The fix was not a better model. It was a second model doing a different job.
This post breaks down what that means, why it matters, and how to build the two-stage pipeline in Python.
Every search system faces a hard tradeoff between speed and accuracy.
You cannot run a deep computation against every document in a million-item corpus at query time. The latency would be unacceptable. So most pipelines use a fast embedding model to retrieve candidates, stop there, and call it done.
The result is a "close but not quite right" problem. The retrieved documents are topically related but not precisely relevant. The pipeline optimized for speed at the cost of meaning.
Query --> [Dual Encoder] --> Top-K results
Query --> [Dual Encoder] --> Top-50 candidates
--> [Cross-Encoder] --> Reranked Top-5
The two models are not competing alternatives. They solve different halves of the same problem.
A dual encoder, also called a bi-encoder or two-tower model, uses two separate transformer networks. One encodes the query. The other encodes the document. Both produce a fixed-size vector. Then the system measures cosine similarity between the two vectors.
That single number is the relevance signal.
The reason this is fast is precomputation. You encode every document at index time and store those vectors. At query time, you only encode the query, which takes milliseconds, and run an approximate nearest-neighbor search against precomputed vectors. Corpus size stops mattering for latency.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
documents = [
"Python is a high-level programming language.",
"Transformer models revolutionized NLP benchmarks.",
"Cosine similarity measures the angle between two vectors.",
"RAG systems combine retrieval with language model generation.",
]
doc_embeddings = model.encode(documents, convert_to_numpy=True)
query = "How does vector similarity work in search?"
query_embedding = model.encode(query, convert_to_numpy=True)
scores = np.dot(doc_embeddings, query_embedding) / (
np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)
ranked = sorted(zip(scores, documents), reverse=True)
for score, doc in ranked[:3]:
print(f"{score:.4f} | {doc}")
The tradeoff is that the query and document never actually interact. Each gets compressed into one vector independently. That compression loses nuance. The model has no way to understand whether the document answers the query. It only knows whether they live in the same region of vector space.
A cross-encoder takes the query and a candidate document as a single concatenated input: [CLS] query [SEP] document [SEP]
. One transformer runs on this joint sequence. Every query token attends to every document token across all layers. The output is a single relevance score from 0 to 1.
Because the model reads both at the same time, it catches what a dual encoder misses. It understands negation. It distinguishes between a document that mentions a concept and a document that answers a question about it. It scores based on whether the document actually addresses the query, not whether they share vocabulary.
from sentence_transformers.cross_encoder import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
query = "side effects of stopping medication suddenly"
candidates = [
"Medication dosage guidelines for common prescriptions.",
"Abrupt discontinuation of certain medications can cause withdrawal symptoms.",
"Drug interaction checkers and pharmacy tools.",
"How to schedule medication reminders on your phone.",
]
pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, candidates), reverse=True)
for score, doc in ranked:
print(f"{score:.4f} | {doc}")
The cost is real. Every cross-encoder call requires a full transformer forward pass on the combined input. Nothing can be precomputed. At query time, each candidate costs a separate forward pass. Running this against a million documents is not viable.
This is why you chain them.
Stage one uses the dual encoder to retrieve the top 50 to 100 candidates. High recall matters here. Any relevant document that misses this cut is permanently gone.
Stage two passes only those candidates to the cross-encoder for reranking. The corpus is now small enough that deep joint attention is computationally viable. The reranker reorders the list. Only the top 5 to 10 results reach the user or the LLM.
from sentence_transformers import SentenceTransformer
from sentence_transformers.cross_encoder import CrossEncoder
import numpy as np
retriever = SentenceTransformer("all-MiniLM-L6-v2")
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
corpus = [
"Stopping blood pressure medication without a doctor's guidance can be dangerous.",
"Common blood pressure drugs include ACE inhibitors and beta blockers.",
"Medication adherence improves outcomes in chronic disease management.",
"Withdrawal effects vary depending on the type and duration of medication use.",
"Pharmacists can review drug interactions and dosage schedules.",
"Abrupt cessation of antidepressants can cause discontinuation syndrome.",
"Always consult a physician before changing any medication regimen.",
"Over-the-counter pain relievers are generally safe for short-term use.",
]
query = "is it dangerous to stop taking my medication without a doctor?"
doc_embeddings = retriever.encode(corpus, convert_to_numpy=True)
query_embedding = retriever.encode(query, convert_to_numpy=True)
cosine_scores = np.dot(doc_embeddings, query_embedding) / (
np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)
top_k = 5
top_indices = np.argsort(cosine_scores)[::-1][:top_k]
candidates = [corpus[i] for i in top_indices]
print("Stage 1 - Dual Encoder Retrieval:")
for i, doc in enumerate(candidates):
print(f" {i+1}. {doc}")
pairs = [[query, doc] for doc in candidates]
rerank_scores = reranker.predict(pairs)
final_results = sorted(zip(rerank_scores, candidates), reverse=True)
print("\nStage 2 - Cross-Encoder Reranked:")
for score, doc in final_results:
print(f" {score:.4f} | {doc}")
Note for RAG builders:The ceiling of this pipeline is always stage one recall. If the dual encoder misses a relevant document entirely, the cross-encoder never sees it. Retrieve generously with a higher top-k, then rerank aggressively.
| Property | Dual Encoder | Cross-Encoder |
|---|---|---|
| Input format | Query and document encoded separately | Query and document concatenated as one input |
| Output | Two vectors, cosine compared | One relevance score per pair |
| Speed | Very fast, supports precomputation | Slow, no precomputation possible |
| Accuracy | Moderate, misses nuanced relevance | High, full query-document interaction |
| Scalability | Scales to millions of documents | Practical only on 50 to 200 candidates |
| Pipeline role | Stage 1 retrieval | Stage 2 reranking |
| Example models | all-MiniLM-L6-v2, text-embedding-ada-002 | ms-marco-MiniLM-L-6-v2, Cohere Rerank, BGE-Reranker |
| Token interaction | None (independent encoding) | Full cross-attention across all layers |
| Scenario | Recommended Approach |
|---|---|
| Corpus under 500 documents | Cross-encoder directly, skip dual encoder |
| Large corpus, latency is the priority | Dual encoder only |
| Large corpus, accuracy matters | Dual encoder retrieval plus cross-encoder reranking |
| High QPS, tight latency budget | Dual encoder plus ColBERT late interaction |
| Domain-specific content, general models underperform | Fine-tune the cross-encoder on your domain data |
| Multilingual corpus | BGE-Reranker or mGTE for multilingual reranking |
If cross-encoder latency is too high for your QPS budget, ColBERT is worth knowing. It sits between the two architectures. It encodes query and document separately like a dual encoder, preserving the ability to precompute document vectors. But instead of comparing two pooled vectors, it compares individual token embeddings using a MaxSim operation: for each query token, find the most similar document token across the full sequence.
This gives ColBERT much better accuracy than a standard dual encoder while keeping document precomputation intact. According to the original ColBERT paper from Stanford, it uses two orders of magnitude fewer FLOPs per query than a cross-encoder while maintaining strong retrieval quality. The RAGatouille library is the fastest way to plug ColBERT into an existing pipeline.
| Approach | Precompute Docs | Token Interaction | Speed | Accuracy |
|---|---|---|---|---|
| Dual Encoder | Yes | None | Fastest | Lowest |
| ColBERT | Yes (per token) | MaxSim per token | Fast | High |
| Cross-Encoder | No | Full cross-attention | Slowest | Highest |
What is a dual encoder in NLP?
A dual encoder encodes the query and document into separate vectors and computes cosine similarity between them for fast, scalable retrieval.
What is a cross-encoder?
A cross-encoder takes a query and document as a single concatenated input and produces one precise relevance score by attending to both simultaneously.
Why not use a cross-encoder for all retrieval?
Cross-encoders cannot precompute anything, so running one against millions of documents at query time is computationally infeasible.
What is the two-stage retrieval pipeline?
The dual encoder retrieves the top 50 to 100 candidates for speed, then the cross-encoder reranks only those candidates for precision before passing results to the user or LLM.
What is ColBERT?
ColBERT is a late-interaction model that sits between dual and cross-encoders, comparing per-token vectors with MaxSim instead of pooled vectors, giving better precision than a dual encoder without losing document precomputation.
Which cross-encoder model should I start with?
cross-encoder/ms-marco-MiniLM-L-6-v2
from Sentence Transformers is the standard starting point for English; for managed reranking, Cohere Rerank works well in RAG pipelines.
Does LangChain support cross-encoder reranking?
Yes, both LangChain and LlamaIndex have built-in reranking steps that accept cross-encoder models as a second-stage reranker.
I write deeper technical breakdowns on search architecture, RAG systems, and AI infrastructure over at krunalkanojiya.com. The full version of this article with complete pipeline walkthroughs lives there if you want to go further.