Dual Encoder vs Cross-Encoder: Why Your RAG Pipeline Needs Both

wpnews.pro

My RAG pipeline looked fine on paper. Fast retrieval. Decent cosine scores. But when I tested it with real queries, the top results were always a little off. Documents that shared vocabulary with the query kept showing up instead of documents that actually answered it. The model was doing its job. The architecture was not.

The fix was not a better model. It was a second model doing a different job.

This post breaks down what that means, why it matters, and how to build the two-stage pipeline in Python.

Every search system faces a hard tradeoff between speed and accuracy.

You cannot run a deep computation against every document in a million-item corpus at query time. The latency would be unacceptable. So most pipelines use a fast embedding model to retrieve candidates, stop there, and call it done.

The result is a "close but not quite right" problem. The retrieved documents are topically related but not precisely relevant. The pipeline optimized for speed at the cost of meaning.

  Query --> [Dual Encoder] --> Top-K results

  Query --> [Dual Encoder] --> Top-50 candidates
                --> [Cross-Encoder] --> Reranked Top-5

The two models are not competing alternatives. They solve different halves of the same problem.

A dual encoder, also called a bi-encoder or two-tower model, uses two separate transformer networks. One encodes the query. The other encodes the document. Both produce a fixed-size vector. Then the system measures cosine similarity between the two vectors.

That single number is the relevance signal.

The reason this is fast is precomputation. You encode every document at index time and store those vectors. At query time, you only encode the query, which takes milliseconds, and run an approximate nearest-neighbor search against precomputed vectors. Corpus size stops mattering for latency.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

documents = [
    "Python is a high-level programming language.",
    "Transformer models revolutionized NLP benchmarks.",
    "Cosine similarity measures the angle between two vectors.",
    "RAG systems combine retrieval with language model generation.",
]
doc_embeddings = model.encode(documents, convert_to_numpy=True)

query = "How does vector similarity work in search?"
query_embedding = model.encode(query, convert_to_numpy=True)

scores = np.dot(doc_embeddings, query_embedding) / (
    np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)

ranked = sorted(zip(scores, documents), reverse=True)
for score, doc in ranked[:3]:
    print(f"{score:.4f} | {doc}")

The tradeoff is that the query and document never actually interact. Each gets compressed into one vector independently. That compression loses nuance. The model has no way to understand whether the document answers the query. It only knows whether they live in the same region of vector space.

A cross-encoder takes the query and a candidate document as a single concatenated input: [CLS] query [SEP] document [SEP]

. One transformer runs on this joint sequence. Every query token attends to every document token across all layers. The output is a single relevance score from 0 to 1.

Because the model reads both at the same time, it catches what a dual encoder misses. It understands negation. It distinguishes between a document that mentions a concept and a document that answers a question about it. It scores based on whether the document actually addresses the query, not whether they share vocabulary.

from sentence_transformers.cross_encoder import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

query = "side effects of stopping medication suddenly"

candidates = [
    "Medication dosage guidelines for common prescriptions.",
    "Abrupt discontinuation of certain medications can cause withdrawal symptoms.",
    "Drug interaction checkers and pharmacy tools.",
    "How to schedule medication reminders on your phone.",
]

pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)

ranked = sorted(zip(scores, candidates), reverse=True)
for score, doc in ranked:
    print(f"{score:.4f} | {doc}")

The cost is real. Every cross-encoder call requires a full transformer forward pass on the combined input. Nothing can be precomputed. At query time, each candidate costs a separate forward pass. Running this against a million documents is not viable.

This is why you chain them.

Stage one uses the dual encoder to retrieve the top 50 to 100 candidates. High recall matters here. Any relevant document that misses this cut is permanently gone.

Stage two passes only those candidates to the cross-encoder for reranking. The corpus is now small enough that deep joint attention is computationally viable. The reranker reorders the list. Only the top 5 to 10 results reach the user or the LLM.

from sentence_transformers import SentenceTransformer
from sentence_transformers.cross_encoder import CrossEncoder
import numpy as np

retriever = SentenceTransformer("all-MiniLM-L6-v2")
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

corpus = [
    "Stopping blood pressure medication without a doctor's guidance can be dangerous.",
    "Common blood pressure drugs include ACE inhibitors and beta blockers.",
    "Medication adherence improves outcomes in chronic disease management.",
    "Withdrawal effects vary depending on the type and duration of medication use.",
    "Pharmacists can review drug interactions and dosage schedules.",
    "Abrupt cessation of antidepressants can cause discontinuation syndrome.",
    "Always consult a physician before changing any medication regimen.",
    "Over-the-counter pain relievers are generally safe for short-term use.",
]

query = "is it dangerous to stop taking my medication without a doctor?"

doc_embeddings = retriever.encode(corpus, convert_to_numpy=True)
query_embedding = retriever.encode(query, convert_to_numpy=True)

cosine_scores = np.dot(doc_embeddings, query_embedding) / (
    np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)

top_k = 5
top_indices = np.argsort(cosine_scores)[::-1][:top_k]
candidates = [corpus[i] for i in top_indices]

print("Stage 1 - Dual Encoder Retrieval:")
for i, doc in enumerate(candidates):
    print(f"  {i+1}. {doc}")

pairs = [[query, doc] for doc in candidates]
rerank_scores = reranker.predict(pairs)

final_results = sorted(zip(rerank_scores, candidates), reverse=True)

print("\nStage 2 - Cross-Encoder Reranked:")
for score, doc in final_results:
    print(f"  {score:.4f} | {doc}")

Note for RAG builders:The ceiling of this pipeline is always stage one recall. If the dual encoder misses a relevant document entirely, the cross-encoder never sees it. Retrieve generously with a higher top-k, then rerank aggressively.

Property	Dual Encoder	Cross-Encoder
Input format	Query and document encoded separately	Query and document concatenated as one input
Output	Two vectors, cosine compared	One relevance score per pair
Speed	Very fast, supports precomputation	Slow, no precomputation possible
Accuracy	Moderate, misses nuanced relevance	High, full query-document interaction
Scalability	Scales to millions of documents	Practical only on 50 to 200 candidates
Pipeline role	Stage 1 retrieval	Stage 2 reranking
Example models	all-MiniLM-L6-v2, text-embedding-ada-002	ms-marco-MiniLM-L-6-v2, Cohere Rerank, BGE-Reranker
Token interaction	None (independent encoding)	Full cross-attention across all layers

Scenario	Recommended Approach
Corpus under 500 documents	Cross-encoder directly, skip dual encoder
Large corpus, latency is the priority	Dual encoder only
Large corpus, accuracy matters	Dual encoder retrieval plus cross-encoder reranking
High QPS, tight latency budget	Dual encoder plus ColBERT late interaction
Domain-specific content, general models underperform	Fine-tune the cross-encoder on your domain data
Multilingual corpus	BGE-Reranker or mGTE for multilingual reranking

If cross-encoder latency is too high for your QPS budget, ColBERT is worth knowing. It sits between the two architectures. It encodes query and document separately like a dual encoder, preserving the ability to precompute document vectors. But instead of comparing two pooled vectors, it compares individual token embeddings using a MaxSim operation: for each query token, find the most similar document token across the full sequence.

This gives ColBERT much better accuracy than a standard dual encoder while keeping document precomputation intact. According to the original ColBERT paper from Stanford, it uses two orders of magnitude fewer FLOPs per query than a cross-encoder while maintaining strong retrieval quality. The RAGatouille library is the fastest way to plug ColBERT into an existing pipeline.

Approach	Precompute Docs	Token Interaction	Speed	Accuracy
Dual Encoder	Yes	None	Fastest	Lowest
ColBERT	Yes (per token)	MaxSim per token	Fast	High
Cross-Encoder	No	Full cross-attention	Slowest	Highest

What is a dual encoder in NLP?

A dual encoder encodes the query and document into separate vectors and computes cosine similarity between them for fast, scalable retrieval.

What is a cross-encoder?

A cross-encoder takes a query and document as a single concatenated input and produces one precise relevance score by attending to both simultaneously.

Why not use a cross-encoder for all retrieval?

Cross-encoders cannot precompute anything, so running one against millions of documents at query time is computationally infeasible.

What is the two-stage retrieval pipeline?

The dual encoder retrieves the top 50 to 100 candidates for speed, then the cross-encoder reranks only those candidates for precision before passing results to the user or LLM.

What is ColBERT?

ColBERT is a late-interaction model that sits between dual and cross-encoders, comparing per-token vectors with MaxSim instead of pooled vectors, giving better precision than a dual encoder without losing document precomputation.

Which cross-encoder model should I start with?

cross-encoder/ms-marco-MiniLM-L-6-v2

from Sentence Transformers is the standard starting point for English; for managed reranking, Cohere Rerank works well in RAG pipelines.

Does LangChain support cross-encoder reranking?

Yes, both LangChain and LlamaIndex have built-in reranking steps that accept cross-encoder models as a second-stage reranker.

I write deeper technical breakdowns on search architecture, RAG systems, and AI infrastructure over at krunalkanojiya.com. The full version of this article with complete pipeline walkthroughs lives there if you want to go further.

source & further reading

dev.to — original article I Traced a Multi-Step LLM Agent With Self-Hosted SigNoz. One Feature Sold Me. How I Built a Fully Automated AI Blog with AWS CDK, Bedrock, and Step Functions The Missing Economic Layer: How AI Agents Will Pay for Their Own Infrastructure

Dual Encoder vs Cross-Encoder: Why Your RAG Pipeline Needs Both

Run your AI side-project on zahid.host