Your retriever returned the right documents. The similarity scores look fine. The answer is still wrong. If you've shipped RAG, you've seen this — and it's the failure that survives every retrieval upgrade.
Reranker. Higher top-k. Hybrid search. A better embedding model. All of these chase the same goal: documents more similar to the query. They help when the right document wasn't being retrieved. They do nothing when the right document was retrieved and the answer is still wrong.
Similarity answers "is this chunk about the same topic?" It does not answer "does this chunk contain the facts needed to support the answer?" Those come apart constantly. A chunk can be highly similar — same vocabulary, same subject — and contain nothing that actually grounds the answer. Hand the model a pile of on-topic text and it will produce a fluent, plausible, even cited-looking answer. The grounding is cosmetic: the text was nearby, not load-bearing.
High similarity with a wrong answer isn't a contradiction. You asked retrieval to find related text. It did. Nobody asked whether the text was enough.
Stop treating retrieval output as evidence. Treat it as candidate material that has to pass an explicit evidence check before it can support an answer. Put a step between retrieval and generation: does the retrieved set actually contain the facts this answer requires? If not, abstain. When the documents don't contain the facts, the system should return nothing rather than a confident guess.
Relevant context in, only sufficient evidence allowed through. That's the line between a RAG demo and a RAG system you can trust in production.
I write about the three boundaries where production RAG dies — query, evidence, output — from the angle of shipping under security and model constraints. Read the full version on my blog, where this connects to the practical RAG Failure Diagnosis Kit for teams debugging production RAG.