# RAG should never be your default

> Source: <https://dev.to/chou_500/rag-should-never-be-your-default-38lh>
> Published: 2026-06-14 01:27:35+00:00

Vector RAG is the reflexive answer to "give the model more context," and when I built a production tutoring AI, I reached for it too. The product is simple: a student uploads a photo of a problem, and our tutor explains it step by step and produces an answer. Our client also had a large database of past problems, stored as images — and of course we wanted to leverage it. So the system retrieved the most similar past problem and fed it, together with its answer, into the model to help generate a solution. It was such an obvious move that I never questioned it.

It performed poorly.

I went hunting for the usual suspect: retrieval quality. I tried different retrieval strategies, from matching on the problem description to matching on the image content. I benchmarked different embedding models, from single-vector to late-interaction. None of it moved the needle. If retrieval quality wasn't the problem, where was the bug?

Vector RAG represents text or images in an embedding space and returns the stored chunk whose meaning is closest to the query. In other words, it optimizes for exactly one thing: semantic similarity to what you've already stored. Hidden inside that is a silent assumption — that the most similar stored item is the one the model needs. For FAQs or document lookup, the assumption holds: the most similar passage really is the right one, so RAG is never wrong there, and reaching for it feels so natural that the assumption never surfaces. The tool isn't the problem; it does exactly what it claims. The real question is whether *most similar == what's needed* holds for your case — and you have to check that **before** reaching for it.

In my case, it didn't. I wanted the most similar problem to raise answer accuracy, so I built a multimodal image-retrieval pipeline — past problems indexed in Qdrant, with Colpali/ColQwen and Jina multi-vector retrieval benchmarked against each other. They all generalized poorly. Thinking it through from first principles, I realized the approach had two fundamental problems.

First, the retrieval was ill-posed. For our task, "similar" had to mean *similar in solution method* — not visually or semantically similar. And that kind of similarity is almost impossible to define from the problem image alone. No wonder neither single-vector nor late-interaction could fix retrieval quality: I was asking the retriever for something the images simply didn't encode.

Second, even a perfect retriever wouldn't have helped. Hand the model a similar problem and its answer, and it still has to reverse-engineer the solution method from that example before it can apply it — an extra reasoning hop. On a small qualitative eval (~10 problems), feeding in the similar problem and its answer produced no meaningful lift.

Both problems pointed at the same thing: what the model actually needed was the solution method, directly. The similar-problem image was the wrong unit — indirect at best.

So we dropped RAG entirely. Instead, we pre-structured the solution methods into reusable knowledge and let the LLM select the relevant method and apply it directly. The answers got noticeably better — the model no longer had to infer a method before using one; it just used it. (We never ran a formal benchmark on the replacement, so treat that as a qualitative read, not a number.)

The bigger win was operational. We built a web interface that lets the client edit the knowledge the LLM relies on. Because each change is reflected in the model's behavior immediately, they can experiment, iterate, and watch the results in real time.

**And they don't need an engineer to do it.** Non-technical users update the playbook themselves — something that was never possible when the model's behavior was locked inside embeddings and a retrieval pipeline. The system keeps improving after launch, and a knowledge playbook turns out to be far easier to maintain than a RAG stack. As a bonus, it cut our retrieval infra cost.

The case taught me never to treat RAG as the default. Before reaching for it, ask one question: is the most similar stored item the same as what the model needs most here? If not, figure out what it *does* need, and give it that. Retrieval is a design decision about the unit of need — not a default you reach for.
