Why “Just RAG” Breaks During Multilingual Support Surges

A prototype for multilingual retrieval-augmented generation (RAG) revealed that caching answers across languages is unsafe unless routing identity, evidence propagation, and reuse eligibility are verified. The system failed when Thai retrieval namespaces mismatched, source metadata was lost, or cached answers were reused after prompt or model changes. The findings show that safe reuse requires proving two requests share the same evidence context, not just the same question text.

During a support spike, repeated questions are both an opportunity and a risk. They are an opportunity because the system should not pay full retrieval and generation cost every time customers ask about returns, refunds, shipping timelines, promotion rules, account access, or support policies. They are a risk because support answers are not just text; they are policy-backed responses that agents may need to defend. That tradeoff gets harder in multilingual RAG. The same customer intent can appear in English, Chinese, or Thai, but safe reuse depends on whether the cached answer still maps to the same retrieval namespace, retrieved evidence, and runtime conditions. For this prototype, safe reuse depended on three boundaries: routing identity, evidence propagation, and reuse eligibility. Routing identity meant the request used the right retrieval namespace for its language lane. Evidence propagation meant retrieved source metadata and route metadata reached the final response. Reuse eligibility meant the system could prove a cached answer was still valid before generation, including prompt version, model version, and index version. The failure mode is quiet: no error, no routing alarm, and a fluent response that can still come from the wrong retrieval namespace or lose the source metadata a support rep needs to verify the answer. So the design question changed from “Can I make this faster?” to “What makes two multilingual RAG requests equivalent enough to reuse one answer?” Cached RAG is only safe when “same question” also means the same evidence context. The useful failures were the ones that exposed the three reuse boundaries. The first failure was Routing Identity. Routing bugs often look like retrieval bugs until namespace identity is visible. Let’s say the Thai retrieval namespace appeared under different names, such as idx th and idx th bge. The naming mismatch mattered because it made routing harder to verify: when a Thai answer looked wrong, the system needed to show which evidence namespace it had actually used. That made retrieval namespace identity a final-response concern, not just a configuration detail. The second failure was Evidence Propagation . Retrieved document metadata did not consistently survive from retrieval output to the final response. The generated answer could look reasonable, but the response could still fail the support workflow because it could not show which retrieved chunks or sources supported the answer. That made source fields part of correctness. In RAG, an answer is not support-ready if the response cannot show which retrieval evidence made it safe to use. The third failure was Reuse Eligibility . The cache could answer whether a normalized question had appeared before, but not whether the previous answer was still safe to reuse: key = normalize question cached = cache.get key if cached: return cacheddocs = retriever.search question answer = llm.generate question, docs cache.set key, answer return answer This is a real read path, but the eligibility check is too thin. It treats repeated text as reusable text. For multilingual RAG, that can produce the wrong kind of cache hit: a Thai request could reuse an answer produced for another namespace, or a request after a prompt/model/index change could reuse an answer produced under older runtime conditions. The consequence is not just lower answer quality. The system may skip fresh RAG exactly when needed. Once those boundaries were clear, the runtime design reduced to request-level decisions. The architecture tradeoff was live explainability versus safe reuse. Live RAG every time is auditable but pays generation repeatedly. Exact answer caching is fast but unsafe if the key only represents the question string. This prototype needed a middle path: reuse only when the request still belonged to the same evidence context. That led to decisions at request time: choose the lane, check reuse eligibility, and return enough metadata to verify the path. First, the request needed a lane . A shared multilingual index would have been easier to maintain, but separate namespaces made routing failures diagnosable. The interface provided the language value, then routed to a named namespace such as idx en bge or idx th bge. When a Thai answer looked wrong, I could inspect whether the request actually hit idx th bge instead of treating every bad answer as a retrieval-quality problem. Second, reuse needed to be decided before generation . If the request matched a valid cached answer for the same lane and runtime state, the system could return that answer with route metadata. If not, it had to run fresh RAG. Third, the response needed enough context for a support workflow to trust it : request ID, namespace, index version, model, latency, cache-hit status, and source fields where present. These decisions became cache-key checks. The simplified flow looked like this: php User Query - Lane Selector - Cache Lookup - Hit: return cached answer + response metadata - Miss: retrieve - generate - trace - store - Response Cache identity was how the system enforced the boundaries. A weak key would make reuse fast but unsafe: the same normalized question could appear in another language lane, under another prompt version, or after an index update. In this prototype, the cache key had to describe the request text and the conditions that made the answer safe to reuse. A cache entry was considered reusable only when three key dimensions matched: The measured fast path here was exact repeat reuse, so the lookup started with the normalized question plus lane and version dimensions. It should not require fresh retrieval before checking the cache. Retrieved document hashes belonged outside the first lookup key because they are not available until after retrieval. They can help validate stored answers later, but they cannot be required before an exact cache hit can happen. cache key = make key query=normalize question , language=lang, namespace=namespace, prompt version=PROMPT VERSION, index version=INDEX VERSION, model=MODEL ID, A semantic cache would still need the same routing and version checks; similarity alone would not make reuse safe. The key made wrong reuse harder: an English answer should not silently satisfy a Thai request, and an answer produced under one prompt, model, or index version should not be treated as equivalent after those conditions change. The evaluation used a controlled hot-set replay: repeated FAQ queries across EN/ZH/TH, two embedder lanes, two execution modes, and top k=3 retrieval. Each cell used n=20, with the fast path primed before measurement. Validation results: The cache behaved as expected across the validation workload. When the request stayed inside the same lane and runtime conditions, the prototype skipped generation and returned route/source context with the response. On the primary lane, cache-hit latency dropped into the low tens of milliseconds; the backup lane was slower, but stayed visible as a separate fallback/comparison path. This does not predict production hit rate or long-term invalidation behavior. It shows the narrower mechanism working: eligible repeats could reuse an answer without losing the evidence context needed to inspect that reuse. The conclusion is narrower: repeated or similar questions are reusable only when the system can prove the routing, evidence, and runtime conditions still match. For multilingual RAG, that means three checks: 1. Reuse needs routing identity, not just matching text or similar intent. 2. Reuse needs evidence propagation, not just fluent answer text. 3. Reuse needs eligibility checks, not just a cache hit. Before reusing a RAG answer, I would ask: 1. What makes two requests equivalent?2. What changes invalidate reuse?3. How can we verify the cached answer came from the correct evidence context? In multilingual support, avoiding generation is useful only when the system can still explain why the reused answer was valid. The safest cached answer is not the fastest one. It is the one whose evidence context you can still explain. Further context This article focuses on the cache/reuse boundary of a prototype I built. For readers who want to inspect the work more closely, the implementation annex is on GitHub https://github.com/NaughtyRex/faq-surge-assistant . I also keep a portfolio case study https://rexchang.dev/ case-studies for the product/architecture framing and a short fun walkthrough video https://www.youtube.com/watch?v=0mZvhp8Wurg showing the prototype behavior. Why “Just RAG” Breaks During Multilingual Support Surges https://pub.towardsai.net/why-just-rag-breaks-during-multilingual-support-surges-88d8bce5aae2 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.