# RAG Fails Upstream and Most Teams Are Fixing the Wrong Problem

> Source: <https://techstrong.ai/features/rag-fails-upstream-and-most-teams-are-fixing-the-wrong-problem/>
> Published: 2026-06-25 10:37:02+00:00

Retrieval-augmented generation (RAG) is not one engineering problem but two, and most teams are solving the easier one while the harder one implodes into a production incident.

Unstructured enterprise data was always a problem worth solving. [RAG made the solution look simple enough](https://techstrong.ai/articles/from-rag-to-riches-why-retrieval-augmented-generation-is-key/) that most teams were in production before they understood what they had actually built. Connect your documents to a language model, let the model retrieve what it needs before it answers and the gap between a generic AI assistant and one that knows your business closes itself. However, the tooling ecosystem that grew up around it evolved fast enough that most teams were running their first RAG pipeline before they had a clear picture of what making it production-ready would actually require.

So, the gap between what teams expected to build and what they actually built is where a significant amount of engineering time and organizational credibility has disappeared. The problem is not that RAG does not work. It is that most teams are treating it as one problem when it is structurally two and building the solution to one while leaving the other largely unaddressed. The consequences surface not in the demo but three months into production when retrieval quality degrades and edge cases accumulate. The AI that impressed the stakeholder presentation starts giving answers that nobody on the engineering team can defend.

### Data Quality Kills More RAG Pipelines

[Mayank Bhola](https://in.linkedin.com/in/mayankbhola), co-founder and head of products at [TestMu AI](https://www.testmuai.com/), an AI-native quality engineering platform formerly known as LambdaTest, has seen this expectation gap play out consistently across enterprise teams. Most come in with strong instincts about which LLM to use and how to prompt it, but have not asked the harder question of whether the data those models are going to retrieve is actually ready to be retrieved. “That readiness problem does not announce itself in the demo because demos are curated,” Bhola explains. “It is exposed weeks into production when the edge cases the demo never touched start hitting the retrieval layer and the system has no mechanism to handle them.”

Treating data as production-ready before the pipeline is built means running a structured audit of everything the system is going to ingest. This implies checking whether tables are parseable, whether images and charts have text equivalents, whether legacy documents follow any schema the extraction tooling can interpret and whether the content has been reviewed for consistency across sources. Teams that skip this step are not moving faster. They are deferring the work into a production environment where it is significantly harder and more expensive to fix.

[Carlos Rolo](https://www.linkedin.com/in/carlosjuzarterolo/) — an open-source community contributor with work across Cassandra, OpenSearch and Cadence and an engineering expert at [Instaclustr, a NetApp company](https://www.instaclustr.com/) — describes the data reality that most teams encounter the moment they move past the demo. Organizations walk in with PDFs, Word documents and SharePoint content, with the reasonable expectation that a RAG pipeline will ingest it all and make it queryable.

Tables with ambiguous commas, images and charts that extraction tools cannot reliably interpret, inconsistent schemas and legacy content that predates any formatting standard the current team would recognize are all waiting on the other side of that expectation. None of it is what the vendor pitch prepared the team for, because the vendor pitch was about the retrieval and generation layers rather than the data preparation layer that both depend on entirely.

“I have seen very robust RAG pipelines fall over because a table had a comma in it, which is immediately understandable by us humans, but it completely breaks down the RAG pipeline,” Rolo explains. “Then you cannot make any retrieval.” The fix is not complicated once the problem is identified, but the category it belongs to — malformed source data — is one that requires dedicated tooling and a deliberate pre-ingestion step rather than a quick patch.

Open-source projects such as [Docling from IBM](https://github.com/DS4SD/docling) and [Markitdown from Microsoft](https://github.com/microsoft/markitdown) exist specifically to solve this problem, and teams building RAG pipelines over complex enterprise documents should be evaluating them as part of the data preparation layer rather than as an afterthought.

### Access Controls and Semantic Governance Can’t Be Retrofitted

The governance dimension of data readiness is the part Rolo identifies as most consistently underestimated, and the one that creates the most friction in enterprise deployments where the data being ingested is not just messy but sensitive. Once data is in the RAG pipeline, the questions of who can access it, who owns it and whether the retrieval system can enforce access controls at the document level rather than just the query level become operational requirements rather than compliance considerations.

“Is this being seen by someone’s eyes that should not be seeing it?” Rolo asks, framing the question that organizations running cloud-based AI systems over enterprise content have to answer before they can call the pipeline production ready. In the rush to move fast, it is the question that gets answered last rather than first.

Document-level access control has to be designed into the retrieval architecture from the start, not retrofitted after the pipeline is running. This means mapping which documents belong to which access tiers before ingestion, building retrieval filters that enforce those tiers at query time and auditing the pipeline regularly to verify that access boundaries are holding as the document corpus grows. Teams that treat this as a post-launch compliance task typically discover the gap when an employee retrieves content they were never supposed to see, at which point the fix requires rebuilding parts of the pipeline rather than adjusting a configuration.

[Alex Merced](https://www.linkedin.com/in/alexmerced), head of developer relations at [Dremio](https://www.dremio.com/), frames the governance problem in terms that go beyond access controls and into the semantic layer underneath them. “Most enterprises do not actually have an embedding problem. They have a context problem. Their data is fragmented across systems, poorly described, inconsistently modeled and governed through processes that were never designed for autonomous or semi-autonomous agents.”

Without a governed semantic layer, AI systems will retrieve information that is ambiguous, incomplete, stale or outright unauthorized, regardless of how strong the underlying vector search is. Merced puts it plainly: “Better embeddings simply make the wrong answers faster.

Building the semantic layer first means establishing a shared vocabulary for business concepts across data sources, resolving inconsistencies in how those concepts are modeled and creating a governed data foundation that the retrieval layer can reason over reliably. Teams that start from open lakehouse foundations before touching the retrieval layer are not slowing down. They are eliminating the category of retrieval failures that no amount of embedding optimization can fix after the fact.

### Hybrid Retrieval Beats Over-Optimizing the Vector Layer

The architectural insight that Rolo arrived at through building and debugging RAG pipelines is the one most teams discover too late and have to rebuild around. RAG is a two-step problem, and the two steps have different failure modes that require different solutions. The first step is search, finding the right chunks of content from the knowledge base to pass to the model. The second step is context, giving the model those chunks in an order and format that lets it reason about them effectively. Most implementations treat the second step as the hard problem and underinvest in the first, then spend months debugging retrieval quality when the real failure is happening upstream.

“I improved my retrieval massively because I was missing searches and then my content was miserable for the LLM,” Rolo recalls, describing the moment he stepped back from vector-based retrieval and reintroduced keyword search into the pipeline. Keyword search handles precise term retrieval and vector search handles semantic similarity, and the two are solving different retrieval problems.

The practical rule is straightforward: Use keyword search when the query contains specific terms, identifiers or named entities that need exact matching and use vector search when the query is conceptual and the relevant content may not share vocabulary with the question being asked. Running both in parallel and merging the results by relevance score consistently outperforms either method used in isolation.

Merced sees the same over-correction playing out across enterprise teams. “Teams routinely over-optimize the vector layer, tuning embeddings, swapping vector databases or chasing marginal recall improvements, while underestimating issues such as data governance, semantic consistency, hybrid retrieval strategies and end-to-end latency constraints.” The more productive use of that optimization effort is to stop tuning the vector layer and start auditing what is upstream of it.

Check whether the chunking strategy matches the actual structure of the documents. Check whether the metadata attached to each chunk is rich enough to support filtering. Check whether the retrieval evaluation is being run against representative queries drawn from real usage rather than synthetic test cases. Standards such as MCP and other open interfaces allow AI agents to combine symbolic and vector-based retrieval and operate across tools with predictable performance, giving teams the flexibility to evolve the pipeline as models and use cases change without rebuilding from scratch.

On the database decision, Rolo’s advice is grounded in operational reality rather than architectural idealism. “The best advice in the database world is the database you already have,” he argues. “If you already have a database and that database supports vectors, start there. Do not reinvent the wheel.” [Postgres with pgvector](https://github.com/pgvector/pgvector), OpenSearch with ML Commons, Cassandra with vector search — these are mature infrastructure choices that preserve the operational knowledge a team has already built ad avoid the cost of adopting a new system while simultaneously trying to ship a reliable AI product.

### Bigger Context Windows Don’t Fix a Precision Problem

The context side of the two-step problem has its own failure mode that has become more visible as teams have pushed toward models with larger context windows and discovered that larger capacity does not translate linearly into better performance. Rolo describes the pattern that tends to emerge from teams that try to solve retrieval quality problems by sending more content to the model.

“Having a big context window with low quality, even if somewhere within the low-quality context there is the context you need, is not a good technique,” he points out. Models that receive too much content lose track of what is relevant in the middle, a failure mode known as context rot and the model that could reason clearly about three precise chunks becomes unreliable when those same three chunks are buried inside 30 that were only marginally related.

The implication for retrieval architecture is that precision matters more than recall at the context stage. The goal is not to retrieve everything that might be relevant but to retrieve exactly what is needed and nothing that is not, because the model’s ability to reason degrades as the content-to-signal ratio falls.

Benchmarking for context degradation means running structured tests that vary the number of retrieved chunks passed to the model and measuring answer quality against a fixed set of ground truth queries, then identifying the point at which adding more context starts reducing accuracy rather than improving it. Rolo underscores that this threshold is model-specific and data-specific. “Each LLM performs differently,” he notes. “It is not that LLM A is better than LLM B or C. It is about benchmarking our own LLMs to understand how they manage context.”

Keeping a vector index synchronized with a high-transaction operational database is a harder engineering problem than the initial pipeline build, and it compounds over time as the production data distribution evolves away from the data the index was built on. On-the-fly embedding computation avoids the stale index problem but trades it for latency. The teams building the most reliable RAG pipelines are the ones that have been honest about that limitation and designed their systems around explicit refresh cycles and retrieval quality monitoring rather than assuming the pipeline will maintain its own accuracy over time.

### RAG is an Engineering Discipline, Not a One-Time Build

The RAG projects that have delivered durable value share a set of architectural commitments that are less about technology choices and more about sequencing and ownership. Data readiness is treated as a prerequisite rather than a parallel track. The retrieval layer is treated as a first-class engineering concern with its own quality benchmarks, monitoring and improvement cycle. The context assembly layer is built around what the specific model can handle reliably rather than what it theoretically supports at maximum capacity. Governance is addressed at the design stage rather than the compliance review stage.

Bhola frames the operational consequence of getting this sequencing wrong in terms that engineering leaders will recognize from their own post-deployment experience. Speaking from his experience building KaneAI, an end-to-end testing agent, “What we have consistently found is that the teams who treat RAG as a one-time build rather than an ongoing engineering discipline end up spending more time in firefighting mode than they saved by moving fast on the initial implementation. The retrieval layer degrades gradually, the data distribution shifts and the model gets updated. None of those changes trigger an alert because the system is still functioning — it is just functioning worse than it was three months ago, and nobody has instrumented for that.”

Avoiding the outcome means assigning a named owner to the retrieval pipeline with a mandate that goes beyond keeping the system running. The owner runs retrieval quality benchmarks on a regular cadence, tracks answer accuracy against a fixed evaluation set, monitors for data distribution shifts and treats any meaningful change in the underlying data as a trigger to re-evaluate the pipeline. Without this ownership structure, retrieval quality degradation becomes a background condition that surfaces only when a user or a stakeholder notices it, by which point the gap between where the system is and where it should be is already significant.

“The biggest risk you have is not using AI,” Rolo adds, and the observation cuts against the narrative that the complexity of production RAG is an argument for caution. The engineering problems are real and the path through them is demanding. But the gap between a RAG system that works in a demo and one that works in production is an engineering gap, not a technology gap, and the discipline to close it is available to any team willing to take both steps of the problem seriously rather than just the second one.
