{"slug": "most-rag-problems-are-r-etrieval-problems", "title": "Most RAG Problems Are R(etrieval) Problems", "summary": "A developer reports that most failures in production RAG (Retrieval-Augmented Generation) systems stem from retrieval and data quality issues, not the language model itself. Common problems include degraded retrieval accuracy when scaling from demo-sized datasets to real-world document stores, corrupted text from PDF parsing, outdated or conflicting source documents, and permission leaks that expose sensitive data. The engineer notes that adding a reranker and investing in preprocessing infrastructure solved more quality issues than changing the LLM, and that EU-specific challenges like GDPR compliance, on-premise requirements, and right-to-erasure with vector databases add significant complexity.", "body_md": "Most RAG blog posts read like product brochures. After building a few systems over the last months and reading way too many production post-mortems, I'm pretty convinced the LLM is usually not the thing that breaks first.\n\nEspecially not in EU mid-market deployments.\n\nA few failure modes I see again and again:\n\nThe demo with 500 PDFs looks amazing.\n\nThen the first real pilot starts, somebody uploads 30k documents from SharePoint and suddenly top-3 retrieval becomes semi-random.\n\nTypical example:\n\nQuery is `Lieferantenbewertung 2024`\n\n.\n\nWhat comes back:\n\nThis problem is way more common than most tutorials mention.\n\nWhat people in production seem to converge on:\n\nHonestly, adding a reranker solved more quality issues for us than changing the LLM ever did.\n\nMost demos run on clean PDFs.\n\nReal document stores are:\n\n`pypdf`\n\nturns many of these into complete garbage text.\n\nThings I saw multiple times already:\n\n`ü`\n\nbecoming weird symbolsCurrent stack that works reasonably okay:\n\nThis preprocessing layer is very unsexy work, but probably 30% of the actual implementation effort.\n\nAnd if you skip it, the whole RAG quality later becomes fake-good.\n\nEvery stakeholder asks:\n\n“What about hallucinations?”\n\nAlmost nobody asks:\n\n“What if the source itself is outdated?”'\n\nThis kills more pilots from what I’ve seen.\n\nThe model gives a perfectly grounded answer.\n\nIt cites the right document.\n\nThe document is just no longer valid.\n\nOr worse:\n\ntwo valid documents disagree and the system confidently picks one.\n\nWhat seems to work:\n\nA lot of “hallucination problems” are actually retrieval problems wearing a fake mustache.\n\nThis one appears in basically every internal rollout thread.\n\nThe assistant accidentally answers something using a HR spreadsheet or salary export the user should never have seen.\n\nTechnically the solution is easy:\n\npermission filtering before semantic retrieval.\n\nIn reality:\n\nIn EU environments this becomes even more annoying because GDPR changes this from “oops” into potential reportable incident territory.\n\nHonestly I would not even start a pilot anymore before the customer can explain who should access what.\n\nEverybody budgets the first embedding run.\n\nAlmost nobody budgets:\n\nEmbedding APIs look cheap until somebody realizes the SharePoint dump contains 800 million tokens.\n\nWhat seems to become the default setup now:\n\nOtherwise migrations become pain very quickly.\n\nThis changes the architecture more than many US blog posts suggest.\n\nOn-premise is usually the default ask now.\n\nGDPR + Art. 28 contracts eliminate half the providers immediately.\n\nMost legal departments only accept a very small shortlist without months of discussions.\n\nAlso:\n\nright-to-erasure with vector DBs is more annoying than many teams expect. If embeddings are derived from customer documents, you need to know exactly where they are.\n\nStill feels like many teams underestimate how much “boring infrastructure work” is inside production RAG systems.\n\nThe LLM part is honestly often the easiest component.\n\nIf you want a longer version with concrete vendor breakdowns and cost ranges, we wrote one up here: [RAG mit eigenen Daten](https://dagentic.de/blog/rag-eigene-daten/) (in German). The broader take on agentic AI in EU-regulated\n\nenvironments: [KI-Agenten im Mittelstand 2026](https://dagentic.de/blog/ki-agenten-mittelstand-2026/).", "url": "https://wpnews.pro/news/most-rag-problems-are-r-etrieval-problems", "canonical_source": "https://dev.to/dagentic/most-rag-problems-are-retrieval-problems-327h", "published_at": "2026-05-27 14:09:52+00:00", "updated_at": "2026-05-27 14:40:10.955783+00:00", "lang": "en", "topics": ["large-language-models", "ai-products", "ai-infrastructure", "mlops"], "entities": ["SharePoint", "pypdf", "LLM"], "alternates": {"html": "https://wpnews.pro/news/most-rag-problems-are-r-etrieval-problems", "markdown": "https://wpnews.pro/news/most-rag-problems-are-r-etrieval-problems.md", "text": "https://wpnews.pro/news/most-rag-problems-are-r-etrieval-problems.txt", "jsonld": "https://wpnews.pro/news/most-rag-problems-are-r-etrieval-problems.jsonld"}}