Most RAG Problems Are R(etrieval) Problems

wpnews.pro

cd /news/large-language-models/most-rag-problems-are-r-etrieval-pro… · home › topics › large-language-models › article

[ARTICLE · art-15351] src=dev.to ↗ pub=2026-05-27T14:09Z topic=large-language-models verified=true sentiment=↓ negative

Most RAG Problems Are R(etrieval) Problems

A developer reports that most failures in production RAG (Retrieval-Augmented Generation) systems stem from retrieval and data quality issues, not the language model itself. Common problems include degraded retrieval accuracy when scaling from demo-sized datasets to real-world document stores, corrupted text from PDF parsing, outdated or conflicting source documents, and permission leaks that expose sensitive data. The engineer notes that adding a reranker and investing in preprocessing infrastructure solved more quality issues than changing the LLM, and that EU-specific challenges like GDPR compliance, on-premise requirements, and right-to-erasure with vector databases add significant complexity.

read2 min views13 publishedMay 27, 2026

Most RAG blog posts read like product brochures. After building a few systems over the last months and reading way too many production post-mortems, I'm pretty convinced the LLM is usually not the thing that breaks first.

Especially not in EU mid-market deployments.

A few failure modes I see again and again:

The demo with 500 PDFs looks amazing.

Then the first real pilot starts, somebody uploads 30k documents from SharePoint and suddenly top-3 retrieval becomes semi-random.

Typical example:

Query is Lieferantenbewertung 2024

What comes back:

This problem is way more common than most tutorials mention.

What people in production seem to converge on:

Honestly, adding a reranker solved more quality issues for us than changing the LLM ever did.

Most demos run on clean PDFs.

Real document stores are:

pypdf

turns many of these into complete garbage text.

Things I saw multiple times already:

ü

becoming weird symbolsCurrent stack that works reasonably okay:

This preprocessing layer is very unsexy work, but probably 30% of the actual implementation effort.

And if you skip it, the whole RAG quality later becomes fake-good.

Every stakeholder asks:

“What about hallucinations?”

Almost nobody asks:

“What if the source itself is outdated?”'

This kills more pilots from what I’ve seen.

The model gives a perfectly grounded answer.

It cites the right document.

The document is just no longer valid.

Or worse:

two valid documents disagree and the system confidently picks one.

What seems to work:

A lot of “hallucination problems” are actually retrieval problems wearing a fake mustache.

This one appears in basically every internal rollout thread.

The assistant accidentally answers something using a HR spreadsheet or salary export the user should never have seen.

Technically the solution is easy:

permission filtering before semantic retrieval.

In reality:

In EU environments this becomes even more annoying because GDPR changes this from “oops” into potential reportable incident territory.

Honestly I would not even start a pilot anymore before the customer can explain who should access what.

Everybody budgets the first embedding run.

Almost nobody budgets:

Embedding APIs look cheap until somebody realizes the SharePoint dump contains 800 million tokens.

What seems to become the default setup now:

Otherwise migrations become pain very quickly.

This changes the architecture more than many US blog posts suggest.

On-premise is usually the default ask now.

GDPR + Art. 28 contracts eliminate half the providers immediately.

Most legal departments only accept a very small shortlist without months of discussions.

Also:

right-to-erasure with vector DBs is more annoying than many teams expect. If embeddings are derived from customer documents, you need to know exactly where they are.

Still feels like many teams underestimate how much “boring infrastructure work” is inside production RAG systems.

The LLM part is honestly often the easiest component.

If you want a longer version with concrete vendor breakdowns and cost ranges, we wrote one up here: [RAG mit eigenen Daten](https://dagentic.de/blog/rag-eigene-daten/) (in German). The broader take on agentic AI in EU-regulated

environments: [KI-Agenten im Mittelstand 2026](https://dagentic.de/blog/ki-agenten-mittelstand-2026/).

source & further reading

dev.to — original article I built a Rofi assistant so my mom could stop calling me for Linux help I shipped 8 small web tools for overseas users — here's the messy truth Whether it's "coding," "cooking," "rock climbing," or "helping others" -- the AI will craft something beautiful just for you.

~/api · this article 200

$curl api.wpnews.pro/v1/news/most-rag-problems-are-r-…

Read original on dev.to → dev.to/dagentic/most-rag-problems-are-retrieval-…

mentioned entities

SharePoint

pypdf

LLM

metadata

slugmost-rag-problems-are-r-etrieval-problems

topic#large-language-models

secondary3 topics

sentimentnegative

canonicaldev.to

navigation

← prevThe VibeSec Reckoning

next →Dirty mind? Study suggests gut m…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 12 Jul · #large-language-models

Privacy First: Run Your Own Health Assistant LLM Entirely in the Browser (No Backend Required!)

dev.to · 12 Jul · #large-language-models

How Claude Projects Changed My Dev Routine

dev.to · 12 Jul · #large-language-models

Migrating Off OpenAI: A Backend Engineer's Notes From Production

byteiota.com · 12 Jul · #large-language-models

Amazon Project Moonraker: What Developers Need to Know

── more on @sharepoint 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

wpnews · 8 Jul · #artificial-intelligence

xAI Launches Grok 4.5 With Pricing Built to Undercut Anthropic's Opus 4.8

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required