cd /news/large-language-models/your-rag-app-is-broken-because-you-r… · home topics large-language-models article
[ARTICLE · art-26701] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Your RAG App Is Broken Because You're Still Parsing PDFs Like It's 2023

A developer argues that most RAG applications fail because they still parse PDFs as plain text instead of preserving document layout. The recommended approach is to use tools that output structured Markdown, such as IBM's Docling for bulk processing or vision-language models like Mistral OCR for complex layouts. Legacy libraries like PyPDF and pdfminer are no longer sufficient for modern RAG pipelines.

read2 min publishedJun 14, 2026

Most developers building "chat with your data" apps hit the exact same wall. You chunk the text, embed it, dump it in a vector database, and the retrieval is still terrible. The model hallucinates or completely scrambles tables.

People think data ingestion is just text extraction. It isn't. In 2026, text extraction is a solved, boring problem. The actual hard part is layout. If your ingestion layer doesn't know that a bold header implies hierarchy, or that a two-column page isn't just one long string of text read left-to-right, your LLM is reading garbage.

We've mostly stopped treating PDFs as plain text. Markdown is now the default format for document ingestion, simply because it preserves structure.

Modern ingestion tools don't just dump strings. They output Markdown where headers, lists, and tables actually mean something. This gives the LLM the context it needs to figure out where a piece of information lived in the original document, which makes citations and retrieval significantly more accurate.

Right now, there are basically two ways to handle this layout problem.

First, you have local deterministic engines like IBM's Docling or OpenData PDF. Docling has quietly become a standard for enterprise RAG because it natively handles the whole Office suite and spits out clean Markdown. It runs locally without a GPU. OpenData does something similar. If you have a massive volume of private documents, this is the realistic path.

Then you have the Vision-Language Model (VLM) approach. Instead of trying to parse messy PDF code, tools like Mistral OCR and LlamaParse just look at the document as an image. They see it the way we do. This completely bypasses the nightmare of multi-column layouts and nested tables that broke older parsers.

VLM parsing feels like magic, but it's expensive. If you process millions of pages, running everything through a cloud vision API will destroy your budget.

If I'm building a RAG pipeline today, my default is a robust local engine like Docling for the bulk of the documents. I only reach for the expensive VLM calls when a PDF is too visually complex for the local parser to figure out. Whatever you do, don't use legacy libraries like PyPDF or pdfminer for RAG anymore. If your ingestion layer isn't outputting structured Markdown or using vision to understand layout, your app is broken before the prompt even starts.

Source notes

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/your-rag-app-is-brok…] indexed:0 read:2min 2026-06-14 ·