Your RAG App Is Broken Because You're Still Parsing PDFs Like It's 2023

wpnews.pro

cd /news/large-language-models/your-rag-app-is-broken-because-you-r… · home › topics › large-language-models › article

[ARTICLE · art-26701] src=dev.to ↗ pub=2026-06-14T03:05Z topic=large-language-models verified=true sentiment=· neutral

Your RAG App Is Broken Because You're Still Parsing PDFs Like It's 2023

A developer argues that most RAG applications fail because they still parse PDFs as plain text instead of preserving document layout. The recommended approach is to use tools that output structured Markdown, such as IBM's Docling for bulk processing or vision-language models like Mistral OCR for complex layouts. Legacy libraries like PyPDF and pdfminer are no longer sufficient for modern RAG pipelines.

read2 min views30 publishedJun 14, 2026

Most developers building "chat with your data" apps hit the exact same wall. You chunk the text, embed it, dump it in a vector database, and the retrieval is still terrible. The model hallucinates or completely scrambles tables.

People think data ingestion is just text extraction. It isn't. In 2026, text extraction is a solved, boring problem. The actual hard part is layout. If your ingestion layer doesn't know that a bold header implies hierarchy, or that a two-column page isn't just one long string of text read left-to-right, your LLM is reading garbage.

We've mostly stopped treating PDFs as plain text. Markdown is now the default format for document ingestion, simply because it preserves structure.

Modern ingestion tools don't just dump strings. They output Markdown where headers, lists, and tables actually mean something. This gives the LLM the context it needs to figure out where a piece of information lived in the original document, which makes citations and retrieval significantly more accurate.

Right now, there are basically two ways to handle this layout problem.

First, you have local deterministic engines like IBM's Docling or OpenData PDF. Docling has quietly become a standard for enterprise RAG because it natively handles the whole Office suite and spits out clean Markdown. It runs locally without a GPU. OpenData does something similar. If you have a massive volume of private documents, this is the realistic path.

Then you have the Vision-Language Model (VLM) approach. Instead of trying to parse messy PDF code, tools like Mistral OCR and LlamaParse just look at the document as an image. They see it the way we do. This completely bypasses the nightmare of multi-column layouts and nested tables that broke older parsers.

VLM parsing feels like magic, but it's expensive. If you process millions of pages, running everything through a cloud vision API will destroy your budget.

If I'm building a RAG pipeline today, my default is a robust local engine like Docling for the bulk of the documents. I only reach for the expensive VLM calls when a PDF is too visually complex for the local parser to figure out. Whatever you do, don't use legacy libraries like PyPDF or pdfminer for RAG anymore. If your ingestion layer isn't outputting structured Markdown or using vision to understand layout, your app is broken before the prompt even starts.

Source notes

source & further reading

dev.to — original article OpenAI’s GPT-5 Science Report Puts Human Stewardship at the Center of AI Research Claude Opus 5 Is Better at Coding and Harder to Trust I Built an AI Agent With Claude Code, Then Had Claude Review Its Own Work

~/api · this article 200

$curl api.wpnews.pro/v1/news/your-rag-app-is-broken-b…

Read original on dev.to → dev.to/hefty_69a4c2d631c9dd70724/your-rag-app-is…

mentioned entities

IBM

Docling

OpenDataLoader

Mistral OCR

LlamaParse

PyPDF

pdfminer

metadata

slugyour-rag-app-is-broken-because-you-re-still-parsing-pdfs-like-it-s-2023

topic#large-language-models

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevAutomating my job away

next →India debates AI future as Anthr…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 25 Jul · #large-language-models

Parsing documents for air-gapped RAG: no cloud, no JVM, no Python

zeit.de · 29 Jul · #large-language-models

Künstliche Intelligenz: Hackerangriff von OpenAI umfangreicher als bislang bekannt

edparry.com · 29 Jul · #large-language-models

The new leverage of software engineering

dev.to · 29 Jul · #large-language-models

The Personal AI Cloud: Why Everyone Will Have Their Own Private LLM by 2027

── more on @ibm 3 stories trending now

wpnews · 16 Jul · #artificial-intelligence

Women entrepreneurs are less likely to leverage AI—but more likely to benefit from it

wpnews · 28 Jul · #large-language-models

How to Download and Run Kimi K3 Open Weights

wpnews · 28 Jul · #artificial-intelligence

How Claude Code and VS Code turned Anthropic from a safety lab into a developer phenomenon

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required