Benchmarking RAG Architectures Locally on a Real Financial PDF — Part 3: The Measurement Problem

wpnews.pro

Two earlier articles left two loose threads. In Part 1, the strongest text pipeline had the lowest context precision. In Part 2, Graph RAG was the most faithful pipeline and yet one of the worst overall, while Vision RAG was the most correct and one of the least faithful. Each looked like a small anomaly on its own. Put all six methods on one page and they turn out to be the same thing, and it is not really a fact about the pipelines. It is a fact about the metrics.

This final article does three things: lay out all six architectures together, break the results down by question type, and explain why two of the five standard RAGAS metrics [1] point in the opposite direction from the truth.

Here is the whole study on the same 57 questions, same local judge (qwen3:14b). Vision RAG is shown with both vision models from Part 2.

Read the last column and Vision RAG wins clearly: 0.823 answer correctness, well above the best text method’s 0.544. Read the first column and you reach the opposite conclusion: Vision RAG is near the bottom on faithfulness (0.538), while Graph RAG — the most faithful pipeline in the study at 0.902 — is the least correct (0.238). Faithfulness simply does not track correctness: the most faithful pipeline is also the least accurate one.

So which is it — best or worst? The answer depends entirely on which question you actually care about, and the two questions are not the same. The middle of the table is worth a glance on the way past: among the text methods, Hybrid + ColBERT is the strongest (0.544), as Part 1 found, and ColPali sits just below it (0.528): its visual retrieval finds the right page more cleanly than any text method, but generating from that page’s text caps the payoff. Before resolving the contradiction, it helps to see where the correctness comes from.

Every question was tagged by where its answer lives: in plain text (20 questions), in a table (19), or in a chart (18). Splitting answer correctness by that tag is the most revealing cut in the study:

The chart that matters most is the contrast between the best text pipeline and vision:

Look at the three groups in order. On table questions the two are close — 0.66 for text RAG, 0.72 for vision. That is the table-aware OCR from Part 1 paying off: TableFormer reconstructs the grids well enough that a text pipeline can read them, so vision’s advantage is small. On text questions vision is well ahead (0.94 vs 0.57), but text pipelines at least function. On chart questions the floor falls out: the best text pipeline manages 0.391, and vision holds at 0.812, better than two to one.

This is the first finding, and it is not subtle: charts are recoverable only by vision. Tables survive OCR; plain text survives OCR; charts do not, because, as Part 2 showed, they come out of extraction as a bare  placeholder with every value gone. No retrieval strategy, no reranker, no graph trick fixes that, because the information was never in the text to retrieve. A model that reads the pixels is the only thing that recovers it.

Now back to the contradiction. Vision RAG is the most correct pipeline and one of the least faithful. That should be impossible if both metrics measure answer quality, so they must not.

They don’t, and the difference is what each metric compares the answer against:

For a text pipeline those two references are aligned: the answer comes from the retrieved text, so checking it against that text is fair. For Vision RAG they come apart completely. Take a chart question — the kind where Part 2 watched the model read 310 straight off a NIM bar. The model reads the value off the page image, the reference answer matches, and correctness is high. But the extracted text for that page never contained the value (the chart was an ), so faithfulness, checking the answer against that text, sees a number that "isn't supported by the context" and marks it down. On chart questions the vision pipeline scores 0.294 faithfulness against 0.812 correctness: the metric punishes the answer precisely for being grounded in something the metric cannot see.

One question makes it concrete. Asked “What was Akbank’s total revenue in 2025 (TL mn)?”, Vision RAG answered 222,033, exactly the reference value, for an answer correctness of 1.000. Its faithfulness on that same question was 0.000. The number is right, read straight off the chart in the page image; and because that figure appears nowhere in the extracted text, the faithfulness check files it as a fabrication. A perfect answer, scored as a hallucination.

It is worth seeing why with the actual page. The value lives on the revenue chart on page 7:

Now here is everything extraction recovers from that same page: the title, three text bullets, and two empty placeholders where the charts used to be.

## Renewed NII support & strong fee income reinforced core revenues
php
<!-- image -->
- Strong fee income generation and NII recovery ... drove 50% YoY increase in revenue- Fee income surged by 64% YoY ...- NII advanced by 54% YoY ...
php
<!-- image -->

The figure 222,033 is not in there, and neither is any other number from the two charts. Faithfulness can only check the answer against this text, so it has no way to confirm a value it was never shown. The 0.000 is forced by the channel, not earned by the answer: the metric is grading the OCR, not the model.

The same metric will reward a wrong answer just as readily. Asked for Akbank’s capital ratio without forbearances — the reference is 16.8% — Hybrid + ColBERT returned a different entity’s number entirely: “Akbank Group’s European flagship has a robust 34% CAR.” That answer is almost completely wrong (0.14 correctness), yet it scored 0.667 faithfulness, because the wrong figure was copied faithfully from a chunk the pipeline had retrieved. Vision RAG answered 16.8% (correct) and scored 0.000. And this time the channel gap is not the explanation: unlike the revenue figure, 16.8% CAR sits verbatim in the retrieved text, so faithfulness had the right value in front of it and still rejected the correct answer while rewarding the wrong one. On terse one-number answers like these the metric is checking the surface form of a single claim rather than whether it is right, and it is noisy enough to sometimes do the exact opposite of its job.

Graph RAG is the same coin, flipped. It mostly answers “that figure isn’t in the provided context.” An answer that makes no claims cannot contradict its context, so it is trivially faithful (0.902, the highest in the study) while being 0.238 correct, the lowest. Here the metric rewards saying nothing.

The chart case and the graph case come from one fact: four of the five metrics measure grounding in the extracted text, not correctness. That is a perfectly good thing to measure for a text pipeline. It is the wrong instrument for a pipeline that deliberately bypasses the text and reads the page. It is also a noisy measure for the terse numeric answers in this study, so the faithfulness column is best read as directional, not precise. Answer correctness carries the whole comparison, so I checked that judge against the reference answers directly: exact matches land near 1.0, clearly wrong answers below 0.25, and minor format gaps (a missing percent sign or unit) around 0.7. That is sensible grading, and it is what lets the final column be trusted where the others cannot. Judge Vision RAG by faithfulness and you will rank the most accurate system near the bottom and ship a worse one. Only answer correctness, anchored to an independent reference answer, tracks what actually happened.

There is a subtler version of the same problem hiding in the context metrics. The visual pipelines (ColPali and Vision RAG) retrieve whole pages, while the text pipelines retrieve small chunks — so “how much of the retrieved context was relevant” is judged at the page level for one and the chunk level for the other. A page and a chunk are different units of “relevant,” which shifts context precision, context recall, and faithfulness for reasons unrelated to whether the answer was found. (It is not that the visual pipelines flood the judge with text: these image-only slides carry so little extractable text that a retrieved page is no bigger than a couple of chunks.) That makes those metrics hard to compare straight across the two kinds of architecture — one more reason the verdict has to rest on answer correctness.

There are two obvious objections, and both lead somewhere I chose not to go.

The first: just fix the metric. RAGAS has a multimodal faithfulness variant [2] that judges the answer against the page image instead of the text. That would “repair” Vision RAG’s score — but it would mean grading text pipelines against text and vision pipelines against images, scoring the same metric two different ways across the very systems I am comparing. That is not a comparison anymore. A vision-grounded faithfulness check is worth running, but only as a separate vision-only diagnostic, with an independent vision judge (not the generator grading its own work).

The second, bigger one: why keep a lossy text layer at all? You could parse the whole document with a VLM at index time, or add chart-de-rendering models, and hand every pipeline good text. That is a reasonable thing to build — but it dissolves the exact distinction this study measures. It becomes a different, three-arm question (OCR vs VLM-parsing vs query-time vision), not an addition to this one. Each extra axis makes the result muddier, not richer. One question, one set of held-fixed variables; I kept it there on purpose.

On a real, image-only corporate financial PDF, evaluated end to end on a single machine with open models, the most accurate RAG architecture is Vision RAG with a small local VLM: qwen3-vl:8b at 0.823 answer correctness, against 0.544 for the best text pipeline. Its advantage is concentrated exactly where the document stops being text and becomes a picture: charts, where it more than doubles the best text result.

But the most useful lesson is the measurement one. If I had evaluated these systems the standard way, by faithfulness, I would have placed the best pipeline near the bottom and the worst near the top. The metric has to match the architecture. For multimodal RAG, trust answer correctness against an independent reference, and read faithfulness and the context metrics for what they are: diagnostics of the text channel, not verdicts on the answer.

That point outlives this one document. Choosing and aligning your evaluation metric to the architecture and the data is itself a design decision, not a box to tick at the end — pick the wrong one and it will rank your systems backwards, confidently. Before trusting any RAG leaderboard, including your own, it is worth checking that the metric is measuring the thing you actually care about.

And the constraint that started the series — no proprietary model endpoints, everything local — turned out to cost almost nothing. A small open vision model, run on one workstation, reads a messy financial deck more accurately than any of the text architectures it was compared against. Read these numbers as a floor, not a ceiling: they come from open models on a single machine, and a larger local model or a proper GPU server would lift every one of them. What stays true is the shape of the comparison — and the fact that a fully offline stack can already read a document most people assume needs the cloud. For a regulated environment that cannot send its data to an external API, that is the result that matters.

[1] S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “RAGAS: Automated evaluation of retrieval augmented generation,” in Proc. 18th Conf. European Chapter of the ACL (EACL): System Demonstrations, 2024. arXiv:2309.15217.

[2] Exploding Gradients, “Multimodal faithfulness,” RAGAS documentation. [Online]. Available: https://docs.ragas.io

Citations for the individual architectures (RAG, ColBERT, Graph RAG, ColPali, Qwen2-VL, and the local stack) appear in Parts 1 and 2.

Benchmarking RAG Architectures Locally on a Real Financial PDF — Part 3: The Measurement Problem was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article The 3B Model Going Toe to Toe with Opus 4.5 In Maths and Coding Substrate-Bound Coupling in Human-LLM Interaction LAI #131: A Tool Call Can Succeed and Still Be the Wrong Tool

Benchmarking RAG Architectures Locally on a Real Financial PDF — Part 3: The Measurement Problem

Run your AI side-project on zahid.host