# Paper Walkthrough — MACT: A Multi-Agent Collaboration Framework for Visual Document Understanding

> Source: <https://pub.towardsai.net/paper-walkthrough-mact-a-multi-agent-collaboration-framework-for-visual-document-understanding-12a907406a1b?source=rss----98111c9905da---4>
> Published: 2026-07-04 07:06:28+00:00

A financial report is not a photograph.

It is a stack of dense tables, cropped charts, multi-column text, and footnotes — all demanding a different kind of attention at every step. Reading it isn’t a single glance; it’s a *procedure*: figure out what’s being asked, decide where to look, extract the right numbers, check your own work, and only then answer.

Most large Vision-Language Models (VLMs) don’t work that way. They take a document and a question, run one forward pass, and hope the answer falls out. **MACT**, a CVPR 2026 paper from NUS, Tencent Youtu Lab, Tsinghua, and collaborators, argues that this “monolithic scaling” — just making the model bigger — is the wrong lever to pull for documents. Instead, MACT splits the job into four cooperating agents, each specialized for one cognitive step, and lets a smaller model beat much larger ones.

In this walkthrough, we’ll trace the full story: a) why bigger VLMs are hitting diminishing returns on documents, b) how MACT’s four agents divide the labor, c) how test-time compute is allocated differently to each of them, and d) how the reward signal is designed so the agents don’t just game their own local objective.

Scaling up parameter count has been the default strategy for VLMs, and it works well for general vision-language tasks. But for document-based visual question answering (VQA), the paper shows that increasing parameter count yields only marginal gains, while computational cost grows exponentially.

The paper attributes this to three properties that documents have, and natural images mostly don’t:

The proposed fix is a shift from *monolithic scaling* to what the authors call **procedural scaling**: decompose the workflow into functional roles, and scale each role independently, in a way that the role actually benefits from more compute.

MACT decomposes document understanding into four collaborative agents, each handling one stage of the workflow:

**Planning Agent (𝒜_plan).** Given the question and the document, this agent doesn’t just write one plan — it first generates several relevant reference plans for similar hypothetical tasks (inspired by analogical prompting), then uses those as scaffolding to produce the actual execution plan for the real question. The plan stays high-level: it states targets and requirements per step but doesn’t commit to specific tools or execution details, so it doesn’t box in the agent that has to actually do the work.

**Execution Agent (𝒜_exe).** This agent walks through the plan step by step, treating each step as a unit with its own definition, expected output, and access to results from prior steps. It pulls tools from a shared tool library to actually extract information — OCR, cropping, table parsing, and so on — and passes along the full trace of what it did and found.

**Judgment Agent (𝒜_judg).** This is the self-correction layer, and its design choice is the most interesting part of the paper. Rather than having the planning or execution agent grade its own homework, MACT uses a separate, independent judge that only assesses correctness — it never attempts a fix itself. If it finds a mistake, it flags which step failed, describes the mistake, and routes the problem back to whichever agent (planning or execution) is responsible, up to three correction rounds.

**Answer Agent (𝒜_ans).** Once a plan-execution pair has passed judgment, this agent synthesizes the final answer — deliberately keeping both the corrected trace *and* the earlier incorrect attempts in context, so it can see exactly what changed rather than silently discarding the error-prone parts.

The reason for a *separate* judge is worth sitting with. The paper compares three self-correction designs: a) let the same agent correct itself internally, b) use one extra agent that both judges and corrects, and c) MACT’s approach — a judge that only judges. Options a) and b) share a weakness: whatever component is responsible for correction is being trained, via RL, toward “pass verification” as its objective. That creates an incentive to produce vague or incomplete fixes that look right on the surface but aren’t actually right. Separating judgment from correction removes that incentive from the judge, at the cost of needing four coordinated components instead of one or two.

The second pillar of MACT is that test-time compute isn’t spent the same way for every agent — each one gets a scaling strategy suited to its actual job:

This matters because it directly counters a real failure mode of naive test-time scaling: pouring the same extra compute into every component regardless of whether that component actually benefits from it. The paper’s ablations show this agent-specific allocation beats parallel, sequential, hybrid, and internal (budget-forcing-everywhere) scaling strategies applied uniformly across the whole pipeline.

The paper trains three variants of MACT on different base model families — Qwen2.5-VL, MiMo-VL, and InternVL3 — each kept under 30B total parameters, and evaluates on 15 benchmarks spanning document types (text, webpage, chart, table) plus general and mathematical reasoning, to check that the framework doesn’t just help documents at the expense of everything else.

The headline result: despite being smaller than 30B parameters, all three MACT variants outperform every open-source and closed-source model under 100B parameters that the paper tested, and beat their own base models by 9.9–11.5% on average across the 15 benchmarks. Compared against much larger monolithic models from the same family — Qwen2.5-VL-72B and InternVL3–78B — the MACT variants still come out ahead by 3.7–6.6% on average, with the widest margins showing up on the longest-context and most reasoning-heavy benchmarks.

MACT’s core bet is that for a domain like document understanding — where the task is inherently a multi-step procedure rather than a single perceptual judgment — the right place to add capacity isn’t a bigger model, it’s more roles, and more targeted compute per role. That’s a different axis than the “just scale parameters” story that has dominated VLM progress, and the fact that a set of sub-30B-parameter agents can outperform 70–90B monolithic models on this specific class of task is a reasonable existence proof that procedural scaling is worth taking seriously — at least for tasks that are naturally sequential and self-checkable, which document QA clearly is.

[Paper Walkthrough — MACT: A Multi-Agent Collaboration Framework for Visual Document Understanding](https://pub.towardsai.net/paper-walkthrough-mact-a-multi-agent-collaboration-framework-for-visual-document-understanding-12a907406a1b) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.
