arXiv:2606.12898v1 Announce Type: new Abstract: Visual Text Comprehension (VTC) renders text into images for a vision-language model (VLM) to read, sidestepping LLM context-window limits and powering applications from long-page OCR to multi-page memory QA. Yet existing VTC pipelines treat rendering and layout as a fixed, content-agnostic preprocessing step and offer little mechanistic understanding of how VLMs internally process visualized text. Through a focused empirical study on VTC QA tasks, we reveal that VLMs exhibit a localization-without-utilization regime: evidence-localizing attention emerges sharply in the middle-to-late layers and is largely decoupled from answer correctness, yet simply enlarging the localized spans on the rendered page recovers a large fraction of the failures. Building on these observations, we propose AGAR (Attention-Guided Adaptive Rendering), a training-free, model-agnostic method that leverages a VLM's own middle-to-late layer attention to identify the top-K important visual patches, maps them back to word spans, and re-renders the page with those spans enlarged before re-inferring the answer. Extensive experiments across nine VTC benchmarks (short-form, long-context, and multi-page memory QA) and four VLM backbones show that AGAR (i)consistently improves off-the-shelf VLMs as a plug-and-play enhancement, (ii)composes with VLM post-training to yield further gains, and (iii)remains robust under both visual- and text-side input degradation.
Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension
Researchers have identified that vision-language models (VLMs) often locate relevant text in images but fail to utilize it for answering questions, a phenomenon called "localization-without-utilization." To address this, the team developed AGAR (Attention-Guided Adaptive Rendering), a training-free method that uses the VLM's own attention patterns to identify and enlarge critical text spans on the rendered page before re-inferring answers. Testing across nine benchmarks and four VLM backbones showed AGAR consistently improves off-the-shelf VLM performance on visual text comprehension tasks, including long-page OCR and multi-page memory QA.
Run your AI side-project on zahid.host
EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β perfect for shipping the agent you just read about.