{"slug": "magnifying-what-matters-attention-guided-adaptive-rendering-for-visual-text", "title": "Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension", "summary": "Researchers have identified that vision-language models (VLMs) often locate relevant text in images but fail to utilize it for answering questions, a phenomenon called \"localization-without-utilization.\" To address this, the team developed AGAR (Attention-Guided Adaptive Rendering), a training-free method that uses the VLM's own attention patterns to identify and enlarge critical text spans on the rendered page before re-inferring answers. Testing across nine benchmarks and four VLM backbones showed AGAR consistently improves off-the-shelf VLM performance on visual text comprehension tasks, including long-page OCR and multi-page memory QA.", "body_md": "arXiv:2606.12898v1 Announce Type: new\nAbstract: Visual Text Comprehension (VTC) renders text into images for a vision-language model (VLM) to read, sidestepping LLM context-window limits and powering applications from long-page OCR to multi-page memory QA. Yet existing VTC pipelines treat rendering and layout as a fixed, content-agnostic preprocessing step and offer little mechanistic understanding of how VLMs internally process visualized text. Through a focused empirical study on VTC QA tasks, we reveal that VLMs exhibit a localization-without-utilization regime: evidence-localizing attention emerges sharply in the middle-to-late layers and is largely decoupled from answer correctness, yet simply enlarging the localized spans on the rendered page recovers a large fraction of the failures. Building on these observations, we propose AGAR (Attention-Guided Adaptive Rendering), a training-free, model-agnostic method that leverages a VLM's own middle-to-late layer attention to identify the top-K important visual patches, maps them back to word spans, and re-renders the page with those spans enlarged before re-inferring the answer. Extensive experiments across nine VTC benchmarks (short-form, long-context, and multi-page memory QA) and four VLM backbones show that AGAR (i)consistently improves off-the-shelf VLMs as a plug-and-play enhancement, (ii)composes with VLM post-training to yield further gains, and (iii)remains robust under both visual- and text-side input degradation.", "url": "https://wpnews.pro/news/magnifying-what-matters-attention-guided-adaptive-rendering-for-visual-text", "canonical_source": "https://arxiv.org/abs/2606.12898", "published_at": "2026-06-12 04:00:00+00:00", "updated_at": "2026-06-12 04:50:36.427217+00:00", "lang": "en", "topics": ["computer-vision", "natural-language-processing", "large-language-models", "artificial-intelligence", "machine-learning"], "entities": ["AGAR", "VTC", "VLM", "arXiv"], "alternates": {"html": "https://wpnews.pro/news/magnifying-what-matters-attention-guided-adaptive-rendering-for-visual-text", "markdown": "https://wpnews.pro/news/magnifying-what-matters-attention-guided-adaptive-rendering-for-visual-text.md", "text": "https://wpnews.pro/news/magnifying-what-matters-attention-guided-adaptive-rendering-for-visual-text.txt", "jsonld": "https://wpnews.pro/news/magnifying-what-matters-attention-guided-adaptive-rendering-for-visual-text.jsonld"}}