{"slug": "not-all-modalities-are-equal-instruction-aware-gating-for-multimodal-videos", "title": "Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos", "summary": "Researchers have developed UniMVU, a unified multimodal video understanding framework that uses instruction-aware gating to dynamically balance the importance of different input modalities like video, audio, and depth maps. The system addresses modality interference by employing inner-modality gates to emphasize salient regions and modality-level gates to re-weight entire streams based on text instructions. Across six benchmarks, UniMVU achieved consistent improvements over static-fusion baselines, with gains of up to 13.5 on the CIDEr metric.", "body_md": "arXiv:2605.26232v1 Announce Type: new\nAbstract: Pre-trained video large language models excel at visual reasoning. However, they struggle when videos arrive with auxiliary streams, such as audio, depth map, or dense temporal evidence. In such a scenario, uniform fusion induces modality interference, allowing irrelevant channels to distract the model. To address this issue, we present a unified multimodal video understanding framework, named UniMVU, that performs instruction-aware fusion across video, audio, depth map, or any other modality inputs via two levels of dynamic gating: inner-modality gates emphasize salient regions within each modality, whereas modality-level gates re-weight whole streams; both are conditioned on the text instruction to adaptively balance modality importance. Our UniMVU combines cross-modal self-attention with instruction-driven inner-modality gating module and a modality-level gating module with control token; for time-aligned streams we further adopt a fast-to-slow fusion scheme that reduces redundancy. Across six benchmarks (AVQA, AVSD, Music-AVQA, ScanQA, SQA3D and MVBench), our UniMVU achieves consistent gains over static-fusion baselines achieving gains as high as 13.5 in terms of CIDEr metric. Further, our analysis shows that the gating mechanism aligns with the human-interpretable modality relevance, and ablations show the contributions of inner-modality and modality-level gating. Our UniMVU provides a simple, unified recipe for instruction-aware multimodal video understanding that scales to diverse modalities without hand-crafted fusion rules.", "url": "https://wpnews.pro/news/not-all-modalities-are-equal-instruction-aware-gating-for-multimodal-videos", "canonical_source": "https://arxiv.org/abs/2605.26232", "published_at": "2026-05-27 04:00:00+00:00", "updated_at": "2026-05-27 04:26:29.787947+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "computer-vision"], "entities": ["UniMVU", "AVQA", "AVSD", "Music-AVQA", "ScanQA", "SQA3D", "MVBench", "CIDEr"], "alternates": {"html": "https://wpnews.pro/news/not-all-modalities-are-equal-instruction-aware-gating-for-multimodal-videos", "markdown": "https://wpnews.pro/news/not-all-modalities-are-equal-instruction-aware-gating-for-multimodal-videos.md", "text": "https://wpnews.pro/news/not-all-modalities-are-equal-instruction-aware-gating-for-multimodal-videos.txt", "jsonld": "https://wpnews.pro/news/not-all-modalities-are-equal-instruction-aware-gating-for-multimodal-videos.jsonld"}}