{"slug": "disagreement-based-cross-model-routing-for-implicit-video-question-answering", "title": "Disagreement-Based Cross-Model Routing for Implicit Video Question Answering", "summary": "Researchers introduced disagreement-based cross-model routing for implicit video question answering, achieving a +1.43 AvgAcc improvement on the ImplicitQA benchmark by routing disputed questions from Gemini 3.1 Pro Preview to Claude Opus 4.8. The method, which requires no training, targets hard questions involving motion, counting, and spatial reasoning, and was validated on an independent test set with a +1.81 gain.", "body_md": "arXiv:2606.14723v1 Announce Type: new\nAbstract: We study multiple-choice video question answering on the ImplicitQA benchmark, where the correct answer is never explicitly shown but must be inferred from off-screen events, line-of-sight cues, causal structure, and cross-shot spatial layout. On this benchmark a single frontier video LLM already operates near its accuracy ceiling, and we observe that conventional self-consistency strategies -- majority voting across repeated samples of the same model -- can hurt rather than help, because the model's errors on hard questions are correlated. We propose disagreement-based cross-model routing, a pure inference-time procedure that requires no labels and no training. We triple-sample a native-video model (Gemini 3.1 Pro Preview) at temperature zero, exploit the genuine sample-to-sample variance of its video-processing pipeline to identify the roughly 20% subset of questions where the three samples disagree, and route only that subset to a second model from a different family (Claude Opus 4.8) that consumes uniformly sampled frames with adaptive thinking. On the 1001-question validation set with public ground truth -- our main evaluation -- the method improves AvgAcc by +1.43 over the best single sample of the primary model, with per-category gains concentrated on Motion & Trajectory (+5.49), Inferred Counting (+3.45), and Vertical Spatial Reasoning (+1.82) -- the categories most dependent on cross-shot reference resolution. The same pipeline applied to the held-out 172-question CVPR 2026 ImplicitQA challenge test set achieves 82.03 AvgAcc / 79.71 MacroAvgAcc (+1.81 over the best single sample of the primary model), confirming the validation result on an independent split.", "url": "https://wpnews.pro/news/disagreement-based-cross-model-routing-for-implicit-video-question-answering", "canonical_source": "https://arxiv.org/abs/2606.14723", "published_at": "2026-06-16 04:00:00+00:00", "updated_at": "2026-06-16 04:18:21.470735+00:00", "lang": "en", "topics": ["computer-vision", "large-language-models", "natural-language-processing", "ai-research"], "entities": ["Gemini 3.1 Pro Preview", "Claude Opus 4.8", "ImplicitQA", "CVPR 2026"], "alternates": {"html": "https://wpnews.pro/news/disagreement-based-cross-model-routing-for-implicit-video-question-answering", "markdown": "https://wpnews.pro/news/disagreement-based-cross-model-routing-for-implicit-video-question-answering.md", "text": "https://wpnews.pro/news/disagreement-based-cross-model-routing-for-implicit-video-question-answering.txt", "jsonld": "https://wpnews.pro/news/disagreement-based-cross-model-routing-for-implicit-video-question-answering.jsonld"}}