HorusEye: Language as Dynamic Attention for Emergency Visual Analysis

Researchers introduced HorusEye, a framework using language as dynamic attention for emergency visual analysis, and benchmarked it on the RefCOCO-Degraded dataset. Testing multiple vision-language models, they found Gemini improved by 47.3% in thermal conditions with iterative language feedback, while Qwen2-VL degraded by 5.1%. The study also identified a 'Thermal Paradox' where cropping strategies fail in thermal imagery, and BLIP-2 hallucinates more under degradation, making it unsuitable for emergency deployment.

arXiv:2606.14741v1 Announce Type: new Abstract: We introduce HorusEye, Language as Dynamic Attention for Emergency Visual Analysis. Our investigation followed five stages. The first one is benchmarking RefCOCO-Degraded, a dataset of 15,244 images 3,811 base images x 4 conditions: Clean, Fog, Smoke and Thermal with systematic visual degradation. Through four research questions, we evaluate multiple VLMs Gemini, Qwen2-VL, BLIP-2, LLaVA, Kosmos-2 across visual grounding the second stage, language feedback recovery the third one, health VQA tasks the fourth, and hallucination analysis the final stage. Our key finding is that language feedback effectiveness is model-dependent: Gemini achieves +47.3% improvement in thermal conditions through iterative language feedback, while Qwen2-VL shows -5.1% degradation under the same protocol. We also identify the 'Thermal Paradox' where cropping strategies that improve RGB performance catastrophically fail in thermal imagery. Furthermore, BLIP-2 uniquely hallucinates more under degradation, making it unsuitable for emergency deployment