{"slug": "vision-driven-preference-synthesis-for-mitigating-hallucinations-in-vlms", "title": "Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs", "summary": "Researchers propose ViPSy, a framework for constructing preference data that reduces hallucinations in Vision-Language Models (VLMs) by leveraging visual cues from semantically aligned image variants. The method achieves state-of-the-art hallucination mitigation, reducing rates on AMBER and Object HalBench by 35.7% and 24.5%, respectively, while improving performance on general visual grounding benchmarks.", "body_md": "arXiv:2606.28401v1 Announce Type: new\nAbstract: Vision-Language Models (VLMs) have shown strong performance in visual understanding, yet they still suffer from hallucinations, generating content that is not grounded in the image. Preference alignment is a promising approach to improve visual faithfulness, but its success depends heavily on how preference pairs are constructed. Existing methods exhibit two key limitations; (a) intervention-based methods often introduce significant deviation from the policy distribution, and (b) sampling-based methods often underuse visual information during the construction. In this paper, we propose ViPSy (Vision-driven Preference Synthesis), a framework for constructing preference data that are both policy-aligned and visually grounded. Our framework consists of two stages; in the first stage, ViPSy derives a visual cue from recurring object-level content across semantically aligned image variants, so preference construction can rely on visual information rather than language priors. In the second stage, ViPSy conditions the policy's own rollouts on this cue, allowing candidates to be guided by visually grounded content while staying close to the policy's response distribution. The resulting candidates remain close to the policy's response distribution while better leveraging visual information from the image. Experiments show that the resulting VLM, preference-aligned with ViPSy-constructed preference pairs, achieves a new state-of-the-art in hallucination mitigation. Compared with the previous state-of-the-art method, it reduces hallucination rates on AMBER and Object HalBench by 35.7% and 24.5%, respectively. The resulting model further improves on general visual grounding benchmarks, e.g., MMStar, MMVP, and CV-Bench, while also yielding gains in semantic segmentation and ImageNet linear probing, underscoring the effectiveness of our framework in enhancing the model's visual capabilities.", "url": "https://wpnews.pro/news/vision-driven-preference-synthesis-for-mitigating-hallucinations-in-vlms", "canonical_source": "https://arxiv.org/abs/2606.28401", "published_at": "2026-06-30 04:00:00+00:00", "updated_at": "2026-06-30 04:25:19.923396+00:00", "lang": "en", "topics": ["computer-vision", "large-language-models", "ai-research"], "entities": ["ViPSy", "AMBER", "Object HalBench", "MMStar", "MMVP", "CV-Bench", "ImageNet"], "alternates": {"html": "https://wpnews.pro/news/vision-driven-preference-synthesis-for-mitigating-hallucinations-in-vlms", "markdown": "https://wpnews.pro/news/vision-driven-preference-synthesis-for-mitigating-hallucinations-in-vlms.md", "text": "https://wpnews.pro/news/vision-driven-preference-synthesis-for-mitigating-hallucinations-in-vlms.txt", "jsonld": "https://wpnews.pro/news/vision-driven-preference-synthesis-for-mitigating-hallucinations-in-vlms.jsonld"}}