cd /news/computer-vision/vision-driven-preference-synthesis-f… · home topics computer-vision article
[ARTICLE · art-44316] src=arxiv.org ↗ pub= topic=computer-vision verified=true sentiment=↑ positive

Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs

Researchers propose ViPSy, a framework for constructing preference data that reduces hallucinations in Vision-Language Models (VLMs) by leveraging visual cues from semantically aligned image variants. The method achieves state-of-the-art hallucination mitigation, reducing rates on AMBER and Object HalBench by 35.7% and 24.5%, respectively, while improving performance on general visual grounding benchmarks.

read1 min views1 publishedJun 30, 2026

arXiv:2606.28401v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have shown strong performance in visual understanding, yet they still suffer from hallucinations, generating content that is not grounded in the image. Preference alignment is a promising approach to improve visual faithfulness, but its success depends heavily on how preference pairs are constructed. Existing methods exhibit two key limitations; (a) intervention-based methods often introduce significant deviation from the policy distribution, and (b) sampling-based methods often underuse visual information during the construction. In this paper, we propose ViPSy (Vision-driven Preference Synthesis), a framework for constructing preference data that are both policy-aligned and visually grounded. Our framework consists of two stages; in the first stage, ViPSy derives a visual cue from recurring object-level content across semantically aligned image variants, so preference construction can rely on visual information rather than language priors. In the second stage, ViPSy conditions the policy's own rollouts on this cue, allowing candidates to be guided by visually grounded content while staying close to the policy's response distribution. The resulting candidates remain close to the policy's response distribution while better leveraging visual information from the image. Experiments show that the resulting VLM, preference-aligned with ViPSy-constructed preference pairs, achieves a new state-of-the-art in hallucination mitigation. Compared with the previous state-of-the-art method, it reduces hallucination rates on AMBER and Object HalBench by 35.7% and 24.5%, respectively. The resulting model further improves on general visual grounding benchmarks, e.g., MMStar, MMVP, and CV-Bench, while also yielding gains in semantic segmentation and ImageNet linear probing, underscoring the effectiveness of our framework in enhancing the model's visual capabilities.

── more in #computer-vision 4 stories · sorted by recency
── more on @vipsy 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/vision-driven-prefer…] indexed:0 read:1min 2026-06-30 ·