{"slug": "cave-vlm-cot-an-interpretable-vision-language-model-framework", "title": "CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework", "summary": "Researchers introduced CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning in vision-language models through a five-stage closed-loop pipeline. The framework achieves 87.1% accuracy on ScienceQA and 55.2% on MMMU, reducing hallucinations by enforcing step-level citation grounding and enabling targeted re-retrieval for ungrounded claims.", "body_md": "arXiv:2606.18385v1 Announce Type: new\nAbstract: Vision-Language Models (VLMs) remain prone to hallucinations, producing fluent but visually unfaithful outputs. Existing chain-of-thought and retrieval-augmented methods only partially address this, as they neither enforce step-level citation grounding nor route verification failures back to retrieval for correction. We present CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, in which detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval. Since no existing framework jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding, we propose a suite of 23 component-wise metrics across all stages, anchored by CaVeScore, a composite metric weighting accuracy, citation precision and recall, attribution, and evidence grounding. Without any architectural or prompt modifications, CaVe-VLM-CoT achieves 87.1\\% accuracy and 56.6\\% CaVeScore on ScienceQA , and 55.2\\% accuracy and 35.7\\% CaVeScore on MMMU (30 subjects).", "url": "https://wpnews.pro/news/cave-vlm-cot-an-interpretable-vision-language-model-framework", "canonical_source": "https://arxiv.org/abs/2606.18385", "published_at": "2026-06-18 04:00:00+00:00", "updated_at": "2026-06-18 04:22:37.255075+00:00", "lang": "en", "topics": ["large-language-models", "computer-vision", "ai-research", "ai-safety", "ai-agents"], "entities": ["CaVe-VLM-CoT", "ScienceQA", "MMMU", "arXiv"], "alternates": {"html": "https://wpnews.pro/news/cave-vlm-cot-an-interpretable-vision-language-model-framework", "markdown": "https://wpnews.pro/news/cave-vlm-cot-an-interpretable-vision-language-model-framework.md", "text": "https://wpnews.pro/news/cave-vlm-cot-an-interpretable-vision-language-model-framework.txt", "jsonld": "https://wpnews.pro/news/cave-vlm-cot-an-interpretable-vision-language-model-framework.jsonld"}}