CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

wpnews.pro

cd /news/large-language-models/cave-vlm-cot-an-interpretable-vision… · home › topics › large-language-models › article

[ARTICLE · art-32053] src=arxiv.org ↗ pub=2026-06-18T04:00Z topic=large-language-models verified=true sentiment=↑ positive

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

Researchers introduced CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning in vision-language models through a five-stage closed-loop pipeline. The framework achieves 87.1% accuracy on ScienceQA and 55.2% on MMMU, reducing hallucinations by enforcing step-level citation grounding and enabling targeted re-retrieval for ungrounded claims.

read1 min views2 publishedJun 18, 2026

arXiv:2606.18385v1 Announce Type: new Abstract: Vision-Language Models (VLMs) remain prone to hallucinations, producing fluent but visually unfaithful outputs. Existing chain-of-thought and retrieval-augmented methods only partially address this, as they neither enforce step-level citation grounding nor route verification failures back to retrieval for correction. We present CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, in which detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval. Since no existing framework jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding, we propose a suite of 23 component-wise metrics across all stages, anchored by CaVeScore, a composite metric weighting accuracy, citation precision and recall, attribution, and evidence grounding. Without any architectural or prompt modifications, CaVe-VLM-CoT achieves 87.1% accuracy and 56.6% CaVeScore on ScienceQA , and 55.2% accuracy and 35.7% CaVeScore on MMMU (30 subjects).

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/cave-vlm-cot-an-interpre…

Read original on arxiv.org → arxiv.org/abs/2606.18385

mentioned entities

CaVe-VLM-CoT

ScienceQA

MMMU

arXiv

metadata

slugcave-vlm-cot-an-interpretable-vision-language-model-framework

topic#large-language-models

secondary4 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevIs AI Getting Quietly Dumber? A …

next →Most agentic AI projects in prod…

── more in #large-language-models 4 stories · sorted by recency

arxiv.org · 18 Jun · #large-language-models

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

arxiv.org · 18 Jun · #large-language-models

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

arxiv.org · 18 Jun · #large-language-models

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

helpnetsecurity.com · 18 Jun · #large-language-models

What happens to oversight when AI agents write a lab’s own code

── more on @cave-vlm-cot 3 stories trending now

wpnews · 17 Jun · #developer-tools

CircleCI MCP Server: Debug Build Failures Without Leaving Your AI Coding Agent

wpnews · 17 Jun · #artificial-intelligence

How I Build Production AI Apps on Cloudflare with Claude Code

wpnews · 16 Jun · #large-language-models

I'm building CortexDB — an agent-native context database for AI agents

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required