Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

wpnews.pro

cd /news/large-language-models/do-llm-attribution-metrics-transfer-… · home › topics › large-language-models › article

[ARTICLE · art-37188] src=arxiv.org ↗ pub=2026-06-24T04:00Z topic=large-language-models verified=true sentiment=· neutral

Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

A new study auditing eight automatic attribution metrics across multiple datasets finds that no single metric consistently performs best, with rankings inverting across datasets (Kendall tau = -0.64). The instability leads to a mean held-out regret of 0.172 AUROC when using a naive best-on-average selection rule. Prompt-based LLM judges avoid chance-level collapses but are costlier and non-deterministic, shifting rather than removing the validation burden.

read1 min views5 publishedJun 24, 2026

arXiv:2606.23915v1 Announce Type: new Abstract: Practice often treats automatic metrics for attribution in LLM retrieval-augmented generation as interchangeable. We audit eight automatic scorers -- lexical, embedding, and BERTScore baselines alongside entailment/grounding-trained models (clean and FEVER NLI, the checker MiniCheck) -- across three evaluation constructs (provenance/topicality, generated-answer attribution, and fact-check entailment), asking whether any scorer transfers: stays within the 95% confidence interval of the best audited scorer on every dataset of a multi-dataset construct. In the construct with the most multi-dataset human-labeled coverage -- generated-answer attribution (AttributionBench's four source datasets, n = 1,610, with independent HAGRID, n = 2,150) -- none does: the per-dataset metric rankings invert (Kendall tau = -0.64, p = 0.031 on AttributedQA vs. LFQA), and an off-the-shelf NLI scorer that is best on short-claim AttributedQA (AUROC 0.90) collapses to AUROC 0.53 (chance) on long-form LFQA, where BERTScore wins (0.91); the flip is not a length or truncation artifact. This instability has a concrete decision cost: a naive "best-on-average" rule for choosing an evaluator fails leave-one-dataset-out (mean held-out regret 0.172 AUROC, worse than fixing one scorer), so metric choice must be validated on the target dataset rather than learned from others. A prompt-based LLM judge avoids the chance-level collapses the automatic scorers suffer (no LFQA collapse) but is not uniformly best, ~100x costlier, and non-deterministic -- relocating, not removing, the validation burden.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/do-llm-attribution-metri…

Read original on arxiv.org → arxiv.org/abs/2606.23915

mentioned entities

AttributionBench

HAGRID

AttributedQA

LFQA

MiniCheck

FEVER NLI

BERTScore

metadata

slugdo-llm-attribution-metrics-transfer-auditing-retrieval-augmented-generation-and

topic#large-language-models

secondary3 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevStop coding agents from writing …

next →Zhipu considers multibillion-dol…

── more in #large-language-models 4 stories · sorted by recency

schneier.com · 25 Jun · #large-language-models

Interesting Paper Exploring Prompt Injection

arxiv.org · 25 Jun · #large-language-models

The Widening Gap: The Benefits and Harms of Generative AI for Novice Programmers

letsdatascience.com · 25 Jun · #large-language-models

AI Disrupts Workplaces, New Nonprofit Hopes To Aid Workers

letsdatascience.com · 25 Jun · #large-language-models

Canadian groups seek copyright clarity after AI strategy

── more on @attributionbench 3 stories trending now

wpnews · 22 Jun · #generative-ai

Bain tests software takeover targets using vibecoding AI replicas

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 24 Jun · #ai-policy

An AI startup is suing the US government for taking away Anthropic's new model

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required