Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

A new study auditing eight automatic attribution metrics across multiple datasets finds that no single metric consistently performs best, with rankings inverting across datasets (Kendall tau = -0.64). The instability leads to a mean held-out regret of 0.172 AUROC when using a naive best-on-average selection rule. Prompt-based LLM judges avoid chance-level collapses but are costlier and non-deterministic, shifting rather than removing the validation burden.

arXiv:2606.23915v1 Announce Type: new Abstract: Practice often treats automatic metrics for attribution in LLM retrieval-augmented generation as interchangeable. We audit eight automatic scorers -- lexical, embedding, and BERTScore baselines alongside entailment/grounding-trained models clean and FEVER NLI, the checker MiniCheck -- across three evaluation constructs provenance/topicality, generated-answer attribution, and fact-check entailment , asking whether any scorer transfers: stays within the 95% confidence interval of the best audited scorer on every dataset of a multi-dataset construct. In the construct with the most multi-dataset human-labeled coverage -- generated-answer attribution AttributionBench's four source datasets, n = 1,610, with independent HAGRID, n = 2,150 -- none does: the per-dataset metric rankings invert Kendall tau = -0.64, p = 0.031 on AttributedQA vs. LFQA , and an off-the-shelf NLI scorer that is best on short-claim AttributedQA AUROC 0.90 collapses to AUROC 0.53 chance on long-form LFQA, where BERTScore wins 0.91 ; the flip is not a length or truncation artifact. This instability has a concrete decision cost: a naive "best-on-average" rule for choosing an evaluator fails leave-one-dataset-out mean held-out regret 0.172 AUROC, worse than fixing one scorer , so metric choice must be validated on the target dataset rather than learned from others. A prompt-based LLM judge avoids the chance-level collapses the automatic scorers suffer no LFQA collapse but is not uniformly best, ~100x costlier, and non-deterministic -- relocating, not removing, the validation burden.