MT-Bench

mentions 3 type Organization feed RSS

// recent coverage 3 mentions

04:00

2026-06-29

arxiv.org

large-language-models

Masked Language Flow Models

Researchers introduced Masked Language Flow Models (MLFMs), combining masked diffusion and flow-based methods for efficient language generation. MLFMs enable conditional generation via continuous flow…

04:00

2026-06-19

arxiv.org

large-language-models

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias

A systematic evaluation of 21 LLM-as-a-Judge models across 118 runs and 541,000 judgments reveals that exact-match agreement overstates discriminative ability, with kappa deflation of 33–41 percentage…

04:00

2026-06-06

arxiv.org

large-language-models

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

A new study reveals that LLM judges, widely used to evaluate AI outputs, can be manipulated after making an initial decision through targeted conversation, overturning stable judgments and shifting be…

// co-occurs with top 8 entities

AlpacaEval 1 LLM 1 Evaluation Robustness Score 1 JudgeBench 1 RewardBench 1 Cohen's kappa 1 Masked Language Flow Models 1 Masked Diffusion Models 1

// topics top 6 topics

large language models 3 ai research 3 ai safety 2 ai ethics 2 natural language processing 2 generative ai 1