GPT-4.1-mini

mentions 1 type Organization feed RSS

// recent coverage 1 mentions

04:00

2026-06-15

arxiv.org

large-language-models

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

A study of LLM-as-a-Judge evaluations found that pairwise preferences flip 13.6% of the time on average, with some questions reaching a 56% flip rate, and GPT-4o-mini showed a significant first-positi…

// co-occurs with top 2 entities

OpenAI 1 GPT-4o-mini 1