BigCodeBench

mentions 2 type Organization feed RSS

// recent coverage 2 mentions

04:00

2026-05-27

arxiv.org

large-language-models

Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks

Researchers have developed Conv-to-Bench, a framework that automatically converts real-world user-assistant dialogues into structured evaluation benchmarks for large language models. In programming ta…

05:00

2026-05-26

alex.smola.org

large-language-models

You don't need all the LLM benchmarks

A new analysis of over 5,400 AI models reveals that benchmark scores for large language models are highly correlated, with just five subjects on the MMLU test predicting the remaining 52 with 91% accu…

// co-occurs with top 8 entities

MMLU 1 MTEB 1 HELM 1 Open LLM Leaderboard 1 AlpacaEval 1 LiveBench 1 WildBench 1 Conv-to-Bench 1