04:00
2026-06-25
arxiv.org
large-language-models
LLM Performance on a Real, Double-Marked GCSE Benchmark
Researchers introduced a dataset of 32,534 double-marked GCSE student responses and tested large language models (LLMs) against examiner consensus. Top-performing LLMs agreed more closely with examineβ¦