14:00
2026-06-24
arize.com
ai-agents
Long-horizon agent benchmarks are fragmenting: a field guide to what each one actually measures
A new wave of long-horizon agent benchmarks, including Agents' Last Exam, SWE-Marathon, the Meta-Agent Challenge, and Arena's Agent Mode, reveals a fundamental trade-off between realism and verifiabilβ¦