SWE-Marathon

mentions 1 type Organization feed RSS

// recent coverage 1 mentions

14:00

2026-06-24

arize.com

ai-agents

Long-horizon agent benchmarks are fragmenting: a field guide to what each one actually measures

A new wave of long-horizon agent benchmarks, including Agents' Last Exam, SWE-Marathon, the Meta-Agent Challenge, and Arena's Agent Mode, reveals a fundamental trade-off between realism and verifiabil…

// co-occurs with top 7 entities

OpenAI 1 Apollo Research 1 o4-mini 1 o3 1 Agents' Last Exam 1 Meta-Agent Challenge 1 Arena 1