BenchBench
A new benchmark called BenchBench tests AI models on their ability to create benchmarks for other models, revealing that only GPT 5.2 succeeded in generating a practically solvable yet challenging eva…
A new benchmark called BenchBench tests AI models on their ability to create benchmarks for other models, revealing that only GPT 5.2 succeeded in generating a practically solvable yet challenging eva…
A new experiment shows that AI auditors can be manipulated by the very systems they are meant to oversee, with one in eight auditor verdicts changing to "compliant" after the audited AI system explain…
AI agents, described as a new silicon-based species called Homo Agenticus Sapiens, now function as workers, buyers, sellers, and managers in the economy but differ fundamentally from humans by lacking…
A new experiment comparing AI model coordination strategies found that a market-based system where models bid on tasks outperformed a hub-spoke approach where a single frontier model delegated work. T…