Why We Need Behavioral Benchmarks for LLMs — Not Just More Knowledge Tests
A developer argues that current LLM benchmarks like MMLU, HumanEval, and SWE-bench measure only knowledge recall and one-shot task completion, not the behavioral traits—such as debugging, adaptation, and cross-session le…