16:07
2026-06-17
danlevy.net
large-language-models
LLM benchmarks are answering someone else's question
LLM benchmarks like MMLU and HumanEval are irrelevant for most businesses building AI products, as they measure generic performance rather than specific system tasks. Teams should instead build customβ¦