Why Code Golfing is the Ultimate Test for Multimodal LLMs (And a New Benchmark to Prove It)
ClawBattle, a new open-source benchmark that tests multimodal LLMs on code golfing tasks, which require both visual and textual understanding. It claims the benchmark avoids data contamination by using confidential, top-…