This wasn't a nail biter. Claude Opus 4.8 took a decisive 37–31 win over Kimi K2.7 Code, and the per task breakdown tells a clear story: Opus simply operates with more precision and better judgment across the board. The only task Kimi didn't lose outright was messy orders to json , where both models produced identical, rule compliant output. A tie on a rigid data transformation task is fine, but it's not where the real differentiation happens. On python log redactor , the gap was technical an...
Anthropic Explains How Claude Builds Its Own Execution Harnesses