Alibaba’s Qwen3.7-Max Ran Autonomously for 35 Hours on Unfamiliar Hardware. It Still Kept Getting Better.

Alibaba’s Qwen3.7-Max autonomously optimized a kernel on unfamiliar hardware for 35 hours, making 1,158 tool calls and achieving a 10x speedup over the reference implementation. The model outperformed competitors GLM 5.1 (7.3x), Kimi K2.6 (5x), and DeepSeek V4 Pro (3.3x), which stopped early after concluding they could not improve further. The result demonstrates the model’s ability to sustain self-directed improvement without human intervention.

Alibaba gave Qwen3.7-Max a kernel optimization task on a hardware platform the model had never encountered before. No documentation or profiling data. No example kernels for the architecture. Just a task description, an existing implementation, and an evaluation script. The model ran for 35 hours. It made 1,158 tool calls. It wrote, compiled, profiled, and rewrote the kernel repeatedly, diagnosing failures, fixing bugs, identifying blocks, and redesigning the architecture multiple times without anyone watching. After 30 hours it was still finding meaningful improvements. The final result was a 10x speedup over the reference implementation. For context: GLM 5.1 ran the same task and reached 7.3x. Kimi K2.6 reached 5x. DeepSeek V4 Pro reached 3.3x. The models that stopped early did so because they issued no tool calls for five consecutive rounds, they concluded they couldn't make further progress and stopped. Qwen3.7-Max didn't stop.