grok-4.3 edges gpt-5.4-mini on execution

Grok 4.3 outperformed GPT 5.4 Mini in a head-to-head execution benchmark, scoring 38.3 to 36.2 by demonstrating greater reliability on formatting, tone control, and frictionless output. In a key test converting messy orders to JSON, only Grok 4.3 returned valid JSON directly, while GPT 5.4 Mini wrapped its response in Markdown fences, violating the requirement. The margin, though narrow, signals a meaningful advantage in practical task execution.

This wasn’t a blowout, but the margin is real. grok 4.3 takes the aggregate, 38.3 to 36.2, because it was more reliable on the kinds of details that decide practical head to heads: exact formatting, tone control, and not adding avoidable friction. The cleanest example is messy orders to json , where both models parsed, normalized, and sorted correctly, but only grok 4.3 actually obeyed the requirement to return valid JSON directly. gpt 5.4 mini wrapped its answer in Markdown fences, which is ...