This wasn’t a blowout, but the margin is real. grok 4.3 takes the aggregate, 38.3 to 36.2, because it was more reliable on the kinds of details that decide practical head to heads: exact formatting, tone control, and not adding avoidable friction. The cleanest example is messy orders to json , where both models parsed, normalized, and sorted correctly, but only grok 4.3 actually obeyed the requirement to return valid JSON directly. gpt 5.4 mini wrapped its answer in Markdown fences, which is ...
OpenAI WebRTC Audio Session, now with document context