Head to head: grok-4.3 vs mistral-small-2503 Grok-4.3 defeated Mistral-Small-2503 in a head-to-head text model comparison, scoring 38.0 to 22.0 across four tasks including coding, writing, summarization, and data wrangling. The evaluation found Grok-4.3 more accurate and disciplined on edge cases and output formatting, while Mistral-Small-2503 repeatedly failed on details such as incorrect logic in a billing task and adding Markdown fences when code-only was required. grok-4.3 wins because it was reliably correct where these evaluations actually matter: edge cases, audience fit, and output discipline. The aggregate score says 38.0 to 22.0, but the more important story is that grok-4.3 kept clearing the practical bar while mistral-small-2503 repeatedly stumbled on details that should not be optional. The clearest separation came in the Python billing task. grok-4.3 handled month rollover, preserved end-of-month behavior, clamped correctly for shorter months, and included the required asserts. mistral-small-2503 reached for a timedelta -style monthly fix that is simply the wrong tool for calendar billing, failed to preserve the required day semantics, and then ignored the format instruction by adding Markdown fences to a code-only prompt. That is not a near miss; that is a model missing both the logic and the contract. The writing and summarization results tell a similar story. In the vendor incident update, grok-4.3 wrote for technical customer contacts the way a competent operator should: candid, concise, and specific about impact, cause, remediation, next steps, and support. mistral-small-2503 padded the message with generic customer-comms fluff, placeholder signature fields, and less precise phrasing. In the meeting-notes task, grok-4.3 stayed faithful to the source and extracted the requested fields cleanly, while mistral-small-2503 invented a year for the launch date and misfiled action items and dependencies as blocked items. Those are accuracy errors, not stylistic preferences. Even in the data-wrangling round, where both models normalized the orders correctly, grok-4.3 still separated itself by obeying the instruction to return JSON only. mistral-small-2503 again wrapped valid content in Markdown fences and turned a straightforward formatting requirement into a failure mode. That kind of sloppiness is exactly what breaks downstream pipelines. Final call: grok-4.3 is the clearly better text model here. It was more accurate, more disciplined, and more usable in real workflows; mistral-small-2503 lost on the details that professionals cannot afford to excuse. How they were tested We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. grok-4.3 scored 38.0 to mistral-small-2503's 22.0. 1. Practical coding — Python invoice due date fix Language: Python 3.11. Return code only, no explanation. Fix this function so it calculates the next invoice due date correctly for a monthly billing system. Requirements: - Input start date is YYYY-MM-DD . - months ahead is a nonnegative integer. - Preserve end-of-month behavior: if the start date is the last day of its month, the result must be the last day of the target month. - Otherwise, keep the same day-of-month when possible; if the target month is shorter, clamp to that month's last day. - Raise ValueError for invalid dates. Current buggy code: python from datetime import datetime def next due date start date: str, months ahead: int - str: dt = datetime.strptime start date, "%Y-%m-%d" month = dt.month + months ahead year = dt.year + month // 12 month = month % 12 return dt.replace year=year, month=month .strftime "%Y-%m-%d" Also include these asserts in your answer: python assert next due date "2024-01-31", 1 == "2024-02-29" assert next due date "2023-01-31", 1 == "2023-02-28" assert next due date "2024-08-31", 6 == "2025-02-28" assert next due date "2024-05-30", 1 == "2024-06-30" assert next due date "2024-05-15", 10 == "2025-03-15" Winner: grok-4.3 — A correctly handles month rollover, end-of-month preservation, clamping for shorter months, and includes the required asserts. B uses an incorrect timedelta-based approach for monthly billing, does not preserve the required day semantics, and includes markdown fences despite the prompt requiring code only. 2. Professional writing — vendor incident status update Write a status update email to enterprise customers about a resolved service incident. Context: - Company: Northline Metrics - Product: RouteCast API - Incident window: 08:42–10:17 UTC on 14 May - Impact: about 18% of requests to /v2/forecast returned HTTP 503 - Cause: bad cache invalidation rule deployed during a rollout in eu-west-1 - Resolution: rollback completed, cache warmed, extra alert added - No data loss or security issue - Next step: publish full RCA by 17 May Audience: technical customer contacts at logistics companies. Tone: candid, calm, professional; no marketing fluff. Length: 140–190 words. Include: apology, customer impact, plain-English cause, what was done, what happens next, and a support contact line. Winner: grok-4.3 — A is more appropriately tailored to technical customer contacts: it is candid, concise, and avoids marketing language while clearly covering impact, cause, remediation, next steps, and support contact. B includes fluff 'Dear Valued Customers,' 'continued trust' , placeholder signature fields, and slightly less precise phrasing for this audience. 3. Summarization & extraction — meeting notes to bullets + facts Read the meeting notes below. Then produce: 1 a 3-bullet summary for an exec who missed the meeting 2 a JSON object with keys launch date , owner , blocked items , and numeric targets Meeting notes: """ Atlas mobile onboarding sync — 3 Sept, 09:30 - Priya said Android crash-free sessions improved from 96.8% to 98.1% after the image-loader patch. - iOS is still blocked by the App Store review; Mateo expects a decision by Friday. - Marketing needs the final screenshot set by 16:00 Thursday or the paid social campaign slips a week. - We agreed not to launch with Apple SSO yet; too many edge cases in school-managed accounts. - Revised target launch date: 18 Sept, assuming iOS approval lands this week. - Juno will own the go/no-go checklist and send it for signoff. - Support asked for a macros doc covering invite-code failures, duplicate profiles, and timezone-related reminder confusion. - KPI target for week 1 remains: onboarding completion 72% and day-7 retention 31%. """ Return valid JSON for part 2. Winner: grok-4.3 — A is more faithful and concise: its summary captures the key decisions, risks, owner, and KPI targets, and its JSON cleanly extracts the requested fields without overreaching. B adds an unsupported year to the launch date and misclassifies action items/dependencies as blocked items, making its extraction less accurate. 4. Data wrangling — messy orders into clean JSON Transform the messy order lines below into a valid JSON array. Output JSON only. Schema for each object, in this exact key order: order id string , customer string , sku string , qty integer , unit price number , currency string , ship date string YYYY-MM-DD or null Rules: - Trim spaces. - Normalize currency to 3-letter uppercase codes. - Parse quantity as integer. - Parse unit price as a number without currency symbols. - Normalize dates to YYYY-MM-DD. - If ship date is missing, use null. - Keep input order. Messy data: A-9041 | Lumen Dockworks | WX-14B | qty 3 | $19.50 | usd | ships 2025/04/09 A-9042|Pine & Circuit Co.|QZ-8|2 units|EUR 7.2|eur|ship: 09-04-2025 A-9043 | Nara Studio | MESH-2 | 11 | GBP 0.85 | gbp | pending A-9044| Helio Foods|COLD-7|qty:1| CAD$104.00 | cad | 2025-04-12 A-9045 | Orbit Kiosk | TAG-44 | Qty 5 | 12.00 usd | USD | ship 2025.04.15 Winner: grok-4.3 — Both outputs correctly normalize the data, but A follows the instruction to output JSON only. B wraps the JSON in Markdown code fences, which violates the format requirement. See every prompt and the full side-by-side outputs in the interactive Head-to-Head /head-to-head/head-to-head-grok-4-3-vs-mistral-small-2503 .