Head to head: Anthropic: Claude Opus 4.8 vs Google: Gemini 3.5 Flash

Google's Gemini 3.5 Flash narrowly defeated Anthropic's Claude Opus 4.8 in a head-to-head benchmark, scoring 35.4 to 34.8 across four tasks. Gemini won three of four tasks by demonstrating superior instruction-following, format compliance, and tone control, while Claude produced the best single summary but was less reliable overall.

The final margin is narrow — 35.4 to 34.8 for Google: Gemini 3.5 Flash — but the reasons are not mysterious. Gemini takes three of four tasks, and in each of those wins the edge comes from cleaner compliance with the brief rather than flashy ambition. That matters in real production use, where small formatting misses and slightly loose pattern matching are exactly how good outputs become annoying ones. The most convincing Gemini win is probably messy-orders-to-json . Both models normalized and sorted the data correctly, but Claude wrapped the answer in a Markdown code fence despite an explicit JSON-only instruction. That is a straight instruction-following error, and Gemini didn’t make it. On release-delay-note , the gap is subtler but still real: both were accurate, yet Gemini better matched the requested calm, professional Slack tone by staying concise, accountable, and free of unnecessary stylistic garnish. The python-log-redactor result reinforces the same pattern. Both solutions were solid, standard-library-only, and correctly avoided clobbering version numbers like v2.14.7 . Gemini’s advantage came from a better Bearer-token regex: using word boundaries lowers the risk of partially redacting a longer alphanumeric string. Claude’s solution works, but it is a little sloppier at the edges, and edge cases are where redaction tools live or die. Claude’s best showing came in meeting-notes-summary , and it was a legitimate win. It captured both key risks, included the extra owner/deadline detail, and produced a richer numbers mentioned field. Gemini was mostly right, but it missed the Safari/Klarna risk in the JSON and lost some nuance around draft versus final timing. That result shows Claude still has the stronger instinct for fuller synthesis when the task rewards completeness over strict brevity. Final call: Google: Gemini 3.5 Flash wins because it was more reliable where users actually feel reliability — format compliance, tone control, and tighter handling of implementation details. Claude Opus 4.8 produced the best single summary, but Gemini was the better editor’s choice across the full slate. How they were tested We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. Anthropic: Claude Opus 4.8 scored 34.8 to Google: Gemini 3.5 Flash's 35.4. 1. python-log-redactor Practical coding — Python. Return code only. Write a function redact log line: str - str that sanitizes a single application log line before it is shipped to a vendor. Replace: - any email address with EMAIL - any IPv4 address with IP - any bearer token of the form Bearer <token with Bearer TOKEN where <token is 20+ characters of letters, digits, or - Requirements: - preserve all other text exactly - do not redact version numbers like v2.14.7 - do not redact invalid IP-like text such as 999.12.3.4 - use only the Python standard library Example input line: 2026-04-09T07:14:55Z WARN user=mina.cho@northlane.io ip=10.44.18.201 auth='Bearer sk live 9AbCdEfGhIjKlMnOpQrStUv' path=/api/v2.14.7/status Expected output for that example: 2026-04-09T07:14:55Z WARN user= EMAIL ip= IP auth='Bearer TOKEN ' path=/api/v2.14.7/status Winner: Google: Gemini 3.5 Flash — Both solutions are standard-library-only and correctly handle valid IPv4s without redacting version numbers like v2.14.7. Model B is slightly better because its Bearer-token regex uses word boundaries, reducing the chance of partially redacting a longer alphanumeric string, while Model A could match just the first 20+ valid characters of a longer token-like sequence. 2. release-delay-note Professional writing. Draft a Slack update to the company-wide ops-and-sales channel. Audience: account managers and support leads. Tone: calm, accountable, no blame. Length: 110-140 words. Situation: This morning's rollout of the Atlas invoicing export is delayed. During final checks, the team found duplicate rows appearing only for customers with archived cost centers. No customer data was lost or exposed. We have paused the rollout, disabled the new export toggle for all workspaces, and expect another update by 3:30 PM Eastern. Ask customer-facing teams not to promise a launch time yet, and route urgent client asks to Priya Nadeem in Revenue Systems. Winner: Google: Gemini 3.5 Flash — Both are strong and accurate, but B is slightly better aligned to the requested calm, professional Slack update: it is more concise, avoids extra formatting/emojis, and maintains an accountable, no-blame tone while covering every required detail. A is also good, but the headline, numbered list, and emoji make it feel a bit more stylized than necessary. 3. meeting-notes-summary Summarization & extraction. Read the meeting notes below, then produce: 1 a 2-sentence summary 2 a JSON object with keys decision , owner , deadline , risks , and numbers mentioned Source notes: "Checkout reliability sync — 11 May - Jae: error rate fell from 3.8% to 1.1% after disabling the coupon recompute step. - Marta: mobile Safari still times out on the payment page for some Klarna users; 17 complaints since Friday. - Decision: keep the recompute step off through the Cedar & Finch campaign. - Owner for root-cause writeup: Marta. - Deadline: draft by Tuesday 14:00 UTC, final by Wednesday noon. - Risk: finance team says coupon totals may be off by up to $0.27 on roughly 2% of orders. - Jae will add a banner to the internal status page by 09:30 UTC. - No evidence of fraud or duplicate charges." Return only the summary and the JSON. Winner: Anthropic: Claude Opus 4.8 — A is more complete and accurate: it captures both key risks, includes the additional owner/deadline detail from the notes, and provides a richer numbers mentioned field. B is mostly correct but omits the Safari/Klarna risk from the JSON, misses the draft/final nuance in the summary, and its numbers list is less informative. 4. messy-orders-to-json Data wrangling / structured output. Convert the messy order lines below into valid JSON: an array of objects sorted by order id ascending. Each object must have exactly these fields: order id string , customer string in title case , sku uppercase string , qty integer , unit price number , rush boolean , ship date string YYYY-MM-DD or null . Rules: - trim extra spaces - title case customer names - uppercase SKU - parse rush yes/no/Y/N into true/false - if ship date is TBD or blank, use null - preserve numeric values exactly as decimals, not strings Messy data: ord-204 | nora feld | qx-9 | 3 | 19.95 | yes | 2026/07/03 ord-198|ELI TRAN|ab-11|1|250|N|TBD ord-203 |mara iTO| lk-2 | 12 | 4.5 | y | 2026-07-01 ord-201| jules park |zx-77|2| 88.00|no| Return JSON only. Winner: Google: Gemini 3.5 Flash — Both outputs correctly normalize and sort the data, but Model A violates the instruction to return JSON only by wrapping the array in a Markdown code fence. Model B provides valid raw JSON and is therefore better. See every prompt and the full side-by-side outputs in the interactive Head-to-Head /head-to-head/head-to-head-anthropic-claude-opus-4-8-vs-google-gemini-3-5-flash .