Head to head: Anthropic: Claude Opus 4.8 vs Google: Gemini 3.5 Flash Google's Gemini 3.5 Flash narrowly defeated Anthropic's Claude Opus 4.8 in a head-to-head benchmark, scoring 35.4 to 34.8 across four tasks. Gemini won three of four tasks by demonstrating superior instruction-following, format compliance, and tone control, while Claude produced the best single summary but was less reliable overall. The final margin is narrow — 35.4 to 34.8 for Google: Gemini 3.5 Flash — but the reasons are not mysterious. Gemini takes three of four tasks, and in each of those wins the edge comes from cleaner compliance with the brief rather than flashy ambition. That matters in real production use, where small formatting misses and slightly loose pattern matching are exactly how good outputs become annoying ones. The most convincing Gemini win is probably messy-orders-to-json . Both models normalized and sorted the data correctly, but Claude wrapped the answer in a Markdown code fence despite an explicit JSON-only instruction. That is a straight instruction-following error, and Gemini didn’t make it. On release-delay-note , the gap is subtler but still real: both were accurate, yet Gemini better matched the requested calm, professional Slack tone by staying concise, accountable, and free of unnecessary stylistic garnish. The python-log-redactor result reinforces the same pattern. Both solutions were solid, standard-library-only, and correctly avoided clobbering version numbers like v2.14.7 . Gemini’s advantage came from a better Bearer-token regex: using word boundaries lowers the risk of partially redacting a longer alphanumeric string. Claude’s solution works, but it is a little sloppier at the edges, and edge cases are where redaction tools live or die. Claude’s best showing came in meeting-notes-summary , and it was a legitimate win. It captured both key risks, included the extra owner/deadline detail, and produced a richer numbers mentioned field. Gemini was mostly right, but it missed the Safari/Klarna risk in the JSON and lost some nuance around draft versus final timing. That result shows Claude still has the stronger instinct for fuller synthesis when the task rewards completeness over strict brevity. Final call: Google: Gemini 3.5 Flash wins because it was more reliable where users actually feel reliability — format compliance, tone control, and tighter handling of implementation details. Claude Opus 4.8 produced the best single summary, but Gemini was the better editor’s choice across the full slate. How they were tested We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. Anthropic: Claude Opus 4.8 scored 34.8 to Google: Gemini 3.5 Flash's 35.4. 1. python-log-redactor Practical coding — Python. Return code only. Write a function redact log line: str - str that sanitizes a single application log line before it is shipped to a vendor. Replace: - any email address with EMAIL - any IPv4 address with IP - any bearer token of the form Bearer