Gemma 4 outpaces Qwen 3.6 on code review

wpnews.pro

cd /news/large-language-models/gemma-4-outpaces-qwen-3-6-on-code-re… · home › topics › large-language-models › article

[ARTICLE · art-39415] src=runagentrun.co.uk ↗ pub=2026-06-25T00:00Z topic=large-language-models verified=true sentiment=· neutral

Gemma 4 outpaces Qwen 3.6 on code review

Google's Gemma 4 31B outperforms Alibaba's Qwen 3.6 27B on agentic code review tasks, finishing faster due to superior Multi-Token Prediction (MTP) design, according to benchmarks and field reports. While Qwen 3.6 leads in hard math and world knowledge, Gemma 4 excels in instruction following, graduate reasoning, and latency, making it more reliable for practical coding workflows.

read4 min views1 publishedJun 25, 2026

Gemma 4 outpaces Qwen 3.6 on code review — Image: Runagentrun (auto-discovered)

Gemma 4 finishes the code review first #

A controlled benchmark on the Kaitchup substack and a self-hoster’s field report both reach the same verdict: Google’s Gemma 4 31B beats Alibaba’s Qwen 3.6 27B on agentic code work, and finishes faster. The surprising variable is Multi-Token Prediction (MTP), a technique that drafts several tokens at once to speed up generation. Gemma 4’s MTP implementation is doing real work; Qwen 3.6’s is producing weaker output on coding tasks.

Kaitchup ran both models through identical accuracy, latency and memory tests. Qwen 3.6 dominated hard maths (AIME-style problems, scoring a CoDeC contamination score above 62 — rare in this size class) and world knowledge (MMLU Pro). Gemma 4 31B held a lead on instruction following (IFBench), graduate-level reasoning (GPQA Diamond) and raw latency. A larger model running faster than a smaller dense one is the headline that took off on X.

What the benchmarks actually show #

Kaitchup’s numbers, cross-checked against Artificial Analysis on at least one metric, paint a more nuanced picture:

Hard maths (AIME): Qwen 3.6 ahead of both Qwen 3.5 and Gemma 4. CoDeC score above 62.World knowledge (MMLU Pro): Qwen 3.6 ahead.Single-turn coding (LiveCodeBench): Qwen 3.6 ahead of Qwen 3.5 but behind Gemma 4 on pass@1; tied at pass@4.Instruction following (IFBench): Gemma 4 ahead by a wide margin.Graduate reasoning (GPQA Diamond): Gemma 4 ahead — a surprise, since Alibaba’s own numbers claim a 2.3-point improvement for Qwen 3.6. Kaitchup suspects different evaluation setups; Artificial Analysis found the same.

Qwen 3.6 is sharper on raw knowledge and maths; Gemma 4’s combination of a mixture-of-experts (MoE) architecture — where only some parameters fire per token — plus MTP is calmer and faster on the agent workflow that matters in practice.

The MTP surprise in the field #

Qwen 3.6 27B is great but I have found Gemma 4 31B much more reliable. It doesn’t overthink, uses the right tools only when needed, and can run faster thanks to its superior MTP design. A larger model running faster than a smaller one, that’s crazy!!

— Behnam (@OrganicGPT), X, 6 June 2026 Benchmarks don’t always survive contact with real code. One self-hoster running Qwen 3.6 27B Q8_K_XL (an 8-bit quantisation tuned for quality) on four RTX 5070 Ti cards through llama.cpp and the OpenCode CLI reported that in roughly eight out of ten runs, the non-MTP variant produced more findings, in more detail, on a simple Do a code review of this branch. prompt than the MTP variant did.

MTP is a latency play, not always a quality play. For code review and other reasoning-heavy agentic tasks, drafting multiple tokens at once can hurt as much as it helps. The post above attributes the difference to Gemma 4’s MTP design — it doesn’t overthink simple steps and only invokes tools when they’re needed.

For UK teams self-hosting on modest hardware, MTP support varies by engine: llama.cpp doesn’t yet support MTP for Gemma 4 31B, so if you want the speed-up you’ll need vLLM (an inference engine optimised for serving models at scale) or another runtime.

How to try it this afternoon #

You don’t need a four-GPU rig. A single 24 GB card runs both models in Q4 or Q5 quantisation (4-bit or 5-bit — quality is good enough for code review, and the models fit in roughly 18–22 GB of VRAM).

Pull both with Ollama(ollama pull qwen3.6:27b

andollama pull gemma4:31b

), or browse the Qwen and Gemma repos on Hugging Face for a specific quant. We compared Ollama and LM Studio inLM Studio vs Ollama in 2026if you want the trade-offs first.Install OpenCode CLI(npm i -g opencode

) — a small open-source coding agent that talks to local endpoints via Ollama.Point both at the same prompt on a small repo:*Do a code review of this branch and list findings with file:line references.*Save each output separately.Time them. Wall-clock seconds and total tokens consumed. MoE-vs-dense and MTP differences show up clearly at the token level.Turn MTP on and off in vLLM to reproduce the field report. With Qwen 3.6, expect the non-MTP run to be more thorough; with Gemma 4, MTP is the speed lever and quality stays flat.

What to weigh up:

Gemma 4 31B wins if your daily workload is agent-style coding, code review, or anything where stop thinking and call the toolmatters more than raw knowledge. - Qwen 3.6 27B wins if you want one model for maths, summarisation and reasoning-heavy Q&A without swapping weights — and you’re quantising hard.
If you’re tight on VRAM, the Qwen 3.6-35B-A3B MoE we covered in

Qwen3.6-35B-A3B is the local coding agentstays under 24 GB.

Sources & quotes #

Every quotation in this article is verbatim from a named source — click any 1 to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →

source & further reading

runagentrun.co.uk — original article Ai2 ships Tmax-27B terminal agent Sage Router: one endpoint, every model A business assistant for under £50 a month

~/api · this article 200

$curl api.wpnews.pro/v1/news/gemma-4-outpaces-qwen-3-…

Read original on runagentrun.co.uk → www.runagentrun.co.uk/articles/gemma-4-outpaces-…

mentioned entities

Google

Alibaba

Gemma 4

Qwen 3.6

Kaitchup

Artificial Analysis

Ollama

OpenCode CLI

metadata

sluggemma-4-outpaces-qwen-3-6-on-code-review

topic#large-language-models

secondary2 topics

sentimentneutral

canonicalrunagentrun.co.uk

navigation

← prevsimonw/browser-compat-db

next →12 rules of agentic AI for succe…

── more in #large-language-models 4 stories · sorted by recency

pub.towardsai.net · 25 Jun · #large-language-models

Google Turned LLM Load Balancing Into Scheduling. What That Means for the Rest of Us

dev.to · 18 Jun · #large-language-models

Model Showdown Round 7: Five Local Models vs. One Cloud Model on a Real Coding Task

the-decoder.com · 25 Jun · #large-language-models

Grok AI is reportedly a porn platform now, with over half its traffic tied to adult content

pcguide.com · 25 Jun · #large-language-models

Grab a Sony Bravia OLED TV for its lowest price thanks to this Prime Day deal

── more on @google 3 stories trending now

wpnews · 22 Jun · #generative-ai

Bain tests software takeover targets using vibecoding AI replicas

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 24 Jun · #ai-policy

An AI startup is suing the US government for taking away Anthropic's new model

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required