cd /news/large-language-models/qwen3-6-27b-35b-a3b-vs-gemma-4-vs-de… · home topics large-language-models article
[ARTICLE · art-16088] src=deepresearch.ninja pub= topic=large-language-models verified=true sentiment=· neutral

Qwen3.6 27B/35B-A3B vs Gemma 4 vs DeepSeek V4: A Comprehensive Analysis of the Open-Weight Frontier (May 2026)

Alibaba's Qwen3.6-27B dense model outperformed its own 397-billion-parameter MoE predecessor on SWE-bench Verified (77.2% vs. ~76.2%) while running on a single RTX 4090, challenging the prevailing MoE paradigm for models under 80 billion parameters. DeepSeek V4-Pro led all open-weight models in raw coding performance with 80.6% on SWE-bench Verified but required a datacenter-scale 8× H100 cluster, while none of the three model families published standardized safety evaluation data, creating a critical gap for enterprise deployment.

read46 min publishedMay 20, 2026

Post

A deep-dive into the latest generation of open-weight large language models — examining architectures, benchmarks, trade-offs, and the strategic positioning of Alibaba's Qwen3.6 series against Google's Gemma 4 and DeepSeek V4 in the rapidly evolving 2026 open-source LLM landscape.

Executive Summary #

Bottom line: By April 2026, open-weight models in the ~27–35B active-parameter range have crossed the frontier threshold on coding and reasoning benchmarks — but only when evaluated under comparable conditions. Independent analysis reveals that benchmark scores across Qwen3.6, Gemma 4, and DeepSeek V4 were collected under different protocols (agent scaffolding, temperature differences, think-mode vs non-think), making head-to-head comparisons inherently imprecise. After normalizing for these methodological differences, the following conclusions emerge:

1. Architectural design matters more than parameter count at the 27–35B scale. Qwen3.6-27B (dense, 27B params) outperforms its own Qwen3.5-397B-A17B MoE predecessor on SWE-bench Verified (77.2% vs. ~76.2%) despite having 1/15th the total parameters and 1/60th the active parameters. Its hybrid Gated DeltaNet + Gated Attention architecture — where 75% of layers use linear-attention-style DeltaNet for long-context efficiency — enables this. This challenges the prevailing MoE paradigm: dense models win in the 27–80B range on efficiency, while MoE dominates above ~200B active parameters.

2. The ~27B dense sweet spot: Qwen3.6-27B is the best consumer-hardware model for coding. It runs on a single RTX 4090 (24GB VRAM at Q4 quantization, ~16.8GB weight footprint), scores 77.2% on SWE-bench Verified, 94.1% on AIME 2026 (math), and 87.8% on GPQA Diamond (science). It is the only model in its class to combine frontier-tier coding, mathematics, multimodal vision, and a single-GPU deployment profile under Apache 2.0.

3. DeepSeek V4-Pro delivers the highest raw coding performance but at datacenter scale. At 1.6T total / 49B active parameters, it leads LiveCodeBench v6 (93.5%) and SWE-bench Verified (80.6%) among open models. However, it requires ~400GB+ VRAM (8× H100 cluster), making self-hosting impractical for most organizations. At $0.435/$0.87 per million tokens via API, it remains ~34× cheaper than Claude Opus but is economically viable only for high-volume use cases.

4. The MoE comparison: Qwen3.6-35B-A3B (256 experts, 9 active) significantly outperforms Gemma 4-26B-A4B (128 experts, 9 active) on agentic reasoning. Despite similar active parameter counts (3B vs. 3.8B), Qwen3.6-35B-A3B scores 21.4 vs. 8.7 on HLE (no tools) — a 2.5× gap suggesting that Gemma’s higher sparsity ratio (8 out of 128 experts = 6.25% activation) may be detrimental for specialized reasoning tasks. Qwen3.6-35B-A3B achieves 73.4% on SWE-bench Verified, while Gemma 4-26B-A4B is not officially benchmarked on this metric.

5. Safety evaluations remain the weakest link across all three families. None of the flagship models publishes standardized safety scores (AdvGLUE, JailbreakBench, GCG attack success rates). Only Gemma 4 provides qualitative safety claims (“significantly outperforms Gemma 3 in safety while keeping unjustified refusals low”), tested without safety filters. Qwen3.6 and DeepSeek V4 provide zero safety evaluation data. This gap is critical for enterprise deployment decisions.

6. Competitive landscape has expanded beyond the three core families. GLM-5.1 (Z.AI, 744B/40B MoE) leads open models on SWE-bench Pro at 58.4%, surpassing GPT-5.4 (57.7%) and Claude Opus 4.6 (57.3%). Kimi K2.6 (Moonshot AI, 1T/32B MoE) leads all models on HLE with tools at 54.0 and scores Intelligence Index 54 (highest among open-weight models). Llama 4 Scout (Meta, 109B/17B MoE) offers a unique 10M context window. These models, while outside the ~27–35B scope of this report, define the broader competitive environment.

Licensing: Qwen3.6 and Gemma 4 use Apache 2.0 (unrestricted commercial use). DeepSeek V4 uses MIT license. All are fully open-weight for self-hosting.

Strategic recommendation: For local deployment, Qwen3.6-27B is the best all-around open model — strong coding (77.2% SWE-bench Verified), excellent math (94.1% AIME 2026), multimodal vision support, Apache 2.0 license, and runs on a single RTX 4090. For API-based use where maximum coding performance matters, DeepSeek V4-Pro offers the best open-weight option at $0.435/$0.87 per million tokens. For enterprises prioritizing compliance and transparency, Gemma 4 (Google’s transparent data pipeline and qualitative safety evaluations) is preferable.

Background and Context #

The Open-Weight Revolution: Where We Are in May 2026

The open-weight LLM landscape has undergone a radical transformation over the past twelve months. In mid-2024, the “open” models were typically 7–13B parameter distilled versions of frontier models, trailing closed models by 20–40 percentage points on coding and reasoning benchmarks. By early 2025, the Qwen3 and Llama 3 releases showed that open models could approach frontier quality at the 70B scale. The DeepSeek R1 release in January 2025 (which used test-time compute scaling via RL) briefly closed the gap entirely on reasoning tasks.

By February–April 2026, the landscape shifted again. Alibaba’s Qwen3.5 family (including a 397B-A17B MoE flagship), Google’s Gemma 3 (March 2025), and Meta’s Llama 4 series established a new baseline where models in the 27–80B active parameter range could compete with previous-generation closed flagships. April 2026 was the inflection point: three major open-weight families launched within ten days of each other, and the open-weight curve crossed the closed-weight curve on enterprise-critical metrics.

Why This Question Matters Now

The convergence of Qwen3.6, Gemma 4, and DeepSeek V4 represents a structural shift in the AI industry:

Cost compression: Open models now offer frontier-tier performance at 1/10th to 1/70th the API cost of closed alternatives. DeepSeek V4-Pro costs $0.435/$0.87 per million input/output tokens vs. Claude Opus at ~$15/$75 — a 34× price difference.Hardware democratization: Models in the 27–35B range now run on consumer hardware (single RTX 4090, M-series Macs), removing the need for enterprise GPU clusters for many use cases.Architectural innovation: The three models represent distinct architectural philosophies — Qwen’s hybrid Gated DeltaNet + Attention, Google’s dense transformer with sliding window attention, and DeepSeek’s massive MoE with compressed attention — each solving the scale-efficiency trade-off differently.Agentic coding as the new benchmark: SWE-bench and Terminal-Bench have become the de facto measures of “real utility” for developers, moving beyond static benchmarks like MMLU toward dynamic, task-completion evaluation.

Generational Progression: Qwen3 → Qwen3.5 → Qwen3.6

The Qwen series has progressed rapidly:

Qwen3(April 28, 2025): Dense models from 0.6B to 32B, MoE at 30B-A3B and 235B-A22B. Trained on 36T tokens across 119 languages. Hybrid thinking/non-thinking mode.Qwen3.5(February 16, 2026): Flagship 397B-A17B MoE, medium models at 27B/35B-A3B/122B-A10B, small at 0.8B–9B. Optimized for agentic workflows.Qwen3.6(April 2026): API-only Plus variant (Intelligence Index: 50), open 35B-A3B (Index: 43/32), and dense 27B (Index: 46/37). The 27B model notably outperforms the Qwen3.5-397B MoE on SWE-bench Verified.

Key Definitions

Dense model: Every parameter is active for every input token (e.g., Qwen3.6-27B, Gemma 4-31B)** MoE (Mixture of Experts): Each token activates only a subset of the model’s total parameters via routing (e.g., Qwen3.6-35B-A3B with 35B total / 3B active, DeepSeek V4-Pro with 1.6T total / 49B active)Active parameters: The number of parameters actually computed per token during inference — this determines compute cost, not total parameter countOpen-weight**: Model weights are publicly available for download and self-hosting (as opposed to API-only access)** Apache 2.0 / MIT license**: Permissive open-source licenses allowing unrestricted commercial use, modification, and redistribution

Current State: Model Profiles and Technical Specifications #

Qwen3.6 Family (Alibaba / Qwen Team)

Qwen3.6-27B (released April 22, 2026) Type: Dense, multimodal (text + image input → text output)** Parameters**: ~27B total, all active** Architecture**: 64 layers with hybrid layout:16 × (3 × Gated DeltaNet → FFN → 1 × Gated Attention → FFN)

  • Gated DeltaNet: 48 heads for values (V), 16 heads for queries/keys (QK), head dimension 128
  • Gated Attention: 24 query heads, 4 KV heads, head dimension 256, RoPE dim 64
- FFN intermediate dimension: 17,408
- Multi-Token Prediction (MTP) trained for speculative decoding

Context window: 262,144 tokens native; extensible to ~1,010,000 via YaRN RoPE scaling** Vocabulary**: 248,320 tokens (padded)** License**: Apache 2.0** Key features**: Agentic coding optimization, hybrid thinking/non-thinking modes, vision-language capabilities, 201 languages supported, MCP tool integrationHardware requirements: ~54GB VRAM at BF16; ~27GB at INT8; ~16.8–18GB at Q4_K_M GGUF (fits single RTX 4090/3090 with headroom) [1,2]; ~22–24GB on M-series Macs with unified memory

Qwen3.6-35B-A3B (released mid-April 2026) Type: MoE, multimodal** Parameters**: 35B total, 3B active per token (8 routed + 1 shared active out of 256 routed experts)** Architecture**: Hybrid Gated DeltaNet + Gated Attention with MoE routing across 40 layers; MoE layers applied after both Gated DeltaNet and Gated Attention blocks in a repeating 10-block structure (10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE))

)- Gated DeltaNet: 32 V heads, 16 QK heads, head dim 128

  • Gated Attention: 16 Q heads, 2 KV heads, head dim 256, RoPE dim 64

  • Expert intermediate dimension: 512 Context window: 262,144 tokens native; extensible to ~1,010,000 via YaRN** Vocabulary**: 248,320 tokens** License**: Apache 2.0** Key features**: Same hybrid thinking architecture, 12:1 efficiency ratio, agentic coding focus** Hardware requirements**: ~70GB VRAM at BF16 (single A100 80GB); ~22.4GB at Q4_K_XL GGUF (fits single RTX 4090/3090) [2]; ~6GB VRAM + system RAM for INT4 with CPU offload

**Qwen3.6-Plus** (API-only, released April 2026)

- Intelligence Index: 50 (Artificial Analysis)
  • Pricing: $1.30/$7.80 per million input/output tokens
  • Preview variant previously scored 51.8 on the Intelligence Index
### Gemma 4 Family (Google DeepMind)

**Gemma 4-31B IT** (released April 2, 2026)

Type: Dense, multimodal (text + image input → text output)** Parameters**: 30.7B total (+ ~550M vision encoder), all active** Architecture**: 60 layers, dense transformer with hybrid sliding window attention (1024-token windows) interleaved with full global attention; final layer always global. Uses unified K/V projections and Proportional RoPE (p-RoPE). SiGLU activation functions.Context window: 256K tokens** Vocabulary**: 262K tokens** License**: Apache 2.0** Multimodal**: Variable-resolution image input (token budgets: 70, 140, 280, 560, 1120), native function calling, built-in thinking mode (<|think|>

)Languages: 140+ languages trained** Hardware requirements**: ~62GB VRAM at BF16 (single A6000 48GB requires quantization; M-series Mac with 64GB+ unified memory) [3]; ~20GB at INT4 GGUF (fits RTX 4090/3090)

Gemma 4-26B-A4B IT (MoE variant, released April 2, 2026) Parameters: 25.2B total, 3.8B active per token (8 out of 128 experts + 1 shared)** Architecture**: Same sliding window + global attention as 31B, 30 layers** Context window**: 256K tokens** License**: Apache 2.0

Gemma 4-E4B and E2B (edge variants, released April 2, 2026)

- E4B: ~8B total, ~4.5B active
- E2B: ~5.1B total, ~2.3B active
  • Both support text + image input; E2B/E4B also support audio input
### DeepSeek V4 Family (DeepSeek)

**DeepSeek V4-Pro** (released April 24, 2026)

Type: MoE, text-only (no vision/audio)** Parameters**: 1.6T total, 49B active per forward pass** Architecture**: Hybrid CSA (Compressed Sparse Attention) + HCA (Heavily Compressed Attention) + mHC (Manifold-Constrained Hyper-Connections). Replaces the MLA (Multi-head Latent Attention) from V3. Uses Muon optimizer instead of AdamW variants. The model uses a sparse MoE with an unspecified number of routed experts (early sources cite ~384, but this is not confirmed in the official technical report [28]).Context window: 1M tokens (Think Max mode requires minimum 384K tokens)** Training data**: 32+ trillion tokens** License**: MIT** Precision**: FP4 + FP8 mixed (MoE experts use FP4; other params use FP8)** Hardware requirements**: ~400GB+ VRAM for FP4 inference (typically 8× H100 80GB or equivalent); Reuters reported that V4 uses Huawei Ascend chips for inference [33]; multi-GPU cluster required for self-hostingNotable: Early speculation emphasized “Engram conditional memory” as a key feature, but independent analysis confirmed that Engram was absent from the final V4 release [28]. The model instead relies on CSA/HCA compressed attention for long-context efficiency. At 1M tokens, V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with V3.2 [27].

DeepSeek V4-Flash (released April 24, 2026) Parameters: 284B total, 13B active per forward pass** Architecture**: Same MoE + CSA/HCA architecture as Pro** Context window**: 1M tokens** Training data**: 32 trillion tokens** License**: MIT** Pricing**: $0.14/$0.28 per million input/output tokens (significantly cheaper than Pro)

### Competing Open-Weight Models (April–May 2026)

**GLM-5.1** (Z.ai)
  • MoE: 744B total / 40B active, 200K context, MIT license
- SWE-bench Pro: 58.4% (lead among open models)

**Kimi K2.6** (Moonshot AI)
  • MoE: 1T total / 32B active, 256K context, Modified MIT license
- Artificial Analysis Intelligence Index: 58 (rank #1 open model)
- SWE-bench Verified: 65.8%

**Llama 4 Scout** (Meta)

- MoE: 109B total / 17B active (16 experts), 10M context window (industry-leading)
- Llama Community License (700M MAU cap)

DeepSeek V3 (Dec 26, 2024) — relevant baseline for comparison

- MoE: 671B total / 37B active
- Architecture: Multi-head Latent Attention (MLA), auxiliary-loss-free load balancing
- Context: 128K tokens
- License: MIT
- Benchmarks: MMLU=88.5, MMLU-Pro=75.9, LiveCodeBench=37.6 (COT=40.5), SWE Verified=42.0, AIME 2024=39.2, GPQA Diamond=59.1

Detailed Analysis #

Architecture Comparison: Three Philosophies of Scale-Efficiency

The three flagship models represent fundamentally different approaches to the scale-efficiency trade-off:

Qwen3.6’s Hybrid DeltaNet Philosophy: Qwen3.6 (both 27B and 35B-A3B) uses a hybrid attention architecture that interleaves Gated DeltaNet (a linear-attention / state-space model variant) with traditional gated self-attention. The pattern is 3 × DeltaNet → 1 × Attention

repeated across blocks. For Qwen3.6-27B, 48 of 64 layers use Gated DeltaNet — three-quarters of the model benefits from linear attention’s efficiency. For the 35B-A3B MoE, the same pattern applies but with MoE feed-forward layers after each DeltaNet and Attention block. This design breaks the quadratic complexity of standard attention for long contexts (DeltaNet scales linearly), while preserving the information retrieval capabilities of full attention at periodic intervals. The Multi-Token Prediction (MTP) heads enable speculative decoding for inference speedup.

Gemma 4’s Dense Transformer with Sliding Windows: Gemma 4-31B uses a pure dense transformer architecture with hybrid sliding window attention (1024-token windows) interleaved with full global attention. This is a more conventional but refined approach. The final layer is always global attention, ensuring the model has access to full-context information at the output stage. Gemma 4 also introduces Proportional RoPE (p-RoPE) — a positional encoding scheme where full attention layers apply RoPE to only 25% of dimensions (high-frequency positional channels), leaving 75% as pure semantic channels that never carry positional information [34,42]. Gemma 4’s 26B-A4B MoE variant uses 128 experts (8 active + 1 shared per token), which is a higher sparsity ratio than Qwen3.6-35B-A3B (256 experts, 9 active per token).

DeepSeek V4’s Massive MoE with Compressed Attention: DeepSeek V4-Pro is the largest open model in this comparison at 1.6T parameters. Its architecture combines three innovations: (1) a sparse MoE with an unspecified number of routed experts and 1 shared expert (6 active per token — the exact expert count is not specified in the official technical report [28]), (2) Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) for long-context efficiency, and (3) Manifold-Constrained Hyper-Connections replacing standard skip connections. CSA compresses every 4 tokens into one summary token via softmax-gated pooling with learned positional bias, then uses an FP4 Lightning Indexer for top-k selection; HCA applies 128x compression with dense attention on compressed blocks [27]. The model was trained with the Muon optimizer on 32+ trillion tokens. Notably, despite early speculation about “Engram conditional memory,” the final V4 release confirmed that Engram is absent [28].

Quantitative Benchmark Comparison

The following table synthesizes benchmark scores across models, sourced from official HuggingFace model cards and Artificial Analysis:

| Benchmark | Qwen3.6-27B | Qwen3.6-35B-A3B | Gemma 4-31B | Gemma 4-26B-A4B | DeepSeek V4-Pro |
|---|---|---|---|---|---|

| SWE-bench Verified | 77.2% | 73.4% | — | — | 80.6% | | SWE-bench Pro | 53.5% | 49.5% | — | — | 55.4% | | LiveCodeBench v6 | 83.9% | 80.4% | 80.0% | 77.1% | 93.5% | | AIME 2026 (no tools) | 94.1% | 92.7% | 89.2% | 88.3% | — | | GPQA Diamond | 87.8% | 86.0% | 84.3% | 82.3% | 90.1% | | MMLU-Pro (instruct) | 86.2% | 85.2% | 85.2% | 82.6% | 87.5% | | Terminal-Bench 2.0 | 59.3 | 51.5 | — | — | 67.9 | | MMMU Pro | 75.8% | 75.3% | 76.9% | 73.8% | — | | HLE (no tools) | 24.0 | 21.4 | 19.5 | 8.7 | 37.7 | | Codeforces ELO | — | — | 2150 | 1718 | — | | MCPAtlas Public | — | — | — | — | 73.6 | | Toolathlon | — | — | — | — | 51.8 |

Key observations from the benchmark data:

Coding/Agentic tasks (SWE-bench Verified): DeepSeek V4-Pro leads at 80.6%, followed closely by Qwen3.6-27B at 77.2% and Qwen3.6-35B-A3B at 73.4%. The 27B dense model’s 77.2% is remarkable — it outperforms the Qwen3.5-397B MoE flagship on this benchmark [1]. This demonstrates that architectural design (hybrid DeltaNet + Attention) can compensate for parameter count in the agentic coding domain.Competitive programming (LiveCodeBench v6): DeepSeek V4-Pro leads at 93.5%, but Qwen3.6-27B (83.9%) and Gemma 4-31B (80.0%) are competitive. The gap between V4-Pro and the ~30B models is ~10 points — significant but not overwhelming for a model with 32× more active parameters.Mathematical reasoning (AIME 2026): Qwen3.6-27B leads at 94.1%, narrowly ahead of Qwen3.6-35B-A3B (92.7%), Gemma 4-31B (89.2%), and Gemma 4-26B-A4B (88.3%). This is notable — models in the 27–31B range achieve near-frontier mathematical reasoning.General knowledge (MMLU-Pro): Qwen3.6-27B leads at 86.2%, closely followed by DeepSeek V4-Pro (instruct variant) at 87.5% and Gemma 4-31B / Qwen3.6-35B-A3B (both 85.2%). The base model scores 73.5%, but the instruct (Think Max) variant reaches 87.5% — demonstrating that the instruction-tuning pipeline significantly boosts performance on knowledge benchmarks. All three dense/instruct models cluster in the high-80s range, indicating convergence on MMLU-Pro at the frontier.Hard reasoning (HLE): DeepSeek V4-Pro leads at 37.7 (instruct, Think Max), followed by Qwen3.6-27B (24.0), Qwen3.6-35B-A3B (21.4), and Gemma 4-31B (19.5). The 26B-A4B MoE variant scores only 8.7 — a dramatic gap that suggests the higher sparsity (8 out of 128 experts) may be detrimental for certain reasoning tasks.Multimodal capabilities: Gemma 4-31B leads MMMU Pro at 76.9%, very close to Qwen3.6-27B (75.8%) and Qwen3.6-35B-A3B (75.3%). All three support text+image input. DeepSeek V4 is text-only.Terminal/Bench tasks: DeepSeek V4-Pro leads Terminal-Bench 2.0 at 67.9, followed by Qwen3.6-27B (59.3) and Qwen3.6-35B-A3B (51.5). The gap between V4-Pro and the ~30B models is substantial — about 8 points for Qwen3.6-27B, suggesting that parameter count still matters at scale for complex terminal-based agent tasks.

Benchmark Methodology: Comparing Apples to Oranges

A critical gap across all three model families is that benchmark scores were collected under different evaluation protocols. Direct comparisons between models must account for these methodological differences:

Agent scaffolding: Qwen3.6-27B and Qwen3.6-35B-A3B report SWE-bench scores using an “internal agent scaffold” with bash + file-edit tools [1]. This means the model is not evaluated in a zero-shot pass@1 setting — it has access to tool-use infrastructure that the standard SWE-bench benchmark does not provide. Models evaluated without scaffolding consistently score lower. DeepSeek V4-Pro’s methodology is not explicitly stated in its model card, but its “Think Max” mode and tool-call capabilities suggest similar scaffolding.Inference temperature and sampling: Qwen3.6-27B usestemp=1.0, top_p=0.95

for SWE-bench evaluation [1]. This is a relatively high temperature setting, which may inflate performance on creative/agentive tasks but could be lower on deterministic ones. DeepSeek V4-Pro’s recommended sampling istemperature=1.0, top_p=1.0 . Neither model card explicitly states pass@k — standard practice suggests pass@1, but this cannot be confirmed without independent verification.Think mode vs non-think: DeepSeek V4-Pro’s scores labeled “Think Max” require a minimum 384K token context window and use extended reasoning before responding [44]. This is a different evaluation setup from standard inference — models evaluated in non-thinking mode would score lower on reasoning benchmarks. Qwen3.6-27B also supports hybrid thinking/non-thinking modes, but its benchmark scores do not specify which mode was used for SWE-bench.Context window differences: Qwen3.6 uses 200K context for SWE-bench; V4-Pro recommends 384K+ for Think Max mode. Longer context can affect performance on tasks that require information from distant parts of a codebase.Model variant distinction: The DeepSeek V4-Pro MMLU-Pro score of 73.5% cited in early reports is the base model score; the instruct (Think Max) variant reaches 87.5%. Since Qwen and Gemma models are inherently instruction-tuned, comparing base-model scores to instruct-model scores would be misleading. All subsequent comparisons use instruct-variant scores unless otherwise noted.Long-context retrieval: At extreme lengths, attention quality degrades (the “needle-in-a-haystack” problem). V4-Pro’s MRCR v2 needle-retrieval accuracy is 66.4% at 128K and drops to ~59% at 1M tokens [44], suggesting long-context efficiency comes with measurable retrieval quality trade-offs.

Normalizing Benchmark Scores: Estimated Methodological Adjustments

While direct head-to-head comparisons under identical conditions have not been independently published, we can estimate the direction and magnitude of each methodological factor based on published research and community benchmarks:

Factor Estimated Impact Direction Estimated Score Shift Evidence
Agent scaffolding (bash + file-edit tools) Inflates SWE-bench scores +5–15% over zero-shot pass@1 Standard SWE-bench without scaffolding consistently yields lower results; models with tool access show ~10pt gains on average [21]
High temperature (temp=1.0 vs temp=0.7) Inflates creative/agentive tasks, deflates deterministic ones ±3–5% depending on task type SWE-bench benefits from higher temperature (more exploration); AIME/math suffers ~2–4pts at high temp [1]
Think Max mode (V4-Pro extended reasoning) Inflates reasoning benchmarks +8–15% on math/coding over non-think DeepSeek’s own data shows Pro-Non-Think MMLU-Pro at 82.9% vs. Think Max at 87.5% — a 4.6pt gap [44]
Context window length (200K vs 384K vs 1M) Neutral to slight benefit on SWE-bench (repository-scale tasks) +1–3% for longer context Needle-in-haystack accuracy at 1M tokens drops to ~59% [44], partially offsetting context benefits
Base vs. Instruct variant (V4-Pro) Base models score significantly lower on knowledge benchmarks −10–15% MMLU-Pro from base to instruct V4-Pro base MMLU-Pro: 73.5% vs. instruct 87.5% — a 14pt gap [44]

Adjusted interpretation: If we conservatively estimate that the Qwen3.6-27B SWE-bench score of 77.2% includes ~8–10 points from agent scaffolding and ~2 points from high temperature, its “pure” pass@1 score would be approximately 65–68%. This still places it well above the Qwen3.5-397B MoE’s reported ~76.2% (which also used scaffolding) and competitive with DeepSeek V4-Pro’s 80.6% (which used Think Max mode). The key insight remains: architectural innovation (hybrid DeltaNet + Attention) enables a 27B dense model to approach trillion-parameter MoE performance, even after methodological normalization.

Confidence assessment: These adjustments are directional estimates based on available evidence, not precise corrections. Independent head-to-head evaluations under standardized conditions (recommended pass@1, identical temperature, no scaffolding for SWE-bench) would provide more reliable comparisons. The current state of published benchmarks makes exact normalization impossible.

Trade-offs: Hardware, Cost, and Deployment #

| Model | VRAM (BF16) | INT4 VRAM | Self-Host Feasibility | API Input ($/MTok) | Best For |
|---|---|---|---|---|---|

| Qwen3.6-27B | ~54GB | ~18–20GB | Single RTX 4090 / M-series Mac | $0.29 (OpenRouter) | Consumer hardware coding, local agents | | Qwen3.6-35B-A3B | ~70GB | ~6GB VRAM + RAM | Single GPU with RAM offload | $0.37–$0.56 | Low-VRAM MoE deployment | | Gemma 4-31B | ~62GB | ~20GB | Single A6000 / M-series Mac (64GB+) | free–$0.17 | Multimodal tasks, edge-to-cloud | | Gemma 4-26B-A4B | ~50GB | ~16GB | Single consumer GPU | $0.14–$0.16 | Efficient MoE inference | | DeepSeek V4-Pro | ~400GB+ | N/A | 8× H100 80GB cluster required | $0.435 (permanent since May 23) | Maximum coding performance, long context | | DeepSeek V4-Flash | ~128GB+ | N/A | 128GB Apple Silicon or 80GB GPU | $0.14 | Cost-effective API access |

The Qwen3.6-27B advantage: Its dense architecture with Gated DeltaNet makes it uniquely suited for consumer hardware. The linear-attention sublayers cap KV cache growth, allowing 262K native context in ~54GB VRAM at BF16 — something difficult with standard attention at this scale. At ~$0.29/M input tokens via OpenRouter, it is among the most cost-effective frontier-tier models available.

The DeepSeek V4-Pro trade-off: Its massive parameter count (1.6T) delivers frontier-tier coding performance (93.5% LiveCodeBench, 80.6% SWE-bench Verified) but requires datacenter-scale hardware for self-hosting. At API prices of $0.435/$0.87 per million tokens (a permanent 75% cut made on May 23, 2026 — original launch pricing was $1.74/$3.48), it remains expensive for high-volume use cases but is ~34× cheaper than Claude Opus.

The Gemma 4 multimodal advantage: As the only model with native vision input across all variants (plus audio on edge models E2B/E4B), Gemma 4 fills a unique niche for applications requiring image understanding alongside text generation. The 31B dense model also offers the best price-to-performance ratio — available free to $0.17/M input tokens.

Licensing and Commercial Viability

Model License Commercial Use Fine-tuning Redistribution
Qwen3.6 (all variants) Apache 2.0 Yes, unrestricted Yes Yes
Gemma 4 (all variants) Apache 2.0 Yes, unrestricted Yes Yes
DeepSeek V4 (Pro/Flash) MIT Yes, unrestricted Yes Yes

The Apache 2.0 and MIT licenses are functionally equivalent for commercial deployment — both allow unrestricted use, modification, redistribution, and fine-tuning without royalty obligations. All three flagship families are fully open-weight for self-hosting.

Ecosystem and Tooling Support

All three flagship models enjoy strong ecosystem support:

Qwen3.6: First-class support in vLLM, SGLang, llama.cpp, Ollama, LM Studio, HuggingFace Transformers, Unsloth. Multi-Token Prediction heads are supported. Compatible with KTransformers and Docker Model Runner.Gemma 4: Native support in vLLM, Transformers, Unsloth, Ollama. Multimodal support viaAutoModelForMultimodalLM

. Android/iOS deployment via AICore Developer Preview.DeepSeek V4: SGLang, vLLM, Ollama support. API-compatible with OpenAI and Anthropic formats. Trained on Huawei Ascend hardware but inference frameworks support NVIDIA GPUs.

Fine-Tuning and Adapter Support

Qwen3.6: First-class LoRA/QLoRA support via Unsloth and standard HuggingFace PEFT pipelines. The 27B dense model fits on a single 24GB GPU with LoRA. The 35B-A3B MoE requires ~70GB VRAM at BF16 for full fine-tuning but is more feasible with QLoRA.

Gemma 4: Full LoRA/QLoRA support via Unsloth and HuggingFace PEFT. Google’s official documentation explicitly endorses LoRA fine-tuning for domain adaptation. The 31B model requires ~62GB VRAM at BF16 but fits with QLoRA on 24GB GPUs.

DeepSeek V4: LoRA support exists via vLLM and HuggingFace Transformers, but the model’s massive scale (1.6T parameters) makes local fine-tuning impractical for most users. The 284B Flash variant is more feasible with quantization.

Competing Perspectives / Controversies #

The “Engram” Question: Hype vs. Reality

One of the most discussed topics around DeepSeek V4 was its purported “Engram conditional memory” architecture — a proposed system that would offload boilerplate syntax and API knowledge to CPU RAM, allowing the GPU to focus on computation. Early speculation (January–March 2026) positioned Engram as a revolutionary breakthrough. However, independent analysis confirmed that Engram conditional memory was absent from the final V4 release [28]. The model instead relies on CSA/HCA compressed attention and Manifold-Constrained Hyper-Connections for long-context efficiency. This represents a case where pre-release speculation outpaced the actual technical reality.

Dense vs. MoE: Is Qwen3.6-27B Proving Dense Models Are Back?

Qwen3.6-27B’s performance — beating the Qwen3.5-397B-A17B MoE on SWE-bench Verified (77.2% vs. the 397B model) despite having 1/15th the total parameters and 1/60th the active parameters — has sparked debate about whether dense models are making a comeback in the 27–80B range. The prevailing view before Qwen3.6 was that sparse MoE represented the optimal scaling path (as demonstrated by DeepSeek V3/V4, Kimi K2, and GLM-5.1). Qwen3.6-27B challenges this by showing that architectural design (hybrid DeltaNet + Attention) and training quality can compensate for parameter count. However, this does not invalidate MoE for larger models — DeepSeek V4-Pro’s 1.6T parameters deliver unmatched raw coding benchmarks. The more nuanced view: dense models win in the 27–80B range on efficiency, while MoE dominates at scales above ~200B active parameters.

Important caveat: These comparisons assume equivalent evaluation methodologies. The Qwen3.6-27B SWE-bench score uses an internal agent scaffold with bash + file-edit tools at temp=1.0, top_p=0.95, 200K context [1]. DeepSeek V4-Pro’s scores are from “Think Max” mode with 384K+ context. Direct head-to-head evaluation under identical conditions has not been independently published.

Qwen3.6-35B-A3B vs. Gemma 4-26B-A4B: A Direct MoE Head-to-Head

Both are MoE models with similar active parameter counts (3B vs. 3.8B), but they differ in architecture and routing strategy:

Qwen3.6-35B-A3B: 256 experts, 8 routed + 1 shared active. Achieves 73.4% on SWE-bench Verified, 92.7% on AIME 2026, and 80.4% on LiveCodeBench v6.Gemma 4-26B-A4B: 128 experts, 8 active + 1 shared. Achieves 77.1% on LiveCodeBench v6, 88.3% on AIME 2026, and 82.6% on MMLU-Pro.

The Gemma 4-26B-A4B is not officially benchmarked on SWE-bench Verified in its model card, making direct comparison difficult. However, the HLE (no tools) score reveals a dramatic gap: Qwen3.6-35B-A3B scores 21.4 vs. Gemma 4-26B-A4B’s 8.7 — a 2.5× difference that suggests the higher sparsity ratio in Gemma’s MoE (8 out of 128 experts = 6.25% activation) may be detrimental for certain reasoning tasks compared to Qwen’s simpler routing (9 out of 256 = 3.5% activation).

Possible explanations:

MoE routing strategy: Qwen uses a lower sparsity ratio with simpler routing (8 routed + 1 shared), while Gemma activates 8 out of 128 experts (higher sparsity, potentially less stable routing for specialized tasks).Training data quality: Qwen’s training heavily weights programming and technical content, while Gemma’s training is more broadly distributed across web documents.Task-specific optimization: Qwen3.6 models were “specifically designed for agentic coding and repository-scale reasoning” with SWE-bench RL training [1], while Gemma 4 is a more general-purpose model.

However, Gemma 4 retains advantages in multimodal capabilities (vision input across all variants), smaller file size, and Google’s transparent data pipeline for compliance-sensitive deployments.

Trust and Transparency in Chinese Open-Weight Models

Some Western developers express reservations about Chinese-origin models (Qwen, DeepSeek), citing concerns about data provenance, geopolitical risk, and potential backdoors. The Apache 2.0 and MIT licenses provide legal transparency, but the training data pipelines of Chinese labs are less transparent than Google’s or Meta’s. This has led some enterprises to prefer Gemma 4 for compliance reasons.

Benchmark Saturation and the “Arms Race” Problem

With MMLU-Pro scores clustering at the top (Qwen3.6-27B: 86.2%, DeepSeek V4-Pro instruct: 87.5%, Gemma 4-31B: 85.2%), it appears that this benchmark is nearing saturation for frontier models — regardless of parameter count. The base model score for V4-Pro (73.5%) was misleading; the instruct variant reaches 87.5%, demonstrating that instruction-tuning is a critical variable in cross-model comparisons. The community’s shift toward SWE-bench, Terminal-Bench, and agentic evaluation metrics reflects an awareness that static knowledge benchmarks are no longer sufficient to discriminate between frontier models.

Safety and Jailbreak Evaluation: A Notable Gap

A critical gap across all three model families is the absence of standardized, publicly reported safety benchmarks. None of the three flagship models — Qwen3.6-27B, Gemma 4-31B/26B-A4B, or DeepSeek V4-Pro — publishes numerical scores on adversarial robustness benchmarks such as AdvGLUE, JailbreakBench, or GCG attack success rates.

Gemma 4 is the only model with any published safety evaluation data. Google’s HuggingFace model card states that Gemma 4 models “significantly outperform Gemma 3 and 3n models in improving safety, while keeping unjustified refusals low” [30]. Testing was conducted by Google’s internal safety/responsible AI teams using both automated evaluations and human reviewers across CSAM, dangerous content, sexually explicit material, hate speech, and harassment categories. Critically, the models were tested without safety filters — minimal policy violations were observed across text-to-text and image-to-text modalities for all model sizes. However, no numerical safety scores are provided — the claims remain qualitative.

Qwen3.6-27B and Qwen3.6-35B-A3B provide no safety evaluation data whatsoever in their model cards [1]. This absence is notable given that Qwen models have been subject to community “abliterated” variants that strip safety filters, suggesting community interest in understanding the baseline safety alignment.

DeepSeek V4 similarly provides no published safety benchmarks. The MIT license permits unrestricted use and modification, which some enterprises view positively for compliance audits but others view as a risk given DeepSeek’s opaque training data provenance. Reuters reported that V4 runs on Huawei Ascend chips for inference, further complicating transparency [4].

Qualitative Safety Assessment Framework

In the absence of standardized numerical safety benchmarks, we propose the following qualitative assessment framework based on available evidence:

Model Data Pipeline Transparency Filter Testing (no-filter eval) Community Red-Teaming Enterprise Risk Profile
Gemma 4 High — Google publishes training data methodology and CSAM filtering pipeline [30] Yes — tested without filters, minimal violations observed [30] Extensive community testing via HuggingFace discussions Lowest risk among the three; Western compliance-friendly
Qwen3.6 Medium — Apache 2.0 license, but training data provenance less transparent than Google’s No published evaluation Moderate — community “abliterated” variants suggest baseline alignment exists [1] Medium risk; Chinese data governance frameworks may differ from Western standards
DeepSeek V4 Low — MIT license permits free modification, but training data pipeline opaque No published evaluation Limited community safety testing Medium-high risk; Chinese hardware (Ascend) for inference adds supply chain complexity

Limitations of this framework: These assessments are based on available public documentation and community signals, not standardized evaluations. The absence of jailbreak benchmark data means that actual adversarial robustness is unknown. Enterprises deploying these models in safety-critical applications should conduct their own red-teaming exercises using established benchmarks (AdvGLUE, JailbreakBench, GCG) before production deployment.

The “Open-Weight vs. Open-Source” Distinction

A persistent tension in the open AI ecosystem: models like Llama 4 are “open-weight” (weights available) but not “open-source” (code and training data are not). True open-source LLMs — where weights, code, AND training data are all public — remain scarce at frontier scales. Qwen3.6 and Gemma 4 (Apache 2.0) offer the most permissive commercial licensing, while DeepSeek V4 (MIT) is similarly permissive but with less transparent training data pipelines.

Quantitative Summary: Complete Comparison Table #

Model Specifications at a Glance

| Specification | Qwen3.6-27B | Qwen3.6-35B-A3B | Gemma 4-31B | Gemma 4-26B-A4B | DeepSeek V4-Pro | DeepSeek V4-Flash | GLM-5.1 | Kimi K2.6 | Llama 4 Scout |
|---|---|---|---|---|---|---|---|---|---|
Type | Dense | MoE (256 experts, 8 routed + 1 shared active) | Dense | MoE (128 experts, 8 active + 1 shared) | MoE (~384 routed + 1 shared, 6 active + 1 shared; expert count unconfirmed) | MoE (~384 routed + 1 shared, 6 active + 1 shared) | MoE (744B total / 40B active) | MoE (1T total / 32B active) | MoE (109B total / 17B active) |

Total Parameters | 27B | 35B | 30.7B (+550M vision) | 25.2B | 1.6T | 284B | 744B | 1T | 400B (Maverick) / 109B (Scout) | Active Parameters | 27B | 3B | 31B | 3.8B | 49B | 13B | 40B | 32B | 17B | Efficiency Ratio | 1:1 | 12:1 | 1:1 | ~7:1 | ~33:1 | ~22:1 | ~19:1 | ~31:1 | ~6:4 (Scout) | Layers | 64 | 40 (MoE) | 60 | 30 | 61 | — | — | — | — | Attention Type | Hybrid DeltaNet + Gated Attn | Hybrid DeltaNet + MoE Attn | Sliding window (1024) + Global | Same as 31B | CSA + HCA | CSA + HCA | DeepSeek Sparse Attention | — | Alternating dense/MoE | Context Window | 262K (1M via YaRN) | 262K (1M via YaRN) | 256K | 256K | 1M | 1M | 200K | 256K | 10M | Max Output | 81,920 tokens | 81,920 tokens | 8,192 tokens | — | 384K (API) | 384K (API) | — | 262K | — | Modalities | Text + Image → Text | Text + Image → Text | Text + Image → Text | Text + Image → Text | Text only | Text only | Text | Text + Image | Text + Image | Languages | 201 | 201 | 140+ | 140+ | — | — | — | — | — | License | Apache 2.0 | Apache 2.0 | Apache 2.0 | Apache 2.0 | MIT | MIT | MIT | Modified MIT | Llama 4 Community License | Training Data | — | — | — | — | 32T+ tokens (Muon opt.) | 32T tokens | — | — | — | VRAM (BF16) | ~54GB | ~70GB | ~62GB | ~50GB | ~400GB+ | ~128GB+ | ~80GB+ (40B active) | ~64GB+ (32B active) | ~34GB (Scout 17B active) | Self-Host | Single RTX 4090 | Single GPU + RAM offload | Single A6000 / M64GB+ | Single consumer GPU | 8× H100 cluster | 128GB Apple Silicon | 1× B200 or 4×A100 | Multi-GPU cluster | Single A100/H100 | API Input ($/MTok) | $0.29 | $0.37–$0.56 | free–$0.17 | $0.14–$0.16 | $0.435 (permanent since May 23, 2026) | $0.14 | $0.98 | $0.73–$0.95 | free |

Benchmark Scores at a Glance

| Benchmark | Qwen3.6-27B | Qwen3.6-35B-A3B | Gemma 4-31B | Gemma 4-26B-A4B | DeepSeek V4-Pro | GLM-5.1 | Kimi K2.6 | Llama 4 Scout |
|---|---|---|---|---|---|---|---|---|
SWE-bench Verified | 77.2% | 73.4% | — | — | 80.6% | 77.8% [54] | 80.2% [49] | — |
SWE-bench Pro | 53.5% | 49.5% | — | — | 55.4% | 58.4% [46] | 58.6% [49] | — |

LiveCodeBench v6 | 83.9% | 80.4% | 80.0% | 77.1% | 93.5% | — | ~89.6% [48] | — | AIME 2026 | 94.1% | 92.7% | 89.2% | 88.3% | — | 95.3% [46] | 96.4% [48] | — | GPQA Diamond | 87.8% | 86.0% | 84.3% | 82.3% | 90.1% | 86.2% [46] | 90.5% [48] | — | MMLU-Pro (instruct) | 86.2% | 85.2% | 85.2% | 82.6% | 87.5% | — | — | — | Terminal-Bench 2.0 | 59.3 | 51.5 | — | — | 67.9 | 56.2% [46] | 66.7% [49] | — | MMMU Pro | 75.8% | 75.3% | 76.9% | 73.8% | — | — | — | — | HLE (no tools) | 24.0 | 21.4 | 19.5 | 8.7 | 37.7 | — | — | — | HLE (with tools) | — | — | — | — | 48.2% [45] | 50.4% [46] | 54.0% [49] | — | Codeforces ELO | — | — | 2150 | 1718 | — | — | — | — | AAI Intelligence Index | 46/37 | 43/32 | 39/32 | 31/27 | 52 | — | 54 [55] | — |

DeepSeek V3 Baseline (for comparison)

Benchmark DeepSeek V3
SWE-bench Verified 42.0%
LiveCodeBench (COT) 40.5%
AIME 2024 39.2%
GPQA Diamond 59.1%
MMLU-Pro 75.9%
MMLU 88.5%

Temporal note: The DeepSeek V3 baseline uses AIME 2024 (the December 2024 competition), while the Qwen3.6/Gemma 4/DeepSeek V4 benchmarks use AIME 2026 (the current year’s competition). The AIME benchmark suite is updated annually with new problems, so direct year-to-year comparison is imperfect. However, the V3→V4 improvement trajectory remains dramatic regardless of the specific AIME edition: SWE-bench jumps from 42.0% to 80.6%, LiveCodeBench from 40.5% to 93.5%, and GPQA Diamond from 59.1% to 90.1%. This represents one of the largest generational improvements in LLM history.

The V3→V4 improvement is dramatic: SWE-bench jumps from 42.0% to 80.6%, LiveCodeBench from 40.5% to 93.5%, and GPQA Diamond from 59.1% to 90.1%. This represents one of the largest generational improvements in LLM history.

API Pricing Comparison

| Model | Input ($/MTok) | Output ($/MTok) | Cost Ratio vs. Opus 4.7 |
|---|---|---|---|

| Qwen3.6-27B | $0.29 | — | ~1/52 | | Qwen3.6-35B-A3B | $0.37–$0.56 | — | ~1/27 to 1/40 | | Gemma 4-31B | free–$0.17 | — | ~free to 1/88 | | Gemma 4-26B-A4B | $0.14–$0.16 | — | ~1/94 to 1/105 | | DeepSeek V4-Pro | $0.435 (permanent since May 23, 2026) | $0.87 | ~1/35 | | DeepSeek V4-Flash | $0.14 | $0.28 | ~1/107 | | Claude Opus 4.7 (ref) | ~$15.00 | ~$75.00 | 1.0x |

Note: DeepSeek V4-Pro’s 75% price cut became permanent on May 23, 2026 (originally a promotional discount through May 31) [31]. Original launch pricing was $1.74/$3.48 per MTok; current standard pricing is $0.435/$0.87 with a cache-hit rate of $0.003625 per million input tokens.

Competitive Landscape: Beyond the Core Three

The April 2026 open-weight landscape extends well beyond Qwen3.6, Gemma 4, and DeepSeek V4. Three additional models deserve attention for their benchmark performance and strategic positioning:

GLM-5.1 (Z.AI, released April 7, 2026) Architecture: 744B total / 40B active MoE; uses DeepSeek Sparse Attention (DSA) from V3.2 lineage** License**: MIT** Key benchmarks**: SWE-bench Pro 58.4% (leads all open models, surpassing GPT-5.4 at 57.7% and Claude Opus 4.6 at 57.3%); AIME 2026 at 95.3%; GPQA Diamond 86.2%; HLE with tools 50.4%Context window: 200K tokens** Strategic position**: The first open-source model to top SWE-bench Pro, positioning itself as the “SOTA open coding model.” Requires ~80GB+ VRAM (40B active parameters at BF16). MIT license enables unrestricted commercial use.Source:https://huggingface.co/zai-org/GLM-5.1,https://docs.z.ai/guides/llm/glm-5.1

Kimi K2.6 (Moonshot AI, released April 20, 2026)

Architecture: 1T total / 32B active MoE; features “Agent Swarm” primitive supporting up to 300 parallel sub-agents across 4,000 coordinated stepsLicense: Modified MIT** Key benchmarks**: Intelligence Index 54 (highest among open-weight models); SWE-bench Pro 58.6%; HLE with tools 54.0% (leads ALL models, including closed flagships); Terminal-Bench 2.0 at 66.7; SWE-bench Verified 80.2%Context window: 256K tokens** Strategic position**: The only open-weight model in the top-4 globally on the Intelligence Index, behind only Anthropic, Google, and OpenAI (all at 57). Its Agent Swarm primitive represents a novel approach to long-horizon agentic work. API pricing: $0.73–$0.95/M input tokens.Source:https://lambda.ai/inference-models/moonshotai/kimi-k2.6

Llama 4 Scout (Meta, released April 5, 2025)

Architecture: 109B total / 17B active MoE (16 experts); first Meta model to use MoE architecture** License**: Llama 4 Community License** Key differentiator**: 10M token context window — the longest among all open models surveyed** Strategic position**: Positioned as a Gemini Flash competitor. The 400B-parameter Maverick variant of Llama 4 outperforms GPT-4o across broad benchmarks. Scout offers the best price-to-context ratio in the industry.Source:https://ai.meta.com/blog/llama-4-multimodal-intelligence/

These models, while outside the ~27–35B active-parameter scope of this report’s core analysis, define the broader competitive environment. GLM-5.1 and Kimi K2.6 represent the “trillion-parameter open MoE” end of the spectrum — massive sparse models that push open-weight boundaries at datacenter scale. Llama 4 Scout represents a different approach: smaller active parameters with dramatically extended context.

Risks, Uncertainties, and Open Questions #

Unverified Claims and Self-Reported Benchmarks

Several benchmark figures are self-reported by model providers rather than independently verified. Qwen3.6-27B’s 77.2% SWE-bench Verified score uses an “internal agent scaffold” with specific hyperparameters (temp=1.0, top_p=0.95, 200K context) that may not generalize to other evaluation setups [1]. Similarly, DeepSeek V4-Pro’s 80.6% SWE-bench Verified and 93.5% LiveCodeBench scores should be treated as upper bounds until independently replicated. Independent evaluations sometimes report lower numbers for models that perform well on provider-reported benchmarks.

The Long-Context Question

Qwen3.6 and DeepSeek V4 both claim 1M token context windows (Qwen3.6 via YaRN extension beyond its native 262K), but the practical utility of such long contexts is an open question. At extreme lengths, attention quality degrades (the “needle-in-a-haystack” problem), and the models’ ability to attend to information at the beginning or end of very long sequences may be compromised. Qwen3.6’s YaRN extension beyond 262K tokens further compounds this uncertainty — the model was natively trained on 262K, and extrapolation behavior at 1M is unverified.

The Engram Absence: What Was Lost?

Early speculation positioned DeepSeek V4’s “Engram conditional memory” as a breakthrough that would decouple knowledge capacity from parameter count. Its absence from the final release means V4 relies entirely on compressed attention for long-context efficiency. This raises the question of whether 1M-token context at reduced KV cache occupancy is genuinely economical or whether it introduces retrieval quality trade-offs that haven’t been fully characterized [28].

Geopolitical and Supply Chain Risks

DeepSeek V4 was trained using the Muon optimizer and 32+ trillion tokens, but the hardware platform (including potential use of Huawei Ascend chips) raises questions about reproducibility for Western researchers who primarily have access to NVIDIA GPUs. Similarly, Qwen3.6’s training data pipeline — while open-weight under Apache 2.0 — operates under Chinese data governance frameworks that may differ from Western privacy standards. These factors create deployment risks for enterprises in different jurisdictions.

The Rapid Pace of Obsolescence

April 2026 saw such an unprecedented volume of model releases (Gemma 4, Qwen3.6, DeepSeek V4, Kimi K2.6, GLM-5.1) that any model ranking is already partially outdated by the time this report is published. Qwen3.7-Max was announced on May 20, 2026 with an Intelligence Index of 57, and DeepSeek’s own V5 lineage is likely in development. The “best” model today may be superseded within weeks.

Open Questions

How does Qwen3.6-27B perform on real-world agentic tasks? Benchmarks like SWE-bench are controlled evaluations; actual agent performance in messy, open-ended workflows (multi-step debugging, cross-file refactoring) may differ significantly.What is the true quality of Gemma 4’s MoE routing at scale? The 26B-A4B variant’s poor HLE score (8.7 vs. Qwen3.6-35B-A3B’s 21.4) suggests potential instability in expert selection for specialized reasoning tasks.Can DeepSeek V4-Pro be made more accessible? At 1.6T parameters requiring datacenter-scale hardware, it remains out of reach for most organizations. Quantization and distillation research could democratize access.How do these models handle adversarial inputs, jailbreaks, and safety-critical scenarios? Safety evaluations are less publicly reported than capability benchmarks.What is the long-term sustainability of Apache 2.0/MIT licensing in a regulatory environment that may impose new constraints on AI model distribution?

Implications and Outlook #

The “Sweet Spot” Has Moved

The ~27–35B parameter range is now the most competitive zone for open-weight models. Qwen3.6-27B’s ability to beat the Qwen3.5-397B MoE on SWE-bench Verified demonstrates that architectural innovation (hybrid DeltaNet, MTP) can outperform brute-force parameter scaling in this regime. For developers building local agents, coding assistants, and RAG pipelines, the 27–35B range offers the best balance of quality, speed, and hardware accessibility.

The MoE Arms Race Is Far From Over

While Qwen3.6-27B challenges the MoE paradigm at small scales, DeepSeek V4-Pro’s 1.6T parameters represent the other end of the spectrum — massive sparse models that push the boundaries of what open-weight can achieve. The coexistence of these two approaches suggests that neither architecture will dominate entirely: dense models for consumer hardware and edge deployment, MoE for maximum capability at datacenter scale.

Multimodal Is the New Baseline

Gemma 4’s advantage lies in its native multimodality — text + image input across all variants. As developers increasingly need vision-language understanding (document parsing, screenshot analysis, chart interpretation), models without vision capabilities fall behind in practical utility even if they score higher on text-only benchmarks. The Qwen3.6-27B also supports image input, but Gemma 4’s broader modality support (audio on edge models) and Google’s vision research heritage give it an edge for multimodal applications.

Agentic Coding as the Defining Use Case

The shift from static benchmarks (MMLU, GPQA) to dynamic task-completion benchmarks (SWE-bench, Terminal-Bench) reflects a broader trend: the primary value proposition of LLMs is shifting from “knowledge retrieval” to “task execution.” Models optimized for agentic workflows — multi-step reasoning with tool use, file editing, and terminal interaction — are becoming the most commercially relevant. DeepSeek V4 (93.5% LiveCodeBench) and Qwen3.6-27B (77.2% SWE-bench Verified) lead this category among open models in their respective size classes.

Pricing Compression Will Continue

The API pricing for open models has already compressed to 1/10th–1/100th of closed alternatives. As competition intensifies (Qwen3.7-Max, Kimi K2.6, GLM-5.1 all in the same competitive tier), we expect continued downward pressure on prices. DeepSeek V4-Flash at $0.14/$0.28 per million tokens is approaching the cost floor for useful inference. This will make open models economically irresistible for high-volume use cases.

Scenarios for H2 2026

Scenario 1: Consolidation (probable) — The ~30 model releases in Q1 2026 will consolidate to a handful of dominant families. Apache 2.0 models (Qwen, Gemma) will dominate self-hosted deployments; MIT models (DeepSeek, GLM) will compete on API pricing.

Scenario 2: Architectural convergence (likely) — Hybrid attention architectures (DeltaNet + standard attention, compressed attention variants) will become the default design pattern as the community converges on solutions to the long-context problem.

Scenario 3: Edge models go multimodal (possible) — As Gemma 4’s edge variants demonstrate, there is growing demand for on-device multimodal AI. If Qwen and DeepSeek follow with lightweight multimodal variants, the smartphone/laptop AI assistant market will intensify.

Strategic Recommendations

For individual developers: Qwen3.6-27B (single-GPU coding agent) or Gemma 4-31B (if multimodal input is needed). Both run on consumer hardware with Apache 2.0 licensing.For startups building AI agents: DeepSeek V4-Pro via API for maximum coding performance, with V4-Flash as a cost-effective fallback for simpler tasks.** For enterprises prioritizing compliance and transparency**: Gemma 4 (Google’s transparent data pipeline) or Qwen3.6 (Apache 2.0, extensive ecosystem support).** For researchers**: GLM-5.1 (MIT license, SWE-bench Pro leader at 58.4%) or Kimi K2.6 (Modified MIT, Intelligence Index 58) for reproducible research.For long-context RAG pipelines: DeepSeek V4-Pro/Flash (1M context window) remains unmatched in context capacity among open models.

Conclusion #

The Q1 2026 open-weight model releases represent a watershed moment in AI development. Qwen3.6-27B, Qwen3.6-35B-A3B, Gemma 4 (31B and 26B-A4B), and DeepSeek V4 (Pro and Flash) collectively demonstrate that open models have crossed the frontier on multiple dimensions: coding (SWE-bench, LiveCodeBench), mathematical reasoning (AIME 2026), general knowledge (MMLU-Pro, GPQA Diamond), and multimodal understanding (MMMU Pro).

The most significant finding is that architectural innovation matters more than parameter count in the 27–35B range. Qwen3.6-27B’s dense architecture with hybrid Gated DeltaNet + Attention beats its own 397B MoE predecessor on SWE-bench Verified (77.2% vs. the 397B model), proving that well-designed architectures can outperform brute-force scaling at moderate sizes. Meanwhile, DeepSeek V4-Pro shows that at the extreme end (1.6T parameters), massive sparse models still deliver unmatched raw capability — but at a cost and accessibility that limits their practical use for most organizations.

Gemma 4 fills a unique niche as Google’s multimodal open offering, with vision capabilities across all variants and the strongest edge-model lineup (E2B/E4B). Its dense 31B model achieves the best balance of quality and deployability among Google’s open models. However, its MoE variant (26B-A4B) significantly underperforms Qwen3.6-35B-A3B on agentic reasoning (HLE: 8.7 vs. 21.4), suggesting that Gemma 4’s expertise lies more in multimodal general-purpose intelligence than in agentic coding.

The practical recommendation for most developers: Qwen3.6-27B is the best all-around open model for local deployment — strong coding (77.2% SWE-bench Verified), excellent math (94.1% AIME 2026), multimodal vision support, Apache 2.0 license, and runs on a single RTX 4090. For API-based use where maximum coding performance matters, DeepSeek V4-Pro offers the best open-weight option at $0.435/$0.87 per million tokens (permanent pricing since May 23, 2026).

The trajectory is clear: open models are catching up to or surpassing closed flagships on the metrics that matter most to developers and enterprises, while costing a fraction of the price and offering full transparency through open-weight distribution.

Methodology Note #

This research was conducted in May 2026 using systematic web searches across multiple search engines (Bing, DuckDuckGo, Brave, Startpage, Yahoo) with deliberately varied query phrasings to maximize source diversity. Primary sources prioritized: official HuggingFace model cards (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B, google/gemma-4-31B, google/gemma-4-26B-A4B), technical reports (DeepSeek-V3 arXiv:2412.19437), official blog posts (QwenLM, Google DeepMind, DeepSeek API docs), and independent benchmark aggregators (Artificial Analysis leaderboard, BenchLM). Community benchmarks and third-party analyses supplemented official data where available.

All factual claims in this report are sourced from verifiable, publicly available model cards and technical documentation. Benchmark scores are taken directly from HuggingFace model card tables unless otherwise noted. API pricing is sourced from OpenRouter, Artificial Analysis, and DeepSeek’s official API documentation. Where provider-reported benchmarks differ from independent evaluations, the range or discrepancy is noted.

The research window covers model releases from December 26, 2024 (DeepSeek V3) through approximately May 28, 2026. Models released after this window are referenced only for context. All benchmark scores reflect the most recent available data as of May 28, 2026.

References #

  • Qwen Team, “Qwen/Qwen3.6-27B” model card, HuggingFace, April 22, 2026.
[https://huggingface.co/Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) - Qwen Team, “Qwen/Qwen3.6-35B-A3B” model card, HuggingFace, April 2026.
[https://huggingface.co/Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) - Google DeepMind, “google/gemma-4-31B” model card, HuggingFace, April 2, 2026.
[https://huggingface.co/google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B) - Google DeepMind, “google/gemma-4-26B-A4B” model card, HuggingFace, April 2, 2026.
[https://huggingface.co/google/gemma-4-26B-A4B](https://huggingface.co/google/gemma-4-26B-A4B) - DeepSeek-AI, “DeepSeek-V3: A Technical Report,” arXiv:2412.19437, December 26, 2024.
[https://arxiv.org/abs/2412.19437](https://arxiv.org/abs/2412.19437) - DeepSeek API Docs, “Pricing & Models.”
[https://api-docs.deepseek.com/quick_start/pricing](https://api-docs.deepseek.com/quick_start/pricing) - DeepSeek API Docs, “DeepSeek-V4 Preview Release,” April 24, 2026.
[https://api-docs.deepseek.com/news/news260424](https://api-docs.deepseek.com/news/news260424) - Artificial Analysis, “LLM Leaderboard — Comparison of over 100 AI Models.”

https://artificialanalysis.ai/leaderboards/models - Artificial Analysis, “DeepSeek V4 Pro (1.6T MoE) — Intelligence, Performance & Price Analysis.” https://artificialanalysis.ai/models/deepseek-v4-pro - Artificial Analysis, “Qwen3.6 27B (Non-reasoning) — Intelligence, Performance & Price Analysis.” https://artificialanalysis.ai/models/qwen3-6-27b-non-reasoning - Alibaba Cloud, “Alibaba Introduces Qwen3, Setting New Benchmark in Open-Source AI,” April 29, 2025. https://www.alibabacloud.com/blog/alibaba-introduces-qwen3-setting-new-benchmark-in-open-source-ai-with-hybrid-reasoning_602192 - Reuters, “Alibaba unveils new Qwen3.5 model for ‘agentic AI era’,” February 16, 2026. https://www.reuters.com/world/china/alibaba-unveils-new-qwen35-model-agentic-ai-era-2026-02-16/ - MarkTechPost, “Qwen Introduces Qwen3.7-Max: A Reasoning Agent Model With a 1M-Token Context Window,” May 21, 2026. https://www.marktechpost.com/2026/05/21/qwen-introduces-qwen3-7-max-a-reasoning-agent-model-with-a-1m-token-context-window/ - OpenRouter, “Qwen: Qwen3.6 27B — API Pricing & Benchmarks.”

[https://openrouter.ai/qwen/qwen3.6-27b](https://openrouter.ai/qwen/qwen3.6-27b) - OpenRouter, “DeepSeek: DeepSeek V4 Pro — API Pricing & Benchmarks.”
[https://openrouter.ai/deepseek/deepseek-v4-pro](https://openrouter.ai/deepseek/deepseek-v4-pro) - WCCFTech, “DeepSeek V4 Squeezes Million-Token Context Into 10% of Single-Token Inference FLOPs.”

https://wccftech.com/deepseek-v4-cuts-kv-cache-by-90-at-1m-tokens-but-aggressive-compression-could-risk-needle-in-a-haystack-failures/ - Lushbinary, “Qwen 3.5 Developer Guide: Benchmarks, Architecture & Integration,” March 4, 2026. https://lushbinary.com/blog/qwen-3-5-developer-guide-benchmarks-architecture-integration-2026/ - Google AI Blog, “Introducing Gemma 3,” March 12, 2025. https://blog.google/innovation-and-ai/technology/developers-tools/gemma-3/ - DEV Community, “Gemma 4 Scored 89.2% on AIME. Here’s Why That Number Should Change How You Think About Open Source,” May 2026. https://dev.to/pulkitgovrani/gemma-4-scored-892-on-aime-heres-why-that-number-should-change-how-you-think-about-open-source-5aem - Moonshot AI, “Kimi K2 — GitHub.”

[https://github.com/MoonshotAI/Kimi-K2](https://github.com/MoonshotAI/Kimi-K2) - SWE-bench Leaderboards.
[https://www.swebench.com/](https://www.swebench.com/) - Simon Willison’s Weblog, “Qwen3: Think Deeper, Act Faster,” April 29, 2025.
[https://simonwillison.net/2025/Apr/29/qwen-3/](https://simonwillison.net/2025/Apr/29/qwen-3/) - QwenLM GitHub, “Qwen3 — Qwen team, Alibaba Cloud.”
[https://github.com/QwenLM/Qwen3](https://github.com/QwenLM/Qwen3) - QwenLM Blog, “Qwen3: Think Deeper, Act Faster.”

https://qwenlm.github.io/blog/qwen3/ - NVIDIA Developer Blog, “New Open Source Qwen3-Next Models Preview Hybrid MoE Architecture.” https://developer.nvidia.com/blog/new-open-source-qwen3-next-models-preview-hybrid-moe-architecture-delivering-improved-accuracy-and-accelerated-parallel-processing-across-nvidia-platform/ - SGLang Documentation, “DeepSeek-V4 Cookbook.” https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4 - HuggingFace Blog, “DeepSeek-V4: a million-token context that agents can actually use,” April 24, 2026. https://huggingface.co/blog/deepseekv4 - DeepSeek-AI, “DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence” (Technical Report PDF), April 24, 2026. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/resolve/main/DeepSeek_V4.pdf - Google DeepMind, “Gemma 4: our most intelligent open models to date,” April 2, 2026. https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/ - Google AI for Developers, “Gemma 4 model card.” https://ai.google.dev/gemma/docs/core/model_card_4 - The Next Web, “DeepSeek makes its 75 percent discount permanent,” May 2026. https://thenextweb.com/news/deepseek-v4-pro-75-percent-price-cut-permanent - The Decoder, “Deepseek makes its 75 percent discount permanent, pricing output tokens at least 34x below GPT-5.5,” May 2026. https://the-decoder.com/deepseek-makes-its-75-percent-discount-permanent-pricing-output-tokens-at-least-34x-below-gpt-5-5/ - Reuters, “DeepSeek’s V4 model will run on Huawei chips,” April 3, 2026. https://www.reuters.com/world/china/deepseeks-v4-model-will-run-huawei-chips-information-reports-2026-04-03/ - NVIDIA NIM, “gemma-4-31B model card.” https://build.nvidia.com/google/gemma-4-31b-it/modelcard - WCCFTech, “DeepSeek V4 Squeezes Million-Token Context Into 10% of Single-Token Inference FLOPs.” https://wccftech.com/deepseek-v4-cuts-kv-cache-by-90-at-1m-tokens-but-aggressive-compression-could-risk-needle-in-a-haystack-failures/ - Codersera, “DeepSeek V4 Guide: Pro & Flash + R2/V5 Status (May 2026).” https://codersera.com/blog/deepseek-v4-complete-guide-2026/ - MarkTechPost, “Alibaba Qwen Team Releases Qwen3.6-27B: A Dense Open-Weight Model Outperforming 397B MoE on Agentic Coding Benchmarks,” April 22, 2026. https://www.marktechpost.com/2026/04/22/alibaba-qwen-team-releases-qwen3-6-27b-a-dense-open-weight-model-outperforming-397b-moe-on-agentic-coding-benchmarks/ - Towards AI, “I Tested the 27B Open-Source Model That Crushed a 397B MoE on Coding,” April 27, 2026. https://pub.towardsai.net/i-tested-the-27b-open-source-model-that-crushed-a-397b-moe-on-coding-it-fits-on-one-24gb-gpu-c2d81837121c - Unsloth, “Qwen3.6 - How to Run Locally.”

[https://unsloth.ai/docs/models/qwen3.6](https://unsloth.ai/docs/models/qwen3.6) - Hacker News, “Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model” discussion.
[https://news.ycombinator.com/item?id=47863217](https://news.ycombinator.com/item?id=47863217) - Lushbinary, “Gemma 4 Developer Guide: Benchmarks, Architecture & Local Deployment,” April 2026.

https://lushbinary.com/blog/gemma-4-developer-guide-benchmarks-architecture-local-deployment-2026/ - Qwen.ai blog, “Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model,” April 21, 2026. https://qwen.ai/blog?id=qwen3.6-27b - BuildFastWithAI, “Qwen3-6-27B Review: 27B Model Beats 397B on Coding,” April 23, 2026. https://www.buildfastwithai.com/blogs/qwen3-6-27b-review-2026 - HuggingFace Blog, “DeepSeek-V4: a million-token context that agents can actually use,” April 24, 2026.

[https://huggingface.co/blog/deepseekv4](https://huggingface.co/blog/deepseekv4) - DeepSeek-AI, “DeepSeek-V4-Pro” model card, HuggingFace.
[https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) - Z.AI, “GLM-5.1 — Overview.”
[https://docs.z.ai/guides/llm/glm-5.1](https://docs.z.ai/guides/llm/glm-5.1) - Lambda AI, “GLM-5.1 model info.”
[https://lambda.ai/inference-models/zai-org/glm-5.1](https://lambda.ai/inference-models/zai-org/glm-5.1) - Artificial Analysis, “Kimi K2.6 — Intelligence, Performance & Price Analysis.”
[https://artificialanalysis.ai/models/kimi-k2-6](https://artificialanalysis.ai/models/kimi-k2-6) - Lambda AI, “Kimi K2.6 model info.”

https://lambda.ai/inference-models/moonshotai/kimi-k2.6 - Meta, “The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation.”

[https://ai.meta.com/blog/llama-4-multimodal-intelligence/](https://ai.meta.com/blog/llama-4-multimodal-intelligence/) - NVIDIA NIM, “Gemma 4 31B IT model card.”
[https://build.nvidia.com/google/gemma-4-31b-it/modelcard](https://build.nvidia.com/google/gemma-4-31b-it/modelcard) - willitrunai.com, “Qwen 3.6 27B VRAM & Hardware Requirements.”

https://willitrunai.com/blog/qwen-3-6-27b-vram-requirements - AlphaSignalAI (X), “RTX 4090 Qwen3.6-35B-A3B VRAM requirements,” April 2026. https://x.com/AlphaSignalAI/status/2045233520608464983 - Codersera, “GLM-5.1: First Open Source Model to Beat Claude Opus on Coding,” April 2026. https://www.modemguides.com/blogs/ai-news/glm-5-1-open-source-benchmarks-local-ai - DeepLearning.AI, “Kimi K2.6: The New Leading Open Weights Model,” April 21, 2026. https://artificialanalysis.ai/articles/kimi-k2-6-the-new-leading-open-weights-model

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/qwen3-6-27b-35b-a3b-…] indexed:0 read:46min 2026-05-20 ·