{"slug": "we-asked-10-llms-to-write-efficient-code-only-4-got-better", "title": "We Asked 10 LLMs to Write Efficient Code. Only 4 Got Better.", "summary": "Vilius Vystartas tested 10 large language models on 10 coding tasks to determine whether prompting them to write efficient code actually improves output. Only four models showed measurable gains, with GPT-5.4 achieving the largest efficiency boost of +0.20, while several models produced worse code when given the efficiency instruction.", "body_md": "*By Vilius Vystartas | May 2026*\n\nEvery LLM can write code that works. The question is: can they write code that's *efficient* — and does telling them to be efficient actually help?\n\nI tested 10 models on 10 coding tasks, each in two phases: **unprompted** (the model writes its own code) and **prompted** (explicitly told to write clean, DRY, efficient code). That's 200 API calls, $0.56 total. The results are... not what most prompt engineers would predict.\n\nGPT-5.4 was the only model where prompting gave a substantial boost (+0.20). For most models, the \"write efficient code\" prompt was meaningless or actively harmful.\n\nEach task has a known **optimal token budget** — the minimum tokens needed to produce correct, DRY code for that task (e.g., 70 tokens for 10 styled buttons using CSS classes vs 340 tokens for 10 separate button blocks). The **efficiency score** is `optimal_tokens / actual_tokens`\n\n, capped at 1.0.\n\nA score of 0.63 means the model used about 1.6x the optimal — not bad. A score of 0.43 means it used about 2.3x the optimal. The gap between unprompted and prompted tells you whether the \"write efficient code\" instruction actually changes behaviour.\n\n| # | Model | Unprompted | Prompted | Δ | Frugal | Cost | Correctness |\n|---|---|---|---|---|---|---|---|\n| 🥇 | GPT-5.4 | 0.43 | 0.63 |\n+0.20 |\n30% | $0.096 | 78% → 85% |\n| 🥈 | Qwen 3.6 Plus | 0.44 | 0.60 |\n+0.17 | 40% | $0.158 | 78% → 87% |\n| 🥉 | Gemma 4 31B | 0.54 | 0.58 |\n+0.04 | 50% |\n$0.003 |\n92% both |\n| 4 | DeepSeek Chat | 0.51 | 0.55 | +0.04 | 30% | $0.006 | 91% → 80% |\n| 5 | Claude Sonnet 4 | 0.47 | 0.52 | +0.04 | 40% | $0.121 | 92% both |\n| 6 | LFM 2 24B A2B | 0.54 |\n0.47 | -0.06 | 30% | $0.001 | 90% → 80% |\n| 7 | Mistral Large 2411 | 0.54 | 0.46 | -0.08 | 40% | $0.050 | 90% → 82% |\n| 8 | Gemini 2.5 Flash | 0.47 | 0.46 | -0.01 | 50% | $0.020 | 92% → 90% |\n| 9 | Cohere Command A | 0.60 |\n0.44 | -0.17 |\n40% | $0.071 | 90% → 82% |\n| 10 | Kimi K2.6 | 0.34 | 0.43 | +0.09 | 30% | $0.029 | 76% → 86% |\n\nGPT-5.4 improved on 7 of 10 tasks when prompted for efficiency. The biggest wins were **config-generation** (+0.81 — went from 12 inline JSON blocks to a template loop), **html-from-data** (+0.71), and **magic-strings** (+0.38 — switched to an Enum). It's the only model in the batch where the \"write efficient code\" instruction consistently produces different (and better) output.\n\nThe cost is notable — $0.10 for 20 tasks is mid-range, not cheap, not expensive. But the efficiency gain is real.\n\nHalf of Gemma 4's tasks were already \"frugal\" — naturally efficient without being told. It scored 92% correctness on both phases at just $0.003 total. That's a 40x cost advantage over GPT-5.4 with higher correctness and competitive efficiency. For high-volume production where you want concise, correct code, Gemma 4 31B is the value pick of this batch.\n\nCohere Command A had the **highest unprompted efficiency** in the batch (0.60) — it naturally writes concise code. But when told \"write efficient code,\" it ballooned output on several tasks. **html-from-data** went from a tight 45-token solution to a 600+-token monstrosity (-0.92 gap). The prompt made it overthink.\n\nLesson: if a model is already efficient, don't prompt it to be more efficient.\n\nQwen 3.6 Plus scored second in prompted efficiency (+0.17 improvement) but took **26 minutes** for 20 tasks — by far the slowest model. The efficiency gain is real (especially on html-from-data where it went from hardcoded rows to a map/join pattern), but you're waiting for it. Batch workloads only.\n\nKimi K2.6 had the lowest unprompted efficiency (0.34 — verbose, boilerplate-heavy code) but improved the most at the bottom end (+0.09). Still last place, but the prompt actually helped it compress — which is the opposite of the Cohere effect. Some models need the nudge.\n\n\"Frugal\" means the model naturally produced code at or near the optimal token count without being asked. Gemma 4 31B and Gemini 2.5 Flash led at 50% — half their tasks were already efficient. GPT-5.4, DeepSeek Chat, and Kimi K2.6 were only 30% frugal — they needed the prompt to tighten up.\n\n| Group | Models | Behaviour |\n|---|---|---|\nPrompt-responsive |\nGPT-5.4, Qwen 3.6 Plus | Efficiency improves substantially with prompting |\nPrompt-neutral |\nGemma 4 31B, DeepSeek Chat, Claude Sonnet 4, Gemini 2.5 Flash, Kimi K2.6 | Prompt has little effect (±0.04) |\nPrompt-antagonistic |\nLFM 2 24B A2B, Mistral Large 2411, Cohere Command A | Efficiency drops when prompted |\n\nThe prompt-antagonistic group is the most interesting. These models know how to write efficient code (0.54-0.60 unprompted), but the explicit instruction triggers over-engineering — they add abstractions, comments, error handling, and other bloat that makes the output less efficient by the metric.\n\nIf the prompt says \"write efficient code\" and the model responds by writing *more* tokens, something in the training signal is misaligned.\n\nTen real-world coding tasks across CSS, JavaScript, Python, SQL, and bash — each with a known optimal token budget for a correct, DRY solution. Tasks included: styling 10 buttons (CSS), rendering 20 data rows as HTML (JS/HTML), bulk renaming (shell), form validation (Python), parametrized tests (Python), unit conversion (Python), SQL reporting queries, config generation (JSON), magic string replacement (Python/Enum), and middleware decorator pattern (Python/Flask).\n\nEach model ran 10 tasks unprompted, then the same 10 tasks with an efficiency prompt appended. Scoring: **efficiency_ratio = optimal_tokens / actual_tokens** (capped at 1.0). Correctness scored against expected output patterns.\n\nTotal cost: **$0.56** for 200 API calls (10 models × 10 tasks × 2 phases). Temperature: 0.1. Max tokens: 600.\n\nFull results: [benchmarks.workswithagents.dev](https://benchmarks.workswithagents.dev)", "url": "https://wpnews.pro/news/we-asked-10-llms-to-write-efficient-code-only-4-got-better", "canonical_source": "https://dev.to/vystartasv/we-asked-10-llms-to-write-efficient-code-only-4-got-better-47gf", "published_at": "2026-05-26 22:46:48+00:00", "updated_at": "2026-05-26 23:03:27.536573+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "generative-ai", "ai-research", "ai-tools"], "entities": ["GPT-5.4", "Qwen 3.6 Plus", "Gemma 4 31B", "Vilius Vystartas"], "alternates": {"html": "https://wpnews.pro/news/we-asked-10-llms-to-write-efficient-code-only-4-got-better", "markdown": "https://wpnews.pro/news/we-asked-10-llms-to-write-efficient-code-only-4-got-better.md", "text": "https://wpnews.pro/news/we-asked-10-llms-to-write-efficient-code-only-4-got-better.txt", "jsonld": "https://wpnews.pro/news/we-asked-10-llms-to-write-efficient-code-only-4-got-better.jsonld"}}