We Asked 10 LLMs to Write Efficient Code. Only 4 Got Better.

Vilius Vystartas tested 10 large language models on 10 coding tasks to determine whether prompting them to write efficient code actually improves output. Only four models showed measurable gains, with GPT-5.4 achieving the largest efficiency boost of +0.20, while several models produced worse code when given the efficiency instruction.

By Vilius Vystartas | May 2026 Every LLM can write code that works. The question is: can they write code that's efficient — and does telling them to be efficient actually help? I tested 10 models on 10 coding tasks, each in two phases: unprompted the model writes its own code and prompted explicitly told to write clean, DRY, efficient code . That's 200 API calls, $0.56 total. The results are... not what most prompt engineers would predict. GPT-5.4 was the only model where prompting gave a substantial boost +0.20 . For most models, the "write efficient code" prompt was meaningless or actively harmful. Each task has a known optimal token budget — the minimum tokens needed to produce correct, DRY code for that task e.g., 70 tokens for 10 styled buttons using CSS classes vs 340 tokens for 10 separate button blocks . The efficiency score is optimal tokens / actual tokens , capped at 1.0. A score of 0.63 means the model used about 1.6x the optimal — not bad. A score of 0.43 means it used about 2.3x the optimal. The gap between unprompted and prompted tells you whether the "write efficient code" instruction actually changes behaviour. | | Model | Unprompted | Prompted | Δ | Frugal | Cost | Correctness | |---|---|---|---|---|---|---|---| | 🥇 | GPT-5.4 | 0.43 | 0.63 | +0.20 | 30% | $0.096 | 78% → 85% | | 🥈 | Qwen 3.6 Plus | 0.44 | 0.60 | +0.17 | 40% | $0.158 | 78% → 87% | | 🥉 | Gemma 4 31B | 0.54 | 0.58 | +0.04 | 50% | $0.003 | 92% both | | 4 | DeepSeek Chat | 0.51 | 0.55 | +0.04 | 30% | $0.006 | 91% → 80% | | 5 | Claude Sonnet 4 | 0.47 | 0.52 | +0.04 | 40% | $0.121 | 92% both | | 6 | LFM 2 24B A2B | 0.54 | 0.47 | -0.06 | 30% | $0.001 | 90% → 80% | | 7 | Mistral Large 2411 | 0.54 | 0.46 | -0.08 | 40% | $0.050 | 90% → 82% | | 8 | Gemini 2.5 Flash | 0.47 | 0.46 | -0.01 | 50% | $0.020 | 92% → 90% | | 9 | Cohere Command A | 0.60 | 0.44 | -0.17 | 40% | $0.071 | 90% → 82% | | 10 | Kimi K2.6 | 0.34 | 0.43 | +0.09 | 30% | $0.029 | 76% → 86% | GPT-5.4 improved on 7 of 10 tasks when prompted for efficiency. The biggest wins were config-generation +0.81 — went from 12 inline JSON blocks to a template loop , html-from-data +0.71 , and magic-strings +0.38 — switched to an Enum . It's the only model in the batch where the "write efficient code" instruction consistently produces different and better output. The cost is notable — $0.10 for 20 tasks is mid-range, not cheap, not expensive. But the efficiency gain is real. Half of Gemma 4's tasks were already "frugal" — naturally efficient without being told. It scored 92% correctness on both phases at just $0.003 total. That's a 40x cost advantage over GPT-5.4 with higher correctness and competitive efficiency. For high-volume production where you want concise, correct code, Gemma 4 31B is the value pick of this batch. Cohere Command A had the highest unprompted efficiency in the batch 0.60 — it naturally writes concise code. But when told "write efficient code," it ballooned output on several tasks. html-from-data went from a tight 45-token solution to a 600+-token monstrosity -0.92 gap . The prompt made it overthink. Lesson: if a model is already efficient, don't prompt it to be more efficient. Qwen 3.6 Plus scored second in prompted efficiency +0.17 improvement but took 26 minutes for 20 tasks — by far the slowest model. The efficiency gain is real especially on html-from-data where it went from hardcoded rows to a map/join pattern , but you're waiting for it. Batch workloads only. Kimi K2.6 had the lowest unprompted efficiency 0.34 — verbose, boilerplate-heavy code but improved the most at the bottom end +0.09 . Still last place, but the prompt actually helped it compress — which is the opposite of the Cohere effect. Some models need the nudge. "Frugal" means the model naturally produced code at or near the optimal token count without being asked. Gemma 4 31B and Gemini 2.5 Flash led at 50% — half their tasks were already efficient. GPT-5.4, DeepSeek Chat, and Kimi K2.6 were only 30% frugal — they needed the prompt to tighten up. | Group | Models | Behaviour | |---|---|---| Prompt-responsive | GPT-5.4, Qwen 3.6 Plus | Efficiency improves substantially with prompting | Prompt-neutral | Gemma 4 31B, DeepSeek Chat, Claude Sonnet 4, Gemini 2.5 Flash, Kimi K2.6 | Prompt has little effect ±0.04 | Prompt-antagonistic | LFM 2 24B A2B, Mistral Large 2411, Cohere Command A | Efficiency drops when prompted | The prompt-antagonistic group is the most interesting. These models know how to write efficient code 0.54-0.60 unprompted , but the explicit instruction triggers over-engineering — they add abstractions, comments, error handling, and other bloat that makes the output less efficient by the metric. If the prompt says "write efficient code" and the model responds by writing more tokens, something in the training signal is misaligned. Ten real-world coding tasks across CSS, JavaScript, Python, SQL, and bash — each with a known optimal token budget for a correct, DRY solution. Tasks included: styling 10 buttons CSS , rendering 20 data rows as HTML JS/HTML , bulk renaming shell , form validation Python , parametrized tests Python , unit conversion Python , SQL reporting queries, config generation JSON , magic string replacement Python/Enum , and middleware decorator pattern Python/Flask . Each model ran 10 tasks unprompted, then the same 10 tasks with an efficiency prompt appended. Scoring: efficiency ratio = optimal tokens / actual tokens capped at 1.0 . Correctness scored against expected output patterns. Total cost: $0.56 for 200 API calls 10 models × 10 tasks × 2 phases . Temperature: 0.1. Max tokens: 600. Full results: benchmarks.workswithagents.dev https://benchmarks.workswithagents.dev