cd /news/large-language-models/we-asked-10-llms-to-write-efficient-… Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-14661] src=dev.to pub= topic=large-language-models verified=true sentiment=Β· neutral

We Asked 10 LLMs to Write Efficient Code. Only 4 Got Better.

Vilius Vystartas tested 10 large language models on 10 coding tasks to determine whether prompting them to write efficient code actually improves output. Only four models showed measurable gains, with GPT-5.4 achieving the largest efficiency boost of +0.20, while several models produced worse code when given the efficiency instruction.

read5 min publishedMay 26, 2026

By Vilius Vystartas | May 2026

Every LLM can write code that works. The question is: can they write code that's efficient β€” and does telling them to be efficient actually help?

I tested 10 models on 10 coding tasks, each in two phases: unprompted (the model writes its own code) and prompted (explicitly told to write clean, DRY, efficient code). That's 200 API calls, $0.56 total. The results are... not what most prompt engineers would predict.

GPT-5.4 was the only model where prompting gave a substantial boost (+0.20). For most models, the "write efficient code" prompt was meaningless or actively harmful.

Each task has a known optimal token budget β€” the minimum tokens needed to produce correct, DRY code for that task (e.g., 70 tokens for 10 styled buttons using CSS classes vs 340 tokens for 10 separate button blocks). The efficiency score is optimal_tokens / actual_tokens

, capped at 1.0.

A score of 0.63 means the model used about 1.6x the optimal β€” not bad. A score of 0.43 means it used about 2.3x the optimal. The gap between unprompted and prompted tells you whether the "write efficient code" instruction actually changes behaviour.

# Model Unprompted Prompted Ξ” Frugal Cost Correctness
πŸ₯‡ GPT-5.4 0.43 0.63
+0.20
30% $0.096 78% β†’ 85%
πŸ₯ˆ Qwen 3.6 Plus 0.44 0.60
+0.17 40% $0.158 78% β†’ 87%
πŸ₯‰ Gemma 4 31B 0.54 0.58
+0.04 50%
$0.003
92% both
4 DeepSeek Chat 0.51 0.55 +0.04 30% $0.006 91% β†’ 80%
5 Claude Sonnet 4 0.47 0.52 +0.04 40% $0.121 92% both
6 LFM 2 24B A2B 0.54
0.47 -0.06 30% $0.001 90% β†’ 80%
7 Mistral Large 2411 0.54 0.46 -0.08 40% $0.050 90% β†’ 82%
8 Gemini 2.5 Flash 0.47 0.46 -0.01 50% $0.020 92% β†’ 90%
9 Cohere Command A 0.60
0.44 -0.17
40% $0.071 90% β†’ 82%
10 Kimi K2.6 0.34 0.43 +0.09 30% $0.029 76% β†’ 86%

GPT-5.4 improved on 7 of 10 tasks when prompted for efficiency. The biggest wins were config-generation (+0.81 β€” went from 12 inline JSON blocks to a template loop), html-from-data (+0.71), and magic-strings (+0.38 β€” switched to an Enum). It's the only model in the batch where the "write efficient code" instruction consistently produces different (and better) output.

The cost is notable β€” $0.10 for 20 tasks is mid-range, not cheap, not expensive. But the efficiency gain is real.

Half of Gemma 4's tasks were already "frugal" β€” naturally efficient without being told. It scored 92% correctness on both phases at just $0.003 total. That's a 40x cost advantage over GPT-5.4 with higher correctness and competitive efficiency. For high-volume production where you want concise, correct code, Gemma 4 31B is the value pick of this batch.

Cohere Command A had the highest unprompted efficiency in the batch (0.60) β€” it naturally writes concise code. But when told "write efficient code," it ballooned output on several tasks. html-from-data went from a tight 45-token solution to a 600+-token monstrosity (-0.92 gap). The prompt made it overthink.

Lesson: if a model is already efficient, don't prompt it to be more efficient.

Qwen 3.6 Plus scored second in prompted efficiency (+0.17 improvement) but took 26 minutes for 20 tasks β€” by far the slowest model. The efficiency gain is real (especially on html-from-data where it went from hardcoded rows to a map/join pattern), but you're waiting for it. Batch workloads only.

Kimi K2.6 had the lowest unprompted efficiency (0.34 β€” verbose, boilerplate-heavy code) but improved the most at the bottom end (+0.09). Still last place, but the prompt actually helped it compress β€” which is the opposite of the Cohere effect. Some models need the nudge.

"Frugal" means the model naturally produced code at or near the optimal token count without being asked. Gemma 4 31B and Gemini 2.5 Flash led at 50% β€” half their tasks were already efficient. GPT-5.4, DeepSeek Chat, and Kimi K2.6 were only 30% frugal β€” they needed the prompt to tighten up.

Group Models Behaviour
Prompt-responsive
GPT-5.4, Qwen 3.6 Plus Efficiency improves substantially with prompting
Prompt-neutral
Gemma 4 31B, DeepSeek Chat, Claude Sonnet 4, Gemini 2.5 Flash, Kimi K2.6 Prompt has little effect (Β±0.04)
Prompt-antagonistic
LFM 2 24B A2B, Mistral Large 2411, Cohere Command A Efficiency drops when prompted

The prompt-antagonistic group is the most interesting. These models know how to write efficient code (0.54-0.60 unprompted), but the explicit instruction triggers over-engineering β€” they add abstractions, comments, error handling, and other bloat that makes the output less efficient by the metric.

If the prompt says "write efficient code" and the model responds by writing more tokens, something in the training signal is misaligned. Ten real-world coding tasks across CSS, JavaScript, Python, SQL, and bash β€” each with a known optimal token budget for a correct, DRY solution. Tasks included: styling 10 buttons (CSS), rendering 20 data rows as HTML (JS/HTML), bulk renaming (shell), form validation (Python), parametrized tests (Python), unit conversion (Python), SQL reporting queries, config generation (JSON), magic string replacement (Python/Enum), and middleware decorator pattern (Python/Flask).

Each model ran 10 tasks unprompted, then the same 10 tasks with an efficiency prompt appended. Scoring: efficiency_ratio = optimal_tokens / actual_tokens (capped at 1.0). Correctness scored against expected output patterns.

Total cost: $0.56 for 200 API calls (10 models Γ— 10 tasks Γ— 2 phases). Temperature: 0.1. Max tokens: 600.

Full results: benchmarks.workswithagents.dev

── more in #large-language-models 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/we-asked-10-llms-to-…] indexed:0 read:5min 2026-05-26 Β· β€”