Evaluating Kimi 2.5 vs Kimi 2.6: What happens to agent skills when the model gets smarter?

wpnews.pro

cd /news/large-language-models/evaluating-kimi-2-5-vs-kimi-2-6-what… · home › topics › large-language-models › article

[ARTICLE · art-35365] src=dev.to ↗ pub=2026-06-21T06:41Z topic=large-language-models verified=true sentiment=↑ positive

Evaluating Kimi 2.5 vs Kimi 2.6: What happens to agent skills when the model gets smarter?

Moonshot's Kimi K2.6 outperforms K2.5 on agent skills, with a +1.9 percentage point baseline gain and four skills now solved without any skill installed. In evaluations across 21 skills and 100 paired scenarios, K2.6 achieved 92.2% with skill (vs 90.2% for K2.5) and showed comparable performance to Claude Sonnet 4.5. Skills still provide significant uplift (around +17 pp) even as models improve, though some skills become unnecessary as models handle tasks natively.

read4 min views1 publishedJun 21, 2026

When a stronger model ships, there are two questions every skill author should want answered, and evals are the only honest way to answer either:

Moonshot gave us early access to Kimi K2.6. We ran the Tessl agent skill evaluation harness on the same 21 skills and 100 paired scenarios against three solvers: Kimi K2.5, Kimi K2.6, and Claude Sonnet 4.5.

A solver is the model whose output the grader scores; a paired scenario is the same task run twice per solver, once without the skill installed and once with it. These are early signals from one pre-release on one skill set. A deeper cross-model analysis with clean baselines across the board is in progress and will be its own piece.

Scenarios and rubrics are held fixed across the two Moonshot runs. The only variable is the solver.

SKILL.md

Per-skill n=5 is noisy; the aggregate over 100 scenarios is where the signal lives.

~2 pp

(percentage points) above K2.5 in aggregate, with double-digit moves on specific skills.~8 p.p

.+17.05 pp

on K2.5, +17.20 pp

on K2.6).| Solver | Baseline (no skill) | With skill | Uplift | |---|---|---|---| | Kimi K2.5 | 73.2% | 90.2% | +17.05 pp | | Kimi K2.6 | 75.0% | 92.2% | +17.20 pp |

Kimi K2.6 is a better model than K2.5 on this skill set. Two findings to back this up:

agent-gossip-coordinator is the clearest example: K2.5 needed the skill (+8.0 pp uplift), K2.6 already solves it at 96.4%, and the skill now hurts by 4.8 pp. These skills are no longer earning their context budget as superior models can take care of it.3d-molecule-ray-tracer

: −7.0 pp; agent-base-template-generator : −2.6 pp) both resolve on K2.6. The skills were not wrong; the weaker model was just interpreting them awkwardly.Putting K2.6 next to Sonnet 4.5 on the same 21 skills and same rubric, the early picture is this:

Solver	Baseline (no skill)	With skill	Uplift
Kimi K2.6	75.0%	92.2%	+17.20 pp
Sonnet 4.5	63.2%	84.5%	+21.3 pp

On these early signals, it appears that Kimi K2.6 is competitive with Sonnet 4.5 for the task categories these skills cover. We are scheduled to make a deeper cross-model study with clean baselines across all three solvers is in progress - but this is an early signal that Kimi 2.6 is comparable to certain of the world’s leading providers.

With vs without the skill installed, on Kimi:

+17.05 pp.

+17.20 pp.

The uplift the skill buys does not shrink as the solver gets stronger. The baseline moves, the with-skill score moves with it, and the delta the skill contributes stays in the same range.Two illustrative cases, both Kimi versions, same rubric:

agent-agent

.17.7%

→ 79.9%

. K2.6 33.9%

→ 88.8%

. The baseline closed 16 pp of the gap. The skill still buys roughly 55 pp on top.agent-development

.41.2%

→ 100.0%

K2.6 55.0%

→ 100.0%

. The baseline closed 14 pp of the gap. The skill covers the rest.One nuance worth flagging here and reserving for a dedicated follow-up: not every uplift is equal. An initial pass comparing the same skills on Sonnet 4.5 suggests that skills prescribing ecosystem-specific tool calls or conventions lose the most in the cross-family handoff, while skills graded against real, verifiable behaviour (actual CLI flags, actual API shapes) transfer more readily. We view this as the most actionable signal for skill authors, but a broader sample and matched baselines across models are needed before we publish a complete analysis.

Kimi K2.6 is a better model than K2.5 on this skill set: a +1.9 pp baseline gain, four skills now solved without any skill installed, and both K2.5 regressions cleaned up.

Skills still matter as models get better: the +17 pp uplift we saw on K2.5 held on K2.6, and uplift in a similar range appears on Sonnet. All of this comes from a single pre-release evaluation on 21 skills; a deeper study with clean baselines across the board is the next piece.

The above reflect early signals. On early signals it appears Kimi 2.6 is competitive with Sonnet 4.5, though a deeper study across more models and a balanced skill sample is in progress and will be published separately.

Thanks to Moonshot for early access to K2.6! Head over to Tessl to evaluate and optimize your skills.

source & further reading

dev.to — original article Your AI Isn't Broken. Your Architecture Is. The Aftermarket She Diagnosed is the Aftermarket She Prescribed Goal In, DAG Out: How Open-Multi-Agent Turns a Goal into a Task DAG

~/api · this article 200

$curl api.wpnews.pro/v1/news/evaluating-kimi-2-5-vs-k…

Read original on dev.to → dev.to/tessl-io/evaluating-kimi-25-vs-kimi-26-wh…

mentioned entities

Moonshot

Kimi K2.5

Kimi K2.6

Claude Sonnet 4.5

Tessl

metadata

slugevaluating-kimi-2-5-vs-kimi-2-6-what-happens-to-agent-skills-when-the-model-gets

topic#large-language-models

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevGoal In, DAG Out: How Open-Multi…

next →The Aftermarket She Diagnosed is…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 21 Jun · #large-language-models

Your AI Isn't Broken. Your Architecture Is.

dev.to · 21 Jun · #large-language-models

Artifacts in Claude Code: The Operator's Guide

dev.to · 21 Jun · #large-language-models

"EcoSphere AI: Why I separated 'logic' from 'AI' when building a carbon footprint assistant"

dev.to · 21 Jun · #large-language-models

From Prompting ChatGPT to Orchestrating AI Agents: Two Years as an Ordinary Engineer

── more on @moonshot 3 stories trending now

wpnews · 20 Jun · #ai-agents

Amazon Bedrock AgentCore Memory: Build AI Agents That Remember

wpnews · 20 Jun · #ai-safety

SR 11-7 Model Risk for AI Systems: What Banks Actually Need to Build

wpnews · 20 Jun · #artificial-intelligence

Microsoft is rewriting the economics of enterprise AI and the bill shock is just getting started

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required