When a stronger model ships, there are two questions every skill author should want answered, and evals are the only honest way to answer either:
Moonshot gave us early access to Kimi K2.6. We ran the Tessl agent skill evaluation harness on the same 21 skills and 100 paired scenarios against three solvers: Kimi K2.5, Kimi K2.6, and Claude Sonnet 4.5.
A solver is the model whose output the grader scores; a paired scenario is the same task run twice per solver, once without the skill installed and once with it. These are early signals from one pre-release on one skill set. A deeper cross-model analysis with clean baselines across the board is in progress and will be its own piece.
Scenarios and rubrics are held fixed across the two Moonshot runs. The only variable is the solver.
SKILL.md
SKILL.md
Per-skill n=5 is noisy; the aggregate over 100 scenarios is where the signal lives.
~2 pp
(percentage points) above K2.5 in aggregate, with double-digit moves on specific skills.~8 p.p
.+17.05 pp
on K2.5, +17.20 pp
on K2.6).| Solver | Baseline (no skill) | With skill | Uplift | |---|---|---|---| | Kimi K2.5 | 73.2% | 90.2% | +17.05 pp | | Kimi K2.6 | 75.0% | 92.2% | +17.20 pp |
Kimi K2.6 is a better model than K2.5 on this skill set. Two findings to back this up:
agent-gossip-coordinator
is the clearest example: K2.5 needed the skill (+8.0 pp uplift), K2.6 already solves it at 96.4%, and the skill now hurts by 4.8 pp. These skills are no longer earning their context budget as superior models can take care of it.3d-molecule-ray-tracer
: −7.0 pp; agent-base-template-generator
: −2.6 pp) both resolve on K2.6. The skills were not wrong; the weaker model was just interpreting them awkwardly.Putting K2.6 next to Sonnet 4.5 on the same 21 skills and same rubric, the early picture is this:
| Solver | Baseline (no skill) | With skill | Uplift |
|---|---|---|---|
| Kimi K2.6 | 75.0% | 92.2% | +17.20 pp |
| Sonnet 4.5 | 63.2% | 84.5% | +21.3 pp |
On these early signals, it appears that Kimi K2.6 is competitive with Sonnet 4.5 for the task categories these skills cover. We are scheduled to make a deeper cross-model study with clean baselines across all three solvers is in progress - but this is an early signal that Kimi 2.6 is comparable to certain of the world’s leading providers.
With vs without the skill installed, on Kimi:
+17.05 pp.
+17.20 pp.
The uplift the skill buys does not shrink as the solver gets stronger. The baseline moves, the with-skill score moves with it, and the delta the skill contributes stays in the same range.Two illustrative cases, both Kimi versions, same rubric:
agent-agent
.17.7%
→ 79.9%
. K2.6 33.9%
→ 88.8%
. The baseline closed 16 pp of the gap. The skill still buys roughly 55 pp on top.agent-development
.41.2%
→ 100.0%
K2.6 55.0%
→ 100.0%
. The baseline closed 14 pp of the gap. The skill covers the rest.One nuance worth flagging here and reserving for a dedicated follow-up: not every uplift is equal. An initial pass comparing the same skills on Sonnet 4.5 suggests that skills prescribing ecosystem-specific tool calls or conventions lose the most in the cross-family handoff, while skills graded against real, verifiable behaviour (actual CLI flags, actual API shapes) transfer more readily. We view this as the most actionable signal for skill authors, but a broader sample and matched baselines across models are needed before we publish a complete analysis.
Kimi K2.6 is a better model than K2.5 on this skill set: a +1.9 pp baseline gain, four skills now solved without any skill installed, and both K2.5 regressions cleaned up.
Skills still matter as models get better: the +17 pp uplift we saw on K2.5 held on K2.6, and uplift in a similar range appears on Sonnet. All of this comes from a single pre-release evaluation on 21 skills; a deeper study with clean baselines across the board is the next piece.
The above reflect early signals. On early signals it appears Kimi 2.6 is competitive with Sonnet 4.5, though a deeper study across more models and a balanced skill sample is in progress and will be published separately.
Thanks to Moonshot for early access to K2.6! Head over to Tessl to evaluate and optimize your skills.