Is Your Agent Skill Actually Good? Microsoft's Dual-Paper Deep Dive into Skill Evaluation and Self-Evolving Optimization

wpnews.pro

You spent an afternoon crafting a carefully structured Skill for your agent. Clear steps, thorough edge-case notes, well-formatted output requirements. You tested it manually a few times, the outputs looked great. You shipped it.

Three weeks later, you notice that some task success rates have gone down compared to before the Skill existed.

This is not a hypothetical. In May 2026, Microsoft Research published two concurrent papers — SkillLens ("From Raw Experience to Skill Consumption") and

One paper answers "why skills sometimes backfire." The other answers "how to make skills systematically better." Together they sketch a new paradigm for agent capability improvement.

Most practitioners think of a Skill as "a block of text instructions for an agent." SkillLens decomposes this into a three-stage lifecycle:

Stage 1: Experience Generation
    Target model M runs training tasks, producing an experience pool
    of trajectories (both successes and failures)
    ↓
Stage 2: Skill Extraction
    Extractor model E distills the experience pool into a structured
    skill document — procedural knowledge under a fixed budget
    ↓
Stage 3: Skill Consumption
    The same target model M, equipped with the extracted skill,
    is evaluated on held-out test tasks

Notice there are two distinct roles in this chain: the Extractor (distills knowledge from trajectories) and the Target (consumes knowledge to improve task performance). SkillLens's central insight is that these two roles are independent — a strong task executor is not necessarily a strong extractor, and vice versa.

To separate these two effects, the paper introduces two complementary metrics:

Extraction Efficacy (EE): Fix an extractor. How reliably does it produce helpful skills across different target models?

$$\text{EE}(E, \mathcal{D}) = \frac{1}{|\mathcal{M}|} \sum_{M \in \mathcal{M}} \Delta(E, M, \mathcal{D})$$

Target Evolvability (TE): Fix a target model. How much does it improve when different extractors distill skills from its own experience?

$$\text{TE}(M, \mathcal{D}) = \frac{1}{|\mathcal{E}|} \sum_{E \in \mathcal{E}} \Delta(E, M, \mathcal{D})$$

A useful analogy: EE measures "how good a teacher is at helping many different students," while TE measures "how much a student can learn from many different teachers." Both dimensions being high is ideal.

The study is comprehensive:

A key design principle is minimal scaffolding: the extraction framework intentionally strips out domain-specific heuristics, filtering rules, and optimization tricks. Only a bare two-stage "per-trajectory analysis → hierarchical consolidation" pipeline remains, ensuring performance differences reflect the extractor's intrinsic capability, not pipeline engineering.

Table 1 shows a stark picture:

Model-generated skills improve downstream performance in 75% of entries. Yet negative transfer remains common: 25% of entries have Δ < 0.

Negative transfer rates vary substantially across domains: SpreadsheetBench and SWE-bench-Verified show the lowest rates (~13%), while ALFWorld reaches 47% — nearly half the time, adding a skill to this domain hurts rather than helps.

Even more counterintuitive: a stronger task-execution model does not predict extraction quality. On SpreadsheetBench, the lightweight Gemini-3.1-Flash-Lite achieves the highest EE, while GPT-5.4 — the strongest performer on the benchmark itself — ranks last as an extractor. Converting target-specific trajectories into procedural guidance that the target can actually use is a distinct capability from solving the tasks.

You might guess that an ordered-list Skill outperforms a prose-format Skill. SkillLens tested this directly:

Rewrite the same skill into four canonical formats (ordered list, unordered list, checklist, prose) and use the Friedman test to check whether any format consistently ranks higher.

Result: Format has no detectable effect on any target (all p > 0.34). Swapping extractors, by contrast, produced significant effects on 5 of 6 targets (p < 0.005).

The implication is direct: obsessing over Skill formatting is wasted effort. What the skill says matters far more than how it's laid out.

This is the most surprising finding. The experiment asks GPT-5.4 to act as a judge: given two skills extracted from the same (target model, domain) pair, pick the one that will perform better on downstream tasks.

Without guidance: the unguided judge achieves only 46.4% accuracy — indistinguishable from random guessing (50%). On pairs where the actual performance gap is δ ≥ 5% (clearly better), the judge picks the genuinely higher-performing skill only 15.8% of the time — a clear inversion of actual utility.

In other words, the skill that reads more fluently and coherently tends to be the one that performs worse downstream. Textual plausibility has divorced from actual utility.

This has a direct practical implication: you cannot reliably screen skills by asking an LLM to judge the text. The quality gap lies deeper than surface form.

Stage 1 (Experience Generation): Success/Failure Ratio Sets the Ceiling

Fixing the extractor (GPT-5.4-mini), the researchers sampled experience pools with success ratios of 0%/25%/50%/75%/100% from the same trajectories and compared the resulting skills.

Key finding: experience composition strongly shapes skill quality, and the optimal success-failure ratio is domain-specific.

One universal rule: all-failure pools consistently produce the worst skills. Successful trajectories are the foundation of skill extraction — they provide positive procedural signals that narrow the agent's action space rather than merely listing what to avoid.

Stage 2 (Skill Extraction): Depth Matters, Not Aesthetics

Starting from "why do plausible-looking skills fail?", a qualitative inspection of high-Δ vs. low-Δ skill pairs reveals the real difference:

High-Δ skills provide concrete, executable remedies (e.g., "when the host engine doesn't evaluate formula strings, precompute static values and write them into cells directly"). Low-Δ skills offer generic platitudes (e.g., "resolve the contract before coding").

A vivid analogy: a high-quality skill reads like a practitioner's debugging journal — recording specific failure modes in specific contexts and their concrete fixes. A low-quality skill reads like a "we all know this already" lecture about best practices.

Stage 3 (Skill Consumption): Same Skill, Very Different Effects Across Targets

Injecting the same skill into different models can produce wildly different results. On SpreadsheetBench, the strong-pool skill boosts GPT-5.4 by +9.0 but produces negative transfer on some Qwen3.5-9B conditions.

Behavioral analysis explains why: skill consumption reshapes the target's default policy rather than triggering new explicit tool calls. For GPT-5.4, the skill steers it away from writing spreadsheet formulas toward computing results in Python and writing back static values — exactly the right strategy correction for formula stability issues. For Qwen3.5-9B, the same guidance pushes it toward more complex workbook-native workflows that improve structural correctness on sheet-level tasks but introduce more execution failures on fine-grained operations.

The analysis reveals that skill quality is driven by hidden dimensions that are invisible on the surface. RQ3 asks: can these findings be turned into a concrete, drop-in improvement to the extraction process itself?

Step 1: Discover which dimensions actually predict utility

The paper designs a fully automated rubric-discovery pipeline:

Only 3 dimensions consistently align with utility, forming the validated rubric:

Dimension	What It Captures
Failure Mechanism Encoding
Does the skill encode specific failure modes and their triggers?
Actionable Specificity
Does the skill provide executable guidance tailored to concrete situations?
High-Risk Action Blacklist
Does the skill explicitly name high-risk operations to avoid?

Step 2: Verify the rubric's discriminating power

Using the 3-dimension validated rubric to guide the judge raises overall accuracy from 46.4% to 73.8% on 151 high-gap pairs. On the hardest pairs (δ ≥ 5%), where the unguided judge was actively wrong at 15.8%, the guided judge now picks correctly the majority of the time.

Step 3: Operationalize it as a Meta-Skill

The validated rubric is packaged as a compact meta-skill: a generation-time prior injected directly into the extractor's system prompt. Testing it against three conditions:

The results are unambiguous:

The plausibility rubric hurts average performance (−0.59pp, 6 of 9 cells regress). The validated rubric improves all nine cells (+1.55pp average), with the largest gains on SpreadsheetBench (+2.3 to +3.7pp).

Using "seems reasonable" criteria actively damages extraction. Only empirically validated dimensions reliably improve it.

SkillLens tells us skills can be systematically evaluated and improved. SkillOpt asks: can we apply an optimization loop to the skill document itself — the same discipline that makes weight-space optimization reproducible?

This analogy is the foundation of SkillOpt's entire design, laid out explicitly in the paper:

Deep Learning	SkillOpt Equivalent
Parameter (weights)	Skill document
Gradient direction	Trajectory-derived edit direction
Learning rate	Edit budget $L_t$
Validation check	Held-out selection gate
Stable training setting	Batch / minibatch / schedule / gate

This is not decorative framing — it's operational. Batch size, learning rate, validation, momentum — every concept has a corresponding text-space implementation in SkillOpt.

The optimization pipeline operates at two timescales: per-step updates and epoch-wise consolidation.

At each optimization step, the frozen target model runs a batch of training tasks with the current skill and produces scored trajectories. Small batches update quickly but noisily; larger batches expose more stable patterns. SkillOpt also supports accumulation: multiple rollout batches can be reflected on separately and merged into one update, decoupling execution throughput from update frequency.

The optimizer model (a separate frontier LLM) converts trajectories into structured skill edits. Crucially, it processes both failure and success trajectories:

Single trajectories tend to produce anecdotal fixes; minibatches expose reusable procedural errors — the agent consistently searches the wrong source, always formats answers incorrectly, or reliably fails to verify tool results.

Local proposals are merged hierarchically: failure-driven edits consolidated first, then success-driven edits, with failure corrections given priority. This filters duplicates, contradictions, and instance-specific suggestions before the optimizer selects the final bounded update.

This is the sharpest distinction between SkillOpt and "just rewrite the skill when it seems wrong."

Each optimization step has an edit budget $L_t$: the optimizer may apply at most $L_t$ add/delete/replace operations to the skill document. Candidate edits ranked below the cutoff are discarded.

Why bounding matters:

SkillOpt supports four edit budget schedules: constant, linear, cosine (default), and autonomous. The default cosine schedule starts with larger edit budgets and decays toward smaller consolidation steps.

Every candidate skill is evaluated on the held-out selection split. SkillOpt accepts a candidate only if its selection score is strictly greater than the current best — ties are rejected. This converts reflection into propose-and-test optimization rather than unconditional self-editing.

Rejected edits don't disappear. The optimizer records an epoch-local rejected-edit buffer containing:

Later reflection calls in the same epoch receive this buffer, steering the optimizer away from known-harmful directions. This provides negative feedback at zero additional inference cost during deployment.

Think of it as running an A/B test on every proposed Skill revision: only the version that passes validation gets promoted. Rejected attempts become institutional memory that prevents repeating mistakes.

Fast updates learn from the current rollout batch. The slow/meta update learns from the comparison across adjacent epochs — longer-horizon patterns that individual batches can't expose.

At epoch end, SkillOpt runs the same training items under both the previous epoch's skill and the current epoch's skill, categorizing results into: improvements, regressions, persistent failures, and stable successes. The optimizer model writes a concise longitudinal guidance block — capturing which edit patterns helped, which failed, and which failure modes persist across epochs.

This guidance is stored in a protected region of the skill document that step-level edits cannot overwrite, preventing short-term noise from erasing long-term lessons.

Key deployment note: the optimizer-side meta skill is never shipped withbest_skill.md

. It only lives in the teacher's reflection context. The deployed artifact stays compact and portable; the training process benefits from the full editing history.

SkillOpt is evaluated across 6 benchmarks × 7 target models × 3 execution harnesses (direct chat, Codex agentic loop, Claude Code agentic loop), against 7 baselines: no skill, human skill, one-shot LLM skill, Trace2Skill, TextGrad, GEPA, and EvoSkill.

SkillOpt is best or tied-best on all 52 evaluated (model, benchmark, harness) cells.

Key headline numbers for GPT-5.5:

Execution mode	Average gain over no-skill
Direct chat	+23.5 points
Codex harness	+24.8 points
Claude Code harness	+19.1 points

Individual benchmark gains are striking: OfficeQA jumps from 33.1 to 72.1 (+39.0), SpreadsheetBench from 41.8 to 80.7 (+38.9), ALFWorld on GPT-5.4-nano from 34.3 to 69.4 (×2.0). Procedural benchmarks with strict format requirements see the largest absolute gains — exactly where frontier models are most exposed zero-shot.

Table 3 ablation results confirm each component contributes:

Component removed	SearchQA drop	SpreadsheetBench drop	LiveMath drop
Learning-rate form (unbounded)	-2.5	-1.8	-4.0
Rejected-edit buffer	-1.6	-4.6	-2.4
Meta skill + slow update	-0.6	-22.5
-3.2

Removing the slow/meta update is most damaging for SpreadsheetBench (−22.5 points), because this benchmark requires accumulated procedural knowledge — output format conventions, formula evaluation strategies — exactly what the epoch-wise slow update is designed to protect.

Figure 4 in the paper reproduces one representative learned rule per benchmark, verbatim from the deployed best_skill.md

:

SearchQA: "Infer the expected answer type from clue wording, then choose the shortest canonical entity supported by co-occurring distinctive evidence."

SpreadsheetBench: "Inspect workbook structure and formulas, then write evaluated static values across the full requested target range instead of relying on Excel recalculation."

OfficeQA: "Treat oracle parsed pages as primary evidence, lock table/date/unit context, and output exactly the requested rounded value without extra labels."

LiveMathematicianBench: "In strongest-statement MCQs, rank choices by theorem strength and prefer a justified stronger-result option over true but weaker corollaries."

ALFWorld: "Keep a horizon-aware visited/frontier ledger, diversify search after repeated same-type failures, and avoid revisiting the destination until holding the target."

Three properties stand out:

Compactness: final best_skill.md

lengths range from 379 tokens (LiveMathematicianBench) to 1,995 tokens (SpreadsheetBench), with a median around 920 tokens. The number of actually accepted edits is 1 to 4 (median 2.5) — the optimizer proposes far more, but only a handful survive the validation gate.

One of SkillOpt's most compelling findings is that the optimized artifact transfers well beyond its training setting.

Cross-model transfer: A SpreadsheetBench skill trained on GPT-5.4 retains 82% of the in-domain gain when transferred to GPT-5.4-mini (+9.4 of +11.4). A LiveMath skill transferred to GPT-5.4-nano actually surpasses the in-domain SkillOpt reference (28.8 transferred vs. 27.2 in-domain).

Cross-harness transfer: A SpreadsheetBench skill trained in the Codex loop transfers to Claude Code with a +59.7 point absolute gain over the Claude Code no-skill baseline (22.1 → 81.8), exceeding the Claude Code in-domain SkillOpt reference of 80.4. The transferred skill appears to encode workbook-level procedures — structure-first inspection, formula-aware verification, static-value materialization — that are harness-agnostic.

Cross-benchmark transfer: An OlympiadBench skill applied to Omni-MATH (sharing only the broad "math" family) produces positive gains across all three model scales (+3.7/+1.8/+1.3), supporting the interpretation that the skill encodes reusable mathematical procedure rather than memorized test-specific formatting.

Practical implication: optimize a skill in one execution environment, reuse it across multiple models, harnesses, and related benchmarks — without touching model weights.

Placing both papers side by side, their contributions form a closed loop:

SkillLens: Understand the problem
├── Finding: 25% negative transfer — driven by variable extraction quality
├── Finding: Format doesn't matter; content does
├── Finding: Plausible text ≠ downstream utility (46.4% = random)
└── Solution: Validated rubric (3 dimensions) + Meta-Skill improves extraction

SkillOpt: Systematically solve it
├── Core idea: Skill as trainable text parameter
├── Mechanism: Bounded edits + validation gate + rejected buffer + slow update
├── Results: 52/52 best-or-tied, +17.6 average across 7 models
└── Properties: compact artifact, transferable, auditable, zero inference overhead

Both papers converge on the same core insight: a Skill should not be a static document written by intuition. It should be a dynamically optimized artifact driven by execution data. SkillLens tells you which dimensions genuinely matter. SkillOpt gives you the machinery to push those dimensions forward systematically.

If you maintain a Skill library:

If you're considering systematic skill optimization:

Two Microsoft papers, one cohesive answer to why skills sometimes fail and how to fix them systematically.

SkillLens maps the full three-stage skill lifecycle across five domains, six targets, and five extractors. It discovers that 25% of skill deployments produce negative transfer, that skill format is irrelevant while content depth is decisive, and that "reads well" is a poor predictor of "performs well." It distills these findings into three validated quality dimensions — Failure Mechanism Encoding, Actionable Specificity, High-Risk Action Blacklist — and packages them as a meta-skill prior that improves every evaluated extraction condition.

SkillOpt treats the skill document as a trainable text-space parameter. By combining bounded edit budgets, a strict held-out validation gate, a rejected-edit buffer for negative feedback, and an epoch-wise slow/meta update for long-horizon consolidation, it turns ad hoc skill editing into a controlled optimization loop. The result: best or tied-best on 52 of 52 evaluated cells, +17.6 average improvement across seven models, compact deployable artifacts (< 2,000 tokens, 1–4 accepted edits), and transfer that works across model scales, harnesses, and related benchmarks without touching model weights.

Skill optimization is graduating from craft to engineering.

References:

source & further reading

dev.to — original article agentproto 0.4.0 — the daemon grows up into a supervision surface Playwright CLI vs Playwright MCP: Which Should You Use with Claude Code? Building an AI Agent That Knows When Not to Guess (Qwen + MCP)

Is Your Agent Skill Actually Good? Microsoft's Dual-Paper Deep Dive into Skill Evaluation and Self-Evolving Optimization

Run your AI side-project on zahid.host