{"slug": "is-your-agent-skill-actually-good-microsoft-s-dual-paper-deep-dive-into-skill", "title": "Is Your Agent Skill Actually Good? Microsoft's Dual-Paper Deep Dive into Skill Evaluation and Self-Evolving Optimization", "summary": "Microsoft Research published two concurrent papers in May 2026—SkillLens and a companion work—revealing that agent skills can degrade task performance, with negative transfer occurring in 25% of cases and reaching 47% in some domains. The research introduces a three-stage skill lifecycle and two complementary metrics, Extraction Efficacy and Target Evolvability, demonstrating that a model's ability to execute tasks does not predict its ability to extract useful skills. The study found that skill format (ordered list, prose, etc.) has no detectable effect on performance, while the choice of extractor model significantly impacts outcomes.", "body_md": "You spent an afternoon crafting a carefully structured Skill for your agent. Clear steps, thorough edge-case notes, well-formatted output requirements. You tested it manually a few times, the outputs looked great. You shipped it.\n\nThree weeks later, you notice that some task success rates have gone down compared to before the Skill existed.\n\nThis is not a hypothetical. In May 2026, Microsoft Research published two concurrent papers — [ SkillLens](https://arxiv.org/abs/2605.23899) (\"From Raw Experience to Skill Consumption\") and\n\nOne paper answers \"why skills sometimes backfire.\" The other answers \"how to make skills systematically better.\" Together they sketch a new paradigm for agent capability improvement.\n\nMost practitioners think of a Skill as \"a block of text instructions for an agent.\" SkillLens decomposes this into a **three-stage lifecycle**:\n\n```\nStage 1: Experience Generation\n    Target model M runs training tasks, producing an experience pool\n    of trajectories (both successes and failures)\n    ↓\nStage 2: Skill Extraction\n    Extractor model E distills the experience pool into a structured\n    skill document — procedural knowledge under a fixed budget\n    ↓\nStage 3: Skill Consumption\n    The same target model M, equipped with the extracted skill,\n    is evaluated on held-out test tasks\n```\n\nNotice there are two distinct roles in this chain: the **Extractor** (distills knowledge from trajectories) and the **Target** (consumes knowledge to improve task performance). SkillLens's central insight is that **these two roles are independent — a strong task executor is not necessarily a strong extractor, and vice versa**.\n\nTo separate these two effects, the paper introduces two complementary metrics:\n\n**Extraction Efficacy (EE)**: Fix an extractor. How reliably does it produce helpful skills across different target models?\n\n$$\\text{EE}(E, \\mathcal{D}) = \\frac{1}{|\\mathcal{M}|} \\sum_{M \\in \\mathcal{M}} \\Delta(E, M, \\mathcal{D})$$\n\n**Target Evolvability (TE)**: Fix a target model. How much does it improve when different extractors distill skills from its own experience?\n\n$$\\text{TE}(M, \\mathcal{D}) = \\frac{1}{|\\mathcal{E}|} \\sum_{E \\in \\mathcal{E}} \\Delta(E, M, \\mathcal{D})$$\n\nA useful analogy: EE measures \"how good a teacher is at helping many different students,\" while TE measures \"how much a student can learn from many different teachers.\" Both dimensions being high is ideal.\n\nThe study is comprehensive:\n\nA key design principle is **minimal scaffolding**: the extraction framework intentionally strips out domain-specific heuristics, filtering rules, and optimization tricks. Only a bare two-stage \"per-trajectory analysis → hierarchical consolidation\" pipeline remains, ensuring performance differences reflect the extractor's intrinsic capability, not pipeline engineering.\n\nTable 1 shows a stark picture:\n\nModel-generated skills improve downstream performance in 75% of entries. Yet negative transfer remains common: 25% of entries have Δ < 0.\n\nNegative transfer rates vary substantially across domains: SpreadsheetBench and SWE-bench-Verified show the lowest rates (~13%), while ALFWorld reaches 47% — nearly half the time, adding a skill to this domain *hurts* rather than helps.\n\nEven more counterintuitive: **a stronger task-execution model does not predict extraction quality**. On SpreadsheetBench, the lightweight Gemini-3.1-Flash-Lite achieves the highest EE, while GPT-5.4 — the strongest performer on the benchmark itself — ranks last as an extractor. Converting target-specific trajectories into procedural guidance that the target can actually use is a distinct capability from solving the tasks.\n\nYou might guess that an ordered-list Skill outperforms a prose-format Skill. SkillLens tested this directly:\n\nRewrite the same skill into four canonical formats (ordered list, unordered list, checklist, prose) and use the Friedman test to check whether any format consistently ranks higher.\n\nResult: **Format has no detectable effect on any target (all p > 0.34)**. Swapping extractors, by contrast, produced significant effects on 5 of 6 targets (p < 0.005).\n\nThe implication is direct: obsessing over Skill formatting is wasted effort. What the skill *says* matters far more than *how it's laid out*.\n\nThis is the most surprising finding. The experiment asks GPT-5.4 to act as a judge: given two skills extracted from the same (target model, domain) pair, pick the one that will perform better on downstream tasks.\n\nWithout guidance: **the unguided judge achieves only 46.4% accuracy — indistinguishable from random guessing (50%)**. On pairs where the actual performance gap is δ ≥ 5% (clearly better), the judge picks the genuinely higher-performing skill only **15.8%** of the time — a clear *inversion* of actual utility.\n\nIn other words, the skill that reads more fluently and coherently tends to be the one that performs *worse* downstream. Textual plausibility has divorced from actual utility.\n\nThis has a direct practical implication: **you cannot reliably screen skills by asking an LLM to judge the text**. The quality gap lies deeper than surface form.\n\n**Stage 1 (Experience Generation): Success/Failure Ratio Sets the Ceiling**\n\nFixing the extractor (GPT-5.4-mini), the researchers sampled experience pools with success ratios of 0%/25%/50%/75%/100% from the same trajectories and compared the resulting skills.\n\nKey finding: **experience composition strongly shapes skill quality, and the optimal success-failure ratio is domain-specific**.\n\nOne universal rule: **all-failure pools consistently produce the worst skills**. Successful trajectories are the foundation of skill extraction — they provide positive procedural signals that narrow the agent's action space rather than merely listing what to avoid.\n\n**Stage 2 (Skill Extraction): Depth Matters, Not Aesthetics**\n\nStarting from \"why do plausible-looking skills fail?\", a qualitative inspection of high-Δ vs. low-Δ skill pairs reveals the real difference:\n\nHigh-Δ skills provide **concrete, executable remedies** (e.g., \"when the host engine doesn't evaluate formula strings, precompute static values and write them into cells directly\"). Low-Δ skills offer **generic platitudes** (e.g., \"resolve the contract before coding\").\n\nA vivid analogy: a high-quality skill reads like a **practitioner's debugging journal** — recording specific failure modes in specific contexts and their concrete fixes. A low-quality skill reads like a \"we all know this already\" lecture about best practices.\n\n**Stage 3 (Skill Consumption): Same Skill, Very Different Effects Across Targets**\n\nInjecting the same skill into different models can produce wildly different results. On SpreadsheetBench, the strong-pool skill boosts GPT-5.4 by +9.0 but produces negative transfer on some Qwen3.5-9B conditions.\n\nBehavioral analysis explains why: skill consumption **reshapes the target's default policy** rather than triggering new explicit tool calls. For GPT-5.4, the skill steers it away from writing spreadsheet formulas toward computing results in Python and writing back static values — exactly the right strategy correction for formula stability issues. For Qwen3.5-9B, the same guidance pushes it toward more complex workbook-native workflows that improve structural correctness on sheet-level tasks but introduce more execution failures on fine-grained operations.\n\nThe analysis reveals that skill quality is driven by hidden dimensions that are invisible on the surface. RQ3 asks: **can these findings be turned into a concrete, drop-in improvement to the extraction process itself?**\n\n**Step 1: Discover which dimensions actually predict utility**\n\nThe paper designs a fully automated rubric-discovery pipeline:\n\n**Only 3 dimensions consistently align with utility**, forming the **validated rubric**:\n\n| Dimension | What It Captures |\n|---|---|\nFailure Mechanism Encoding |\nDoes the skill encode specific failure modes and their triggers? |\nActionable Specificity |\nDoes the skill provide executable guidance tailored to concrete situations? |\nHigh-Risk Action Blacklist |\nDoes the skill explicitly name high-risk operations to avoid? |\n\n**Step 2: Verify the rubric's discriminating power**\n\nUsing the 3-dimension validated rubric to guide the judge raises overall accuracy from **46.4% to 73.8%** on 151 high-gap pairs. On the hardest pairs (δ ≥ 5%), where the unguided judge was actively wrong at 15.8%, the guided judge now picks correctly the majority of the time.\n\n**Step 3: Operationalize it as a Meta-Skill**\n\nThe validated rubric is packaged as a compact **meta-skill**: a generation-time prior injected directly into the extractor's system prompt. Testing it against three conditions:\n\nThe results are unambiguous:\n\nThe plausibility rubric hurts average performance (−0.59pp, 6 of 9 cells regress). The validated rubric improves all nine cells (+1.55pp average), with the largest gains on SpreadsheetBench (+2.3 to +3.7pp).\n\nUsing \"seems reasonable\" criteria actively damages extraction. Only empirically validated dimensions reliably improve it.\n\nSkillLens tells us skills can be systematically evaluated and improved. SkillOpt asks: **can we apply an optimization loop to the skill document itself — the same discipline that makes weight-space optimization reproducible?**\n\nThis analogy is the foundation of SkillOpt's entire design, laid out explicitly in the paper:\n\n| Deep Learning | SkillOpt Equivalent |\n|---|---|\n| Parameter (weights) | Skill document |\n| Gradient direction | Trajectory-derived edit direction |\n| Learning rate | Edit budget $L_t$ |\n| Validation check | Held-out selection gate |\n| Stable training setting | Batch / minibatch / schedule / gate |\n\nThis is not decorative framing — it's **operational**. Batch size, learning rate, validation, momentum — every concept has a corresponding text-space implementation in SkillOpt.\n\nThe optimization pipeline operates at two timescales: per-step updates and epoch-wise consolidation.\n\nAt each optimization step, the frozen target model runs a batch of training tasks with the current skill and produces scored trajectories. Small batches update quickly but noisily; larger batches expose more stable patterns. SkillOpt also supports accumulation: multiple rollout batches can be reflected on separately and merged into one update, decoupling execution throughput from update frequency.\n\nThe **optimizer model** (a separate frontier LLM) converts trajectories into structured skill edits. Crucially, it processes **both failure and success trajectories**:\n\nSingle trajectories tend to produce anecdotal fixes; minibatches expose reusable procedural errors — the agent *consistently* searches the wrong source, *always* formats answers incorrectly, or *reliably* fails to verify tool results.\n\nLocal proposals are merged hierarchically: failure-driven edits consolidated first, then success-driven edits, with failure corrections given priority. This filters duplicates, contradictions, and instance-specific suggestions before the optimizer selects the final bounded update.\n\n**This is the sharpest distinction between SkillOpt and \"just rewrite the skill when it seems wrong.\"**\n\nEach optimization step has an **edit budget $L_t$**: the optimizer may apply at most $L_t$ add/delete/replace operations to the skill document. Candidate edits ranked below the cutoff are discarded.\n\nWhy bounding matters:\n\nSkillOpt supports four edit budget schedules: constant, linear, cosine (default), and autonomous. The default cosine schedule starts with larger edit budgets and decays toward smaller consolidation steps.\n\nEvery candidate skill is evaluated on the held-out selection split. SkillOpt accepts a candidate **only if its selection score is strictly greater than the current best** — ties are rejected. This converts reflection into propose-and-test optimization rather than unconditional self-editing.\n\n**Rejected edits don't disappear.** The optimizer records an epoch-local **rejected-edit buffer** containing:\n\nLater reflection calls in the same epoch receive this buffer, steering the optimizer away from known-harmful directions. This provides **negative feedback at zero additional inference cost during deployment**.\n\nThink of it as running an A/B test on every proposed Skill revision: only the version that passes validation gets promoted. Rejected attempts become institutional memory that prevents repeating mistakes.\n\nFast updates learn from the current rollout batch. The slow/meta update learns from **the comparison across adjacent epochs** — longer-horizon patterns that individual batches can't expose.\n\nAt epoch end, SkillOpt runs the same training items under both the previous epoch's skill and the current epoch's skill, categorizing results into: improvements, regressions, persistent failures, and stable successes. The optimizer model writes a concise **longitudinal guidance block** — capturing which edit patterns helped, which failed, and which failure modes persist across epochs.\n\nThis guidance is stored in a **protected region** of the skill document that step-level edits cannot overwrite, preventing short-term noise from erasing long-term lessons.\n\nKey deployment note: the optimizer-side meta skill is never shipped with`best_skill.md`\n\n. It only lives in the teacher's reflection context. The deployed artifact stays compact and portable; the training process benefits from the full editing history.\n\nSkillOpt is evaluated across 6 benchmarks × 7 target models × 3 execution harnesses (direct chat, Codex agentic loop, Claude Code agentic loop), against 7 baselines: no skill, human skill, one-shot LLM skill, Trace2Skill, TextGrad, GEPA, and EvoSkill.\n\n**SkillOpt is best or tied-best on all 52 evaluated (model, benchmark, harness) cells.**\n\nKey headline numbers for GPT-5.5:\n\n| Execution mode | Average gain over no-skill |\n|---|---|\n| Direct chat | +23.5 points |\n| Codex harness | +24.8 points |\n| Claude Code harness | +19.1 points |\n\nIndividual benchmark gains are striking: OfficeQA jumps from 33.1 to 72.1 (+39.0), SpreadsheetBench from 41.8 to 80.7 (+38.9), ALFWorld on GPT-5.4-nano from 34.3 to 69.4 (×2.0). Procedural benchmarks with strict format requirements see the largest absolute gains — exactly where frontier models are most exposed zero-shot.\n\nTable 3 ablation results confirm each component contributes:\n\n| Component removed | SearchQA drop | SpreadsheetBench drop | LiveMath drop |\n|---|---|---|---|\n| Learning-rate form (unbounded) | -2.5 | -1.8 | -4.0 |\n| Rejected-edit buffer | -1.6 | -4.6 | -2.4 |\n| Meta skill + slow update | -0.6 | -22.5 |\n-3.2 |\n\nRemoving the slow/meta update is most damaging for SpreadsheetBench (−22.5 points), because this benchmark requires accumulated procedural knowledge — output format conventions, formula evaluation strategies — exactly what the epoch-wise slow update is designed to protect.\n\nFigure 4 in the paper reproduces one representative learned rule per benchmark, verbatim from the deployed `best_skill.md`\n\n:\n\nSearchQA: \"Infer the expected answer type from clue wording, then choose the shortest canonical entity supported by co-occurring distinctive evidence.\"\n\nSpreadsheetBench: \"Inspect workbook structure and formulas, then write evaluated static values across the full requested target range instead of relying on Excel recalculation.\"\n\nOfficeQA: \"Treat oracle parsed pages as primary evidence, lock table/date/unit context, and output exactly the requested rounded value without extra labels.\"\n\nLiveMathematicianBench: \"In strongest-statement MCQs, rank choices by theorem strength and prefer a justified stronger-result option over true but weaker corollaries.\"\n\nALFWorld: \"Keep a horizon-aware visited/frontier ledger, diversify search after repeated same-type failures, and avoid revisiting the destination until holding the target.\"\n\nThree properties stand out:\n\n**Compactness**: final `best_skill.md`\n\nlengths range from 379 tokens (LiveMathematicianBench) to 1,995 tokens (SpreadsheetBench), with a median around 920 tokens. The number of actually accepted edits is 1 to 4 (median 2.5) — the optimizer proposes far more, but only a handful survive the validation gate.\n\nOne of SkillOpt's most compelling findings is that the optimized artifact **transfers** well beyond its training setting.\n\n**Cross-model transfer**: A SpreadsheetBench skill trained on GPT-5.4 retains 82% of the in-domain gain when transferred to GPT-5.4-mini (+9.4 of +11.4). A LiveMath skill transferred to GPT-5.4-nano actually *surpasses* the in-domain SkillOpt reference (28.8 transferred vs. 27.2 in-domain).\n\n**Cross-harness transfer**: A SpreadsheetBench skill trained in the Codex loop transfers to Claude Code with a +59.7 point absolute gain over the Claude Code no-skill baseline (22.1 → 81.8), **exceeding the Claude Code in-domain SkillOpt reference of 80.4**. The transferred skill appears to encode workbook-level procedures — structure-first inspection, formula-aware verification, static-value materialization — that are harness-agnostic.\n\n**Cross-benchmark transfer**: An OlympiadBench skill applied to Omni-MATH (sharing only the broad \"math\" family) produces positive gains across all three model scales (+3.7/+1.8/+1.3), supporting the interpretation that the skill encodes reusable mathematical procedure rather than memorized test-specific formatting.\n\n**Practical implication**: optimize a skill in one execution environment, reuse it across multiple models, harnesses, and related benchmarks — without touching model weights.\n\nPlacing both papers side by side, their contributions form a closed loop:\n\n```\nSkillLens: Understand the problem\n├── Finding: 25% negative transfer — driven by variable extraction quality\n├── Finding: Format doesn't matter; content does\n├── Finding: Plausible text ≠ downstream utility (46.4% = random)\n└── Solution: Validated rubric (3 dimensions) + Meta-Skill improves extraction\n\nSkillOpt: Systematically solve it\n├── Core idea: Skill as trainable text parameter\n├── Mechanism: Bounded edits + validation gate + rejected buffer + slow update\n├── Results: 52/52 best-or-tied, +17.6 average across 7 models\n└── Properties: compact artifact, transferable, auditable, zero inference overhead\n```\n\nBoth papers converge on the same core insight: **a Skill should not be a static document written by intuition. It should be a dynamically optimized artifact driven by execution data.** SkillLens tells you which dimensions genuinely matter. SkillOpt gives you the machinery to push those dimensions forward systematically.\n\n**If you maintain a Skill library:**\n\n**If you're considering systematic skill optimization:**\n\nTwo Microsoft papers, one cohesive answer to why skills sometimes fail and how to fix them systematically.\n\n**SkillLens** maps the full three-stage skill lifecycle across five domains, six targets, and five extractors. It discovers that 25% of skill deployments produce negative transfer, that skill format is irrelevant while content depth is decisive, and that \"reads well\" is a poor predictor of \"performs well.\" It distills these findings into three validated quality dimensions — Failure Mechanism Encoding, Actionable Specificity, High-Risk Action Blacklist — and packages them as a meta-skill prior that improves every evaluated extraction condition.\n\n**SkillOpt** treats the skill document as a trainable text-space parameter. By combining bounded edit budgets, a strict held-out validation gate, a rejected-edit buffer for negative feedback, and an epoch-wise slow/meta update for long-horizon consolidation, it turns ad hoc skill editing into a controlled optimization loop. The result: best or tied-best on 52 of 52 evaluated cells, +17.6 average improvement across seven models, compact deployable artifacts (< 2,000 tokens, 1–4 accepted edits), and transfer that works across model scales, harnesses, and related benchmarks without touching model weights.\n\nSkill optimization is graduating from craft to engineering.\n\n*References:*", "url": "https://wpnews.pro/news/is-your-agent-skill-actually-good-microsoft-s-dual-paper-deep-dive-into-skill", "canonical_source": "https://dev.to/wonderlab/is-your-agent-skill-actually-good-microsofts-dual-paper-deep-dive-into-skill-evaluation-and-28b7", "published_at": "2026-05-31 09:51:01+00:00", "updated_at": "2026-05-31 10:12:50.167030+00:00", "lang": "en", "topics": ["ai-agents", "artificial-intelligence", "machine-learning", "ai-research"], "entities": ["Microsoft", "SkillLens"], "alternates": {"html": "https://wpnews.pro/news/is-your-agent-skill-actually-good-microsoft-s-dual-paper-deep-dive-into-skill", "markdown": "https://wpnews.pro/news/is-your-agent-skill-actually-good-microsoft-s-dual-paper-deep-dive-into-skill.md", "text": "https://wpnews.pro/news/is-your-agent-skill-actually-good-microsoft-s-dual-paper-deep-dive-into-skill.txt", "jsonld": "https://wpnews.pro/news/is-your-agent-skill-actually-good-microsoft-s-dual-paper-deep-dive-into-skill.jsonld"}}