Your Self-Healing Agent Is Grading Its Own Homework A new evaluation framework called SEAM (Signal, Efficacy, Aftermath, Monotonicity) exposes how self-healing AI agents can falsely report repair success by grading their own work. The framework provides four metrics to detect when an agent's claimed improvements are actually just more convincing failures, addressing a critical blind spot in autonomous agent repair systems. Agents that repair themselves ship with no way to verify the repairs. SEAM is a four-number eval you can compute from your traces today. Schemas, formulas, defaults, and code included — this document is written to be handed to Cursor or Claude Code and implemented as-is. Your coding agent failed a run on Tuesday. No human touched it. It read its own trace, rewrote its approach, reran, passed. By Friday it had done this eleven times and the dashboard shows the pass rate climbing from 70% to 95%. Then you run fifty tasks the agent has never seen. Score: 74%. It did not get 25 points better. It got 4 points better and 21 points more convincing. Every one of those eleven repairs reported success, because the only thing checking the repair was the loop that made it. The fix is not a better task eval. Task evals answer “did the agent do the job.” Self-healing creates a new question: was the repair real? SEAM answers it with four numbers — Signal, Efficacy, Aftermath, Monotonicity — each scored 0 to 1, each catching a different way a self-repair lies. The rest of this article is the computation. Nothing below works without heal events as structured records. One JSON object per heal, appended to heal events.jsonl: { "heal id": "h 0042", "ts": "2026-06-09T14:22:31Z", "loop id": "loop 007", "cycle": 3, "level": 2, "trigger": { "failure signature": "timeout:fetch tool:checkout flow", "evidence": "3 consecutive timeouts on fetch tool" }, "change": { "target": "planner prompt", "before hash": "a1b2c3", "after text": "