Trace-to-Training: how agent runs become learning data

WasmAgent introduces a framework that converts agent execution traces into training data for supervised fine-tuning (SFT) and direct preference optimization (DPO) without human labeling. Its compliance engine evaluates runs, ranks outcomes, and exports typed ComplianceEvalRecords, with a full repair loop (full_pcl) achieving 54.7% pass rate on IFEval benchmarks, an 8.7 percentage point improvement over prompt retry. The system uses compliance verification as the reward signal, enabling models to learn from failure traces.

Every agent run is a data point. Most frameworks throw it away. WasmAgent keeps it — evaluated by the compliance engine, ranked by outcome, exported as a typed ComplianceEvalRecord ready for SFT or DPO training. No human labeling. js import { ComplianceRun } from "@wasmagent/compliance"; const run = new ComplianceRun { mode: "full pcl", // "direct" | "prompt retry" | "full pcl" taskSpec: { instruction: "Write a summary in exactly 3 bullet points.", constraints: { type: "format", rule: "bullet count", value: 3 } , }, } ; const result = await run.execute agent, input ; // result.complianceEvalRecord → typed, versioned, schema-validated direct — one shot, record pass/fail. prompt retry — retry once with a rephrased prompt. full pcl — full repair loop: run → evaluate → patch/regenerate → re-evaluate → record the entire trace. IFEval × Qwen2.5-1.5B-Q4 3 seeds × 50 samples : | Mode | Pass rate | Std dev | |---|---|---| | prompt retry | 46.0% | ±2.0pp | full pcl | 54.7% | ±1.2pp | +8.7pp. The variance drop ±2.0 → ±1.2 matters for production reliability. Reproduce: bun packages/compliance/benchmarks/ifeval/run.ts --limit=50 --seed=42 When full pcl repairs a failing output, RepairPlanner records every attempt: // Inside ComplianceEvalRecord attempts: { strategy: "direct", output: "...", passed: false }, { strategy: "patch", output: "...", passed: false }, { strategy: "regenerate", output: "...", passed: true }, The full sequence — what failed, what was tried, what worked — is what feeds DPO training. The model learns from failure traces, not just final outputs. js import { RolloutForkRunner, RolloutRanker } from "@wasmagent/core"; const runner = new RolloutForkRunner { forks: 4 } ; const rollouts = await runner.run agent, input, taskSpec ; const ranked = new RolloutRanker .rank rollouts ; // ranked 0 → chosen SFT // ranked 1.. → rejected DPO pairs The compliance verifier is the reward signal. No human annotation. git clone https://github.com/WasmAgent/wasmagent-js bun test packages/compliance/ 113 pass / 0 fail Code: packages/compliance https://github.com/WasmAgent/wasmagent-js/tree/main/packages/compliance · RolloutForkRunner https://github.com/WasmAgent/wasmagent-js/tree/main/packages/core/src/enhancement · RolloutRanker https://github.com/WasmAgent/wasmagent-js/tree/main/packages/core/src/ranking Series: AEP part 1 · MCP Trust Pack part 2 · Trace-to-Training part 3