{"slug": "why-we-built-replay-before-everything-else", "title": "Why we built Replay before everything else", "summary": "\"Replay\" as the first feature for SafeRun, a reliability infrastructure for AI agents, prioritizing post-incident debugging over pre-flight validation. It argues that most teams struggle to reproduce agent failures because observability tools only describe what happened, whereas Replay captures the complete, deterministic state of an agent run for frame-by-frame analysis. The author emphasizes that understanding a failure through replay is essential before creating rules to prevent it, making Replay the foundation of the product's reliability loop.", "body_md": "A note on building reliability infrastructure for AI agents — and why post-incident debugging matters more than pre-flight validation.\nA few weeks ago I started building SafeRun — inline reliability infrastructure for AI agents in production. The temptation, when you're building something in the agent reliability space, is to lead with validation. Block the bad action before it happens. Stop the runaway loop. Enforce the policy.\nThese are real features. SafeRun ships all of them. But they're not the first thing we built. The first thing we built was Replay.\nHere's why.\nThe failure mode no one talks about\nMost teams shipping AI agents into production discover the same problem after their first bad incident. The agent did something it shouldn't have. They go to investigate. And they find that they can't reproduce what happened.\nThe traces are flat. The logs don't show the model's reasoning between tool calls. The arguments to the failed call aren't fully captured. The retrieved context that informed the decision is missing. The agent's plan, if it had one, isn't anywhere.\nSo the engineer does what engineers do. They start rerunning the agent, trying to recreate the conditions that led to the failure. The agent is non-deterministic. The conditions change. They spend a weekend trying to reproduce one bad action.\nThis is the universal pain. I've talked to maybe twenty engineers shipping agents in production, and every single one of them has lived this. Not \"heard about it.\" Lived it.\nWhy observability tools don't solve this\nLangSmith, Langfuse, Helicone, Arize, and the broader observability category do something genuinely useful: they tell you what happened. But \"what happened\" is a description, not a reproduction. You can read a trace. You can't re-execute it.\nReplay is different. Replay means capturing the complete state of an agent run with enough fidelity to step through it frame by frame after the fact, see the exact arguments to each tool call, see the model's reasoning between calls, see the retrieved context at each decision point, see the policy that evaluated each action, see the decision that was returned.\nThis is a different engineering problem than logging. It requires deterministic state capture. It requires decision-time context snapshotting separately from outcome context. It requires versioning every policy and every rule and every classifier that participated in a decision. We built this first because everything else depends on it.\nThe four-step loop, and why Replay is the foundation\nSafeRun's product loop is Replay → Understand → Create Rule → Prevent.\nYou can't understand a failure you can't reproduce.\nYou can't create a rule to prevent a failure you don't understand.\nYou can't prevent a category of failure if your rule was created against an incomplete picture of what happened.\nThe order matters. Build Replay first, and everything else compounds. Build prevention first, and your rules will be flat patches against failures you don't fully see.\nThe Stripe boolean problem\nHere's the failure that taught me Replay matters more than any other layer.\nAn agent issues a Stripe refund instead of a charge because a single boolean flipped in the agent's planning step. The call shape is correct. The schema passes. Type-checking passes. Most observability tools log a successful refund and move on.\nThe engineer notices the next morning when the customer complains. They go to investigate. They have a trace. The trace tells them \"Stripe refund issued, amount $4,500, customer cus_9281.\" That's true. It tells them nothing about why.\nWith Replay, they can step back through the agent's decision frame by frame. See the user's request was actually a charge. See the agent's planning step had is_refund: false. See that somewhere between the plan and the tool call, the boolean flipped. See whether it was a model hallucination, a prompt injection, a code bug, or a retrieved-context misinterpretation.\nNow they know what to do. They can write a prevention rule. They can fix the upstream cause. They can ship a fix that actually prevents recurrence, instead of patching the symptom.\nThis is what Replay enables. None of the rest of the product matters without it.\nWhat we shipped, in order:\nPhase 0: Working prototype with six failure simulations, including the Stripe boolean problem.\nPhase 1: Persistent backend on Supabase. Replays survive page reload, browser close, account switch.\nPhase 2: POST /v1/check-action API with sub-50ms p95 latency. Decision-time context snapshotting (inputs, retrieved context, external state, policy version, evaluator model version) captured synchronously, persisted asynchronously. The replay is built from the decision, not assembled after.\nPhase 3: Python and TypeScript SDKs. Three-line install. @guard decorator wraps any tool call.\nPhase 4: Intent Guard — catches valid-shape, wrong-intent tool calls. The Stripe boolean problem from above. Visible confidence scores, threshold calibration as a product surface, feedback loop closes back into recalibration.\nPhase 5: Multi-tenant, project-scoped API keys, environment separation (dev logs, staging warns, production blocks), replay redaction, audit log, rule versioning.\nPhase 6: Design partner onboarding, Prevention Impact Dashboard.\nPhase 7: Self-hosted/VPC, SSO/SAML, audit log export, SOC 2 readiness, SafeRun as an MCP-callable tool.\nThe whole roadmap exists in service of the Replay layer. Every phase compounds on the previous one. Every feature ladders to Replay → Understand → Create Rule → Prevent.\nWhat's next\nWe're onboarding the first design partners now. Engineering teams shipping AI agents into production — agents that move real money, modify real customer data, talk to real customers. Free during the partnership in exchange for honest feedback.\nIf you're shipping agents and want to be one of the first teams running SafeRun in production, get in touch. saferun.dev.\nIf you're shipping agents and don't want to be a design partner but want to try the SDK, it's pip install saferun and three lines.\nEither way, the bet is this: replay the failure, prevent the next one. The first one always happens. The second one is the company's choice.", "url": "https://wpnews.pro/news/why-we-built-replay-before-everything-else", "canonical_source": "https://dev.to/saferunai/why-we-built-replay-before-everything-else-1ild", "published_at": "2026-05-20 02:09:59+00:00", "updated_at": "2026-05-20 02:34:57.532085+00:00", "lang": "en", "topics": ["artificial-intelligence", "developer-tools", "startups", "enterprise-software"], "entities": ["SafeRun", "Replay"], "alternates": {"html": "https://wpnews.pro/news/why-we-built-replay-before-everything-else", "markdown": "https://wpnews.pro/news/why-we-built-replay-before-everything-else.md", "text": "https://wpnews.pro/news/why-we-built-replay-before-everything-else.txt", "jsonld": "https://wpnews.pro/news/why-we-built-replay-before-everything-else.jsonld"}}