A backend service falls over at 2am and you know the drill: open the dashboard, follow the trace, find the bad deploy, roll back. Twenty years of tooling (logs, metrics, traces, APM) exists to answer "what just happened, and why?"
Now your robot bricks a grasp at 2am. What do you open?
There's no trace. The "request" was a 40-second episode of a policy reacting to the physical world. The failure isn't in a log line. It's in the half-second where the gripper closed early, which only makes sense if you can see the wrist camera, the joint torques, and the policy's action outputs on the same clock. You can't grep
that. And the regression that caused it shipped because "it worked in sim" and nobody re-ran it against the 3,000 episodes where it used to work.
We have Datadog for services and Weights & Biases for training. We have almost nothing for the part in between: the run itself. That gap is where robotics observability lives, and it's about to matter a lot, because every team shipping VLA, imitation, and RL policies is hitting the same wall.
This is the whole thesis. Backend observability is built around the request. Robotics observability has to be built around the episode, a synchronized bundle of:
all locked to a single timeline, and tagged with the four things that make it reproducible: policy_version
, env_version
, git_sha
, seed
.
Drop any of those and you've got a video you can't trust. Keep them, and an episode stops being a memory and becomes a test case.
Steal the shape of service observability, but redefine each pillar for physical runs:
import robotrace as rt
ep = rt.log_episode(
policy_version="grasp-v7",
env_version="cell-3",
seed=42,
video="wrist_cam.mp4",
sensors="joint_states.npz", # timestamped
actions="policy_outputs.npz", # same clock
)
Replay. Scrub every run frame-accurate, all streams on one timeline. on the frame where it broke, copy a ?t=12840ms
link, and a teammate lands on the exact moment. This is the "follow the trace" of robotics.
Explain. A failed run should tell you why, not hand you a metadata dump. Ranked root causes (replay regression, raised exception, battery brownout, action saturation) surfaced the moment the episode finalizes.
Verify. The one service observability never needed. Before you ship a new policy to real hardware, re-roll it against thousands of historical episodes and read the diff: where does the candidate do better, where does it regress? Gate the deploy on that, without booking another hour on the arm.
That last pillar is the point. It closes the loop from "we recorded a run" to "we won't ship that regression to a real robot."
RoboTrace: log an episode in a few lines, get a portal URL back
Three things are converging. Foundation/VLA policies are getting cheap enough to iterate on weekly, so the bottleneck moves from training to trusting. Real-robot time stays expensive and scarce, so you can't validate by re-running on hardware; you validate against history. And teams are scaling from one robot to fleets, where "it worked on my arm" is no longer an argument.
The teams that win won't be the ones with the flashiest policy. They'll be the ones who can answer "is this safe to ship?" in minutes instead of days, because they treated every run as a reproducible, replayable, re-rollable artifact from day one.
Robotics is about to get its observability moment, the same way backend did in the 2010s and ML training did with experiment trackers. The winning tool won't be a log viewer bolted onto robots. It'll be episode-first: replay, explain, and regression-gating as first-class primitives, with reproducibility baked into the data model instead of bolted on later.
That's the bet we're making with RoboTrace, observability and evals for AI-powered robots. The SDK is
pip install robotrace-dev
(early access during alpha), and the source lives on You can't grep a robot. So let's build the thing you can do instead.