Sessions measure user activity. Agent runs measure delegated work. Learn why the agent run is the right unit for measuring AI product performance.
Sessions Are the Wrong Unit of Measurement #
When you build a web app, you track sessions. Someone visits, clicks around, and leaves. The session is the unit — it captures one person’s engagement with your product over a period of time.
But AI products don’t work that way. When a user delegates a task to an AI agent, they’re not browsing. They’re handing off work. The agent might query a database, call three external APIs, generate a report, and send an email — all in response to a single instruction. Nothing about that is a “session.”
The agent run is the right unit for measuring AI product performance. It captures what actually happened: what the agent was asked to do, what it did, and whether the outcome was useful. If you’re building AI-powered products and you’re still measuring them like traditional web apps, you’re flying with the wrong instruments.
This article explains what an agent run is, why it matters for product analytics, and what teams should actually be tracking when they ship AI agents.
What Is an Agent Run? #
An agent run is a single, bounded execution of an AI agent — from the moment it receives a task to the moment it completes (or fails to complete) that task.
Think of it as the atomic unit of agentic work. One prompt in, one outcome out, with everything in between logged.
A run typically includes:
The input— What the user (or another system) asked the agent to do** The steps taken**— Which tools were called, which models were used, which branches of logic were executed** The output**— What the agent produced or actioned** Metadata**— Duration, token usage, cost, status (completed, failed, timed out)
This is fundamentally different from a session, which measures user presence. An agent run measures delegated work — something the system did on the user’s behalf.
Why the Distinction Matters
A single user session in an AI product might contain zero agent runs (the user just read something) or twenty (they asked the agent to process a batch of documents). Counting sessions tells you nothing about how much work your agents are actually doing or how well they’re doing it.
Conversely, a single agent run might have zero active users involved at all. Scheduled agents, webhook-triggered automations, and background processing tasks all produce agent runs with no human in the loop at the time of execution.
If you measure AI products the same way you measure content websites, you’ll consistently misread what’s working.
Why Traditional Web Analytics Break Down for AI Products #
Traditional analytics were built around human behavior. Pageviews, bounce rates, session duration — these metrics assume a person is navigating through a product and making decisions.
AI agents change the ratio of human decisions to system actions. In a traditional app, every meaningful action traces back to a click. In an agentic app, a single click can trigger hundreds of downstream actions.
The Measurement Gap
Here’s where the gap shows up in practice:
Latency means something different. In a web app, a 5-second load time is bad UX. In an agent-based product, a 30-second run time might be perfectly acceptable — or it might signal a stuck loop, a failing API call, or an over-engineered prompt chain. You need run-level data to know which.
Errors are structural, not incidental. When a webpage 404s, it’s usually a broken link. When an agent run fails, it could mean a tool call timed out, a model hallucinated an invalid function call, or the input was malformed. Error type and error location within the run are the signal — just knowing “it failed” isn’t enough.
Cost is variable per run. Traditional apps have relatively predictable per-user infrastructure costs. Agent runs have highly variable token consumption depending on task complexity, model selection, and how many steps the agent takes. Tracking cost at the run level is how you find the 10% of tasks eating 60% of your compute budget.
Output quality isn’t binary. A button either works or it doesn’t. An agent output might be technically successful but practically useless. Run-level analytics need to accommodate quality signals — user ratings, downstream task completion, retry rates — that don’t exist in traditional funnels.
The Anatomy of an Agent Run #
Understanding what’s inside a run is the foundation of good AI product analytics. Every run has a few core layers worth instrumenting.
The Input Layer
This is what triggered the run: a user prompt, a scheduled cron trigger, an incoming webhook, or a call from another agent. The input layer tells you what category of work is being requested and, over time, which requests are most common, most error-prone, or most expensive.
Logging input structure (not just raw input, for privacy reasons) lets you identify patterns. Are certain types of requests consistently failing? Are users asking for things the agent wasn’t designed to handle?
The Execution Layer
This is the meat of the run — every step the agent took to produce its output. For a multi-step agent, this might include:
- Model calls (which model, how many tokens, what was the latency)
- Tool invocations (which tools, whether they succeeded, how long they took)
- Branching logic (which conditional paths were taken)
- Retries and fallbacks (did the agent have to try something more than once)
The execution layer is where you diagnose problems. A run that took 45 seconds instead of 5 seconds probably has a story to tell somewhere in the execution trace.
The Output Layer
This is what the agent produced — a message, a document, a completed action, a data transformation. Output layer analytics should track:
Completion rate— Did the agent finish, or did it time out or error?** Output type**— What kind of thing was produced?** Quality signals**— Any downstream feedback about whether the output was useful
The Cost Layer
Token usage, API call costs, and infrastructure time should be tracked at the run level. This gives you a per-task cost structure, which is essential for pricing AI products and for identifying runs that are dramatically more expensive than expected.
Key Metrics Built Around Agent Runs #
Once you accept the agent run as your primary unit of measurement, a different set of metrics becomes useful.
Run Volume and Run Rate
How many runs is your product executing per day, per user, per workflow? Run volume tells you adoption and utilization in a way that user counts don’t. A product with 500 users running 10,000 agent runs per day is doing fundamentally different work than one with 5,000 users running 500 runs per day.
Success Rate
What percentage of runs complete without errors? This is your baseline health metric. A declining success rate is often the first signal that something in your stack has changed — a model update, a tool API change, or a new pattern of user input the agent wasn’t designed for.
Average Run Duration
How long do runs take, on average? Tracked over time, duration is a useful stability signal. Sudden increases in average run time usually indicate a bottleneck somewhere — a slow external API, a prompt that’s generating overly long model responses, or a loop that’s taking more iterations than expected.
Cost Per Run
Total cost divided by total runs. This is the number you need to build unit economics for AI products. If your cost per run is $0.03 and you’re charging $10/month for unlimited runs, the math only works up to a point. Understanding AI agent costs at the run level is how you build sustainable pricing.
Failure Mode Distribution
Not all failures are equal. Categorizing failures by type — tool call errors, model errors, input validation failures, timeouts — lets you prioritize fixes based on actual frequency and impact.
Retry Rate
Remy doesn't write the code. It manages the agents who do. #
Remy runs the project. The specialists do the work. You work with the PM, not the implementers.
How often does your agent have to retry a step before moving forward? High retry rates on specific tool calls or model operations signal fragility. They also inflate costs in ways that raw success/failure rates won’t surface.
Agent Runs in Multi-Agent Systems #
Single-agent systems are tractable. You have one agent, one run per user task, and a relatively clean audit trail.
Multi-agent workflows complicate this considerably. When one agent calls another, you have a parent run spawning child runs. A user’s single request might trigger a chain of five or ten separate agent runs, each with its own execution trace, cost footprint, and potential failure modes.
Parent and Child Runs
In a multi-agent system, runs should be linked. A parent run that delegates to a specialist agent should have a reference to the child run, and the child run should carry metadata about its origin. Without this linkage, debugging is nearly impossible — you see a failure in a child agent but have no context about what triggered it or what larger task it was part of.
This hierarchical run structure is how teams trace end-to-end behavior in agentic pipelines. It’s the difference between knowing “Agent B failed at 2:14 PM” and knowing “Agent A was processing a user’s request to summarize a contract, handed off to Agent B for the extraction step, and that’s where the chain broke.”
Aggregate vs. Per-Step Metrics
In multi-agent systems, you need metrics at multiple levels:
Per-run metrics— What happened in this specific agent’s execution** Per-workflow metrics**— What happened across all agents involved in a single user task** System-level metrics**— Aggregate health across all agents, all workflows, all users
The per-workflow view is often the most actionable one. It shows you where in a pipeline things are slow, expensive, or unreliable — not just which individual agent is struggling.
How MindStudio Handles Agent Run Observability #
When you build agents in MindStudio, every execution is logged as a discrete run. You can see exactly what happened inside each run — which steps executed, what each step produced, how long each step took, and where errors occurred.
This matters practically when you’re trying to improve an agent. Rather than guessing why a workflow occasionally fails, you can pull the run trace and see the exact step where things went wrong, with the full context of what the model was given and what it returned.
For multi-agent workflows — where one MindStudio agent calls another — runs are linked, so you can trace a user’s request through the full chain. The platform tracks token usage and cost at the run level automatically, which means you always have a real picture of what your agents are actually costing to operate. MindStudio’s analytics layer is built around runs, not sessions, because that’s the right unit for this kind of product. You can filter by workflow, by time period, by status, and by user to slice into exactly the performance data you need.
If you’re building AI agents and want instrumentation that reflects how agents actually work, you can try MindStudio free at mindstudio.ai.
Frequently Asked Questions #
What is the difference between an agent run and a session?
A session measures a user’s active period in a product — it starts when they arrive and ends when they leave. An agent run measures one discrete execution of an AI agent — it starts when the agent receives a task and ends when that task is completed or fails. A session can contain many agent runs, and agent runs often happen with no active user session at all (triggered by schedules or other systems).
How do you measure the success of an agent run?
Success in an agent run has a few layers. At the technical level, success means the run completed without errors and produced an output. At the product level, success means the output was actually useful to the user — which requires additional signals like user ratings, downstream task completion, or explicit feedback. Most teams track technical success rate as a leading indicator and layer in quality metrics as the product matures.
What should you log in every agent run?
At minimum: the input that triggered the run, the output produced, the duration, the total token usage and cost, the completion status, and any errors with their location in the execution trace. For multi-agent systems, also log parent/child run relationships so you can trace full workflow chains.
Why is cost tracking at the run level important?
Because AI agent costs are highly variable. Unlike traditional software where per-user infrastructure costs are relatively stable, different agent tasks can have dramatically different token footprints and tool call costs. Run-level cost tracking lets you identify which tasks are economically unsustainable, where model selection should be optimized, and how to build pricing that reflects actual resource consumption. Industry research on AI system observability consistently points to run-level instrumentation as the foundation of cost control in production AI systems.
How do agent run metrics differ between synchronous and asynchronous agents?
Synchronous agents (where the user waits for a response in real time) need low latency. Run duration is a direct UX metric, and slow runs hurt perceived quality immediately. Asynchronous agents (background processing, scheduled tasks) have more latitude on duration but need reliable completion tracking — users need to know the job finished and where to find the results. The core metrics are the same, but the thresholds and priorities differ.
What’s the right way to handle failed agent runs in analytics?
Don’t just count failures — categorize them. Different failure modes tell different stories: input validation failures suggest the agent is receiving requests outside its design scope; tool call failures suggest external dependency issues; model errors suggest prompt or output parsing problems; timeouts suggest capacity or complexity issues. A failure dashboard that separates these categories is far more actionable than a single failure rate metric. Many teams also distinguish between “hard” failures (the agent stopped with an error) and “soft” failures (the agent completed but produced a useless output), since these require different fixes. See how to build reliable AI workflows for more on designing agents that fail gracefully.
Conclusion #
The core shift here is simple: traditional analytics measure user activity. AI product analytics need to measure delegated work.
Key takeaways:
- An agent run is one bounded execution of an AI agent — from task input to task output — and it’s the right atomic unit for measuring AI product performance. - Sessions and pageviews don’t capture the variable cost, multi-step execution, or non-user-triggered nature of agentic systems.
- The most important run-level metrics are success rate, run duration, cost per run, and failure mode distribution.
- In multi-agent systems, linked parent/child run tracking is essential for debugging and optimization.
- Good AI product analytics require run-level instrumentation, not retrofitted session-based dashboards.
Building AI agents without run-level analytics is like running a factory with no process data — you can see whether finished products come out, but you have no idea where the line is breaking down. MindStudio gives you that run-level visibility built in, so you can iterate on your agents based on what’s actually happening, not what you think is happening.