{"slug": "how-to-curate-observability-data-for-ai-agents", "title": "How to curate observability data for AI agents", "summary": "Multiplayer's debugging agent failed because raw observability data overwhelmed the AI, leading to incorrect fixes. The company developed a four-stage data curation process—grouping, correlating, deduplicating, and structuring signals—before feeding data to agents, transforming noisy telemetry into actionable context.", "body_md": "[AI](https://www.multiplayer.app/blog/tag/ai/)\n\n# How to curate observability data for AI agents\n\nMost debugging agents fail not because the model is wrong, but because the data going in is not ready for machine consumption. Here's what data curation actually looks like in practice.\n\nWhen we started building Multiplayer's debugging agent, we made the same mistake almost everyone makes. We gave our coding agent access to observability data and expected it to figure out what was relevant.\n\nIt didn't. The agent called the wrong tools, chased the wrong signals, and produced fixes that looked plausible but failed in production. We were using state-of-the-art models, but we were handing them raw observability data without any curation or filtering. We later realized that we were just routing them noise.\n\nWhat follows is what we learned about what you actually have to do with observability data before it's fit for an AI agent to act on.\n\n**The signal-to-noise problem**\n\nObservability data has one of the worst signal-to-noise ratio of any data type you could feed an AI agent.\n\nA single production issue might involve hundreds of spans across a dozen services, thousands of log lines, missing request and response payloads, redacted headers, clock-skewed timestamps, and events distributed across tools that have never been correlated with each other. A human debugging this issue brings years of context: they know which services are noisy, which logs matter, which timestamps to trust, and roughly where in the stack the problem lives. They navigate the noise because they understand the system.\n\nAn agent sees everything with equal weight. Garbage spans get the same attention as the one span that actually shows the failure. Thousands of log lines get processed before the agent can ask a useful question. And because context windows are finite and expensive, you burn through your budget before you've even framed the problem correctly.\n\nThis is a data preparation problem. And it's one that has to be solved before the data reaches the agent, not by the agent itself.\n\n**What data curation actually means**\n\nData curation for AI agents shouldn’t be confused with summarization or compression, which is what most engineering teams end up doing.\n\nIn actuality, it's the process of transforming raw observability data into a structured, scoped, context-rich package that an agent can reason about correctly. That means making a series of deliberate decisions: what to include, what to exclude, how to group related signals, and what additional context the agent needs to understand the problem.\n\nAt Multiplayer, we do this in four stages before any data reaches a coding agent.\n\n**Stage one: group and correlate aggressively**\n\nThe first thing we do with raw observability data is group related events and correlate them across service boundaries.\n\nA single bug will typically surface across many sessions, environments, and services. Without grouping, each occurrence looks like a separate issue. And without correlation, the agent can't see the causal chain that connects a user action on the frontend to a failure deep in the backend.\n\nWe correlate aggressively: user interactions, session metadata, network requests, backend traces, and log events get tied together into a single timeline before anything else happens. The agent needs to see that the click at 14:32:01 caused the cascade that showed up in the backend logs at 14:32:04. It can't infer that from timestamps alone (especially under any real load or clock skew). The correlation has to be built into the data structure before the agent sees it.\n\nWe also deduplicate at this stage. The same bug appearing across a hundred user sessions becomes one issue, not a hundred separate signals. This is both because of cost and quality management. An agent acting on deduplicated, grouped data produces one PR for one issue. An agent acting on raw, ungrouped data produces dozens of PRs for the same issue, burns through tokens unnecessarily, or gets confused trying to reconcile conflicting signals from the same underlying failure.\n\n**Stage two: assess fixability before routing to the agent**\n\nNot every issue is worth routing to a coding agent, and not every issue is something a coding agent can fix.\n\nBefore anything reaches the coding agent, we run a fixability assessment through a dedicated agent. Is this a deterministic, reproducible failure with a clear root cause? Or is it an intermittent, environment-specific issue that requires human judgment to diagnose?\n\nThis matters for a few reasons. First, coding agents produce their worst outputs on problems they don't have enough context to solve correctly, which are often the hardest, most intermittent bugs. Routing those to a coding agent without human oversight wastes tokens and produces plausible-looking fixes that don't hold.\n\nSecond, fixability scoring lets you prioritize. High-fixability issues (clear root cause, deterministic reproduction, well-scoped impact) go to the coding agent immediately. Lower-fixability issues get flagged for human review with the curated context already attached.\n\nThe goal is to keep humans in the loop where human judgment is actually needed, and route everything else through the automated fix cycle.\n\n**Stage three: add release context and metadata**\n\nRaw observability data tells you that something broke, but it doesn't tell you what changed that caused it to break.\n\nBefore the data reaches the coding agent, we automatically add release context: build information, deployment timestamps, recent commits, the specific version of each service involved in the failure. Bugs don't appear in a vacuum. They're usually introduced at a specific point in the git history, often in a specific commit, often by a specific change that touched the affected code path.\n\nA coding agent producing a fix without this context is guessing about the causal history of the bug. With release metadata attached, the agent can connect the failure signal to the change that introduced it. That changes the quality of the fix significantly: it goes from \"*here's a patch that handles this error case*\" to \"*here's what the change introduced and here's how to correct it.*\"\n\nWe also add service metadata, environment information, and any relevant configuration context that helps the agent understand the system it's operating in. Custom service name mappings (e.g. \"payment-service,\" \"svc-payments,\" and \"payments_v2\" all referring to the same thing) get resolved here so the agent isn't treating three names as three entities.\n\n**Stage four: format and summarize for machine consumption**\n\nThe final stage is the one that most teams skip entirely, and it's where a lot of debugging agent performance gets left on the table.\n\nRaw observability data is formatted for humans: JSON payloads, nested span structures, log lines with internal formatting conventions. These are designed to be readable by someone who understands the system and is looking at a dashboard.\n\nWe reformat data before it reaches the coding agent. Spans get converted from nested JSON into a structured narrative that describes the execution path. Log lines get filtered to the ones that are actually relevant to the failure window and reformatted to make the timeline legible. Request and response payloads (which most observability tools strip out by default, and which we capture specifically because they're the most useful debugging signal) get included with the context that explains why they're relevant.\n\nWe also produce an issue summary that we call \"explain it like I'm 5\". The goal is to bring the coding agent up to speed the way you'd brief a developer who's just joined an incident call: here's what broke, here's when it started, here's what changed recently, here's where in the stack the failure lives, here's what the error looks like when it surfaces.\n\n**What this looks like in practice**\n\nThe difference between V1 and V2 of Multiplayer's debugging agent was almost entirely in the curation layer.\n\nV1 mirrored our API and gave the agent a lot of tools to work with. The agent called the wrong tools, used the wrong parameters, burned through tokens, and produced PRs that missed the actual root cause. The model wasn't the problem. The data access pattern was the problem.\n\nV2 had one main tool that returned a curated, correlated, formatted package of everything the agent needed to understand the issue. The agent called the right thing at the right time, asked focused follow-up questions when it needed more context, and produced fixes that held up in production.\n\nWhat made the difference was the curation layer: grouping, deduplication, fixability assessment, release context, formatting, and issue summary.\n\n**The question to ask yourself**\n\nMost debugging agents and MCP servers are built to answer the question: \"How do I give the AI access to my observability data?\"\n\nThat's the wrong question.\n\nThe right question is: \"What does the agent need to understand about this specific issue in order to produce a fix worth shipping?\"\n\nThose questions lead to very different architectures. The first leads to raw data exposure: give the agent access to everything and let it figure out what's relevant. The second leads to curation: do the work of making the data fit for machine consumption before it ever reaches the agent.\n\nThe observability data you have right now was built for humans. It's sampled, aggregated, siloed, and formatted for dashboards. Sending it directly to a coding agent without transformation is the reason most debugging agents produce output that looks right and fails in production.\n\nOne copy/paste in your terminal and the debugging agent is running:\n\n`npm install -g @multiplayer-app/cli && multiplayer`\n\nRather explore first?👇\n\n[multiplayer.app](https://multiplayer.app/?ref=localhost)", "url": "https://wpnews.pro/news/how-to-curate-observability-data-for-ai-agents", "canonical_source": "https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/", "published_at": "2026-06-25 16:25:41+00:00", "updated_at": "2026-06-25 16:45:34.488823+00:00", "lang": "en", "topics": ["ai-agents", "ai-tools", "developer-tools", "large-language-models", "ai-infrastructure"], "entities": ["Multiplayer", "Multiplayer's debugging agent"], "alternates": {"html": "https://wpnews.pro/news/how-to-curate-observability-data-for-ai-agents", "markdown": "https://wpnews.pro/news/how-to-curate-observability-data-for-ai-agents.md", "text": "https://wpnews.pro/news/how-to-curate-observability-data-for-ai-agents.txt", "jsonld": "https://wpnews.pro/news/how-to-curate-observability-data-for-ai-agents.jsonld"}}