{"slug": "instrumenting-your-ai-agent-fleet-from-black-box-to-full-observability", "title": "Instrumenting Your AI Agent Fleet: From Black Box to Full Observability", "summary": "A developer instrumented his autonomous AI agent fleet with full observability after weeks of operating them as black boxes. The system captures metrics from two execution environments—launchd schedules and Cowork sessions—into a SQLite database, revealing patterns across 455 sessions such as heavy reads of role-definition files like CLAUDE.md and core source files like server.py.", "body_md": "## The Problem With Invisible Agents\n\nMy development team runs on schedules. Every morning, a security agent scans the codebase for vulnerabilities. A QA agent validates the previous day's code changes. A marketing agent drafts content and checks the editorial queue. A project manager agent synthesizes everything into a dashboard.\n\nNone of them are supervised while they run. That is the point -- autonomous agents work while I sleep.\n\nThe problem is that for the first few weeks, I had no reliable answer to the question: \"What did the agent actually do?\" I could read the output files. I could check git history. But I could not tell whether an agent used 12 tool calls or 58, which files it read most often, whether it hit its turn budget limit, or whether it threw errors it quietly swallowed. The agents were a black box with visible outputs but invisible process.\n\nIn data engineering, you do not ship a pipeline without monitoring. You have metrics on row counts, processing time, error rates, and resource consumption. You know when something is underperforming before it fails. I applied the same discipline to the agent fleet.\n\n## Two Collection Paths\n\nThe instrumentation pipeline has two paths because the agents run in two different environments.\n\nAgents running on launchd schedules (the Mac-native job scheduler) go through a shell wrapper called `run_agent_code.sh`\n\n. This wrapper captures the full session output in JSONL format -- every tool call, every message, every timestamp. After the session completes, `parse_agent_metrics.py`\n\nruns against that JSONL file and extracts structured metrics: total tool calls, a breakdown by tool name, which files were read and written, bash commands executed, session duration, error count, and subagent spawns. The parser can distinguish between reading a SKILL.md file (a planning document) and reading a source file, so I can see how much of each session is orientation versus actual work.\n\nAgents running in Cowork scheduled sessions do not have the launchd wrapper, so they self-report at session end using `save_agent_metrics.py`\n\n. They call this script with the same fields the JSONL parser would produce -- tool call counts, files read, duration, errors. The data ends up in the same format, from the same schema. Whether an agent ran locally or in Cowork, the downstream analytics layer cannot tell the difference.\n\n## What Goes Into the Database\n\nBoth paths emit JSON metric files that get ingested into a SQLite database via `ingest_agent_metrics.py`\n\n. The schema is straightforward. Three core tables do most of the work.\n\nThe `agent_sessions`\n\ntable holds one row per session with aggregate metrics -- agent name, date, duration, total tool calls, files read count, files written count, bash command count, subagent spawns, and error count. The `tool_usage`\n\ntable holds one row per tool per session, so I can see that a particular builder session used Read 18 times, Edit 12 times, and Bash 6 times. The `files_accessed`\n\ntable records every file touched, tagged by access type (read, write, edit) and a flag indicating whether the file was a SKILL.md or other planning document.\n\nThe query interface is plain SQL against a local SQLite file. No dashboard, no visualization layer, no cloud dependency. I run queries directly when I want to know something, and the answers come back in milliseconds.\n\n## What the Data Shows\n\nAfter 6 weeks of data collection across 455 agent sessions (2026-04-19 to 2026-06-01), a few patterns have emerged.\n\nThe most-read files across the fleet are `CLAUDE.md`\n\n(291 reads), `server.py`\n\n(251), `LEARNINGS.md`\n\n(164), `BUSINESS_STATUS.md`\n\n(161), `COMPLETED.md`\n\n(160), and individual SKILL.md files (130). The mix tells an interesting story: agents read role-definition and shared-correction files heavily, but also spend significant time on core source files like `server.py`\n\n. What surprised me was how much variance exists in skill-file read patterns: some agents read their SKILL.md once and proceed, while others return to it multiple times mid-session. That re-reading behavior correlates with sessions where the agent encountered an unexpected situation and needed to recheck the rules.\n\n| Agent role | Sessions | Avg tool calls | Avg turns | Hit soft cap | Hit hard cap |\n|---|---|---|---|---|---|\n| Builder | 145 | 71 | 119 | 14% (20/145) | 6% (9/145) |\n| QA | 49 | 49 | 81 | 4% (2/49) | 2% (1/49) |\n\nTurn budget utilization is another useful signal. Each agent has a turn budget -- a cap on how many turns it takes per session before wrapping up. Agents that consistently run near their budget are doing more work per session, which is efficient if the output quality holds up, and a warning signal if quality drops. The builder agent averages 71 tool calls and 119 turns per session, hitting the soft cap in 14% of sessions and the hard cap in 6%. The QA agent runs more constrained -- 49 tool calls and 81 turns per session -- and rarely bumps against its limits (4% soft, 2% hard). The builder runs closer to the edge, which makes sense: it is doing more implementation work per session.\n\nError rates are low across the board -- 10 single-error sessions out of 455, and they only began surfacing in late May as instrumentation matured. April had zero errors across 84 sessions. May had 10 across 360 sessions, all single-error events spread across several roles. No agent had a run of errors. The signal the instrumentation is designed to surface showed up exactly as expected: not catastrophic failures, but quiet signals worth watching.\n\nThe most valuable query has been something simple:\n\n```\nSELECT agent, session_date, errors\nFROM agent_sessions\nWHERE errors > 0\nORDER BY session_date;\n```\n\nTen rows. Every one a single-error session, all clustered in late May. Spread across roles -- no single agent was struggling. It takes about two seconds to run and tells me exactly where to look.\n\n## The LEARNINGS.md Pattern\n\nOne piece of the observability stack is not technical infrastructure -- it is a shared text file called `LEARNINGS.md`\n\nat the repo root. Every agent is required to read it at session start and append a concrete rule when a human reviewer or another agent corrects them during a session.\n\nThis pattern comes from the AGENTS.md open standard, which formalizes how multi-agent systems communicate expectations and corrections. The idea is lightweight institutional memory: instead of re-explaining the same correction every session, you write it down once and every subsequent session inherits the lesson.\n\nThe rules in LEARNINGS.md are deliberately concrete. \"Always pin dependencies to exact versions in requirements.txt\" rather than \"be careful with dependencies.\" \"The META_PAGE_ID must be the actual Facebook Page ID from Page Settings, not the Business Portfolio ID\" rather than \"check your IDs carefully.\" Concrete rules are hard to misinterpret. Abstract rules get interpreted differently by different agents, or quietly ignored.\n\nThe database does not capture LEARNINGS.md interactions directly, but the files_accessed table shows which sessions read it. Agents that consistently skip it are worth investigating -- either the session was so constrained it went straight to work, or the agent is not following the startup checklist.\n\n## What This Changes\n\nThe before picture: an agent runs, produces output, and I look at the output to judge whether the session was successful. The after picture: an agent runs, produces output, and I have structured metrics on everything it did to produce that output, queryable in seconds.\n\nThis does not replace reading the output. An agent can use 30 tool calls efficiently or inefficiently, and the metric alone does not tell you which. But it surfaces the cases worth investigating: sessions with high error counts, sessions that hit turn limits, sessions where the skill-file read ratio is unusually high (possible orientation problem), sessions that ran much longer or shorter than normal.\n\nFor anyone running autonomous AI agents in production -- whether that is one agent or ten -- the same principle applies that has always applied to production data pipelines. You cannot operate what you cannot observe. The implementation here took a single session to build. The data it produces is already changing how I allocate attention.\n\nIf you are working through how to instrument your own agent systems, or evaluating whether autonomous agents make sense for your data operations, the contact page is the right place to start: [labyrinthanalyticsconsulting.com/contact](/contact).", "url": "https://wpnews.pro/news/instrumenting-your-ai-agent-fleet-from-black-box-to-full-observability", "canonical_source": "https://labyrinthanalyticsconsulting.com/blog/instrumenting-ai-agent-fleet-observability", "published_at": "2026-06-13 00:00:00+00:00", "updated_at": "2026-06-14 02:02:22.266095+00:00", "lang": "en", "topics": ["ai-agents", "ai-tools", "developer-tools", "mlops"], "entities": ["CLAUDE.md", "server.py", "LEARNINGS.md", "BUSINESS_STATUS.md", "COMPLETED.md", "SKILL.md", "SQLite", "Cowork"], "alternates": {"html": "https://wpnews.pro/news/instrumenting-your-ai-agent-fleet-from-black-box-to-full-observability", "markdown": "https://wpnews.pro/news/instrumenting-your-ai-agent-fleet-from-black-box-to-full-observability.md", "text": "https://wpnews.pro/news/instrumenting-your-ai-agent-fleet-from-black-box-to-full-observability.txt", "jsonld": "https://wpnews.pro/news/instrumenting-your-ai-agent-fleet-from-black-box-to-full-observability.jsonld"}}