# Instrumenting Your AI Agent Fleet: From Black Box to Full Observability

> Source: <https://labyrinthanalyticsconsulting.com/blog/instrumenting-ai-agent-fleet-observability>
> Published: 2026-06-13 00:00:00+00:00

## The Problem With Invisible Agents

My development team runs on schedules. Every morning, a security agent scans the codebase for vulnerabilities. A QA agent validates the previous day's code changes. A marketing agent drafts content and checks the editorial queue. A project manager agent synthesizes everything into a dashboard.

None of them are supervised while they run. That is the point -- autonomous agents work while I sleep.

The problem is that for the first few weeks, I had no reliable answer to the question: "What did the agent actually do?" I could read the output files. I could check git history. But I could not tell whether an agent used 12 tool calls or 58, which files it read most often, whether it hit its turn budget limit, or whether it threw errors it quietly swallowed. The agents were a black box with visible outputs but invisible process.

In data engineering, you do not ship a pipeline without monitoring. You have metrics on row counts, processing time, error rates, and resource consumption. You know when something is underperforming before it fails. I applied the same discipline to the agent fleet.

## Two Collection Paths

The instrumentation pipeline has two paths because the agents run in two different environments.

Agents running on launchd schedules (the Mac-native job scheduler) go through a shell wrapper called `run_agent_code.sh`

. This wrapper captures the full session output in JSONL format -- every tool call, every message, every timestamp. After the session completes, `parse_agent_metrics.py`

runs against that JSONL file and extracts structured metrics: total tool calls, a breakdown by tool name, which files were read and written, bash commands executed, session duration, error count, and subagent spawns. The parser can distinguish between reading a SKILL.md file (a planning document) and reading a source file, so I can see how much of each session is orientation versus actual work.

Agents running in Cowork scheduled sessions do not have the launchd wrapper, so they self-report at session end using `save_agent_metrics.py`

. They call this script with the same fields the JSONL parser would produce -- tool call counts, files read, duration, errors. The data ends up in the same format, from the same schema. Whether an agent ran locally or in Cowork, the downstream analytics layer cannot tell the difference.

## What Goes Into the Database

Both paths emit JSON metric files that get ingested into a SQLite database via `ingest_agent_metrics.py`

. The schema is straightforward. Three core tables do most of the work.

The `agent_sessions`

table holds one row per session with aggregate metrics -- agent name, date, duration, total tool calls, files read count, files written count, bash command count, subagent spawns, and error count. The `tool_usage`

table holds one row per tool per session, so I can see that a particular builder session used Read 18 times, Edit 12 times, and Bash 6 times. The `files_accessed`

table records every file touched, tagged by access type (read, write, edit) and a flag indicating whether the file was a SKILL.md or other planning document.

The query interface is plain SQL against a local SQLite file. No dashboard, no visualization layer, no cloud dependency. I run queries directly when I want to know something, and the answers come back in milliseconds.

## What the Data Shows

After 6 weeks of data collection across 455 agent sessions (2026-04-19 to 2026-06-01), a few patterns have emerged.

The most-read files across the fleet are `CLAUDE.md`

(291 reads), `server.py`

(251), `LEARNINGS.md`

(164), `BUSINESS_STATUS.md`

(161), `COMPLETED.md`

(160), and individual SKILL.md files (130). The mix tells an interesting story: agents read role-definition and shared-correction files heavily, but also spend significant time on core source files like `server.py`

. What surprised me was how much variance exists in skill-file read patterns: some agents read their SKILL.md once and proceed, while others return to it multiple times mid-session. That re-reading behavior correlates with sessions where the agent encountered an unexpected situation and needed to recheck the rules.

| Agent role | Sessions | Avg tool calls | Avg turns | Hit soft cap | Hit hard cap |
|---|---|---|---|---|---|
| Builder | 145 | 71 | 119 | 14% (20/145) | 6% (9/145) |
| QA | 49 | 49 | 81 | 4% (2/49) | 2% (1/49) |

Turn budget utilization is another useful signal. Each agent has a turn budget -- a cap on how many turns it takes per session before wrapping up. Agents that consistently run near their budget are doing more work per session, which is efficient if the output quality holds up, and a warning signal if quality drops. The builder agent averages 71 tool calls and 119 turns per session, hitting the soft cap in 14% of sessions and the hard cap in 6%. The QA agent runs more constrained -- 49 tool calls and 81 turns per session -- and rarely bumps against its limits (4% soft, 2% hard). The builder runs closer to the edge, which makes sense: it is doing more implementation work per session.

Error rates are low across the board -- 10 single-error sessions out of 455, and they only began surfacing in late May as instrumentation matured. April had zero errors across 84 sessions. May had 10 across 360 sessions, all single-error events spread across several roles. No agent had a run of errors. The signal the instrumentation is designed to surface showed up exactly as expected: not catastrophic failures, but quiet signals worth watching.

The most valuable query has been something simple:

```
SELECT agent, session_date, errors
FROM agent_sessions
WHERE errors > 0
ORDER BY session_date;
```

Ten rows. Every one a single-error session, all clustered in late May. Spread across roles -- no single agent was struggling. It takes about two seconds to run and tells me exactly where to look.

## The LEARNINGS.md Pattern

One piece of the observability stack is not technical infrastructure -- it is a shared text file called `LEARNINGS.md`

at the repo root. Every agent is required to read it at session start and append a concrete rule when a human reviewer or another agent corrects them during a session.

This pattern comes from the AGENTS.md open standard, which formalizes how multi-agent systems communicate expectations and corrections. The idea is lightweight institutional memory: instead of re-explaining the same correction every session, you write it down once and every subsequent session inherits the lesson.

The rules in LEARNINGS.md are deliberately concrete. "Always pin dependencies to exact versions in requirements.txt" rather than "be careful with dependencies." "The META_PAGE_ID must be the actual Facebook Page ID from Page Settings, not the Business Portfolio ID" rather than "check your IDs carefully." Concrete rules are hard to misinterpret. Abstract rules get interpreted differently by different agents, or quietly ignored.

The database does not capture LEARNINGS.md interactions directly, but the files_accessed table shows which sessions read it. Agents that consistently skip it are worth investigating -- either the session was so constrained it went straight to work, or the agent is not following the startup checklist.

## What This Changes

The before picture: an agent runs, produces output, and I look at the output to judge whether the session was successful. The after picture: an agent runs, produces output, and I have structured metrics on everything it did to produce that output, queryable in seconds.

This does not replace reading the output. An agent can use 30 tool calls efficiently or inefficiently, and the metric alone does not tell you which. But it surfaces the cases worth investigating: sessions with high error counts, sessions that hit turn limits, sessions where the skill-file read ratio is unusually high (possible orientation problem), sessions that ran much longer or shorter than normal.

For anyone running autonomous AI agents in production -- whether that is one agent or ten -- the same principle applies that has always applied to production data pipelines. You cannot operate what you cannot observe. The implementation here took a single session to build. The data it produces is already changing how I allocate attention.

If you are working through how to instrument your own agent systems, or evaluating whether autonomous agents make sense for your data operations, the contact page is the right place to start: [labyrinthanalyticsconsulting.com/contact](/contact).
