# My agent swarm had a productive night. My pipeline lied about it.

> Source: <https://dev.to/niclydon/my-agent-swarm-had-a-productive-night-my-pipeline-lied-about-it-1kao>
> Published: 2026-06-05 14:13:54+00:00

I gave a Grok CLI agent swarm one instruction around 1am and went to bed. By 5:30 it had closed 37 items off my operator backlog and landed 20 commits on main: 2,838 lines added, 112 removed, across 51 files. Good night’s work for a process I wasn’t awake for.

Then I tried to audit it, and discovered my own ingest pipeline had quietly turned the entire run into garbage. The schema mismatch is mundane. The lesson underneath it is not.

I run a personal automation stack with an operator backlog: a Postgres-backed queue of tickets tagged by whether an agent can safely act on them unattended. My usual agents drain that queue against a set of contracts: an investigator role that diagnoses, and a worker role that ships with test gates. I wanted to see how a Grok CLI swarm would handle the same contracts.

The kickoff was one line: keep working through the queue, do the autonomous-safe work. The first session fanned out a dozen subagents in about three seconds. Each read the same prompt contracts, and each claimed tickets with a lease so they wouldn’t collide.

Every session writes a transcript to disk. My stack ingests those transcripts into Postgres so I can answer questions like “what did agent #7 actually run.” Standard local-first observability: the model provider doesn’t hold my data, I do.

I went to look at the per-session breakdown and the numbers were obviously wrong:

```
SELECT count(*)                            AS sessions,
       count(model)                        AS with_model,
       sum(turn_count)                     AS turns,
       count(*) FILTER (WHERE is_subagent) AS subagents
FROM grok_sessions;
sessions | with_model | turns | subagents
----------+------------+-------+-----------
       95 |          0 |    99 |         0
```

Ninety-five sessions, zero with a model recorded, ninety-nine total turns, zero subagents. That immediately failed a sanity check: six sessions had `subagent-<uuid>`

in their project name, and a dozen subagents launching in three seconds do not average one turn each. The data was lying.

The importer was written against a Claude-Code-shaped assumption. It expected a `role`

field on each record, tool calls embedded inside content blocks, and a model field somewhere on the message. Here is what Grok CLI actually writes, one JSON object per line:

``` bash
$ cat chat_history.jsonl | jq -r '.type' | sort | uniq -c
     47 assistant
     47 reasoning
      1 system
     51 tool_result
      3 user
```

Three things break immediately.

**There is no role.** The discriminator is

`type`

, and the values include `reasoning`

and `tool_result`

, record types the parser had never heard of. Anything that wasn’t `system`

or `assistant`

got bucketed into `user`

. So a 149-record transcript with 47 real assistant turns counted as roughly three turns, the actual user messages. Everything else disappeared.**Tool calls aren’t embedded in content blocks.** They’re a top-level array on assistant records:

``` bash
$ cat chat_history.jsonl | jq -c 'select(.type=="assistant")
    | {model_id, tools: (.tool_calls | length), first: .tool_calls[0].name}' | head -3
{"model_id":"grok-build","tools":2,"first":"read_file"}
{"model_id":"grok-build","tools":3,"first":"read_file"}
{"model_id":"grok-build","tools":2,"first":"grep"}
```

The parser scanned content blocks looking for tool-use records that never existed, so every session appeared to have made zero tool calls.

**The model was right there.** Every assistant record carried `"model_id": "grok-build"`

. The importer had a `state.model`

variable. It declared it, threaded it through the insert path, and never assigned it. A dead variable writing NULL ninety-five times.

The real failure wasn’t misreading the transcript. It was assuming the transcript was the source of truth. The parser bug merely exposed the design flaw.

In hindsight the warning signs were obvious. Every session directory already contained multiple independent records of what happened: transcripts, event streams, session metadata, and git state. I had accidentally built an audit pipeline that ignored corroborating evidence and trusted the narrative, so when the narrative parser failed, the entire picture failed with it.

The session metadata alone contained everything I was missing:

```
{
  "current_model_id": "grok-build",
  "session_kind": "subagent",
  "num_chat_messages": 149,
  "created_at": "2026-06-05T00:28:06.848Z",
  "last_active_at": "2026-06-05T00:33:48.184Z"
}
```

Every missing column was already available. `session_kind`

identifies subagents, `num_chat_messages`

is the real turn count, `created_at`

and `last_active_at`

are the real timestamps, and additional metadata links the session back to the correct repository context. Meanwhile my importer was deriving timestamps from file modification times and inferring project identity from directory names.

The fix is not a patch to the parse loop. It’s architectural: trust authoritative session metadata for session-level facts, use transcripts only for message bodies, handle the record types that actually exist, and corroborate one source against another.

Here’s the uncomfortable bit. I could not answer “what did the swarm actually do” from the swarm’s own records. I had to reconstruct the night from two sources the agents didn’t write: the backlog table, which recorded what changed state, and git history, which recorded what actually landed. Those two agreed, and they’re trustworthy precisely because the agents couldn’t rewrite them afterward.

``` bash
$ git log --since="2026-06-04 23:30" --until="2026-06-05 08:00" \
    --pretty=tformat: --numstat \
  | awk 'NF==3 {a+=$1; d+=$2; f++}
         END {printf "%d files, +%d/-%d\n", f, a, d}'
51 files, +2838/-112
```

That number I believe. The transcript-derived “99 turns” I don’t.

This has nothing to do with Grok specifically. Autonomous agents are trivial to launch and genuinely hard to audit, and the gap between those two facts is where teams are going to get hurt over the next few years. An agent telling you what it did is not evidence. It’s a claim, and it’s a claim sourced from the least trustworthy possible place: the thing being audited.

I filed two tickets against my own importer. One fixes the parser; the other forces a re-parse of the 95 corrupted rows, since the incremental loader skips files whose size hasn’t changed and a logic fix doesn’t change the file. They’ll get fixed. But the design flaw matters more than the bug: verify agent work from sources the agent cannot write to, and preferably from multiple sources that must agree. Otherwise you’re not auditing behavior, you’re auditing a story about behavior.

If your only record of an agent’s actions comes from a channel the agent controls, you don’t have observability. You have a press release with a timestamp.

The swarm had a productive night. I just had to prove it the hard way.