My agent swarm had a productive night. My pipeline lied about it.

A developer's Grok CLI agent swarm autonomously closed 37 backlog items and landed 20 commits overnight, adding 2,838 lines of code across 51 files. However, an audit revealed the developer's own data ingest pipeline had silently corrupted the entire session record, reporting zero subagents and only 99 total turns despite evidence of dozens of subagents launching within seconds. The schema mismatch stemmed from the importer being built against a Claude-Code-shaped assumption that failed to parse Grok CLI's distinct JSON structure, including missing role fields and tool calls stored as top-level arrays rather than embedded content blocks.

I gave a Grok CLI agent swarm one instruction around 1am and went to bed. By 5:30 it had closed 37 items off my operator backlog and landed 20 commits on main: 2,838 lines added, 112 removed, across 51 files. Good night’s work for a process I wasn’t awake for. Then I tried to audit it, and discovered my own ingest pipeline had quietly turned the entire run into garbage. The schema mismatch is mundane. The lesson underneath it is not. I run a personal automation stack with an operator backlog: a Postgres-backed queue of tickets tagged by whether an agent can safely act on them unattended. My usual agents drain that queue against a set of contracts: an investigator role that diagnoses, and a worker role that ships with test gates. I wanted to see how a Grok CLI swarm would handle the same contracts. The kickoff was one line: keep working through the queue, do the autonomous-safe work. The first session fanned out a dozen subagents in about three seconds. Each read the same prompt contracts, and each claimed tickets with a lease so they wouldn’t collide. Every session writes a transcript to disk. My stack ingests those transcripts into Postgres so I can answer questions like “what did agent 7 actually run.” Standard local-first observability: the model provider doesn’t hold my data, I do. I went to look at the per-session breakdown and the numbers were obviously wrong: SELECT count AS sessions, count model AS with model, sum turn count AS turns, count FILTER WHERE is subagent AS subagents FROM grok sessions; sessions | with model | turns | subagents ----------+------------+-------+----------- 95 | 0 | 99 | 0 Ninety-five sessions, zero with a model recorded, ninety-nine total turns, zero subagents. That immediately failed a sanity check: six sessions had subagent-<uuid in their project name, and a dozen subagents launching in three seconds do not average one turn each. The data was lying. The importer was written against a Claude-Code-shaped assumption. It expected a role field on each record, tool calls embedded inside content blocks, and a model field somewhere on the message. Here is what Grok CLI actually writes, one JSON object per line: bash $ cat chat history.jsonl | jq -r '.type' | sort | uniq -c 47 assistant 47 reasoning 1 system 51 tool result 3 user Three things break immediately. There is no role. The discriminator is type , and the values include reasoning and tool result , record types the parser had never heard of. Anything that wasn’t system or assistant got bucketed into user . So a 149-record transcript with 47 real assistant turns counted as roughly three turns, the actual user messages. Everything else disappeared. Tool calls aren’t embedded in content blocks. They’re a top-level array on assistant records: bash $ cat chat history.jsonl | jq -c 'select .type=="assistant" | {model id, tools: .tool calls | length , first: .tool calls 0 .name}' | head -3 {"model id":"grok-build","tools":2,"first":"read file"} {"model id":"grok-build","tools":3,"first":"read file"} {"model id":"grok-build","tools":2,"first":"grep"} The parser scanned content blocks looking for tool-use records that never existed, so every session appeared to have made zero tool calls. The model was right there. Every assistant record carried "model id": "grok-build" . The importer had a state.model variable. It declared it, threaded it through the insert path, and never assigned it. A dead variable writing NULL ninety-five times. The real failure wasn’t misreading the transcript. It was assuming the transcript was the source of truth. The parser bug merely exposed the design flaw. In hindsight the warning signs were obvious. Every session directory already contained multiple independent records of what happened: transcripts, event streams, session metadata, and git state. I had accidentally built an audit pipeline that ignored corroborating evidence and trusted the narrative, so when the narrative parser failed, the entire picture failed with it. The session metadata alone contained everything I was missing: { "current model id": "grok-build", "session kind": "subagent", "num chat messages": 149, "created at": "2026-06-05T00:28:06.848Z", "last active at": "2026-06-05T00:33:48.184Z" } Every missing column was already available. session kind identifies subagents, num chat messages is the real turn count, created at and last active at are the real timestamps, and additional metadata links the session back to the correct repository context. Meanwhile my importer was deriving timestamps from file modification times and inferring project identity from directory names. The fix is not a patch to the parse loop. It’s architectural: trust authoritative session metadata for session-level facts, use transcripts only for message bodies, handle the record types that actually exist, and corroborate one source against another. Here’s the uncomfortable bit. I could not answer “what did the swarm actually do” from the swarm’s own records. I had to reconstruct the night from two sources the agents didn’t write: the backlog table, which recorded what changed state, and git history, which recorded what actually landed. Those two agreed, and they’re trustworthy precisely because the agents couldn’t rewrite them afterward. bash $ git log --since="2026-06-04 23:30" --until="2026-06-05 08:00" \ --pretty=tformat: --numstat \ | awk 'NF==3 {a+=$1; d+=$2; f++} END {printf "%d files, +%d/-%d\n", f, a, d}' 51 files, +2838/-112 That number I believe. The transcript-derived “99 turns” I don’t. This has nothing to do with Grok specifically. Autonomous agents are trivial to launch and genuinely hard to audit, and the gap between those two facts is where teams are going to get hurt over the next few years. An agent telling you what it did is not evidence. It’s a claim, and it’s a claim sourced from the least trustworthy possible place: the thing being audited. I filed two tickets against my own importer. One fixes the parser; the other forces a re-parse of the 95 corrupted rows, since the incremental loader skips files whose size hasn’t changed and a logic fix doesn’t change the file. They’ll get fixed. But the design flaw matters more than the bug: verify agent work from sources the agent cannot write to, and preferably from multiple sources that must agree. Otherwise you’re not auditing behavior, you’re auditing a story about behavior. If your only record of an agent’s actions comes from a channel the agent controls, you don’t have observability. You have a press release with a timestamp. The swarm had a productive night. I just had to prove it the hard way.