# Reading the Agent Log Like a Detective

> Source: <https://tacoda.medium.com/reading-the-agent-log-like-a-detective-cdf92758da24?source=rss-bf474619cf47------2>
> Published: 2026-06-24 16:21:53+00:00

The agent shipped a broken migration last month. The PR was clean, the tests passed locally, the reviewer signed off in twelve minutes, and the deploy to staging green-lit on the first try. Production started erroring within four minutes of the merge because the migration assumed a column existed that had been renamed two weeks earlier.

The first instinct was to blame the model. *Bad agent, didn’t read the docs.* The second instinct was to add a rule. *Always check the current schema before writing migrations.* Both instincts felt productive. Neither would have fixed it.

What fixed it was reading the transcript. The agent had, in fact, looked at the schema file. The schema file in the repo was correct. It had been updated when the column was renamed. But the agent had also loaded a database/CLAUDE.md rule that described the column under its *old* name as the canonical reference. The rule was stale. The agent followed the rule, but the rule was wrong.

You could call this a model problem — the agent trusted a stale rule over the live schema file sitting right next to it. **But that’s not the lever you can pull.** The fixable cause was in the harness (the rules, skills, hooks, and tools wrapped around the model) and here it was a rule that *lied*. It was visible in the log if you read it forensically.

That’s what this post is about. Reading agent transcripts as evidence, not as narrative. The five failure modes that show up over and over. And the habit of treating every wrong commit as a case to investigate before reaching for a fix.

The first move when an agent ships bad code is to resist the urge to fix and move on. The fix is downstream. The question worth asking is *how did the agent decide to do that*. The answer lives in the transcript.

Most modern agent harnesses save full transcripts somewhere. Claude Code stores them in ~/.claude/projects/<project>/, Cursor keeps recent ones in its UI, and most CI-driven agent runs log to a build artifact. The transcript is your scene of the incident. It records what the agent loaded, what it read, what it ran, what it produced, and in what order. Everything you need to reconstruct its reasoning is in that file.

The discipline is to treat the transcript the same way an SRE treats a stack trace: not as a list of complaints, but as the most reliable evidence of what happened. The agent’s text answers in the transcript are sometimes hand-wavy. The tool calls are not. The tool calls are facts.

Read the tool calls first.

Across the agent transcripts I’ve reviewed, the same five failure modes keep producing wrong-code incidents. Each one has a cause in the model but each also has a cause in the harness, and the harness is the half you can change. All five are visible in the log.

The agent never read the file that mattered. It wrote a migration without checking the schema. It wrote a test without checking the test conventions. It added a new module without checking how other modules in the same area were structured.

How to spot it in the log: scroll back to the start of the task and look at the Read and Glob tool calls. Note which files the agent loaded. Compare to the files the change touched. The pattern is: *the agent edited file X, but never read file Y, where Y was the file that defined the convention X had to match*.

The fix isn’t in the model. The fix is to make sure the relevant context loads automatically. Either by giving the rule a sharper path scope (so it loads when the agent touches the area), by adding a skill that walks the agent through the right files in order, or by adding a hook that injects the relevant context at the start of tasks of that shape.

The agent did what it was told to do. The instruction set didn’t include *read Y first*. That’s a harness gap.

The agent followed a rule that was wrong. The rule was outdated, or it referenced a pattern the codebase no longer uses, or two rules in the harness said different things and the agent picked the one that produced the bug.

How to spot it in the log: look at the rules the agent cited in its reasoning. If the rule cited matches the code change exactly, but the change is still wrong, the rule is the bug. Cross-check the rule’s claims against the current code. *Rule says use UserRepository, code uses UserService* is a smoking gun.

The fix is the rule-rot audit. Rules drift faster than people think. A rule written six months ago can be confidently citing a class that was renamed in April. The agent trusts the harness; if the harness lies, the agent ships lies.

The contradicted-rules failure mode is the one I find most embarrassing because the agent did exactly what the harness said to do. The agent isn’t wrong. The harness is wrong. And the harness was wrong because I didn’t audit it.

The agent called a function that doesn’t exist. Used an option that isn’t supported. Imported from a module that doesn’t export what it imported.

How to spot it in the log: search the agent’s output for function names and option names, then grep the actual codebase or library for them. The mismatches are obvious. The agent might have hallucinated findOrFailWithLock when the real method is findOrFail and there’s no lock option at all.

The fix has two parts. Short-term, tighten the harness with documentation links or type signatures pointing at the real reference is the cheapest cure. Longer-term, give the agent tools instead of documentation. An MCP server that exposes the library’s actual API (so the agent can *call* it instead of guessing) ends the hallucination class. A type checker run as a sensor (an automated check wired into the harness that fails the run when something’s wrong) catches the hallucinations that slip through.

The recurrence pattern worth noting: hallucinations cluster around APIs the model is confidently familiar with from training but where the project has chosen to use the library differently than the common convention. The model isn’t making up the function from nothing; it’s substituting the public-API version when the project actually wraps it.

The agent changed more than it was asked to. The task was *fix a bug in the login flow*. The diff also reformatted three unrelated files, renamed a helper for clarity, and updated four imports because the rename touched them. The bug fix is correct. The reformatting is harmless. But the PR is now many times the size it should be, and the reviewer’s eye is going to slide past the actual fix.

How to spot it in the log: look at the Edit and Write tool calls. Compare the files touched against the files the task description called out. The gap is the creep.

The fix is a rule and a sensor. The rule says *touch only what you must, clean up only your own mess*. The sensor is a pre-PR check that fails if the diff is wider than a budget; N files, M lines, whatever your numbers happen to be.

Scope creep is one of the most common failure modes in my experience and the most invisible. It doesn’t break anything. It just makes every PR worse to review and every regression harder to track down.

The agent committed before the work was done. The tests pass because the agent skipped writing the failing test first. The acceptance criteria look met because the agent declared them met without verifying. The PR description claims feature X is implemented when, under the hood, feature X is a stub.

How to spot it in the log: look for the moment the agent decided the work was done. Compare against the actual evidence of doneness. Did the agent run the test it claimed proves the fix? Did the test exist before the change, or did the agent write it after seeing the change pass? Did the agent verify the acceptance criteria, or did it skip the verification step and write the PR description from memory of what the task asked?

The fix is procedural. A skill called finishing-work that walks the agent through the verification steps before allowing a commit. Perhaps a pre-commit hook that fails if the changed files don’t have at least one paired test change. An agent that has to produce a *completion report* (see my earlier post on [post-mortems for agent runs](https://blog.tacoda.dev/post-mortems-for-agent-runs-2cceeaf13f2e)) before declaring done.

Premature commit is the failure mode where the agent looks most productive and is least useful. The PR ships, the work isn’t finished, and the next person to touch the code finds out.

Here’s a transcript fragment from a real run I investigated. The task was to add a new field to a user profile.

```
> User: Add a `pronouns` field to the user profile. Optional string, nullable.[Read .claude/CLAUDE.md][Read app/Models/User.php][Glob database/migrations/*_users_*.php][Read database/migrations/2024_05_12_add_email_to_users.php][Write database/migrations/2026_06_18_add_pronouns_to_users.php][Edit app/Models/User.php] // added 'pronouns' to $fillable[Bash: php artisan migrate - pretend][Edit app/Http/Resources/UserResource.php] // added pronouns to toArray[Bash: vendor/bin/pest tests/Feature/UserResourceTest.php][Write app/Http/Controllers/UserController.php] // added validation rule
```

The PR shipped. The reviewer signed off. Production threw an error the next morning, from the admin panel: Undefined array key 'pronouns'. The admin user view read $user['pronouns'] straight off the resource array, and the resource never put it there.

Reading the transcript: the agent loaded the project’s CLAUDE.md, looked at the User model, looked at a recent migration as a template, and wrote a similar migration. It also added the field to the User resource and the controller validation. It did *not* read app/Http/Resources/AdminUserResource.php, which the admin panel uses and which lists its fields explicitly, key by key. Unlike the model’s $fillable, the admin resource doesn’t auto-pick up new columns. A field exists there only if someone names it. The agent never opened the file, so pronouns was never added, and the admin view that assumed the key blew up.

The failure mode: *missing context*. The harness had no rule that said *user-profile fields appear in three places: User, UserResource, AdminUserResource*. The agent followed the convention from the two places it knew about and shipped a partial change.

The fix wasn’t *be smarter, agent*. The fix was a path-scoped rule in app/Http/Resources/CLAUDE.md that named the resource files that needed to stay in sync. The rule went in. The sensor (a quick AST check that any new $fillable field appears in both resource files) went in alongside. The next pronouns-shaped task didn’t repeat the bug.

That’s the loop. The transcript pointed at the gap and the harness change closed the gap. The model wasn’t the variable.

The payoff of reading transcripts isn’t fixing the specific bug. It’s the patterns that show up after twenty or thirty of them.

You start noticing which files the agent rarely reads, and asking why. You notice which rules it cites a lot (they’re load-bearing) and which it never cites (they’re dead weight). You notice the directories where transcripts go wrong most often — usually the directories with the worst seams, the least context, or the most conventions that aren’t written down.

There’s a cost the failure modes hide from each other. The fix for missing context is almost always a new rule and every rule you add is a future candidate for contradicted-rules rot. Fix enough gaps by piling on instructions and you build the exact stale-harness problem that produced failure mode two. So the audit cuts both ways: you read transcripts to find the rule you’re missing, and you read them to find the rule that’s now lying. A harness isn’t a pile of rules that only grows. The transcripts tell you which rules to add and which to retire, and the second list matters as much as the first. The transcript is the highest-resolution feedback signal the harness produces.

Two practical reading habits help.

**Read at least one failed transcript a week.** Not all of them. One. Pick the one that produced the most surprising failure. Read it end to end. Note the failure mode. Ask whether the harness has a gap there. The discipline isn’t to fix everything. Rather, it’s to develop the muscle.

**Keep a transcript notebook.** A file in the repo. Each entry: link or paste of the transcript, the failure mode (pick from the five above), the gap in the harness, the change that fixed it. After ten entries, the notebook becomes a map of where your harness is weak.

The notebook doesn’t have to be pretty. It has to exist.

A theme runs through every failure mode above. You can argue each one back to *“the model should have known better”*. But arguing about whose fault it is doesn’t change the rate. Changing the harness does. Missing context is a closable harness gap. Contradicted rules are stale rules. Hallucinated APIs are weak documentation or missing tools. Scope creep is an unset constraint. Premature commit is a missing procedure.

This isn’t a defense of the model. The model is fallible, and the next generation of models will be more capable than this one. The point is that *the cheap improvements live in the harness, not in the model*. You can’t retrain the model from your laptop. You can fix a rule, scope a rule, add a sensor, sharpen a tool, and write a procedure. All five are at your fingertips. All five reduce the rate of the same bug shipping again.

Forensic reading of transcripts is the practice that tells you which one to fix.

Find the last commit your agent made that you wish it hadn’t, or the last PR that came back with a “this isn’t quite right” review. Open the transcript that produced it.

Read the tool calls first. Note which files the agent loaded and which it should have loaded. Find the rule it cited or skipped. Cross-check the agent’s claims against the actual code. Pick which of the five failure modes you’re looking at.

Then write the harness change that would have stopped it. Whether a new rule, a path-scope, a sensor, a skill or a tool. Ship it the same day.

After three or four passes, you’ll start seeing the same failure mode keep recurring. That’s the signal. The pattern is the rule waiting to be written. The transcript is just the place the pattern lives until you spot it.

The model can’t read its own transcript. You can.
