I recorded my agent auditing a 36k-file Rails app: the play-by-play

A developer tested a coding agent on a 36k-file Rails monolith (GitLab) by asking it to audit all dependents of MergeRequest. Without a codebase map, the agent found only 2 of 16 actual dependents in five minutes, producing a report that looked complete but missed 14 critical dependencies. A second run with a codebase map improved results, highlighting the danger of relying on agents for large-scale code audits.

I gave a coding agent a real maintainer chore on the largest open-source Rails monolith there is, and recorded the whole session. Then I gave it the same chore with a map of the codebase and recorded that one too. What follows is the play-by-play. The agent's moves are real, lightly trimmed for length; my commentary is the indented notes between them. Watch where the first run stops, and why it doesn't know it stopped. The repo is GitLab, indexed at a pinned commit: gitlabhq @ 1f9c256f0 tracked files / ruby 68,289 / 29,784 files indexed coverage 36,829 100% symbols / graph edges 177,929 / 1,121,147 A million edges in one Rails app. Nobody holds that in their head. The question is whether the agent can rebuild enough of it from files to do the job. The task, identical on both runs: You are about to rework how a MergeRequest , and everything attached to it, behaves when it changes or is torn down. Before touching the model, audit every place that depends on the MergeRequest contract. The gold set was built by hand against the source first: 16 scattered dependents, planted across app/services , app/workers , app/models , app/graphql , app/serializers , and lib/ , so no single grep rounds them all up. The agent never sees the list. It gets graded against it. Same model both runs Claude Code, Opus 4.8 . Same prompt. Same commit. The only variable is the map. 00:00 me → audit every dependent of MergeRequest before the rework 00:06 agent → grep -rin "merge request" app/ lib/ ee/ | wc -l = 41,800+ First move, the only move it has without structure: search the token. Forty-one thousand hits. That's not a list of dependents, it's a second codebase. The agent has a token budget, not a weekend. 00:31 agent → grep -rinE "belongs to :merge request|has many :merge requests" = 380+ named associations This part it does well. The named associations come back fast, and the Rails conventions earn their reputation here, a service in app/services, a worker in app/workers, named for what it is. The agent reads these correctly. The easy third of the answer. 01:10 agent → grep -rinE "merge request id|\.merge request\b|MergeRequest\." | wc -l = 9,200+ Now the real work, and the dead end. Nine thousand hits, and the agent has to decide which are dependents and which are an incidental local variable or a log line. It samples. It can't read all of them. It walks straight past things like this: Issuable is mixed into both Issue and MergeRequest. Code that resolves notes on "the issuable" depends on MergeRequest without ever typing the string a grep could match. module Issuable def resolvable discussions @resolvable discussions ||= discussion notes.resolvable end end The notes-resolution service depends on MergeRequest through the Issuable concern. It never types the class name. No grep for MergeRequest reaches it. The dependents that matter are exactly the ones with no shared token to match on. 04:50 agent → writes the audit "Audited the dependents of MergeRequest. Found: the named associations, the diff serializer, the pipeline link. The contract is depended on primarily through..." 05:12 agent → done. Five minutes, near the harness ceiling. It ground honestly the whole way and produced a calm, finished-looking report. It never invented a citation, every dependent it named is real. It just found 2 of the 16 that matter. On a rerun, 1. It does not know it stopped at two. Nothing in the output could tell it. Cold result: 2 then 1 of 16. Scattered recall 0.09. The audit looked complete. That last sentence is the whole danger. On a small repo you'd notice two-sixteenths. On a monolith you ship it, and the fourteen you missed are the ones that page someone at 3am. 00:00 me → identical prompt 00:03 agent → sense blast MergeRequest MergeRequest app/models/merge request.rb 932 symbols in blast radius app/workers/merge requests/process auto merge worker.rb app/services/discussions/resolve service.rb app/serializers/merge request diff entity.rb app/graphql/types/issuable type.rb ee/app/services/ee/milestones/promote service.rb lib/gitlab/url builder.rb ... 926 more One call. 932 symbols, the resolved set, returned in a single shot instead of reconstructed from 41,800 grep hits. The notes-resolution service that the Issuable concern hid? It's right there in the list. The map walked the edge the grep couldn't see. 00:18 agent → reads each candidate, pins to file:line 03:40 agent → writes the audit "16-item dependency audit of MergeRequest, each pinned to file:line: auto-merge worker, notes-resolution service, cycle-analytics builder, API discussions, GraphQL issuable, Jira integration, milestone promotion, ghost-user handler, timelog..." Same model. Same budget, spent differently, on reading and pinning instead of hunting. It caught 13 of 16 on its best run, a floor of 10. Twelve of the dependents it found are ones the cold runs reached in noattempt:auto-merge worker · notes-resolution service · cycle-analytics builder · API discussions · GraphQL issuable · Jira integration · milestone promotion · ghost-user handler · timelog · timeline event · URL builder · enterprise discussion Mapped result: 10 then 13 of 16. Scattered recall 0.72, full audit 0.26 → 0.67. The map cost more here, and I'm not hiding it. On this run the map billed about 9% more tokens, not fewer 27,604 → 30,128 . On other repos it came in cheaper. Token cost is task-dependent and I'd never compare it across agents. What didn't move is reach: 2.6x more of the real dependent set, plus the twelve silent breaks. Nine percent more tokens to go from two of sixteen to thirteen is a rounding error against the incident you didn't have. The first time I ran this, the map lost, and that's why I trust it now. Early runs scored 12, then 8, then 1, all over the place. The lazy read was to blame the scenario. The transcripts said otherwise: sense blast was returning a different set of callers each call. On a hub this size almost every dependency is a plain method call sharing one confidence score, and the index capped that tied list with an unstable sort, even evicting direct callers for distant ones. A non-reproducible impact analysis, handed to any large-repo user, silently. The fix made the cap deterministic, ties broken by confidence then direct-over-indirect, and it ships for everyone now. The benchmark was supposed to score the tool. It kept fixing it instead. That determinism is also the quiet argument for why this is structure, not a model trick. The map computes the same 932 every time. A model infers a different answer every run, you watched it go 2 then 1. A better model infers more confidently, not more reproducibly. The map reads this repo at this commit, not a training snapshot, and any agent can call it over MCP. None of that rides on which model you run next quarter. It's the part that doesn't change when the model does. This is worth watching on your own code, because your monolith has a MergeRequest too. Pick it, the model half your services reach into and nobody fully tracks. Ask your agent cold, "before I change how this model is torn down, find every place that depends on it." Watch the grep grind. Count the answer. Then give it the map. → curl -fsSL https://luuuc.github.io/sense/install.sh | sh → sense scan in the root of the app that pages your team at night → sense setup to connect your agent Ask again and diff the two transcripts. On a tree this size, the dependents you couldn't find cold are exactly the ones the change would have broken. The full session logs, the answer key, every transcript for thirteen repos. https://github.com/luuuc/sense/blob/main/bench/verticals/ruby-rails/results/report.md I build the map in this recording. Everything you'd need to call me wrong is public, the transcripts, the harness, the pinned commit, the judge, so check the session instead of taking my read of it. PS. The scariest frame in that whole recording is the cold run writing a composed, confident audit of two dependents out of sixteen and signing off. No flailing. That composure is the thing to be afraid of on a big repo.