Nobody wants to review the robot's 600-line pull request

A developer at a service company describes the growing problem of reviewing AI-generated code, noting that while agents now produce high-quality pull requests, the volume overwhelms human reviewers, leading to shallow reviews that miss contextual errors. The author illustrates this with an example where an agent's deduplication logic assumed webhooks could be deduplicated on payload, conflicting with the team's use of a separate idempotency field.

An agent opened a pull request on our service last week. Six hundred lines. It rewrote how we handle webhook retries and deduplication, an area that is fiddly and easy to get subtly wrong. The diff was clean. The tests were green. The commit messages were better than mine usually are. And I felt the specific dread that I think a lot of engineers are starting to feel in 2026. I was the reviewer. I had not written any of this. I had no idea why it was shaped the way it was. To review it properly, the way I would want my own code reviewed, I was looking at the better part of an hour of carefully reconstructing intent from the code itself. I did not have that hour. So I did what almost everyone does in that situation, which is skim it, decide it looked reasonable, and approve. That moment is the actual problem with AI-written code, and it is not the one people argue about. The tired debate is whether agents write good code. In 2026 that argument is mostly over. They do. They plan, they read the codebase, they run the tests, they back out of dead ends, they open pull requests that clear most review bars. If you are still litigating whether the code is any good, you have not used a current agent in a while. But here is what follows from that, and it is the part teams have not absorbed: if writing the code is no longer the slow step, then reviewing it is. And review does not scale the way generation does. An agent can produce five well-tested pull requests before lunch. Your senior engineers cannot deeply review five pull requests before lunch, not on top of their own work. The volume went up and the review capacity did not, and something has to give. What gives is the depth of review. It degrades, quietly, into a skim. People approve fluent diffs they have not truly read, because reading them properly costs more time than anyone has. The green check still appears. It just means less than it used to. That is a governance failure wearing the costume of a passing review, and it is happening on a lot of teams right now without anyone deciding it should. It is worth being precise about why this is worse than the human version of the same problem. When a teammate sends you a pull request, you usually share context with them. You were in the standup. You saw the thread. You know roughly why they are doing this and what constraints they are under. You review the diff against intent you already hold in your head. With an agent PR, that shared context is gone. You are handed a substantial, confident, well-structured diff and no narrative of why it is shaped the way it is. You cannot ask the agent in the hallway. So reviewing it well means reconstructing intent purely from the code, which is slow, error-prone, and exactly the work that gets skipped under time pressure. The fluency makes it worse, not better, because fluent code invites you to assume the thinking behind it was equally sound. The mistakes that survive this are not syntax. They are contextual. The agent leaned on an assumption that is true in general and false for your system. In our case, the dedup rewrite quietly assumed webhooks could be deduplicated on payload, when our whole reason for keying on a separate idempotency field was that one upstream sends byte-identical payloads for genuinely distinct events. Perfectly reasonable code. Wrong for us, for a reason that lives in our history and nowhere in the diff. The instinctive responses do not hold up. "Review everything as carefully as before" does not survive contact with rising volume. The math does not work. "Trust the agent" is not a governance posture, it is the absence of one. "Add more CI rules" helps with the syntactic class of problem that is already mostly solved, and does little for architectural mistakes, because a contextual violation is not a lint rule, it is a conflict with a decision your team made that no checker knows about. The thing that actually changes the equation is not reviewing harder. It is changing what review consumes. Here is the shift. Instead of handing a reviewer a raw diff and asking them to reverse-engineer the intent, hand them the story of how the work came to be, first. The decisions it relied on. The constraints it was operating under. The context it was given before it started. The questions it hit and who answered them. The approaches it tried and reverted. The points where a human stepped in and steered it. Then the reviewer reads that narrative, and reads the diff second, with intent already established. A contextual mistake that is invisible when you are staring at code becomes obvious the moment you can see the reasoning behind it. If I had seen, up front, that the agent's plan rested on deduplicating by payload, I would have caught it in five seconds, because I know why that is wrong here. Buried in six hundred lines, I missed it entirely. This is what I am building as Branch Story. For any branch or pull request, it reconstructs that narrative, grounded in your team's actual captured decisions rather than guessed from the diff. It reads as a story of how the work happened, not a log dump, because a log dump is just more material to skim and the entire point is to stop skimming. It works for human-written branches too, which matters more than it sounds. A teammate reviewing a teammate gets the same narrative, so review becomes consistent across human and agent work instead of being two different activities. And honestly, that consistency is part of what makes a team willing to let an agent open pull requests at all. The blocker on agent autonomy was never whether the agent could write the code. It is whether a human can responsibly sign off on it. Trust is the unlock, and trust comes from being able to see the why quickly. I want to be straight about the limits, because overselling this would be its own kind of damage. Branch Story does not replace your judgment and it does not certify that code is correct. It makes the why legible so your judgment is fast and accurate instead of slow and guessed. And it is only as good as the context behind it, if your team's decisions are not captured anywhere, there is no story to reconstruct, which is exactly why the capture side is the hard part and where most of my work goes. But the direction feels right to me. As more of the code gets written by something that cannot explain itself in the hallway, the scarce resource is not code and it is not review hours. It is legible intent. That is the thing worth building. When you review an agent's pull request, are you reading it, or are you approving it? Be honest. That gap is the whole problem.