Agents Write My Code. Agents Review It. I Referee.

wpnews.pro

AI agents review my pull requests now, not me. But there is one thing I will never be able to hand them: everything my company knows that was never written down.

I haven’t read a full pull request diff line by line in weeks. Other agents do that now. I run a deep, multi-agent review on every PR I open, across every repo I touch. By the time I look, the code has already been attacked from a dozen angles by review agents, each better-read than I am in corners I never mastered: frontend, security, databases. My job is no longer to catch bugs. It’s to referee.

This sounds a lot like the argument Linear made recently: as agents write more of the code, reviewers stop catching bugs and graduate to product judgment. The engineer gets “elevated.” It’s a good piece, and they’re half right. But the half they skip is the interesting one.

You don’t stop catching bugs because the bots got good #

You stop catching bugs because you handed bug-catching to a different bot.

A few months ago I wrote that when AI writes the code, correctness is mostly fine if the tests pass. I still believe that. What I underestimated was how much work hides inside if the tests pass. The authoring agent does not get you there. It writes plausible code, which is not the same thing as correct code, and a green check on tests it also wrote is not the floor it looks like. If I trusted that output, I’d be shipping the corruption tax straight into production.

So I don’t trust it. I run an adversarial pass: a swarm of review agents, each one prompted to break the change from a different direction, looking for races, missing error handling, the edge case the author never considered.

The obvious objection: if I don’t trust the agent that wrote the code, why would I trust the agents that review it? I don’t, not individually. They share a base model, so they share blind spots, and they raise false alarms I have to wave off. But writing correct code means getting everything right at once, while a reviewer only has to surface one candidate worth a second look. Run a wide, cheap, diverse sweep and you raise the odds that a real problem gets flagged by at least one angle, even when no single agent is reliable. I’m not building a chain of trust where each agent vouches for the next. The things that actually backstop me don’t hallucinate: the deterministic checks I wrote and read myself, and my own final pass.

So bug-catching didn’t disappear when the writing got cheap. It forked. It went from “a human reads the diff” to “one agent writes, several agents try to tear it apart, and a human referees the fight.” Linear’s “reviewers no longer catch bugs” describes the symptom, not the cause. The work didn’t stop. You just stopped being the one doing it.

And I’ll admit the obvious: a review agent pointed at a database migration surfaces things I would have waved through. Not because it’s a better engineer than me, but because it carries breadth I don’t. I am not a security specialist. My CSS is competent, not great. I have shipped enough subtle SQL mistakes to respect the category. Delegating the review isn’t laziness. It’s borrowing breadth I don’t have, on demand, on every PR.

Can I prove the swarm catches more than I would alone? Not cleanly, and I’d distrust anyone who handed you a number. Code review was never measurable. A human reviewer misses things, waves others through, and two good reviewers disagree about the same diff every day of the week. That subjectivity didn’t arrive with the agents. So I don’t hold them to a bar human review never cleared. I hold them to the average of a good reviewer, I make them as sharp as I can, and I accept that their judgment will differ from mine, the way a colleague’s would. Sometimes they’re wrong. Sometimes I’m the one who would have been.

The thing you can never hand off #

So if agents write the code and agents review the code, what is left for the human?

Context. Specifically, the context that was never written down.

You can feed your whole codebase to an LLM. You cannot get 100% of your company’s brain into the files an agent can read. Why we picked this ugly abstraction two years ago. The customer who churned because an earlier, cleverer version of this feature burned them. The Slack thread where we agreed not to touch this module before a release. The half sentence in a hallway that quietly became a constraint. None of it is in the repo. None of it is in the prompt. The model isn’t wrong about the code. It’s blind to the company.

That’s why I still do a final pass. Not the line-by-line read I opened by saying I’ve given up. A context pass: does this change fit the things the agents had no way of knowing? And it’s why I only step in as referee where a human left a comment. A human comment is the signal that undocumented context lives there, the kind no agent could have inferred. The final pass survives not because the models are weak, but because documentation is never finished. You can’t write down a culture. You can only carry it.

Two things that follow from this #

First: stop using staging as your first bug filter. Most teams lean on a staging environment to catch what their checks should have caught, which means they find problems late, by hand, in a place that looks nothing like production. Agents change the economics. Writing tests, fuzzers, property checks, and assertions used to be expensive enough that teams skipped them and shipped to staging to “see what breaks.” It’s cheap now, though a generated test you don’t understand is worse than none, so you still have to read those. Push the verification left. Make your deterministic checks ten times stronger, because the test suite is the first reviewer now and it should act like one. A green build you actually trust is worth more than any amount of manual poking around.

Second: when someone tells you the LLM is bad at their codebase, they’re usually describing a context problem, not a model problem. Not always. Models still invent APIs that never existed, no matter how much you feed them. But the systematic gap between the teams getting great results and the teams cursing the model is rarely the model. You can watch it close in real time. Connect the model to where your context actually lives, the ticketing system, the Notion pages, the RFCs, the incident history, and the same “dumb” model starts making calls that look informed, because for the first time it is. The teams winning aren’t using secret models. They gave the model the context everyone else keeps in their heads. It was never underperforming. It was working with the fraction of your company you bothered to hand it.

What’s actually left #

The pull request used to be a place where I read code. Now it’s a place where I arbitrate between agents that wrote it and agents that attacked it, and then I add the one thing none of them have: everything my company knows and never bothered to write down.

The model can write all of my code. It can’t be in the room. That last part is still the job.

I Shipped a Rust Binary. I Can't Write Rust.

A rewrite used to be career suicide. Porting our CLI to Rust took a month of review and a model that knew the language better than I ever will.

[Read more →](/blog/i-shipped-rust-i-cant-write-rust)

### [ The Hidden Corruption Tax of AI Delegation ](/blog/the-hidden-corruption-tax-of-ai-delegation)

Frontier LLMs corrupt 25% of what you delegate. The fix isn't going back to writing by hand. It's the same linters and CI we built for humans, finally pointed at the new worker.