# Where AI agents pay off

> Source: <https://rico.codes/agents>
> Published: 2026-06-04 07:31:25+00:00

# Where AI Agents Actually Pay Off

Posted June 4, 2026

I am starting to get real leverage from AI agents.

Not theoretical leverage. Not "look, the chatbot wrote a function" leverage. I mean the kind where a messy voice note turns into a draft, a repo change, a test, a pull request, a live fix, a follow-up task, and a breadcrumb that gives the next agent more context.

That leverage is exciting. It is also a little cursed.

The cursed part is not that the models are secretly alive or that software engineers are all immediately obsolete. The cursed part is more boring and more important: the economics are starting to work in weird places, especially for individuals and very small teams, and they do not work everywhere. The window is small. The workflow changes are nontrivial. The token bill can get gross fast. And if you do not build the surrounding system, agents can easily become an expensive way to generate unfinishedness.

This is where I think most agent discourse gets a little too smooth. People ask "is AI faster?" as if there is one answer.

There is not.

Sometimes it is slower. Sometimes the model churns. Sometimes the first answer is plausible but wrong. Sometimes the agent burns twenty minutes going in the wrong direction.

But the interesting question is not whether one agent is always faster than one human on one task. The interesting question is:

What happens when a human can specify, run, review, and improve many bounded execution loops in parallel?

That is where the ROI starts showing up.

It is also where the danger starts showing up.

George Hotz wrote the sharp negative version of this in
["The Eternal Sloptember"](https://geohot.github.io/blog/jekyll/update/2026/05/24/the-eternal-sloptember.html).
His argument, as I read it, is not just "AI code bad." It is that agent output
frontloads the impressive part, leaves the hard polish and coherence work to the
human, and produces artifacts that are broken in ways old quality proxies do not
catch anymore.

I do not fully buy the permanent claim that agents cannot program. I do buy the organizational warning. If your feedback loops are slow and your average worker is not carefully reading and error-correcting the output, agents can raise the volume of mediocre work faster than they raise the quality of good work.

That distinction matters. The question is not "agents: yes or no?" The question is "who can absorb the leverage without degrading their own system?"

## The ROI Is A System Property

The useful unit is not "the model."

The useful unit is the whole system:

```
Capability = model x harness x tools x environment x evaluator
```

The model matters. Obviously. A stronger model listens better, repairs better, and survives ambiguity better. GPT-5.5, in particular, has felt like a genuinely good foundational engineering model in my current workflow. It is often good enough that I can hand it a real codebase, a weird constraint, and a fuzzy product taste problem, then get back something I can review instead of something I have to babysit from first principles.

The annoying wrinkle is that models are not good in one global way. Some cloud/chat models feel much better at one-shot apps, UX exploration, visual design, and frontend taste. Codex/GPT-5.5 feels more steerable for deep repo engineering, but it can be pretty rough by default on product polish. That is not a contradiction. It is routing. Different tasks want different model/harness/tool combinations.

But the model is not the product.

The harness matters. Can it read the repo? Can it run tests? Can it browse current docs? Can it keep a plan? Can it spawn parallel work safely? Can it preserve local changes it did not make? Can it say clearly when it is blocked?

The tools matter. A model with a terminal, browser, GitHub access, docs, image inspection, and a real test suite is a different creature from the same model in a textbox. Tool access changes the shape of cognition because the agent can externalize uncertainty into the world: read the file, run the command, inspect the screenshot, check the deployed page.

The environment matters. A legible repo is agent fuel. Good scripts are agent fuel. Clear boundaries are agent fuel. Stable design primitives, typed connectors, preview/apply workflows, and boring test commands are all forms of intelligence that do not live in the model weights.

And the evaluator matters most of all. A task becomes delegable when there is a way to tell whether it worked.

Typecheck. Test. Build. Screenshot. Read back the external system. Ask a human to review a tight diff. Run an eval. Compare against a rubric. Verify the live URL. Whatever. Without an evaluator, the agent is not really operating. It is describing completion instead of proving it.

## Manual Testing Is Underrated

The best agent workflows I have found are not the most autonomous ones. They are the ones with the tightest feedback loops.

Manual testing is underrated here. So is manual tasking.

People sometimes treat manual intervention as failure, as if the agent only counts if it runs alone and returns with a perfect artifact. That is the wrong fantasy. The fastest path is often:

- Ask for a bounded change.
- Let the agent inspect, edit, and test.
- Manually poke the thing.
- Notice the failure.
- Make the agent fix it.
- Turn the failure into a durable guardrail.

The last step is the compounding step.

If I manually catch a bug and only fix that bug, I got one fix. If I catch a bug and then add a test, a lint rule, a PR gate, a repo instruction, a skill, or an eval, I changed the future working conditions. Every later agent now has a slightly narrower path to repeat the same mistake.

This sounds obvious, but it is the difference between "using AI" and building an agentic work system.

For example, in one repo I added a PR compliance pattern that is almost comically literal: repository skills contain attestation words, and the agent has to include the current words in the PR body to prove it read the relevant instructions. The CI gate checks the JSON. If the branch changes, the head SHA has to be updated. If the agent tries to hand-wave the process, the gate fails.

It is silly.

It works.

And that is the point. You do not need the model to become careful by default. You need the environment to make the desired behavior easier to do than to skip.

## Parallelism Is Not Just Splitting A Project

In serial, agents are often not as magical as people want them to be.

If I sit and watch one agent do one thing, I still have to wait. I still have to review. I still have to catch drift. I still have to close the loop. Sometimes I could have done the task myself faster.

The return starts to make sense when the work can run in parallel.

But "parallel" means two different things.

The first is normal decomposition. Some goals are naturally splittable: add
several model providers, support several import paths, fix a cluster of bounded
bugs, smoke test several integrations, write the plan while another branch works
on the primitive. In those cases, the move is to write the map, split the slices,
give each slice a narrow success condition, and periodically re-ground in
`main`

.

This is where planning documents become more useful than they sound. A good plan is shared state. It tells future agents what exists, what is blocked, what should merge first, and what "done" actually means.

A model-provider push is a good example. The goal was not "have agents do provider stuff." The goal was to make additional providers usable, cheap enough to matter, and provable through the real product path. That split into capability research, shared adapter work, provider-specific implementation, smoke tests, and an integration pass that checked what was actually merged, deployed, and usable.

That last part matters. A branch can be merged and still not be done. A local smoke test can pass and still not mean the product works. Sometimes success has to mean a real production turn, through the real auth path, with enough output to prove the provider is not merely returning a polite error.

The second kind of parallelism is less clean and more honest: working on more than one thing because the agent is busy and I am sitting there.

I am literally dictating parts of this article while other agents are fixing other things. Some of those things are related. Some are not. I am doing it because I am bored waiting for loops to finish, because I am anxious, because the queue is there, and because if I care about maximizing my own output the incentive is obvious: keep useful work in flight.

That is not the same as one project neatly split into ten slices. It is a more ambient multiplexing of attention. While one thread builds, another reviews, a third researches docs, a fourth waits on CI, and I use the dead air to think about the next thing.

This changes the latency math. If I have one task running, the difference between 20 minutes and 40 minutes is painful. If I have several bounded loops running and my real bottleneck is review, merging, and deciding what to queue next, the difference matters less. Not zero. But less.

The job becomes orchestration: what is running, what is worth checking now, what can wait, what needs to be killed, what should become a primitive, and what should merge before another branch drifts.

That does not mean "start ten random branches and vibe." It means keeping enough explicit state that parallel work stays reviewable instead of becoming an unclear set of branches with unclear ownership.

## The Small-Team Ownership Window

This is the part I keep coming back to: agents may be a much better deal for a small number of high-agency people than for the average large org.

Large organizations have advantages: money, distribution, legal cover, procurement, internal data, and teams of specialists. But they also have slow feedback loops. The person prompting may not own the architecture. The person reviewing may not understand the product context. The person paying the token bill may not see the cleanup burden. The person measuring productivity may count output instead of coherence.

That is how you get the Sloptember failure mode: more code, more features, more artifacts, more surface area, and less understanding.

Small teams have a different advantage. The loop can be brutally short:

- I feel a roadblock.
- I decide whether the roadblock is recurring.
- I build or ask an agent to build the primitive that removes it.
- I manually test the new path.
- I use the improved path immediately on the next task.

That loop is hard to buy with headcount.

It is also why the $200 tier is not just a pricing detail. For an individual or tiny shop, a heavy consumer subscription can feel like access to an absurd amount of subsidized frontier-ish execution. I can burn through most of a weekly Codex allowance, keep thinking, keep delegating, and keep building. Inside a big company, that same behavior may be blocked by policy, data rules, vendor approval, or simply the fact that the enterprise has to pay usage-based prices for every team.

So there is a weird temporary arbitrage here. Individuals can sometimes get something that looks like enterprise execution capacity before enterprises can comfortably operationalize it.

But it only works for a narrow class of people and teams. You need taste. You need error correction. You need enough technical depth to know when the agent is wrong even though the output sounds confident. You need enough product judgment to know when not to run another branch. You need enough executive function scaffolding to remember what is already running.

This is not "AI makes everyone 10x."

It is more like: AI lets some people build a little execution machine around their own judgment, if they are willing to do the practical work of making that machine reliable.

That matters personally because "just get a job" does not feel like the stable fallback it is supposed to be. I have applied for roles where I thought I was a real fit, gotten through parts of the process, and still felt the hesitation in the market. Companies are reluctant to hire right now. Maybe because budgets are weird. Maybe because AI has made everyone unsure how many people they need. Maybe because the middle of the labor market is just having a bad time.

Whatever the reason, it changes the calculation. If the old bargain is less available, then ownership starts to look less like an idealized founder story and more like a practical survival strategy.

That is the uncomfortable capitalism part. One way to reduce dependence on someone else's allocation decision is to own more of the upside. Not everyone can take that risk. Not everyone should. But if you have an idea, a little room to be wrong, and enough taste to keep the machine pointed at something real, this is a pretty interesting window.

There is still a large gap between "AI features" and **AI-native** products.
Most software is still shaped like old software with a chat box taped to the
side. Whole experiences can be rethought around voice, ambient context,
agent-readable state, reviewable diffs, previews, receipts, and interfaces where
chat is one input mode instead of the entire product. Product taste matters a
lot here, because the winning interaction pattern is probably not "same app, but
with more tokens."

That gap will not stay open forever. The patterns will get copied. The market will saturate. Tokens may get priced more honestly. Frontier intelligence may stay expensive even if everyday intelligence gets cheaper. So the question becomes: what can a small team build while the leverage is temporarily this weird?

## Foundations Are The Factory Game

The closest metaphor I have is not an office. It is a factory game.

The agent game feels like playing an old Minecraft automation modpack. You do not start with a giant perfect machine. You start by punching trees, then you build a tree farm, then the tree farm feeds some other machine, then that machine unlocks a better material, then the whole base starts producing things you used to gather manually.

That is how good agent work feels.

You spend time making machines. You make primitives. You create little languages. You build scripts, evals, checklists, prompts, skills, adapters, schemas, preview flows, and review surfaces. None of those things are the final artifact. They are the production line. Then you ramble into the system, test the output manually, notice where the machine jams, and improve the machine.

The important move is not "ask AI to do my work." The important move is "build the machine that makes this kind of work delegatable." Once that exists, it can keep paying out. The tree farm is not valuable because watching it run is beautiful. It is valuable because you stop punching trees.

This is why the highest-leverage work does not always look like feature work. Sometimes the right move is a script, a DSL, a connector, a preview/apply contract, a replayable eval, a design-system rule, a browser smoke test, a publishing script, or a tiny internal language that lets the agent express intent safely.

This is where agents reward taste at the low level.

If you have good concepts for how APIs should work, agents can fill in a lot of edges. If you do not, they will still fill in the edges, but now the edges are attached to an abstraction you may not want to own.

I think this is part of what people miss when they say AI makes software engineering easy. It can make implementation cheap. It does not make conceptual integrity cheap. In fact, it makes conceptual integrity more important because the implementation surface area expands.

Mitchell Hashimoto's ["Building Block Economy"](https://mitchellh.com/writing/building-block-economy)
gets at this from another angle: agents are extremely good at gluing together
high-quality, well-documented building blocks. That matches my experience. The
better the primitive, the more useful the agent.

The product I am building is partly a bet on this. Not "AI replaces judgment," but "good primitives let human judgment travel farther." It is early, so I do not want to overclaim the state of it. But the direction is an execution environment where agents have enough language, tools, memory, connectors, traces, and evals to do real work while the user stays close to authorship.

My current rule is simple: if a foundation changes a class of work from undelegatable to delegatable, strongly consider building it.

That is the real asset. The list of things I can safely delegate.

Every new bullet point on that list compounds. "Can update this doc safely." "Can test this route in a browser." "Can inspect this spreadsheet and propose a diff." "Can ship this kind of UI fix if a screenshot check passes." Each bullet point means a future thought has somewhere to go.

This is why the early phase can feel so slow. You are not only doing the task. You are building the language that makes the next task delegatable.

It also means not building fake foundations. An abstraction that does not reduce future risk is just ceremony.

## Input Bandwidth And The Execution Horizon

Voice is a huge part of this for me.

I get far more done when I can talk through the mess. A good voice dump can carry intent, priority, frustration, constraints, taste, and emotional salience in one sloppy packet. Typing can do that too, but speech catches the thought while it is still alive.

That matters because agents are hungry for context and I am not always willing to produce a perfect written brief before starting. The workflow that works is closer to:

- Ramble.
- Let the agent structure the ramble.
- Correct the structure.
- Split it into delegatable slices.
- Run the slices.
- Review the evidence.

This is not a side quest. Input modality changes throughput.

So does reading. If agents can create more output than I can absorb, the next bottleneck is not generation. It is review bandwidth. That is why I keep playing with speed readers and faster reading interfaces. Not because reading text one word at a time is the grand future of civilization. Because the interface between "agents produced a lot of stuff" and "Rico understands what happened" is now a serious part of the system.

The same is true for executive function.

Once execution capacity rises, the painful question becomes "what do I do next?" That used to sound like a productivity problem. In an agentic workflow it becomes infrastructure. You need to know:

- what is running
- what is done
- what needs review
- what is blocked
- what can be delegated next
- what should not be delegated at all.

There are a few terms I keep wanting to preserve here.

**Delegation minimum** is the point where a task becomes safe enough,
inspectable enough, and valuable enough that handing it to an agent feels better
than doing it yourself.

**Delegation saturation** is what happens after enough task classes cross that
minimum. The problem stops being "can I delegate this?" and becomes "how do I
keep all this delegated work visible enough to review?"

**Execution horizon** is the point where your supported execution rate exceeds
your ability to generate, prioritize, and review good ideas.

That is the real executive-function problem. Once supported execution outruns your review and prioritization capacity, a visible work surface stops being a nice-to-have. It becomes the control surface that keeps leverage from turning into fragmentation.

That is the real reason I keep circling the idea of a board or current screen for delegated work. The agentic environment needs harnesses, tools, connectors, evals, traces, skills, and preview/apply/verify loops. But the user also needs a surface for Now, Running, Review, Blocked, and Next. That surface is how delegated work stays visible enough for a human to remain the author.

## Benchmarks Are Also Harnesses

This is why I am increasingly suspicious of benchmark takes that collapse everything into a model leaderboard.

Coding benchmarks are useful. They are also weird. The closer you get to real software engineering, the less you are measuring only "the model." You are measuring a model inside a scaffold, with a prompt, a tool policy, a repo, a timeout, an environment, tests, hidden verifiers, retry rules, and a definition of success.

In other words, the evaluated object is really:

```
model + harness + task + verifier
```

Benchmark harnesses are designed for consistency, and consistency matters. If you want to compare models, you need to hold the scaffold as still as you can. But that does not mean the scaffold is incidental. It means the published score is a score for a model running inside a particular harness.

OpenAI recently stopped reporting SWE-bench Verified because, in its audit, many
remaining failures were test or contamination problems rather than clean signals
of frontier coding ability. OpenAI now recommends reporting SWE-bench Pro until
better uncontaminated evals exist
([OpenAI, February 2026](https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/)).

SWE-bench Pro is a serious attempt to raise the bar: 1,865 tasks across 41
repositories, including public, held-out, and commercial subsets. It uses
human-augmented problem statements, Docker environments, fail-to-pass and
pass-to-pass tests, and a public/private split meant to reduce contamination
([Scale Labs paper](https://labs.scale.com/papers/swe_bench_pro),
[methodology](https://labs.scale.com/leaderboard/swe_bench_pro_public)).

DeepSWE is interesting for a different reason. It uses 113 original long-horizon
tasks across 91 repositories and five languages, with hand-written behavioral
verifiers. Its public table says all models are run on mini-swe-agent for
consistency, and the current June 2026 snapshot reports `gpt-5.5[xhigh]`

at
70% +/- 4%, ahead of `claude-opus-4.8[max]`

at 58% +/- 5%
([DeepSWE](https://deepswe.datacurve.ai/),
[repository](https://github.com/datacurve-ai/deep-swe)).

That last sentence is the whole article hiding in a benchmark note.

"All models run on mini-swe-agent" is not a minor implementation detail. It is a claim about the harness. It means the score is not "GPT-5.5 in the abstract" or "Claude in the abstract." It is a model, with a particular reasoning setting, inside a particular agent scaffold, under a particular verifier regime.

That is good. We need that kind of specificity.

But it also means benchmark scores should be read as system scores. DeepSWE
itself has already drawn methodological audits around reproducibility,
denominators, and verdict receipts, which is exactly the kind of pressure a
serious benchmark should invite
([June Kim's audit](https://www.june.kim/auditing-deepswe)). The point is not
that one chart is valid and the other is invalid. The point is that the harness
is part of the measurement.

If my real workflow uses Codex with repo memory, browser verification, local scripts, PR gates, subagents, and manual review, then a benchmark using mini-swe-agent tells me something. It does not tell me everything. Likewise, if an enterprise agent runs inside a locked-down internal platform with different tools, data boundaries, and approval gates, the model leaderboard is only a starting prior.

That also matches my anecdotal experience with GPT-5.5. I have not seriously used Opus 4.8 yet, so I do not want to overstate the comparison. But compared with the 4.6 and 4.7-era models I was using before, GPT-5.5 has felt better at foundational engineering: reading the system, preserving constraints, building the primitive, and staying steerable over a long repo task. DeepSWE is not proof of my workflow, but it is at least evidence in the same direction.

The benchmark I actually want is closer to:

```
model + harness + tools + context + verifier + cost + review burden
```

That is less elegant than a leaderboard.

It is also much closer to reality.

## Token Economics Will Matter More

Right now, some of this feels distorted by consumer subscription economics.

The $200-ish personal AI plan is a strange object. OpenAI introduced ChatGPT Pro
as a $200 monthly plan for scaled access to its best models and tools, and the
larger industry has been playing with similar heavy-user tiers
([OpenAI, December 2024](https://openai.com/index/introducing-chatgpt-pro/)).
For an individual doing personal work, contract work, or small agency work, that
can feel like access to a subsidized compute well.

This creates funny incentives.

If I am operating as an individual, I may be able to pour a lot of agentic compute into my own work. If I am inside an enterprise, I may not be allowed to use that same personal compute on company code or data. The enterprise may need API billing, compliance, data controls, admin policy, audit logs, and a vendor relationship. That can be the correct boundary, but it changes the economics.

And the scale can get strange quickly. I am not a casual user of this stuff right now. I push close to the weekly Codex budget because I keep thinking, forking, reviewing, and building. Looking at my own usage, a month of this can start to look like something on the order of 20 billion tokens. Priced as raw API-style compute, depending on the model mix, that can look like maybe $15,000 of monthly compute value.

I am treating that as mostly free right now because, for me, it effectively is. That is absurd. It is also part of why the window feels temporary. If I had to pay the unsubsidized bill directly, the ROI math would get much harsher, much faster.

Eventually, some subsidies will go away or get priced more precisely. When that happens, the agent workflows that survive will be the ones where the value is measurable enough to justify the total system cost.

And total cost is not just tokens.

Total cost includes:

- model spend
- latency
- retries
- duplicated work
- review burden
- merge conflicts
- bad abstractions created too quickly
- security and data risk
- the emotional tax of tracking too many half-finished branches of intention.

A cheap model that loops forever is expensive. A premium model that solves the task once and teaches you how to make the cheap route reliable may be cheap. A fast model with the right tool and verifier may beat both.

This is why my model strategy has shifted away from "frontier by default" and toward cost, speed, and sufficient capability. Premium models are still important. They are teachers, judges, ambiguity resolvers, and escalation paths. But ordinary workflows should move toward the fastest route that crosses the quality bar at the lowest acceptable review burden.

That is the actual economic game.

Not "which model is smartest?"

"Which route gets this class of work done with enough correctness, taste, speed, and cost discipline that I would delegate it again?"

## The Human Moves Upstream And Downstream

The optimistic version of agents is not that humans disappear.

The optimistic version is that humans move.

Upstream, into intent, taste, architecture, decomposition, and deciding what is worth doing.

Downstream, into review, verification, synthesis, and closure.

The middle gets more delegable. Not all of it. Not uniformly. Not safely by default. But enough of it that the shape of work changes.

This is also why I am wary, and why I do not want this to read like a simple argument for acceleration.

I am not exactly happy about all of this. I am trying to understand it because it is where the work seems to be going, and because I need a livelihood. I would like to keep some version of middle-class security. I would like to afford health care. I would like housing and assets to feel less out of reach than they do.

So yes, I am learning the game. I am trying to get good at the strange new workflow. I am building the primitives and the boards and the review loops. But that is not the same as believing the social direction is cleanly good.

I would not die for this shit. If, collectively, society looked at the tradeoffs and decided "eh, maybe not," I would agree with a lot of the hesitation.

Individually, though, in the world we are actually in, I do not think I can afford to pretend the leverage is not real. The same thing that helps indie builders and tiny agencies can also help already-powerful actors with capital, distribution, compute, and data. If compute access becomes more capital-gated over time, the leverage gap can widen.

So I do not want to make this sound clean.

Agents are not free leverage. They are leverage with a control problem.

The control problem is not only alignment in the sci-fi sense. It is much more ordinary:

- What did I ask for?
- What is running?
- What changed?
- How do I know it worked?
- What should become a reusable primitive?
- What should be thrown away?
- What did this cost?
- Would I delegate this again?

Those are managerial questions, philosophical questions, and software questions all at once.

## Where I Think Agents Actually Pay Off

For coding and engineering work, the current sweet spot looks something like this:

Agents pay off when the task is bounded, the repo is legible, the tools are available, the success condition is verifiable, and the result can be reviewed in a tight loop.

They pay off more when tasks can run in parallel without coordination chaos.

They pay off even more when each failure becomes a durable improvement to the environment: a test, a script, an eval, a clearer tool, a skill, a better primitive, a sharper instruction.

They pay off most when a small team uses them not as magic employees, but as an execution substrate that compounds with taste.

That is less glamorous than the usual pitch.

It is also more actionable.

Do not just burn tokens.

Build great foundations.