Plan Mode Is a Crutch

wpnews.pro

The standard plan then implement cycle ignores critical properties of the human-agent optimization problem. We derive these properties and show how to design for them.

It's fair to say the bottleneck in most agentic coding workflows is no longer the model.

However the UX of most coding agents still seems to ignore this assumption. The agent researches, writes a huge complete plan, asks for approval, then disappears into execution. Front-load the human. Get alignment early. Minimize surprises.

This is what makes plan mode so useful. It gives you a coherent artifact to review, and it makes the agent feel less chaotic. But as a protocol for human-agent communication, it is far from the best we can do.

The noisy channel #

You start with a vague feature idea. A more detailed description narrows the space. A spec narrows it further. Code narrows it to exactly one. Every step here reduces ambiguity.

Shannon called this entropy: uncertainty over possible outcomes. A vague idea has high entropy and working code has zero. A coding agent's job is to get from one to the other.

But the agent can't close the gap alone. Which database? How does this interact with the billing module? Some of those answers live in your head. Some don't live anywhere yet. Your intent may even be partially formed and full of ambiguity you haven't confronted yet.

In coding, entropy usually reduces when a decision gets made: picking the database type, deciding whether the billing edge case matters, accepting or rejecting a proposed interface.

The agent has two ways to handle ambiguity. It can choose to use the prompt and its knowledge to pick the most plausible implementation. Better models make this option stronger. Their priors get sharper, their guesses improve, and fewer choices need to reach the human. Or it can ask. This spends the human channel capacity to reduce ambiguity before committing to a choice.

The protocol problem is not "ask whenever anything is ambiguous." That would waste your bandwidth. It is also not "guess everything." That pushes distortion into the implementation. The protocol problem is deciding which ambiguities are cheap enough to guess and which are worth spending human attention on.

The limits #

This channel is not limited by the agent. It is limited by how many decisions the human can absorb, evaluate, and answer without losing the thread. Call that capacity .

And those decisions are not independent tokens. An auth question is cheaper when you are already holding the auth model in your head. A database question between two auth questions forces you to evict one context and load another. The channel is still open, but its effective capacity has dropped.

We can model that effective capacity as:

where is how often you switch contexts and is the recovery cost per switch. Three auth questions in a row are cheap. Auth, database, auth is not, even if the questions are individually simple.

Now combine the pieces. The agent needs to reduce entropy by choosing or by asking. Choosing is cheap but risks distortion. Asking is more precise but spends a finite, degradable channel. So the real design question is not "how do we make the agent plan better?" It's "what protocol reduces the most uncertainty per unit of human attention?".

Every approach to human-agent collaboration is a protocol for solving this: plan mode, spec-driven development, flat multi-agent parallelism, fork-join orchestration. Some protocols use the human channel well. Most do not.

The failure mode #

The usual cycle is: prompt, wait, review plan, wait, review implementation.

"Wait" is doing a lot of work here. You are not really waiting. You are either staring idly at your screen, or you're switching to something else and accepting that the context will be cold when the agent comes back.

Most of the time you context switch: to another agent, a different review, slack, doomscrolling or some other distraction. Then the plan arrives and asks for a high-bandwidth decision from a cold start.

So before implementation has even started, the protocol has already forced a bad tradeoff: idle time or context reload cost.

Consider the set of implementations consistent with the prompt. The agent models a distribution over this set; some implementations are more likely to satisfy you than others. A better model sharpens that distribution, but it does not remove ambiguity entirely.

Guessing selects one point from that distribution. Asking sends bits through the human channel to reshape the distribution before selection. If the expected distortion from guessing is low, guess. If it is high, ask.

That channel has an effective capacity : the number of useful decisions the human can make per unit time after switching costs. Reliable communication through a noisy channel is possible up to the channel's capacity, and impossible above it. No protocol can extract more human signal than .

But capacity is only the first constraint. Timing matters too. During the first wait, the agent is producing a plan while the human channel is idle. During plan review, the channel receives a burst: architecture, sequencing, error handling, tests, naming, migration strategy, all compressed into one approval decision. During the second wait, the agent learns new implementation facts, but the channel is idle again.

The plan is a lossy encoding #

This is also a rate-distortion problem. A plan is a compressed encoding of intent. The lower the rate, the more distortion you should expect between what the human meant and what the agent builds. A giant plan reviewed after an idle gap or a context switch is low-rate in the place that matters: it asks the human to validate many hidden decisions through one coarse approval action.

A pre-implementation plan is maximum-compression, minimum-context. It captures what the human and agent knew before the implementation exposed the hard parts. Ambiguities, edge cases, and interactions with existing modules often surface only during execution, when plan mode has already put the channel to sleep. The distortion is baked in during planning and paid for during implementation.

That is the failure mode of plan-then-execute. It sends the biggest burst when the human has the least concrete context. It idles the channel while the agent is discovering the facts that would make human input most valuable. It turns many domain-specific decisions into one approve/reject action, which increases distortion.

A better protocol should do the opposite: keep the channel warm while work is happening, ask smaller questions when the relevant context exists, batch nearby decisions together, and keep state alive instead of treating the approved plan as a finished artifact.

Scheduling the human channel #

Once you view the human as the scarce channel, the design problem starts to look less like "write a better plan" and more like scheduling work around an expensive shared resource. Computer architecture has been doing this for decades: pipeline dependent stages, use massive multithreading for independent work, and preserve locality so the cache does not thrash.

Pipelining is the single-thread optimization. When tasks depend on each other, you cannot make them fully parallel, but you can overlap their stages. Research, planning, and execution do not wait for each other as global phases. If one agent is implementing auth, the orchestrator can already be researching the database task. If that research exposes one real ambiguity, it can ask while auth is still running. By the time auth lands, the database task is not starting cold.

That lowers latency because the next task is prepared before the current task finishes. It also spends human attention at a better time: questions arrive while work is happening, not as one big upfront approval burst.

Massive multithreading is the independent-work optimization. A single workstream alternates between needing input and going quiet. Several workstreams smooth that demand. Auth may be implementing while database is clarifying a migration choice and frontend is waiting on a product decision.

But multithreading only helps if it preserves locality. If questions arrive auth, database, frontend, auth, the human pays a reload cost almost every time. Three auth decisions in a row are cheap. Auth, database, auth is not.

Forking is the structure that gives us both multithreading and locality. Each fork owns one domain. Auth questions come from the auth fork. Database questions come from the database fork. The parent handles only cross-cutting decisions. The question stream from each fork is domain-coherent by construction, so parallelism does not turn the human into a random-interrupt handler.

The tree also bounds coordination. A single orchestrator managing leaf tasks becomes its own bottleneck: its context fills, dependencies blur, and it starts losing track of what each worker owns. Recursive decomposition keeps fan-out small. If each node manages a bounded number of children, coordination depth grows with the height of the tree rather than the number of leaves. You can route decisions, propagate completion, and integrate work in logarithmic time instead of making one agent scan a flat list of everything.

The state structure for this is not a plan. It is a living DAG. Nodes are units of work: research a question, ask for a decision, implement a module, review an integration point. Edges are dependencies. When research exposes ambiguity, the graph spawns a decision node. When the human answers, downstream implementation unblocks. When implementation exposes a new constraint, the graph changes again.

That is what makes the architecture work. Pipelining is overlapping active nodes. Forking is subgraphs with local ownership. Recursive decomposition is a tree of DAGs with bounded fan-out. The protocol does not wait for a complete plan before acting; it keeps updating the work graph as entropy is reduced.

Plan mode is the degenerate case #

Plan mode is the degenerate protocol: one source, one burst, one idle phase, one level, zero multiplexing. The human's channel is overloaded during planning, every decision at once with minimal context, then wasted during execution.

That does not make planning useless. It makes plan mode a crutch: helpful when the system cannot maintain living state, cannot ask small questions at the right time, and cannot coordinate multiple workstreams safely.

The future is not better upfront plans. It is better protocols for spending human attention.

source & further reading

graphcoder.ai — original article