How to make AI coding agents do real work — repeatedly, verifiably, and without you babysitting every step.
Synthesized from current practice (2025–2026): Addy Osmani's "Loop Engineering," Peter Steinberger's "Just Talk To It" and "Shipping at Inference‑Speed," Boris Cherny's talks on Claude Code, Geoffrey Huntley's Ralph technique, Matt Van Horn's "WTF Is a Loop?", the Forward Future
Loop Library, the Lushbinary loop‑engineering guide, and the working loops shared by practitioners like Matthew Berman, Eric Lott, Hiten Shah, and others.
An agentic loop is the simplest unit of useful agent work: do something → check the result → decide whether to continue or stop. The whole craft is in making the check real and defining when to stop. Everything else — model choice, harness, MCPs, subagents — is secondary.
If you remember one sentence: A loop is a task with a check. A task without a check is just hope.
Zoom out and the same idea has a name: loop engineering — designing the system that prompts your agent on a schedule and against a goal, instead of typing every prompt yourself. As Anthropic's Boris Cherny put it, "My job is to write loops." This guide takes you from one good loop to a self‑running one — and tells you where the brakes are.
Most people picture "an agent" as a chatbot that writes code in one shot. That's a one-time task. A loop is different. The agent:
┌───────────────────────────────────────────┐
│ │
▼ │
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────┴─────┐
│ OBSERVE │──▶│ ACT │──▶│ CHECK │──▶│ DECIDE │
│ (inputs) │ │ (1 step) │ │ (fixed) │ │ continue │
└──────────┘ └──────────┘ └──────────┘ │ /stop? │
└────┬─────┘
│ stop
▼
┌───────────────────┐
│ HANDOFF / REPORT │
└───────────────────┘
Use a loop when the result of one step should change the next step. If it won't, use a one-time task instead. (Forward Future, How agent loops work.)
This is why "improve the code" fails and "make every page load under 50ms under the same test conditions" works. The first has no finish line; the second has a check the agent can run after every change, and a number that says done.
The cycle above is the inner loop — what a coding agent already runs on every turn: it perceives the state, reasons about what to do, acts (calls a tool, edits a file, runs a test), observes the result, and reasons again. You don't build that; the harness does.
What you build is the outer loop: the system that runs that inner loop on a schedule, feeds it work, checks the result, and decides the next thing — without you typing each prompt. Everything past this section is about designing that outer loop well.
In June 2026 this pattern got a name. Addy Osmani called it loop engineering, crystallizing what Peter Steinberger and Anthropic's Boris Cherny (head of Claude Code) had been saying:
"You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents." —
Peter Steinberger"I don't prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops." —
Boris Cherny
It's the third layer in a stack that's been building for years. Each layer wraps the one inside it and moves the leverage point further from the raw model call:
| Layer | What you optimize | Unit of work |
|---|---|---|
| Prompt engineering | ||
| how you phrase one instruction | one turn you type by hand | |
| Context engineering | ||
| what else is in the window: docs, history, tool defs | the conditions around one answer | |
| Loop engineering | ||
| the system that decides what to prompt, when, and whether the result passes | ||
| a self‑running cycle across many turns |
The lower layers don't disappear — a sloppy prompt inside a loop just produces sloppy work faster, and the loop still has to put the right files in front of the model each turn. What loop engineering adds is the autonomous control structure around all of it.
The leverage moved; the work didn't get easier. A well‑designed loop multiplies a good engineer. A badly designed one multiplies a bad decision just as fast, with less of you watching.
Before it had a name, there was Ralph. In July 2025 Geoffrey Huntley described running a coding agent inside a plain while
loop and named it after Ralph Wiggum — "deterministically simple in an unpredictable world." It looks too dumb to work, and it works. (Huntley built an entire programming language with it for about $297.)
while ! grep -q "ALL TASKS DONE" STATUS.md; do
claude -p "Read PLAN.md and STATUS.md. Pick the next unchecked task,
implement it, run the tests, commit on success, and update
STATUS.md. Then stop."
done
The non‑obvious insight is the context reset. A long session degrades as the window fills with old reasoning, dead ends, and stale file contents. Ralph sidesteps that: every iteration is a fresh agent with a clean context that reads the current repo state and task list from disk, does exactly one unit of work, commits, and exits. The intelligence doesn't live in one heroic run — it lives in clear, granular specs and verifiable outcomes, applied over and over against an external memory the model can't pollute.
Loop engineering is Ralph, productized. The while
loop becomes a scheduled automation, the context reset becomes a worktree plus a sub‑agent, and the ALL TASKS DONE
grep becomes a /goal
condition graded by a separate model. Same shape, fewer sharp edges.
Ralph didn't appear from nowhere — and what Steinberger and Cherny mean today isn't Ralph either. The word loop hides at least five distinct things. Knowing where you are on this ladder is the fastest way to stop talking past people:
| Stage | When | What it was | What it added |
|---|
- ReAct | 2022 | the academic while‑loop: reason → act → observe → repeat | one model, one loop, a human watching |
- AutoGPT | 2023 | gave the loop a goal and let it prompt itself | autonomy — and infamous infinite spinning |
- Ralph | Jul 2025 | a bash one‑liner piping the same prompt, fresh context each pass | discipline: reset context to fixed anchor files |
/goal| spring 2026 | Ralph productized in Codex & Claude Code; runs until a validator model confirms done | a built‑in verifiable stop |- Orchestration | now | loops supervising loops, on a schedule, with durable git‑backed state | the loop becomes the unit of work |
Stages 1–4 are single‑agent. Stage 5 is what's genuinely new: the loop became the unit of work (not the task), loops started supervising other loops concurrently and on a schedule, scheduling replaced the human kickoff (so it runs on infrastructure time, not your attention), and durability became explicit (git‑backed state and crash recovery, because Ralph assumed your terminal stayed open and the 2026 version assumes it does not).
"It's just cron with a hat on" — half right. The sharpest skeptic line in the whole discourse was four words: "Cronjobs have funny re‑branding right now." And yes, the scheduling layer is cron — Claude Code's /loop
runs on cron under the hood. What cron never had is the body. A cron job runs a fixed script; a loop runs a model that reads the current state, decides what to do next, does it, checks whether it worked, and decides whether to continue. A loop is cron plus a decision‑maker in the body. Stack those — let one loop dispatch and supervise others with durable shared state — and you get something cron can't express. The open‑source proof is Steve Yegge's Gas Town: 20–30 Claude Code instances coordinated by a "Mayor" agent, patrol agents running continuous loops, and state in git so work survives a crash.
Read it top to bottom: a scheduler tick wakes the Mayor (the outer loop), which hands each patrol agent one bounded task in its own worktree. Each patrol agent runs its own inner observe → act → check cycle, then a verifier gates the result — failures bounce back to the Mayor for rework, passes are committed to durable git state. The next tick reads that state and picks up where the last one stopped. The Mayor enforces the three hard stops (max iterations, no‑progress, budget) so the whole thing halts instead of running off a cliff.
The capability bar moved. Practitioners report that agentic coding went from "this is crap" to "this is good" around mid‑2025, and from good to "this is amazing" with the newest frontier coding models. The practical consequence, in Steinberger's words: the amount of software you can create is now mostly limited by inference time and hard thinking — not by typing.
That shifts where your effort goes. The bottleneck is no longer writing code; it's specifying the goal and the check precisely enough that an agent can run unattended and you can trust the result. The agentic loop is the format that encodes exactly those two things.
A second reason it matters: closing the loop. The recurring theme across every credible source is that agents get dramatically more reliable when they can verify their own work — run the CLI, run the test, diff the screenshot, hit the endpoint. Whatever you build, build it so the agent can check itself. "By default, whatever I wanna build, it starts as a CLI. Agents can call it directly and verify output — closing the loop." (Steinberger.)
Every reliable loop names five things explicitly. Miss one and the loop drifts, runs forever, or "succeeds" while tests fail.
| Part | Question it answers | Failure if missing |
|---|---|---|
| Trigger | ||
| When does the loop run? | Never starts, or runs at the wrong time | |
| Inputs | ||
| What fresh state does the agent inspect each pass? | Acts on stale assumptions | |
| Action | ||
| What single bounded, reversible change may it make? | Huge blast radius, impossible to undo | |
| Check | ||
| What fixed test/benchmark/rubric decides success? | "Looks done" while broken | |
| Stop | ||
| Success? No‑op? Blocked? Out of budget? | Infinite loop, wasted tokens, runaway authority |
This single prompt shape works across Cursor, Codex, Claude Code, Factory, Devin — anything. Fill the brackets:
When [trigger], inspect [fresh inputs]. Choose one in-scope action using
[criteria], then make the change.
Run [acceptance check] under the same conditions. Record what changed, the
evidence, and the next step in [state file].
Repeat only while progress is measurable and [budget] remains. Stop when
[success gate] passes. Stop without changes when [no-op condition] is true.
Ask for approval or report a blocker when [escalation condition] occurs.
Never [forbidden action]. Finish with [pull request, report, artifact, or handoff].
Run it once by hand before you schedule it. The first manual run almost always reveals a missing check, a fuzzy boundary, or a stop condition that needs to be sharper. (Forward Future.)
A year ago a loop meant a pile of bash you maintained forever. As of mid‑2026 the pieces ship inside the products — and the shape is the same across OpenAI Codex and Anthropic's Claude Code, so you stop arguing about which tool and just design a loop that works in either. A loop needs five blocks plus one place to remember state.
| # | Block | What it does | In Codex | In Claude Code |
|---|---|---|---|---|
| 1 | Automations | |||
| scheduled discovery + triage | Automations tab (project, prompt, cadence, env); Triage inbox; /goal |
|||
/loop , scheduled tasks/cron, hooks, GitHub Actions, /goal |
||||
| 2 | Worktrees | |||
| isolate parallel agents | built‑in worktree per thread | |||
git worktree , --worktree , isolation: worktree on a subagent |
||||
| 3 | Skills | |||
| codify project knowledge | ||||
SKILL.md , called with $name or implicitly |
||||
Agent Skills (SKILL.md ) |
||||
| 4 | Connectors | |||
| reach your real tools | Connectors (MCP) + plugins | MCP servers + plugins | ||
| 5 | Sub‑agents | |||
| separate maker from checker | TOML in .codex/agents/ |
|||
.claude/agents/ , agent teams |
||||
| + | Memory | |||
| durable state between runs | markdown / Linear via connector | markdown (AGENTS.md , progress files) / Linear via MCP |
Four of these are mechanics; two are where loops live or die.
Automations are the heartbeat. They surface work on a schedule without you asking — everything else reacts to what they find. Runs that find something land in triage; runs that find nothing archive themselves. The in‑session cousin is the most important primitive of 2026: /goal
keeps working across turns until a condition you wrote is verifiably true, and a separate small model checks "are we done?" after every turn.
Memory is the spine. The model forgets everything between runs, so state must live on disk, not in the context window. The agent forgets; the repo doesn't. Tomorrow's run reads the state file and picks up exactly where today stopped. Keep two things separate: skills hold durable knowledge (how we build, our conventions, "we don't do it this way because of that one incident"); memory holds changing state (what's been tried, what passed, what's still open). Never put secrets in either.
The maker–checker split is the single most useful structural move. The model that wrote the code is far too generous grading its own homework. A second agent — different instructions, sometimes a stronger model on higher reasoning effort, told to be adversarial and to trust tests over its own read of the diff — catches what the first talked itself into. This is exactly what /goal
does under the hood: a fresh model decides whether the loop is done, not the one that did the work.
The reusable unit is a skill, not a prompt. Steinberger's other rule pairs with the loop one and is arguably the more durable half: if you do something more than once, turn it into a named skill; if you do something hard, turn it into a skill afterward so next time is free. A loop with no reusable skills inside it is just a while
‑true around a stranger. A loop that calls a library of sharp, tested, named skills compounds — every run gets cheaper and sharper instead of re‑deriving your project from zero. The loop is plumbing; the skills are the asset.
You are still the ceiling. Worktrees remove the mechanical collision, but your bandwidth to review merged work caps how many parallel agents you can actually run. Ten agents producing changes you can't review is worse than two you can.
A goal is only as good as the evidence that proves it. "Make the checkout flow better" gives the loop nothing to grade against, so it stops whenever it feels like it. Practitioners running long, unattended agents converged on the same fix: specify the desired end state, the evidence required, the constraints that must hold, and a hard budget. The agent stays the executor; you write the acceptance test it must pass before it may claim done.
| Dimension | A wish (don't) | A contract (do) |
|---|---|---|
| End state | ||
| "Improve test coverage" | "Coverage for src/billing is ≥ 90%" |
|
| Evidence | ||
| "It looks done" | "npm test exits 0 and the coverage report confirms the number" |
|
| Constraints | ||
| (unstated) | ||
| "Do not touch public APIs or delete existing tests" | ||
| Budget | ||
| (unbounded) | ||
| "Stop after 25 turns or $5, whichever comes first" |
Three habits make a loop trustworthy: preserve mistakes so the loop learns instead of repeating them, build verification into the loop rather than bolting it on after, and treat the failing test or red CI as the signal that keeps the agent honest. A loop with no evidence to fail against will always think it succeeded.
These are real, attributed patterns from the Loop Library. Each one is a worked example of the template — notice how every one has a concrete check and an explicit stop.
sleep
or retry. Repeat until The pattern is identical every time: fresh inputs → one change → fixed check → keep only verified wins → explicit stop.
Don't jump straight to an auto‑merging loop. Earn trust one rung at a time, and only climb when the current rung is already producing work you'd have done by hand anyway. Each level adds exactly one new power and keeps a human in the path until the evidence says you can step back.
| Level | What the loop does | What you do |
|---|---|---|
| 0 — Manual | ||
| you prompt turn by turn | every turn | |
| 1 — Triage | ||
| scheduled run writes findings to a markdown file; no code changes | read and act on the findings | |
| 2 — Draft | ||
| drafts fixes on a branch in an isolated worktree | review and merge every PR | |
| 3 — Verified PR | ||
| a verifier sub‑agent gates the PR before it reaches you | approve; the verifier filters | |
| 4 — Auto‑merge | ||
| low‑risk classes (dep bumps, lint, flaky‑test retries) merge on green | audit the log, not each change |
Start smaller than you think. A single automation that triages CI failures into a markdown file each morning — no auto‑merge — already removes a recurring chore and lets you watch how the loop behaves before you trust it with PRs.
Watch the token bill. A scheduled loop with a verifier running after every turn burns tokens fast, and usage swings wildly with cadence and sub‑agent count. Start with a slow cadence and a tight goal, watch cost for a few days, and scale up only once the loop produces work you actually merge.
The same prompt works everywhere; only where you store recurring instructions and how you schedule differ.
| Tool | Stable instructions | Scheduling |
|---|---|---|
| Cursor | ||
.cursor/rules or AGENTS.md |
||
| Cursor Automations | ||
| Codex | ||
AGENTS.md |
||
| Codex Automations (Worktree thread for isolation) | ||
| Claude Code | ||
CLAUDE.md (/init to scaffold) |
||
| Routines, scheduled tasks, GitHub Actions | ||
| Factory | ||
.factory/prompts/*.md |
||
droid exec --auto medium -f … from CI/cron |
||
| Devin | ||
| Playbooks | Scheduled Sessions |
In all cases: review the diff and the actual check output — not the agent's summary — to decide whether the work is done.
The lowest on‑ramp is one line. Boris Cherny's own canonical example — paste it and change the nouns:
/loop babysit all my PRs. Auto-fix build issues, and when comments come in,
use a worktree agent to fix them.
Notice what you didn't write: the steps. You wrote the intent and the stopping behavior; the loop prompts the agent each tick. Cherny's five tips for running an agent autonomously for hours or days:
/goal
or /loop
A loop is delegated authority. Bound it.
tmp/<file>.md
). Here's the plot twist of 2026. Once the model writes the code for almost nothing, the cost moves to the loop running it. The expensive resource shifted from tokens‑per‑feature to loop management.
"The costliest thing in AI coding is no longer writing code, it's managing the agent loop."
The receipts are real: Uber capped its engineers at $1,500 per person, per tool, per month for Claude Code and Cursor after burning its annual AI budget in four months. And the failure mode every production team fears is the loop that doesn't stop — "without guardrails, you get infinite loops and billing surprises orders of magnitude over budget."
Which is why every serious 2026 write‑up converges on the same three hard stops. Bake all three into every loop:
The romantic version of loops is that you write them and a thousand agents build your company overnight. The production version is that you write the loops, and most of your job is making sure they halt. (For perspective: Gartner places agentic AI at the peak of inflated expectations, with only ~17% of organizations actually deploying agents. Mind the gap between the timeline and the receipts.)
Here's a nuance worth naming: the same Peter Steinberger quoted in §2 as a loop‑engineering advocate also wrote "Just Talk To It," a manifesto for dropping the ceremony. That's not a contradiction \u2014 it's the boundary. Structure pays off for unattended, repeated work; it's overhead for interactive, exploratory work where you're watching the stream. Both are right, for different situations.
His core claims for the hands‑on mode, distilled:
--help
.The shared truth underneath both: build for verification, keep changes small, and never trust a summary over a check.
A loop changes the work; it doesn't delete you from it. Three problems get sharper as the loop gets better, not easier.
Build the loop. But build it like someone who intends to stay the engineer — not just the person who presses go.
| Symptom | Root cause | Fix |
|---|---|---|
| Loop runs forever | No budget / no stop condition | Add max iterations + explicit success gate |
| "Done" but broken | The check is the agent's opinion | Replace with a mechanical, fixed check |
| Huge unreviewable diff | Action wasn't bounded | One reversible change per pass; manage blast radius |
| Looks better, performs worse | Check moved between passes | Freeze the check; use a holdout for optimization |
| Acts on wrong assumptions | Stale inputs | Re‑inspect fresh state every pass |
| Silently overwrites your WIP | No protected scope | Protect uncommitted/active work; scope the loop |
| Token / cost blowups | Context bloat, no limits | Cap iterations/cost; prefer CLIs over heavy MCPs |
| Quality degrades over a long run | Context window fills with cruft | Reset to a fresh context each pass (Ralph‑style); keep state on disk |
| Verifier rubber‑stamps the work | Maker is grading its own homework | Use a separate model/instructions for the checker; trust tests over its read of the diff |
| You no longer understand the repo | Comprehension debt from unread merges | Read what the loop produced; keep human review on merges |
| Surprise bill orders of magnitude over budget | Loop that won't halt | Enforce the three hard stops: max iterations, no‑progress detection, $ ceiling |
Building your first real loop? Walk this list:
Get these right and you have a loop you can trust to run while you sleep. Everything else is iteration.
On loop engineering (the discipline):
Loops in practice (the patterns):
Tool commands and product capabilities (Codex /goal, Claude Code /loop, worktree flags, Automations) change frequently — verify against each vendor's current official docs before relying on a specific behavior. Loops in the library are shared under their authors' attribution.
These documents live in this same repo and pair directly with topics covered above. Read them in the order that matches where you are right now.
| Document | Why it pairs with this guide |
|---|---|
/loop
, worktrees, scheduled tasks) map directly to §7's five building blocks.
Suggested reading path:
- This guide (orientation + mental model)
- → [🤖 SWE-agent — Deep Dive & Build-Your-Own Guide 📘](understand the inner loop mechanics)- → [🏗️ Building High-Quality AI Agents 🤖 — A Comprehensive, Actionable Field Guide 📚](harness + tool design)- → [🚀 Claude Code: From Zero to Pro 🤖](tool-specific execution)- → [🏗️ Building Production-Grade Fullstack Products with AI Coding Agents 🤖 — A Practical Playbook 📘](ship it end-to-end)