# Loop engineering: the complete guide

> Source: <https://spidra.io/blog/loop-engineering>
> Published: 2026-06-16 00:00:00+00:00

On June 2, 2026, Boris Cherny, the person who built Claude Code at Anthropic, said something at a private WorkOS event that spread across the developer internet within days.

"I don't prompt Claude anymore. I have loops that are running. They're the ones that are prompting Claude and figuring out what to do. My job is to write loops."

Six days later, Peter Steinberger, founder of OpenClaw and now at OpenAI, posted two sentences on X that hit 6.5 million views in under 24 hours: "Here's your monthly reminder that you shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."

A Google engineer named Addy Osmani published an essay that same week, giving the practice a name: loop engineering.

That is the origin story. What follows is everything the origin story does not tell you.

## What loop engineering actually is

Prompt engineering was about crafting the right message. Context engineering was about feeding the right information. Harness engineering was about building the right environment for a single agent to run in. Loop engineering is one floor above all of that. It is about building the system that decides what the agent works on, when it works, how it verifies the result, and what state survives to the next run.

Osmani's definition is the clearest: loop engineering is replacing yourself as the person who prompts the agent. You design the system that does it instead.

For about two years, the way you got something out of a coding agent was: write a good prompt, share enough context, read what came back, type the next thing. You held the tool the entire time, one turn after the other. A loop inverts that arrangement. You write a small program that finds the work, hands it to the agent, checks the result, writes down what happened, and decides the next move. That program prompts the agent from then on. You become the author of the loop. The model becomes a subroutine.

The closest parallel is moving from operating a lathe to designing the production line the lathe sits on.

## Before you build anything: the 4-condition test

This is the part most X threads and blog posts skip because it is less exciting than the framework. It is also the most important.

Loops earn their cost under four conditions. Miss one and the loop costs more than it returns. Run this test on any task before you turn it into a loop.

### 1. The task repeats at least weekly.

A loop amortizes its setup cost across many runs. For a one-time job, a well-aimed prompt is faster and cheaper. If the work does not recur weekly, you do not have a loop candidate. You have a script you ran once.

### 2. Verification is automated.

This is the non-negotiable one. The loop needs something that can fail the work without you in the room. A test suite, a type checker, a linter, a build command. No automated gate means you are back in the chair reading every diff, which is exactly the job the loop was supposed to remove. A second agent told to "review this" without an objective signal is not verification. It is a second optimist.

### 3. Your token budget can absorb the waste.

Loops re-read context, retry, explore. That burns tokens whether or not the run ships anything useful. The technique reads as obvious to people with effectively unlimited tokens (Cherny and Steinberger work at Anthropic and OpenAI respectively) and expensive to people on metered plans. Single agents consume roughly 4x more tokens than standard chat. Multi-agent systems consume roughly 15x more. Peter Steinberger acknowledged $1.3 million in monthly token usage at one point. That is not a universal budget.

### 4. The agent has senior engineer tools.

Logs, a reproduction environment, the ability to run the code it writes and see what breaks. Without this, the loop iterates blind. It is guessing rather than observing.

Miss one of these four conditions and skip the loop. Use a good prompt instead.

## Who actually benefits and who should wait

The economics are not universal and the honest version of this story says so upfront.

### Who benefits today:

The developers who get the most out of loops are working on problems that repeat on a predictable schedule and have a clear automated way to check whether the work was done correctly.

Think of a team that runs CI every night and spends part of every morning triaging the failures. Or a codebase where dependency updates are a weekly chore that always follows the same steps. Or a team where linting and style fixes pile up across dozens of open PRs. These situations share the same shape: the task repeats, it is well-defined, and a test suite or linter can tell you objectively whether the agent did it right without anyone reading a diff.

Engineers at companies where token costs are not a real constraint also benefit significantly. Boris Cherny works at Anthropic. Peter Steinberger works at OpenAI. When they describe their workflows, they are describing setups built inside organizations where Claude usage is effectively free for employees. That context matters more than most loop engineering articles acknowledge.

### Who should wait:

Solo builders on consumer plans tend to hit the token bill before they see the productivity gain. That gap is discouraging and real.

Teams without automated test coverage are in a harder position. Without an objective gate, the loop has no reliable way to know if it actually did the work correctly. A second agent saying "this looks good" is not verification. It is just a second opinion that usually agrees with the first.

There is also a subtler case worth naming: teams where the bottleneck is review capacity, not typing speed. A loop generates more code faster. If the PR queue is already hard to keep up with, more output into that same pipeline makes the problem worse, not better.

Clinton, a developer who replied to one of the original threads, put the counterargument well: "I like being the loop. I like going back and forth with the agent, reviewing its choices, questioning its assumptions, correcting its direction, and learning new things in the process. Manual interaction still has value, especially when you are trying to learn new things."

That is not a wrong take. Direct interaction with an agent is still valuable for exploratory work, learning, and anything where the right answer is not yet clear. Loops are for work that is well-defined and repeating. They are not a replacement for thinking.

## The six building blocks

Osmani mapped a loop into six primitives. Both Claude Code and OpenAI Codex implement all six, with different names but the same shape.

### 1. Automations: the heartbeat

Automations are what make a loop an actual loop rather than a one-time run. They fire on a schedule, on an event, or on a trigger condition, without you typing anything. The agent finds work and triages it before you ask.

In Claude Code, three primitives compose into this: `/loop`

for session-scoped cadence, scheduled tasks for runs that survive a laptop closing, and hooks for firing shell commands at specific points in the agent lifecycle. GitHub Actions handles anything that needs to keep running in the cloud.

In Codex, the Automations tab covers the same ground. You pick a project, a prompt, a cadence, and whether it runs on a local checkout or a background worktree. Runs that find something land in a Triage inbox. Runs that find nothing archive themselves.

Two specific commands are worth understanding here because they do different things:

`/loop`

re-runs a prompt on a cadence. Use it when you want regular checks regardless of state.`/goal`

shipped in Claude Code version 2.1.139 on May 12, 2026. It keeps running until a condition you wrote is actually true. The key mechanic is that a separate small model checks whether the condition is met at the end of each turn. The agent that wrote the code is not the one grading it. If the condition is not met, the task prompt reinjects and the loop continues.

One practical note from developers who have tested this: `/loop`

and `/goal`

cannot currently be combined directly in Claude Code. `/goal`

is a UI command and cannot be auto-invoked inside a loop. If you try, you get a warning. Worth knowing before you design a loop around both of them.

### 2. Worktrees: parallel without chaos

The second you run more than one agent against the same codebase, the files start colliding. Two agents writing the same file at the same time is the same headache as two engineers committing to the same lines without talking first.

A git worktree is a separate working directory on its own branch that shares the same repo history. One agent's edits cannot touch another's checkout because they are physically in different directories.

Codex builds this in. Multiple threads hit the same repo at once without bumping into each other.

Claude Code exposes it through `git worktree`

, a `--worktree`

flag to open a session in its own checkout, and an `isolation: worktree`

setting on subagents so each helper gets a fresh checkout that cleans itself up after.

The thing worktrees do not solve: your review bandwidth is still the ceiling. Worktrees take away the mechanical collision. You are still the bottleneck on how many parallel changes you can actually evaluate and merge.

### 3. Skills: write project knowledge once

A skill is how you stop re-explaining the same project context every session. Both tools use the same format: a folder with a `SKILL.md`

file containing instructions and metadata, plus optional scripts, references, and assets.

Why this matters specifically inside loops: a loop without skills re-derives your entire project context from zero every cycle. With skills, intent compounds. The conventions, the build steps, the "we do not do it like this because of that one incident" — written once where the agent reads it every run.

A skill file for CI triage might look like this:

```
name: ci-triage
description: Classify CI failures by root cause and draft fixes for the
  easy ones. Trigger whenever a workflow run fails or on the morning triage loop.
---

## Classification rules
- env: missing secret, wrong env var, infra not provisioned → escalate to human
- flake: passes on retry without code change → retry once, then file
- bug: deterministic failure tied to recent commit → draft fix
- dependency: failure tied to a version bump → draft rollback
- infra: timeout, OOM, runner issue → escalate

## Fix patterns
- Auth tests → check src/auth/middleware first
- Database tests → verify migration applied in CI env

## Never do
- Disable failing tests. Always escalate instead.
- Modify CI config without human approval.
- Touch src/payments/ or src/billing/ under any circumstances.

## State
Update STATE.md after each run: paths checked, classifications, PRs opened, items escalated.
```

One distinction worth keeping clear: a skill is the authoring format. A plugin is how you distribute it. When you want to share a skill across repos or bundle several together, package them as a plugin.

### 4. Connectors: the loop touches your real tools

A loop that can only see the filesystem is a limited loop. Connectors, built on the Model Context Protocol (MCP), let the agent read your issue tracker, query a database, hit a staging API, drop a message in Slack. Both Claude Code and Codex speak MCP, so a connector you write for one usually works in the other.

This is the difference between an agent that says "here is the fix" and a loop that opens the PR, links the Linear ticket, and pings the channel when CI is green. Without connectors, the loop can only tell you what it would do. With connectors, it acts.

The connectors that pay back fastest for loop work:

**GitHub** is the obvious one. Read repos, create branches, open PRs, comment on issues, react to webhooks. If your loop does anything with code, this is the first connector to set up.**Linear or Jira** close the loop between code work and project tracking. The loop updates tickets as it progresses, links PRs back to issues, closes items when verification passes.**Slack** surfaces overnight run results without anyone having to check a dashboard. Post triage results, ping humans on escalations, summarize what happened while everyone was asleep.**Sentry or your error tracker** lets the loop investigate live alerts and draft fixes for the high-frequency ones.

A word of caution on community connectors: an audit of publicly available skill files found that 520 of 17,022 contained credential leaks. Read the source before installing anything you did not write yourself.

### 5. Sub-agents: keep the maker away from the checker

This is the structural decision that most determines whether a loop is trustworthy or not.

The agent that wrote the code is, in Osmani's words, "way too nice grading its own homework." A model evaluating its own output will mark its own work as done more often than it should. It reasons itself into believing it succeeded because it remembers the reasoning that led to the action.

A second agent with different instructions, sometimes running on a different model, catches the failures the first one talked itself into. This is the evaluator-optimizer pattern that Anthropic documented in December 2024 under the name "building effective agents." The vocabulary went viral in June 2026 but the pattern was published eighteen months earlier.

In Claude Code, subagents live in `.claude/agents/`

and can be configured with their own system prompts and model settings. In Codex, they are TOML files in `.codex/agents/`

. A typical split: one agent explores, one implements, one verifies against the tests.

The important thing is that the verifier has an objective standard to check against, not just an opinion. A test that passes or fails. A build that compiles or does not. A linter that returns zero or non-zero. An agent told to "review this and tell me if it looks good" is not a verifier. It is a second agent that will usually agree with the first.

Sub-agents cost more tokens because each one does its own model calls. Spend them where a second opinion is genuinely worth paying for, not on every task.

### 6. Memory: the agent forgets, the file does not

This is the piece that sounds too simple to matter and is actually the spine of every working loop. A markdown file, a Linear board, a JSON state object — anything that lives outside the single conversation and holds what has been done and what comes next.

Agents have no memory between runs by default. What they learn this session is gone next session unless you write it down outside the context window. The loop without persistent state restarts from zero every run. The loop with state resumes.

A minimal STATE.md might look like this:

```
# Loop state · ci-triage

## Last run
2026-06-09 03:30 UTC · 7 failures classified, 3 fixes drafted, 4 escalated

## In progress
- claude/fix-auth-token-refresh — tests passing locally, awaiting CI
- claude/fix-flaky-payment-webhook — retry pattern applied, monitoring

## Completed today
- claude/bump-axios-1.7.4 → merged
- claude/lint-fix-pass-june-9 → merged

## Escalated to humans
- src/billing/refund.ts — tests failing in 3 ways, root cause unclear

## Lessons learned
- 2026-06-08: PowerShell hits TLS 1.2 issue on this Windows runner. Use bash.
- 2026-06-07: tests/e2e/checkout requires Stripe webhook secret in env. Skip if missing.
```

The lessons learned section is the most important part. Every mistake the agent makes, the correction goes here. Future runs read this section and do not repeat the same mistake. This is CLAUDE.md applied to loop-specific knowledge. It compounds over time.

For long-running loops, pair the state file with a standing high-level spec (VISION.md or AGENTS.md) that the agent reads each run. The state file tells it where it is. The spec tells it where to go.

## The loop types: from ReAct to the Ralph Loop

Loop engineering did not appear from nowhere. The underlying patterns have been developing since 2022. Understanding where each loop type came from helps you choose the right one.

### ReAct (2022)

Published by Princeton and Google Research in October 2022, five months before AutoGPT made the concept mainstream. ReAct stands for Reasoning + Acting.

At each step, the agent produces two things: a reasoning trace explaining what it is doing and why, and a concrete action. The result of each action feeds into the next reasoning step. When something unexpected comes back, the agent can reason about it rather than retrying blindly.

ReAct is the foundation most production frameworks build on. It is the right starting point for almost every loop engineering project. Add complexity only when ReAct hits a clear limit.

### Reflexion (2023)

ReAct with a self-evaluation layer. After completing or failing a task, the agent generates a critique of what went wrong. That critique gets stored in memory and injected into the next attempt.

More expensive than ReAct because of the extra model calls for reflection. Better on trial-and-error tasks: debugging, unfamiliar codebases, multi-step problems where early approaches fail. Usually not worth the overhead for straightforward retrieval or well-defined tasks.

### Plan-and-Execute (2023)

Separates thinking from doing. A planner generates a full task breakdown upfront. An executor works through the steps. A re-planner adjusts when execution diverges from the plan.

A 2024 paper from LangChain reported a 3.6x speedup over sequential ReAct by running independent steps in parallel. The tradeoff: Plan-and-Execute is less adaptive when early steps produce unexpected results. ReAct recalibrates at every step. Plan-and-Execute commits to a plan.

### The Inner/Outer Dual Loop

Microsoft's Magentic-One architecture. An outer loop handles strategic planning and monitors progress against the original goal. An inner loop handles step-by-step execution within the current strategy.

The key advantage: when the inner loop stalls, the outer loop can reset the entire strategy rather than retrying the broken approach indefinitely. This prevents the failure mode where an agent repeats a broken approach because it has no mechanism to step back far enough.

### The Ralph Loop (2025)

Invented by Geoffrey Huntley in July 2025. Named after Ralph Wiggum from The Simpsons, who announces "I'm helping!" while walking into doorframes. Deliberately simple. Became a standard pattern in under six months.

The mechanics: a coding agent runs inside an infinite shell loop. Each iteration reads the same prompt file from disk. The agent modifies the codebase and exits. The loop restarts with a fresh context window. State lives in the file system.

It solves two specific production problems:

**Context overflow.** Long sessions degrade as the context window fills. The Ralph Loop resets context each iteration. The new session reads current state from disk rather than carrying forward a degraded context.**Premature exit.** LLMs stop when they subjectively decide the task is complete. A Stop Hook intercepts exit attempts, checks whether the completion criteria actually hold (tests green, coverage above threshold, type checks clean), and reinjects the task prompt if they do not. The loop cannot exit by claiming it is done. It can only exit by proving it is done.

### The /goal command (2026)

Built directly into Claude Code as of version 2.1.139, shipped May 12, 2026. You set a completion condition and Claude works autonomously across multiple turns until it holds. A separate evaluator model checks the condition at the end of each turn. The agent that wrote the code is not the one deciding it is finished.

Early adopters called it "the most underrated AI feature of 2026" because it eliminates the manual iteration cycle on multi-step tasks entirely.

## How a loop fails: the failure modes

These are not edge cases. They happen in production.

### The Ralph Wiggum loop

Named for the same Simpsons character as the Ralph Loop, but for the opposite reason. An agent that is supposed to emit a completion signal only when genuinely finished emits it early because it has decided it is done. The loop exits on a half-finished job believing it succeeded.

This happens when the stop condition is based on the agent's own judgment rather than an external verifiable check. The fix is always the same: replace "the agent says it is done" with "the test suite says it is done" or "the build says it is done."

### Context overflow

Long loops fill the context window. Reasoning degrades. Constraints introduced at turn 2 disappear by turn 47 because every summarization step is lossy. "Do not modify src/billing/" becomes a faint signal in an ocean of conversation history.

The Ralph Loop's context reset on each iteration addresses this. So does a VISION.md or AGENTS.md file that gets re-read at the start of each run, re-establishing the constraints the agent should never lose.

### Self-preferential bias

The model that generated the code evaluates it generously. It remembers its own reasoning and is inclined to trust it. A 2024 Anthropic post documented this as the core reason to split the maker from the checker. It is not a speculation about model behavior. It is an observed and measured failure mode.

### Silent failure

The agent produces confident output, calls are being made, tokens are being consumed, and nothing is actually changing. The hardest failure mode to catch because nothing breaks. The loop just spins without progress.

Detection: a no-progress check in your stopping conditions. If the output state has not changed between two iterations, exit and report rather than continuing to run.

### Token cost explosion

One documented production incident: an agent called a broken tool 400 times in five minutes. Without a hard token budget cap and a circuit breaker on tool calls, a loop that hits a broken environment turns into an expensive infinite loop very quickly.

Single agents run at roughly 4x the token cost of standard chat. Multi-agent systems at roughly 15x. A loop that runs nightly and has no hard spending limit will produce a surprise at the end of the month.

### Comprehension debt

Osmani's most important warning, and the one that gets sharper as the loop gets better rather than easier.

The faster a loop ships code you did not write, the larger the gap between what the repository contains and what you actually understand. The token bill is visible immediately. The comprehension debt is invisible until the day you have to debug a system nobody on the team has actually read.

The mitigation is not technical. Read the diffs. Every loop run that merges something you did not read is a withdrawal from a comprehension account that eventually becomes overdrawn.

### Cognitive surrender

The companion failure mode to comprehension debt. Once loops are running, the pull toward accepting whatever they produce grows. Designing a loop with judgment is the cure. Using a loop to avoid forming a judgment is the accelerant. Same action, opposite result.

## Boris Cherny's actual workflow

The numbers behind the quote that started this conversation.

In the 30 days before December 27, 2025, Cherny reported that 100% of his contributions to Claude Code were written by Claude Code itself: 259 PRs merged.

His setup: five Claude Code instances running in numbered terminal tabs, five to ten Claude browser sessions going simultaneously, and system notifications configured so he only checks in when an agent specifically needs his input. A teleport command handles handing context between local and cloud sessions. CLAUDE.md sits at the center of all of it as the persistent instruction layer that every new session reads at startup.

The CLAUDE.md practice is the key insight that often gets lost. Every mistake an agent makes, the correction goes into CLAUDE.md. Future sessions do not repeat it. The file becomes a cumulative record of learned project knowledge that survives context resets. It is the loop's long-term memory, manually curated over time.

## The autonomy ladder

Not every loop should run fully unattended from day one. This framework from Cherny maps four levels of autonomy, each earned through demonstrated reliability at the previous level.

**Level 1: Suggests only.** The loop identifies work and proposes actions. A human approves every action before anything happens. No code is written, no files are changed, no PRs are opened without explicit approval.

**Level 2: Drafts for human application.** The loop writes the code, opens draft PRs, generates the changes. A human reviews everything and applies it manually. The loop does the generation. The human does the execution.

**Level 3: Applies with approval gate.** The loop applies low-risk changes automatically but requires human approval before publish, merge, or any deployment. Lint fixes, minor dependency bumps, test additions can merge automatically after a brief review window. Anything that touches production requires a human to approve.

**Level 4: Fully autonomous with audit logs.** The loop applies and completes actions automatically. Human oversight comes from audit logs, not approval gates. Reserved for tasks where the loop has demonstrated reliable level 3 performance for an extended period.

Start every new loop at level 1 or 2. Run it for at least a week. Read its output. Correct what it gets wrong in the skill file. Move to level 3 only when the loop is consistently producing work you would approve without changes. Level 4 is earned, not assumed.

## The minimum viable loop

If you passed the 4-condition test, build the smallest loop that works before anything more complex. Four parts, no multi-agent swarm required.

**One automation.** A scheduled run that fires on a cadence and stops on a clear condition. Use `/loop`

in Claude Code or an automation in Codex. `/goal`

for run-until-done behavior.

**One skill.** A single `SKILL.md`

that stores the project context the agent would otherwise re-derive from zero every run.

**One state file.** A markdown file that records what is done and what is next. Tomorrow's run resumes instead of restarting.

**One gate.** The test, type check, or build that fails bad work automatically. This is the only part that decides whether the loop helps or just spends.

Order matters: get one manual run reliable first. Turn the working prompt into a skill. Wrap it in a scheduled loop. Then connect it to external tools. Skip ahead and you are paying for a system no one understands.

The metric that matters is cost per accepted change. Not tokens spent, not tasks attempted, not loops scheduled. If your accepted-change rate is below 50%, the loop is creating review work, not removing it.

## Good first loops and bad first loops

**Good first loops:**

**CI failure triage** runs nightly, scans failures, classifies root causes, and drafts fix PRs for the easy ones while escalating the rest. It has a clear trigger, a clear output, and automated verification baked in.

**Dependency bump PRs** check for updates weekly, test compatibility, and open PRs. The task is deterministic, machine-checkable, and low-risk.

**Lint-and-fix passes** run on every PR open event and apply style fixes automatically. The outcome is binary: the linter passes or it does not.

**Flaky test reproduction** loops until a theory about why a test fails survives three consecutive runs. Clear stopping condition, no judgment required.

**Issue-to-PR drafts** on codebases with strong test coverage let bad output get rejected by the test suite automatically. The gate does the review for you.

**Bad first loops:**

**Architecture rewrites** have no automated verification and no clear definition of done. Everything about what "good" looks like is a judgment call.

**Auth and payments code** are too high-stakes for an unattended loop. A mistake here is not a failed test. It is a security incident or a financial error.

**Production deploys** fall into the same category. The cost of getting it wrong is too high.

**Vague product work** has no clear stopping condition, no automated gate, and accumulates comprehension debt faster than almost any other loop type.

The common thread across all the bad ones: the correct answer requires taste, context, or judgment that cannot be captured in a skill file and checked by a machine.

## The security tax

A loop running unattended is also an attack surface running unattended. These threats are not hypothetical.

**Unreviewed PRs merging automatically.** The loop opens PRs faster than a human can read them. Without security checks baked into the gate (SAST, dependency audit, secret scanning), insecure code can merge without anyone seeing it.

**Skills as injection vectors.** An audit of publicly available skill files found that 520 of 17,022 contained credential leaks. A loop that auto-installs community skills inherits every prompt injection in their descriptions. Read the source before installing anything you did not write yourself.

**Credentials in logs.** Debug logging during a long-running loop scatters secrets across log files you may not be monitoring. Disable verbose logging in production loops. Sanitize what does get logged.

**Permission scope creep.** A loop tested with read-only permissions gets "just one" write permission added for convenience. That permission never gets re-audited. Audit your loop's permissions every 30 days and strip anything it does not actively need.

**Command allowlists.** Any loop that can execute shell commands should have an explicit allowlist of exactly which commands it can run. An agent with unrestricted shell access inside an unattended loop is the fastest way to turn a token cost problem into a security incident.

## The bigger picture

Developers went from prompt engineering to context engineering to harness engineering to loop engineering in fewer than 18 months. The pattern is consistent: each wave is one level of abstraction above the previous one.

A cron job and a loop look similar from the outside. Both run on a schedule. The difference is in the middle. A cron job runs a fixed script. A loop runs a model that reads the current state and chooses its next action. The scheduling layer is the same. The decision layer is not.

What the people who built these tools are saying is that the leverage point has moved up the stack. The prompt is no longer where the meaningful work happens. The loop is. Your job, as a developer working with AI tools in 2026, is less about crafting the perfect message and more about designing the system that crafts messages for you, checks the results, and decides what to do next.

That is a real shift. It is not universal yet. The token costs are real, the tooling is new, and most developers are still a few conditions short of making a loop worth building. But the direction is clear, and the window to learn this before it becomes mainstream is shorter than it looks.

## What most developers should do right now

Not build a loop. Run the 4-condition test first.

Does the task repeat at least weekly? Is verification automated? Can your token budget absorb the waste? Does the agent have senior engineer tools?

If you pass all four: start with one small, manually supervised loop on a low-risk task. CI failure triage, dependency bumps, or lint-and-fix are all good starting points. Build the skill file first. Then the state file. Then the gate. Run it for a week at level 1 or 2 autonomy, read its output, and correct what it gets wrong. Then, and only then, step back and let it run more autonomously.

If you do not pass all four: use a well-aimed prompt. It is still the right tool for most tasks. The leverage point moved, but it has not moved for everyone yet.