OpenAI Codex now finishes 85% of scoped tasks. Here is the /goal workflow that gets you there.

OpenAI's Codex now completes 85-90% of well-scoped maintenance tasks using its /goal workflow, which loops until a goal is verified complete or a token budget is exhausted. The feature, available in Codex CLI 0.128.0 and generally available in version 0.133.0, relies on binary success checks, tight scope, and observable completion conditions. Tasks like fixing failing tests or adding typed interfaces succeed reliably, while ambiguous goals like redesigning data models fail.

OpenAI has been circulating an 85 to 90 percent success rate for Codex on well-scoped maintenance work. That number comes from internal testing, not an independent benchmark. But the mechanics behind it are real, and they explain both why it works and when it falls apart. The feature is /goal . It shipped in Codex CLI 0.128.0 and became generally available across the CLI, IDE extension, and Codex app in version 0.133.0 on May 21, 2026. The short version: you set a goal, Codex loops until it believes the goal is complete, and the only hard stops are an evaluation that says "done" or a token budget that runs dry. Understanding why that loop succeeds or fails on any given task is the whole game. | Scenario | Outcome | Why | |---|---|---| | Fix a failing test with a known error message | High pass rate | Scope is tight, completion is verifiable | | Add a typed interface to an existing module | High pass rate | Output shape is checkable | | Refactor a cross-cutting concern across 12 files | Fails often | Ambiguous scope, no clear done signal | | Redesign the data model | Fails always | No binary done-check possible | | Update a dependency and fix breakage | Medium | Depends on how far the breakage spreads | A standard Codex turn is stateless. You ask something, it runs, the session ends. /goal breaks that pattern. When you set a goal, Codex injects two prompts at the end of every turn automatically: goals/continuation.md and goals/budget limit.md . The first tells the model to check whether the goal is complete and decide whether to continue. The second tracks token consumption and stops the loop before it exceeds your budget. The loop runs forward until one of those two conditions triggers. Before version 0.133.0 , goals were session-scoped. When the CLI process died, the goal died. The 0.133.0 release backed goals with dedicated storage so they track progress across active turns, including across CLI restarts. That is the "persisted" part. The goal state survives a reboot. Version 0.132.0 May 19, 2026 added one important fix: goal continuations now stop at usage limits instead of spinning indefinitely. Before that fix, a goal with no clear completion signal would run until the process died or the account hit a rate limit. The loop pattern OpenAI uses here is not novel. Practitioners call this the "Ralph loop": an agent that checks its own output and decides whether to keep going. Codex adds budget accounting and a persistence layer on top. The prompt injection runs automatically; you never write the continuation prompts yourself. Three properties push a task into the high success range. The goal must have a binary success check. "Fix the failing tests in src/auth " works. "Improve the auth module" does not. The agent needs to run a verification step and get a yes or no result. Passing CI is yes or no. "Better code" is not. The scope must stay tight. A goal that touches one module or one interface definition gives the agent a small search space. If the fix requires changes in five unrelated parts of the codebase, the agent will solve three of them and stall on the fourth with no way to know it stalled. The success condition must be observable from within the session. Write a shell command that returns 0 on success and non-zero on failure, and the agent can self-check. Tests are the obvious example. Type checks work too. Lint rules work. "The PR passes review" does not, because the agent cannot run that check. Tasks I have seen work well: as castEvery one of those has a finish line the agent can reach and measure. The failure modes split into two categories: scope creep and unprovable completion. Scope creep happens when the agent fixes one thing and reveals another. You ask it to fix a failing integration test. It fixes the test by updating the mock. The mock now diverges from the real API. The agent has no instruction to check that, so it declares done. The CI passes locally and fails in staging two days later. The agent did exactly what you said. The goal was too narrow. Unprovable completion happens when the agent cannot self-check. "Refactor this service to be more readable" gives the agent nothing to verify. The agent will make changes, decide the changes look reasonable, mark the goal complete, and stop. Whether the code reads better is a human judgment. The agent will produce something and stop confidently regardless. Architectural changes fail almost every time. If the task requires deciding where a module boundary should sit, or which service owns a responsibility, the agent hits the ambiguity and either picks one arbitrarily or loops until budget. That is not a capability gap. The task is genuinely underdetermined. No amount of looping closes that. The 85% number, whatever its exact measurement method, almost certainly applies to a curated set of maintenance tasks with clear success criteria. If you point /goal at open-ended design work, you are not in the 85%. You are in a different distribution entirely. Install or update the Codex CLI: npm install -g @openai/codex codex --version 0.133.0 or later for persistent goals Check that goals are active on by default since 0.133.0, but worth confirming : codex doctor look for: goals: enabled, storage: ok Set a goal from the CLI: codex goal set "All tests in src/payments pass with no TypeScript errors" Start a session in the repo and let it run: cd /your/repo codex Codex picks up the active goal and begins the loop Watch it loop: codex goal status shows: active goal, turns completed, tokens used, last evaluation result The agent runs npm test or your configured test command at the end of each turn, checks the output, and decides whether to continue. If it cannot find a test command, it looks for package.json scripts named test , typecheck , or lint in that order. For a task with a tighter scope, you can inline the success command: codex goal set "Fix TypeScript errors in src/api/routes.ts" \ --verify "npx tsc --noEmit --project tsconfig.json" The --verify flag tells Codex which command to use as the done-check instead of inferring it. Pass anything that exits 0 on success. Cancel a goal that has stalled: codex goal cancel List past goals and their outcomes: codex goal list --limit 10 The loop does not replace CI. Treat it as a way to get closer to green before CI runs. The agent's output goes through type check, lint, and tests before merging, same as any other code. A GitHub Actions job that verifies Codex-generated changes: name: verify-codex-output on: pull request: branches: main jobs: type-check: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Node uses: actions/setup-node@v4 with: node-version: 20 cache: npm - name: Install run: npm ci - name: Type check run: npx tsc --noEmit - name: Lint run: npx eslint src --max-warnings 0 - name: Test run: npm test -- --coverage --passWithNoTests detect-scope-creep: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 with: fetch-depth: 0 - name: Count changed files run: | CHANGED=$ git diff --name-only origin/main...HEAD | wc -l echo "Changed files: $CHANGED" if "$CHANGED" -gt 20 ; then echo "::warning::PR changes $CHANGED files. Review for unintended scope creep." fi The scope-creep check is the one I added specifically for agent-authored PRs. If Codex touches more than 20 files on what should be a five-file task, someone needs to read what happened. The warning does not block the PR; it flags it for a slower review. The important CI rule: never relax your existing quality gates for agent-generated code. If anything, add the file-count check. An agent that cannot measure its own scope will not stop itself from editing 40 files to fix a one-line bug. Pre-commit hooks are the other layer. Add a quick type check before the commit even reaches CI: .pre-commit-config.yaml if using pre-commit repos: - repo: local hooks: - id: tsc name: TypeScript check entry: npx tsc --noEmit language: system pass filenames: false Or wire it directly in package.json using husky : { "scripts": { "prepare": "husky install" } } .husky/pre-commit npm run typecheck Now every commit the agent makes, whether from a /goal loop or a single turn, goes through the type check locally before it can push. The /goal loop works on tasks where "done" has a binary answer the agent can check itself. Write that verify command before you set the goal. If you cannot write that command, the task needs more scoping before you hand it to the agent. The 85% figure covers curated maintenance tasks. You cannot carry that rate over to any task you hand the tool. Architectural decisions, ambiguous refactors, and cross-cutting changes will not approach that number regardless of turn count. The persistence layer that shipped in 0.133.0 is the real unlock. A goal that survives a CLI restart means you can set a task running, close the terminal, and come back to a result rather than a dead session. That changes the workflow from "supervised agent" to something closer to a slow async job. Wire it into CI, cap the budget, and treat the output like any other unreviewed PR. What is the first maintenance task in your backlog that has a clear test-based done condition? That is the one to try /goal on first. GDS K S · thegdsks.com https://thegdsks.com · follow on X @thegdsks https://x.com/thegdsks Set the verify command before the goal. If you cannot write it, the scope is not ready.