cd /news/large-language-models/openai-codex-now-finishes-85-of-scop… · home topics large-language-models article
[ARTICLE · art-26705] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

OpenAI Codex now finishes 85% of scoped tasks. Here is the /goal workflow that gets you there.

OpenAI's Codex now completes 85-90% of well-scoped maintenance tasks using its /goal workflow, which loops until a goal is verified complete or a token budget is exhausted. The feature, available in Codex CLI 0.128.0 and generally available in version 0.133.0, relies on binary success checks, tight scope, and observable completion conditions. Tasks like fixing failing tests or adding typed interfaces succeed reliably, while ambiguous goals like redesigning data models fail.

read8 min publishedJun 14, 2026

OpenAI has been circulating an 85 to 90 percent success rate for Codex on well-scoped maintenance work. That number comes from internal testing, not an independent benchmark. But the mechanics behind it are real, and they explain both why it works and when it falls apart.

The feature is /goal

. It shipped in Codex CLI 0.128.0

and became generally available across the CLI, IDE extension, and Codex app in version 0.133.0

on May 21, 2026. The short version: you set a goal, Codex loops until it believes the goal is complete, and the only hard stops are an evaluation that says "done" or a token budget that runs dry.

Understanding why that loop succeeds or fails on any given task is the whole game.

Scenario Outcome Why
Fix a failing test with a known error message High pass rate Scope is tight, completion is verifiable
Add a typed interface to an existing module High pass rate Output shape is checkable
Refactor a cross-cutting concern across 12 files Fails often Ambiguous scope, no clear done signal
Redesign the data model Fails always No binary done-check possible
Update a dependency and fix breakage Medium Depends on how far the breakage spreads

A standard Codex turn is stateless. You ask something, it runs, the session ends. /goal

breaks that pattern.

When you set a goal, Codex injects two prompts at the end of every turn automatically: goals/continuation.md

and goals/budget_limit.md

. The first tells the model to check whether the goal is complete and decide whether to continue. The second tracks token consumption and stops the loop before it exceeds your budget. The loop runs forward until one of those two conditions triggers.

Before version 0.133.0

, goals were session-scoped. When the CLI process died, the goal died. The 0.133.0

release backed goals with dedicated storage so they track progress across active turns, including across CLI restarts. That is the "persisted" part. The goal state survives a reboot.

Version 0.132.0

(May 19, 2026) added one important fix: goal continuations now stop at usage limits instead of spinning indefinitely. Before that fix, a goal with no clear completion signal would run until the process died or the account hit a rate limit.

The loop pattern OpenAI uses here is not novel. Practitioners call this the "Ralph loop": an agent that checks its own output and decides whether to keep going. Codex adds budget accounting and a persistence layer on top. The prompt injection runs automatically; you never write the continuation prompts yourself.

Three properties push a task into the high success range.

The goal must have a binary success check. "Fix the failing tests in src/auth

" works. "Improve the auth module" does not. The agent needs to run a verification step and get a yes or no result. Passing CI is yes or no. "Better code" is not.

The scope must stay tight. A goal that touches one module or one interface definition gives the agent a small search space. If the fix requires changes in five unrelated parts of the codebase, the agent will solve three of them and stall on the fourth with no way to know it stalled.

The success condition must be observable from within the session. Write a shell command that returns 0 on success and non-zero on failure, and the agent can self-check. Tests are the obvious example. Type checks work too. Lint rules work. "The PR passes review" does not, because the agent cannot run that check.

Tasks I have seen work well:

as

castEvery one of those has a finish line the agent can reach and measure.

The failure modes split into two categories: scope creep and unprovable completion.

Scope creep happens when the agent fixes one thing and reveals another. You ask it to fix a failing integration test. It fixes the test by updating the mock. The mock now diverges from the real API. The agent has no instruction to check that, so it declares done. The CI passes locally and fails in staging two days later. The agent did exactly what you said. The goal was too narrow.

Unprovable completion happens when the agent cannot self-check. "Refactor this service to be more readable" gives the agent nothing to verify. The agent will make changes, decide the changes look reasonable, mark the goal complete, and stop. Whether the code reads better is a human judgment. The agent will produce something and stop confidently regardless.

Architectural changes fail almost every time. If the task requires deciding where a module boundary should sit, or which service owns a responsibility, the agent hits the ambiguity and either picks one arbitrarily or loops until budget. That is not a capability gap. The task is genuinely underdetermined. No amount of looping closes that.

The 85% number, whatever its exact measurement method, almost certainly applies to a curated set of maintenance tasks with clear success criteria. If you point /goal

at open-ended design work, you are not in the 85%. You are in a different distribution entirely.

Install or update the Codex CLI:

npm install -g @openai/codex
codex --version

Check that goals are active (on by default since 0.133.0, but worth confirming):

codex doctor

Set a goal from the CLI:

codex goal set "All tests in src/payments pass with no TypeScript errors"

Start a session in the repo and let it run:

cd /your/repo
codex

Watch it loop:

codex goal status

The agent runs npm test

or your configured test command at the end of each turn, checks the output, and decides whether to continue. If it cannot find a test command, it looks for package.json

scripts named test

, typecheck

, or lint

in that order.

For a task with a tighter scope, you can inline the success command:

codex goal set "Fix TypeScript errors in src/api/routes.ts" \
  --verify "npx tsc --noEmit --project tsconfig.json"

The --verify

flag tells Codex which command to use as the done-check instead of inferring it. Pass anything that exits 0 on success.

Cancel a goal that has stalled:

codex goal cancel

List past goals and their outcomes:

codex goal list --limit 10

The loop does not replace CI. Treat it as a way to get closer to green before CI runs. The agent's output goes through type check, lint, and tests before merging, same as any other code.

A GitHub Actions job that verifies Codex-generated changes:

name: verify-codex-output

on:
  pull_request:
    branches: [main]

jobs:
  type-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm

      - name: Install
        run: npm ci

      - name: Type check
        run: npx tsc --noEmit

      - name: Lint
        run: npx eslint src --max-warnings 0

      - name: Test
        run: npm test -- --coverage --passWithNoTests

  detect-scope-creep:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Count changed files
        run: |
          CHANGED=$(git diff --name-only origin/main...HEAD | wc -l)
          echo "Changed files: $CHANGED"
          if [ "$CHANGED" -gt 20 ]; then
            echo "::warning::PR changes $CHANGED files. Review for unintended scope creep."
          fi

The scope-creep check is the one I added specifically for agent-authored PRs. If Codex touches more than 20 files on what should be a five-file task, someone needs to read what happened. The warning does not block the PR; it flags it for a slower review.

The important CI rule: never relax your existing quality gates for agent-generated code. If anything, add the file-count check. An agent that cannot measure its own scope will not stop itself from editing 40 files to fix a one-line bug.

Pre-commit hooks are the other layer. Add a quick type check before the commit even reaches CI:

repos:
  - repo: local
    hooks:
      - id: tsc
        name: TypeScript check
        entry: npx tsc --noEmit
        language: system
        pass_filenames: false

Or wire it directly in package.json

using husky

:

{
  "scripts": {
    "prepare": "husky install"
  }
}
npm run typecheck

Now every commit the agent makes, whether from a /goal

loop or a single turn, goes through the type check locally before it can push.

The /goal

loop works on tasks where "done" has a binary answer the agent can check itself. Write that verify command before you set the goal. If you cannot write that command, the task needs more scoping before you hand it to the agent.

The 85% figure covers curated maintenance tasks. You cannot carry that rate over to any task you hand the tool. Architectural decisions, ambiguous refactors, and cross-cutting changes will not approach that number regardless of turn count.

The persistence layer that shipped in 0.133.0

is the real unlock. A goal that survives a CLI restart means you can set a task running, close the terminal, and come back to a result rather than a dead session. That changes the workflow from "supervised agent" to something closer to a slow async job. Wire it into CI, cap the budget, and treat the output like any other unreviewed PR.

What is the first maintenance task in your backlog that has a clear test-based done condition? That is the one to try /goal

on first.

GDS K S · thegdsks.com · follow on X @thegdsks

Set the verify command before the goal. If you cannot write it, the scope is not ready.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/openai-codex-now-fin…] indexed:0 read:8min 2026-06-14 ·