{"slug": "openai-codex-now-finishes-85-of-scoped-tasks-here-is-the-goal-workflow-that-gets", "title": "OpenAI Codex now finishes 85% of scoped tasks. Here is the /goal workflow that gets you there.", "summary": "OpenAI's Codex now completes 85-90% of well-scoped maintenance tasks using its /goal workflow, which loops until a goal is verified complete or a token budget is exhausted. The feature, available in Codex CLI 0.128.0 and generally available in version 0.133.0, relies on binary success checks, tight scope, and observable completion conditions. Tasks like fixing failing tests or adding typed interfaces succeed reliably, while ambiguous goals like redesigning data models fail.", "body_md": "OpenAI has been circulating an 85 to 90 percent success rate for Codex on well-scoped maintenance work. That number comes from internal testing, not an independent benchmark. But the mechanics behind it are real, and they explain both why it works and when it falls apart.\n\nThe feature is `/goal`\n\n. It shipped in Codex CLI `0.128.0`\n\nand became generally available across the CLI, IDE extension, and Codex app in version `0.133.0`\n\non May 21, 2026. The short version: you set a goal, Codex loops until it believes the goal is complete, and the only hard stops are an evaluation that says \"done\" or a token budget that runs dry.\n\nUnderstanding why that loop succeeds or fails on any given task is the whole game.\n\n| Scenario | Outcome | Why |\n|---|---|---|\n| Fix a failing test with a known error message | High pass rate | Scope is tight, completion is verifiable |\n| Add a typed interface to an existing module | High pass rate | Output shape is checkable |\n| Refactor a cross-cutting concern across 12 files | Fails often | Ambiguous scope, no clear done signal |\n| Redesign the data model | Fails always | No binary done-check possible |\n| Update a dependency and fix breakage | Medium | Depends on how far the breakage spreads |\n\nA standard Codex turn is stateless. You ask something, it runs, the session ends. `/goal`\n\nbreaks that pattern.\n\nWhen you set a goal, Codex injects two prompts at the end of every turn automatically: `goals/continuation.md`\n\nand `goals/budget_limit.md`\n\n. The first tells the model to check whether the goal is complete and decide whether to continue. The second tracks token consumption and stops the loop before it exceeds your budget. The loop runs forward until one of those two conditions triggers.\n\nBefore version `0.133.0`\n\n, goals were session-scoped. When the CLI process died, the goal died. The `0.133.0`\n\nrelease backed goals with dedicated storage so they track progress across active turns, including across CLI restarts. That is the \"persisted\" part. The goal state survives a reboot.\n\nVersion `0.132.0`\n\n(May 19, 2026) added one important fix: goal continuations now stop at usage limits instead of spinning indefinitely. Before that fix, a goal with no clear completion signal would run until the process died or the account hit a rate limit.\n\nThe loop pattern OpenAI uses here is not novel. Practitioners call this the \"Ralph loop\": an agent that checks its own output and decides whether to keep going. Codex adds budget accounting and a persistence layer on top. The prompt injection runs automatically; you never write the continuation prompts yourself.\n\nThree properties push a task into the high success range.\n\nThe goal must have a binary success check. \"Fix the failing tests in `src/auth`\n\n\" works. \"Improve the auth module\" does not. The agent needs to run a verification step and get a yes or no result. Passing CI is yes or no. \"Better code\" is not.\n\nThe scope must stay tight. A goal that touches one module or one interface definition gives the agent a small search space. If the fix requires changes in five unrelated parts of the codebase, the agent will solve three of them and stall on the fourth with no way to know it stalled.\n\nThe success condition must be observable from within the session. Write a shell command that returns 0 on success and non-zero on failure, and the agent can self-check. Tests are the obvious example. Type checks work too. Lint rules work. \"The PR passes review\" does not, because the agent cannot run that check.\n\nTasks I have seen work well:\n\n`as`\n\ncastEvery one of those has a finish line the agent can reach and measure.\n\nThe failure modes split into two categories: scope creep and unprovable completion.\n\nScope creep happens when the agent fixes one thing and reveals another. You ask it to fix a failing integration test. It fixes the test by updating the mock. The mock now diverges from the real API. The agent has no instruction to check that, so it declares done. The CI passes locally and fails in staging two days later. The agent did exactly what you said. The goal was too narrow.\n\nUnprovable completion happens when the agent cannot self-check. \"Refactor this service to be more readable\" gives the agent nothing to verify. The agent will make changes, decide the changes look reasonable, mark the goal complete, and stop. Whether the code reads better is a human judgment. The agent will produce something and stop confidently regardless.\n\nArchitectural changes fail almost every time. If the task requires deciding where a module boundary should sit, or which service owns a responsibility, the agent hits the ambiguity and either picks one arbitrarily or loops until budget. That is not a capability gap. The task is genuinely underdetermined. No amount of looping closes that.\n\nThe 85% number, whatever its exact measurement method, almost certainly applies to a curated set of maintenance tasks with clear success criteria. If you point `/goal`\n\nat open-ended design work, you are not in the 85%. You are in a different distribution entirely.\n\nInstall or update the Codex CLI:\n\n```\nnpm install -g @openai/codex\ncodex --version\n# 0.133.0 or later for persistent goals\n```\n\nCheck that goals are active (on by default since 0.133.0, but worth confirming):\n\n```\ncodex doctor\n# look for: goals: enabled, storage: ok\n```\n\nSet a goal from the CLI:\n\n```\ncodex goal set \"All tests in src/payments pass with no TypeScript errors\"\n```\n\nStart a session in the repo and let it run:\n\n```\ncd /your/repo\ncodex\n# Codex picks up the active goal and begins the loop\n```\n\nWatch it loop:\n\n```\ncodex goal status\n# shows: active goal, turns completed, tokens used, last evaluation result\n```\n\nThe agent runs `npm test`\n\nor your configured test command at the end of each turn, checks the output, and decides whether to continue. If it cannot find a test command, it looks for `package.json`\n\nscripts named `test`\n\n, `typecheck`\n\n, or `lint`\n\nin that order.\n\nFor a task with a tighter scope, you can inline the success command:\n\n```\ncodex goal set \"Fix TypeScript errors in src/api/routes.ts\" \\\n  --verify \"npx tsc --noEmit --project tsconfig.json\"\n```\n\nThe `--verify`\n\nflag tells Codex which command to use as the done-check instead of inferring it. Pass anything that exits 0 on success.\n\nCancel a goal that has stalled:\n\n```\ncodex goal cancel\n```\n\nList past goals and their outcomes:\n\n```\ncodex goal list --limit 10\n```\n\nThe loop does not replace CI. Treat it as a way to get closer to green before CI runs. The agent's output goes through type check, lint, and tests before merging, same as any other code.\n\nA GitHub Actions job that verifies Codex-generated changes:\n\n```\nname: verify-codex-output\n\non:\n  pull_request:\n    branches: [main]\n\njobs:\n  type-check:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n\n      - name: Setup Node\n        uses: actions/setup-node@v4\n        with:\n          node-version: 20\n          cache: npm\n\n      - name: Install\n        run: npm ci\n\n      - name: Type check\n        run: npx tsc --noEmit\n\n      - name: Lint\n        run: npx eslint src --max-warnings 0\n\n      - name: Test\n        run: npm test -- --coverage --passWithNoTests\n\n  detect-scope-creep:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n        with:\n          fetch-depth: 0\n\n      - name: Count changed files\n        run: |\n          CHANGED=$(git diff --name-only origin/main...HEAD | wc -l)\n          echo \"Changed files: $CHANGED\"\n          if [ \"$CHANGED\" -gt 20 ]; then\n            echo \"::warning::PR changes $CHANGED files. Review for unintended scope creep.\"\n          fi\n```\n\nThe scope-creep check is the one I added specifically for agent-authored PRs. If Codex touches more than 20 files on what should be a five-file task, someone needs to read what happened. The warning does not block the PR; it flags it for a slower review.\n\nThe important CI rule: never relax your existing quality gates for agent-generated code. If anything, add the file-count check. An agent that cannot measure its own scope will not stop itself from editing 40 files to fix a one-line bug.\n\nPre-commit hooks are the other layer. Add a quick type check before the commit even reaches CI:\n\n```\n# .pre-commit-config.yaml (if using pre-commit)\nrepos:\n  - repo: local\n    hooks:\n      - id: tsc\n        name: TypeScript check\n        entry: npx tsc --noEmit\n        language: system\n        pass_filenames: false\n```\n\nOr wire it directly in `package.json`\n\nusing `husky`\n\n:\n\n```\n{\n  \"scripts\": {\n    \"prepare\": \"husky install\"\n  }\n}\n# .husky/pre-commit\nnpm run typecheck\n```\n\nNow every commit the agent makes, whether from a `/goal`\n\nloop or a single turn, goes through the type check locally before it can push.\n\nThe `/goal`\n\nloop works on tasks where \"done\" has a binary answer the agent can check itself. Write that verify command before you set the goal. If you cannot write that command, the task needs more scoping before you hand it to the agent.\n\nThe 85% figure covers curated maintenance tasks. You cannot carry that rate over to any task you hand the tool. Architectural decisions, ambiguous refactors, and cross-cutting changes will not approach that number regardless of turn count.\n\nThe persistence layer that shipped in `0.133.0`\n\nis the real unlock. A goal that survives a CLI restart means you can set a task running, close the terminal, and come back to a result rather than a dead session. That changes the workflow from \"supervised agent\" to something closer to a slow async job. Wire it into CI, cap the budget, and treat the output like any other unreviewed PR.\n\nWhat is the first maintenance task in your backlog that has a clear test-based done condition? That is the one to try `/goal`\n\non first.\n\n**GDS K S** · [thegdsks.com](https://thegdsks.com) · follow on X [@thegdsks](https://x.com/thegdsks)\n\n*Set the verify command before the goal. If you cannot write it, the scope is not ready.*", "url": "https://wpnews.pro/news/openai-codex-now-finishes-85-of-scoped-tasks-here-is-the-goal-workflow-that-gets", "canonical_source": "https://dev.to/thegdsks/openai-codex-now-finishes-85-of-scoped-tasks-here-is-the-goal-workflow-that-gets-you-there-1dae", "published_at": "2026-06-14 02:56:59+00:00", "updated_at": "2026-06-14 03:29:05.283084+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "developer-tools"], "entities": ["OpenAI", "Codex", "Codex CLI"], "alternates": {"html": "https://wpnews.pro/news/openai-codex-now-finishes-85-of-scoped-tasks-here-is-the-goal-workflow-that-gets", "markdown": "https://wpnews.pro/news/openai-codex-now-finishes-85-of-scoped-tasks-here-is-the-goal-workflow-that-gets.md", "text": "https://wpnews.pro/news/openai-codex-now-finishes-85-of-scoped-tasks-here-is-the-goal-workflow-that-gets.txt", "jsonld": "https://wpnews.pro/news/openai-codex-now-finishes-85-of-scoped-tasks-here-is-the-goal-workflow-that-gets.jsonld"}}