AI Bug-Fix & Draft-PR Agent

AgentAz™ released a flagship reference blueprint for an AI bug-fix and draft-PR agent that reproduces, locates, fixes, tests, and submits a pull request in a sandboxed environment. The agent operates under a governance specification with limited autonomy, requiring human approval for high-risk actions and maintaining an append-only audit trail. The blueprint is open source under Apache-2.0 and aims to provide safe, minimal, and cost-controlled automated code fixes.

Overview Reproduce → locate → fix → test → PR: a complete loop that ends in a focused, reviewable draft pull request, not a pile of speculative edits. Grounded in the real repo: it reads the actual code and reproduces the bug before changing anything, so fixes target the true root cause. Minimal and safe by default: smallest viable diff, a regression test that fails before and passes after, and no changes to protected paths without human sign-off. Cost- and blast-radius-controlled: sandboxed execution, capped tool calls and files touched, and escalation when the fix is ambiguous or high-risk. AgentAz™ specification A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime. Machine-readable contract agentaz.json , validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL: { "$schema": "./agentaz.schema.json", "version": "2.0.0", "last reviewed": "2026-06-24", "agent id": "issue-to-pr-agent", "trust level": "A4", "dna pattern": "Execution", "worst case action": "Opens a draft PR with an incorrect fix on a sandboxed branch; never auto-merged. Human reviews and merges.", "authority boundary": "Writes fixes on an isolated branch and opens draft PRs; merge-to-main and deploy tools absent.", "tags": "software-engineering", "bug-fix", "sandboxed", "draft-pr", "human-approval" , "tool boundary": { "auto executable tools": "read issue", "write branch", "run tests sandbox", "open draft pr" , "approval required tools": "merge pr" , "execution tools absent": false, "rollback required": true, "branch isolated": true }, "output boundary": { "format": "structured json", "never without approval": "merge pr", "deploy", "force push" }, "cost boundary": { "max usd per trace loop": 0.5, "alert threshold usd": 0.35 }, "loop boundary": { "max reasoning turns": 14 }, "human handoff": { "triggers": "tests failing", "risky change", "low confidence" , "destination": "maintainer" }, "audit": { "append only": true, "logs": "diff", "test results", "reasoning", "approvals" } } New to this? Read the AgentAz specification guide /agentaz-specifications — Trust Levels, DNA patterns, and how it complements your runtime. This is a flagship reference blueprint for AgentAz v1.0.0. AgentAz™ is open source under Apache-2.0 https://www.apache.org/licenses/LICENSE-2.0 spec text under CC‑BY‑4.0 — schema and source on GitHub https://github.com/agent-kits/agentaz . Governance matrix A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality. | Agent goal | Bounded by the authority spec above | |---|---| | Trust Level | A4 — Limited Autonomy | | Tool access | Scoped tools; high-risk actions gated behind approval | | Context handling | Grounded in provided inputs; cites or flags rather than guessing | | Memory strategy | Task-scoped; no persistent cross-session memory | | Human approval | Required on tests failing, risky change, low confidence → maintainer | | Audit trail | Append-only log diff, test results, reasoning, approvals | | Cost & loop bounds | ≤ $0.5 per loop · ≤ 14 reasoning turns | | Recovery / escalation | Escalates to maintainer | Agent component mapping A framework-neutral view of how this blueprint maps to standard agent-architecture components the vocabulary common to ADK-style frameworks . It describes structure for clarity — not an official integration or certified compatibility. | Agent | Primary reasoner — Limited Autonomy authority A4 | |---|---| | Tools | read issue, write branch, run tests sandbox, open draft pr; approval-gated: merge pr | | Memory | Task-scoped working context; no persistent cross-session memory | | Guardrails | Worst-case classified A4 ; high-risk actions gated; ≤ $0.5/loop · ≤ 14 turns | | Evaluator | Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned | | Handoff | Escalates to maintainer on tests failing, risky change, low confidence | Failure modes Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly. Writes an incorrect fix that passes a weak test suite. - Detection - The existing suite runs in a sandbox; low coverage is flagged in the PR. - Mitigation - A draft PR only — never auto-merged; a human reviews. - Recovery - The human rejects the PR and the branch is discarded reversible . The fix introduces a regression elsewhere. - Detection - A full sandbox test run executes and the diff scope is checked. - Mitigation - The branch is isolated and merge-to-main is absent from the registry. - Recovery - The draft is closed and the branch is reverted. Misreads the issue and fixes the wrong thing. - Detection - The fix is linked to the issue's acceptance criteria; low confidence is flagged in the PR. - Mitigation - A human approves the merge. - Recovery - The maintainer redirects and the PR is closed. Attempts to write to a protected branch. - Detection - A branch-isolation check runs; protected-branch writes are absent from the tool registry. - Mitigation - The capability is structurally not granted. - Recovery - Prevented by construction; the attempt is logged. Evaluation Fix correctness verified by tests is the core metric — does the proposed change actually resolve the issue without regressions? | Resolution rate | Share of draft PRs where the change resolves the issue and passes its acceptance tests. | |---|---| | Regression rate | Of generated fixes, the share that introduce new failures in the full suite. | | Issue-match accuracy | Whether the fix addresses the actual issue rather than the wrong thing. | | Human-merge rate | Share of draft PRs a maintainer merges with little or no change. | | Latency & cost | Time and token cost per issue. | Recommended approach. Use a benchmark of issues with known fixes and tests SWE-bench-style in a sandbox; measure resolution rate by test outcome and regression rate on the full suite. Never auto-merge during evaluation. When to use Use it when - You have a backlog of well-described, reproducible bugs that follow common patterns and drain senior time. - Your repo has a test suite and CI the agent can use to verify a fix actually works. - You want a draft PR with a real fix and test to review, not just a suggestion or a comment. - You can run the agent against a sandboxed checkout with scoped permissions. - You want automation that opens PRs for the easy-to-medium fixes and routes the hard ones to humans with a clear plan. Avoid it when - The bug report has no reproduction steps and the failure cannot be reproduced — the agent should ask or escalate, not invent a fix. - The change is architectural, security-sensitive, or spans many subsystems; those need a human author. - You have no tests or CI, so a fix cannot be verified before it is proposed. - You are unwilling to keep human review on the PR and a sandbox between the agent and production. System prompt You are an Autonomous Bug-Fix Engineer. Your job is to take ONE issue and produce a small, correct, reviewed pull request — or, when that is not safe or possible, a clear plan and an escalation. You are judged on fixes that are correct, minimal, and tested, and on never breaking the build, never widening scope, and never touching things you are not allowed to. == CORE PRINCIPLES == 1. Reproduce before you fix. Do not change code until you have reproduced the reported behavior a failing test or a documented repro . If you cannot reproduce it, you do not understand it — ask for details or escalate. 2. Smallest correct diff. Fix the root cause, not the symptom, with the minimum change. Do not refactor, reformat, rename, or "improve" unrelated code. A 6-line fix beats a 200-line rewrite. 3. Evidence over guessing. Ground every claim in code you have actually read cite path:line . If the root cause is unclear, say so and stop — never ship a speculative fix. == HARD RULES NON-NEGOTIABLE == - PROTECTED PATHS: You must NOT modify authentication, authorization, cryptography, payments/billing, database migrations, access control, or infra/deploy config. If the fix requires touching these, STOP, write the plan, and escalate to a human. - TESTS REQUIRED: Every fix must include a regression test that fails on the original code and passes on the fixed code. No test, no PR. - NO DESTRUCTIVE GIT: Never force-push, never rewrite history, never delete branches, never commit to main directly. Work on a fresh branch and open a DRAFT PR. - SANDBOX ONLY: Run code and tests only in the provided sandbox. Never run untrusted scripts outside it, never exfiltrate secrets, and if you find a secret in the repo, flag it and do not echo its value. - SCOPE: Touch only the files needed for this one issue, within the configured file/diff budget. If the fix would exceed the budget, stop and propose splitting the work. == WORKFLOW POLICY == - Step 1 Reproduce: write or run a test that demonstrates the bug. If it cannot be reproduced after a reasonable attempt, set decision=NEEDS INFO and list exactly what you need. - Step 2 Locate: trace the root cause through the code; cite the responsible lines. State your hypothesis explicitly. - Step 3 Fix: apply the minimal change. Re-run the failing test now passing and the surrounding suite to check for regressions. - Step 4 Verify: run static analysis/type checks if available. If anything fails, fix forward only within scope, or escalate. - Step 5 Propose: open a draft PR with the diff, the failing→passing test, a plain-language explanation, and any risks. == DECISION calibrated confidence 0.0-1.0 == - OPEN PR: confidence = 0.8, reproduced, fixed, tested, no protected paths, within budget. - NEEDS INFO: cannot reproduce or the report is ambiguous. Ask specific questions; make no code change. - ESCALATE: touches protected paths, exceeds budget/scope, security-sensitive, or confidence < 0.8 after investigation. Provide a plan a human can act on. == COST CONTROL == Read only the files you need use search before reading whole trees . Do not re-read files already in context. Cap tool calls per issue; if you would exceed the cap, escalate with what you have. Keep the PR description concise. == OUTPUT FORMAT return ONE JSON object == { "decision": "OPEN PR|NEEDS INFO|ESCALATE", "confidence": <0.0-1.0 , "root cause": "<grounded explanation with path:line, or empty ", "reproduction": "<how you reproduced it / the failing test, or what's missing ", "patch": "<unified diff of the minimal fix, or empty ", "test": "<the regression test added, or empty ", "files touched": "..." , "risks": "<what a reviewer should double-check ", "pr": { "title": "<concise ", "body": "<explanation + test note + risks ", "draft": true }, "escalation": { "needed": <bool , "reason": "<protected path / scope / uncertainty, or empty ", "plan": "<next steps for a human, or empty " } } If decision is NEEDS INFO or ESCALATE, leave patch/test empty and do not modify code. Simulate run Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser. Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run. Setup guide Install and create a sandbox runner Install the agent and prepare an isolated workspace it can clone into. pipx install bugfix-agent mkdir -p ~/.bugfix-sandbox bugfix-agent doctor checks git, test runner, sandbox isolation Configure models, repo access, and limits Use a least-privilege token. Caps live in config so the agent cannot widen its own scope. cp .env.example .env ANTHROPIC API KEY=sk-ant-... GITHUB TOKEN=ghp ... scoped: contents:write, pull requests:write on ONE repo TEST CMD="pytest -q" MAX FILES=8 MAX DIFF LINES=200 REQUIRE TEST=true Declare protected paths Anything matching these is off-limits to the agent and forces escalation. .bugfix.yml protected paths: - " /auth/ " - " /payments/ " - "migrations/ " - "infra/ " open as: draft base branch: main Dry-run on a real issue Point it at an issue and inspect the plan, patch, and test before letting it open a PR. bugfix-agent run --issue 1423 --dry-run --explain prints decision, root cause, patch, test, risks Enable PR creation in CI Trigger on labeled issues so a human stays in the loop by choosing what to hand the agent. .github/workflows/bugfix.yml name: Autonomous Bug Fix on: issues: types: labeled jobs: fix: if: github.event.label.name == 'agent-fix' runs-on: ubuntu-latest permissions: { contents: write, pull-requests: write } steps: - uses: actions/checkout@v4 with: { fetch-depth: 0 } - run: pipx install bugfix-agent - run: bugfix-agent run --issue ${{ github.event.issue.number }} --open-pr env: { ANTHROPIC API KEY: ${{ secrets.ANTHROPIC API KEY }}, GITHUB TOKEN: ${{ secrets.GITHUB TOKEN }} } Architecture Tools required Workflow 1. Intake the issue Fetch the report, logs, and acceptance criterion. Confirm it is a single, scoped bug; if it is actually several, propose splitting. 2. Reproduce in a sandbox Clone into the sandbox and reproduce via an existing or newly written failing test. If it cannot be reproduced, switch to NEEDS INFO and ask precise questions. 3. Find the root cause Search and read only the relevant code, trace the failure to its source, and state the hypothesis with cited lines before changing anything. 4. Apply the minimal fix Make the smallest change that fixes the root cause. Reject anything that strays into protected paths or beyond the diff budget. 5. Test and verify Run the regression test failing → passing plus the surrounding suite and static analysis. If something breaks, fix forward in scope or escalate. 6. Open a draft PR Publish a draft PR with the diff, the test, a plain-language explanation, and explicit risks for the reviewer. 7. Learn from review Record whether the PR was merged, edited, or rejected, and use it to tune which bug classes the agent should attempt versus escalate. Examples Reproducible null-handling bug → tested fix + draft PR A clear issue: an endpoint 500s when an optional field is missing. There's a stack trace and steps to reproduce. Input Issue 1423: GET /profile crashes when user has no avatar. Traceback: AttributeError: 'NoneType' object has no attribute 'url' at api/profile.py:51 Repro: create a user without an avatar, call GET /profile - 500. Output { "decision": "OPEN PR", "confidence": 0.9, "root cause": "api/profile.py:51 accesses user.avatar.url without checking that avatar exists; users without an avatar have avatar=None, raising AttributeError.", "reproduction": "Added test profile without avatar which creates an avatarless user and asserts GET /profile returns 200; it fails on the original code with the reported AttributeError.", "patch": "--- a/api/profile.py\n+++ b/api/profile.py\n@@ -49,7 +49,7 @@\n- avatar url = user.avatar.url\n+ avatar url = user.avatar.url if user.avatar else None", "test": "def test profile without avatar client :\n u = make user avatar=None \n r = client.get '/profile', user=u \n assert r.status code == 200\n assert r.json 'avatar url' is None", "files touched": "api/profile.py", "tests/test profile.py" , "risks": "Confirm clients tolerate a null avatar url they should; it was previously crashing .", "pr": { "title": "Fix 500 on /profile when user has no avatar", "body": "Guards avatar access against None and adds a regression test fails before, passes after . Minimal change, no behavior change for users with an avatar.", "draft": true }, "escalation": { "needed": false, "reason": "", "plan": "" } } Note: The textbook case: reproduce with a failing test, fix the exact line, keep the diff to one expression, and open a draft PR. The test is the proof — it fails on the old code and passes on the new, which is what makes the PR trustworthy. Unreproducible report → NEEDS INFO instead of a guess A vague report with no steps. A naive agent would invent a plausible-looking fix; this one refuses. Input Issue 1502: 'The dashboard is sometimes slow and sometimes wrong.' No steps, no timestamps, no account, no logs. Output { "decision": "NEEDS INFO", "confidence": 0.2, "root cause": "", "reproduction": "Could not reproduce: no steps, no affected account/tenant, no timeframe, and 'slow' vs 'wrong' are two different problems. Tried the default dashboard with seed data and saw correct, fast results.", "patch": "", "test": "", "files touched": , "risks": "Guessing here would likely produce a fix for the wrong problem and add risk without value.", "pr": { "title": "", "body": "", "draft": true }, "escalation": { "needed": false, "reason": "", "plan": "Need: 1 exact steps or a screen recording, 2 an affected account/tenant id, 3 approximate timestamps, 4 whether 'wrong' means missing data or incorrect numbers. With those I can reproduce and fix." } } Note: This is the discipline that keeps an autonomous fixer safe: with confidence 0.2 and no reproduction, it makes zero code changes and asks four specific questions. Shipping a speculative fix for an unreproducible bug is how agents erode trust. Fix would touch a migration → escalate with a plan A real, reproducible bug whose correct fix requires a schema change. The protected-path gate forces a human. Input Issue 1571: duplicate orders can be created on rapid double-submit. Repro confirmed. Root cause: no unique constraint on user id, idempotency key ; fix needs a DB migration to add it. Output { "decision": "ESCALATE", "confidence": 0.86, "root cause": "services/orders.py inserts without enforcing idempotency; the real fix is a unique constraint on orders user id, idempotency key , which requires a migration.", "reproduction": "Added a concurrency test that fires two identical submits and asserts one order; it fails today two orders created .", "patch": "", "test": "tests/test orders idempotency.py failing repro included in the PR plan ", "files touched": , "risks": "Adding a unique constraint can fail if duplicate rows already exist; needs a data backfill/cleanup step first.", "pr": { "title": "", "body": "", "draft": true }, "escalation": { "needed": true, "reason": "Fix requires a database migration protected path .", "plan": "1 De-dupe existing orders. 2 Add migration for UNIQUE user id, idempotency key . 3 Make the insert idempotent in services/orders.py. 4 Ship the included concurrency test. I can prepare the app-code change and test now; a human should own the migration and backfill." } } Note: Even with high confidence in the diagnosis and a ready test, the agent stops because the correct fix touches migrations. It hands over a concrete, ordered plan and an existing failing test — maximally useful while staying inside its authority. Implementation notes - Make 'a regression test that fails before and passes after' a hard merge requirement — it is the single best guard against speculative or wrong fixes. - Run everything in a sandbox with a least-privilege, single-repo token. The agent should never hold broad credentials or touch production. - Enforce protected paths and the diff budget in a deterministic gate, not just the prompt; any violation downgrades to ESCALATE with a plan. - Start with --dry-run and label-gated triggers so a human chooses which issues the agent attempts; expand the allowed bug classes as merge rates prove out. - Keep diffs tiny on purpose. Small, single-cause PRs are easy to review and safe to merge; reject the urge to bundle cleanups. - Log merged/edited/rejected outcomes per bug class. Rejection patterns tell you exactly where to tighten the prompt or stop auto-attempting. - A cheaper model is usually enough to triage and search, so keep the strong model for root-cause reasoning and patch generation. Variations Basic Plan & patch suggester Reproduces and diagnoses the bug and posts a proposed patch + test as an issue comment for a human to apply. No branch, no PR — the safest starting point. Advanced Draft-PR author Opens a draft PR with the minimal fix and a passing regression test, gated by protected paths, a diff budget, and a required green test run before the PR is created. Enterprise Governed fleet fixer Runs across many repos with per-repo policies and CODEOWNERS routing, sandboxed execution, secret-scanning, full audit trails, and a feedback loop that tunes which bug classes auto-open PRs. Download the Agent Blueprint Download Blueprint .zip /downloads/issue-to-pr-planner.zip Export View the source on GitHub https://github.com/agent-kits/agentaz/tree/main/kits/issue-to-pr-planner This flagship blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 code & schema and CC‑BY‑4.0 text . Frequently asked questions No. It opens a DRAFT pull request with the fix and a passing regression test; a human reviews and merges. It never commits to main or force-pushes. It must reproduce the bug with a failing test before changing anything, and the fix must make that test pass without breaking the suite. If it can't reproduce or isn't confident, it asks for info or escalates instead of guessing. A deterministic diff budget and protected-path gate. It can only touch a small number of files, can't modify auth/payments/migrations/infra, and escalates with a plan when a fix would exceed those bounds. It runs in an isolated sandbox with a least-privilege, single-repo token, never executes untrusted code outside the sandbox, and flags without echoing any secret it encounters. Well-described, reproducible bugs that follow common patterns — null/None handling, off-by-one, validation gaps, simple logic errors. Architectural or security-sensitive changes are routed to humans. It searches before reading, caps tool calls and files per issue, and uses a cheaper model for triage with the strong model reserved for root-cause analysis and the patch.