Hire a Code Watchdog: Building a Claude Code Auto-Review Bot as a Quality Gate for Solo Projects (with Real Ops Logs)

wpnews.pro

When you ship alone, nobody catches your 2 AM mistakes. This article gives you a working Claude-powered review bot that runs on every pull request via GitHub Actions, posts inline findings, and — crucially — refuses to bankrupt you. By the end you can copy two files into any repo and have a guard dog that barks at real bugs and stays quiet otherwise.

I'll also walk through the failures I hit wiring this up, because the interesting part of a review bot isn't the happy path. It's the night the dog wouldn't stop barking.

Conclusion first: a linter checks syntax and style, but a Claude-based reviewer checks intent — and for a solo dev, intent drift is what actually breaks production.

ESLint won't tell you that your new retry wrapper swallows the original exception. ruff

won't notice that you compare a Decimal

to a float

in a billing path. Those are semantic bugs, and they're exactly what a second pair of eyes catches in a team. When you don't have a team, you rent the eyes.

The design constraint that shapes everything below: a review bot has no human deciding when to stop. A human reviewer gets tired and goes home. An automated one will happily call the API in a loop until your card declines. So the architecture is two parts — a reviewer and a leash.

The single biggest mistake people make is asking Claude to "review this diff" and then regex-parsing prose out of the reply. Don't. Use tool use to force a JSON schema, so the model cannot return free text where you expect structured findings.

Here is the core reviewer. It reads a unified diff on stdin, sends it to Claude with a forced tool call, and prints findings as JSON. It defaults to Sonnet 4.6 (claude-sonnet-4-6

) because review volume is high and Sonnet is the cost-sane workhorse; you escalate to Opus only when you mean it.

import os, sys, json
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
MODEL = os.environ.get("REVIEW_MODEL", "claude-sonnet-4-6")
MAX_DIFF_CHARS = 60_000  # hard cap; see the truncation failure below

REVIEW_TOOL = {
    "name": "submit_review",
    "description": "Submit code review findings for the diff.",
    "input_schema": {
        "type": "object",
        "properties": {
            "findings": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "file": {"type": "string"},
                        "line": {"type": "integer",
                                  "description": "line number in the NEW file, from the diff hunk header"},
                        "severity": {"type": "string", "enum": ["blocker", "warning", "nit"]},
                        "title": {"type": "string"},
                        "detail": {"type": "string"},
                    },
                    "required": ["file", "line", "severity", "title", "detail"],
                },
            }
        },
        "required": ["findings"],
    },
}

SYSTEM = (
    "You are a senior reviewer for a SOLO developer with no second reviewer. "
    "Report only semantic bugs, security issues, and correctness traps. "
    "Do NOT report style or formatting — a linter handles that. "
    "If a line number is not derivable from a hunk header, omit the finding rather than guessing. "
    "An empty findings list is a valid, good outcome."
)

def main():
    diff = sys.stdin.read()
    if not diff.strip():
        print(json.dumps({"findings": []})); return
    truncated = diff[:MAX_DIFF_CHARS]
    note = "" if len(diff) <= MAX_DIFF_CHARS else "\n[diff truncated — review only what is shown]"

    resp = client.messages.create(
        model=MODEL,
        max_tokens=2000,
        system=SYSTEM,
        tools=[REVIEW_TOOL],
        tool_choice={"type": "tool", "name": "submit_review"},  # force the schema
        messages=[{"role": "user",
                   "content": f"Review this unified diff:\n\n{truncated}{note}"}],
    )
    for block in resp.content:
        if block.type == "tool_use" and block.name == "submit_review":
            print(json.dumps(block.input))
            return
    print(json.dumps({"findings": []}))

if __name__ == "__main__":
    main()

Two things earn their keep here. tool_choice

with a named tool means Claude is required to call submit_review

— you never parse prose. And the system prompt explicitly blesses an empty list as success. Without that line, the model invents nits to feel useful, and a guard dog that barks at the mailman gets ignored.

Now the leash and the delivery. This workflow triggers on PRs, computes the diff against the base branch, runs review.py

, and posts a single sticky comment. It pins max_tokens

, uses concurrency

to kill stale runs, and never re-reviews the bot's own comments.

name: Claude Review
on:
  pull_request:
    types: [opened, synchronize]

concurrency:                       # a new push cancels the previous review run
  group: claude-review-${{ github.event.pull_request.number }}
  cancel-in-progress: true

permissions:
  contents: read
  pull-requests: write

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0           # need base branch history for the diff
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install anthropic
      - name: Run Claude review
        id: review
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          BASE="origin/${{ github.event.pull_request.base.ref }}"
          git diff "$BASE"...HEAD -- . ':(exclude)*.lock' ':(exclude)dist/**' \
            | python review.py > findings.json
          cat findings.json
      - name: Post sticky comment
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const data = JSON.parse(fs.readFileSync('findings.json', 'utf8'));
            const f = data.findings || [];
            const TAG = '<!-- claude-review -->';
            const body = f.length === 0
              ? `${TAG}\n🐕 Claude review: no blocking issues found.`
              : `${TAG}\n🐕 Claude review — ${f.length} finding(s):\n\n` +
                f.map(x => `- **[${x.severity}]** \`${x.file}:${x.line}\` — ${x.title}\n  ${x.detail}`).join('\n');
            const {owner, repo} = context.repo;
            const issue = context.issue.number;
            const comments = await github.rest.issues.listComments({owner, repo, issue_number: issue});
            const mine = comments.data.find(c => c.body.includes(TAG));
            if (mine) await github.rest.issues.updateComment({owner, repo, comment_id: mine.id, body});
            else await github.rest.issues.createComment({owner, repo, issue_number: issue, body});

The 

HTML marker is how the bot finds and updates its own comment instead of stacking a new wall of text on every push. The :(exclude)*.lock

pathspec keeps lockfile churn out of the token bill. And cancel-in-progress

matters more than it looks — see below.

Here are the concrete failures, because reproducing them is how you understand the system.

Failure 1 — the retry loop that machine-gunned Opus. My first version wrapped messages.create

in a naive while True

retry around the SDK's RateLimitError

. When the API returned 429, the code retried immediately with no backoff. Worse, my early config used Opus for every review. Under a burst of pushes, each 429 fired another full Opus request within milliseconds, which produced another 429, which fired another request. The bot wasn't reviewing anything — it was a busy loop hammering the rate limiter. The fix is exponential backoff that respects the retry-after

header, and the SDK already does this for you if you stop fighting it:

client = Anthropic(max_retries=4)  # built-in exponential backoff; don't hand-roll a while True

The lesson: an unattended process plus a retry loop plus a per-call cost is a money fire. The concurrency

block in the workflow is the second guardrail — without it, ten rapid pushes spawn ten concurrent review jobs, each holding its own retry loop.

Failure 2 — hallucinated line numbers from a truncated diff. Before MAX_DIFF_CHARS

, a 4,000-line refactor blew past the context I budgeted, and git diff

output got cut mid-hunk. Claude, trying to be helpful, reported a blocker

at payments.py:812

— a line that didn't exist in the diff because the hunk header it referenced had been sliced off. The bot looked confident and was completely wrong, which is the worst kind of reviewer. The fix is the explicit truncation cap plus the system-prompt rule "if a line number is not derivable from a hunk header, omit the finding." Confident-and-wrong is worse than silent.

Failure 3 — the feedback loop on its own comments. An early iteration diffed the whole PR including review threads via a script step, and on the next synchronize

event the bot ended up "reviewing" the text of its own previous comment, generating meta-findings about its own findings. Scoping the diff strictly to git diff base...HEAD -- .

(code only, never comment bodies) and the sticky-comment marker killed it.

The last decision is policy: should a finding block the merge or just inform? I keep the gate soft on purpose. Making the job exit 1

on any blocker

sounds disciplined, but a single false-positive blocker on a Friday teaches you to merge with --admin

override, and once you've trained yourself to ignore the gate, the gate is dead.

Instead I let the comment post and only fail the check when there are blocker-severity findings and the diff touches a protected path:

BLOCKERS=$(jq '[.findings[] | select(.severity=="blocker")] | length' findings.json)
if [ "$BLOCKERS" -gt 0 ] && git diff --name-only "$BASE"...HEAD | grep -qE '(auth|payment|migrations)/'; then
  echo "Blocking: critical path has $BLOCKERS blocker(s)"; exit 1
fi

This is the watchdog philosophy in one rule: bark at everything, bite only near the things that actually hurt. The bot reviews 100% of PRs, comments on most, and blocks only when a high-severity finding lands in auth/

, payment/

, or migrations/

.

Copy review.py

and claude-review.yml

, add ANTHROPIC_API_KEY

to repo secrets, and every PR gets a Claude reviewer that posts structured, line-anchored findings, updates one comment instead of spamming, respects a token cap, and won't loop your bill to the moon.

The natural extensions, in order of payoff: feed the file's full content (not just the diff) for the few files Claude flags, so it can reason about surrounding code; cache the system prompt with prompt caching to cut input cost on every run; and add a REVIEW_MODEL=claude-opus-4-8

override on PRs labeled high-risk

so the expensive eyes show up only when you ask. The watchdog metaphor holds the whole way down — a good guard dog is cheap to keep, loud at the right moments, and smart enough to know the mailman.

source & further reading

dev.to — original article I Built the First Collaborative Multi-Persona Sandbox (Because I'm Sick of Cloud Wrappers) ACP vs AP2: the two AI-checkout protocols, and what your store actually has to build Jump into Junior Engr.: Your Web Development Journey with ReactJS (AI-Augmented Edition)

Hire a Code Watchdog: Building a Claude Code Auto-Review Bot as a Quality Gate for Solo Projects (with Real Ops Logs)

Run your AI side-project on zahid.host