Hire a Code Watchdog: Building a Claude Code Auto-Review Bot as a Quality Gate for Solo Projects (with Real Ops Logs)

A solo developer built a Claude-powered code review bot that runs on every pull request via GitHub Actions, posting structured JSON findings for semantic bugs while refusing to overuse the API. The system uses forced tool calls to enforce a JSON schema output, preventing the model from returning free-text prose, and defaults to the cost-efficient Sonnet model for high review volume. The developer documented real operational failures, including diff truncation issues and the bot's tendency to "bark" excessively, to show how the architecture includes a "leash" that prevents runaway API costs.

When you ship alone, nobody catches your 2 AM mistakes. This article gives you a working Claude-powered review bot that runs on every pull request via GitHub Actions, posts inline findings, and — crucially — refuses to bankrupt you. By the end you can copy two files into any repo and have a guard dog that barks at real bugs and stays quiet otherwise. I'll also walk through the failures I hit wiring this up, because the interesting part of a review bot isn't the happy path. It's the night the dog wouldn't stop barking. Conclusion first: a linter checks syntax and style , but a Claude-based reviewer checks intent — and for a solo dev, intent drift is what actually breaks production. ESLint won't tell you that your new retry wrapper swallows the original exception. ruff won't notice that you compare a Decimal to a float in a billing path. Those are semantic bugs, and they're exactly what a second pair of eyes catches in a team. When you don't have a team, you rent the eyes. The design constraint that shapes everything below: a review bot has no human deciding when to stop. A human reviewer gets tired and goes home. An automated one will happily call the API in a loop until your card declines. So the architecture is two parts — a reviewer and a leash . The single biggest mistake people make is asking Claude to "review this diff" and then regex-parsing prose out of the reply. Don't. Use tool use to force a JSON schema, so the model cannot return free text where you expect structured findings. Here is the core reviewer. It reads a unified diff on stdin, sends it to Claude with a forced tool call, and prints findings as JSON. It defaults to Sonnet 4.6 claude-sonnet-4-6 because review volume is high and Sonnet is the cost-sane workhorse; you escalate to Opus only when you mean it. review.py — reads a unified diff on stdin, emits JSON findings on stdout import os, sys, json from anthropic import Anthropic client = Anthropic api key=os.environ "ANTHROPIC API KEY" MODEL = os.environ.get "REVIEW MODEL", "claude-sonnet-4-6" MAX DIFF CHARS = 60 000 hard cap; see the truncation failure below REVIEW TOOL = { "name": "submit review", "description": "Submit code review findings for the diff.", "input schema": { "type": "object", "properties": { "findings": { "type": "array", "items": { "type": "object", "properties": { "file": {"type": "string"}, "line": {"type": "integer", "description": "line number in the NEW file, from the diff hunk header"}, "severity": {"type": "string", "enum": "blocker", "warning", "nit" }, "title": {"type": "string"}, "detail": {"type": "string"}, }, "required": "file", "line", "severity", "title", "detail" , }, } }, "required": "findings" , }, } SYSTEM = "You are a senior reviewer for a SOLO developer with no second reviewer. " "Report only semantic bugs, security issues, and correctness traps. " "Do NOT report style or formatting — a linter handles that. " "If a line number is not derivable from a hunk header, omit the finding rather than guessing. " "An empty findings list is a valid, good outcome." def main : diff = sys.stdin.read if not diff.strip : print json.dumps {"findings": } ; return truncated = diff :MAX DIFF CHARS note = "" if len diff <= MAX DIFF CHARS else "\n diff truncated — review only what is shown " resp = client.messages.create model=MODEL, max tokens=2000, system=SYSTEM, tools= REVIEW TOOL , tool choice={"type": "tool", "name": "submit review"}, force the schema messages= {"role": "user", "content": f"Review this unified diff:\n\n{truncated}{note}"} , for block in resp.content: if block.type == "tool use" and block.name == "submit review": print json.dumps block.input return print json.dumps {"findings": } if name == " main ": main Two things earn their keep here. tool choice with a named tool means Claude is required to call submit review — you never parse prose. And the system prompt explicitly blesses an empty list as success. Without that line, the model invents nits to feel useful, and a guard dog that barks at the mailman gets ignored. Now the leash and the delivery. This workflow triggers on PRs, computes the diff against the base branch, runs review.py , and posts a single sticky comment. It pins max tokens , uses concurrency to kill stale runs, and never re-reviews the bot's own comments. .github/workflows/claude-review.yml name: Claude Review on: pull request: types: opened, synchronize concurrency: a new push cancels the previous review run group: claude-review-${{ github.event.pull request.number }} cancel-in-progress: true permissions: contents: read pull-requests: write jobs: review: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 with: fetch-depth: 0 need base branch history for the diff - uses: actions/setup-python@v5 with: { python-version: "3.12" } - run: pip install anthropic - name: Run Claude review id: review env: ANTHROPIC API KEY: ${{ secrets.ANTHROPIC API KEY }} run: | BASE="origin/${{ github.event.pull request.base.ref }}" git diff "$BASE"...HEAD -- . ': exclude .lock' ': exclude dist/ ' \ | python review.py findings.json cat findings.json - name: Post sticky comment uses: actions/github-script@v7 with: script: | const fs = require 'fs' ; const data = JSON.parse fs.readFileSync 'findings.json', 'utf8' ; const f = data.findings || ; const TAG = '< -- claude-review -- '; const body = f.length === 0 ? ${TAG}\n🐕 Claude review: no blocking issues found. : ${TAG}\n🐕 Claude review — ${f.length} finding s :\n\n + f.map x = - ${x.severity} \ ${x.file}:${x.line}\ — ${x.title}\n ${x.detail} .join '\n' ; const {owner, repo} = context.repo; const issue = context.issue.number; const comments = await github.rest.issues.listComments {owner, repo, issue number: issue} ; const mine = comments.data.find c = c.body.includes TAG ; if mine await github.rest.issues.updateComment {owner, repo, comment id: mine.id, body} ; else await github.rest.issues.createComment {owner, repo, issue number: issue, body} ; The < -- claude-review -- HTML marker is how the bot finds and updates its own comment instead of stacking a new wall of text on every push. The : exclude .lock pathspec keeps lockfile churn out of the token bill. And cancel-in-progress matters more than it looks — see below. Here are the concrete failures, because reproducing them is how you understand the system. Failure 1 — the retry loop that machine-gunned Opus. My first version wrapped messages.create in a naive while True retry around the SDK's RateLimitError . When the API returned 429, the code retried immediately with no backoff. Worse, my early config used Opus for every review. Under a burst of pushes, each 429 fired another full Opus request within milliseconds, which produced another 429, which fired another request. The bot wasn't reviewing anything — it was a busy loop hammering the rate limiter. The fix is exponential backoff that respects the retry-after header, and the SDK already does this for you if you stop fighting it: client = Anthropic max retries=4 built-in exponential backoff; don't hand-roll a while True The lesson: an unattended process plus a retry loop plus a per-call cost is a money fire. The concurrency block in the workflow is the second guardrail — without it, ten rapid pushes spawn ten concurrent review jobs, each holding its own retry loop. Failure 2 — hallucinated line numbers from a truncated diff. Before MAX DIFF CHARS , a 4,000-line refactor blew past the context I budgeted, and git diff output got cut mid-hunk. Claude, trying to be helpful, reported a blocker at payments.py:812 — a line that didn't exist in the diff because the hunk header it referenced had been sliced off. The bot looked confident and was completely wrong, which is the worst kind of reviewer. The fix is the explicit truncation cap plus the system-prompt rule "if a line number is not derivable from a hunk header, omit the finding." Confident-and-wrong is worse than silent. Failure 3 — the feedback loop on its own comments. An early iteration diffed the whole PR including review threads via a script step, and on the next synchronize event the bot ended up "reviewing" the text of its own previous comment, generating meta-findings about its own findings. Scoping the diff strictly to git diff base...HEAD -- . code only, never comment bodies and the sticky-comment marker killed it. The last decision is policy: should a finding block the merge or just inform? I keep the gate soft on purpose. Making the job exit 1 on any blocker sounds disciplined, but a single false-positive blocker on a Friday teaches you to merge with --admin override, and once you've trained yourself to ignore the gate, the gate is dead. Instead I let the comment post and only fail the check when there are blocker-severity findings and the diff touches a protected path: BLOCKERS=$ jq ' .findings | select .severity=="blocker" | length' findings.json if "$BLOCKERS" -gt 0 && git diff --name-only "$BASE"...HEAD | grep -qE ' auth|payment|migrations /'; then echo "Blocking: critical path has $BLOCKERS blocker s "; exit 1 fi This is the watchdog philosophy in one rule: bark at everything, bite only near the things that actually hurt. The bot reviews 100% of PRs, comments on most, and blocks only when a high-severity finding lands in auth/ , payment/ , or migrations/ . Copy review.py and claude-review.yml , add ANTHROPIC API KEY to repo secrets, and every PR gets a Claude reviewer that posts structured, line-anchored findings, updates one comment instead of spamming, respects a token cap, and won't loop your bill to the moon. The natural extensions, in order of payoff: feed the file's full content not just the diff for the few files Claude flags, so it can reason about surrounding code; cache the system prompt with prompt caching to cut input cost on every run; and add a REVIEW MODEL=claude-opus-4-8 override on PRs labeled high-risk so the expensive eyes show up only when you ask. The watchdog metaphor holds the whole way down — a good guard dog is cheap to keep, loud at the right moments, and smart enough to know the mailman.