# Hire a Code Watchdog: Building a Claude Code Auto-Review Bot as a Quality Gate for Solo Projects (with Real Ops Logs)

> Source: <https://dev.to/_7fb6011b57d383122b5a/hire-a-code-watchdog-building-a-claude-code-auto-review-bot-as-a-quality-gate-for-solo-projects-4i92>
> Published: 2026-06-07 00:07:49+00:00

When you ship alone, nobody catches your 2 AM mistakes. This article gives you a working Claude-powered review bot that runs on every pull request via GitHub Actions, posts inline findings, and — crucially — refuses to bankrupt you. By the end you can copy two files into any repo and have a guard dog that barks at real bugs and stays quiet otherwise.

I'll also walk through the failures I hit wiring this up, because the interesting part of a review bot isn't the happy path. It's the night the dog wouldn't stop barking.

Conclusion first: a linter checks *syntax and style*, but a Claude-based reviewer checks *intent* — and for a solo dev, intent drift is what actually breaks production.

ESLint won't tell you that your new retry wrapper swallows the original exception. `ruff`

won't notice that you compare a `Decimal`

to a `float`

in a billing path. Those are semantic bugs, and they're exactly what a second pair of eyes catches in a team. When you don't have a team, you rent the eyes.

The design constraint that shapes everything below: a review bot has no human deciding when to stop. A human reviewer gets tired and goes home. An automated one will happily call the API in a loop until your card declines. So the architecture is two parts — a *reviewer* and a *leash*.

The single biggest mistake people make is asking Claude to "review this diff" and then regex-parsing prose out of the reply. Don't. Use tool use to force a JSON schema, so the model *cannot* return free text where you expect structured findings.

Here is the core reviewer. It reads a unified diff on stdin, sends it to Claude with a forced tool call, and prints findings as JSON. It defaults to Sonnet 4.6 (`claude-sonnet-4-6`

) because review volume is high and Sonnet is the cost-sane workhorse; you escalate to Opus only when you mean it.

```
# review.py — reads a unified diff on stdin, emits JSON findings on stdout
import os, sys, json
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
MODEL = os.environ.get("REVIEW_MODEL", "claude-sonnet-4-6")
MAX_DIFF_CHARS = 60_000  # hard cap; see the truncation failure below

REVIEW_TOOL = {
    "name": "submit_review",
    "description": "Submit code review findings for the diff.",
    "input_schema": {
        "type": "object",
        "properties": {
            "findings": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "file": {"type": "string"},
                        "line": {"type": "integer",
                                  "description": "line number in the NEW file, from the diff hunk header"},
                        "severity": {"type": "string", "enum": ["blocker", "warning", "nit"]},
                        "title": {"type": "string"},
                        "detail": {"type": "string"},
                    },
                    "required": ["file", "line", "severity", "title", "detail"],
                },
            }
        },
        "required": ["findings"],
    },
}

SYSTEM = (
    "You are a senior reviewer for a SOLO developer with no second reviewer. "
    "Report only semantic bugs, security issues, and correctness traps. "
    "Do NOT report style or formatting — a linter handles that. "
    "If a line number is not derivable from a hunk header, omit the finding rather than guessing. "
    "An empty findings list is a valid, good outcome."
)

def main():
    diff = sys.stdin.read()
    if not diff.strip():
        print(json.dumps({"findings": []})); return
    truncated = diff[:MAX_DIFF_CHARS]
    note = "" if len(diff) <= MAX_DIFF_CHARS else "\n[diff truncated — review only what is shown]"

    resp = client.messages.create(
        model=MODEL,
        max_tokens=2000,
        system=SYSTEM,
        tools=[REVIEW_TOOL],
        tool_choice={"type": "tool", "name": "submit_review"},  # force the schema
        messages=[{"role": "user",
                   "content": f"Review this unified diff:\n\n{truncated}{note}"}],
    )
    for block in resp.content:
        if block.type == "tool_use" and block.name == "submit_review":
            print(json.dumps(block.input))
            return
    print(json.dumps({"findings": []}))

if __name__ == "__main__":
    main()
```

Two things earn their keep here. `tool_choice`

with a named tool means Claude is *required* to call `submit_review`

— you never parse prose. And the system prompt explicitly blesses an empty list as success. Without that line, the model invents nits to feel useful, and a guard dog that barks at the mailman gets ignored.

Now the leash and the delivery. This workflow triggers on PRs, computes the diff against the base branch, runs `review.py`

, and posts a single sticky comment. It pins `max_tokens`

, uses `concurrency`

to kill stale runs, and never re-reviews the bot's own comments.

```
# .github/workflows/claude-review.yml
name: Claude Review
on:
  pull_request:
    types: [opened, synchronize]

concurrency:                       # a new push cancels the previous review run
  group: claude-review-${{ github.event.pull_request.number }}
  cancel-in-progress: true

permissions:
  contents: read
  pull-requests: write

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0           # need base branch history for the diff
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install anthropic
      - name: Run Claude review
        id: review
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          BASE="origin/${{ github.event.pull_request.base.ref }}"
          git diff "$BASE"...HEAD -- . ':(exclude)*.lock' ':(exclude)dist/**' \
            | python review.py > findings.json
          cat findings.json
      - name: Post sticky comment
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const data = JSON.parse(fs.readFileSync('findings.json', 'utf8'));
            const f = data.findings || [];
            const TAG = '<!-- claude-review -->';
            const body = f.length === 0
              ? `${TAG}\n🐕 Claude review: no blocking issues found.`
              : `${TAG}\n🐕 Claude review — ${f.length} finding(s):\n\n` +
                f.map(x => `- **[${x.severity}]** \`${x.file}:${x.line}\` — ${x.title}\n  ${x.detail}`).join('\n');
            const {owner, repo} = context.repo;
            const issue = context.issue.number;
            const comments = await github.rest.issues.listComments({owner, repo, issue_number: issue});
            const mine = comments.data.find(c => c.body.includes(TAG));
            if (mine) await github.rest.issues.updateComment({owner, repo, comment_id: mine.id, body});
            else await github.rest.issues.createComment({owner, repo, issue_number: issue, body});
```

The `<!-- claude-review -->`

HTML marker is how the bot finds and *updates* its own comment instead of stacking a new wall of text on every push. The `:(exclude)*.lock`

pathspec keeps lockfile churn out of the token bill. And `cancel-in-progress`

matters more than it looks — see below.

Here are the concrete failures, because reproducing them is how you understand the system.

**Failure 1 — the retry loop that machine-gunned Opus.** My first version wrapped `messages.create`

in a naive `while True`

retry around the SDK's `RateLimitError`

. When the API returned 429, the code retried *immediately* with no backoff. Worse, my early config used Opus for every review. Under a burst of pushes, each 429 fired another full Opus request within milliseconds, which produced another 429, which fired another request. The bot wasn't reviewing anything — it was a busy loop hammering the rate limiter. The fix is exponential backoff that respects the `retry-after`

header, and the SDK already does this for you if you stop fighting it:

```
client = Anthropic(max_retries=4)  # built-in exponential backoff; don't hand-roll a while True
```

The lesson: an unattended process plus a retry loop plus a per-call cost is a money fire. The `concurrency`

block in the workflow is the second guardrail — without it, ten rapid pushes spawn ten concurrent review jobs, each holding its own retry loop.

**Failure 2 — hallucinated line numbers from a truncated diff.** Before `MAX_DIFF_CHARS`

, a 4,000-line refactor blew past the context I budgeted, and `git diff`

output got cut mid-hunk. Claude, trying to be helpful, reported a `blocker`

at `payments.py:812`

— a line that *didn't exist in the diff* because the hunk header it referenced had been sliced off. The bot looked confident and was completely wrong, which is the worst kind of reviewer. The fix is the explicit truncation cap plus the system-prompt rule "if a line number is not derivable from a hunk header, omit the finding." Confident-and-wrong is worse than silent.

**Failure 3 — the feedback loop on its own comments.** An early iteration diffed the whole PR including review threads via a script step, and on the next `synchronize`

event the bot ended up "reviewing" the text of its own previous comment, generating meta-findings about its own findings. Scoping the diff strictly to `git diff base...HEAD -- .`

(code only, never comment bodies) and the sticky-comment marker killed it.

The last decision is policy: should a finding *block* the merge or just inform? I keep the gate soft on purpose. Making the job `exit 1`

on any `blocker`

sounds disciplined, but a single false-positive blocker on a Friday teaches you to merge with `--admin`

override, and once you've trained yourself to ignore the gate, the gate is dead.

Instead I let the comment post and only fail the check when there are blocker-severity findings *and* the diff touches a protected path:

```
BLOCKERS=$(jq '[.findings[] | select(.severity=="blocker")] | length' findings.json)
if [ "$BLOCKERS" -gt 0 ] && git diff --name-only "$BASE"...HEAD | grep -qE '(auth|payment|migrations)/'; then
  echo "Blocking: critical path has $BLOCKERS blocker(s)"; exit 1
fi
```

This is the watchdog philosophy in one rule: bark at everything, bite only near the things that actually hurt. The bot reviews 100% of PRs, comments on most, and blocks only when a high-severity finding lands in `auth/`

, `payment/`

, or `migrations/`

.

Copy `review.py`

and `claude-review.yml`

, add `ANTHROPIC_API_KEY`

to repo secrets, and every PR gets a Claude reviewer that posts structured, line-anchored findings, updates one comment instead of spamming, respects a token cap, and won't loop your bill to the moon.

The natural extensions, in order of payoff: feed the file's full content (not just the diff) for the few files Claude flags, so it can reason about surrounding code; cache the system prompt with prompt caching to cut input cost on every run; and add a `REVIEW_MODEL=claude-opus-4-8`

override on PRs labeled `high-risk`

so the expensive eyes show up only when you ask. The watchdog metaphor holds the whole way down — a good guard dog is cheap to keep, loud at the right moments, and smart enough to know the mailman.