cd /news/large-language-models/i-make-codex-review-every-diff-claud… · home topics large-language-models article
[ARTICLE · art-38572] src=zackproser.com ↗ pub= topic=large-language-models verified=true sentiment=· neutral

I Make Codex Review Every Diff Claude Writes

A developer implements a system where Claude Code writes code diffs and a separate Codex model reviews them before human approval, using cross-model disagreement to catch bugs that self-review misses. The hook-based workflow gates substantial changes through adversarial review, blocking or flagging issues before they reach the developer.

read13 min views1 publishedJun 24, 2026
I Make Codex Review Every Diff Claude Writes
Image: Zackproser (auto-discovered)

I Make Codex Review Every Diff Claude Writes

Claude Code writes the diff. Before that diff reaches me, a second model from a different lab reads it with one job: find what's wrong. Claude writes the code, Codex tries to break it, then I read the findings. I don't let the model that wrote the patch be the only thing that approves it.

I run this on every substantial change. A one-line typo fix slides through. Anything that touches logic, auth, money, or a migration gets handed to a Codex reviewer with an adversarial brief before it lands in front of me. The reviewer reads the unified diff, the changed files (in full when they fit a size budget), and the task that produced them, then returns a structured verdict: block, comment, or approve, with line-anchored findings.

I already do this for my prose. My blog bot sends every draft to two cold readers from different model families before I see it, and where they disagree is the spot worth looking at. This is the same idea pointed at code. Same rule: the model that wrote the code is the worst possible judge of it.

Why self-review is structurally weak

Ask Claude to review the code Claude just wrote and you get a confident yes. Not because the model is dishonest, but because review and authorship share the same context. The reasoning that produced the bug is still in the window, still feels correct, and the "review" just re-runs it. You're asking the author to re-read their own sentence. They read what they meant, not what they typed.

In practice, self-review misses three classes of bug I care about, and a cross-model reviewer kills all three.

Shared blind spots. A model has characteristic mistakes: the off-by-one it tends to make, the error case it tends to skip, the framework footgun it doesn't know about. When the same model reviews, it shares those exact blind spots. It can't see the bug for the same reason it wrote it.

Context contamination. The reviewer that saw the drafting prompt inherits the author's assumptions. If the task said "this input is always validated upstream," the author trusts that and so does the same-model reviewer. A fresh reviewer that only sees the diff asks the obvious question: where's the validation?

Sycophancy toward its own output. Models rate their own work higher. Give Claude its own patch and ask "any issues?" and the prior is approval. Give a different model the same patch cold, briefed to find problems, and the prior flips.

Claude and Codex make different mistakes. That's the point. Where Codex and Claude agree, that's signal. Where they diverge, that's the line I read myself. I'm not trusting either model to be right. I'm using their disagreement as a flashlight.

The hook flow

The whole thing hangs off a Claude Code hook. When Claude finishes a turn that produced edits, a Stop

hook fires, computes the diff against the branch point, and routes it.

The gate in the middle keeps this usable. Small, cold diffs pass straight through on three cheap signals (lines changed, paths touched, a high-risk glob for auth, billing, migrations, and infra/

). Hot paths always go to review. A README tweak doesn't wake Codex up. A change to the session-token check always does.

Here's the hook, trimmed to the parts that matter:

#!/usr/bin/env bash
set -euo pipefail

BASE_REF="${REVIEW_BASE:-origin/main}"
BASE="$(git merge-base HEAD "$BASE_REF")"
git ls-files --others --exclude-standard -z | xargs -0 -r git add -N -- 2>/dev/null || true

DIFF="$(git diff "$BASE" --)"          # working tree vs branch base
[ -z "$DIFF" ] && exit 0

changed_loc=$(git diff --numstat "$BASE" -- | awk '{a+=$1+$2} END{print a+0}')
hot_paths=$(git diff --name-only "$BASE" -- \
  | grep -cE '(auth|billing|payment|migrations?/|infra/|\.sql$)' || true)

if [ "$changed_loc" -lt 25 ] && [ "$hot_paths" -eq 0 ]; then
  exit 0   # trivial + cold: pass through, no review
fi

FILES_CONTEXT=""
skipped_context=0
while IFS= read -r -d '' file; do
  [ -f "$file" ] || continue                                  # skip deletes
  git diff --numstat "$BASE" -- "$file" \
    | awk '$1=="-"||$2=="-"{exit 1}' || { skipped_context=1; continue; }   # binary
  if [ "$(wc -l < "$file")" -gt 600 ]; then skipped_context=1; continue; fi  # too big
  FILES_CONTEXT+=$'\n### '"$file"$'\n```\n'"$(cat "$file")"$'\n```\n'
done < <(git diff --name-only -z "$BASE" --)

PROMPT="$(cat .claude/prompts/adversarial-review.md)

## Task that produced this diff
$(cat .claude/last-task.txt 2>/dev/null || echo "(not recorded)")

## Changed files (full)
$FILES_CONTEXT

## Unified diff
\`\`\` diff
$DIFF
\`\`\`"

RAW=$(printf '%s' "$PROMPT" | timeout 60s codex exec --json -) \
  || { echo "codex review failed or timed out — blocking" >&2; exit 2; }

REVIEW=$(printf '%s' "$RAW" | jq -r 'select(.type=="message") | .content' | tail -n1)

if ! printf '%s' "$REVIEW" | jq -e '.verdict and (.findings | type == "array")' >/dev/null 2>&1; then
  echo "codex review returned no valid JSON — blocking" >&2
  exit 2
fi

verdict=$(printf '%s' "$REVIEW" | jq -r '.verdict')   # approve | comment | block
printf '%s' "$REVIEW" | jq -r '.findings[] | "\(.severity) \(.file):\(.line) \(.message)"'

[ "$skipped_context" = "1" ] && [ "$verdict" = "approve" ] && verdict="comment"

[ "$verdict" = "block" ] && exit 2
exit 0

Two choices matter more than the rest.

The reviewer gets the diff, the changed files, and the task, and nothing else. No drafting transcript, no chain of thought, no "here's why I think this is right." A cold reader with an adversarial brief catches what a warm one waves through. The brief itself is blunt. The whole of .claude/prompts/adversarial-review.md

is short enough to paste here, and it's the most reusable part of the setup:

You are a hostile senior reviewer. Assume the author is wrong.
Your job is to find the bug, the security hole, or the irreversible
action in this diff — not to praise it.

Rules:
- Cite the exact file and line for every finding.
- Every `block` must name a concrete failure mode: how it breaks,
  who it hurts, when. "This could be cleaner" is a comment, not a block.
- Score against this checklist, in order: correctness, security,
  reversibility, tests, scope.
- Return ONLY this JSON: { "verdict": "approve|comment|block",
  "findings": [{ "severity", "file", "line", "message" }] }

(Codex CLI streams JSON events, so the hook pulls the final message

payload off the stream with jq

, then parses the review object out of it.)

The reviewer returns structured output, not prose, against a fixed schema:

{
  "verdict": "block",
  "findings": [
    {
      "severity": "high",
      "file": "src/app/api/route.ts",
      "line": 42,
      "message": "the auth check moved below the handler; the body runs before the user is verified."
    }
  ]
}

Structure is what makes the hook automatable. block

exits 2, which Claude Code surfaces as a failed Stop hook, which means Claude has to read the findings and respond instead of barreling on. The model can't talk its way past a schema.

The rubric the reviewer scores against

The reviewer doesn't get to invent what "good" means in the moment. It scores the diff against the same fixed checklist every time, in priority order:

Correctness. Does the change do what the task asked, and does it break anything adjacent? Off-by-ones, inverted conditionals, dropped error cases.Security. Unvalidated input, authz checks that moved or vanished, secrets in code, SQL built by string concatenation, a scope widened without reason.Reversibility. Can this be rolled back? Destructive migrations, data deletes without a backup, schema changes that aren't backward-compatible. (This maps straight ontothe line I draw for what agents may do unattended.)Tests. Does new logic come with a test that would fail without the change? A diff that adds a branch and no test to cover it gets flagged.Scope. Did the author change more than the task asked? Drive-by edits, unrelated refactors, formatting churn that buries the real change.

Correctness and security block. Scope and style comment. The severity in each finding is what the hook's exit code keys off, so the rubric isn't decoration. It's wired to whether Claude gets to move on.

What it actually catches

The pattern earns its keep on a specific shape of bug: the diff that does exactly what the task asked, passes every test the author wrote, and is still wrong. Self-review waves those through, because the author already believes the code is correct. A cold reviewer doesn't share the belief.

The recurring categories, in roughly the order they show up:

The missing edge. The change handles the happy path and the obvious error, and silently skips the third case the task never mentioned. The author's tests cover the cases the author thought of, so they stay green. A reviewer reading the diff cold asks the question the author never had: what about the input that isn't either of those?The assumption that no longer holds. A guard, a null check, or an auth step that some earlier code "always" did, that this diff quietly moved or removed. The author still pictures the old invariant. The cold reviewer only sees what the diff leaves in place.The irreversible step with no undo. A migration that drops before it backfills, a delete with no soft-delete, a schema change that isn't backward compatible. Correct-looking, and a one-way door if it's wrong.The test that proves nothing. New logic, a new branch, and a test that would still pass if you deleted the branch. The reviewer flags coverage that doesn't exercise the change.The quiet scope creep. A drive-by refactor or a formatting sweep riding along in a diff that was supposed to do one thing, burying the line that matters.

None of these need a smarter model to catch. They need a reader who doesn't already know what the code was supposed to do. That's the entire trick: the author reads the diff through the intent that produced it and sees the code it meant to write; the cold reviewer reads the lines that are actually on the page.

Codex is not reliably smarter than Claude, in either direction. I use it because it comes in cold. A reviewer with no memory of what the code was supposed to do reads what the code does.

When they disagree

The interesting cases aren't the clean blocks. They're the splits. Claude writes the code and is sure it's right; Codex flags something; I'm the tiebreaker.

I treat a disagreement as a prompt to go read the code myself, not as a vote to average. Sometimes Codex is wrong. It flags a "missing" null check on a value that's typed non-null three frames up, and I dismiss it. Sometimes it's right and Claude's defense is exactly the kind of confident-but-wrong reasoning that produced the bug. Either way, the disagreement bought me a thirty-second read of the one line that mattered instead of a full review of a diff I would have rubber-stamped. The reviewer's job isn't to be right. It's to point my attention at the line worth my judgment.

What it costs, and what it doesn't catch

The tax is latency and false positives.

It adds latency. A substantial diff waits ten to thirty seconds for the Codex pass before I see it. The gate is what makes that acceptable. Trivial changes skip it entirely, so the tax only lands on changes where I'd want a careful read anyway. I'll trade thirty seconds of a machine's time against an hour of mine debugging that bug in production every time.

It produces false positives. Codex flags things that are fine, and a reviewer that cries wolf gets ignored, so I tuned the brief to demand a concrete failure mode for every block

. A finding has to say how it breaks, not just that it smells off. "This could be cleaner" is a comment, never a block.

And it does not check whether the change is the right change to make. The reviewer grades the diff in front of it: is this code correct, safe, reversible, tested, in scope? It has no opinion on whether I should have built this feature at all, or whether the architecture is sound. That judgment stays mine. The cross-model loop protects the code. It can't protect the decision.

This is the boundary I use for every agent I run. The machine takes the middle of the work, the drafting and the review and the catch, and the ends stay with me. I wrote about that division as building agents that hand you a reviewable artifact instead of a chat reply. A diff that's already been attacked by a hostile reviewer from a different lab is a much better artifact than one fresh out of the model that wrote it.

Steal the setup

If you want to run this, the parts that matter are short:

Put the reviewer in a It fires automatically when Claude finishes editing, so review isn't a thing you remember to do. It's a thing that happens.Stop

hook.Gate on substance. Score the diff on LOC, hot paths, and risk globs. Skip the trivial; always review auth, money, and migrations.Use a different lab's model. Claude writes, a GPT-class model reviews. Shared blind spots are the problem. A second model family is the fix.Starve the reviewer of context. Give it the diff, the files, the task. Not the drafting transcript. A cold read with an adversarial brief beats a warm one.Demand structured output and wire it to an exit code. Verdict enum, line-anchored findings, severities.block

fails the hook so Claude has to respond.Make Every block names how it breaks. Everything softer is a comment. That's how you keep the reviewer worth listening to.block

require a failure mode.Treat disagreement as a flashlight. When the two models split, go read that line yourself. Don't average the verdicts.

The bug this catches is the one that does what the task asked, passes its tests, and is still wrong. You don't need a smarter model to catch it. You need a second one with no loyalty to the patch.

If you copy one thing, copy the brief and the failure-mode rule. I covered this pattern (cold reviewers, fixed rubrics, fail-closed gates) in the Claude Code skills workshop Nick Nisi and I ran at AI Engineering London: constraints over instructions, evidence over guesses, measurement over vibes.

── more in #large-language-models 4 stories · sorted by recency
── more on @claude code 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/i-make-codex-review-…] indexed:0 read:13min 2026-06-24 ·