{"slug": "ai-code-reviewer-with-senior-level-judgment-and-strict-rubric", "title": "AI code reviewer with senior-level judgment and strict rubric", "summary": "Lazycoder, an AI code review agent with senior-level judgment, evaluates every changed code block against a fixed 17-rule rubric, runs real checks, and returns a defensible verdict of APPROVE, REQUEST_CHANGES, or BLOCK before code is merged. The tool enforces deterministic, auditable reviews with cited evidence, and integrates into CI via exit codes, aiming to remove human inconsistency and ensure all rules are checked every time.", "body_md": "A code review agent with senior-level judgement. It interrogates every changed\nblock against a fixed rubric, runs the real checks, and returns a defensible\nverdict — **APPROVE / REQUEST_CHANGES / BLOCK** — before code is trusted or merged.\n\nCode gets written fast. The bottleneck is trusting it. lazycoder is the reviewer that never gets tired, never skips a rule, and never self-reports green without running the checks.\n\n```\nexport ANTHROPIC_API_KEY=sk-ant-...\n\nuvx lazycoder my.diff              # zero-install run\npipx install lazycoder             # or install the CLI permanently\n\ngit diff main | uvx lazycoder -    # review your branch straight from a pipe\n```\n\nExit codes map the verdict — `0`\n\nAPPROVE, `1`\n\nREQUEST_CHANGES, `2`\n\nBLOCK — so it\ndrops into CI as a gate with no glue code. `--json`\n\nemits the full report.\n\n| Manual review | lazycoder | |\n|---|---|---|\nCoverage |\nWhatever the reviewer remembers to look at | Every rule (R1–R17) evaluated, every time |\nConsistency |\nVaries by reviewer, mood, time of day | Same rubric, same policy, deterministic |\nVerdict |\n\"LGTM\" / gut feel | APPROVE / REQUEST_CHANGES / BLOCK from a severity policy |\nEvidence |\nComments, sometimes | Every finding cites `rule_id` + exact file:line |\nGreen claims |\n\"tests pass\" (trust me) | Real linter/typecheck/test output in a sandbox |\nUntrusted code |\nReviewer may run it locally | Reviewed code is data, never executed outside the sandbox |\nSpeed at scale |\nSlows down as diffs grow | Loops the rubric per block, unattended |\nAuditability |\nLives in someone's head | Append-only decision log; any verdict is replayable |\n\nlazycoder does not replace the human — a person still confirms consequential decisions. It removes the parts humans are bad at: remembering all 17 rules, staying consistent across 200 files, and proving the checks actually ran.\n\nTwo structural facts, at a glance. These are not benchmarks — they are properties enforced by the schema, so they hold on every single review:\n\n```\nxychart-beta\n    title \"Rubric rules guaranteed evaluated per code block\"\n    x-axis [\"manual review\", \"lazycoder\"]\n    y-axis \"rules (of 17)\" 0 --> 17\n    bar [0, 17]\n```\n\nManual review *may* cover all 17 — nothing guarantees it. lazycoder cannot emit\na verdict until every rule has a recorded pass/fail (`APPROVE`\n\nis refused\notherwise).\n\n```\nxychart-beta\n    title \"Findings that cite rule_id + exact file:line (%)\"\n    x-axis [\"manual review\", \"lazycoder\"]\n    y-axis \"% enforced\" 0 --> 100\n    bar [0, 100]\n```\n\nA human reviewer *can* cite evidence; the lazycoder domain model makes an\nuncited finding unrepresentable — pydantic rejects it before it exists.\n\nThe **full pipeline is live end to end** — deterministic core plus the real\nmodel. A unified diff flows all the way to an aggregated verdict:\n\n```\ndiff → parse_diff → CodeBlock[]\n         └─ review_rubric(block, rubric)  # every rule, every block\n              └─ RuleResult[] → from_rule_results → aggregate → verdict\n```\n\nThe same flow runs in two modes, sharing every line of plumbing:\n\n**Fake client**(default, CI): deterministic, network-free.`pytest -q`\n\nproves the parser, aggregator, and verdict policy on every run.**Real client**(opt-in):`AnthropicClient`\n\nhits the live API. The first live run of eval E3 already passed — the model caught the SQL injection, flagged R7, and the pipeline derived`BLOCK`\n\nwith zero parse failures.\n\nBecause the model was the *last* thing plugged in, any failure isolates to the\nprompt or the model — never to the plumbing, which is already proven. The\nresponse parser is hardened against real LLM output (code fences, surrounding\nprose, severity casing), and the reviewer prompt teaches the model the exact\n`Finding`\n\nschema with a literal example, so form errors die at the source.\n\nPolicy is declarative and lives in `config/`\n\n, not buried in code. Each file is\none part of the setup — reviewable, diffable, swappable:\n\n```\nlazycoder/\n├── config/\n│   ├── harness.json              # project context, stack, hard rules, definition of done\n│   ├── guardrails.json           # what the agent may / may not do; injection defense; limits\n│   ├── setup.json                # runtime, deps + rationale, env vars, bootstrap\n│   ├── working_loop.json         # specify → plan → execute → verify → decide\n│   ├── task_loop.json            # orchestrator + review subagents, isolation, aggregation\n│   ├── review_rules.json         # R1..R17 — the interrogation rubric (the core)\n│   ├── production_readiness.json # the release gate\n│   ├── evals.json                # known-flawed/clean cases that test the reviewer\n│   └── observability.json        # append-only decision log, tracing, redaction\n├── src/argus/                    # domain, config loader, reviewers, llm client\n└── tests/                        # unit + integration + eval coverage\n```\n\nCode-level: data structure (R1), control flow (R2), inputs/outputs (R3), failure modes (R4), side effects (R5), dependencies (R6). Security: validation, secrets, injection (R7). Simplicity: simplest form (R8). System-level: state (R9), sync vs async (R10), monolith vs services (R11), invariant (R12). Plus maintainability, tests, and compatibility rules through R17.\n\nThe interesting part of this project is not the review logic; it's the choices that make the review logic trustworthy.\n\n-\n**Deterministic core, model last.** Everything that can be pure logic*is*pure logic, and the non-deterministic LLM is bolted on at the very end. This is a deliberate failure-isolation strategy: when a review goes wrong, the bug is in the prompt or the model, because the plumbing has tests proving it isn't there. -\n**Contracts make invalid state unrepresentable.** The domain types are strict pydantic models with validators, not bags of fields. A*passed*rule cannot carry a finding; a*failed*one must. Every finding must cite its`rule_id`\n\nand an exact`file:line`\n\n. The verdict is a*computed*field over findings, never a value someone can set by hand. You cannot construct a lying`ReviewReport`\n\n. -\n**Normalize at the boundary, keep the core strict.** Untrusted LLM text is cleaned up where it enters (`\"HIGH\"`\n\n→`\"high\"`\n\n), but the domain enum stays the single source of truth and never loosens. Leniency lives at the edge; the core does not bend. -\n**Debt is executable, not documented.** The one known parser limitation is pinned by a`strict`\n\nxfail test, not a comment someone can ignore. The day the fix lands, that test flips to green and the suite*tells you*the debt is closed. Notes rot; tests don't. -\n**TDD throughout.** Every behavior went RED before GREEN — including the garbage-input fixtures that hardened the parser. -\n**The eval is the product.**`config/evals.json`\n\nis a set of known-flawed and known-clean cases whose job is to measure*the reviewer itself*. Wired as a CI gate, it closes the loop: a code reviewer that has its own reviewer, and knows whether it's still good every time it changes.\n\n```\nuv sync --extra dev\npre-commit install\n\npytest -q                       # deterministic suite — no network, no key\nruff check . && black --check .\nmypy src\n```\n\nTo run the live-API suite (opt-in, never part of `pytest -q`\n\n):\n\n```\ncp .env.example .env            # fill in ANTHROPIC_API_KEY — .env is gitignored\nset -a; source .env; set +a\npytest -m integration\n```\n\n~~Multi-file / diff orchestration on top of~~✓`review_rubric`\n\n.~~Harden the response parser against real LLM output (fixtures).~~✓~~Wire~~✓`config/evals.json`\n\nas a regression gate on the fake client — a missed rule fails the gate.~~Wire the real Anthropic client behind the same~~✓`LLMClient`\n\nprotocol, with an opt-in integration suite (`pytest -m integration`\n\n). First live run: the model caught eval E3's SQL injection (R7 → BLOCK).**Run the full evals.json set against the live model** and track the score over time — the eval stops measuring the plumbing and starts measuring the reviewer: does this prompt, on this model, still catch what it must?~~Distribution: published to~~✓[PyPI](https://pypi.org/project/lazycoder/)with a`lazycoder`\n\nconsole entry point (`uvx lazycoder my.diff`\n\n), rubric bundled in the wheel, releases via trusted publishing on`v*`\n\ntags.**GitHub Action** wrapping the CLI, so`uses: aisona-lab/lazycoder`\n\ngates a PR with the same rubric and exit codes.", "url": "https://wpnews.pro/news/ai-code-reviewer-with-senior-level-judgment-and-strict-rubric", "canonical_source": "https://github.com/aisona-lab/lazycoder", "published_at": "2026-07-04 13:19:11+00:00", "updated_at": "2026-07-04 13:49:58.238161+00:00", "lang": "en", "topics": ["ai-tools", "developer-tools", "large-language-models", "ai-agents", "ai-safety"], "entities": ["lazycoder", "Anthropic", "pydantic"], "alternates": {"html": "https://wpnews.pro/news/ai-code-reviewer-with-senior-level-judgment-and-strict-rubric", "markdown": "https://wpnews.pro/news/ai-code-reviewer-with-senior-level-judgment-and-strict-rubric.md", "text": "https://wpnews.pro/news/ai-code-reviewer-with-senior-level-judgment-and-strict-rubric.txt", "jsonld": "https://wpnews.pro/news/ai-code-reviewer-with-senior-level-judgment-and-strict-rubric.jsonld"}}