Change-time review for data-engineering and financial-controlling work.
An agent skill that reviews the data impact of a code change: it re-runs the affected pipeline lanes and checks whether the numbers move the way the author said they would.
Code review checks the code; tests check the assertions you remembered to write. Neither runs a query, so locale mis-parses, sign flips, silent NULLs, join fan-out, and double counting all pass green. Scripts compute, agents judge: no agent does arithmetic a query can do, no script decides whether a number should have moved.
you → a PR touching parsers / transforms / models / deliverables
blast_radius → which lanes the diff hits (deterministic)
T1 tieouts → is the world in a known-good state? (deterministic)
T2 rerun+diff → scratch rebuild vs committed baseline (deterministic)
T3 row diff → row-level EXCEPT ALL + top-k moved groups (deterministic)
judgment layer → declared / explained / UNEXPLAINED (agents)
report → findings ranked by € impact + coverage table
A real before/after data bug caught in under a minute, no agent or account:
git clone https://github.com/10fra/data-review.git
cd data-review && ./demo.sh
It builds a toy two-stage pipeline, blesses a baseline, then drops a
cents-to-dollars conversion in staging. A unit test and the orders-vs-staging
tie-out both stay green (everything scaled in lockstep); the baseline diff
catches sum(amount)
jumping $134.50 → $13,450 and escalates it.
git clone https://github.com/10fra/data-review.git
cd data-review && ./install.sh
install.sh
builds the venv, links the skill into ~/.claude/skills/
, and runs
the tests; re-run after a git pull
. A consumer project carries a
data-review.yml
and committed baselines under .data-review/baselines/
.
In a project with a data-review.yml
, ask Claude Code:
review the data impact of this branch
Auto-discovered from ~/.claude/skills/
, it runs the evidence scripts, then
reviews the diff through ten failure-class lenses (each seeded from a real
production bug), checks every delta against the PR's ## Expected data impact
, and verifies each finding adversarially before it reaches the report.
Deterministic, so runnable by hand or in CI. The data-review
launcher uses the repo's venv from any consumer directory:
data-review blast --manifest data-review.yml --diff main...HEAD # diff × manifest → tier plan
data-review bless --manifest data-review.yml --lane <name> # snapshot live state → committed baseline
data-review t1 --manifest data-review.yml # tieouts + live-vs-baseline conformance
data-review t2 --manifest data-review.yml --lane <name> # scratch re-run, metric diff vs baseline
data-review t3 --manifest data-review.yml --lane <name> --scratch <db> # row-level EXCEPT ALL
data-review xlsx --manifest data-review.yml --workbook <path> # Excel tie-outs, hardcodes, short SUMs
(alias dr=/path/to/data-review/data-review
to shorten.)
One data-review.yml
per project, declaring what the skill cannot infer:
engine: # where the numbers live (DuckDB; adapters convert anything else)
lanes: # unit of blast radius: code globs + scratch-safe run command + output tables
baselines: # metrics + group-by dims snapshotted into committed JSON
tieouts: # cross-artifact invariants, verified on live data before being committed
deliverables: # Excel workbooks audited against SQL
context: # prose priors for the judgment agents (locale, sign conventions, known traps)
A lane run command must take a scratch-target override (scratch_env
) or the skill refuses it, so a re-run never touches the canonical store. It must build what it consumes, or it re-runs stale code; non-DuckDB outputs convert there too.
Baselines are committed JSON. Re-blessing is a reviewed git change, so baseline drift shows up in diffs.Tiers escalate mechanically. Any beyond-tolerance T2 delta makes the lane a T3 candidate;money_critical
lanes always get T3. Whether the movement isexplainedis the agents' call.Findings are falsifiable. Each carries a one-sentence claim, € impact from evidence, a reproduce command, and its verification votes.Coverage is reported, not implied. The report ends with a lane × tier table; an empty report is not a clean review.
Tested on four stacks (one private, three public), replaying real historical bugs from their git history.
| Project | Stack | Bug replayed / injected | Caught | Signal |
|---|---|---|---|---|
| consumer #1 (private) | TS parsers + DuckDB | supplier A amount explosion (real) | T2 | sum €63.5M → €3.91B, 74 group deltas, zero leakage |
| consumer #1 (private) | TS parsers + DuckDB | supplier B US-locale dates (real) | T2 | null_rate(accounting_period) +16pp, 96 group deltas |
owid/etlef7a53c9b
, real)accidents
pudl09d8efa7d
, real)Every clean-control rerun produced zero beyond-tolerance facts, deterministic to the cent, and one caught real drift on its own (a canonical store stale against merged parser fixes). The recurring lesson: tie-outs and pipeline tests are blind to scaling, double-counting, and drop distortions. Shares renormalize, identities scale in lockstep, sets dedupe. Only the committed-baseline diff is anchored outside the change.
- Scripts emit
facts, never findings. Severity is the judgment layer's job. - Exit 2 (
evidence-unavailable
) isnever a pass. Missing input is a finding, not a skip. - Silence never looks green: unmatched files, skipped formulas, dropped metrics, and unavailable tiers are all reported facts.
-
A stale scratch database is never diffed; pre-existing scratch files are deleted before every re-run.
-
A tieout is only committed after it has been verified to hold on live data.
-
€ impact is computed from evidence by script, never estimated by an agent.
-
Write
data-review.yml
(schema:skills/data-review/manifest.schema.json
). data-review bless --manifest data-review.yml --lane <name>
, commit the baselines.data-review t1 --manifest data-review.yml
; a clean run means ready.
Cost: 3 minutes (dbt + DuckDB) to half a day (dagster monorepo).