Show HN: Data-review, diff the data/numbers a PR changes

wpnews.pro

Change-time review for data-engineering and financial-controlling work.

An agent skill that reviews the data impact of a code change: it re-runs the affected pipeline lanes and checks whether the numbers move the way the author said they would.

Code review checks the code; tests check the assertions you remembered to write. Neither runs a query, so locale mis-parses, sign flips, silent NULLs, join fan-out, and double counting all pass green. Scripts compute, agents judge: no agent does arithmetic a query can do, no script decides whether a number should have moved.

you             →  a PR touching parsers / transforms / models / deliverables
blast_radius    →  which lanes the diff hits                    (deterministic)
T1  tieouts     →  is the world in a known-good state?          (deterministic)
T2  rerun+diff  →  scratch rebuild vs committed baseline        (deterministic)
T3  row diff    →  row-level EXCEPT ALL + top-k moved groups    (deterministic)
judgment layer  →  declared / explained / UNEXPLAINED           (agents)
report          →  findings ranked by € impact + coverage table

A real before/after data bug caught in under a minute, no agent or account:

git clone https://github.com/10fra/data-review.git
cd data-review && ./demo.sh

It builds a toy two-stage pipeline, blesses a baseline, then drops a cents-to-dollars conversion in staging. A unit test and the orders-vs-staging tie-out both stay green (everything scaled in lockstep); the baseline diff catches sum(amount)

jumping $134.50 → $13,450 and escalates it.

git clone https://github.com/10fra/data-review.git
cd data-review && ./install.sh

install.sh

builds the venv, links the skill into ~/.claude/skills/

, and runs the tests; re-run after a git pull

. A consumer project carries a data-review.yml

and committed baselines under .data-review/baselines/

.

In a project with a data-review.yml

, ask Claude Code:

review the data impact of this branch

Auto-discovered from ~/.claude/skills/

, it runs the evidence scripts, then reviews the diff through ten failure-class lenses (each seeded from a real production bug), checks every delta against the PR's ## Expected data impact

, and verifies each finding adversarially before it reaches the report.

Deterministic, so runnable by hand or in CI. The data-review

launcher uses the repo's venv from any consumer directory:

data-review blast --manifest data-review.yml --diff main...HEAD   # diff × manifest → tier plan
data-review bless --manifest data-review.yml --lane <name>        # snapshot live state → committed baseline
data-review t1    --manifest data-review.yml                      # tieouts + live-vs-baseline conformance
data-review t2    --manifest data-review.yml --lane <name>        # scratch re-run, metric diff vs baseline
data-review t3    --manifest data-review.yml --lane <name> --scratch <db>  # row-level EXCEPT ALL
data-review xlsx  --manifest data-review.yml --workbook <path>    # Excel tie-outs, hardcodes, short SUMs

(alias dr=/path/to/data-review/data-review

to shorten.)

One data-review.yml

per project, declaring what the skill cannot infer:

engine:                 # where the numbers live (DuckDB; adapters convert anything else)
lanes:                  # unit of blast radius: code globs + scratch-safe run command + output tables
baselines:              # metrics + group-by dims snapshotted into committed JSON
tieouts:                # cross-artifact invariants, verified on live data before being committed
deliverables:           # Excel workbooks audited against SQL
context:                # prose priors for the judgment agents (locale, sign conventions, known traps)

A lane run command must take a scratch-target override (scratch_env

) or the skill refuses it, so a re-run never touches the canonical store. It must build what it consumes, or it re-runs stale code; non-DuckDB outputs convert there too.

Baselines are committed JSON. Re-blessing is a reviewed git change, so baseline drift shows up in diffs.Tiers escalate mechanically. Any beyond-tolerance T2 delta makes the lane a T3 candidate;money_critical

lanes always get T3. Whether the movement isexplainedis the agents' call.Findings are falsifiable. Each carries a one-sentence claim, € impact from evidence, a reproduce command, and its verification votes.Coverage is reported, not implied. The report ends with a lane × tier table; an empty report is not a clean review.

Tested on four stacks (one private, three public), replaying real historical bugs from their git history.

Project	Stack	Bug replayed / injected	Caught	Signal
consumer #1 (private)	TS parsers + DuckDB	supplier A amount explosion (real)	T2	sum €63.5M → €3.91B, 74 group deltas, zero leakage
consumer #1 (private)	TS parsers + DuckDB	supplier B US-locale dates (real)	T2	null_rate(accounting_period) +16pp, 96 group deltas

owid/etlef7a53c9b

, real)accidents

pudl09d8efa7d

, real)Every clean-control rerun produced zero beyond-tolerance facts, deterministic to the cent, and one caught real drift on its own (a canonical store stale against merged parser fixes). The recurring lesson: tie-outs and pipeline tests are blind to scaling, double-counting, and drop distortions. Shares renormalize, identities scale in lockstep, sets dedupe. Only the committed-baseline diff is anchored outside the change.

Scripts emit facts, never findings. Severity is the judgment layer's job. - Exit 2 ( evidence-unavailable

) isnever a pass. Missing input is a finding, not a skip. - Silence never looks green: unmatched files, skipped formulas, dropped metrics, and unavailable tiers are all reported facts.

A stale scratch database is never diffed; pre-existing scratch files are deleted before every re-run.
A tieout is only committed after it has been verified to hold on live data.
€ impact is computed from evidence by script, never estimated by an agent.
Write data-review.yml

(schema:skills/data-review/manifest.schema.json

). data-review bless --manifest data-review.yml --lane <name>

, commit the baselines.data-review t1 --manifest data-review.yml

; a clean run means ready.

Cost: 3 minutes (dbt + DuckDB) to half a day (dagster monorepo).

source & further reading

github.com — original article

Show HN: Data-review, diff the data/numbers a PR changes

Run your AI side-project on zahid.host