cd /news/ai-tools/show-hn-data-review-diff-the-data-nu… · home topics ai-tools article
[ARTICLE · art-27180] src=github.com ↗ pub= topic=ai-tools verified=true sentiment=↑ positive

Show HN: Data-review, diff the data/numbers a PR changes

A developer released Data-review, an open-source agent skill that automatically diffs data and numbers changed by a pull request, re-running affected pipeline lanes to catch bugs like sign flips, NULLs, and double counting that unit tests miss. The tool caught a real cents-to-dollars conversion error in under a minute during a demo, and is designed for data engineering and financial controlling workflows.

read5 min publishedJun 14, 2026

Change-time review for data-engineering and financial-controlling work.

An agent skill that reviews the data impact of a code change: it re-runs the affected pipeline lanes and checks whether the numbers move the way the author said they would.

Code review checks the code; tests check the assertions you remembered to write. Neither runs a query, so locale mis-parses, sign flips, silent NULLs, join fan-out, and double counting all pass green. Scripts compute, agents judge: no agent does arithmetic a query can do, no script decides whether a number should have moved.

you             →  a PR touching parsers / transforms / models / deliverables
blast_radius    →  which lanes the diff hits                    (deterministic)
T1  tieouts     →  is the world in a known-good state?          (deterministic)
T2  rerun+diff  →  scratch rebuild vs committed baseline        (deterministic)
T3  row diff    →  row-level EXCEPT ALL + top-k moved groups    (deterministic)
judgment layer  →  declared / explained / UNEXPLAINED           (agents)
report          →  findings ranked by € impact + coverage table

A real before/after data bug caught in under a minute, no agent or account:

git clone https://github.com/10fra/data-review.git
cd data-review && ./demo.sh

It builds a toy two-stage pipeline, blesses a baseline, then drops a cents-to-dollars conversion in staging. A unit test and the orders-vs-staging tie-out both stay green (everything scaled in lockstep); the baseline diff catches sum(amount)

jumping $134.50 → $13,450 and escalates it.

git clone https://github.com/10fra/data-review.git
cd data-review && ./install.sh

install.sh

builds the venv, links the skill into ~/.claude/skills/

, and runs the tests; re-run after a git pull

. A consumer project carries a data-review.yml

and committed baselines under .data-review/baselines/

.

In a project with a data-review.yml

, ask Claude Code:

review the data impact of this branch

Auto-discovered from ~/.claude/skills/

, it runs the evidence scripts, then reviews the diff through ten failure-class lenses (each seeded from a real production bug), checks every delta against the PR's ## Expected data impact

, and verifies each finding adversarially before it reaches the report.

Deterministic, so runnable by hand or in CI. The data-review

launcher uses the repo's venv from any consumer directory:

data-review blast --manifest data-review.yml --diff main...HEAD   # diff × manifest → tier plan
data-review bless --manifest data-review.yml --lane <name>        # snapshot live state → committed baseline
data-review t1    --manifest data-review.yml                      # tieouts + live-vs-baseline conformance
data-review t2    --manifest data-review.yml --lane <name>        # scratch re-run, metric diff vs baseline
data-review t3    --manifest data-review.yml --lane <name> --scratch <db>  # row-level EXCEPT ALL
data-review xlsx  --manifest data-review.yml --workbook <path>    # Excel tie-outs, hardcodes, short SUMs

(alias dr=/path/to/data-review/data-review

to shorten.)

One data-review.yml

per project, declaring what the skill cannot infer:

engine:                 # where the numbers live (DuckDB; adapters convert anything else)
lanes:                  # unit of blast radius: code globs + scratch-safe run command + output tables
baselines:              # metrics + group-by dims snapshotted into committed JSON
tieouts:                # cross-artifact invariants, verified on live data before being committed
deliverables:           # Excel workbooks audited against SQL
context:                # prose priors for the judgment agents (locale, sign conventions, known traps)

A lane run command must take a scratch-target override (scratch_env

) or the skill refuses it, so a re-run never touches the canonical store. It must build what it consumes, or it re-runs stale code; non-DuckDB outputs convert there too.

Baselines are committed JSON. Re-blessing is a reviewed git change, so baseline drift shows up in diffs.Tiers escalate mechanically. Any beyond-tolerance T2 delta makes the lane a T3 candidate;money_critical

lanes always get T3. Whether the movement isexplainedis the agents' call.Findings are falsifiable. Each carries a one-sentence claim, € impact from evidence, a reproduce command, and its verification votes.Coverage is reported, not implied. The report ends with a lane × tier table; an empty report is not a clean review.

Tested on four stacks (one private, three public), replaying real historical bugs from their git history.

Project Stack Bug replayed / injected Caught Signal
consumer #1 (private) TS parsers + DuckDB supplier A amount explosion (real) T2 sum €63.5M → €3.91B, 74 group deltas, zero leakage
consumer #1 (private) TS parsers + DuckDB supplier B US-locale dates (real) T2 null_rate(accounting_period) +16pp, 96 group deltas

owid/etlef7a53c9b

, real)accidents

pudl09d8efa7d

, real)Every clean-control rerun produced zero beyond-tolerance facts, deterministic to the cent, and one caught real drift on its own (a canonical store stale against merged parser fixes). The recurring lesson: tie-outs and pipeline tests are blind to scaling, double-counting, and drop distortions. Shares renormalize, identities scale in lockstep, sets dedupe. Only the committed-baseline diff is anchored outside the change.

  • Scripts emit facts, never findings. Severity is the judgment layer's job. - Exit 2 ( evidence-unavailable

) isnever a pass. Missing input is a finding, not a skip. - Silence never looks green: unmatched files, skipped formulas, dropped metrics, and unavailable tiers are all reported facts.

  • A stale scratch database is never diffed; pre-existing scratch files are deleted before every re-run.

  • A tieout is only committed after it has been verified to hold on live data.

  • € impact is computed from evidence by script, never estimated by an agent.

  • Write data-review.yml

(schema:skills/data-review/manifest.schema.json

). data-review bless --manifest data-review.yml --lane <name>

, commit the baselines.data-review t1 --manifest data-review.yml

; a clean run means ready.

Cost: 3 minutes (dbt + DuckDB) to half a day (dagster monorepo).

── more in #ai-tools 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/show-hn-data-review-…] indexed:0 read:5min 2026-06-14 ·