cd /news/machine-learning/a-umap-with-arrows-is-not-a-benchmar… · home topics machine-learning article
[ARTICLE · art-30318] src=dev.to ↗ pub= topic=machine-learning verified=true sentiment=· neutral

A UMAP With Arrows Is Not a Benchmark. This Is

A developer built a three-task evaluation framework for RNA velocity trajectory inference, measuring global ordering, pairwise rank preservation, and robustness to missing branches on pancreatic endocrinogenesis data. The benchmark revealed a Spearman correlation of 0.8926 with diffusion pseudotime, robust performance when rare cell types were removed, but only 71.4% of consecutive transitions were correctly ordered. The framework provides a more rigorous test than typical visual inspections of UMAP plots.

read8 min views1 publishedJun 16, 2026

How I built a three-task evaluation framework for RNA velocity trajectory inference -- measuring global ordering, pairwise rank preservation, and robustness to missing branches on real pancreatic endocrinogenesis data.

Most RNA velocity tutorials end the same way.

Moments are computed. Velocities are fitted. A UMAP appears with arrows pointing vaguely in the right direction. The author says "and here we can see the trajectory from progenitors to terminal cell types" and closes the notebook.

What they have produced is a picture. What they have not produced is evidence.

This post is about building evidence. Specifically, a three-task benchmark framework that evaluates whether RNA velocity actually recovers the known developmental ordering of pancreatic endocrinogenesis and whether the signal holds up when the rarest cell population is surgically removed from the dataset.

The dataset is pancreatic endocrinogenesis from Bastidas-Ponce et al. 2019 -- 3,696 cells captured at day 15.5 of mouse embryonic development. It is the canonical RNA velocity benchmark dataset, used in the Bergen et al. 2020 Nature Biotechnology paper introducing scVelo. The known developmental trajectory runs from Ductal progenitors through two Ngn3-expressing intermediate stages to four terminal hormone-producing cell types: Alpha, Beta, Delta, and Epsilon.

The benchmark has three tasks:

Task 1 measures whether velocity pseudotime correlates with oracle diffusion pseudotime (Spearman rho). These are genuinely independent methods. Diffusion pseudotime uses the transcriptional similarity graph. RNA velocity uses the ratio of unspliced precursor to mature spliced mRNA. High correlation between two independent methods measuring the same biology is meaningful.

Task 2 tests whether each consecutive developmental transition is statistically preserved using a one-sided Mann-Whitney U test. This catches cases where global rho is high but specific transitions are misordered.

Task 3 removes all Epsilon cells (n=73) from the dataset, re-runs the velocity pipeline from scratch, and checks whether the remaining trajectory is still correctly recovered. A rho drop near zero means velocity is detecting global transcriptional momentum. A large drop would mean the pipeline was relying on the presence of the rare terminal state to establish directionality.

The oracle is diffusion pseudotime computed by sc.tl.dpt()

rooted at Ductal cells during preprocessing. It is computed fresh -- not loaded from a stored file -- using an independent algorithm.

This matters. If you evaluate RNA velocity pseudotime against a stored pseudotime that was itself computed by a velocity-adjacent method, you have circularity. The diffusion pseudotime oracle uses the transcriptional similarity graph, which has no knowledge of splicing kinetics. Evaluating a velocity prediction against a similarity-based oracle is a genuine independent test.

The oracle table before running any velocity:

Cell type Mean DPT (oracle)
Ductal 0.060
Ngn3 low EP 0.076
Ngn3 high EP 0.478
Pre-endocrine 0.806
Epsilon 0.882
Delta 0.900
Beta 0.904
Alpha 0.905

This is the correct developmental ordering from the published biology. The oracle is valid.

Task 1 -- Trajectory Recovery (Spearman rho)
  rho = 0.8926  |  PASS

Task 2 -- Rank Preservation (Mann-Whitney U)
  5/7 pairs (71.4%)  |  FAIL

Task 3 -- Hidden Branch Recovery
  rho_full=0.8926  rho_masked=0.8897  drop=0.0029  |  ROBUST

Task 1 rho of 0.8926 is directly comparable to published scVelo benchmarks on the same dataset. Task 3 rho drop of 0.0029 is the most compelling result: removing the rarest cell type causes essentially no degradation. The velocity signal is global.

Task 2 failing at 71.4% is documented honestly. Here is the full pairwise table:

Pair Earlier mean vpt Later mean vpt Result
Ductal vs Ngn3 low EP 0.170 0.233 PASS
Ngn3 low EP vs Ngn3 high EP 0.233 0.688 PASS
Ngn3 high EP vs Pre-endocrine 0.688 0.920 PASS
Pre-endocrine vs Alpha 0.920 0.928 PASS
Pre-endocrine vs Beta 0.920 0.959 PASS
Pre-endocrine vs Delta 0.920 0.839 FAIL
Pre-endocrine vs Epsilon 0.920 0.891 FAIL

The linear progenitor axis is perfectly ordered. The two failures are the two rarest terminal cell types: Delta (approximately 168 cells) and Epsilon (73 cells). Sparse populations produce fewer edges in the velocity graph, reducing the directional signal available to velocity pseudotime. This is a known limitation, and the benchmark surfaced it precisely.

The initial benchmark design listed the known ordering as:

Ductal -> Ngn3 low EP -> Ngn3 high EP -> Pre-endocrine -> Beta -> Alpha -> Delta -> Epsilon

Task 2 tested consecutive pairs: Beta < Alpha, Alpha < Delta, and so on.

This is biologically wrong.

Alpha, Beta, Delta, and Epsilon are not sequential fates. They are four parallel terminal lineages that branch simultaneously from Pre-endocrine. There is no defined ordering among them -- a Beta cell does not come after an Alpha cell. They are siblings, not parent and child.

Testing Beta < Alpha imposes a false linear structure on a biological fan-out. The benchmark was finding "failures" that were not failures at all -- they were the correct result of a mis-specified test.

The fix: test only the linear trunk (Ductal through Pre-endocrine) and each terminal fate separately against their common progenitor (Pre-endocrine). Never test terminal fates against each other.

known_ordering = ["Ductal", "Ngn3 low EP", "Ngn3 high EP",
                  "Pre-endocrine", "Beta", "Alpha", "Delta", "Epsilon"]

known_ordering = ["Ductal", "Ngn3 low EP", "Ngn3 high EP", "Pre-endocrine", "Alpha"]
terminal_pairs = [
    ["Pre-endocrine", "Alpha"],
    ["Pre-endocrine", "Beta"],
    ["Pre-endocrine", "Delta"],
    ["Pre-endocrine", "Epsilon"],
]

This distinction -- between a linear trajectory and a branching fan-out -- is what separates someone who understands developmental biology from someone who has run a tutorial.

scv.tl.velocity(mode='stochastic')

raised ValueError: setting an array element with a sequence

on both synthetic fixtures and the real pancreas data.

The root cause is in scvelo's leastsq_generalized

:

gamma[i] = np.linalg.pinv(A.T.dot(A)).dot(A.T.dot(y[:, i]))

In NumPy 2.x, pinv(A).dot(b)

returns a 1-element array rather than a scalar. Assigning a 1-element array into a pre-allocated float32

scalar slot raises ValueError

. This is a known upstream incompatibility (github.com/theislab/scvelo/issues/966).

The fix: mode='deterministic'

. The deterministic steady-state model from LaManno et al. 2018 is the foundational method and produces correct trajectory ordering without triggering the NumPy 2.x incompatibility.

Adding sc.pp.highly_variable_genes

after scv.pp.normalize_per_cell

raised ValueError: cannot specify integer bins when input data contains infinity

.

Root cause: some genes have near-zero total counts after per-cell normalisation. When pandas tries to bin gene expression levels for HVG dispersion computation, infinity values in the bin array trigger the error.

Fix: remove the HVG step entirely. The scvelo workflow handles its own feature selection inside scv.pp.moments

. A separate HVG step before log1p is architecturally incorrect for the scvelo pipeline.

The initial implementation required dpt_pseudotime

to exist in the loaded AnnData. It does not. The pancreas h5ad from scvelo's repository stores cell-type labels and raw counts, but not pseudotime.

Fix: compute diffusion pseudotime during preprocessing using sc.tl.dpt()

rooted at Ductal cells. This is actually a better design -- the oracle is computed fresh from the data, not loaded from a file that might have been computed by a different pipeline version.

Unit tests for a velocity benchmark cannot download real data on every run. The solution is a biologically grounded synthetic fixture that embeds a genuine velocity signal.

pseudotime = np.linspace(0, 1, n_cells)
spliced_velocity = np.outer(pseudotime, np.ones(n_velocity_genes)) * 8
unspliced_velocity = np.outer(
    np.clip(pseudotime + 0.15, 0, 1), np.ones(n_velocity_genes)
) * 8

The test test_velocity_signal_in_early_cells

verifies this works:

early_ratio = (unspliced[early_mask] / spliced[early_mask]).mean()
late_ratio = (unspliced[late_mask] / spliced[late_mask]).mean()
assert early_ratio > late_ratio

Early trajectory cells have higher unspliced-to-spliced ratio than late cells. This is the biological constraint RNA velocity is built to detect. If the fixture does not satisfy it, the fixture is wrong. Having a test that verifies this means you can trust the downstream velocity tests.

Metric Value
Dataset Pancreas endocrinogenesis (Bastidas-Ponce 2019)
Cells 3,696
Velocity genes 1,598
Task 1 Spearman rho 0.8926 PASS
Task 2 pairs passing 5/7 (71.4%) FAIL
Task 3 rho drop 0.0029 ROBUST
Tests passing 99
Test coverage 92%
Pipeline runtime 47.2 seconds

The rho of 0.8926 is a good number. But what the project demonstrates is something different: the ability to ask whether the number is meaningful.

Anyone can run scv.tl.velocity_pseudotime

and get a number. The question that separates a bioinformatics engineer from a tutorial follower is: how do you know whether that number reflects real biology? The answer requires an independent oracle, a test that checks specific biologically defined transitions, a perturbation experiment, and honest documentation of what fails and why -- including cases like the terminal fates where the benchmark design itself needs correction because the biology is branching, not linear.

The benchmark framework is the contribution. The rho is just the output.

github.com/gbadedata/scvelo-trajectory-benchmark

The README covers the full pipeline architecture, all four engineering challenges, the branching topology correction, and complete instructions for reproducing the results from a fresh clone.

Building and shipping bioinformatics, data engineering, or DevOps projects, trajectory inference benchmarks, or single-cell evaluation frameworks? Connect on GitHub or LinkedIn.

── more in #machine-learning 4 stories · sorted by recency
── more on @scvelo 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/a-umap-with-arrows-i…] indexed:0 read:8min 2026-06-16 ·