A UMAP With Arrows Is Not a Benchmark. This Is

wpnews.pro

How I built a three-task evaluation framework for RNA velocity trajectory inference -- measuring global ordering, pairwise rank preservation, and robustness to missing branches on real pancreatic endocrinogenesis data.

Most RNA velocity tutorials end the same way.

Moments are computed. Velocities are fitted. A UMAP appears with arrows pointing vaguely in the right direction. The author says "and here we can see the trajectory from progenitors to terminal cell types" and closes the notebook.

What they have produced is a picture. What they have not produced is evidence.

This post is about building evidence. Specifically, a three-task benchmark framework that evaluates whether RNA velocity actually recovers the known developmental ordering of pancreatic endocrinogenesis and whether the signal holds up when the rarest cell population is surgically removed from the dataset.

The dataset is pancreatic endocrinogenesis from Bastidas-Ponce et al. 2019 -- 3,696 cells captured at day 15.5 of mouse embryonic development. It is the canonical RNA velocity benchmark dataset, used in the Bergen et al. 2020 Nature Biotechnology paper introducing scVelo. The known developmental trajectory runs from Ductal progenitors through two Ngn3-expressing intermediate stages to four terminal hormone-producing cell types: Alpha, Beta, Delta, and Epsilon.

The benchmark has three tasks:

Task 1 measures whether velocity pseudotime correlates with oracle diffusion pseudotime (Spearman rho). These are genuinely independent methods. Diffusion pseudotime uses the transcriptional similarity graph. RNA velocity uses the ratio of unspliced precursor to mature spliced mRNA. High correlation between two independent methods measuring the same biology is meaningful.

Task 2 tests whether each consecutive developmental transition is statistically preserved using a one-sided Mann-Whitney U test. This catches cases where global rho is high but specific transitions are misordered.

Task 3 removes all Epsilon cells (n=73) from the dataset, re-runs the velocity pipeline from scratch, and checks whether the remaining trajectory is still correctly recovered. A rho drop near zero means velocity is detecting global transcriptional momentum. A large drop would mean the pipeline was relying on the presence of the rare terminal state to establish directionality.

The oracle is diffusion pseudotime computed by sc.tl.dpt()

rooted at Ductal cells during preprocessing. It is computed fresh -- not loaded from a stored file -- using an independent algorithm.

This matters. If you evaluate RNA velocity pseudotime against a stored pseudotime that was itself computed by a velocity-adjacent method, you have circularity. The diffusion pseudotime oracle uses the transcriptional similarity graph, which has no knowledge of splicing kinetics. Evaluating a velocity prediction against a similarity-based oracle is a genuine independent test.

The oracle table before running any velocity:

Cell type	Mean DPT (oracle)
Ductal	0.060
Ngn3 low EP	0.076
Ngn3 high EP	0.478
Pre-endocrine	0.806
Epsilon	0.882
Delta	0.900
Beta	0.904
Alpha	0.905

This is the correct developmental ordering from the published biology. The oracle is valid.

Task 1 -- Trajectory Recovery (Spearman rho)
  rho = 0.8926  |  PASS

Task 2 -- Rank Preservation (Mann-Whitney U)
  5/7 pairs (71.4%)  |  FAIL

Task 3 -- Hidden Branch Recovery
  rho_full=0.8926  rho_masked=0.8897  drop=0.0029  |  ROBUST

Task 1 rho of 0.8926 is directly comparable to published scVelo benchmarks on the same dataset. Task 3 rho drop of 0.0029 is the most compelling result: removing the rarest cell type causes essentially no degradation. The velocity signal is global.

Task 2 failing at 71.4% is documented honestly. Here is the full pairwise table:

Pair	Earlier mean vpt	Later mean vpt	Result
Ductal vs Ngn3 low EP	0.170	0.233	PASS
Ngn3 low EP vs Ngn3 high EP	0.233	0.688	PASS
Ngn3 high EP vs Pre-endocrine	0.688	0.920	PASS
Pre-endocrine vs Alpha	0.920	0.928	PASS
Pre-endocrine vs Beta	0.920	0.959	PASS
Pre-endocrine vs Delta	0.920	0.839	FAIL
Pre-endocrine vs Epsilon	0.920	0.891	FAIL

The linear progenitor axis is perfectly ordered. The two failures are the two rarest terminal cell types: Delta (approximately 168 cells) and Epsilon (73 cells). Sparse populations produce fewer edges in the velocity graph, reducing the directional signal available to velocity pseudotime. This is a known limitation, and the benchmark surfaced it precisely.

The initial benchmark design listed the known ordering as:

Ductal -> Ngn3 low EP -> Ngn3 high EP -> Pre-endocrine -> Beta -> Alpha -> Delta -> Epsilon

Task 2 tested consecutive pairs: Beta < Alpha, Alpha < Delta, and so on.

This is biologically wrong.

Alpha, Beta, Delta, and Epsilon are not sequential fates. They are four parallel terminal lineages that branch simultaneously from Pre-endocrine. There is no defined ordering among them -- a Beta cell does not come after an Alpha cell. They are siblings, not parent and child.

Testing Beta < Alpha imposes a false linear structure on a biological fan-out. The benchmark was finding "failures" that were not failures at all -- they were the correct result of a mis-specified test.

The fix: test only the linear trunk (Ductal through Pre-endocrine) and each terminal fate separately against their common progenitor (Pre-endocrine). Never test terminal fates against each other.

known_ordering = ["Ductal", "Ngn3 low EP", "Ngn3 high EP",
                  "Pre-endocrine", "Beta", "Alpha", "Delta", "Epsilon"]

known_ordering = ["Ductal", "Ngn3 low EP", "Ngn3 high EP", "Pre-endocrine", "Alpha"]
terminal_pairs = [
    ["Pre-endocrine", "Alpha"],
    ["Pre-endocrine", "Beta"],
    ["Pre-endocrine", "Delta"],
    ["Pre-endocrine", "Epsilon"],
]

This distinction -- between a linear trajectory and a branching fan-out -- is what separates someone who understands developmental biology from someone who has run a tutorial.

scv.tl.velocity(mode='stochastic')

raised ValueError: setting an array element with a sequence

on both synthetic fixtures and the real pancreas data.

The root cause is in scvelo's leastsq_generalized

:

gamma[i] = np.linalg.pinv(A.T.dot(A)).dot(A.T.dot(y[:, i]))

In NumPy 2.x, pinv(A).dot(b)

returns a 1-element array rather than a scalar. Assigning a 1-element array into a pre-allocated float32

scalar slot raises ValueError

. This is a known upstream incompatibility (github.com/theislab/scvelo/issues/966).

The fix: mode='deterministic'

. The deterministic steady-state model from LaManno et al. 2018 is the foundational method and produces correct trajectory ordering without triggering the NumPy 2.x incompatibility.

Adding sc.pp.highly_variable_genes

after scv.pp.normalize_per_cell

raised ValueError: cannot specify integer bins when input data contains infinity

.

Root cause: some genes have near-zero total counts after per-cell normalisation. When pandas tries to bin gene expression levels for HVG dispersion computation, infinity values in the bin array trigger the error.

Fix: remove the HVG step entirely. The scvelo workflow handles its own feature selection inside scv.pp.moments

. A separate HVG step before log1p is architecturally incorrect for the scvelo pipeline.

The initial implementation required dpt_pseudotime

to exist in the loaded AnnData. It does not. The pancreas h5ad from scvelo's repository stores cell-type labels and raw counts, but not pseudotime.

Fix: compute diffusion pseudotime during preprocessing using sc.tl.dpt()

rooted at Ductal cells. This is actually a better design -- the oracle is computed fresh from the data, not loaded from a file that might have been computed by a different pipeline version.

Unit tests for a velocity benchmark cannot download real data on every run. The solution is a biologically grounded synthetic fixture that embeds a genuine velocity signal.

pseudotime = np.linspace(0, 1, n_cells)
spliced_velocity = np.outer(pseudotime, np.ones(n_velocity_genes)) * 8
unspliced_velocity = np.outer(
    np.clip(pseudotime + 0.15, 0, 1), np.ones(n_velocity_genes)
) * 8

The test test_velocity_signal_in_early_cells

verifies this works:

early_ratio = (unspliced[early_mask] / spliced[early_mask]).mean()
late_ratio = (unspliced[late_mask] / spliced[late_mask]).mean()
assert early_ratio > late_ratio

Early trajectory cells have higher unspliced-to-spliced ratio than late cells. This is the biological constraint RNA velocity is built to detect. If the fixture does not satisfy it, the fixture is wrong. Having a test that verifies this means you can trust the downstream velocity tests.

Metric	Value
Dataset	Pancreas endocrinogenesis (Bastidas-Ponce 2019)
Cells	3,696
Velocity genes	1,598
Task 1 Spearman rho	0.8926 PASS
Task 2 pairs passing	5/7 (71.4%) FAIL
Task 3 rho drop	0.0029 ROBUST
Tests passing	99
Test coverage	92%
Pipeline runtime	47.2 seconds

The rho of 0.8926 is a good number. But what the project demonstrates is something different: the ability to ask whether the number is meaningful.

Anyone can run scv.tl.velocity_pseudotime

and get a number. The question that separates a bioinformatics engineer from a tutorial follower is: how do you know whether that number reflects real biology? The answer requires an independent oracle, a test that checks specific biologically defined transitions, a perturbation experiment, and honest documentation of what fails and why -- including cases like the terminal fates where the benchmark design itself needs correction because the biology is branching, not linear.

The benchmark framework is the contribution. The rho is just the output.

github.com/gbadedata/scvelo-trajectory-benchmark

The README covers the full pipeline architecture, all four engineering challenges, the branching topology correction, and complete instructions for reproducing the results from a fresh clone.

Building and shipping bioinformatics, data engineering, or DevOps projects, trajectory inference benchmarks, or single-cell evaluation frameworks? Connect on GitHub or LinkedIn.

source & further reading

dev.to — original article I stopped reviewing my own code. Here's what had to be true first. qm multiplayer AI agent tutorial: Cut Latency 20% with Node.js Will AI replace software?

A UMAP With Arrows Is Not a Benchmark. This Is

Run your AI side-project on zahid.host