# A UMAP With Arrows Is Not a Benchmark. This Is

> Source: <https://dev.to/gbadedata/a-umap-with-arrows-is-not-a-benchmark-this-is-1k7o>
> Published: 2026-06-16 23:51:24+00:00

*How I built a three-task evaluation framework for RNA velocity trajectory inference -- measuring global ordering, pairwise rank preservation, and robustness to missing branches on real pancreatic endocrinogenesis data.*

Most RNA velocity tutorials end the same way.

Moments are computed. Velocities are fitted. A UMAP appears with arrows pointing vaguely in the right direction. The author says "and here we can see the trajectory from progenitors to terminal cell types" and closes the notebook.

What they have produced is a picture. What they have not produced is evidence.

This post is about building evidence. Specifically, a three-task benchmark framework that evaluates whether RNA velocity actually recovers the known developmental ordering of pancreatic endocrinogenesis and whether the signal holds up when the rarest cell population is surgically removed from the dataset.

The dataset is pancreatic endocrinogenesis from Bastidas-Ponce et al. 2019 -- 3,696 cells captured at day 15.5 of mouse embryonic development. It is the canonical RNA velocity benchmark dataset, used in the Bergen et al. 2020 Nature Biotechnology paper introducing scVelo. The known developmental trajectory runs from Ductal progenitors through two Ngn3-expressing intermediate stages to four terminal hormone-producing cell types: Alpha, Beta, Delta, and Epsilon.

The benchmark has three tasks:

**Task 1** measures whether velocity pseudotime correlates with oracle diffusion pseudotime (Spearman rho). These are genuinely independent methods. Diffusion pseudotime uses the transcriptional similarity graph. RNA velocity uses the ratio of unspliced precursor to mature spliced mRNA. High correlation between two independent methods measuring the same biology is meaningful.

**Task 2** tests whether each consecutive developmental transition is statistically preserved using a one-sided Mann-Whitney U test. This catches cases where global rho is high but specific transitions are misordered.

**Task 3** removes all Epsilon cells (n=73) from the dataset, re-runs the velocity pipeline from scratch, and checks whether the remaining trajectory is still correctly recovered. A rho drop near zero means velocity is detecting global transcriptional momentum. A large drop would mean the pipeline was relying on the presence of the rare terminal state to establish directionality.

The oracle is diffusion pseudotime computed by `sc.tl.dpt()`

rooted at Ductal cells during preprocessing. It is computed fresh -- not loaded from a stored file -- using an independent algorithm.

This matters. If you evaluate RNA velocity pseudotime against a stored pseudotime that was itself computed by a velocity-adjacent method, you have circularity. The diffusion pseudotime oracle uses the transcriptional similarity graph, which has no knowledge of splicing kinetics. Evaluating a velocity prediction against a similarity-based oracle is a genuine independent test.

The oracle table before running any velocity:

| Cell type | Mean DPT (oracle) |
|---|---|
| Ductal | 0.060 |
| Ngn3 low EP | 0.076 |
| Ngn3 high EP | 0.478 |
| Pre-endocrine | 0.806 |
| Epsilon | 0.882 |
| Delta | 0.900 |
| Beta | 0.904 |
| Alpha | 0.905 |

This is the correct developmental ordering from the published biology. The oracle is valid.

```
Task 1 -- Trajectory Recovery (Spearman rho)
  rho = 0.8926  |  PASS

Task 2 -- Rank Preservation (Mann-Whitney U)
  5/7 pairs (71.4%)  |  FAIL

Task 3 -- Hidden Branch Recovery
  rho_full=0.8926  rho_masked=0.8897  drop=0.0029  |  ROBUST
```

Task 1 rho of 0.8926 is directly comparable to published scVelo benchmarks on the same dataset. Task 3 rho drop of 0.0029 is the most compelling result: removing the rarest cell type causes essentially no degradation. The velocity signal is global.

Task 2 failing at 71.4% is documented honestly. Here is the full pairwise table:

| Pair | Earlier mean vpt | Later mean vpt | Result |
|---|---|---|---|
| Ductal vs Ngn3 low EP | 0.170 | 0.233 | PASS |
| Ngn3 low EP vs Ngn3 high EP | 0.233 | 0.688 | PASS |
| Ngn3 high EP vs Pre-endocrine | 0.688 | 0.920 | PASS |
| Pre-endocrine vs Alpha | 0.920 | 0.928 | PASS |
| Pre-endocrine vs Beta | 0.920 | 0.959 | PASS |
| Pre-endocrine vs Delta | 0.920 | 0.839 | FAIL |
| Pre-endocrine vs Epsilon | 0.920 | 0.891 | FAIL |

The linear progenitor axis is perfectly ordered. The two failures are the two rarest terminal cell types: Delta (approximately 168 cells) and Epsilon (73 cells). Sparse populations produce fewer edges in the velocity graph, reducing the directional signal available to velocity pseudotime. This is a known limitation, and the benchmark surfaced it precisely.

The initial benchmark design listed the known ordering as:

``` php
Ductal -> Ngn3 low EP -> Ngn3 high EP -> Pre-endocrine -> Beta -> Alpha -> Delta -> Epsilon
```

Task 2 tested consecutive pairs: Beta < Alpha, Alpha < Delta, and so on.

This is biologically wrong.

Alpha, Beta, Delta, and Epsilon are not sequential fates. They are four parallel terminal lineages that branch simultaneously from Pre-endocrine. There is no defined ordering among them -- a Beta cell does not come after an Alpha cell. They are siblings, not parent and child.

Testing Beta < Alpha imposes a false linear structure on a biological fan-out. The benchmark was finding "failures" that were not failures at all -- they were the correct result of a mis-specified test.

The fix: test only the linear trunk (Ductal through Pre-endocrine) and each terminal fate separately against their common progenitor (Pre-endocrine). Never test terminal fates against each other.

```
# Wrong: imposes false linear ordering
known_ordering = ["Ductal", "Ngn3 low EP", "Ngn3 high EP",
                  "Pre-endocrine", "Beta", "Alpha", "Delta", "Epsilon"]

# Correct: respects the branching topology
known_ordering = ["Ductal", "Ngn3 low EP", "Ngn3 high EP", "Pre-endocrine", "Alpha"]
terminal_pairs = [
    ["Pre-endocrine", "Alpha"],
    ["Pre-endocrine", "Beta"],
    ["Pre-endocrine", "Delta"],
    ["Pre-endocrine", "Epsilon"],
]
```

This distinction -- between a linear trajectory and a branching fan-out -- is what separates someone who understands developmental biology from someone who has run a tutorial.

`scv.tl.velocity(mode='stochastic')`

raised `ValueError: setting an array element with a sequence`

on both synthetic fixtures and the real pancreas data.

The root cause is in scvelo's `leastsq_generalized`

:

```
gamma[i] = np.linalg.pinv(A.T.dot(A)).dot(A.T.dot(y[:, i]))
```

In NumPy 2.x, `pinv(A).dot(b)`

returns a 1-element array rather than a scalar. Assigning a 1-element array into a pre-allocated `float32`

scalar slot raises `ValueError`

. This is a known upstream incompatibility (github.com/theislab/scvelo/issues/966).

The fix: `mode='deterministic'`

. The deterministic steady-state model from LaManno et al. 2018 is the foundational method and produces correct trajectory ordering without triggering the NumPy 2.x incompatibility.

Adding `sc.pp.highly_variable_genes`

after `scv.pp.normalize_per_cell`

raised `ValueError: cannot specify integer bins when input data contains infinity`

.

Root cause: some genes have near-zero total counts after per-cell normalisation. When pandas tries to bin gene expression levels for HVG dispersion computation, infinity values in the bin array trigger the error.

Fix: remove the HVG step entirely. The scvelo workflow handles its own feature selection inside `scv.pp.moments`

. A separate HVG step before log1p is architecturally incorrect for the scvelo pipeline.

The initial implementation required `dpt_pseudotime`

to exist in the loaded AnnData. It does not. The pancreas h5ad from scvelo's repository stores cell-type labels and raw counts, but not pseudotime.

Fix: compute diffusion pseudotime during preprocessing using `sc.tl.dpt()`

rooted at Ductal cells. This is actually a better design -- the oracle is computed fresh from the data, not loaded from a file that might have been computed by a different pipeline version.

Unit tests for a velocity benchmark cannot download real data on every run. The solution is a biologically grounded synthetic fixture that embeds a genuine velocity signal.

```
# Velocity genes: unspliced leads spliced along the trajectory axis
# This is the precursor-product relationship RNA velocity detects
pseudotime = np.linspace(0, 1, n_cells)
spliced_velocity = np.outer(pseudotime, np.ones(n_velocity_genes)) * 8
unspliced_velocity = np.outer(
    np.clip(pseudotime + 0.15, 0, 1), np.ones(n_velocity_genes)
) * 8
```

The test `test_velocity_signal_in_early_cells`

verifies this works:

```
early_ratio = (unspliced[early_mask] / spliced[early_mask]).mean()
late_ratio = (unspliced[late_mask] / spliced[late_mask]).mean()
assert early_ratio > late_ratio
```

Early trajectory cells have higher unspliced-to-spliced ratio than late cells. This is the biological constraint RNA velocity is built to detect. If the fixture does not satisfy it, the fixture is wrong. Having a test that verifies this means you can trust the downstream velocity tests.

| Metric | Value |
|---|---|
| Dataset | Pancreas endocrinogenesis (Bastidas-Ponce 2019) |
| Cells | 3,696 |
| Velocity genes | 1,598 |
| Task 1 Spearman rho | 0.8926 PASS |
| Task 2 pairs passing | 5/7 (71.4%) FAIL |
| Task 3 rho drop | 0.0029 ROBUST |
| Tests passing | 99 |
| Test coverage | 92% |
| Pipeline runtime | 47.2 seconds |

The rho of 0.8926 is a good number. But what the project demonstrates is something different: the ability to ask whether the number is meaningful.

Anyone can run `scv.tl.velocity_pseudotime`

and get a number. The question that separates a bioinformatics engineer from a tutorial follower is: how do you know whether that number reflects real biology? The answer requires an independent oracle, a test that checks specific biologically defined transitions, a perturbation experiment, and honest documentation of what fails and why -- including cases like the terminal fates where the benchmark design itself needs correction because the biology is branching, not linear.

The benchmark framework is the contribution. The rho is just the output.

[github.com/gbadedata/scvelo-trajectory-benchmark](https://github.com/gbadedata/scvelo-trajectory-benchmark)

The README covers the full pipeline architecture, all four engineering challenges, the branching topology correction, and complete instructions for reproducing the results from a fresh clone.

*Building and shipping bioinformatics, data engineering, or DevOps projects, trajectory inference benchmarks, or single-cell evaluation frameworks? Connect on GitHub or LinkedIn.*