{"slug": "a-umap-with-arrows-is-not-a-benchmark-this-is", "title": "A UMAP With Arrows Is Not a Benchmark. This Is", "summary": "A developer built a three-task evaluation framework for RNA velocity trajectory inference, measuring global ordering, pairwise rank preservation, and robustness to missing branches on pancreatic endocrinogenesis data. The benchmark revealed a Spearman correlation of 0.8926 with diffusion pseudotime, robust performance when rare cell types were removed, but only 71.4% of consecutive transitions were correctly ordered. The framework provides a more rigorous test than typical visual inspections of UMAP plots.", "body_md": "*How I built a three-task evaluation framework for RNA velocity trajectory inference -- measuring global ordering, pairwise rank preservation, and robustness to missing branches on real pancreatic endocrinogenesis data.*\n\nMost RNA velocity tutorials end the same way.\n\nMoments are computed. Velocities are fitted. A UMAP appears with arrows pointing vaguely in the right direction. The author says \"and here we can see the trajectory from progenitors to terminal cell types\" and closes the notebook.\n\nWhat they have produced is a picture. What they have not produced is evidence.\n\nThis post is about building evidence. Specifically, a three-task benchmark framework that evaluates whether RNA velocity actually recovers the known developmental ordering of pancreatic endocrinogenesis and whether the signal holds up when the rarest cell population is surgically removed from the dataset.\n\nThe dataset is pancreatic endocrinogenesis from Bastidas-Ponce et al. 2019 -- 3,696 cells captured at day 15.5 of mouse embryonic development. It is the canonical RNA velocity benchmark dataset, used in the Bergen et al. 2020 Nature Biotechnology paper introducing scVelo. The known developmental trajectory runs from Ductal progenitors through two Ngn3-expressing intermediate stages to four terminal hormone-producing cell types: Alpha, Beta, Delta, and Epsilon.\n\nThe benchmark has three tasks:\n\n**Task 1** measures whether velocity pseudotime correlates with oracle diffusion pseudotime (Spearman rho). These are genuinely independent methods. Diffusion pseudotime uses the transcriptional similarity graph. RNA velocity uses the ratio of unspliced precursor to mature spliced mRNA. High correlation between two independent methods measuring the same biology is meaningful.\n\n**Task 2** tests whether each consecutive developmental transition is statistically preserved using a one-sided Mann-Whitney U test. This catches cases where global rho is high but specific transitions are misordered.\n\n**Task 3** removes all Epsilon cells (n=73) from the dataset, re-runs the velocity pipeline from scratch, and checks whether the remaining trajectory is still correctly recovered. A rho drop near zero means velocity is detecting global transcriptional momentum. A large drop would mean the pipeline was relying on the presence of the rare terminal state to establish directionality.\n\nThe oracle is diffusion pseudotime computed by `sc.tl.dpt()`\n\nrooted at Ductal cells during preprocessing. It is computed fresh -- not loaded from a stored file -- using an independent algorithm.\n\nThis matters. If you evaluate RNA velocity pseudotime against a stored pseudotime that was itself computed by a velocity-adjacent method, you have circularity. The diffusion pseudotime oracle uses the transcriptional similarity graph, which has no knowledge of splicing kinetics. Evaluating a velocity prediction against a similarity-based oracle is a genuine independent test.\n\nThe oracle table before running any velocity:\n\n| Cell type | Mean DPT (oracle) |\n|---|---|\n| Ductal | 0.060 |\n| Ngn3 low EP | 0.076 |\n| Ngn3 high EP | 0.478 |\n| Pre-endocrine | 0.806 |\n| Epsilon | 0.882 |\n| Delta | 0.900 |\n| Beta | 0.904 |\n| Alpha | 0.905 |\n\nThis is the correct developmental ordering from the published biology. The oracle is valid.\n\n```\nTask 1 -- Trajectory Recovery (Spearman rho)\n  rho = 0.8926  |  PASS\n\nTask 2 -- Rank Preservation (Mann-Whitney U)\n  5/7 pairs (71.4%)  |  FAIL\n\nTask 3 -- Hidden Branch Recovery\n  rho_full=0.8926  rho_masked=0.8897  drop=0.0029  |  ROBUST\n```\n\nTask 1 rho of 0.8926 is directly comparable to published scVelo benchmarks on the same dataset. Task 3 rho drop of 0.0029 is the most compelling result: removing the rarest cell type causes essentially no degradation. The velocity signal is global.\n\nTask 2 failing at 71.4% is documented honestly. Here is the full pairwise table:\n\n| Pair | Earlier mean vpt | Later mean vpt | Result |\n|---|---|---|---|\n| Ductal vs Ngn3 low EP | 0.170 | 0.233 | PASS |\n| Ngn3 low EP vs Ngn3 high EP | 0.233 | 0.688 | PASS |\n| Ngn3 high EP vs Pre-endocrine | 0.688 | 0.920 | PASS |\n| Pre-endocrine vs Alpha | 0.920 | 0.928 | PASS |\n| Pre-endocrine vs Beta | 0.920 | 0.959 | PASS |\n| Pre-endocrine vs Delta | 0.920 | 0.839 | FAIL |\n| Pre-endocrine vs Epsilon | 0.920 | 0.891 | FAIL |\n\nThe linear progenitor axis is perfectly ordered. The two failures are the two rarest terminal cell types: Delta (approximately 168 cells) and Epsilon (73 cells). Sparse populations produce fewer edges in the velocity graph, reducing the directional signal available to velocity pseudotime. This is a known limitation, and the benchmark surfaced it precisely.\n\nThe initial benchmark design listed the known ordering as:\n\n``` php\nDuctal -> Ngn3 low EP -> Ngn3 high EP -> Pre-endocrine -> Beta -> Alpha -> Delta -> Epsilon\n```\n\nTask 2 tested consecutive pairs: Beta < Alpha, Alpha < Delta, and so on.\n\nThis is biologically wrong.\n\nAlpha, Beta, Delta, and Epsilon are not sequential fates. They are four parallel terminal lineages that branch simultaneously from Pre-endocrine. There is no defined ordering among them -- a Beta cell does not come after an Alpha cell. They are siblings, not parent and child.\n\nTesting Beta < Alpha imposes a false linear structure on a biological fan-out. The benchmark was finding \"failures\" that were not failures at all -- they were the correct result of a mis-specified test.\n\nThe fix: test only the linear trunk (Ductal through Pre-endocrine) and each terminal fate separately against their common progenitor (Pre-endocrine). Never test terminal fates against each other.\n\n```\n# Wrong: imposes false linear ordering\nknown_ordering = [\"Ductal\", \"Ngn3 low EP\", \"Ngn3 high EP\",\n                  \"Pre-endocrine\", \"Beta\", \"Alpha\", \"Delta\", \"Epsilon\"]\n\n# Correct: respects the branching topology\nknown_ordering = [\"Ductal\", \"Ngn3 low EP\", \"Ngn3 high EP\", \"Pre-endocrine\", \"Alpha\"]\nterminal_pairs = [\n    [\"Pre-endocrine\", \"Alpha\"],\n    [\"Pre-endocrine\", \"Beta\"],\n    [\"Pre-endocrine\", \"Delta\"],\n    [\"Pre-endocrine\", \"Epsilon\"],\n]\n```\n\nThis distinction -- between a linear trajectory and a branching fan-out -- is what separates someone who understands developmental biology from someone who has run a tutorial.\n\n`scv.tl.velocity(mode='stochastic')`\n\nraised `ValueError: setting an array element with a sequence`\n\non both synthetic fixtures and the real pancreas data.\n\nThe root cause is in scvelo's `leastsq_generalized`\n\n:\n\n```\ngamma[i] = np.linalg.pinv(A.T.dot(A)).dot(A.T.dot(y[:, i]))\n```\n\nIn NumPy 2.x, `pinv(A).dot(b)`\n\nreturns a 1-element array rather than a scalar. Assigning a 1-element array into a pre-allocated `float32`\n\nscalar slot raises `ValueError`\n\n. This is a known upstream incompatibility (github.com/theislab/scvelo/issues/966).\n\nThe fix: `mode='deterministic'`\n\n. The deterministic steady-state model from LaManno et al. 2018 is the foundational method and produces correct trajectory ordering without triggering the NumPy 2.x incompatibility.\n\nAdding `sc.pp.highly_variable_genes`\n\nafter `scv.pp.normalize_per_cell`\n\nraised `ValueError: cannot specify integer bins when input data contains infinity`\n\n.\n\nRoot cause: some genes have near-zero total counts after per-cell normalisation. When pandas tries to bin gene expression levels for HVG dispersion computation, infinity values in the bin array trigger the error.\n\nFix: remove the HVG step entirely. The scvelo workflow handles its own feature selection inside `scv.pp.moments`\n\n. A separate HVG step before log1p is architecturally incorrect for the scvelo pipeline.\n\nThe initial implementation required `dpt_pseudotime`\n\nto exist in the loaded AnnData. It does not. The pancreas h5ad from scvelo's repository stores cell-type labels and raw counts, but not pseudotime.\n\nFix: compute diffusion pseudotime during preprocessing using `sc.tl.dpt()`\n\nrooted at Ductal cells. This is actually a better design -- the oracle is computed fresh from the data, not loaded from a file that might have been computed by a different pipeline version.\n\nUnit tests for a velocity benchmark cannot download real data on every run. The solution is a biologically grounded synthetic fixture that embeds a genuine velocity signal.\n\n```\n# Velocity genes: unspliced leads spliced along the trajectory axis\n# This is the precursor-product relationship RNA velocity detects\npseudotime = np.linspace(0, 1, n_cells)\nspliced_velocity = np.outer(pseudotime, np.ones(n_velocity_genes)) * 8\nunspliced_velocity = np.outer(\n    np.clip(pseudotime + 0.15, 0, 1), np.ones(n_velocity_genes)\n) * 8\n```\n\nThe test `test_velocity_signal_in_early_cells`\n\nverifies this works:\n\n```\nearly_ratio = (unspliced[early_mask] / spliced[early_mask]).mean()\nlate_ratio = (unspliced[late_mask] / spliced[late_mask]).mean()\nassert early_ratio > late_ratio\n```\n\nEarly trajectory cells have higher unspliced-to-spliced ratio than late cells. This is the biological constraint RNA velocity is built to detect. If the fixture does not satisfy it, the fixture is wrong. Having a test that verifies this means you can trust the downstream velocity tests.\n\n| Metric | Value |\n|---|---|\n| Dataset | Pancreas endocrinogenesis (Bastidas-Ponce 2019) |\n| Cells | 3,696 |\n| Velocity genes | 1,598 |\n| Task 1 Spearman rho | 0.8926 PASS |\n| Task 2 pairs passing | 5/7 (71.4%) FAIL |\n| Task 3 rho drop | 0.0029 ROBUST |\n| Tests passing | 99 |\n| Test coverage | 92% |\n| Pipeline runtime | 47.2 seconds |\n\nThe rho of 0.8926 is a good number. But what the project demonstrates is something different: the ability to ask whether the number is meaningful.\n\nAnyone can run `scv.tl.velocity_pseudotime`\n\nand get a number. The question that separates a bioinformatics engineer from a tutorial follower is: how do you know whether that number reflects real biology? The answer requires an independent oracle, a test that checks specific biologically defined transitions, a perturbation experiment, and honest documentation of what fails and why -- including cases like the terminal fates where the benchmark design itself needs correction because the biology is branching, not linear.\n\nThe benchmark framework is the contribution. The rho is just the output.\n\n[github.com/gbadedata/scvelo-trajectory-benchmark](https://github.com/gbadedata/scvelo-trajectory-benchmark)\n\nThe README covers the full pipeline architecture, all four engineering challenges, the branching topology correction, and complete instructions for reproducing the results from a fresh clone.\n\n*Building and shipping bioinformatics, data engineering, or DevOps projects, trajectory inference benchmarks, or single-cell evaluation frameworks? Connect on GitHub or LinkedIn.*", "url": "https://wpnews.pro/news/a-umap-with-arrows-is-not-a-benchmark-this-is", "canonical_source": "https://dev.to/gbadedata/a-umap-with-arrows-is-not-a-benchmark-this-is-1k7o", "published_at": "2026-06-16 23:51:24+00:00", "updated_at": "2026-06-17 00:21:28.639783+00:00", "lang": "en", "topics": ["machine-learning", "developer-tools", "ai-research"], "entities": ["scVelo", "Bastidas-Ponce et al. 2019", "Bergen et al. 2020", "Nature Biotechnology", "UMAP", "RNA velocity", "diffusion pseudotime", "Mann-Whitney U test"], "alternates": {"html": "https://wpnews.pro/news/a-umap-with-arrows-is-not-a-benchmark-this-is", "markdown": "https://wpnews.pro/news/a-umap-with-arrows-is-not-a-benchmark-this-is.md", "text": "https://wpnews.pro/news/a-umap-with-arrows-is-not-a-benchmark-this-is.txt", "jsonld": "https://wpnews.pro/news/a-umap-with-arrows-is-not-a-benchmark-this-is.jsonld"}}