Spectral DPPs via NEPv: A Scalable Continuous Relaxation of Determinantal MAP for Diversity-Aware Data Selection

wpnews.pro

cd /news/machine-learning/spectral-dpps-via-nepv-a-scalable-co… · home › topics › machine-learning › article

[ARTICLE · art-33567] src=arxiv.org ↗ pub=2026-06-19T04:00Z topic=machine-learning verified=true sentiment=· neutral

Spectral DPPs via NEPv: A Scalable Continuous Relaxation of Determinantal MAP for Diversity-Aware Data Selection

Researchers introduced a continuous relaxation of the determinantal point process (DPP) maximum a posteriori (MAP) problem for diversity-aware data selection, reformulating it as a nonlinear eigenvalue problem (NEPv) on the Stiefel manifold. The resulting algorithm, Spectral DPPs via NEPv (SDvN), runs in near-linear time relative to the ground-set size, enabling scalable subset selection from millions to billions of candidates for applications like data curation, active learning, and retrieval diversification.

read1 min views1 publishedJun 19, 2026

arXiv:2606.19411v1 Announce Type: new Abstract: Selecting a small, diverse, high-quality subset from a massive pool of candidates is a recurring primitive in modern machine learning -- data curation and coreset selection for training and fine-tuning large models, active-learning batch acquisition, prompt and exemplar selection for in-context learning, retrieval diversification, and experimental design. Determinantal Point Processes (\DPP s) give a principled, well-calibrated notion of diversity for this task, but their \emph{MAP} objective -- pick a size-$k$ subset $S$ maximizing $\logdet(L_S)$ -- is NP-hard, and the standard greedy and sampling algorithms scale superlinearly in the ground-set size $n$. This cost is prohibitive precisely in the data-centric regime where diversity matters most, where $n$ ranges over millions to billions of candidate examples, features, or embeddings. We recast \DPP-MAP as a continuous optimization problem over the Stiefel manifold, and show that its first-order optimality conditions form a \emph{Nonlinear Eigenvalue Problem with eigenvector dependency} (\NEPv) of a previously unstudied form. This \NEPv\ admits a self-consistent field (\SCF) iteration with a spectral-gap-based local contraction guarantee, giving a principled iterative solver where the diversity objective drives an eigenvector-dependent operator. The resulting algorithm, \OurMethod, requires only matrix-vector products with the kernel and runs in time $O!\big((ndk+nk^2),t\big)$ for a small number of iterations $t$, scaling near-linearly in $n$ and integrating directly with low-rank and feature-map kernels common in ML. This paper focuses on the relaxation, solver, and scaling analysis; full real-data benchmarking is left to a planned empirical study.

source & further reading

arxiv.org — original article

── more in #machine-learning 4 stories · sorted by recency

letsdatascience.com · 19 Jun · #machine-learning

SpaceX Unveils Plan for Orbital AI Data Centers

letsdatascience.com · 19 Jun · #machine-learning

AI-skilled Workers Command 56% Wage Premium

github.com · 19 Jun · #machine-learning

Profile(v2.1.4) physics-aware optimizer for vLLM (31→470 tok/s on A100)

letsdatascience.com · 19 Jun · #machine-learning

Amazon explores selling Trainium chips to data centres

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required