Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts

wpnews.pro

cd /news/machine-learning/policy-regret-for-embedding-model-ro… · home › topics › machine-learning › article

[ARTICLE · art-28971] src=arxiv.org ↗ pub=2026-06-16T04:00Z topic=machine-learning verified=true sentiment=· neutral

Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts

Researchers formalized embedding model routing as an adversarial contextual linear bandit with low-rank experts, identifying a log-quadratic policy class for efficient online learning. They proposed the Hypentropy Policy Gradient (HPG) algorithm, which achieves sublinear regret and avoids the curse of dimensionality. The work addresses practical challenges in modern recommendation systems under adversarial queries and bandit feedback.

read1 min views1 publishedJun 16, 2026

arXiv:2606.14929v1 Announce Type: new Abstract: Modern recommendation systems increasingly rely on dynamically routing diverse queries to multiple embedding models. Despite its practical significance, this problem remains poorly understood under realistic conditions like adversarial queries, bandit feedback, and limited observability of models. We formalize embedding model routing as an adversarial contextual linear bandit with low-rank experts, where contexts are queries, actions are items, and experts are the embedding models working on low-rank latent representation spaces. We first establish that standard regret notions suffer from structural misspecification or statistical intractability, and we identify a log-quadratic policy class that is expressive enough to capture query-dependent model routing, yet structured enough to allow efficient online learning. Second, we propose a policy gradient algorithm called Hypentropy Policy Gradient (HPG). It provably adapts to the unknown low-rank structure under incomplete information and attains $\tilde{\mathcal O}(s\sqrt{M T})$ linearized policy regret -- where $s, M$, and $T$ are the intrinsic rank of the experts, the number of models, and the number of rounds -- thus avoiding a curse of dimensionality. Finally, we also provide an computationally efficient and parameter-free implementation of HPG.

source & further reading

arxiv.org — original article

── more in #machine-learning 4 stories · sorted by recency

letsdatascience.com · 16 Jun · #machine-learning

Video Optimal Transport Enables Feedback-Efficient Reward Learning

letsdatascience.com · 16 Jun · #machine-learning

Latent-space RL estimates material parameters for food fracture

letsdatascience.com · 16 Jun · #machine-learning

Paper Introduces Causal-Origin Taxonomy for Distributional Shifts in RL

letsdatascience.com · 16 Jun · #machine-learning

Researchers propose causal framework to audit synthetic data

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required