Video Optimal Transport Enables Feedback-Efficient Reward Learning

Researchers from KAIST introduced Video-based Optimal Transport Preference (VOTP), a method that uses optimal transport over Video Foundation Model embeddings to generate high-fidelity pseudo-labels from a small number of human preferences. The approach reduces required human supervision and outperforms state-of-the-art offline preference-based reinforcement learning methods on locomotion and manipulation benchmarks. The paper was accepted for oral presentation at ICML 2026.

Video Optimal Transport Enables Feedback-Efficient Reward Learning Per the arXiv abstract arXiv:2606.16856, submitted 15 Jun 2026 , the paper "Video-Based Optimal Transport for Feedback-Efficient Offline Preference-Based Reinforcement Learning" by Minh-Tung Luu, Hwanhee Kim, Younghwan Lee, and Chang D. Yoo introduces Video-based Optimal Transport Preference VOTP . Per the abstract, VOTP leverages optimal transport over Video Foundation Model ViFM embeddings to generate high-fidelity pseudo-labels from a small number of human preferences, reducing required human supervision and outperforming state-of-the-art offline preference-based RL methods on locomotion and manipulation benchmarks. The ICML 2026 program page lists an oral presentation on July 8, 2026, indicating the paper was accepted for oral presentation at ICML. Editorial analysis: This work situates recent ViFM representation gains inside semi-supervised preference learning, offering a practical path to lower labeling budgets for offline PbRL tasks. What happened Per the arXiv abstract arXiv:2606.16856, submitted 15 Jun 2026 , the paper titled "Video-Based Optimal Transport for Feedback-Efficient Offline Preference-Based Reinforcement Learning" introduces a method named Video-based Optimal Transport Preference VOTP . The ICML 2026 program page lists the paper as an oral presentation scheduled for July 8, 2026, confirming acceptance to the conference. Per the abstract, the authors report that VOTP uses optimal transport in the representation space of Video Foundation Models to produce high-fidelity pseudo-labels from a handful of human preference labels and that the method outperforms state-of-the-art offline preference-based RL methods on locomotion and manipulation benchmarks. Technical details Per the arXiv abstract and accompanying OpenReview/ICML materials, VOTP aligns visual trajectories by computing an optimal transport plan over embeddings produced by a Video Foundation Model, then uses that alignment to assign pseudo-preferences to unlabeled trajectory pairs. The paper frames the offline PbRL problem as having two inputs, an offline dataset collected from an unknown policy and a small preference dataset, and reports that VOTP uses semi-supervised pseudo-labeling to scale preference learning with minimal human queries. The authors also report experiments showing robustness to visual distractors and real-robot validations, per the abstract and ICML page. Industry context Editorial analysis: Preference-based RL aims to replace manual reward engineering with human judgements, but practitioners routinely face steep labeling costs and distributional mismatch between offline logs and evaluation scenarios. Industry-pattern observations: Recent advances in Video Foundation Models provide rich trajectory-level embeddings that researchers increasingly use as a substrate for downstream supervision via similarity metrics, clustering, or retrieval. Companies and labs exploring offline PbRL are likely to watch methods that convert ViFM similarity into pseudo-supervision because they can reduce annotation budgets while leveraging large offline datasets. Context and significance Editorial analysis: If VOTP's reported gains hold across broader benchmarks, the approach could materially lower the practical cost of preference collection in offline settings, particularly for robotic manipulation and locomotion where physical trials and human labeling are expensive. Editorial analysis: The use of optimal transport to align entire trajectories, rather than frame-by-frame heuristics, is technically notable because it preserves temporal structure and can produce more semantically consistent pseudo-labels when ViFM embeddings capture dynamics and intent. What to watch Editorial analysis: Observers should look for the paper's public code release, detailed ablations on ViFM choice, and sensitivity to embedding quality, since the method's reliability depends on the representational fidelity of the underlying Video Foundation Model. Editorial analysis: Practitioners should also monitor evaluation scope beyond standard benchmarks, including how well pseudo-labels generalize when offline datasets contain diverse policies or when human preference signals are sparse and noisy. Scoring Rationale The paper presents a practical technique that combines Video Foundation Model embeddings and optimal transport to reduce preference-labeling costs, a notable advance for offline PbRL and robotics. The ICML oral acceptance raises visibility, but broader adoption depends on code release and replication across datasets, so the story rates as a notable research contribution rather than a paradigm shift. Practice interview problems based on real data 1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with. Try 250 free problems /problems