QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

wpnews.pro

cd /news/machine-learning/qpilots-efficient-test-time-q-steeri… · home › topics › machine-learning › article

[ARTICLE · art-28968] src=arxiv.org ↗ pub=2026-06-16T04:00Z topic=machine-learning verified=true sentiment=↑ positive

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

Researchers propose QPILOTS, a method that steers flow-matching and diffusion policies at inference time by projecting noisy intermediate states to clean action estimates for critic gradient computation. QPILOTS achieves a 90% average success rate across 50 tasks in offline-to-online RL benchmarks and outperforms prior methods when steering a frozen VLA foundation model on six manipulation tasks.

read1 min views2 publishedJun 16, 2026

arXiv:2606.14801v1 Announce Type: new Abstract: Flow-matching and diffusion policies are expressive action generators, but optimizing them with temporal-difference reinforcement learning (RL) remains difficult. Effective policy extraction requires exploiting the critic's action gradient, yet directly backpropagating this signal through a multi-step denoising process can be numerically unstable. Existing methods work around this either by discarding gradient information, distilling the policy into a simpler one-step actor, or repeatedly fine-tuning the denoising policy as the critic improves. We propose QPILOTS, a method that leaves the original policy unmodified and steers the denoising process at inference time. At each denoising step, instead of evaluating the critic on the noisy intermediate action where critic predictions are unreliable, we first project that intermediate state to an estimate of the final clean action and compute the critic gradient there. We introduce two variants: QPILOTS-U uses a fast single-point approximation, while QPILOTS-M draws differentiable posterior samples via a learned auxiliary network. On a standard offline-to-online RL benchmark, QPILOTS achieves the best aggregate performance, reaching an average success rate of 90% across 50 tasks. We also apply QPILOTS to steer a large, frozen, pretrained Vision-Language Action (VLA) foundation model, outperforming or matching prior inference-time approaches across six manipulation tasks in simulation.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/qpilots-efficient-test-t…

Read original on arxiv.org → arxiv.org/abs/2606.14801

mentioned entities

QPILOTS

Vision-Language Action (VLA)

metadata

slugqpilots-efficient-test-time-q-steering-for-flow-policies

topic#machine-learning

secondary3 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevBuild Your Own AI Automation wit…

next →Could a diamond wafer as wide as…

── more in #machine-learning 4 stories · sorted by recency

code.visualstudio.com · 17 Jun · #machine-learning

Visual Studio Code 1.125

letsdatascience.com · 16 Jun · #machine-learning

Businesses Build AI Trust Signal Through Review Strategies

letsdatascience.com · 16 Jun · #machine-learning

CacheWise Improves KVCache Reuse for LLM Coding Agents

letsdatascience.com · 16 Jun · #machine-learning

GIST-CMTF adds goal inference to causal tool filtering

── more on @qpilots 3 stories trending now

wpnews · 15 Jun · #artificial-intelligence

Facebook now has an AI search engine that pulls answers from your Group posts and Reels

wpnews · 15 Jun · #generative-ai

Pentagon Reports 1.5 Million Daily GenAI.mil Users

wpnews · 15 Jun · #large-language-models

The Grain of Thought

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required