{"slug": "flash-wam-modality-aware-distillation-for-world-action-models", "title": "Flash-WAM: Modality-Aware Distillation for World Action Models", "summary": "Researchers introduced Flash-WAM, a modality-aware distillation framework that compresses world-action model inference to a single step, reducing per-chunk latency from 8.1 seconds to 348 milliseconds on NVIDIA L40S hardware. The method addresses the challenge of joint video-action generation by applying different consistency functions to each modality's distinct noise schedule. Flash-WAM preserves task success rates of 85.5% on RoboTwin 2.0 and 95.7% on LIBERO benchmarks while enabling real-time control, compared to naive consistency distillation which dropped to 24% success at the same step budget.", "body_md": "arXiv:2606.05254v1 Announce Type: new\nAbstract: World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce \\textbf{Flash-WAM}, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from $8.1$ seconds to $348$ ms on NVIDIA L40S, a $23{\\times}$ speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks ($85.5\\%$ RoboTwin 2.0, $95.7\\%$ LIBERO) and substantially recovers real-world performance ($60\\%$ average on a Unitree G1 humanoid robot), while naive consistency distillation drops to $24\\%$ at the same step budget.", "url": "https://wpnews.pro/news/flash-wam-modality-aware-distillation-for-world-action-models", "canonical_source": "https://arxiv.org/abs/2606.05254", "published_at": "2026-06-05 04:00:00+00:00", "updated_at": "2026-06-05 04:36:51.176953+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "robotics", "computer-vision", "generative-ai"], "entities": ["Flash-WAM", "LingBot-VA", "RoboTwin 2.0", "NVIDIA L40S"], "alternates": {"html": "https://wpnews.pro/news/flash-wam-modality-aware-distillation-for-world-action-models", "markdown": "https://wpnews.pro/news/flash-wam-modality-aware-distillation-for-world-action-models.md", "text": "https://wpnews.pro/news/flash-wam-modality-aware-distillation-for-world-action-models.txt", "jsonld": "https://wpnews.pro/news/flash-wam-modality-aware-distillation-for-world-action-models.jsonld"}}