Paper Walkthrough — U-Mind: A Unified Framework for Real-Time Multimodal Interaction with… Researchers from Tsinghua University and Meituan introduced U-Mind, a unified framework for real-time multimodal interaction that enables a large language model to reason across text, speech, and motion. The model, accepted at CVPR 2026, uses a two-stage training recipe and text-first decoding strategy to integrate reasoning into multimodal generation, building on prior work like MotionGPT and SOLAMI. How can an MLLM model think? The Multi-modality LLM domain has been advancing rapidly in recent years, converging into two main styles: LLM decoder-based style https://arxiv.org/pdf/2306.14795 and DiT-based style https://arxiv.org/pdf/2605.22344 . While most commercial models e.g., Seedance https://seedancev2.ai/ leverage the DiT architecture, it remains an open problem how to inject “reasoning” into the model. On the other hand, the research community actively explores LLM decoder-based models, which make direct reasoning plausible. For example, MotionGPT https://arxiv.org/pdf/2306.14795 NeurIPS’ 23 defines 15 core motion tasks to support the instruction-tuning stage for