How can an MLLM model think?
The Multi-modality LLM domain has been advancing rapidly in recent years, converging into two main styles: LLM decoder-based style and DiT-based style. While most commercial models (e.g., Seedance) leverage the DiT architecture, it remains an open problem how to inject “reasoning” into the model. On the other hand, the research community actively explores LLM decoder-based models, which make direct reasoning plausible. For example, MotionGPT (NeurIPS’ 23) defines 15 core motion tasks to support the instruction-tuning stage for <text, motion> pairs, while SOLAMI (CVPR’ 25) uses the instruction-tuning stage to train on multi-turn conversation data to handle speech/motion alignment.
U-Mind, a CVPR 2026 paper from Tsinghua University and Meituan, asks a more demanding question: can a This walkthrough traces the technical path from “why is this hard” to how U-Mind’s two-stage training recipe and text-first decoding strategy address it.
Let’s take a detailed look at the MotionGPT and SOLAMI papers.
MotionGPT is the paper that established the architectural pattern U-Mind builds on. Its core claim is that human motion has a “semantic coupling” similar to natural language — it’s a form of body language — and so it should be representable the same way text is, rather than treated as a separate continuous signal needing its own model. It uses the SOTA VQ-VAE to discretize 3D motion sequences into discrete tokens, expands an LLM’s vocabulary to include them, and then runs motion-language pre-training so a single autoregressive model can move fluidly between text-to-motion generation, motion captioning, motion prediction, and motion in-between tasks, all driven by natural-language instructions. There’s no dialogue loop and no speech modality — the instruct-tuning stage further trains on <text, motion> pairs on 15 different motion tasks, to enhance the model’s text/motion alignment.
SOLAMI takes that same architectural pattern — motion tokens living inside an LLM’s vocabulary — and wraps it in an actual interactive setting. It focuses on immersive interaction with 3D characters: a user talks and gestures to a character inside a VR interface, and the model generates a synchronized speech-and-motion response, not just a motion clip. Like U-Mind, SOLAMI is built on the AnyGPT backbone (itself based on LLaMA2–7B). For the instruction-tuning stage, it trains on multi-turn conversations, which is a step beyond MotionGPT in motion and speech alignment based on character settings and contexts.
U-Mind’s backbone is LLaMA2–7B, extended into a multimodal model in the lineage of AnyGPT: rather than bolting on separate encoder/decoder heads per modality, every modality is discretized into tokens living in the same vocabulary as text, so the whole system stays pure next-token prediction. For motion, body pose is represented with the SMPL-X parametric model, converted to continuous 6D joint rotations (more stable than quaternions or Euler angles), then compressed into discrete “motion tokens” by a Residual Vector-Quantized VAE (RVQ-VAE) — turning a continuous control problem into something an autoregressive LLM can predict one token at a time. Speech receives the same treatment via the RVQ-VAE architecture as SpeechTokenizer, which discretizes raw waveforms into tokens that carry both the words said and paralinguistic cues such as prosody and emotion.
The piece that ties it together is a set of explicit reasoning tokens: <think> / </think> delimiters wrap text-only internal Chain-of-Thought planning that's never shown to the user, while speech and motion segments get their own start/end delimiters and a global token wraps the full response (plain text stays undelimited). Stack text vocabulary, motion tokens, speech tokens, and these CoT delimiters into one resized embedding matrix, and you get a single unified token space — no modality-specific decoder heads needed, just autoregressive next-token prediction where "next token" might mean the next word, acoustic frame, or motion frame.
This is where U-Mind actually fights the reasoning-degradation problem described above. The framework is called the Unified Alignment and Reasoning Framework, and it’s split into two stages.
The goal here is narrow: teach the model to speak the language of motion and speech tokens without unlearning how to reason in plain text. The training mixture splits into two halves — modality-grounding tasks (text-to-motion, speech-to-motion, text-to-speech) that teach temporally coherent generation from linguistic or acoustic prompts, and rehearsal tasks interleaved purely to keep the model’s symbolic reasoning circuitry exercised. The intuition is simple: teach the model to produce temporally and contextually coherent sequences via segment-alignment, while letting the LLM head “rehearse” purely on reasoning tasks — two objectives running side by side rather than competing for the same gradient updates.
Cross-modal coordination is handled by a segment-wise alignment strategy: inputs are split along prosodic boundaries (rhythm and points in speech), and the model trains on randomized recombinations of these segments rather than on whole, intact utterances — forcing it to learn fine-grained speech-motion correspondence rather than memorizing a fixed global mapping. The ablation confirms this matters: removing segmentation (wo-seg) drives FGD from 11.12 up to 16.89 and angle error from 0.188 up to 0.219, a fairly large hit to raw motion fidelity.
Stage 2 takes that stable backbone and aligns it with actual user intent via supervised fine-tuning on a CoT-driven instruction corpus that covers open dialogue and instruction-following.
The key design choice here is the text-first decoding strategy: every response is required to begin with a <think>...</think> block — a latent, CoT-style plan — before any text, speech, or motion is generated. Symbolic planning happens first; continuous-modality generation is conditioned on that plan, not the other way around.
Note that the big differentiator between this stage and pretraining is that instruction tuning is specifically about exercising the model’s reasoning capability — by reducing the model’s reliance on raw speech/motion conditioning and instead training it to produce rich, explicit text reasoning before anything continuous gets generated. This is where the unique strength of the decoder-based architecture shows up: reasoning isn’t bolted on through a second model or a complicated training pattern — it’s intrinsically embedded in the architecture itself, simply by virtue of being earlier in the sequence under causal attention. Forcing the <think> block first means everything downstream — text, speech, motion — is mechanically conditioned on that plan, for free, just by token order.
Note that the paper directly competes with SOLAMI and another LLM-based algorithm, and exceeds both on generated motion quality, modality types, and open-world reasoning capability.
Final Takeaway. Multi-modality LLMs have greatly expanded their capabilities in the past few years, but how to properly reason across and align different modalities for real-world problems remains a challenge. The autoregressive decoder-style architecture is still a path worth exploring for its ability to naturally embed reasoning.
Deng X, Gao F, Zhang Y, Pang Y, Xu X, Kang Z, Wei X, Liu Y. U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026, pp. 10874–10886.
Related work referenced in this walkthrough:
Paper Walkthrough — U-Mind: A Unified Framework for Real-Time Multimodal Interaction with… was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.