LiteRT-LM brings native support for Gemma 4 Multi-Token Prediction (MTP) drafters, enabling up to 2.2x faster inference. The framework is expanding beyond Kotlin and C++ adding support for new Swift and a JavaScript APIs.
LiteRT-LM includes a specialized orchestration layer built on top of LiteRT, formerly known as TensorFlow Lite, specifically designed to handle large language models (LLM). According to Google, it is a production-proven, highly optimized runtime for running Gemma 4 on-device across platforms like Android, iOS, and the web.
Its LiteRT foundation enables it to efficiently handle constraints like limited memory, compute, and fragmented hardware leveraging advanced quantization schemes along with accelerated XNNPACK and MLDrift kernels. At the orchestration level, it employs optimized pipelines to minimize costly CPU-GPU data transfers, supports multi-token prediction, and features advanced session management. According to Google, this combination makes it "the highest-performing runtime environment for Gemma models".
LiteRT-LM adopts speculative decoding for MTP and implements it in a way that avoids the bottlenecks of naive approaches by "optimizing the data interplay between the primary Gemma 4 model and the MTP drafter".
To achieve this, LiteRT-LM enforces memory locality by executing both the lightweight MTP drafter and the primary model on the same hardware IP (e.g., the GPU). Managing the shared KV cache and activations within local memory entirely eliminates the latency penalties of cross-IP synchronization and data transfers. Once the drafter predicts future tokens, the primary model evaluates them using optimized kernels that maximize parallelization during verification
Based on its own benchmarks, Google says that MTP decoding speed is 1.6x faster for Gemma 4 E2B and 2.2x faster for Gemma 4 E4B. The company also reports that both prefill and decode performance are 1.8x to 3.7x faster than competing frameworks like llama.cpp, MLX, Cactus, and ONNX.
Session management is treated as a first-class feature in LiteRT-LM. It can save and restore KV cache state, enabling seamless continuation of long interactions while avoiding expensive recomputation. This can improve both overall user experience and efficiency.
Another pillar in LiteRT-LM is memory efficiency, as it minimizes its footprint by keeping per-layer embeddings out of memory and dynamically image and audio encoders only when required. As a result, the runtime remains lean, with, for example, the the ~2.58GB Gemma 4 E2B model taking just 607MB on Apple mobile CPUs.
The system also emphasizes agentic capabilities through native support for Gemma 4 "Thinking Mode", constrained decoding for structured outputs, and function-calling. These features allow it to execution, return structured tool-call requests, and eventually resume.
Launched with Gemma 4, multi-token prediction drafters use speculative decoding to generate multiple tokens in parallel. which can be verified together in a single pass. This approach reduces constant data movement between VRAM and compute units while exploiting the fact that many predictions are "obvious" and do not require the same amount of computation as others.
LiteRT-LM is available on GitHub and includes a CLI for experimenting on the desktop, as well as a mobile app for on-device usage.