cd /news/large-language-models/google-litert-lm-speeds-up-local-inf… · home topics large-language-models article
[ARTICLE · art-22363] src=infoq.com pub= topic=large-language-models verified=true sentiment=↑ positive

Google LiteRT-LM Speeds Up Local Inference Up to 2.2x With Gemma 4 Multi-Token Prediction

Google released LiteRT-LM, a new runtime framework that accelerates on-device inference for its Gemma 4 large language models by up to 2.2x using multi-token prediction. The framework, built on the LiteRT platform formerly known as TensorFlow Lite, now supports Swift and JavaScript APIs in addition to Kotlin and C++, enabling faster AI processing on Android, iOS, and web platforms. Google reports that LiteRT-LM achieves 1.8x to 3.7x faster prefill and decode performance compared to competing frameworks like llama.cpp and MLX.

read2 min publishedJun 5, 2026

LiteRT-LM brings native support for Gemma 4 Multi-Token Prediction (MTP) drafters, enabling up to 2.2x faster inference. The framework is expanding beyond Kotlin and C++ adding support for new Swift and a JavaScript APIs.

LiteRT-LM includes a specialized orchestration layer built on top of LiteRT, formerly known as TensorFlow Lite, specifically designed to handle large language models (LLM). According to Google, it is a production-proven, highly optimized runtime for running Gemma 4 on-device across platforms like Android, iOS, and the web.

Its LiteRT foundation enables it to efficiently handle constraints like limited memory, compute, and fragmented hardware leveraging advanced quantization schemes along with accelerated XNNPACK and MLDrift kernels. At the orchestration level, it employs optimized pipelines to minimize costly CPU-GPU data transfers, supports multi-token prediction, and features advanced session management. According to Google, this combination makes it "the highest-performing runtime environment for Gemma models".

LiteRT-LM adopts speculative decoding for MTP and implements it in a way that avoids the bottlenecks of naive approaches by "optimizing the data interplay between the primary Gemma 4 model and the MTP drafter".

To achieve this, LiteRT-LM enforces memory locality by executing both the lightweight MTP drafter and the primary model on the same hardware IP (e.g., the GPU). Managing the shared KV cache and activations within local memory entirely eliminates the latency penalties of cross-IP synchronization and data transfers. Once the drafter predicts future tokens, the primary model evaluates them using optimized kernels that maximize parallelization during verification

Based on its own benchmarks, Google says that MTP decoding speed is 1.6x faster for Gemma 4 E2B and 2.2x faster for Gemma 4 E4B. The company also reports that both prefill and decode performance are 1.8x to 3.7x faster than competing frameworks like llama.cpp, MLX, Cactus, and ONNX.

Session management is treated as a first-class feature in LiteRT-LM. It can save and restore KV cache state, enabling seamless continuation of long interactions while avoiding expensive recomputation. This can improve both overall user experience and efficiency.

Another pillar in LiteRT-LM is memory efficiency, as it minimizes its footprint by keeping per-layer embeddings out of memory and dynamically image and audio encoders only when required. As a result, the runtime remains lean, with, for example, the the ~2.58GB Gemma 4 E2B model taking just 607MB on Apple mobile CPUs.

The system also emphasizes agentic capabilities through native support for Gemma 4 "Thinking Mode", constrained decoding for structured outputs, and function-calling. These features allow it to execution, return structured tool-call requests, and eventually resume.

Launched with Gemma 4, multi-token prediction drafters use speculative decoding to generate multiple tokens in parallel. which can be verified together in a single pass. This approach reduces constant data movement between VRAM and compute units while exploiting the fact that many predictions are "obvious" and do not require the same amount of computation as others.

LiteRT-LM is available on GitHub and includes a CLI for experimenting on the desktop, as well as a mobile app for on-device usage.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/google-litert-lm-spe…] indexed:0 read:2min 2026-06-05 ·