cd /news/large-language-models/google-litert-lm-accelerates-gemma-4… · home topics large-language-models article
[ARTICLE · art-22385] src=letsdatascience.com pub= topic=large-language-models verified=true sentiment=↑ positive

Google LiteRT-LM Accelerates Gemma 4 Local Inference

Google added native support for Gemma 4 Multi-Token Prediction (MTP) to LiteRT-LM, its on-device LLM runtime built on LiteRT (formerly TensorFlow Lite). Google reports the integration yields MTP decoding speed improvements of 1.6x for Gemma 4 E2B and 2.2x for Gemma 4 E4B, and delivers 1.8x to 3.7x faster prefill and decode performance versus competing frameworks including llama.cpp, MLX, Cactus, and ONNX. The update also expands runtime language bindings to include Swift and JavaScript alongside existing Kotlin and C++ support.

read2 min publishedJun 5, 2026

Google has added native support for Gemma 4 Multi-Token Prediction (MTP) to LiteRT-LM, its on-device LLM runtime built on LiteRT (formerly TensorFlow Lite), InfoQ reports. According to Google, the MTP integration yields MTP decoding speed improvements of 1.6x for Gemma 4 E2B and 2.2x for Gemma 4 E4B, and yields 1.8x to 3.7x faster prefill and decode performance versus frameworks cited by Google, including llama.cpp, MLX, Cactus, and ONNX. InfoQ also reports that LiteRT-LM uses advanced quantization, accelerated XNNPACK and MLDrift kernels, optimized pipelines to reduce CPU-GPU transfers, speculative decoding for MTP, and session management features. The runtime now exposes Swift and JavaScript APIs in addition to existing Kotlin and C++ support, per InfoQ.

What happened

Google expanded LiteRT-LM, its production LLM runtime built on LiteRT (formerly TensorFlow Lite), to add native support for Gemma 4 Multi-Token Prediction (MTP), InfoQ reports. According to Google, MTP decoding speed on Gemma 4 E2B is 1.6x faster and on Gemma 4 E4B is 2.2x faster compared with single-token baselines. InfoQ reports Google claims 1.8x to 3.7x faster prefill and decode performance versus competing frameworks named by Google, including llama.cpp, MLX, Cactus, and ONNX. The release also expands runtime language bindings beyond Kotlin and C++ to include Swift and JavaScript, per InfoQ.

Technical details

According to InfoQ's coverage of Google's description, LiteRT-LM couples an orchestration layer on top of LiteRT with advanced quantization and accelerated kernels such as XNNPACK and MLDrift to handle constrained on-device environments. InfoQ reports the runtime applies speculative decoding for MTP, minimizes costly CPU-GPU data movement via optimized pipelines, and treats session management as a first-class feature. InfoQ further describes that LiteRT-LM enforces memory locality by executing both the lightweight MTP drafter and the primary model on the same hardware IP and managing shared KV caches and activations in local memory to avoid cross-IP synchronization costs.

Industry context

Editorial analysis: Companies shipping on-device LLM runtimes increasingly combine model-level techniques (speculative decoding, MTP) with low-level engineering (quantization, kernel tuning, data-locality) to close the latency gap with server inference. Observed patterns in similar efforts show that reducing CPU-GPU transfers and consolidating cache management are common levers for multi-token and batched decoding performance improvements.

What to watch

Editorial analysis: Observers and practitioners should track independent benchmarks comparing LiteRT-LM's MTP results to open-source tooling and real-world latency/throughput measurements across mobile GPUs and the web. Also watch adoption signals for the new Swift and JavaScript APIs and whether third-party frameworks integrate LiteRT-LM pipelines or replicate its memory-locality patterns.

Scoring Rationale #

This is a notable engineering improvement for on-device LLM inference that matters to ML engineers and mobile practitioners. The change is not a frontier-model release but materially affects latency and deployment trade-offs for Gemma 4 on constrained hardware.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/google-litert-lm-acc…] indexed:0 read:2min 2026-06-05 ·