Google LiteRT-LM Speeds Up Local Inference Up to 2.2x With Gemma 4 Multi-Token Prediction

wpnews.pro

cd /news/large-language-models/google-litert-lm-speeds-up-local-inf… · home › topics › large-language-models › article

[ARTICLE · art-22363] src=infoq.com ↗ pub=2026-06-05T09:00Z topic=large-language-models verified=true sentiment=↑ positive

Google LiteRT-LM Speeds Up Local Inference Up to 2.2x With Gemma 4 Multi-Token Prediction

Google released LiteRT-LM, a new runtime framework that accelerates on-device inference for its Gemma 4 large language models by up to 2.2x using multi-token prediction. The framework, built on the LiteRT platform formerly known as TensorFlow Lite, now supports Swift and JavaScript APIs in addition to Kotlin and C++, enabling faster AI processing on Android, iOS, and web platforms. Google reports that LiteRT-LM achieves 1.8x to 3.7x faster prefill and decode performance compared to competing frameworks like llama.cpp and MLX.

read2 min views20 publishedJun 5, 2026

LiteRT-LM brings native support for Gemma 4 Multi-Token Prediction (MTP) drafters, enabling up to 2.2x faster inference. The framework is expanding beyond Kotlin and C++ adding support for new Swift and a JavaScript APIs.

LiteRT-LM includes a specialized orchestration layer built on top of LiteRT, formerly known as TensorFlow Lite, specifically designed to handle large language models (LLM). According to Google, it is a production-proven, highly optimized runtime for running Gemma 4 on-device across platforms like Android, iOS, and the web.

Its LiteRT foundation enables it to efficiently handle constraints like limited memory, compute, and fragmented hardware leveraging advanced quantization schemes along with accelerated XNNPACK and MLDrift kernels. At the orchestration level, it employs optimized pipelines to minimize costly CPU-GPU data transfers, supports multi-token prediction, and features advanced session management. According to Google, this combination makes it "the highest-performing runtime environment for Gemma models".

LiteRT-LM adopts speculative decoding for MTP and implements it in a way that avoids the bottlenecks of naive approaches by "optimizing the data interplay between the primary Gemma 4 model and the MTP drafter".

To achieve this, LiteRT-LM enforces memory locality by executing both the lightweight MTP drafter and the primary model on the same hardware IP (e.g., the GPU). Managing the shared KV cache and activations within local memory entirely eliminates the latency penalties of cross-IP synchronization and data transfers. Once the drafter predicts future tokens, the primary model evaluates them using optimized kernels that maximize parallelization during verification

Based on its own benchmarks, Google says that MTP decoding speed is 1.6x faster for Gemma 4 E2B and 2.2x faster for Gemma 4 E4B. The company also reports that both prefill and decode performance are 1.8x to 3.7x faster than competing frameworks like llama.cpp, MLX, Cactus, and ONNX.

Session management is treated as a first-class feature in LiteRT-LM. It can save and restore KV cache state, enabling seamless continuation of long interactions while avoiding expensive recomputation. This can improve both overall user experience and efficiency.

Another pillar in LiteRT-LM is memory efficiency, as it minimizes its footprint by keeping per-layer embeddings out of memory and dynamically image and audio encoders only when required. As a result, the runtime remains lean, with, for example, the the ~2.58GB Gemma 4 E2B model taking just 607MB on Apple mobile CPUs.

The system also emphasizes agentic capabilities through native support for Gemma 4 "Thinking Mode", constrained decoding for structured outputs, and function-calling. These features allow it to execution, return structured tool-call requests, and eventually resume.

Launched with Gemma 4, multi-token prediction drafters use speculative decoding to generate multiple tokens in parallel. which can be verified together in a single pass. This approach reduces constant data movement between VRAM and compute units while exploiting the fact that many predictions are "obvious" and do not require the same amount of computation as others.

LiteRT-LM is available on GitHub and includes a CLI for experimenting on the desktop, as well as a mobile app for on-device usage.

source & further reading

infoq.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/google-litert-lm-speeds-…

Read original on infoq.com → www.infoq.com/news/2026/06/google-litertlm-gemma…

mentioned entities

Google

LiteRT-LM

Gemma 4

TensorFlow Lite

XNNPACK

MLDrift

Kotlin

Swift

metadata

sluggoogle-litert-lm-speeds-up-local-inference-up-to-2-2x-with-gemma-4-multi-token

topic#large-language-models

secondary4 topics

sentimentpositive

canonicalinfoq.com

navigation

← prevPostcard from Computex 2026: New…

next →This AI startup says it saves $3…

── more in #large-language-models 4 stories · sorted by recency

sourcefeed.dev · 25 Jul · #large-language-models

Small Models That Know When to Phone Home

dev.to · 25 Jul · #large-language-models

SCOTOMA: Gemma 4 31B Abliteration Review

dev.to · 25 Jul · #large-language-models

Google reengineers data center hardware for AI agents

cryptobriefing.com · 25 Jul · #large-language-models

Anthropic seeks semiconductor supplies from SK Hynix as AI firms race to build custom chips

── more on @google 3 stories trending now

wpnews · 24 Jul · #artificial-intelligence

SK Hynix reports Q2 2026 earnings as the AI memory supercycle faces its first real test

wpnews · 24 Jul · #artificial-intelligence

As agentic AI inference surges, tokenomics becomes the enterprise’s defining budget constraint

wpnews · 24 Jul · #artificial-intelligence

A $700 Billion Sovereign Fund Just Made the Chinese AI Cost Argument Impossible to Ignore

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required