Google LiteRT-LM Accelerates Gemma 4 Local Inference

wpnews.pro

cd /news/large-language-models/google-litert-lm-accelerates-gemma-4… · home › topics › large-language-models › article

[ARTICLE · art-22385] src=letsdatascience.com ↗ pub=2026-06-05T09:55Z topic=large-language-models verified=true sentiment=↑ positive

Google LiteRT-LM Accelerates Gemma 4 Local Inference

Google added native support for Gemma 4 Multi-Token Prediction (MTP) to LiteRT-LM, its on-device LLM runtime built on LiteRT (formerly TensorFlow Lite). Google reports the integration yields MTP decoding speed improvements of 1.6x for Gemma 4 E2B and 2.2x for Gemma 4 E4B, and delivers 1.8x to 3.7x faster prefill and decode performance versus competing frameworks including llama.cpp, MLX, Cactus, and ONNX. The update also expands runtime language bindings to include Swift and JavaScript alongside existing Kotlin and C++ support.

read2 min views16 publishedJun 5, 2026

Google has added native support for Gemma 4 Multi-Token Prediction (MTP) to LiteRT-LM, its on-device LLM runtime built on LiteRT (formerly TensorFlow Lite), InfoQ reports. According to Google, the MTP integration yields MTP decoding speed improvements of 1.6x for Gemma 4 E2B and 2.2x for Gemma 4 E4B, and yields 1.8x to 3.7x faster prefill and decode performance versus frameworks cited by Google, including llama.cpp, MLX, Cactus, and ONNX. InfoQ also reports that LiteRT-LM uses advanced quantization, accelerated XNNPACK and MLDrift kernels, optimized pipelines to reduce CPU-GPU transfers, speculative decoding for MTP, and session management features. The runtime now exposes Swift and JavaScript APIs in addition to existing Kotlin and C++ support, per InfoQ.

What happened

Google expanded LiteRT-LM, its production LLM runtime built on LiteRT (formerly TensorFlow Lite), to add native support for Gemma 4 Multi-Token Prediction (MTP), InfoQ reports. According to Google, MTP decoding speed on Gemma 4 E2B is 1.6x faster and on Gemma 4 E4B is 2.2x faster compared with single-token baselines. InfoQ reports Google claims 1.8x to 3.7x faster prefill and decode performance versus competing frameworks named by Google, including llama.cpp, MLX, Cactus, and ONNX. The release also expands runtime language bindings beyond Kotlin and C++ to include Swift and JavaScript, per InfoQ.

Technical details

According to InfoQ's coverage of Google's description, LiteRT-LM couples an orchestration layer on top of LiteRT with advanced quantization and accelerated kernels such as XNNPACK and MLDrift to handle constrained on-device environments. InfoQ reports the runtime applies speculative decoding for MTP, minimizes costly CPU-GPU data movement via optimized pipelines, and treats session management as a first-class feature. InfoQ further describes that LiteRT-LM enforces memory locality by executing both the lightweight MTP drafter and the primary model on the same hardware IP and managing shared KV caches and activations in local memory to avoid cross-IP synchronization costs.

Industry context

Editorial analysis: Companies shipping on-device LLM runtimes increasingly combine model-level techniques (speculative decoding, MTP) with low-level engineering (quantization, kernel tuning, data-locality) to close the latency gap with server inference. Observed patterns in similar efforts show that reducing CPU-GPU transfers and consolidating cache management are common levers for multi-token and batched decoding performance improvements.

What to watch

Editorial analysis: Observers and practitioners should track independent benchmarks comparing LiteRT-LM's MTP results to open-source tooling and real-world latency/throughput measurements across mobile GPUs and the web. Also watch adoption signals for the new Swift and JavaScript APIs and whether third-party frameworks integrate LiteRT-LM pipelines or replicate its memory-locality patterns.

Scoring Rationale #

This is a notable engineering improvement for on-device LLM inference that matters to ML engineers and mobile practitioners. The change is not a frontier-model release but materially affects latency and deployment trade-offs for Gemma 4 on constrained hardware.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems

source & further reading

letsdatascience.com — original article Court Reprimands Lawyer for AI Hallucinations in Briefs Ghostcommit: PNG prompt-injection makes AI agents leak repository secrets Google Expands Gemini Ad Agents In India

~/api · this article 200

$curl api.wpnews.pro/v1/news/google-litert-lm-acceler…

Read original on letsdatascience.com → letsdatascience.com/news/google-litert-lm-accele…

mentioned entities

Google

LiteRT-LM

Gemma 4

InfoQ

llama.cpp

MLX

Cactus

ONNX

metadata

sluggoogle-litert-lm-accelerates-gemma-4-local-inference

topic#large-language-models

secondary4 topics

sentimentpositive

canonicalletsdatascience.com

navigation

← prevShell expands predictive mainten…

next →InfoQ Publishes Securing the AI …

── more in #large-language-models 4 stories · sorted by recency

startupfortune.com · 22 Jul · #large-language-models

BloombergNEF just revised its US data center power forecast 83% higher in seven months

sourcefeed.dev · 22 Jul · #large-language-models

Gemini's New Flash Models Compete on Cost Per Task

siliconangle.com · 22 Jul · #large-language-models

Nvidia Vera Rubin: Inside the agentic AI factory that rewrites the CPU playbook

voi.id · 22 Jul · #large-language-models

Tingkatkan Asisten Perjalanan, Google Maps Uji Fitur Pintar Lintas Aplikasi

── more on @google 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 8 Jul · #ai-tools

What's the Future of Clay?

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required