Firefox Integrates On-Device AI Runtime for Features

Mozilla has integrated an on-device ML inference runtime into Firefox, using ONNX Runtime and Transformers.js, to power features like PDF.js alt text generation and Smart Tab grouping. The runtime runs in a separate content process, with models distributed via Remote Settings and cached locally, achieving 2-10x faster inference after switching from WebAssembly to a native C++ backend.

Firefox Integrates On-Device AI Runtime for Features A December 2025 blog post by Tarek Ziade, a Mozilla engineer who spent two years on Firefox AI features, describes how Firefox now ships multiple AI capabilities that run entirely on-device. The post reports the team built a dedicated ML inference runtime using ONNX Runtime paired with Transformers.js, running inside a separate content process isolated from the main browser process. According to the post, Remote Settings distributes both runtime and model configurations while IndexedDB caches downloaded models locally. A key architectural shift - moving from WebAssembly to a native C++ ONNX backend - is independently confirmed by a Mozilla Blog post from August 2025, which reported 2-10x faster inference; PDF.js alt-text latency dropped from 3.5 seconds to 350ms. Features shipping on this runtime include PDF.js alt text generation Firefox 130 and Smart Tab grouping Firefox 141 . What happened In a December 5, 2025 blog post, Tarek Ziade - a Mozilla engineer who spent two years on Firefox AI features - describes how Firefox now ships multiple AI capabilities that run entirely on-device. The post reports the team built a dedicated ML inference runtime using ONNX Runtime paired with Transformers.js, running inside a separate content process isolated from the main browser process. According to the post, Remote Settings distributes both the runtime and model configurations, while IndexedDB caches downloaded models locally. Technical architecture A key architectural shift described in the post was moving from WebAssembly onnxruntime-web to a native C++ ONNX Runtime under Transformers.js. This migration is independently confirmed by a Mozilla Blog post co-authored by Ziade, Paul Adenot, and Serge Guelton August 2025 , which reported 2-10x faster inference with zero WASM warm-up overhead. PDF.js alt-text latency specifically dropped from 3.5 seconds to 350ms on the same hardware, per the Mozilla Blog. The C++ backend was first rolled out to Smart Tab Grouping in Firefox 142. The developer-facing API exposes inference by task name and model ID, supporting synchronous or streaming output. The runtime is available to web extension developers via the browser.trial.ml namespace. Features shipped The first end-to-end feature was PDF.js alt text generation Firefox 130 , using a Vision Transformer paired with a distilled GPT-2 decoder compressed to 180 million parameters and under 200MB. The post reports the model runs in under two seconds on a laptop and never transmits PDF content externally. Smart Tabs, launched in Firefox 141, uses the runtime to analyze tab titles and descriptions to suggest groupings locally. Mozilla also announced AI Window in November 2025 - an opt-in server-side conversational assistant for more complex tasks - though Ziade notes this requires external inference and raises privacy tradeoffs the post addresses in detail. Context Browser-based, on-device ML removes round-trip latency and avoids transmitting user data to external servers. The multi-process isolation design allows models to be updated via Remote Settings without shipping full browser version updates. Ziade also advocates for industry adoption of end-to-end encryption and confidential compute - citing NVIDIA H100 trusted execution environments and Flower.ai federated learning - as prerequisites for server-side AI to match the privacy standard of local inference. What to watch The Mozilla Blog notes GPU acceleration, multi-threaded DequantizeLinear operations, and caching of compiled graphs are in progress for the C++ ONNX backend. WebNN support is also planned. Practitioners evaluating client-side ML inference should track binary size, cross-platform portability, and startup latency tradeoffs as the runtime matures beyond its current experimental status. Scoring Rationale An engineering retrospective on shipping on-device ML inference in a major browser, with independently confirmed 2-10x performance gains after migrating to native C++ ONNX Runtime. Valuable for practitioners building client-side AI features, but this is a personal blog post from December 2025 summarizing completed work rather than a new release announcement, placing it in the solid-niche tier. Scored at 5.2 rather than the n8n estimate of 6.6. Practice interview problems based on real data 1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with. Try 250 free problems /problems