{"slug": "webgpu-feature-detection-was-not-enough-to-run-small-llms-on-phones", "title": "WebGPU feature detection was not enough to run small LLMs on phones", "summary": "WebGPU feature detection proved insufficient for running small LLMs on phones. On an iPhone 11 Pro Max, all runs failed due to page reloads or load errors, while a Pixel 8a in LINE's in-app browser stalled mid-download. Even on capable hardware, throughput varied by up to 2x between engines, and long prompts on a Pixel 8a took over 76 seconds for the first token.", "body_md": "# WebGPU feature detection was not enough to run small LLMs on phones\n\nFour test environments where the browser exposed WebGPU, and what the measurements say.\n\nI wanted to run a small language model in the browser, on the phone, without sending inference to a server. The feature detection is easy. You ask for a WebGPU adapter, you read its limits, and if the buffer sizes are large enough you assume it will run. Every browser environment I tested exposed WebGPU. As a first-pass check, the reported limits looked large enough for the model weights.\n\nThen I ran them. What a device reports about its GPU and what an inference run completes are two different things. Four cases from my own measurements.\n\nAll numbers below come from the raw measurement files in the repository. The models are Llama-3.2-1B-Instruct, Qwen2.5-1.5B-Instruct, and Qwen2.5-0.5B-Instruct, quantized to roughly 4-bit. The engines are WebLLM 0.2.84, transformers.js 4.2.0, and wllama 3.4.1. Each run was cold cache, with a short prompt near 50 tokens and a long prompt near 1200 tokens.\n\n## 1. Safari on iPhone reloads the page during generation\n\nThe device is an iPhone 11 Pro Max on iOS 18.7, Safari 26.5. It reports\n`webgpu: true`\n\n, an Apple adapter with f16 support, and a\n`maxBufferSize`\n\nof 715827880 bytes. The reported maxBufferSize was\nlarge enough for the model weights, at least as a first-pass check.\n\nNone of them completed. Qwen2.5-1.5B through WebLLM downloaded all 728 MB and\nthen failed at init with `TypeError: Load failed`\n\n. Llama-3.2-1B\nthrough WebLLM got further, reached generation on the WebGPU backend, and then\nthe page reloaded mid-generation with no JavaScript-visible exception and no\nout-of-memory error I could catch. The smaller Qwen2.5-0.5B through wllama did\nthe same thing at init: the\ntab reloaded before it ever became ready. Across every engine and model on this\ndevice, zero runs completed. The failure mode is not an error you handle. It is\nthe tab restarting under you.\n\n## 2. LINE's in-app browser exposes WebGPU but the run never completes\n\nThe device is a Pixel 8a, 8 GB of memory, opened inside the LINE in-app browser\non Android 16. It reports `webgpu: true`\n\n, an Arm Valhall adapter with\nf16, and a `maxBufferSize`\n\nof 4294967292 bytes, which is the full\n4 GB ceiling. Nothing in the adapter limits distinguished it from the Chrome run\nthat completed.\n\nThe Llama-3.2-1B session started, stalled mid-download, and never reached a single completed run. The results file for that session has an empty runs list. The adapter report told me nothing about whether the in-app browser would carry a download and an init to the end. It did not.\n\n## 3. Same hardware and model, about two times the throughput by engine alone\n\nOn a Windows desktop with an AMD RDNA 4 GPU, Chrome 148, I ran the same Llama-3.2-1B with the short prompt through all three engines. WebGPU is present and used in every case. The decode rate is the median of three runs.\n\n| engine | decode tok/s |\n|---|---|\n| WebLLM 0.2.84 | 196.17 |\n| transformers.js 4.2.0 | 125.41 |\n| wllama 3.4.1 | 97.61 |\n\nThe fastest engine decodes about twice as fast as the slowest on identical hardware running the identical model. The WebGPU support flag reads the same for all three. The measured throughput does not.\n\n## 4. Pixel 8a completes, but a long prefill takes 76 seconds\n\nThe device is a Pixel 8a again, this time in plain Chrome 149, not an in-app browser. The Arm Valhall adapter reports the same 4 GB buffer ceiling. Here the model loads and runs to completion, so I have full timings.\n\nWith the short prompt of 52 input tokens, time to first token is about 3.8 seconds across three runs (3782, 3954, 3752 ms). With the long prompt of 1213 input tokens, time to first token is 77153, 76996, and 76449 ms. That is 76 to 77 seconds before the first token of the answer appears. Decode after that holds near 9 tokens per second. The same device that handles a one-line prompt in a few seconds takes well over a minute to read a page of context.\n\nAcross these four test environments, WebGPU exposure and large adapter limits were not enough to predict whether a small LLM run would complete. Feature detection answered whether WebGPU could be requested, not whether inference would finish.", "url": "https://wpnews.pro/news/webgpu-feature-detection-was-not-enough-to-run-small-llms-on-phones", "canonical_source": "https://ludion.ai/blog/webgpu-reports-vs-reality/", "published_at": "2026-06-22 02:40:18+00:00", "updated_at": "2026-06-22 03:09:47.713544+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-infrastructure", "ai-products", "developer-tools"], "entities": ["WebGPU", "iPhone 11 Pro Max", "Pixel 8a", "LINE", "WebLLM", "transformers.js", "wllama", "Chrome"], "alternates": {"html": "https://wpnews.pro/news/webgpu-feature-detection-was-not-enough-to-run-small-llms-on-phones", "markdown": "https://wpnews.pro/news/webgpu-feature-detection-was-not-enough-to-run-small-llms-on-phones.md", "text": "https://wpnews.pro/news/webgpu-feature-detection-was-not-enough-to-run-small-llms-on-phones.txt", "jsonld": "https://wpnews.pro/news/webgpu-feature-detection-was-not-enough-to-run-small-llms-on-phones.jsonld"}}