cd /news/large-language-models/webgpu-feature-detection-was-not-eno… · home topics large-language-models article
[ARTICLE · art-36072] src=ludion.ai ↗ pub= topic=large-language-models verified=true sentiment=↓ negative

WebGPU feature detection was not enough to run small LLMs on phones

WebGPU feature detection proved insufficient for running small LLMs on phones. On an iPhone 11 Pro Max, all runs failed due to page reloads or load errors, while a Pixel 8a in LINE's in-app browser stalled mid-download. Even on capable hardware, throughput varied by up to 2x between engines, and long prompts on a Pixel 8a took over 76 seconds for the first token.

read4 min views1 publishedJun 22, 2026

Four test environments where the browser exposed WebGPU, and what the measurements say.

I wanted to run a small language model in the browser, on the phone, without sending inference to a server. The feature detection is easy. You ask for a WebGPU adapter, you read its limits, and if the buffer sizes are large enough you assume it will run. Every browser environment I tested exposed WebGPU. As a first-pass check, the reported limits looked large enough for the model weights.

Then I ran them. What a device reports about its GPU and what an inference run completes are two different things. Four cases from my own measurements.

All numbers below come from the raw measurement files in the repository. The models are Llama-3.2-1B-Instruct, Qwen2.5-1.5B-Instruct, and Qwen2.5-0.5B-Instruct, quantized to roughly 4-bit. The engines are WebLLM 0.2.84, transformers.js 4.2.0, and wllama 3.4.1. Each run was cold cache, with a short prompt near 50 tokens and a long prompt near 1200 tokens.

1. Safari on iPhone reloads the page during generation #

The device is an iPhone 11 Pro Max on iOS 18.7, Safari 26.5. It reports webgpu: true

, an Apple adapter with f16 support, and a maxBufferSize

of 715827880 bytes. The reported maxBufferSize was large enough for the model weights, at least as a first-pass check.

None of them completed. Qwen2.5-1.5B through WebLLM downloaded all 728 MB and then failed at init with TypeError: Load failed

. Llama-3.2-1B through WebLLM got further, reached generation on the WebGPU backend, and then the page reloaded mid-generation with no JavaScript-visible exception and no out-of-memory error I could catch. The smaller Qwen2.5-0.5B through wllama did the same thing at init: the tab reloaded before it ever became ready. Across every engine and model on this device, zero runs completed. The failure mode is not an error you handle. It is the tab restarting under you.

2. LINE's in-app browser exposes WebGPU but the run never completes #

The device is a Pixel 8a, 8 GB of memory, opened inside the LINE in-app browser on Android 16. It reports webgpu: true

, an Arm Valhall adapter with f16, and a maxBufferSize

of 4294967292 bytes, which is the full 4 GB ceiling. Nothing in the adapter limits distinguished it from the Chrome run that completed.

The Llama-3.2-1B session started, stalled mid-download, and never reached a single completed run. The results file for that session has an empty runs list. The adapter report told me nothing about whether the in-app browser would carry a download and an init to the end. It did not.

3. Same hardware and model, about two times the throughput by engine alone #

On a Windows desktop with an AMD RDNA 4 GPU, Chrome 148, I ran the same Llama-3.2-1B with the short prompt through all three engines. WebGPU is present and used in every case. The decode rate is the median of three runs.

engine decode tok/s
WebLLM 0.2.84 196.17
transformers.js 4.2.0 125.41
wllama 3.4.1 97.61

The fastest engine decodes about twice as fast as the slowest on identical hardware running the identical model. The WebGPU support flag reads the same for all three. The measured throughput does not.

4. Pixel 8a completes, but a long prefill takes 76 seconds #

The device is a Pixel 8a again, this time in plain Chrome 149, not an in-app browser. The Arm Valhall adapter reports the same 4 GB buffer ceiling. Here the model loads and runs to completion, so I have full timings.

With the short prompt of 52 input tokens, time to first token is about 3.8 seconds across three runs (3782, 3954, 3752 ms). With the long prompt of 1213 input tokens, time to first token is 77153, 76996, and 76449 ms. That is 76 to 77 seconds before the first token of the answer appears. Decode after that holds near 9 tokens per second. The same device that handles a one-line prompt in a few seconds takes well over a minute to read a page of context.

Across these four test environments, WebGPU exposure and large adapter limits were not enough to predict whether a small LLM run would complete. Feature detection answered whether WebGPU could be requested, not whether inference would finish.

── more in #large-language-models 4 stories · sorted by recency
── more on @webgpu 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/webgpu-feature-detec…] indexed:0 read:4min 2026-06-22 ·