I stress-tested Gemma 4 E4B's 128K context on a laptop GPU — recall is great, prefill is not

Here is a factual summary of the article:

The article details a stress test of the Gemma 4 E4B model's 128K context window on a laptop GPU (RTX 5050). The test found that while the model's recall of information within the context remained perfect across all tested sizes, the "time to first token" (prefill latency) increased dramatically and almost linearly with context length, rising from 4 seconds at 5K tokens to 72 seconds at 100K tokens. The author concludes that the 128K specification is accurate but misleading, as it does not account for the significant prefill latency that makes the model impractical for interactive use on consumer hardware.

Thursday night I let a benchmark run while I slept. By Friday morning Gemma 4 E4B had answered fifteen needle-in-a-haystack questions across four context sizes on my RTX 5050 laptop. The recall numbers were better than I expected. The latency numbers were worse. Here's both, with the ~30 lines of Python to reproduce it on your own hardware. I keep seeing "Gemma 4 E4B has a 128K context window" repeated as if it were a single property, like "the engine is 3.5 litres" . It is not a single property. A context-window number means at least three different things — will the model accept this many tokens? , will it remember what's in the middle of them? , and how fast does the first answer token arrive? — and the answers diverge sharply once you leave the laptop-GPU regime that most spec sheets assume. This is the post I wish I'd had when I started building on E4B. The TL;DR is in the table further down. The reproducible test rig is at the bottom. The setup - Hardware: RTX 5050 Laptop, 8 GB VRAM, 24 GB system RAM, Intel i7-13620H - Software: Ollama 0.24.0, gemma4:e4b Q4 K M, ~9.6 GB on disk , Linux 7.x - Test: needle-in-a-haystack — five unique 4-character codes embedded at fixed positions inside a long synthetic English document; the model has to recover each one in isolation by exact match. The test is deliberately simple. I want to know whether the model can find a fact at a known position, not whether it can paraphrase it. Reasoning quality is a different benchmark and needs human evaluation, which I didn't have budget for. I ran the sweep at 5K, 20K, 60K, and 100K target context sizes. I didn't push to the 128K spec because Ollama's num ctx setting interacts with the K/V cache headroom in ways I didn't have time to characterize cleanly, and 100K is already 80% of the spec. The numbers | Context | Pass rate 5/5 | Tokens/sec | Time to first token | |---|---|---|---| | 5K | 5/5 ✓ | 9.2 | 4 s | | 20K | 5/5 ✓ | 8.6 | 15 s | | 60K | 5/5 ✓ | 7.6 | 38 s | | 100K | 5/5 ✓ | 6.8 | 72 s | Three things stand out. Recall stayed perfect. I expected E4B to wobble somewhere past 60K — that's the failure mode I see most reported for 4B-class models, the "middle of the context is fuzzy" problem. The needles at 25% and 75% are exactly where I'd expect drop-off. They held. I re-ran the sweep twice to be sure. Generation throughput barely moved. 9.2 tok/s at 5K vs. 6.8 tok/s at 100K. That's a 26% drop across a 20x context increase. The K/V cache is the obvious culprit, but in practical terms: once the answer starts streaming, it streams at roughly the same speed. Time to first token blew up. 4s at 5K, 72s at 100K. Almost linear in context size. This is the prefill phase — the model encoding everything you sent it before producing the first output token. On a laptop GPU, prefill is where the consumer-hardware tax lives. What this means if you're building on E4B Let me write the practical zones the way I actually think about them, not the marketing version: - Under 20K tokens: interactive. First token in ~15 seconds, full answer in ~25-30s. This feels like a real conversation. Most single-paper Q&A lives here. - 20K to 60K tokens: research-assistant. 30-40 second TTFT. You're going to glance away from the screen. That's fine, the answer will be there when you look back. Multi-paper comparisons, longer contexts. - 60K to 100K tokens: batch. You're queuing a job. 60-80 second TTFT means you might as well make coffee. Loading a whole codebase, a textbook chapter, a quarter's worth of meeting notes. - Above 100K: I didn't measure. The prefill cost was already breaching my "is this still interactive?" threshold and the use case I was solving for didn't need it. If you're designing a UI on top of this model, surface these zones to the user . A progress bar or a tier label "interactive / research / batch" tells someone what their next click will feel like before they ask. The 128K spec is honest; it just doesn't tell you when it'll start. Reproduce it yourself The whole rig is about 30 lines once you strip the CLI scaffolding. Save this as bench.py , install ollama pip install ollama , then run it: python import random, time import ollama MODEL = "gemma4:e4b" NEEDLE POSITIONS = 0.05, 0.25, 0.50, 0.75, 0.95 def make needles k=5, seed=20260521 : rng = random.Random seed chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" return f"box-{i+1}", "".join rng.choices chars, k=4 for i in range k def build haystack target tokens: int, needles : Filler ~ 80 tokens per sentence, English-ish prose. filler = "The committee continued its review of the operational notes " "submitted during the prior fiscal quarter, with particular " "attention paid to procedural anomalies. " sentences needed = target tokens // 20 ~4 tok/word, 5 words/sentence avg body = filler sentences needed : target tokens 4 Splice needles in at fixed positions out = body for pos, label, code in zip NEEDLE POSITIONS, needles : i = int pos len out out = out :i + f"\n\nNote: {label} contains the code {code}.\n\n" + out i: return out def ask haystack: str, label: str, num ctx: int - tuple str, float, float : t0 = time.time first t = None chunks = for r in ollama.chat model=MODEL, messages= {"role": "system", "content": "Answer with only the 4-character code, nothing else."}, {"role": "user", "content": haystack + f"\n\nWhat code is in {label}?"}, , stream=True, options={"num ctx": num ctx}, : delta = r.get "message", {} .get "content", "" if delta: first t = first t or time.time chunks.append delta answer = "".join chunks .strip return answer, first t - t0 if first t else 0, time.time - t0 if name == " main ": needles = make needles for ctx in 5 000, 20 000, 60 000, 100 000 : hay = build haystack ctx, needles passed = 0 for label, code in needles: ans, ttft, total = ask hay, label, num ctx=ctx + 4 000 passed += code in ans print f" ctx={ctx: 6,} {label} expected={code} got={ans r} ttft={ttft:.1f}s total={total:.1f}s" print f"ctx={ctx: 6,} pass={passed}/{len needles }" It writes to stdout. If you want JSON-lines results to plot, redirect to a file and parse the ctx=… pass=… lines. The whole sweep takes ~30 minutes on an RTX 5050; longer on smaller GPUs. The seed is fixed 20260521 so the needle strings are deterministic. If your pass rate doesn't match mine at the same model, ctx, seed , that's a real signal — likely Ollama version, quantization, or hardware-driver path. Things this rig deliberately doesn't measure Quality of paraphrase. The needles are literal 4-character codes. I'm measuring can the model find it? , not can the model reason about it? . Those are different benchmarks. VRAM consumption. Ollama owns the K/V cache and I'm not going to fight it for memory accounting. nvidia-smi says it sits around 7.4 GB at 100K context, but I haven't characterized the curve. Cross-document attention. Each needle is asked in isolation. Multi-fact composition "how does the figure on page 12 of paper A relate to section 3 of paper B?" is a different problem. I don't have a clean benchmark for it. I'm working on it. The honest comparison Qwen 3.5 27B has ~190K effective context on similar hardware. Llama 3.1 70B if you can fit it goes further. On raw context size alone , Gemma 4 E4B isn't the winner. What E4B is the winner at is the combination : 128K context + native vision + native audio + ~9.6 GB on disk, all in one model. That combination is what makes whole-document workloads tractable on a laptop. Qwen 27B doesn't fit in 8 GB of VRAM. Llama 3.1 70B doesn't either. If your hardware constraint is "consumer GPU", E4B is the only model in this class with 128K context and multimodality. That's the framing I'd give someone choosing an open-weights model for a single-machine deployment in 2026. Three places I'd take this benchmark next - Mixed-modality recall. Embed half the needles in text, half in rendered images. See if vision-encoded needles degrade differently from text-encoded ones. This is the part most relevant to anyone building doc-Q&A. - Cross-document needles. Two documents in context, the needle in paper A, the question phrased to require paper B's vocabulary. The actual "I have a library, I want to ask questions" workload. - Long-document Q&A with human evaluation. Pay five grad students to grade 100 questions about a single 25-page research paper. Real quality numbers, not synthetic ones. If you run any of these, I'd genuinely like to read the results. Connect with me: • Website https://yashksaini.vercel.app/ • GitHub https://github.com/yashksaini-coder • LinkedIn https://www.linkedin.com/in/yashksaini/ • X Twitter https://x.com/0xcrackedDev