{"slug": "i-stress-tested-gemma-4-e4b-s-128k-context-on-a-laptop-gpu-recall-is-great-is", "title": "I stress-tested Gemma 4 E4B's 128K context on a laptop GPU — recall is great, prefill is not", "summary": "Here is a factual summary of the article:\n\nThe article details a stress test of the Gemma 4 E4B model's 128K context window on a laptop GPU (RTX 5050). The test found that while the model's recall of information within the context remained perfect across all tested sizes, the \"time to first token\" (prefill latency) increased dramatically and almost linearly with context length, rising from 4 seconds at 5K tokens to 72 seconds at 100K tokens. The author concludes that the 128K specification is accurate but misleading, as it does not account for the significant prefill latency that makes the model impractical for interactive use on consumer hardware.", "body_md": "Thursday night I let a benchmark run while I slept. By Friday morning Gemma 4 E4B had answered fifteen needle-in-a-haystack questions across four context sizes on my RTX 5050 laptop. The recall numbers were better than I expected. The latency numbers were worse. Here's both, with the ~30 lines of Python to reproduce it on your own hardware.\n\nI keep seeing \"Gemma 4 E4B has a 128K context window\" repeated as if it were a single property, like *\"the engine is 3.5 litres\"*. It is not a single property. A context-window number means at least three different things — *will the model accept this many tokens?*, *will it remember what's in the middle of them?*, and *how fast does the first answer token arrive?* — and the answers diverge sharply once you leave the laptop-GPU regime that most spec sheets assume.\n\nThis is the post I wish I'd had when I started building on E4B. The TL;DR is in the table further down. The reproducible test rig is at the bottom.\n\n## The setup\n\n-\n**Hardware:** RTX 5050 Laptop, 8 GB VRAM, 24 GB system RAM, Intel i7-13620H -\n**Software:** Ollama 0.24.0,`gemma4:e4b`\n\n(Q4_K_M, ~9.6 GB on disk), Linux 7.x -\n**Test:** needle-in-a-haystack — five unique 4-character codes embedded at fixed positions inside a long synthetic English document; the model has to recover each one in isolation by exact match.\n\nThe test is deliberately simple. I want to know whether the model can *find* a fact at a known position, not whether it can paraphrase it. Reasoning quality is a different benchmark and needs human evaluation, which I didn't have budget for.\n\nI ran the sweep at 5K, 20K, 60K, and 100K target context sizes. I didn't push to the 128K spec because Ollama's `num_ctx`\n\nsetting interacts with the K/V cache headroom in ways I didn't have time to characterize cleanly, and 100K is already 80% of the spec.\n\n## The numbers\n\n| Context | Pass rate (5/5) | Tokens/sec | Time to first token |\n|---|---|---|---|\n| 5K | 5/5 ✓ | 9.2 | 4 s |\n| 20K | 5/5 ✓ | 8.6 | 15 s |\n| 60K | 5/5 ✓ | 7.6 | 38 s |\n| 100K | 5/5 ✓ | 6.8 | 72 s |\n\nThree things stand out.\n\n**Recall stayed perfect.** I expected E4B to wobble somewhere past 60K — that's the failure mode I see most reported for 4B-class models, the \"middle of the context is fuzzy\" problem. The needles at 25% and 75% are exactly where I'd expect drop-off. They held. I re-ran the sweep twice to be sure.\n\n**Generation throughput barely moved.** 9.2 tok/s at 5K vs. 6.8 tok/s at 100K. That's a 26% drop across a 20x context increase. The K/V cache is the obvious culprit, but in practical terms: once the answer starts streaming, it streams at roughly the same speed.\n\n**Time to first token blew up.** 4s at 5K, 72s at 100K. Almost linear in context size. This is the prefill phase — the model encoding everything you sent it before producing the first output token. On a laptop GPU, prefill is where the consumer-hardware tax lives.\n\n## What this means if you're building on E4B\n\nLet me write the practical zones the way I actually think about them, not the marketing version:\n\n-\n**Under 20K tokens:*** interactive.*First token in ~15 seconds, full answer in ~25-30s. This feels like a real conversation. Most single-paper Q&A lives here. -\n**20K to 60K tokens:*** research-assistant.*30-40 second TTFT. You're going to glance away from the screen. That's fine, the answer will be there when you look back. Multi-paper comparisons, longer contexts. -\n**60K to 100K tokens:*** batch.*You're queuing a job. 60-80 second TTFT means you might as well make coffee. Loading a whole codebase, a textbook chapter, a quarter's worth of meeting notes. -\n**Above 100K:** I didn't measure. The prefill cost was already breaching my \"is this still interactive?\" threshold and the use case I was solving for didn't need it.\n\nIf you're designing a UI on top of this model, *surface these zones to the user*. A progress bar or a tier label (\"interactive / research / batch\") tells someone what their next click will *feel* like before they ask. The 128K spec is honest; it just doesn't tell you when it'll start.\n\n## Reproduce it yourself\n\nThe whole rig is about 30 lines once you strip the CLI scaffolding. Save this as `bench.py`\n\n, install `ollama`\n\n(`pip install ollama`\n\n), then run it:\n\n``` python\nimport random, time\nimport ollama\n\nMODEL = \"gemma4:e4b\"\nNEEDLE_POSITIONS = [0.05, 0.25, 0.50, 0.75, 0.95]\n\ndef make_needles(k=5, seed=20260521):\n    rng = random.Random(seed)\n    chars = \"ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\"\n    return [(f\"box-{i+1}\", \"\".join(rng.choices(chars, k=4))) for i in range(k)]\n\ndef build_haystack(target_tokens: int, needles):\n    # Filler ~ 80 tokens per sentence, English-ish prose.\n    filler = (\n        \"The committee continued its review of the operational notes \"\n        \"submitted during the prior fiscal quarter, with particular \"\n        \"attention paid to procedural anomalies. \"\n    )\n    sentences_needed = target_tokens // 20  # ~4 tok/word, 5 words/sentence avg\n    body = (filler * sentences_needed)[: target_tokens * 4]\n    # Splice needles in at fixed positions\n    out = body\n    for pos, (label, code) in zip(NEEDLE_POSITIONS, needles):\n        i = int(pos * len(out))\n        out = out[:i] + f\"\\n\\nNote: {label} contains the code {code}.\\n\\n\" + out[i:]\n    return out\n\ndef ask(haystack: str, label: str, num_ctx: int) -> tuple[str, float, float]:\n    t0 = time.time()\n    first_t = None\n    chunks = []\n    for r in ollama.chat(\n        model=MODEL,\n        messages=[\n            {\"role\": \"system\", \"content\": \"Answer with only the 4-character code, nothing else.\"},\n            {\"role\": \"user\", \"content\": haystack + f\"\\n\\nWhat code is in {label}?\"},\n        ],\n        stream=True,\n        options={\"num_ctx\": num_ctx},\n    ):\n        delta = r.get(\"message\", {}).get(\"content\", \"\")\n        if delta:\n            first_t = first_t or time.time()\n            chunks.append(delta)\n    answer = \"\".join(chunks).strip()\n    return answer, (first_t - t0) if first_t else 0, time.time() - t0\n\nif __name__ == \"__main__\":\n    needles = make_needles()\n    for ctx in (5_000, 20_000, 60_000, 100_000):\n        hay = build_haystack(ctx, needles)\n        passed = 0\n        for label, code in needles:\n            ans, ttft, total = ask(hay, label, num_ctx=ctx + 4_000)\n            passed += code in ans\n            print(f\"  ctx={ctx:>6,}  {label}  expected={code}  got={ans!r}  ttft={ttft:.1f}s  total={total:.1f}s\")\n        print(f\"ctx={ctx:>6,}  pass={passed}/{len(needles)}\")\n```\n\nIt writes to stdout. If you want JSON-lines results to plot, redirect to a file and parse the `ctx=… pass=…`\n\nlines. The whole sweep takes ~30 minutes on an RTX 5050; longer on smaller GPUs.\n\nThe seed is fixed (`20260521`\n\n) so the needle strings are deterministic. If your pass rate doesn't match mine at the same `(model, ctx, seed)`\n\n, that's a real signal — likely Ollama version, quantization, or hardware-driver path.\n\n## Things this rig deliberately doesn't measure\n\n**Quality of paraphrase.** The needles are literal 4-character codes. I'm measuring *can the model find it?*, not *can the model reason about it?*. Those are different benchmarks.\n\n**VRAM consumption.** Ollama owns the K/V cache and I'm not going to fight it for memory accounting. `nvidia-smi`\n\nsays it sits around 7.4 GB at 100K context, but I haven't characterized the curve.\n\n**Cross-document attention.** Each needle is asked in isolation. Multi-fact composition (\"how does the figure on page 12 of paper A relate to section 3 of paper B?\") is a different problem. I don't have a clean benchmark for it. I'm working on it.\n\n## The honest comparison\n\nQwen 3.5 27B has ~190K effective context on similar hardware. Llama 3.1 70B (if you can fit it) goes further. On *raw context size alone*, Gemma 4 E4B isn't the winner.\n\nWhat E4B *is* the winner at is the **combination**: 128K context + native vision + native audio + ~9.6 GB on disk, all in one model. That combination is what makes whole-document workloads tractable on a laptop. Qwen 27B doesn't fit in 8 GB of VRAM. Llama 3.1 70B doesn't either. If your hardware constraint is \"consumer GPU\", E4B is the only model in this class with 128K context *and* multimodality.\n\nThat's the framing I'd give someone choosing an open-weights model for a single-machine deployment in 2026.\n\n## Three places I'd take this benchmark next\n\n-\n**Mixed-modality recall.** Embed half the needles in text, half in rendered images. See if vision-encoded needles degrade differently from text-encoded ones. (This is the part most relevant to anyone building doc-Q&A.) -\n**Cross-document needles.** Two documents in context, the needle in paper A, the question phrased to require paper B's vocabulary. The actual \"I have a library, I want to ask questions\" workload. -\n**Long-document Q&A with human evaluation.** Pay five grad students to grade 100 questions about a single 25-page research paper. Real quality numbers, not synthetic ones.\n\nIf you run any of these, I'd genuinely like to read the results.\n\n**Connect with me:**\n\n• [Website](https://yashksaini.vercel.app/)\n\n• [GitHub](https://github.com/yashksaini-coder)\n\n• [LinkedIn](https://www.linkedin.com/in/yashksaini/)\n\n• [X (Twitter)](https://x.com/0xcrackedDev)", "url": "https://wpnews.pro/news/i-stress-tested-gemma-4-e4b-s-128k-context-on-a-laptop-gpu-recall-is-great-is", "canonical_source": "https://dev.to/yashksaini/i-stress-tested-gemma-4-e4bs-128k-context-on-a-laptop-gpu-recall-is-great-prefill-is-not-244i", "published_at": "2026-05-24 06:44:15+00:00", "updated_at": "2026-05-24 07:16:54.102756+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "artificial-intelligence", "open-source", "hardware"], "entities": ["Gemma 4 E4B", "RTX 5050", "Ollama", "Linux"], "alternates": {"html": "https://wpnews.pro/news/i-stress-tested-gemma-4-e4b-s-128k-context-on-a-laptop-gpu-recall-is-great-is", "markdown": "https://wpnews.pro/news/i-stress-tested-gemma-4-e4b-s-128k-context-on-a-laptop-gpu-recall-is-great-is.md", "text": "https://wpnews.pro/news/i-stress-tested-gemma-4-e4b-s-128k-context-on-a-laptop-gpu-recall-is-great-is.txt", "jsonld": "https://wpnews.pro/news/i-stress-tested-gemma-4-e4b-s-128k-context-on-a-laptop-gpu-recall-is-great-is.jsonld"}}