# I stress-tested Gemma 4 E4B's 128K context on a laptop GPU — recall is great, prefill is not

> Source: <https://dev.to/yashksaini/i-stress-tested-gemma-4-e4bs-128k-context-on-a-laptop-gpu-recall-is-great-prefill-is-not-244i>
> Published: 2026-05-24 06:44:15+00:00

Thursday night I let a benchmark run while I slept. By Friday morning Gemma 4 E4B had answered fifteen needle-in-a-haystack questions across four context sizes on my RTX 5050 laptop. The recall numbers were better than I expected. The latency numbers were worse. Here's both, with the ~30 lines of Python to reproduce it on your own hardware.

I keep seeing "Gemma 4 E4B has a 128K context window" repeated as if it were a single property, like *"the engine is 3.5 litres"*. It is not a single property. A context-window number means at least three different things — *will the model accept this many tokens?*, *will it remember what's in the middle of them?*, and *how fast does the first answer token arrive?* — and the answers diverge sharply once you leave the laptop-GPU regime that most spec sheets assume.

This is the post I wish I'd had when I started building on E4B. The TL;DR is in the table further down. The reproducible test rig is at the bottom.

## The setup

-
**Hardware:** RTX 5050 Laptop, 8 GB VRAM, 24 GB system RAM, Intel i7-13620H -
**Software:** Ollama 0.24.0,`gemma4:e4b`

(Q4_K_M, ~9.6 GB on disk), Linux 7.x -
**Test:** needle-in-a-haystack — five unique 4-character codes embedded at fixed positions inside a long synthetic English document; the model has to recover each one in isolation by exact match.

The test is deliberately simple. I want to know whether the model can *find* a fact at a known position, not whether it can paraphrase it. Reasoning quality is a different benchmark and needs human evaluation, which I didn't have budget for.

I ran the sweep at 5K, 20K, 60K, and 100K target context sizes. I didn't push to the 128K spec because Ollama's `num_ctx`

setting interacts with the K/V cache headroom in ways I didn't have time to characterize cleanly, and 100K is already 80% of the spec.

## The numbers

| Context | Pass rate (5/5) | Tokens/sec | Time to first token |
|---|---|---|---|
| 5K | 5/5 ✓ | 9.2 | 4 s |
| 20K | 5/5 ✓ | 8.6 | 15 s |
| 60K | 5/5 ✓ | 7.6 | 38 s |
| 100K | 5/5 ✓ | 6.8 | 72 s |

Three things stand out.

**Recall stayed perfect.** I expected E4B to wobble somewhere past 60K — that's the failure mode I see most reported for 4B-class models, the "middle of the context is fuzzy" problem. The needles at 25% and 75% are exactly where I'd expect drop-off. They held. I re-ran the sweep twice to be sure.

**Generation throughput barely moved.** 9.2 tok/s at 5K vs. 6.8 tok/s at 100K. That's a 26% drop across a 20x context increase. The K/V cache is the obvious culprit, but in practical terms: once the answer starts streaming, it streams at roughly the same speed.

**Time to first token blew up.** 4s at 5K, 72s at 100K. Almost linear in context size. This is the prefill phase — the model encoding everything you sent it before producing the first output token. On a laptop GPU, prefill is where the consumer-hardware tax lives.

## What this means if you're building on E4B

Let me write the practical zones the way I actually think about them, not the marketing version:

-
**Under 20K tokens:*** interactive.*First token in ~15 seconds, full answer in ~25-30s. This feels like a real conversation. Most single-paper Q&A lives here. -
**20K to 60K tokens:*** research-assistant.*30-40 second TTFT. You're going to glance away from the screen. That's fine, the answer will be there when you look back. Multi-paper comparisons, longer contexts. -
**60K to 100K tokens:*** batch.*You're queuing a job. 60-80 second TTFT means you might as well make coffee. Loading a whole codebase, a textbook chapter, a quarter's worth of meeting notes. -
**Above 100K:** I didn't measure. The prefill cost was already breaching my "is this still interactive?" threshold and the use case I was solving for didn't need it.

If you're designing a UI on top of this model, *surface these zones to the user*. A progress bar or a tier label ("interactive / research / batch") tells someone what their next click will *feel* like before they ask. The 128K spec is honest; it just doesn't tell you when it'll start.

## Reproduce it yourself

The whole rig is about 30 lines once you strip the CLI scaffolding. Save this as `bench.py`

, install `ollama`

(`pip install ollama`

), then run it:

``` python
import random, time
import ollama

MODEL = "gemma4:e4b"
NEEDLE_POSITIONS = [0.05, 0.25, 0.50, 0.75, 0.95]

def make_needles(k=5, seed=20260521):
    rng = random.Random(seed)
    chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
    return [(f"box-{i+1}", "".join(rng.choices(chars, k=4))) for i in range(k)]

def build_haystack(target_tokens: int, needles):
    # Filler ~ 80 tokens per sentence, English-ish prose.
    filler = (
        "The committee continued its review of the operational notes "
        "submitted during the prior fiscal quarter, with particular "
        "attention paid to procedural anomalies. "
    )
    sentences_needed = target_tokens // 20  # ~4 tok/word, 5 words/sentence avg
    body = (filler * sentences_needed)[: target_tokens * 4]
    # Splice needles in at fixed positions
    out = body
    for pos, (label, code) in zip(NEEDLE_POSITIONS, needles):
        i = int(pos * len(out))
        out = out[:i] + f"\n\nNote: {label} contains the code {code}.\n\n" + out[i:]
    return out

def ask(haystack: str, label: str, num_ctx: int) -> tuple[str, float, float]:
    t0 = time.time()
    first_t = None
    chunks = []
    for r in ollama.chat(
        model=MODEL,
        messages=[
            {"role": "system", "content": "Answer with only the 4-character code, nothing else."},
            {"role": "user", "content": haystack + f"\n\nWhat code is in {label}?"},
        ],
        stream=True,
        options={"num_ctx": num_ctx},
    ):
        delta = r.get("message", {}).get("content", "")
        if delta:
            first_t = first_t or time.time()
            chunks.append(delta)
    answer = "".join(chunks).strip()
    return answer, (first_t - t0) if first_t else 0, time.time() - t0

if __name__ == "__main__":
    needles = make_needles()
    for ctx in (5_000, 20_000, 60_000, 100_000):
        hay = build_haystack(ctx, needles)
        passed = 0
        for label, code in needles:
            ans, ttft, total = ask(hay, label, num_ctx=ctx + 4_000)
            passed += code in ans
            print(f"  ctx={ctx:>6,}  {label}  expected={code}  got={ans!r}  ttft={ttft:.1f}s  total={total:.1f}s")
        print(f"ctx={ctx:>6,}  pass={passed}/{len(needles)}")
```

It writes to stdout. If you want JSON-lines results to plot, redirect to a file and parse the `ctx=… pass=…`

lines. The whole sweep takes ~30 minutes on an RTX 5050; longer on smaller GPUs.

The seed is fixed (`20260521`

) so the needle strings are deterministic. If your pass rate doesn't match mine at the same `(model, ctx, seed)`

, that's a real signal — likely Ollama version, quantization, or hardware-driver path.

## Things this rig deliberately doesn't measure

**Quality of paraphrase.** The needles are literal 4-character codes. I'm measuring *can the model find it?*, not *can the model reason about it?*. Those are different benchmarks.

**VRAM consumption.** Ollama owns the K/V cache and I'm not going to fight it for memory accounting. `nvidia-smi`

says it sits around 7.4 GB at 100K context, but I haven't characterized the curve.

**Cross-document attention.** Each needle is asked in isolation. Multi-fact composition ("how does the figure on page 12 of paper A relate to section 3 of paper B?") is a different problem. I don't have a clean benchmark for it. I'm working on it.

## The honest comparison

Qwen 3.5 27B has ~190K effective context on similar hardware. Llama 3.1 70B (if you can fit it) goes further. On *raw context size alone*, Gemma 4 E4B isn't the winner.

What E4B *is* the winner at is the **combination**: 128K context + native vision + native audio + ~9.6 GB on disk, all in one model. That combination is what makes whole-document workloads tractable on a laptop. Qwen 27B doesn't fit in 8 GB of VRAM. Llama 3.1 70B doesn't either. If your hardware constraint is "consumer GPU", E4B is the only model in this class with 128K context *and* multimodality.

That's the framing I'd give someone choosing an open-weights model for a single-machine deployment in 2026.

## Three places I'd take this benchmark next

-
**Mixed-modality recall.** Embed half the needles in text, half in rendered images. See if vision-encoded needles degrade differently from text-encoded ones. (This is the part most relevant to anyone building doc-Q&A.) -
**Cross-document needles.** Two documents in context, the needle in paper A, the question phrased to require paper B's vocabulary. The actual "I have a library, I want to ask questions" workload. -
**Long-document Q&A with human evaluation.** Pay five grad students to grade 100 questions about a single 25-page research paper. Real quality numbers, not synthetic ones.

If you run any of these, I'd genuinely like to read the results.

**Connect with me:**

• [Website](https://yashksaini.vercel.app/)

• [GitHub](https://github.com/yashksaini-coder)

• [LinkedIn](https://www.linkedin.com/in/yashksaini/)

• [X (Twitter)](https://x.com/0xcrackedDev)
