{"slug": "popping-the-gpu-bubble", "title": "Popping the GPU Bubble", "summary": "Moondream HQ reveals that GPUs often sit idle during AI model inference due to CPU overhead, a phenomenon called the 'GPU bubble.' The company's Photon system uses pipelined decoding to overlap CPU and GPU work, eliminating idle time and accelerating token generation.", "body_md": "How do you make an AI model run as fast as possible? This is a question we obsess over at\nMoondream HQ. The GPU handles all the math involved in model inference, so at first glance it\ndoesn't seem like there's much to it: just tell it what to do and wait for the answer. But if\nyou start looking at how it actually works under the hood, you find that the GPU often sits\nidle, not for lack of work, but because the CPU hasn't told it what to do next yet. This\nphenomenon is called a **GPU bubble**.\n\nWhen a typical AI model generates text, it produces one **token** at a time (a token is a\nchunk of text, roughly a few characters). Each token depends on the tokens before it, a\nproperty called *autoregressive*, so generation is sequential. You can't compute the third\ntoken before you have the second. This decode loop involves a round trip between the CPU and\nGPU. The GPU does most of the heavy lifting to run the actual model, performing billions of\narithmetic operations to produce the next token. But there's also a surprising amount of work\ndone by the CPU. It selects which requests to run next, sets up the metadata the GPU needs for\nthem, picks the actual token out of the model's output and records it, and more.\n\nThe challenge is that one token's worth of GPU work is *small*, while the CPU housekeeping is a\nfixed cost paid on every trip. If the GPU has to wait for that housekeeping before it can start\nthe next token, it sits idle for part of every loop. This is why we get GPU bubbles.\n\nIn this post we're going to dive into how [Photon](/p/photon) hides these bubbles using a\ntechnique called *pipelined decoding*. The idea is to overlap the two kinds of work: we start\nGPU work on the next token while the CPU is still finishing the last one.\n\n## The bubble\n\nHere's the shape of the problem.\n\nIn the blocking version (top), every step is a baton pass. The CPU plans and launches a\nforward, the GPU runs it, then the CPU *synchronizes*, waits for the results to land,\ncommits them, and only then starts planning the next step. This is because the plan depends\non the token we select. For example, if the model indicates it has finished answering,\nthen we need to schedule a new pending request from our queue. The GPU sits idle waiting\nfor the CPU to finish its commit-plan-launch work.\n\nThe fix is to **pipeline the loop.** Launch the next forward\nwhile the current step's token is still coming back and being committed. That's the\n**pipelined** version (bottom): the forwards run back-to-back, and the CPU work is overlapped\nunderneath them.\n\nThe reason we can is that the token we just sampled doesn't have to leave the GPU. The next forward reads it straight from GPU memory as its input. We still want a copy on the CPU eventually, to detokenize it, stream it, and decide whether the request is done, but that is bookkeeping we can do a moment later, in the background, while the next forward already runs. Not waiting on that copy is the move that removes the bubble.\n\nMaking it safe requires three things, that we cover in the rest of this post: keeping step buffers from colliding (ping-pong slots), getting the sampling order right for constrained decoding (forward now, sample later), and cleaning up after a request finishes (zombies).\n\n## Mechanism 1: ping-pong slots\n\nTo run a decode step, the GPU needs a working set of buffers: a place to stage the input (the\nlast generated token and its position in the sequence), a place for the model to write its\noutput (the *logits*, one score per word in the vocabulary), a place to land the sampled token,\nand some bookkeeping the attention kernel needs to find each sequence's cached keys and values\n(its KV cache). We keep *pinned* (page-locked) host buffers on both ends, so the copies on and\noff the GPU run as background DMA (direct memory access) transfers instead of blocking the CPU.\n\nThese buffers are allocated once and reused on every step. We work hard to avoid performing\nGPU memory allocations at runtime, because they can cause device synchronization and introduce\nbubbles. Fixed buffer addresses are also needed for capturing the decode step once as a\n[CUDA graph](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) and replaying it,\nreducing kernel launch overhead. We call this bundle a [ DecodeSlot](https://github.com/m87-labs/kestrel/blob/bb530fad318ff82c1367af4629964938cff72eaa/kestrel/models/moondream/decode_slot.py).\n\nThis works, but introduces a blocker for pipelining. The buffers stay in use until the step is done, so we cannot start the next step until the current one finishes. To overlap two steps, the second step needs its own working set, otherwise it can overwrite the results of the first step before the CPU has read them. So we keep two slots and alternate between them, ping-pong style.\n\nOne thing to note about launch: we don't execute kernels the instant we issue a launch from CPU.\nInstead, we enqueue them onto a *stream* -- an ordered queue that the GPU drains in order. Work\non the same stream runs sequentially, while work on separate streams can overlap. Both slots put\ntheir forwards onto the same compute stream. The slots are not for GPU parallelism. They only\nexist so the CPU can process one slot's results while the GPU runs the other slot's forward.\n\nThe forwards all share that one compute stream, but the copies do not. Each step's\ndevice-to-host copy, the one that brings the sampled token back for bookkeeping, goes on a\n*separate* copy stream, so it can run while the GPU is busy with the next forward. That is what\nlets us not wait for it. We anchor the copy to an event recorded the instant the step's outputs\nare written, so it waits on exactly that step's work and nothing queued behind it.\n\nA slot only becomes free once its results have been read, not just once the GPU is done with it. Its pinned host buffer is the landing site for a copy that may still be in flight, so handing the slot to a new step too early would overwrite a copy mid-transfer, creating a hard-to-debug corruption bug. So the slot stays reserved through the commit that reads it, and is released only once that commit has finished.\n\n## Mechanism 2: forward now, sample later\n\nThe next forward can run ahead because it doesn't depend on anything the CPU does with the last\ntoken. But two things about the *next* step do depend on the last step's committed result. One\nis which sequences are still in the batch: if a request just finished, it shouldn't be in the\nnext forward. That is the next section (zombies). The other is what tokens the next step is even\nallowed to sample, and that one is this section.\n\nIt comes from *constrained decoding*. Moondream's spatial skills return structured output\ninstead of free text: `point`\n\nreturns a coordinate, `detect`\n\nreturns boxes, `segment`\n\nreturns\nan outline. We get those from the same decode loop by restricting which tokens the model may\nproduce at each step: we force the scores (the *logits*) of the disallowed ones to negative\ninfinity before we sample. A `point`\n\nstep has to emit a coordinate, a `detect`\n\nrequest walks an\nx, y, size cycle, and so on. Which tokens are allowed, the *mask*, depends on what has been\nproduced so far, so the mask for step *t+1* depends on the token we sampled at *t*.\n\nThe dependency is in *sampling*, not in the forward.\n\nEach scheduler tick goes through three phases: **launch, commit, and finalize**:\n\n**Launch** the forward for*t+1*. It doesn't depend on the mask, so it goes immediately.**Commit** step*t*: wait on the in-flight copy and advance the request's decode state. That is needed to decide the mask for*t+1*.**Finalize sampling** for*t+1*: with the state current, build the mask and sample.\n\nSampling *t+1* lands after committing *t* because the commit is what makes *t+1*'s mask correct.\nWe call this \"commit-before-finalize\" ordering. The GPU runs the *t+1* forward through steps 2\nand 3, so the commit disappears from the critical path.\n\nFor plain text there is no mask, so forward and sampling can both run a step ahead. For constrained sequences the forward still runs ahead, but sampling waits on the previous commit, which caps how far ahead we get with no special-casing. One loop handles both.\n\n## Mechanism 3: zombies: finalize early, release late\n\nBack in *forward now, sample later* we flagged two ways the next step depends on the last\nstep's committed result. The sampling mask was one. Batch membership is the other, and it\ntakes a bit of care to handle right.\n\nTo launch step *t+1* we first decide its batch, which sequences are in it, and we do that\nbefore committing step *t*. So what happens when a sequence hits its stop token at *t*, but is\nalready baked into *t+1*'s forward? You can't un-launch GPU work. The sequence is finished, yet\nstill physically present in a batch that's executing.\n\nPhoton calls these **zombies**, and instead of bolting on cancellation logic, it lets the\nbehavior emerge from two per-sequence fields:\n\n`finalized`\n\n:`True`\n\nafter the sequence has hit EOS or its length cap.`inflight_refs`\n\n: the number of in-flight steps that still reference this sequence (0, 1, or 2).\n\nWhen step *t* commits and detects EOS, the sequence is marked `finalized`\n\nand its result is\nemitted — but it isn't torn down, because `inflight_refs`\n\nis still nonzero (step *t+1*\nreferences it). At step *t+1*'s commit, the sequence is already `finalized`\n\n, so the commit\nis **skipped**: no token is appended, no state mutates. The zombie was harmlessly along for\nthe ride — it occupied its slot and wrote some KV that nobody will read. Only when\n`inflight_refs`\n\nfinally hits 0 are its KV pages and LoRA slot released.\n\nThis finalize-early, release-late dance is a small amount of refcounting that replaces what would otherwise be a thicket of \"cancel this row mid-flight\" special cases.\n\n## Prefill rides the same pipeline\n\nSo far this has all been about decode steps, but a real serving loop is constantly doing two\n*different* kinds of work: **prefill** (processing a new request's prompt + image, the\nexpensive one-shot forward over many tokens) and **decode** (one token at a time for everyone\nalready running).\n\nPhoton doesn't separate them. A prefill is just another `kind=\"prefill\"`\n\nlaunch in the\n*same* two-slot pipeline. Because the pipeline only cares that a slot is free, not what kind\nof work last used it, a prefill forward can be launched into one slot while a decode step\nfrom the other slot is still being committed, and vice versa. The expensive prefill forward\nruns on the GPU while the CPU commits decode results; the next decode forward runs while the\nCPU finishes admitting the just-prefilled request. The same commit ordering (and the same\n`inflight_refs`\n\nbookkeeping) keeps everything correct across the two kinds, so none of the\nzombie or constrained-decode logic needs a special case for \"what if a prefill is in flight.\"\n\nThis matters most when outputs are short. A request that emits three tokens spends almost all of its life in prefill and admission, not decode, so a workload of many short requests is really a stream of prefills with a little decode sprinkled in. Sharing one pipeline is what lets that stream overlap its own CPU bookkeeping instead of serializing prefill behind decode and back again.\n\n## A cost model for the bubble\n\nHow much should pipelining actually buy you? You can predict it from the parts of a decode step, and then check the prediction against measurement.\n\nA decode step is three pieces of work:\n\n**forward**: the heavy GPU matmuls. At decode this is memory-bandwidth bound: every token streams the whole weight set through the cores, so it has a floor near`weight_bytes / memory_bandwidth`\n\n. It shrinks as memory gets faster or as the model gets smaller.**sampling**: turning the scores into a committed token: the constrained-decode mask, the argmax/sample, the spatial (grounding) decode, and the device→host copy of the result. All GPU work.**bookkeeping**: the CPU around it. Choose the next batch (`plan`\n\n), launch the graph (`launch`\n\n), commit the previous step (`commit`\n\n).\n\nA blocking loop runs the three in series, so the GPU sits idle through the bookkeeping — that\nidle is the bubble. Pipelining slides the bookkeeping of one step underneath the *forward +\nsampling* of the next, so the period collapses toward `forward + sampling`\n\nand the bubble\ndisappears. Measured per step, pipelined, that's exactly what we see — the GPU is busy for\nessentially the whole period (steady-state medians, moondream2, ms):\n\n| forward (ms) | sampling (ms) | period (ms) | |\n|---|---|---|---|\n| 3090 · 1 stream | 4.87 | 0.20 | 5.10 |\n| 8 streams | 6.66 | 0.27 | 6.97 |\n| 32 streams | 10.24 | 0.26 | 10.52 |\n| B200 · 1 stream | 2.45 | 0.14 | 2.63 |\n| 8 streams | 3.12 | 0.14 | 3.30 |\n| 32 streams | 3.80 | 0.14 | 3.98 |\n\n`forward + sampling ≈ period`\n\n; the leftover GPU idle is under 0.05 ms. So what was hiding it\nworth? It comes down to a tug-of-war between two things — how much of a step you manage to tuck\naway, against a small penalty for running ahead:\n\n```\nspeedup  =   T_block / T_pipe    ×      (1 − z)\n            └─ bubble hidden ─┘     └─ zombie tax ─┘\n```\n\nTwo symbols, two ideas. The first term is the win, and it's the whole GPU-speed story: how long\na step takes blocking (`T_block`\n\n) over how long it takes pipelined (`T_pipe`\n\n) — i.e. how much\nfaster the step runs once the bookkeeping is tucked underneath it.\n\nThe second, `z`\n\n, is the price of running ahead — the **zombie tax** from Mechanism 3. Launch step\n*t+1* before committing *t*, and a sequence that just finished still has a forward in flight: a\nwasted step. On a single stream that's one wasted forward for every `L`\n\ntokens the request\ngenerated, so about 1% at `L ≈ 110`\n\n. Pack a batch, though, and it nearly vanishes — the zombie is\njust one more row in a step that's already paying full price to stream the weights, so it rides\nalong almost free. The tax bites hardest at one stream and fades exactly where throughput lives,\nwhich is why predicting it needs both `L`\n\nand the batch size.\n\nHere's that step, measured both ways — blocking idles each step while the CPU commits the last token and re-launches; pipelining runs that work (and the async mask upload) underneath the forward, so the forwards never stop:\n\nNow put real numbers in it. Measure each piece on its own — the two step times and `L`\n\n— and the\nmodel's prediction should land on what the benchmark actually delivers (depth-1 blocking vs\ndepth-2 pipelined, nothing else changed):\n\n| blocking (ms) | pipelined (ms) | L | predicted | observed | |\n|---|---|---|---|---|---|\n| 3090 · 1 stream | 5.44 | 5.10 | 104 | +5.7% | +6.5% |\n| 8 streams | 7.52 | 6.97 | 113 | +7.6% | +7.8% |\n| 32 streams | 11.74 | 10.52 | 113 | +11.1% | +11.6% |\n| B200 · 1 stream | 3.11 | 2.63 | 115 | +17.2% | +17.6% |\n| 8 streams | 4.04 | 3.30 | 115 | +22.2% | +21.9% |\n| 32 streams | 5.55 | 3.98 | 104 | +39.1% | +35.4% |\n\nThree things to read out of it:\n\n**The win grows with GPU speed.** Same workload, +12% on a 3090 but +35% on a B200 at 32 streams. The bookkeeping is GPU-speed-independent, so as the forward shrinks — faster memory, or a smaller model — the bubble is a bigger share of the step. Pipelining is insurance against the GPU getting faster, which for us is the same thing as the model getting smaller.**The zombie tax is real but small, and it amortizes.** At one stream the zombie is a whole wasted forward — about 1% at L≈110. At batch it's one extra*row*in a step that's memory-bound on the weights, not the row count, so it costs almost nothing: at 32 streams the 3090's observed +11.6% lands right on the*no-zombie*per-step ratio. The tax bites at a single stream and fades exactly where throughput lives. (The B200's 32-stream row sits a few points under prediction for a duller reason — at ~4 ms/step the whole run is under half a second, so prefill and the end-of-run batch ramp-down are a visible slice of the wall.)**It only pays once the bubble is actually hideable.**(This is how we caught a bug, in fact: the pipelined numbers came out at*blocking*speed, traced to an accidental synchronous copy while building the constrained-decode mask. Moving it to the copy stream was worth +11% on the 3090 and +34% on the B200.)\n\n## It's never just one thing\n\nThat's the whole technique: ping-pong slots so two steps don't collide, a forward/sampling split so even constrained decoding can run ahead, and a little zombie refcounting so finished requests tear down cleanly. The GPU stops waiting on the CPU, and you get back anywhere from a few percent to a third; more the faster your accelerator/model is.\n\nBut Photon isn't fast because of this one technique, or any single technique. It's fast because dozens of these details compound across the serving stack: how we resize and tile images on the way in, the kernels that run the model, the scheduler ordering here, and the synchronization points we remove from the hot path. No one piece is the whole story; the stack gets fast when enough of them line up.\n\nWe'll keep writing these up, one corner of the stack at a time. [Follow us on Twitter](https://x.com/moondreamai)\nso you don't miss the next one. And keep an eye out for Photon 2.0, coming soon: we can't share\ndetails yet, but it's a big one.", "url": "https://wpnews.pro/news/popping-the-gpu-bubble", "canonical_source": "https://moondream.ai/blog/popping-the-gpu-bubble", "published_at": "2026-06-30 05:14:35+00:00", "updated_at": "2026-06-30 05:20:17.504073+00:00", "lang": "en", "topics": ["ai-infrastructure", "machine-learning", "large-language-models", "ai-tools"], "entities": ["Moondream HQ", "Photon", "GPU", "CPU"], "alternates": {"html": "https://wpnews.pro/news/popping-the-gpu-bubble", "markdown": "https://wpnews.pro/news/popping-the-gpu-bubble.md", "text": "https://wpnews.pro/news/popping-the-gpu-bubble.txt", "jsonld": "https://wpnews.pro/news/popping-the-gpu-bubble.jsonld"}}