{"slug": "stop-generating-what-you-already-have", "title": "Stop generating what you already have", "summary": "A developer reduced LLM extraction latency from 42 seconds to 6 seconds by replacing verbatim text copying with pointer-based extraction and splitting a single large call into multiple parallel calls. The bottleneck was output token generation, which is serial, while input processing is parallel. By asking the model for start and end phrases instead of full summaries, and parallelizing small extraction calls, the total time dropped dramatically.", "body_md": "A teammate pinged me in the morning. They were using a self-hosted LLM I maintain to convert large text documents into structured JSON. Each extraction was taking 42 to 50 seconds. They needed it faster.\n\nThe model is a 26B parameter model, AWQ quantized, running on vLLM on a single GPU. Solid setup. Not exotic hardware. The task was straightforward: feed in a long text document, get back structured fields. Names, dates, addresses, section summaries.\n\n42 seconds per document is not a latency problem. It is a design problem. I dug in.\n\n## The bottleneck is always output tokens\n\nEveryone optimizes prompt tokens. They trim context, compress system prompts, switch to shorter models. None of this matters if your problem is output latency.\n\nHere is the math. On a self-hosted vLLM endpoint, input tokens are batch-processed in parallel on the GPU. The entire prompt is consumed in one forward pass. Output tokens are generated auto-regressively, one at a time, each requiring a full forward pass through the model. Input is parallel. Output is serial.\n\nIf your extraction prompt takes 900 input tokens and generates 850 output tokens, the input processing takes maybe 200 milliseconds. The output generation takes 40 seconds. You are not waiting on the prompt. You are waiting on generation.\n\n## What the model was actually doing\n\nI logged the token breakdown for a typical extraction call. The model was generating 854 completion tokens. Of those, roughly 600 were summary text copied verbatim from the input document.\n\nThe LLM was acting as a copy machine.\n\nWhen you ask a model to extract a \"summary\" field from a document and put it in JSON, it does not summarize. It copies. Word for word. The same text that is already sitting in your prompt gets generated back to you one token at a time. You sent it once as input (fast, batch-parallel). It sends it back as output (slow, serial). That round trip is the entire latency problem.\n\n## The insight: ask for pointers, not content\n\nIf the model is going to copy text verbatim anyway, stop asking it to copy. Ask it for the location of the text instead.\n\nInstead of:\n\n```\n{\n  \"summary\": \"Cross-functional team building scalable frontend architecture with React and TypeScript, collaborating with designers and backend engineers to deliver accessible web applications.\"\n}\n```\n\nAsk for:\n\n```\n{\n  \"summary_start\": \"Cross-functional team\",\n  \"summary_end\": \"accessible web applications.\"\n}\n```\n\nFirst 3 words and last 3 words. 12 tokens instead of 300. Then slice the summary from the source document yourself using `str.find()`\n\n. Zero LLM tokens spent on the actual content. The model tells you where the text starts and ends. You do the copying in under 1 millisecond.\n\nThis works because the summaries in extraction pipelines are almost always verbatim copies from the source. The model is not generating new content. It is locating existing content and transcribing it. So ask it to locate, not transcribe.\n\n## Splitting the call\n\nOnce I realized output tokens were the problem, I split the single extraction call into many small parallel calls.\n\nThe original approach: one call, one prompt, one massive JSON response with every field including full summary text. 854 output tokens, 42 seconds.\n\nThe split approach: 6 scalar extraction calls in parallel (fullname, headline, location, etc.), each generating 2 to 30 tokens. Plus one call to list the section headers found in the document. All 7 calls fire simultaneously and finish in about 3 seconds because the longest output is 30 tokens.\n\nThen a second phase: for each section found in phase 1, one parallel call extracts the metadata plus summary anchors for that section. 5 sections means 5 parallel calls, each generating about 50 tokens. Another 3 seconds.\n\nTotal: 6 seconds. 13 parallel calls instead of 1 sequential call.\n\n## Why two phases instead of flat parallelism\n\nMy first attempt split everything flat: one call per field, all in parallel. It ran in 2.2 seconds. The problem was the section-level fields. When you ask the model \"extract details for section 4,\" it sometimes skips sections, duplicates them, or invents ones that do not exist.\n\nThe model is reliable at listing what it sees. It is unreliable at counting. \"List all sections you can find\" produces a clean, complete list every time. \"Extract section 4\" produces chaos.\n\nSo phase 1 asks for the list. Phase 2 uses that list to make targeted extraction calls. The serial dependency between phases costs 3 seconds. The reliability gain is worth it.\n\n## Where this applies\n\nThis is not specific to document extraction. Any LLM pipeline where the output contains large blocks of text copied from the input has the same problem. Resume parsing, contract analysis, product page extraction, log summarization, meeting transcription. If the model is copying, you are paying serial output token costs for text you already have.\n\nThe fix is always the same. Stop asking the model to copy. Ask it to point. Do the slicing yourself.\n\n## The post-processing layer\n\nAfter both phases complete, the summary slicing runs in under 1 millisecond per section. Case-insensitive `str.find`\n\nlocates the anchor words in the source document. Slice between them. Truncate at \"...\" markers and next-section boundaries.\n\nNo GPU time. No API call. No model involvement. Just string operations on text you already had.\n\n## What I measured\n\n| Approach | Wall time |\n|---|---|\n| Single JSON call (original) | 42-50s |\n| Per-field flat split (v1) | 33s |\n| Per-field + anchor slicing (v2) | 2.2s |\n| Two-phase + anchor slicing (final) | 6.2s |\n\nVersion 2 is faster than the final version. The final version trades 4 seconds of latency for reliability. Phase 1 guarantees a complete section list before phase 2 starts extracting. That correctness matters more than 4 seconds.\n\n## What I did not do\n\nI did not change the model. I did not change the quantization. I did not buy a bigger GPU. I did not add caching. I did not switch to a smaller model. I did not use a different framework.\n\nI changed how I asked the question. That is it.\n\nThe model was never the problem. The prompt was never the problem. The output was the problem. Specifically, asking the model to generate text it had already received as input. Remove that one thing and the latency collapses.\n\nMost LLM latency problems are output token problems in disguise. Profile your completions before you profile anything else.", "url": "https://wpnews.pro/news/stop-generating-what-you-already-have", "canonical_source": "https://aazar.me/posts/stop-generating-what-you-already-have", "published_at": "2026-06-26 10:30:06+00:00", "updated_at": "2026-06-26 11:06:12.270106+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "developer-tools"], "entities": ["vLLM", "AWQ"], "alternates": {"html": "https://wpnews.pro/news/stop-generating-what-you-already-have", "markdown": "https://wpnews.pro/news/stop-generating-what-you-already-have.md", "text": "https://wpnews.pro/news/stop-generating-what-you-already-have.txt", "jsonld": "https://wpnews.pro/news/stop-generating-what-you-already-have.jsonld"}}