# Stop generating what you already have

> Source: <https://aazar.me/posts/stop-generating-what-you-already-have>
> Published: 2026-06-26 10:30:06+00:00

A teammate pinged me in the morning. They were using a self-hosted LLM I maintain to convert large text documents into structured JSON. Each extraction was taking 42 to 50 seconds. They needed it faster.

The model is a 26B parameter model, AWQ quantized, running on vLLM on a single GPU. Solid setup. Not exotic hardware. The task was straightforward: feed in a long text document, get back structured fields. Names, dates, addresses, section summaries.

42 seconds per document is not a latency problem. It is a design problem. I dug in.

## The bottleneck is always output tokens

Everyone optimizes prompt tokens. They trim context, compress system prompts, switch to shorter models. None of this matters if your problem is output latency.

Here is the math. On a self-hosted vLLM endpoint, input tokens are batch-processed in parallel on the GPU. The entire prompt is consumed in one forward pass. Output tokens are generated auto-regressively, one at a time, each requiring a full forward pass through the model. Input is parallel. Output is serial.

If your extraction prompt takes 900 input tokens and generates 850 output tokens, the input processing takes maybe 200 milliseconds. The output generation takes 40 seconds. You are not waiting on the prompt. You are waiting on generation.

## What the model was actually doing

I logged the token breakdown for a typical extraction call. The model was generating 854 completion tokens. Of those, roughly 600 were summary text copied verbatim from the input document.

The LLM was acting as a copy machine.

When you ask a model to extract a "summary" field from a document and put it in JSON, it does not summarize. It copies. Word for word. The same text that is already sitting in your prompt gets generated back to you one token at a time. You sent it once as input (fast, batch-parallel). It sends it back as output (slow, serial). That round trip is the entire latency problem.

## The insight: ask for pointers, not content

If the model is going to copy text verbatim anyway, stop asking it to copy. Ask it for the location of the text instead.

Instead of:

```
{
  "summary": "Cross-functional team building scalable frontend architecture with React and TypeScript, collaborating with designers and backend engineers to deliver accessible web applications."
}
```

Ask for:

```
{
  "summary_start": "Cross-functional team",
  "summary_end": "accessible web applications."
}
```

First 3 words and last 3 words. 12 tokens instead of 300. Then slice the summary from the source document yourself using `str.find()`

. Zero LLM tokens spent on the actual content. The model tells you where the text starts and ends. You do the copying in under 1 millisecond.

This works because the summaries in extraction pipelines are almost always verbatim copies from the source. The model is not generating new content. It is locating existing content and transcribing it. So ask it to locate, not transcribe.

## Splitting the call

Once I realized output tokens were the problem, I split the single extraction call into many small parallel calls.

The original approach: one call, one prompt, one massive JSON response with every field including full summary text. 854 output tokens, 42 seconds.

The split approach: 6 scalar extraction calls in parallel (fullname, headline, location, etc.), each generating 2 to 30 tokens. Plus one call to list the section headers found in the document. All 7 calls fire simultaneously and finish in about 3 seconds because the longest output is 30 tokens.

Then a second phase: for each section found in phase 1, one parallel call extracts the metadata plus summary anchors for that section. 5 sections means 5 parallel calls, each generating about 50 tokens. Another 3 seconds.

Total: 6 seconds. 13 parallel calls instead of 1 sequential call.

## Why two phases instead of flat parallelism

My first attempt split everything flat: one call per field, all in parallel. It ran in 2.2 seconds. The problem was the section-level fields. When you ask the model "extract details for section 4," it sometimes skips sections, duplicates them, or invents ones that do not exist.

The model is reliable at listing what it sees. It is unreliable at counting. "List all sections you can find" produces a clean, complete list every time. "Extract section 4" produces chaos.

So phase 1 asks for the list. Phase 2 uses that list to make targeted extraction calls. The serial dependency between phases costs 3 seconds. The reliability gain is worth it.

## Where this applies

This is not specific to document extraction. Any LLM pipeline where the output contains large blocks of text copied from the input has the same problem. Resume parsing, contract analysis, product page extraction, log summarization, meeting transcription. If the model is copying, you are paying serial output token costs for text you already have.

The fix is always the same. Stop asking the model to copy. Ask it to point. Do the slicing yourself.

## The post-processing layer

After both phases complete, the summary slicing runs in under 1 millisecond per section. Case-insensitive `str.find`

locates the anchor words in the source document. Slice between them. Truncate at "..." markers and next-section boundaries.

No GPU time. No API call. No model involvement. Just string operations on text you already had.

## What I measured

| Approach | Wall time |
|---|---|
| Single JSON call (original) | 42-50s |
| Per-field flat split (v1) | 33s |
| Per-field + anchor slicing (v2) | 2.2s |
| Two-phase + anchor slicing (final) | 6.2s |

Version 2 is faster than the final version. The final version trades 4 seconds of latency for reliability. Phase 1 guarantees a complete section list before phase 2 starts extracting. That correctness matters more than 4 seconds.

## What I did not do

I did not change the model. I did not change the quantization. I did not buy a bigger GPU. I did not add caching. I did not switch to a smaller model. I did not use a different framework.

I changed how I asked the question. That is it.

The model was never the problem. The prompt was never the problem. The output was the problem. Specifically, asking the model to generate text it had already received as input. Remove that one thing and the latency collapses.

Most LLM latency problems are output token problems in disguise. Profile your completions before you profile anything else.
