Never Waste a Token A new durable inference technique uses a separate buffer between agents and LLM providers to prevent token waste when processes crash mid-stream, saving costs by resuming streams rather than re-requesting tokens. never waste a token durable inference: resumable streams, crash recovery, and why the LLM request shouldn't die with your process. this post itself is LLM slop, but it tastes alright tl;dr - put a durable buffer between your agent and the LLM provider. the provider connection now outlives your process, so a deploy in the middle of a stream doesn’t cost you the tokens you already paid for. and the same buffer that lets a disconnected browser catch back up is the thing that recovers a crashed turn. one log, two readers. I’ve spent the last few weeks stuck on one question: what happens to an agent when the process running it dies in the middle of a turn? it goes deep fast. tool calls that may or may not have fired. sub-agents. half-written streams waiting on a human. I’m writing all of that up separately durable agent loops, coming soon . but one piece of it is small and self-contained enough to pull out on its own: when your process dies mid-inference, you don’t just lose your place. you lose money. the problem that’s easy to miss your agent opens a streaming request to a model, and the model starts generating. you’re billed for those output tokens the moment they’re generated. then your process gets replaced. maybe a deploy, maybe an eviction, maybe an OOM. the usual reassurance is “don’t worry, the state is durable.” and sure, your conversation history survived. but the in-flight HTTP request to the provider did not. it lived in the memory of the process that just died. so when you recover, your only option is to make the call again . you pay for those output tokens a second time. now make it an agent. a real one does multiple tool calls in a single turn: user message → stream some text → tool call → tool result → stream more text → tool call → tool result → stream the answer every interruption throws away all the output tokens generated so far in that turn. and it scales with the model you actually want to use: output runs $30 per million tokens on gpt-5.5 versus $2 on gpt-5.5-mini , so a flagship retry burns ~15x what a mini one does. the better the model, the more it hurts. deploys happen constantly, evictions happen constantly, and each one that lands on a live stream is money straight out the window. the happy path hides it. you only see it when you start counting tokens after an incident and the numbers don’t add up. the move: stop tying the request to the process the reason a crash wastes tokens is that the provider connection lives inside the thing that crashed . so move it out. put a buffer between the agent and the provider, and make it a separate deployment : its own Worker, its own Durable Object. when a request comes in, the buffer does three things in order. it resets its state for a fresh stream. it kicks off a background task that drains the provider connection into SQLite. and it immediately hands the caller back a stream that tails those same rows as they land: async proxyAndBuffer req: ProviderRequest : Promise