# Full Sail on Asynchronous Inference

> Source: <https://www.tomtunguz.com/sail-inference-queue/>
> Published: 2026-06-25 00:00:00+00:00

Today all inference is real-time. A human types, a model responds, & the clock starts over. The infrastructure is built for someone waiting on the other end. Every millisecond of latency costs money because the serving stack optimizes for cold-start, not throughput.

As we built internal AI systems at Theory, we embraced queueing. Parallelize ten agents on a single task, let them run for hours, & the productivity gains are enormous. It is the product of token-maxxing, 1 pushing every dollar of compute to do more work. But the cost was unsustainable.

[Neil Movva](https://www.linkedin.com/in/nmovva)&

[Samir Menon](https://www.linkedin.com/in/samir-menon-27954214b)of Sail Research.

[2](#fn:2)[Neil Movva](https://www.linkedin.com/in/nmovva) built one of the fastest LLM inference stacks at Together AI. [Samir Menon](https://www.linkedin.com/in/samir-menon-27954214b) ran LLMs inside hardware enclaves at Blyss. Both are systems engineers to the core. They were building the system we needed.

As the inference market segments into real-time, near-real-time, & batch, async inference sits in the batch tier & carries a massive cost advantage. 3 The key is model selection & routing.

Sail distributes requests across open models like DeepSeek, Qwen, Kimi, & GLM, picking the cheapest capable model for each task. GLM-5.1 on Sail costs 6x less per token than Anthropic’s Haiku. 4 Wait two minutes instead of two seconds for a code review, & the same token costs 6x less.

Sail uses spot capacity when it is available & fails over to reliable compute when it is not. Fleet-aware orchestration keeps utilization high & cost low.

Real-time stacks reserve capacity per request. Queued stacks pack requests into idle capacity. Different architecture, different economics.

Sailboxes are cloud computers for the bursty rhythm of agents. A sailbox stays alive as long as the agent needs, holds state across the entire task, pauses when it waits on inference, & resumes in seconds when the response arrives. You pay for active time. No paying for idle.

Sail has served trillions of tokens to customers in code review, deep research, & cybersecurity.[2](#fn:2)

Today we announced our Series A investment in Sail alongside Kleiner Perkins, Redpoint, & Sequoia.

As agents grow from chat assistants into background workers scanning codebases overnight, enriching every CRM row, processing every document, the vast majority of tokens will flow through a queue. The future runs in the background. We are thrilled to partner with Neil, Samir, & the entire Sail team.

If you’re building agents, [get started here](https://docs.sailresearch.com/).

-
[Sail Research : Cost Efficiency](https://www.sailresearch.com/)- Input tokens per dollar, comparing Sail’s GLM-5.1 to Anthropic Haiku 4.5 & OpenAI GPT-5.4-mini.[↩︎](#fnref:4)
