Full Sail on Asynchronous Inference

Sail Research, founded by Neil Movva and Samir Menon, announced a Series A investment from Kleiner Perkins, Redpoint, and Sequoia for its asynchronous inference platform that reduces costs by up to 6x compared to real-time services. The platform uses queueing and spot capacity to optimize throughput for AI agents, serving trillions of tokens in code review, deep research, and cybersecurity.

Today all inference is real-time. A human types, a model responds, & the clock starts over. The infrastructure is built for someone waiting on the other end. Every millisecond of latency costs money because the serving stack optimizes for cold-start, not throughput. As we built internal AI systems at Theory, we embraced queueing. Parallelize ten agents on a single task, let them run for hours, & the productivity gains are enormous. It is the product of token-maxxing, 1 pushing every dollar of compute to do more work. But the cost was unsustainable. Neil Movva https://www.linkedin.com/in/nmovva & Samir Menon https://www.linkedin.com/in/samir-menon-27954214b of Sail Research. 2 fn:2 Neil Movva https://www.linkedin.com/in/nmovva built one of the fastest LLM inference stacks at Together AI. Samir Menon https://www.linkedin.com/in/samir-menon-27954214b ran LLMs inside hardware enclaves at Blyss. Both are systems engineers to the core. They were building the system we needed. As the inference market segments into real-time, near-real-time, & batch, async inference sits in the batch tier & carries a massive cost advantage. 3 The key is model selection & routing. Sail distributes requests across open models like DeepSeek, Qwen, Kimi, & GLM, picking the cheapest capable model for each task. GLM-5.1 on Sail costs 6x less per token than Anthropic’s Haiku. 4 Wait two minutes instead of two seconds for a code review, & the same token costs 6x less. Sail uses spot capacity when it is available & fails over to reliable compute when it is not. Fleet-aware orchestration keeps utilization high & cost low. Real-time stacks reserve capacity per request. Queued stacks pack requests into idle capacity. Different architecture, different economics. Sailboxes are cloud computers for the bursty rhythm of agents. A sailbox stays alive as long as the agent needs, holds state across the entire task, pauses when it waits on inference, & resumes in seconds when the response arrives. You pay for active time. No paying for idle. Sail has served trillions of tokens to customers in code review, deep research, & cybersecurity. 2 fn:2 Today we announced our Series A investment in Sail alongside Kleiner Perkins, Redpoint, & Sequoia. As agents grow from chat assistants into background workers scanning codebases overnight, enriching every CRM row, processing every document, the vast majority of tokens will flow through a queue. The future runs in the background. We are thrilled to partner with Neil, Samir, & the entire Sail team. If you’re building agents, get started here https://docs.sailresearch.com/ . - Sail Research : Cost Efficiency https://www.sailresearch.com/ - Input tokens per dollar, comparing Sail’s GLM-5.1 to Anthropic Haiku 4.5 & OpenAI GPT-5.4-mini. ↩︎ fnref:4