{"slug": "full-sail-on-asynchronous-inference", "title": "Full Sail on Asynchronous Inference", "summary": "Sail Research, founded by Neil Movva and Samir Menon, announced a Series A investment from Kleiner Perkins, Redpoint, and Sequoia for its asynchronous inference platform that reduces costs by up to 6x compared to real-time services. The platform uses queueing and spot capacity to optimize throughput for AI agents, serving trillions of tokens in code review, deep research, and cybersecurity.", "body_md": "Today all inference is real-time. A human types, a model responds, & the clock starts over. The infrastructure is built for someone waiting on the other end. Every millisecond of latency costs money because the serving stack optimizes for cold-start, not throughput.\n\nAs we built internal AI systems at Theory, we embraced queueing. Parallelize ten agents on a single task, let them run for hours, & the productivity gains are enormous. It is the product of token-maxxing, 1 pushing every dollar of compute to do more work. But the cost was unsustainable.\n\n[Neil Movva](https://www.linkedin.com/in/nmovva)&\n\n[Samir Menon](https://www.linkedin.com/in/samir-menon-27954214b)of Sail Research.\n\n[2](#fn:2)[Neil Movva](https://www.linkedin.com/in/nmovva) built one of the fastest LLM inference stacks at Together AI. [Samir Menon](https://www.linkedin.com/in/samir-menon-27954214b) ran LLMs inside hardware enclaves at Blyss. Both are systems engineers to the core. They were building the system we needed.\n\nAs the inference market segments into real-time, near-real-time, & batch, async inference sits in the batch tier & carries a massive cost advantage. 3 The key is model selection & routing.\n\nSail distributes requests across open models like DeepSeek, Qwen, Kimi, & GLM, picking the cheapest capable model for each task. GLM-5.1 on Sail costs 6x less per token than Anthropic’s Haiku. 4 Wait two minutes instead of two seconds for a code review, & the same token costs 6x less.\n\nSail uses spot capacity when it is available & fails over to reliable compute when it is not. Fleet-aware orchestration keeps utilization high & cost low.\n\nReal-time stacks reserve capacity per request. Queued stacks pack requests into idle capacity. Different architecture, different economics.\n\nSailboxes are cloud computers for the bursty rhythm of agents. A sailbox stays alive as long as the agent needs, holds state across the entire task, pauses when it waits on inference, & resumes in seconds when the response arrives. You pay for active time. No paying for idle.\n\nSail has served trillions of tokens to customers in code review, deep research, & cybersecurity.[2](#fn:2)\n\nToday we announced our Series A investment in Sail alongside Kleiner Perkins, Redpoint, & Sequoia.\n\nAs agents grow from chat assistants into background workers scanning codebases overnight, enriching every CRM row, processing every document, the vast majority of tokens will flow through a queue. The future runs in the background. We are thrilled to partner with Neil, Samir, & the entire Sail team.\n\nIf you’re building agents, [get started here](https://docs.sailresearch.com/).\n\n-\n[Sail Research : Cost Efficiency](https://www.sailresearch.com/)- Input tokens per dollar, comparing Sail’s GLM-5.1 to Anthropic Haiku 4.5 & OpenAI GPT-5.4-mini.[↩︎](#fnref:4)", "url": "https://wpnews.pro/news/full-sail-on-asynchronous-inference", "canonical_source": "https://www.tomtunguz.com/sail-inference-queue/", "published_at": "2026-06-25 00:00:00+00:00", "updated_at": "2026-06-25 20:17:05.064383+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-agents", "ai-startups", "large-language-models", "ai-tools"], "entities": ["Sail Research", "Neil Movva", "Samir Menon", "Kleiner Perkins", "Redpoint", "Sequoia", "Together AI", "Blyss"], "alternates": {"html": "https://wpnews.pro/news/full-sail-on-asynchronous-inference", "markdown": "https://wpnews.pro/news/full-sail-on-asynchronous-inference.md", "text": "https://wpnews.pro/news/full-sail-on-asynchronous-inference.txt", "jsonld": "https://wpnews.pro/news/full-sail-on-asynchronous-inference.jsonld"}}