cd /news/ai-infrastructure/full-sail-on-asynchronous-inference · home topics ai-infrastructure article
[ARTICLE · art-39894] src=tomtunguz.com ↗ pub= topic=ai-infrastructure verified=true sentiment=↑ positive

Full Sail on Asynchronous Inference

Sail Research, founded by Neil Movva and Samir Menon, announced a Series A investment from Kleiner Perkins, Redpoint, and Sequoia for its asynchronous inference platform that reduces costs by up to 6x compared to real-time services. The platform uses queueing and spot capacity to optimize throughput for AI agents, serving trillions of tokens in code review, deep research, and cybersecurity.

read2 min views1 publishedJun 25, 2026
Full Sail on Asynchronous Inference
Image: Tomtunguz (auto-discovered)

Today all inference is real-time. A human types, a model responds, & the clock starts over. The infrastructure is built for someone waiting on the other end. Every millisecond of latency costs money because the serving stack optimizes for cold-start, not throughput.

As we built internal AI systems at Theory, we embraced queueing. Parallelize ten agents on a single task, let them run for hours, & the productivity gains are enormous. It is the product of token-maxxing, 1 pushing every dollar of compute to do more work. But the cost was unsustainable.

[Neil Movva](https://www.linkedin.com/in/nmovva)&

[Samir Menon](https://www.linkedin.com/in/samir-menon-27954214b)of Sail Research.

2Neil Movva built one of the fastest LLM inference stacks at Together AI. Samir Menon ran LLMs inside hardware enclaves at Blyss. Both are systems engineers to the core. They were building the system we needed.

As the inference market segments into real-time, near-real-time, & batch, async inference sits in the batch tier & carries a massive cost advantage. 3 The key is model selection & routing.

Sail distributes requests across open models like DeepSeek, Qwen, Kimi, & GLM, picking the cheapest capable model for each task. GLM-5.1 on Sail costs 6x less per token than Anthropic’s Haiku. 4 Wait two minutes instead of two seconds for a code review, & the same token costs 6x less.

Sail uses spot capacity when it is available & fails over to reliable compute when it is not. Fleet-aware orchestration keeps utilization high & cost low.

Real-time stacks reserve capacity per request. Queued stacks pack requests into idle capacity. Different architecture, different economics.

Sailboxes are cloud computers for the bursty rhythm of agents. A sailbox stays alive as long as the agent needs, holds state across the entire task, s when it waits on inference, & resumes in seconds when the response arrives. You pay for active time. No paying for idle.

Sail has served trillions of tokens to customers in code review, deep research, & cybersecurity.2

Today we announced our Series A investment in Sail alongside Kleiner Perkins, Redpoint, & Sequoia.

As agents grow from chat assistants into background workers scanning codebases overnight, enriching every CRM row, processing every document, the vast majority of tokens will flow through a queue. The future runs in the background. We are thrilled to partner with Neil, Samir, & the entire Sail team.

If you’re building agents, get started here. #

Sail Research : Cost Efficiency- Input tokens per dollar, comparing Sail’s GLM-5.1 to Anthropic Haiku 4.5 & OpenAI GPT-5.4-mini.↩︎

── more in #ai-infrastructure 4 stories · sorted by recency
── more on @sail research 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/full-sail-on-asynchr…] indexed:0 read:2min 2026-06-25 ·