We're Open Sourcing Our Voice AI Latency Benchmarking Tool

wpnews.pro

Last month, a 340ms spike in our TTS pipeline caused 12% of Loquent callers to talk over the AI mid-response. We didn't catch it for six hours because we were measuring the wrong thing — average latency instead of tail latency at each pipeline stage. That incident is why we built vox-bench

, and why we're releasing it today.

When you're building a voice AI agent that handles thousands of live phone calls per month — dental appointment bookings, patient intake, after-hours triage — latency isn't a nice-to-have metric. It's the difference between a conversation that feels human and one that feels like talking to a broken IVR.

Our Loquent pipeline has five stages: Twilio media stream ingestion, speech-to-text via Deepgram, LLM inference via Anthropic Claude (with OpenAI as fallback), text-to-speech via ElevenLabs, and audio streaming back through Twilio. Each stage adds time. The total round-trip — from the moment a caller stops speaking to the moment they hear the AI respond — needs to stay under 800ms to feel natural. Go above 1.2 seconds and callers start repeating themselves. Go above 1.8 seconds and they hang up.

We know these numbers because we tracked them across 10,000+ calls over six months of running Loquent in production. But for the first four months, we were tracking them wrong.

Our original monitoring was simple: we logged a timestamp when audio came in from Twilio and another when we sent audio back. Total round-trip time. One number. And for a while, it looked great — averaging around 650ms.

The problem was that average told us almost nothing. When our ElevenLabs latency spiked from 120ms p50 to 340ms p95 during a provider-side deployment, our total average barely moved — from 650ms to 710ms. Still "fine" by our alerting thresholds. But 12% of calls were hitting 1.4+ second response times, and those callers were already talking again before the AI responded. The result was conversational chaos — interrupted responses, repeated questions, callers saying "hello? are you there?"

We needed per-stage, per-percentile latency tracking. Nothing we found did exactly what we needed.

We evaluated several options before building our own:

Generic APM tools (Datadog, New Relic) — great for HTTP request latency, but they don't understand voice AI pipeline stages. You can instrument custom spans, but you're building the domain model yourself. We tried this with Datadog for two months. The dashboard became a wall of custom metrics that nobody on the team could parse quickly.

Provider-specific dashboards — Deepgram and ElevenLabs both have latency metrics in their dashboards, but they only show their own stage. You can't correlate a Deepgram STT spike with downstream effects on total response time. And they measure from their side — not from your server's perspective, which includes network transit.

Load testing tools (k6, Locust) — designed for HTTP endpoints, not real-time WebSocket audio streams. You can hack them into shape, but simulating realistic voice conversation patterns (variable utterance lengths, interruptions, silence gaps) is a project in itself.

We needed something purpose-built for voice AI pipelines. So we built it.

vox-bench

is a TypeScript CLI tool that benchmarks each stage of a voice AI pipeline independently and in combination. Here's what it does:

Per-stage benchmarking. Point it at your STT provider, your LLM, and your TTS provider. It sends realistic audio samples (we include a corpus of 200 healthcare-domain utterances of varying lengths) and measures latency at each stage independently. You get p50, p95, p99, and max for each provider.

Pipeline simulation. Chain your stages together and vox-bench

simulates full conversational round-trips. It measures total time-to-first-byte (TTFB) and time-to-complete, broken down by stage. This is where you catch the compounding effects — a 50ms STT increase plus a 80ms LLM increase that pushes your total over the threshold.

Provider comparison. Run the same benchmark against multiple providers simultaneously. We built this because we needed to evaluate whether switching from Deepgram Nova-2 to Nova-3 would actually reduce our p95 STT latency in practice (it did — by 35ms on average for our healthcare utterances, but increased p99 by 12ms on longer sentences). You configure providers in a YAML file and vox-bench

runs them head-to-head. Regression detection. Run vox-bench

on a schedule (we use a GitHub Action that runs every 6 hours) and it compares results against your baseline. If any stage's p95 moves more than your configured threshold, it fires an alert. This is what would have caught the ElevenLabs spike that burned us.

Conversation pattern simulation. Real calls aren't "send audio, get response, repeat." Callers interrupt. They mid-sentence. They say "um" for three seconds. vox-bench

includes conversation profiles — healthcare-intake

, appointment-booking

, general-inquiry

— that model realistic interaction patterns we extracted from our Loquent call data.

We've been running vox-bench

internally for two months. Here's what the data looks like across our current production stack:

Deepgram Nova-3 STT: p50 = 180ms, p95 = 245ms, p99 = 310ms. The variance is almost entirely driven by utterance length. Anything under 3 seconds of audio processes fast. Once you cross 6-7 seconds (a full sentence describing symptoms, for example), latency jumps. Our takeaway: design your prompts to encourage shorter caller responses when possible.

Anthropic Claude (Haiku) LLM: p50 = 210ms TTFB, p95 = 340ms, p99 = 480ms. This is streaming — we start sending to TTS as soon as the first tokens arrive. We tested Claude Sonnet too: p50 = 380ms TTFB, p95 = 620ms. For voice, Haiku wins. The quality difference between Haiku and Sonnet for our use cases (appointment scheduling, FAQ answers, intake questions) is negligible, but the latency difference is enormous.

ElevenLabs TTS: p50 = 130ms, p95 = 220ms, p99 = 350ms. The most variable stage in our pipeline. We've seen p99 hit 600ms during what we assume are provider-side capacity issues, always between 2-4pm ET. vox-bench

caught this pattern within a week of deployment.

Total pipeline (end-to-end): p50 = 620ms, p95 = 890ms, p99 = 1,150ms. Our p99 is above the 800ms "feels natural" threshold, but below the 1.2 second "callers repeat themselves" line. We're okay with that tradeoff — optimizing p99 below 800ms would require either pre-generating responses (quality hit) or switching to a faster but lower-quality TTS (quality hit). For now, 5-6% of responses feeling slightly delayed is acceptable.

vox-bench

is built with TypeScript (Node.js 20+). We chose TypeScript because our entire Loquent backend is TypeScript/NestJS, and we wanted the team to be able to extend the tool without context-switching languages.

Key components: a provider adapter layer (currently supports Deepgram, OpenAI Whisper, Anthropic Claude, OpenAI GPT, ElevenLabs, and Google Cloud TTS), a pipeline orchestrator that chains stages with proper streaming, a statistics engine that computes percentiles using the t-digest algorithm (accurate percentiles without storing every measurement), and a reporter that outputs results as JSON, Markdown tables, or sends them to your monitoring system via webhooks.

The whole thing is about 3,200 lines of TypeScript. No magic.

TTS is your most variable stage. STT and LLM latency are relatively predictable. TTS providers show the most variance, and the variance is time-of-day dependent. Benchmark at different times or you'll get misleading numbers.

Average latency is a useless metric for voice AI. Your p95 and p99 determine caller experience. A 650ms average can hide a 1.4 second p99 that makes 5% of your conversations feel broken.

Streaming changes everything. Without streaming (waiting for complete LLM response before sending to TTS), our p50 total would be 1,100ms+. With streaming, it's 620ms. If your voice AI pipeline isn't streaming at every stage, fix that before optimizing anything else.

Provider latency varies by content domain. Our healthcare utterances benchmark 15-20% slower on STT than general conversation because of medical terminology. Always benchmark with domain-representative audio, not generic test phrases.

Schedule your benchmarks. Provider performance isn't static. Run benchmarks on a cadence and track trends. The regression detection in vox-bench

has caught three provider-side degradations that our application monitoring missed.

The repo is at github.com/Autor-Technologies/vox-bench. MIT licensed. The README has quickstart instructions — you can be running benchmarks against your own providers in under five minutes if you have API keys ready.

We included our healthcare conversation profiles and audio corpus. If you're building voice AI for a different domain, you can create your own profiles — the format is documented and there's a generator script that builds profiles from your own call recordings.

We're actively using this internally. If you find bugs or want to add a provider adapter, PRs are welcome. If you're building voice AI and want to talk latency optimization, we've probably hit the same walls you're hitting.

If you're building something similar, we'd love to hear about it. Reach out at hello@autor.ca or visit autor.ca.

source & further reading

dev.to — original article It's OK to Get Lucky I read the 17-comment Reddit fight about trying Kimi K3 and the answer is way less exciting than people want Google Trends ties its data tokens to your IP and it broke my scraper in a way I didn't expect

We're Open Sourcing Our Voice AI Latency Benchmarking Tool

Run your AI side-project on zahid.host