Building Production Voice AI Agents: Latency, Architecture, and What Nobody Tells You

wpnews.pro

Originally published on prodinit.com

Key Takeaways

Sub-300ms end-to-end latency is the human-conversation threshold for voice AI.
The latency budget breaks into four layers: STT (80–120ms), LLM first-token (150–250ms), TTS first-chunk (60–100ms), and network transport (20–60ms). Missing target in any one layer pushes the total over 500ms.
WebRTC with ICE Trickle is the correct transport for browser and mobile clients. SIP is the right choice for PSTN integration and legacy telephony.
LiveKit SFU reduces media server complexity by forwarding encoded streams rather than decoding and re-mixing them, and its hosted tier removes the need to operate a media server fleet entirely.

Voice AI demos look deceptively easy. A GPT-4o API call, a TTS response, a microphone input — connected together in 200 lines of Python, the thing works. Then you put it in front of real users and it fails.

The failure is almost never the model. It is the architecture.

In production at 2000+ calls per day — the scale Prodinit operates for a healthcare scheduling platform — three classes of failure dominate: latency spikes that destroy conversational flow, audio glitches from unmanaged WebRTC sessions, and compliance gaps where customer PII surfaces in LLM provider logs. None of these appear in a notebook demo. All of them have architecture solutions.

This guide walks through the complete production stack: what latency target you are actually trying to hit, how the budget breaks across each layer, the transport architecture that achieves it, and the security and observability instrumentation that keeps it running without surprises.

Sub-300ms end-to-end latency is the human-conversation threshold. Conversational linguistics research places the average human response gap at 200ms; gaps up to 500ms are within the natural range. Beyond 500ms, listeners register the . Beyond 1,500ms, they start to speak again — or hang up.

The practical production target is under 800ms at p95, with a p50 below 400ms. This is not a soft target — these numbers correlate directly with call completion rates and CSAT scores.

End-to-end latency in a voice AI agent is the sum of five contributors:

Total target: 320–560ms. That is achievable. The mistakes that push it over 1,000ms are predictable and avoidable.

VAD decides when the user has stopped speaking and the pipeline should fire. A misconfigured VAD is the single easiest way to add 500ms of latency without touching any model. Most implementations default to a trailing silence window of 500–800ms — that sits entirely in the user experience before a single API call fires.

In production, configure VAD with:

getUserMedia

handles this with echoCancellation: true

Deepgram's streaming STT includes built-in VAD endpointing via endpointing=300

— use this rather than a separate VAD layer, as it eliminates an additional round-trip.

Batch transcription — send audio, wait for full transcript — adds 600–1,200ms before your LLM call even starts. This alone makes sub-300ms unreachable. The solution is streaming STT with interim results.

Deepgram Nova-2 delivers streaming transcription with a first-word latency around 80ms over WebSocket. You do not wait for the complete transcript; you begin processing on is_final: true

utterances:

User audio → WebSocket → Deepgram Nova-2 (streaming)
                              ↓
                    interim results (ignored)
                              ↓
                    is_final: true → LLM pipeline fires

Critical configuration: punctuate=true

, smart_format=true

, and endpointing=300

. Without endpointing set, Deepgram uses server-side silence detection that defaults longer than your VAD window.

LLM first-token latency is the hardest constraint to optimize. GPT-4 in streaming mode cannot reliably hit sub-200ms first-token in typical network conditions. The model choices that achieve 150–250ms in practice:

Stream the response. Pass tokens to TTS as they arrive — do not buffer the full LLM output before starting TTS synthesis. The overlap between LLM generation and TTS synthesis recovers 100–200ms of total latency.

Prompt engineering for voice: system prompts should be shorter than for text chatbots. Strip all markdown formatting instructions — the output goes to TTS and formatted text degrades audio. Keep total context under 2,000 tokens where possible; token count has a near-linear relationship with first-token latency.

ElevenLabs streaming delivers first-audio-chunk in 60–100ms on their Flash tier versus 200–400ms on standard. The difference is significant enough that choosing the wrong tier consumes your entire latency budget on TTS alone.

Use streaming TTS: do not wait for the complete audio file before playback. The client should begin playing as soon as the first audio chunk arrives. For browser clients, the Web Audio API handles chunked playback natively; for telephony, use RTP packetization.

The TTS configuration that matters for latency:

eleven_flash_v2_5

for minimum latencystream=true

pcm_16000

for telephony, mp3_44100_128

for browseroptimize_streaming_latency=4

(aggressive mode)With a well-configured WebRTC connection, transport adds 20–40ms round-trip. With a WebSocket-only approach through a distant cloud region, transport alone can add 200ms in the tail. This is where the transport choice has the most impact.

The production architecture for a sub-300ms voice AI agent:

The agent worker sits between the media plane and the model APIs. It receives raw audio frames from LiveKit, streams them to Deepgram, fires the LLM on final utterances, and pushes TTS audio frames back into the LiveKit room. The client never calls model APIs directly — this is essential for PII control and rate-limit management.

WebRTC connection establishment uses Interactive Connectivity Establishment (ICE) to find a network path between peers. In the naive implementation — wait for all ICE candidates before signaling — setup latency adds 500–2,000ms to every call start. This is invisible in demos and very visible in production.

ICE Trickle solves this: candidates are sent to the remote peer as they are gathered, and connectivity checks begin immediately. Call setup time drops to 100–400ms in most network conditions.

LiveKit implements ICE Trickle automatically. What you need to deploy:

stun.l.google.com:19302

works for most cases; deploy your own for HIPAA environments to keep traffic off third-party infrastructureA Selective Forwarding Unit receives encoded media streams and forwards them to participants without decoding and re-encoding. For voice AI, this matters because:

The LiveKit room model maps cleanly to a voice call session:

from livekit import agents, rtc
import asyncio

async def entrypoint(ctx: agents.JobContext):
    await ctx.connect()

    async for event in ctx.room.on("track_subscribed"):
        if event.track.kind == rtc.TrackKind.KIND_AUDIO:
            audio_stream = rtc.AudioStream(event.track)
            asyncio.create_task(process_audio(audio_stream, ctx.room))

async def process_audio(stream: rtc.AudioStream, room: rtc.Room):
    async for frame in stream:
        await pipeline.push_frame(frame)

LiveKit's agent framework handles room lifecycle, track subscription, and RTP framing. Application code focuses on pipeline logic.

This is the question that trips up most teams evaluating voice AI infrastructure. They are not competing choices — they solve different integration problems.

Use WebRTC when you control the client — a web app, mobile app, or embedded SDK. It gives you wideband Opus audio (meaningfully better STT accuracy), lower setup latency, and direct control over the media path.

Use SIP when the caller is on a real phone number — inbound calls to a support line, outbound dialer campaigns, or integration with an existing contact center (Genesys, Five9, Twilio PSTN). Twilio's Media Streams provides a WebSocket bridge from PSTN to your agent worker, which avoids running a full SIP stack yourself.

The G.711 codec limitation of PSTN calls has an underappreciated consequence: STT accuracy on 8kHz narrowband audio is meaningfully lower than on 16kHz+ wideband. For healthcare or fintech agents where transcription accuracy directly affects outcomes, browser/mobile WebRTC with Opus gives a material accuracy advantage over telephone calls.

A production voice AI WebRTC architecture typically uses both: WebRTC for app callers and a SIP trunk or Twilio Media Streams for inbound phone calls, with the same agent worker behind both paths.

Voice AI pipelines fail silently. A WebRTC ICE failure looks like a dropped call. A Deepgram WebSocket disconnect looks like the agent not hearing the user. A TTS timeout manifests as silence on the line. Without structured observability, every incident is a multi-hour debugging session across three services.

Instrument the following at minimum:

Per-call latency histogram — record wall-clock time from VAD endpoint event to first TTS audio chunk, broken down by component: stt_latency_ms

, llm_first_token_ms

, tts_first_chunk_ms

. Alert on p95 > 800ms for any single component.

Per-call transcription confidence — Deepgram returns a confidence

score per utterance. Log confidence distributions; a degradation in median confidence correlates with audio quality issues, codec mismatches, or background noise problems before callers start complaining.

WebRTC ICE connection state — log ICE state transitions (checking → connected → disconnected → failed). Track failed

rates by client region. Elevated failure rates in a specific geography usually indicate TURN server coverage gaps.

STT WebSocket reconnections — Deepgram WebSocket connections drop under load or network events. Count reconnections per call. A call with 3+ reconnections will have visible transcription gaps; flag and review these separately.

LLM error rates — log 4xx/5xx rates from your LLM provider independently from total call failure. A 429 spike during peak hours needs a different response (add capacity, queue calls) than a 500 (inspect payloads, contact provider).

Use structured logging with a call_id

field on every log event. Voice AI incidents always span Deepgram, your agent worker, and your SFU. Without a consistent call_id

, joining those log lines across services is impossible.

source & further reading

dev.to — original article Claude Code's Auto Mode Now Default on Major Cloud Platforms Your LLM Cannot Tell When It Is Wrong, Build for That Passion Edition

Building Production Voice AI Agents: Latency, Architecture, and What Nobody Tells You

Run your AI side-project on zahid.host