The 1.4 Seconds That Weren't on Any Span

wpnews.pro

On the morning of June 3rd, a customer on a live call sat through 1.4 seconds of dead air after she finished a sentence, long enough that she said "hello?" before the agent answered. I had the trace open in Honeycomb forty seconds later. Every span was green. End-to-end p95 read 980ms, comfortably under our budget, and not one span in that waterfall was longer than 400ms. The dashboard told me everything was fine while the customer was, in fact, talking to silence.

TL;DR: End-to-end voice latency is not the sum of your spans. The number that kills your UX lives in the unattributed time between spans, most often the gap between turn-end (the moment the user stops talking) and ASR-start (the moment your pipeline begins transcribing). APM-style tracing instruments the work and ignores the waiting, so the gap is invisible by construction. You have to put a span on the handoff itself.

Here is the thing I keep saying and keep being right about: most "LLM observability" is just APM with extra steps. It watches the model. It traces the LLM call, the tool call, the retrieval, the token count, all the parts a backend engineer already knows how to think about. For a voice agent that is the wrong half of the system. Voice agents do not break inside the LLM call. They break in the audio pipeline, in the orchestration between components, in the handoffs nobody owns a span for. Your model can be fast and your product can still feel broken, and your dashboard will not say a word about it.

Our turn looks like this on paper. VAD/turn-detection decides the user is done. Audio goes to ASR (Whisper Large v3, streaming). The transcript goes to the LLM (gpt-4o-realtime) for a first token, then the full response. The response streams to TTS (ElevenLabs) for the first audio byte, which is the moment the user hears anything. There is network on both ends.

I pulled the one trace from the 1.4-second call. Not an aggregate, the actual trace. Here is the latency budget I had been staring at for weeks, the summed-span view:

Stage	p50	p95	p99	who owns the span
VAD / turn-detection	60ms	120ms	180ms	orchestrator
ASR (streaming)	180ms	310ms	540ms	ASR client
LLM TTFT	220ms	380ms	720ms	model client
LLM full response	140ms	260ms	430ms	model client
TTS first byte	90ms	190ms	360ms	TTS client
Network (both legs)	40ms	90ms	150ms	gateway

Add the p95 column. It comes to roughly 1340ms. Our reported end-to-end p95 was 980ms (the percentiles do not stack, a single request rarely hits the tail on every stage at once, so the real end-to-end p95 sits below the naive sum). Fine. Either way, both numbers are wrong about the call that paged me, because the call that paged me had 1.4 seconds the table does not contain. None of these rows is the dead air. The dead air is the white space between two of them.

When you look at a single voice turn in a normal tracing UI, you get a waterfall of bars. Each bar is a span. The instinct, the APM instinct, is to find the longest bar and optimize it. I spent two days doing exactly that. I made ASR faster. I shaved 40ms off TTFT with a prompt cache. The summed bars got shorter and the dead air did not move, because the dead air was never a bar.

Here is the timeline I finally drew on a whiteboard, because the tracing UI would not draw it for me.

Figure: one voice turn. The captured spans are short and correct, but they start 1400ms late. The damage is the unattributed gap to their left.

That bracket on the left is the whole post. The spans were honest. They were short, they were green, they summed to a healthy number. They just started 1.4 seconds after the user stopped talking, and nothing in the trace measured the wait, because the code path between "turn-detection fired" and "ASR client opened a stream" did not open a span. It awaited a coroutine, hit a connection-pool stall under load, and sat there. Silent. Unspanned. Invisible.

The turn-detection callback handed off to ASR through a queue, and the ASR client lazily established its streaming connection on first use. Under concurrent calls, that connection setup contended on a pool that was sized for steady state, not for the moment six calls all finished a turn inside the same 200ms window. So turn-end fired, the handoff coroutine queued the audio, and then waited on a connection that was busy being born. By the time the ASR span opened, 1.4 seconds had passed. The ASR span itself then ran in 300ms, green and blameless.

The fix is two parts. Put a span around the handoff so the gap stops being invisible. Then fix the pool. You cannot fix what you cannot see, and the entire reason this lived in production for weeks is that the gap was never a measurable thing.

Here is the real instrumentation. This is OpenTelemetry Python, opentelemetry-api

and opentelemetry-sdk

, the actual SDK calls, runnable.

from opentelemetry import trace
from opentelemetry.trace import SpanKind, Status, StatusCode

tracer = trace.get_tracer("voice.turn")

async def handle_turn(audio_in, ctx):
    with tracer.start_as_current_span(
        "voice.turn",
        kind=SpanKind.SERVER,
    ) as turn_span:
        turn_span.set_attribute("call.id", ctx.call_id)
        turn_span.set_attribute("turn.index", ctx.turn_index)

        with tracer.start_as_current_span("voice.handoff.vad_to_asr") as hs:
            hs.set_attribute("handoff.from", "turn_detection")
            hs.set_attribute("handoff.to", "asr")
            try:
                asr_stream = await asr_client.open_stream(ctx)
            except Exception as exc:
                hs.set_status(Status(StatusCode.ERROR, str(exc)))
                hs.record_exception(exc)
                raise
            hs.add_event("asr_stream_ready")

        with tracer.start_as_current_span("voice.asr") as asr_span:
            transcript = await asr_stream.transcribe(audio_in)
            asr_span.set_attribute("asr.transcript_chars", len(transcript))

        with tracer.start_as_current_span("voice.llm") as llm_span:
            reply = await llm_client.complete(transcript, ctx)
            llm_span.set_attribute("llm.model", ctx.model)

        with tracer.start_as_current_span("voice.tts") as tts_span:
            first_byte = await tts_client.first_audio_byte(reply)
            tts_span.set_attribute("tts.first_byte_ms", first_byte.elapsed_ms)

        return reply

The point is the voice.handoff.vad_to_asr

span. It wraps the dead zone between two components that each had their own span and were each, individually, fast. Now the wait has a name and a duration. The next time six calls finish a turn at once, the handoff span balloons to 1400ms and the connection-pool stall is right there in the waterfall instead of hiding in the white space.

And once the span exists, you can query for it. Here is the trace query I now run, written for a backend that speaks SQL-ish over spans (Honeycomb's query builder maps to the same idea, and so does any OTLP store you can point at ClickHouse). It surfaces turns where the handoff alone blew past 250ms:

SELECT
  trace_id,
  call_id,
  duration_ms AS handoff_ms
FROM spans
WHERE name = 'voice.handoff.vad_to_asr'
  AND duration_ms > 250
ORDER BY handoff_ms DESC
LIMIT 50;

That query returns nothing on a normal day and lights up the instant the pool starts contending. I wired it to an alert on the handoff span's p95, not the end-to-end p95, because the end-to-end p95 is exactly the number that lied to me on June 3rd.

The pool fix was unglamorous. Pre-warm the ASR streaming connections, size the pool for burst concurrency instead of average, and keep the connections alive between turns instead of opening lazily. Handoff p95 went from 1400ms on the bad call down to 70ms steady-state. The dead air was gone the same afternoon I shipped the span, because the span told me precisely where to put the fix.

Instrumenting the handoff makes the gap visible. It does not make your infrastructure fast. A few honest limits.

It does not fix jitter under load on its own. The span tells you the handoff is slow, but if your pool, your event loop, or your GC is the bottleneck, you still have to go fix that. The span is a flashlight, not a wrench.

It does nothing about provider-side queueing you cannot see. When ElevenLabs or your ASR vendor queues your request on their side, your client-side span measures the wait but cannot attribute it past the boundary. You will know that you waited, not why the provider made you wait. For that you need their status, their rate-limit headers, sometimes a support ticket.

And it will not catch every gap automatically. I added the VAD-to-ASR span because that is where this fire was. There are other handoffs (ASR-to-LLM, LLM-to-TTS, barge-in cancellation) and each one needs its own span if you want to see its gap. Instrument the ones that hurt first.

Lesson: Instrument the handoffs, not just the calls. A green waterfall of short, correct spans can still add up to a customer saying "hello?" into silence, because the damage is the time between the bars, and a trace only shows you the bars you drew. The day I stopped trusting the summed p95 and started putting spans on the gaps is the day the dead air stopped paging me. If you run voice agents, go find your turn-end-to-ASR-start handoff right now, wrap it in a span, and alert on that span alone. It is the cheapest 1.4 seconds you will ever buy back.

source & further reading

dev.to — original article You skipped Figma and just built it. Your team still has notes. Stop building custom wrappers for your ML models. The Retry That Booked Mrs. Alvarez Twice

The 1.4 Seconds That Weren't on Any Span

Run your AI side-project on zahid.host