{"slug": "the-1-4-seconds-that-weren-t-on-any-span", "title": "The 1.4 Seconds That Weren't on Any Span", "summary": "A developer at a voice agent company traced a 1.4-second dead-air incident on a customer call, discovering that the latency was invisible in standard APM traces. The gap occurred between turn-detection and ASR-start, an unattributed interval not captured by any span. The developer argues that voice agent observability must instrument handoffs, not just component work.", "body_md": "On the morning of June 3rd, a customer on a live call sat through 1.4 seconds of dead air after she finished a sentence, long enough that she said \"hello?\" before the agent answered. I had the trace open in Honeycomb forty seconds later. Every span was green. End-to-end p95 read 980ms, comfortably under our budget, and not one span in that waterfall was longer than 400ms. The dashboard told me everything was fine while the customer was, in fact, talking to silence.\n\n**TL;DR:** End-to-end voice latency is not the sum of your spans. The number that kills your UX lives in the unattributed time *between* spans, most often the gap between turn-end (the moment the user stops talking) and ASR-start (the moment your pipeline begins transcribing). APM-style tracing instruments the work and ignores the waiting, so the gap is invisible by construction. You have to put a span on the handoff itself.\n\nHere is the thing I keep saying and keep being right about: most \"LLM observability\" is just APM with extra steps. It watches the model. It traces the LLM call, the tool call, the retrieval, the token count, all the parts a backend engineer already knows how to think about. For a voice agent that is the wrong half of the system. Voice agents do not break inside the LLM call. They break in the audio pipeline, in the orchestration between components, in the handoffs nobody owns a span for. Your model can be fast and your product can still feel broken, and your dashboard will not say a word about it.\n\nOur turn looks like this on paper. VAD/turn-detection decides the user is done. Audio goes to ASR (Whisper Large v3, streaming). The transcript goes to the LLM (gpt-4o-realtime) for a first token, then the full response. The response streams to TTS (ElevenLabs) for the first audio byte, which is the moment the user hears anything. There is network on both ends.\n\nI pulled the one trace from the 1.4-second call. Not an aggregate, the actual trace. Here is the latency budget I had been staring at for weeks, the summed-span view:\n\n| Stage | p50 | p95 | p99 | who owns the span |\n|---|---|---|---|---|\n| VAD / turn-detection | 60ms | 120ms | 180ms | orchestrator |\n| ASR (streaming) | 180ms | 310ms | 540ms | ASR client |\n| LLM TTFT | 220ms | 380ms | 720ms | model client |\n| LLM full response | 140ms | 260ms | 430ms | model client |\n| TTS first byte | 90ms | 190ms | 360ms | TTS client |\n| Network (both legs) | 40ms | 90ms | 150ms | gateway |\n\nAdd the p95 column. It comes to roughly 1340ms. Our reported end-to-end p95 was 980ms (the percentiles do not stack, a single request rarely hits the tail on every stage at once, so the real end-to-end p95 sits below the naive sum). Fine. Either way, both numbers are wrong about the call that paged me, because the call that paged me had 1.4 seconds the table does not contain. None of these rows is the dead air. The dead air is the white space between two of them.\n\nWhen you look at a single voice turn in a normal tracing UI, you get a waterfall of bars. Each bar is a span. The instinct, the APM instinct, is to find the longest bar and optimize it. I spent two days doing exactly that. I made ASR faster. I shaved 40ms off TTFT with a prompt cache. The summed bars got shorter and the dead air did not move, because the dead air was never a bar.\n\nHere is the timeline I finally drew on a whiteboard, because the tracing UI would not draw it for me.\n\n*Figure: one voice turn. The captured spans are short and correct, but they start 1400ms late. The damage is the unattributed gap to their left.*\n\nThat bracket on the left is the whole post. The spans were honest. They were short, they were green, they summed to a healthy number. They just started 1.4 seconds after the user stopped talking, and nothing in the trace measured the wait, because the code path between \"turn-detection fired\" and \"ASR client opened a stream\" did not open a span. It awaited a coroutine, hit a connection-pool stall under load, and sat there. Silent. Unspanned. Invisible.\n\nThe turn-detection callback handed off to ASR through a queue, and the ASR client lazily established its streaming connection on first use. Under concurrent calls, that connection setup contended on a pool that was sized for steady state, not for the moment six calls all finished a turn inside the same 200ms window. So turn-end fired, the handoff coroutine queued the audio, and then waited on a connection that was busy being born. By the time the ASR span opened, 1.4 seconds had passed. The ASR span itself then ran in 300ms, green and blameless.\n\nThe fix is two parts. Put a span around the handoff so the gap stops being invisible. Then fix the pool. You cannot fix what you cannot see, and the entire reason this lived in production for weeks is that the gap was never a measurable thing.\n\nHere is the real instrumentation. This is OpenTelemetry Python, `opentelemetry-api`\n\nand `opentelemetry-sdk`\n\n, the actual SDK calls, runnable.\n\n``` python\nfrom opentelemetry import trace\nfrom opentelemetry.trace import SpanKind, Status, StatusCode\n\ntracer = trace.get_tracer(\"voice.turn\")\n\nasync def handle_turn(audio_in, ctx):\n    # The outer span is the whole turn, anchored at turn-end.\n    with tracer.start_as_current_span(\n        \"voice.turn\",\n        kind=SpanKind.SERVER,\n    ) as turn_span:\n        turn_span.set_attribute(\"call.id\", ctx.call_id)\n        turn_span.set_attribute(\"turn.index\", ctx.turn_index)\n\n        # THE MISSING SPAN: turn-detection -> ASR-start handoff.\n        # Everything that happens between \"user stopped talking\" and\n        # \"ASR actually began\" gets measured here, including the wait\n        # for a streaming connection that used to be invisible.\n        with tracer.start_as_current_span(\"voice.handoff.vad_to_asr\") as hs:\n            hs.set_attribute(\"handoff.from\", \"turn_detection\")\n            hs.set_attribute(\"handoff.to\", \"asr\")\n            try:\n                asr_stream = await asr_client.open_stream(ctx)\n            except Exception as exc:\n                hs.set_status(Status(StatusCode.ERROR, str(exc)))\n                hs.record_exception(exc)\n                raise\n            # mark when audio truly starts flowing into ASR\n            hs.add_event(\"asr_stream_ready\")\n\n        # ASR itself. Short and green. Never the problem.\n        with tracer.start_as_current_span(\"voice.asr\") as asr_span:\n            transcript = await asr_stream.transcribe(audio_in)\n            asr_span.set_attribute(\"asr.transcript_chars\", len(transcript))\n\n        # LLM and TTS spans continue as before.\n        with tracer.start_as_current_span(\"voice.llm\") as llm_span:\n            reply = await llm_client.complete(transcript, ctx)\n            llm_span.set_attribute(\"llm.model\", ctx.model)\n\n        with tracer.start_as_current_span(\"voice.tts\") as tts_span:\n            first_byte = await tts_client.first_audio_byte(reply)\n            tts_span.set_attribute(\"tts.first_byte_ms\", first_byte.elapsed_ms)\n\n        return reply\n```\n\nThe point is the `voice.handoff.vad_to_asr`\n\nspan. It wraps the dead zone between two components that each had their own span and were each, individually, fast. Now the wait has a name and a duration. The next time six calls finish a turn at once, the handoff span balloons to 1400ms and the connection-pool stall is right there in the waterfall instead of hiding in the white space.\n\nAnd once the span exists, you can query for it. Here is the trace query I now run, written for a backend that speaks SQL-ish over spans (Honeycomb's query builder maps to the same idea, and so does any OTLP store you can point at ClickHouse). It surfaces turns where the handoff alone blew past 250ms:\n\n```\nSELECT\n  trace_id,\n  call_id,\n  duration_ms AS handoff_ms\nFROM spans\nWHERE name = 'voice.handoff.vad_to_asr'\n  AND duration_ms > 250\nORDER BY handoff_ms DESC\nLIMIT 50;\n```\n\nThat query returns nothing on a normal day and lights up the instant the pool starts contending. I wired it to an alert on the handoff span's p95, not the end-to-end p95, because the end-to-end p95 is exactly the number that lied to me on June 3rd.\n\nThe pool fix was unglamorous. Pre-warm the ASR streaming connections, size the pool for burst concurrency instead of average, and keep the connections alive between turns instead of opening lazily. Handoff p95 went from 1400ms on the bad call down to 70ms steady-state. The dead air was gone the same afternoon I shipped the span, because the span told me precisely where to put the fix.\n\nInstrumenting the handoff makes the gap visible. It does not make your infrastructure fast. A few honest limits.\n\nIt does not fix jitter under load on its own. The span tells you the handoff is slow, but if your pool, your event loop, or your GC is the bottleneck, you still have to go fix that. The span is a flashlight, not a wrench.\n\nIt does nothing about provider-side queueing you cannot see. When ElevenLabs or your ASR vendor queues your request on their side, your client-side span measures the wait but cannot attribute it past the boundary. You will know *that* you waited, not *why* the provider made you wait. For that you need their status, their rate-limit headers, sometimes a support ticket.\n\nAnd it will not catch every gap automatically. I added the VAD-to-ASR span because that is where this fire was. There are other handoffs (ASR-to-LLM, LLM-to-TTS, barge-in cancellation) and each one needs its own span if you want to see its gap. Instrument the ones that hurt first.\n\n**Lesson:** Instrument the handoffs, not just the calls. A green waterfall of short, correct spans can still add up to a customer saying \"hello?\" into silence, because the damage is the time between the bars, and a trace only shows you the bars you drew. The day I stopped trusting the summed p95 and started putting spans on the gaps is the day the dead air stopped paging me. If you run voice agents, go find your turn-end-to-ASR-start handoff right now, wrap it in a span, and alert on that span alone. It is the cheapest 1.4 seconds you will ever buy back.", "url": "https://wpnews.pro/news/the-1-4-seconds-that-weren-t-on-any-span", "canonical_source": "https://dev.to/realmarcuschen/the-14-seconds-that-werent-on-any-span-483m", "published_at": "2026-06-24 23:38:04+00:00", "updated_at": "2026-06-25 00:12:50.078664+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "developer-tools", "ai-products", "ai-infrastructure"], "entities": ["Honeycomb", "Whisper Large v3", "gpt-4o-realtime", "ElevenLabs", "OpenAI"], "alternates": {"html": "https://wpnews.pro/news/the-1-4-seconds-that-weren-t-on-any-span", "markdown": "https://wpnews.pro/news/the-1-4-seconds-that-weren-t-on-any-span.md", "text": "https://wpnews.pro/news/the-1-4-seconds-that-weren-t-on-any-span.txt", "jsonld": "https://wpnews.pro/news/the-1-4-seconds-that-weren-t-on-any-span.jsonld"}}