cd /news/large-language-models/llm-observability-tools-are-blind-to… · home topics large-language-models article
[ARTICLE · art-33368] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

LLM observability tools are blind to the voice layer. Here is what I checked 6 of them for.

A developer evaluated six LLM observability tools—Langfuse, Helicone, Arize Phoenix, LangSmith, Braintrust, and Laminar—for their ability to monitor the audio layer in voice agents, not just LLM calls. The tools are blind to critical audio-layer spans like end-of-turn detection, ASR latency, and barge-in events unless custom instrumentation is added. OpenTelemetry-based tools (Langfuse, Phoenix, Laminar) offer the best canvas for capturing these spans, but none ship voice-aware instrumentation out of the box.

read3 min views1 publishedJun 18, 2026

Most LLM observability tools trace the same thing: the prompt, the completion, the tokens, the latency of the model call. For a text agent that is most of the story. For a voice agent it is maybe a fifth of it, because the failures that actually make a voice agent feel broken happen in the audio layer, and a tracer pointed at the LLM call cannot see them. I went through six observability tools (Langfuse, Helicone, Arize Phoenix, LangSmith, Braintrust, and Laminar) asking one question each: can it show me the audio layer, or only the LLM call?

The audio layer is where the real spans are. End-of-turn detection: how long did the agent wait before deciding the caller was done? ASR latency and confidence: how long did transcription take, and how sure was it? Barge-in: did the caller interrupt, and did the agent yield? Time-to-first-audio: how long from the caller finishing to the agent making a sound? None of these are LLM-call metrics, and a green LLM-latency dashboard tells you nothing about any of them. I have watched a voice agent with a perfectly healthy model-call trace feel sluggish and rude to every caller, because the lag and the interruptions lived in spans the tracer was not capturing.

So here is how the six landed, all on the same question. Langfuse, Phoenix, and Laminar are OpenTelemetry-based, which is the good news: OTel does not care whether a span is an LLM call or an ASR call, so you can emit custom spans for endpointing, ASR, and barge-in and see them next to the model call. The catch is you have to instrument those spans yourself; none of them ship voice-aware instrumentation, they give you the canvas. Helicone is gateway-first, so it is excellent at LLM-call logging and cost and largely silent on the audio layer unless you add your own telemetry around it. LangSmith is deep on the LLM and LangChain trace and the most LLM-call-centric of the set, least aware of audio by default. Braintrust gives you a clean UI for whatever you send it, so again the audio layer shows up only if you instrument it.

The pattern is the same across all six: the tool is only as voice-aware as the spans you feed it, and the ones built on OpenTelemetry make that easy because you are just emitting more spans into a format they already understand. That is the actual selection criterion for a voice agent, not the LLM-tracing features every one of them advertises, but whether the model lets you put audio-layer spans right next to the model spans so "it feels slow" maps to a stage instead of a guess.

If I were choosing today for a voice agent, I would pick an OpenTelemetry-native tool and spend the first day instrumenting the audio layer, endpoint timeout, ASR latency and confidence, barge-in events, time-to-first-audio, before touching a single LLM metric. The LLM trace is the part that is already solved. The voice layer is the part that is invisible, and invisible is where the incidents hide. The open question I have not cracked: even with audio-layer spans, "the call felt off" is a subjective, whole-conversation judgment that does not reduce cleanly to any single span. I can show you the endpoint timeout and the barge-in count, but not why the caller hung up frustrated. If anyone has tied per-span audio telemetry to a felt-quality score for a whole call, that is the conversation I want.

── more in #large-language-models 4 stories · sorted by recency
── more on @langfuse 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/llm-observability-to…] indexed:0 read:3min 2026-06-18 ·