{"slug": "your-voice-agent-is-slow-here-are-5-tricks-to-hide-it", "title": "Your Voice Agent Is Slow. Here Are 5 Tricks to Hide It.", "summary": "A developer shares five perception tricks to reduce felt latency in voice AI agents without improving actual response times. The techniques include playing instant acknowledgment utterances, using human-like filler sounds, and streaming answer outlines to mask delays. The approach prioritizes user perception over raw performance metrics.", "body_md": "A while back I shipped a voice agent that took roughly 1,200ms to respond. Not catastrophic on paper. Pretty bad in practice. Users would ask a question, get a beat of silence, and start over. Some thought the mic had cut out. One tester told me, with a straight face, that my agent was \"thinking too hard.\"\n\nI tried everything legitimate first. Smaller LLM. Streaming TTS. Region-pinned endpoints. I shaved off about 200ms and felt clever for a week. Then I measured again and realized I was still on the wrong side of every latency threshold that matters.\n\nSo I gave up on being faster and started working on being a better liar.\n\nThis is the playbook I wish I had when I started: five perception tricks that reduce *felt* latency without touching the actual numbers. They're the voice-AI equivalent of a magician's misdirection. Your right hand waves at the audience. Your left hand swaps the card.\n\nIn a [previous article](https://kenimoto.dev/) I broke down the three latency cliffs for voice AI. The short version:\n\nIf your stack is doing STT plus LLM plus TTS plus network, hitting 200ms end-to-end is, frankly, a fantasy for most teams. You can chase it. You can throw money at it. You can cache and prefetch and stream. At some point you bottom out.\n\nThat's where perception work begins. The user can't measure your p99 latency. They can only measure how the agent *feels*. Those are two different problems and they have two different solutions.\n\n**What it is:** A short, instant utterance played the moment the user finishes speaking, before any LLM work begins.\n\n**Why it works:** Silence is the worst possible feedback. Even 400ms of nothing feels longer than 800ms with a \"let me check\" in front of it. The user's brain logs \"the agent heard me\" and resets its impatience timer.\n\n**Implementation difficulty:** Low. Pre-generate three or four short audio clips, pick one based on the rough intent class, play it the instant your VAD confirms end-of-speech. No LLM in the loop.\n\n**Payoff:** Big. This is the single highest-ROI thing I've done. It's also the most embarrassing, because the fix is \"say a word.\"\n\n**Where it backfires:** If the acknowledgment doesn't match the request, users notice. An \"On it!\" before \"what's the weather\" sounds psychotic. Keep your tokens neutral. \"Got it\" is safer than \"Sure thing!\"\n\n**What it is:** The agent buys time the way humans buy time. Mid-sentence \"uh,\" a soft \"let me check on that,\" a thoughtful \"hmm\" while the LLM grinds through tokens in the background.\n\n**Why it works:** Humans use fillers to signal cognitive load. When an agent does the same, listeners parse it as *thinking*, not as *broken*. Research from ACM CUI 2025 found this effect is strongest exactly where you need it most: under high latency conditions (4+ seconds), where naked silence is fatal but a filler turns the same wait into \"the agent is being thorough.\"\n\nI once shipped a voice agent that said \"um\" so much, users thought it was actually thinking. It wasn't. It was just buying time for the LLM. Felt great. Looked great in user studies. Slightly weird when I demoed it to my mother.\n\n**Implementation difficulty:** Medium. The filler has to feel native, not robotic. Use real recorded human fillers, not TTS. TTS \"um\" sounds like a glitch.\n\n**Payoff:** Buys you 1.5 to 2 seconds of plausible cover.\n\n**Where it backfires:** Overuse. If your agent ums on every turn, it stops reading as natural and starts reading as a stalling tactic. Reserve fillers for turns you've predicted will be slow.\n\n**What it is:** Stream the *shape* of the answer first. \"There are three things to know here. The first one is...\" while the rest is still being generated.\n\n**Why it works:** The user gets immediate signal that the answer is on its way and roughly how big it'll be. Their brain stops watching the clock and starts unpacking content. By the time they're processing point one, the LLM has caught up to point two.\n\n**Implementation difficulty:** Medium-high. Requires either a planning step that emits structure first, or careful prompt design that forces the model to commit to a frame before details.\n\n**Payoff:** Excellent for long answers. Useless for one-line responses.\n\n**Where it backfires:** When the model says \"there are three reasons\" and then can only think of two. Don't ask me how I know.\n\n**What it is:** A library of short, intent-specific opening phrases that the agent plays while the real response is being generated. Not a full answer, just a warmup. \"Let me pull up your schedule.\" \"Checking the weather now.\" \"One moment, finding that for you.\"\n\n**Why it works:** Two things at once. The user gets confirmation that their intent was understood (subtly different from a generic acknowledgment), and you get another second or two of cover for the real generation. The crucial difference from trick 1 is that this is *context-aware*. \"Checking the weather\" only fires if the agent is, in fact, checking the weather.\n\n**Implementation difficulty:** Medium. You need solid intent classification at the start of the turn, plus a phrase library that doesn't sound canned.\n\n**Payoff:** Strong. This is what GetStream calls \"speculative tool calling\": fire the tool call early based on predicted intent, run the warmup phrase in parallel, hope you predicted right.\n\n**Where it backfires:** When intent classification is wrong. \"Checking your schedule\" followed by \"actually I can't help with that\" is worse than just admitting it from the start.\n\n**What it is:** If your voice agent has any visual surface (a phone screen, a dashboard, a kiosk display), show *something*. A pulsing dot. A waveform. A typing animation. A face that nods.\n\n**Why it works:** Multimodal feedback compresses perceived latency. Tohoku University's MDPI 2025 study on embodied conversational agents found that showing emotion or activity on a face during the wait reduced user dissatisfaction with response delays. The wait stops feeling like dead air and starts feeling like \"the system is working on it.\"\n\nThis is also why every chat UI has a typing indicator. You'd think after twenty years we'd have stopped falling for it. We have not.\n\n**Implementation difficulty:** Trivial if you have a screen. Impossible if you don't.\n\n**Payoff:** Surprisingly large per dollar of effort. A pulsing dot is one CSS animation away.\n\n**Where it backfires:** When the visual stays up longer than feels reasonable. A typing dot that pulses for 12 seconds is no longer reassuring. It's a hostage situation.\n\nI keep this rough mapping in my head:\n\n| Situation | Best trick | Why |\n|---|---|---|\n| Predictable intent, slow tool call | Pre-canned warmup (4) | Buys 1-2s of cover with context |\n| Unpredictable intent, fast LLM | Acknowledgment token (1) | Instant feedback, no risk of misfire |\n| Long-form answer | Progressive disclosure (3) | Streams structure, masks generation tail |\n| Worst-case latency (4s+) | Conversational fillers (2) | Reframes wait as thinking |\n| Multimodal product | Visual decoy (5) | Cheapest perceived-latency win available |\n\nNone of these is a silver bullet. The real game is layering. On a typical turn in my current agent, an acknowledgment token fires immediately, a warmup phrase plays once intent is classified, the visual indicator runs throughout, and progressive disclosure handles the response delivery. Four tricks stacked. The user perceives one smooth conversation.\n\nWhen I measured a recent build: physical latency was around 750ms. Self-reported perceived latency from testers landed around 350ms. The numbers didn't move. The experience nearly halved.\n\nLet me be honest about what I'm doing here. I'm not making the agent faster. I'm making the user *trust the agent more than the latency warrants*.\n\nThat works fine when the agent is genuinely on the user's side: answering a question, executing a benign command. It works less fine when the agent is selling something, or upselling, or the wait is being engineered to suggest more value than there actually is.\n\nA pulsing \"thinking...\" indicator that runs for three seconds when the answer was cached and instant is, technically, a lie. Most users wouldn't care. Some would. The line between \"perception design\" and \"manipulating perception\" is thinner than I'd like to admit, and I don't always know which side of it I'm on.\n\nMy current rule of thumb: every perception trick should make a slow but honest experience feel acceptable. None should make a fast experience feel slower for theatrical reasons. If you find yourself adding latency to seem more thoughtful, you've crossed something.\n\nAlso: don't use these to cover up an agent that's genuinely broken. Hiding 5-second latency behind a pile of \"ums\" doesn't fix the 5-second latency. It just makes the bug harder to spot in user testing. I learned this one the slow way.\n\nYou probably can't make your voice agent hit 200ms. You probably *can* make it feel like it does. The five tricks above get most teams from \"users hate this\" to \"users don't notice.\" That's most of the win.\n\nThe 200ms agent still lives somewhere on the horizon. Until you can build it, you can fake it credibly. Just don't fake it in directions that hurt the people on the other end of the line.\n\nI wrote a short book covering the latency cliffs, the perception hacks above, and the architecture choices that make a voice agent feel responsive even when it physically isn't. If this article was useful, the book goes considerably deeper.", "url": "https://wpnews.pro/news/your-voice-agent-is-slow-here-are-5-tricks-to-hide-it", "canonical_source": "https://dev.to/kenimo49/your-voice-agent-is-slow-here-are-5-tricks-to-hide-it-3pcb", "published_at": "2026-06-13 00:46:21+00:00", "updated_at": "2026-06-13 01:13:14.211615+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-agents", "natural-language-processing", "ai-products", "ai-tools"], "entities": ["ACM CUI 2025"], "alternates": {"html": "https://wpnews.pro/news/your-voice-agent-is-slow-here-are-5-tricks-to-hide-it", "markdown": "https://wpnews.pro/news/your-voice-agent-is-slow-here-are-5-tricks-to-hide-it.md", "text": "https://wpnews.pro/news/your-voice-agent-is-slow-here-are-5-tricks-to-hide-it.txt", "jsonld": "https://wpnews.pro/news/your-voice-agent-is-slow-here-are-5-tricks-to-hide-it.jsonld"}}