The ghost in my transcript: why my AI meeting app kept saying 'Thank you.'

A developer building a passive AI meeting assistant discovered that their speech recognition model was hallucinating the phrase 'Thank you' during applause, laughter, and silence. The model, trained on real-world audio where applause often follows 'Thank you,' confidently emits the phrase even when no speech is present. The developer found that the real-time ASR protocol provides no confidence scores to filter such errors, and voice activity detection cannot distinguish loud applause from speech.

I build a passive AI meeting assistant. It records, transcribes live in 26 languages, and fact-checks claims against the web in real time. The transcript is the flagship surface — the thing users trust. So when I opened a recording the other night and saw this, my stomach dropped: 12:21 Thank you. 12:27 Thank you. 13:16 Thank you. 13:16 Thank you. 13:19 Do we love our children? 13:20 Thank you. Nobody said "Thank you." Not once. The speaker was mid-rant about generational politics "Washington has become a cross between the land of the dead and the Golden Girls" — a real line my app captured perfectly . And scattered through this otherwise-sharp transcript were a dozen phantom Thank you s. My first reaction was the same as anyone's: stupid ASR. But "stupid" isn't a root cause, and I had a product to ship. So I went looking. The phantom lines weren't random. I lined them up against the audio and there it was: every single "Thank you." landed on applause, laughter, or a beat of silence between sentences. The model wasn't mis-hearing words. It was inventing words when there were no words to hear. That clue is the whole story. This is one of the most famous failure modes in modern speech recognition, and once you see the mechanism you can't unsee it. Models like Whisper, qwen3-asr, and friends are trained on enormous piles of real-world audio — YouTube, podcasts, talks, lectures. They learn a conditional distribution: given this audio, what's the most likely text? They are very good at this when the audio is speech. But they were never really taught what to do with non-speech . Applause, laughter, silence, music — the training data is full of those moments too, and they're labeled with whatever the human transcriber wrote. And what do humans write at the end of a talk, right when the audience erupts in applause? "Thank you." "Thanks for watching." "Thank you very much." So the model learns a rock-solid association: clapping sounds → "Thank you." Feed it applause and it doesn't shrug and return nothing. It confidently emits the single highest-probability phrase it has ever seen paired with that acoustic texture. It's not a bug in my code. It's the model doing exactly what it was trained to do , just in a context nobody curated for. This is a "confidently wrong" failure — the most dangerous kind, because the output looks like every other line. The obvious instinct is: "just drop low-confidence segments." Great idea. One problem. I'm streaming through a realtime ASR over a WebSocket protocol, and I went and checked exactly what comes back on the wire. The final transcript event looks like this: { "type": "conversation.item.input audio transcription.completed", "transcript": "Thank you." } That's it. No confidence. No logprob. No no speech probability. The model gives me the text and nothing else. There is no number to threshold on, because the model doesn't hand me one. So the entire class of "filter by confidence" solutions is off the table before I even start. Worth knowing your wire protocol before you design around a field that doesn't exist. Second idea: tighten the Voice Activity Detection VAD . My session config tells the upstream how aggressively to gate non-speech: "turn detection": "type": "server vad", "threshold": 0.0, // <- maximally permissive "silence duration ms": 400 That threshold: 0.0 means "treat basically everything as speech." Raising it would make the model's own VAD reject quiet, low-energy audio before it ever tries to transcribe — which kills the silence-gap hallucinations those phantom lines between sentences . But here's the trap: applause is loud. Laughter is loud. An energy-based VAD threshold can't tell a clapping crowd from a talking human — they're both well above any silence floor. So bumping the threshold helps the quiet gaps and does nothing for the exact screenshots that started this whole investigation. Worse, there's a values cost. My app has one sacred rule: never miss real speech. Crank the VAD too high and you start clipping a soft-spoken participant. Trading a real sentence to suppress a fake "Thank you." is a bad trade. So VAD tuning is, at best, a cautious complement — never the main fix. If the model emits a fabricated phrase, and I can't catch it by confidence, and I can't gate it by loudness... the only reliable place to catch it is on the way out. After the text exists, before it becomes part of the user's record. This is exactly what the Whisper community landed on years ago, and it's refreshingly boring: a known-hallucination phrase filter. The shape of it: struct TranscriptHallucinationFilter { // Tight, curated, multi-word non-speech priors. private let blocklist: Set<String = "thank you", "thank you very much", "thanks for watching", "thank you for watching", func isLikelyHallucination text: String - Bool { let normalized = text .trimmingCharacters in: .whitespacesAndNewlines .lowercased .trimmingTrailingPunctuation // ".", " ", "?", "…", "," .collapsingWhitespace // EXACT match only. Never substring. return blocklist.contains normalized } } The non-negotiable design decisions, each one earned: Exact match, never substring. A standalone "Thank you." is a hallucination. But "Thank you for joining us, let's get started" is a real human being. If I matched on contains , I'd start deleting genuine sentences — a far worse bug than the one I'm fixing. The filter only fires when the entire normalized segment equals a blocklist entry. Start the blocklist tight. I seed it with the high-confidence, multi-word priors and resist the urge to add bare words. "you" and "okay" are classic hallucinations and things people genuinely say alone. When in doubt, leave it out — a phantom "okay" slipping through is cheap; deleting a real one is not. Drop loud, not silent. My codebase has an iron rule: failures must be loud. So every drop increments a content-free counter and logs a line the matched blocklist index — never the raw text, privacy first . If my filter ever starts eating real speech, I'll see the rate climb in the field instead of discovering it from an angry user. Put it at the one chokepoint. There's a single funnel where a transcript event becomes both a saved segment and food for the insight/fact-check engine. The filter goes at the very top of that function, before either happens — so a phantom "Thank you." pollutes neither the transcript nor the downstream AI. One guard, total coverage. And because the audio file itself is never touched, the recording stays sacred. I'm only suppressing a fabricated line from the record. If I'm ever wrong, the original audio is right there to replay. I want to be straight about what this is. It's not magic and it's not complete: The lesson I keep relearning: when a model does something baffling, "the model is stupid" is where the investigation starts , not where it ends. The phantom "Thank you" wasn't noise. It was the model telling me, very precisely, that it had been handed sound with no speech in it — and doing the most human thing it knew how to do. It said thanks. Building Faktum, a passive AI meeting assistant. If you've shipped your own war story against ASR hallucinations, I'd love to hear how you handled the blocklist-vs-false-positive tradeoff.