The ghost in my transcript: why my AI meeting app kept saying 'Thank you.' A developer building a passive AI meeting assistant discovered that their speech recognition model was hallucinating the phrase 'Thank you' during applause, laughter, and silence. The model, trained on real-world audio where applause often follows 'Thank you,' confidently emits the phrase even when no speech is present. The developer found that the real-time ASR protocol provides no confidence scores to filter such errors, and voice activity detection cannot distinguish loud applause from speech. I build a passive AI meeting assistant. It records, transcribes live in 26 languages, and fact-checks claims against the web in real time. The transcript is the flagship surface — the thing users trust. So when I opened a recording the other night and saw this, my stomach dropped: 12:21 Thank you. 12:27 Thank you. 13:16 Thank you. 13:16 Thank you. 13:19 Do we love our children? 13:20 Thank you. Nobody said "Thank you." Not once. The speaker was mid-rant about generational politics "Washington has become a cross between the land of the dead and the Golden Girls" — a real line my app captured perfectly . And scattered through this otherwise-sharp transcript were a dozen phantom Thank you s. My first reaction was the same as anyone's: stupid ASR. But "stupid" isn't a root cause, and I had a product to ship. So I went looking. The phantom lines weren't random. I lined them up against the audio and there it was: every single "Thank you." landed on applause, laughter, or a beat of silence between sentences. The model wasn't mis-hearing words. It was inventing words when there were no words to hear. That clue is the whole story. This is one of the most famous failure modes in modern speech recognition, and once you see the mechanism you can't unsee it. Models like Whisper, qwen3-asr, and friends are trained on enormous piles of real-world audio — YouTube, podcasts, talks, lectures. They learn a conditional distribution: given this audio, what's the most likely text? They are very good at this when the audio is speech. But they were never really taught what to do with non-speech . Applause, laughter, silence, music — the training data is full of those moments too, and they're labeled with whatever the human transcriber wrote. And what do humans write at the end of a talk, right when the audience erupts in applause? "Thank you." "Thanks for watching." "Thank you very much." So the model learns a rock-solid association: clapping sounds → "Thank you." Feed it applause and it doesn't shrug and return nothing. It confidently emits the single highest-probability phrase it has ever seen paired with that acoustic texture. It's not a bug in my code. It's the model doing exactly what it was trained to do , just in a context nobody curated for. This is a "confidently wrong" failure — the most dangerous kind, because the output looks like every other line. The obvious instinct is: "just drop low-confidence segments." Great idea. One problem. I'm streaming through a realtime ASR over a WebSocket protocol, and I went and checked exactly what comes back on the wire. The final transcript event looks like this: { "type": "conversation.item.input audio transcription.completed", "transcript": "Thank you." } That's it. No confidence. No logprob. No no speech probability. The model gives me the text and nothing else. There is no number to threshold on, because the model doesn't hand me one. So the entire class of "filter by confidence" solutions is off the table before I even start. Worth knowing your wire protocol before you design around a field that doesn't exist. Second idea: tighten the Voice Activity Detection VAD . My session config tells the upstream how aggressively to gate non-speech: "turn detection": "type": "server vad", "threshold": 0.0, // <- maximally permissive "silence duration ms": 400 That threshold: 0.0 means "treat basically everything as speech." Raising it would make the model's own VAD reject quiet, low-energy audio before it ever tries to transcribe — which kills the silence-gap hallucinations those phantom lines between sentences . But here's the trap: applause is loud. Laughter is loud. An energy-based VAD threshold can't tell a clapping crowd from a talking human — they're both well above any silence floor. So bumping the threshold helps the quiet gaps and does nothing for the exact screenshots that started this whole investigation. Worse, there's a values cost. My app has one sacred rule: never miss real speech. Crank the VAD too high and you start clipping a soft-spoken participant. Trading a real sentence to suppress a fake "Thank you." is a bad trade. So VAD tuning is, at best, a cautious complement — never the main fix. If the model emits a fabricated phrase, and I can't catch it by confidence, and I can't gate it by loudness... the only reliable place to catch it is on the way out. After the text exists, before it becomes part of the user's record. This is exactly what the Whisper community landed on years ago, and it's refreshingly boring: a known-hallucination phrase filter. The shape of it: struct TranscriptHallucinationFilter { // Tight, curated, multi-word non-speech priors. private let blocklist: Set