{"slug": "building-a-real-time-desktop-ai-copilot-for-calls-the-hard-parts", "title": "Building a real-time desktop AI copilot for calls: the hard parts", "summary": "A developer built a real-time desktop AI copilot for online calls that displays hints within seconds. The system captures system audio (not microphone) using ScreenCaptureKit on macOS and WASAPI loopback on Windows, streams audio to a server for speech-to-text and LLM processing, and shows an overlay excluded from screen capture. Key challenges included platform-specific audio capture, latency optimization, dialog context management, and privacy safeguards.", "body_md": "[Half a year ago I asked a simple question: during an online call, could a short, to-the-point hint appear on my screen in a second or two — while the other person is still talking? Not an after-the-fact transcript, but help in the moment.](https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fqv7kob7nod3to5erb7g8.jpg)\n\nThe result is a desktop assistant (macOS + Windows). Below is an honest breakdown of what turned out to be hard, and which solutions worked. Engineering only, no marketing.\n\nArchitecture in one paragraph\n\nOn the device there are only two things: audio capture and a thin UI overlay. All the \"brains\" (provider keys, prompts, model selection) live on the server. The client gets a short-lived per-session token and streams audio; the server returns the transcript and the generated answer. I picked this split not for \"security theater\" but because otherwise keys and prompts would have to be baked into the binary — and both leak instantly.\n\nHard part #1: system audio, not the microphone\n\nThe mic only captures you. You need the other party's audio — i.e. the system output. And that's where the platform pain starts:\n\nmacOS. For a long time there was no native \"give me system audio\" API; the classic path was a virtual audio device (BlackHole/Soundflower-style) or, in recent versions, ScreenCaptureKit, which can hand you a process's audio. ScreenCaptureKit turned out to be the best option: no kernel extensions for the user to install.\n\nWindows. WASAPI loopback saves you — you can grab whatever is going to the output device, without virtual cables.\n\nTakeaway: \"system audio capture\" is not one feature but two different subsystems for two OSes, and most of the early bugs were about permissions and device selection, not about audio itself.\n\nHard part #2: latency is everything\n\nA hint that arrives 6 seconds late is useless — the conversation has already moved on. The latency budget has three parts:\n\nSTT (speech → text). Streaming only. Batch \"recognize after the phrase ends\" immediately adds 1–2 seconds. The key metrics weren't \"overall accuracy on a benchmark\" but streaming latency and quality on the target language with domain vocabulary.\n\nLLM (text → answer). Token streaming is mandatory: the first token must show up almost immediately, otherwise it feels frozen. Plus an aggressive system prompt for brevity — a long answer is impossible to read out loud.\n\nNetwork. RTT to the server and providers. Keeping the connection warm and not reopening sockets per phrase helps.\n\nThe main lesson: optimize time-to-first-useful-token, not total response time.\n\nHard part #3: dialog context, not the last sentence\n\nFeed the model only the last sentence and the answers miss. A real question is often smeared across 3–4 turns. So the server keeps a sliding window of the dialog and sends a coherent, role-tagged context into the prompt. Separately, a hotkey-triggered screenshot analysis: code or a diagram on screen gives the model what speech doesn't.\n\nHard part #4: an overlay excluded from screen capture\n\nThe technically fun part. The hint window must be visible to the user but not show up in screen sharing or recordings. On macOS this is solved with the window level and an exclude-from-capture flag (sharingType); on Windows with window affinity (WDA_EXCLUDEFROMCAPTURE). The catch: behavior depends on how the conferencing app captures the screen (composited vs. raw), so it took a \"OS × call platform\" test matrix.\n\nHard part #5: privacy and trust\n\nOnce an app listens to calls, the immediate question is \"what about the data?\" The choices I landed on:\n\nkeys and prompts live only on the server, no secrets in the client;\n\na short-lived per-session token instead of a persistent one;\n\naudio capture and screen analysis only on an explicit user action — no background \"listening.\"\n\nWhat I'd do differently\n\nLock down a latency metric earlier and run it in CI against real recordings (noise, accents) instead of \"by ear.\"\n\nNot underestimate platform permissions: at launch ~80% of tickets were \"can't hear the other side\" = a permissions problem, not a code one.\n\nIf you want to poke at the result — the project is Suflo (macOS & Windows): suflo.ru. Happy to go deep on real-time STT and system-audio capture in the comments.", "url": "https://wpnews.pro/news/building-a-real-time-desktop-ai-copilot-for-calls-the-hard-parts", "canonical_source": "https://dev.to/_1002282ce22ffc6094/building-a-real-time-desktop-ai-copilot-for-calls-the-hard-parts-2e4o", "published_at": "2026-06-21 21:11:31+00:00", "updated_at": "2026-06-21 21:25:14.441904+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "natural-language-processing", "developer-tools", "ai-products"], "entities": ["macOS", "Windows", "ScreenCaptureKit", "WASAPI", "BlackHole", "Soundflower"], "alternates": {"html": "https://wpnews.pro/news/building-a-real-time-desktop-ai-copilot-for-calls-the-hard-parts", "markdown": "https://wpnews.pro/news/building-a-real-time-desktop-ai-copilot-for-calls-the-hard-parts.md", "text": "https://wpnews.pro/news/building-a-real-time-desktop-ai-copilot-for-calls-the-hard-parts.txt", "jsonld": "https://wpnews.pro/news/building-a-real-time-desktop-ai-copilot-for-calls-the-hard-parts.jsonld"}}