Building a real-time desktop AI copilot for calls: the hard parts

wpnews.pro

cd /news/artificial-intelligence/building-a-real-time-desktop-ai-copi… · home › topics › artificial-intelligence › article

[ARTICLE · art-35888] src=dev.to ↗ pub=2026-06-21T21:11Z topic=artificial-intelligence verified=true sentiment=· neutral

Building a real-time desktop AI copilot for calls: the hard parts

A developer built a real-time desktop AI copilot for online calls that displays hints within seconds. The system captures system audio (not microphone) using ScreenCaptureKit on macOS and WASAPI loopback on Windows, streams audio to a server for speech-to-text and LLM processing, and shows an overlay excluded from screen capture. Key challenges included platform-specific audio capture, latency optimization, dialog context management, and privacy safeguards.

read3 min views1 publishedJun 21, 2026

Half a year ago I asked a simple question: during an online call, could a short, to-the-point hint appear on my screen in a second or two — while the other person is still talking? Not an after-the-fact transcript, but help in the moment.

The result is a desktop assistant (macOS + Windows). Below is an honest breakdown of what turned out to be hard, and which solutions worked. Engineering only, no marketing.

Architecture in one paragraph

On the device there are only two things: audio capture and a thin UI overlay. All the "brains" (provider keys, prompts, model selection) live on the server. The client gets a short-lived per-session token and streams audio; the server returns the transcript and the generated answer. I picked this split not for "security theater" but because otherwise keys and prompts would have to be baked into the binary — and both leak instantly.

Hard part #1: system audio, not the microphone

The mic only captures you. You need the other party's audio — i.e. the system output. And that's where the platform pain starts:

macOS. For a long time there was no native "give me system audio" API; the classic path was a virtual audio device (BlackHole/Soundflower-style) or, in recent versions, ScreenCaptureKit, which can hand you a process's audio. ScreenCaptureKit turned out to be the best option: no kernel extensions for the user to install.

Windows. WASAPI loopback saves you — you can grab whatever is going to the output device, without virtual cables.

Takeaway: "system audio capture" is not one feature but two different subsystems for two OSes, and most of the early bugs were about permissions and device selection, not about audio itself. Hard part #2: latency is everything

A hint that arrives 6 seconds late is useless — the conversation has already moved on. The latency budget has three parts:

STT (speech → text). Streaming only. Batch "recognize after the phrase ends" immediately adds 1–2 seconds. The key metrics weren't "overall accuracy on a benchmark" but streaming latency and quality on the target language with domain vocabulary.

LLM (text → answer). Token streaming is mandatory: the first token must show up almost immediately, otherwise it feels frozen. Plus an aggressive system prompt for brevity — a long answer is impossible to read out loud.

Network. RTT to the server and providers. Keeping the connection warm and not reopening sockets per phrase helps.

The main lesson: optimize time-to-first-useful-token, not total response time. Hard part #3: dialog context, not the last sentence

Feed the model only the last sentence and the answers miss. A real question is often smeared across 3–4 turns. So the server keeps a sliding window of the dialog and sends a coherent, role-tagged context into the prompt. Separately, a hotkey-triggered screenshot analysis: code or a diagram on screen gives the model what speech doesn't.

Hard part #4: an overlay excluded from screen capture

The technically fun part. The hint window must be visible to the user but not show up in screen sharing or recordings. On macOS this is solved with the window level and an exclude-from-capture flag (sharingType); on Windows with window affinity (WDA_EXCLUDEFROMCAPTURE). The catch: behavior depends on how the conferencing app captures the screen (composited vs. raw), so it took a "OS × call platform" test matrix.

Hard part #5: privacy and trust

Once an app listens to calls, the immediate question is "what about the data?" The choices I landed on:

keys and prompts live only on the server, no secrets in the client;

a short-lived per-session token instead of a persistent one;

audio capture and screen analysis only on an explicit user action — no background "listening."

What I'd do differently

Lock down a latency metric earlier and run it in CI against real recordings (noise, accents) instead of "by ear."

Not underestimate platform permissions: at launch ~80% of tickets were "can't hear the other side" = a permissions problem, not a code one.

If you want to poke at the result — the project is Suflo (macOS & Windows): suflo.ru. Happy to go deep on real-time STT and system-audio capture in the comments.

source & further reading

dev.to — original article Kinde Is Missing from Mastra's Auth Lineup, So I Built the Provider Mock any API response in Postman (and let AI build the collection) Dawn or Eclipse — a code-breaking ode to Turing you can't outsource to the machine

~/api · this article 200

$curl api.wpnews.pro/v1/news/building-a-real-time-des…

Read original on dev.to → dev.to/_1002282ce22ffc6094/building-a-real-time-…

mentioned entities

macOS

Windows

ScreenCaptureKit

WASAPI

BlackHole

Soundflower

metadata

slugbuilding-a-real-time-desktop-ai-copilot-for-calls-the-hard-parts

topic#artificial-intelligence

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevKinde Is Missing from Mastra's A…

next →Visual Studio Code 1.126

── more in #artificial-intelligence 4 stories · sorted by recency

discuss.privacyguides.net · 21 Jun · #artificial-intelligence

Questions about PWAs

auratxt.com · 21 Jun · #artificial-intelligence

AuraText – a Windows overlay that brings AI into any text field URL

dev.to · 21 Jun · #artificial-intelligence

This is me asking the universe... "Is this enough?"

replyvoice.com · 20 Jun · #artificial-intelligence

Show HN: Voice typing on Android with accurate punctuation and code-switching

── more on @macos 3 stories trending now

wpnews · 20 Jun · #artificial-intelligence

Microsoft is rewriting the economics of enterprise AI and the bill shock is just getting started

wpnews · 21 Jun · #large-language-models

Anthropic faces a class action lawsuit accusing it of selling Claude Max subscribers far less than advertised

wpnews · 20 Jun · #ai-agents

Amazon Bedrock AgentCore Memory: Build AI Agents That Remember

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required