The result is a desktop assistant (macOS + Windows). Below is an honest breakdown of what turned out to be hard, and which solutions worked. Engineering only, no marketing.
Architecture in one paragraph
On the device there are only two things: audio capture and a thin UI overlay. All the "brains" (provider keys, prompts, model selection) live on the server. The client gets a short-lived per-session token and streams audio; the server returns the transcript and the generated answer. I picked this split not for "security theater" but because otherwise keys and prompts would have to be baked into the binary β and both leak instantly.
Hard part #1: system audio, not the microphone
The mic only captures you. You need the other party's audio β i.e. the system output. And that's where the platform pain starts:
macOS. For a long time there was no native "give me system audio" API; the classic path was a virtual audio device (BlackHole/Soundflower-style) or, in recent versions, ScreenCaptureKit, which can hand you a process's audio. ScreenCaptureKit turned out to be the best option: no kernel extensions for the user to install.
Windows. WASAPI loopback saves you β you can grab whatever is going to the output device, without virtual cables.
Takeaway: "system audio capture" is not one feature but two different subsystems for two OSes, and most of the early bugs were about permissions and device selection, not about audio itself. Hard part #2: latency is everything
A hint that arrives 6 seconds late is useless β the conversation has already moved on. The latency budget has three parts:
STT (speech β text). Streaming only. Batch "recognize after the phrase ends" immediately adds 1β2 seconds. The key metrics weren't "overall accuracy on a benchmark" but streaming latency and quality on the target language with domain vocabulary.
LLM (text β answer). Token streaming is mandatory: the first token must show up almost immediately, otherwise it feels frozen. Plus an aggressive system prompt for brevity β a long answer is impossible to read out loud.
Network. RTT to the server and providers. Keeping the connection warm and not reopening sockets per phrase helps.
The main lesson: optimize time-to-first-useful-token, not total response time. Hard part #3: dialog context, not the last sentence
Feed the model only the last sentence and the answers miss. A real question is often smeared across 3β4 turns. So the server keeps a sliding window of the dialog and sends a coherent, role-tagged context into the prompt. Separately, a hotkey-triggered screenshot analysis: code or a diagram on screen gives the model what speech doesn't.
Hard part #4: an overlay excluded from screen capture
The technically fun part. The hint window must be visible to the user but not show up in screen sharing or recordings. On macOS this is solved with the window level and an exclude-from-capture flag (sharingType); on Windows with window affinity (WDA_EXCLUDEFROMCAPTURE). The catch: behavior depends on how the conferencing app captures the screen (composited vs. raw), so it took a "OS Γ call platform" test matrix.
Hard part #5: privacy and trust
Once an app listens to calls, the immediate question is "what about the data?" The choices I landed on:
keys and prompts live only on the server, no secrets in the client;
a short-lived per-session token instead of a persistent one;
audio capture and screen analysis only on an explicit user action β no background "listening."
What I'd do differently
Lock down a latency metric earlier and run it in CI against real recordings (noise, accents) instead of "by ear."
Not underestimate platform permissions: at launch ~80% of tickets were "can't hear the other side" = a permissions problem, not a code one.
If you want to poke at the result β the project is Suflo (macOS & Windows): suflo.ru. Happy to go deep on real-time STT and system-audio capture in the comments.