Building AI-Powered Voice Transcription at Scale: Engineering Lessons

wpnews.pro

Eighteen months ago, we thought we were building a simple voice memo app.

We were wrong about the "simple" part.

At Vomo, what started as a tool to capture and transcribe voice notes evolved into a full voice-first productivity platform supporting 50+ languages, real-time streaming transcription, and a growing number of enterprise customers with strict latency and accuracy requirements. Along the way, we learned a lot — some of it the hard way.

This post covers the engineering decisions we made, the ones that hurt us, and what we'd do differently. If you're building anything in the audio/speech space, I hope this saves you some pain.

The initial insight was embarrassingly simple: people think faster than they type. Voice memos have existed for decades, but the experience of using them is terrible. You record something, and then it just... sits there. You either listen to the whole thing again or you forget it.

The opportunity was to make voice memos actually useful — not just stored audio, but captured thought that gets organized, summarized, and actionable automatically.

That meant transcription was table stakes. But transcription alone is boring. The real product is what happens to the text after: structured notes, action items, searchable archives, smart summaries, integrations with Notion and Slack and everything else knowledge workers already use.

We scoped the MVP in two weeks. That scope did not survive contact with reality.

The first question we faced: do we send audio to the server in chunks as the user speaks, or wait for them to finish and process the whole file?

We went with streaming from day one, and it's one of the decisions I'm most glad we made.

Real-time streaming means users see text appearing as they speak. The psychological difference is enormous — it feels like the tool is listening, not processing. Users with streaming transcription are significantly more likely to keep talking, which results in longer, more useful recordings.

The architecture:

Mobile/Web Client
    ↓ (WebSocket, 100ms audio chunks, Opus codec)
API Gateway (load balanced)
    ↓
Transcription Worker Pool
    ↓ (partial results every ~500ms)
Client (streaming text updates)
    ↓ (on recording stop)
Post-processing Pipeline (cleanup, structure, AI enrichment)

Key decisions here:

We evaluated four options:

We ended up with a hybrid: Deepgram Nova-2 for real-time streaming (where latency matters most) and self-hosted Whisper large-v3 for post-processing uploaded files (where accuracy matters most and latency is acceptable).

The accuracy difference between these models matters less in clean conditions (all hit >95% on clear studio audio) and enormously in noisy conditions. Whisper large-v3 on a cafeteria recording still hits around 91%; the same recording on a mid-tier commercial ASR drops to 78-83%.

For our target user — people recording voice memos while commuting, walking, or between meetings — noise robustness was non-negotiable. That pushed us toward Whisper for the quality path even with the infrastructure overhead.

Our initial streaming implementation had a "first word latency" of about 1.8 seconds — the time from when a user starts speaking to when the first transcribed word appears on screen. Users found this uncomfortable. It felt like the tool wasn't keeping up.

We got this to 340ms through three changes:

1. Model warm-keeping: Transcription workers stay loaded with the model in memory. Cold-starting Whisper large-v3 takes 3–8 seconds depending on hardware. Warm requests take milliseconds. We keep a pool of warm workers sized to handle 95th-percentile concurrency without cold starts.

2. Partial Transcription Streaming: Instead of waiting for a complete sentence, we emit partial results every 500ms during active speech. These get replaced as context improves. Users see text "solidifying" in real time — initial rough transcription that gets corrected as more audio context arrives.

3. Edge pre-processing: We run a lightweight VAD (Voice Activity Detection) model on the client before streaming. Silence periods don't get sent. This reduces the amount of audio the server processes and eliminates the confusion caused by long s generating incomplete sentence segments.

Our first major traffic spike came after a mention in a tech newsletter. We went from ~80 concurrent transcription sessions to ~1,400 in about 25 minutes. Our worker pool maxed out. New sessions queued. Queue depth hit 600+.

The problem was that our auto-scaling was too slow. We were using cloud VM auto-scaling with a 3–5 minute spin-up time. That's fine for gradual traffic increases. It's useless for spike traffic.

The fix was two-pronged:

Auto-scaling now responds to queue depth rather than just CPU utilization. Queue depth above threshold triggers immediate scale-out; it doesn't wait for CPU to saturate.

Supporting 50+ languages meant we needed Whisper large-v3, which handles multilingual transcription. The challenge: language detection requires processing the first 30 seconds of audio.

For short recordings under 30 seconds, we were initially guessing the language wrong ~12% of the time. A voice memo recorded in Japanese would start processing as English because we didn't have enough audio to be confident.

Our solution: language detection from the first 3 seconds using a lightweight language ID model (fastText language identification), followed by Whisper processing with the detected language as a forced parameter. This reduced language misdetection to under 2% and eliminated the accuracy penalty from wrong-language processing.

We knew Whisper was good at noise robustness. What we didn't anticipate was the diversity of "noise" in production.

Our test suite covered café noise, street traffic, and office chatter. Production audio included: treadmill recordings, car engine noise, HVAC hum, keyboard clatter, music from a nearby speaker, and — most challenging — Bluetooth headsets with their own compression artifacts on top of background noise.

Bluetooth + background noise was particularly brutal. WER on some samples jumped from our expected 9% to 22-28%.

We added an optional pre-processing step using the DeepFilterNet noise suppression model before Whisper sees the audio. On heavily degraded audio, this consistently improved WER by 4–8 percentage points. On clean audio, it has essentially no effect.

The tradeoff: DeepFilterNet adds ~150ms of processing latency. We enable it adaptively — only when the input audio fails a quick SNR check.

Six months after the MVP:

The piece I'm most proud of is the post-processing pipeline. Getting transcription right is a solved problem if you're willing to pay for infrastructure. Getting the intelligence layer right — the summarization that's actually useful, the action items that aren't garbage, the structure that fits how knowledge workers think — that's the hard problem.

We ended up fine-tuning a smaller Claude model on our own structured outputs, which significantly improved the quality of AI-generated notes compared to zero-shot prompting. The training data was annotations from our own team on hundreds of real voice memo transcripts.

What worked:

What hurt:

Open questions we're still working on:

The platform we've built treats voice as input. The next frontier for us is voice as interface — where you can query your own recordings, ask questions about what was said in past meetings, and surface relevant notes through voice commands.

This requires evolving from a transcription + structuring system to an actual memory system, with semantic search, long-term context, and personalization. The transcription and AI layer we built is the foundation. The next layer is considerably more interesting.

If you're working on related problems — audio pipelines, speech AI, or voice-first products — I'm happy to trade notes. The engineering community in this space is still surprisingly small and surprisingly collegial.

Stack notes: Python workers (FastAPI), WebSocket via Redis pub/sub, Whisper large-v3 on A10G GPUs, Deepgram Nova-2 for streaming, DeepFilterNet for noise suppression, PostgreSQL + pgvector for transcript storage and search.

source & further reading

dev.to — original article Open Science Desktop: A Local-First Experimental Tool for AI Research 7 advanced Claude Code tips from 17 months of intense use TypeScript `const` Type Parameters: Immutable Inference and When It Beats `as const`

Building AI-Powered Voice Transcription at Scale: Engineering Lessons

Run your AI side-project on zahid.host