{"slug": "how-together-ai-built-the-worlds-fastest-speech-to-text-stack", "title": "How Together AI built the world’s fastest speech-to-text stack", "summary": "Together AI built the world’s fastest speech-to-text stack, enabling NVIDIA’s Parakeet-TDT 0.6B v3 model to transcribe roughly 20 hours of speech in under 10 seconds. The company achieved this by optimizing the entire data path, including TensorRT profiles for real audio shapes and removing the CPU from the decoder loop, addressing the unique systems challenges of serving smaller speech-to-text models. The result is the lowest-latency ASR service ranked by Artificial Analysis, serving both offline and streaming transcription regimes.", "body_md": "**Modality matters**\n\nA 1M-token text prompt can fit the entire Harry Potter series and still only weigh around 5 MB. That scale sounds enormous, but the input itself is compact. Text also arrives almost ready for inference: tokenize it, batch it, and move it through the model.\n\nAudio changes the shape of the problem. The same Harry Potter corpus as audiobooks is 5 to 10 GB, roughly three orders of magnitude larger than the text. Before any of it reaches the GPU, the server has to decode the container, resample, filter noise, run VAD, segment speech, and compute audio features.\n\nThe model side flips too. LLMs these days have hundreds of billions or trillions of parameters, so serving work naturally concentrates inside the GPU: quantization, KV cache, attention kernels, batching, and parallelism. Speech-to-text models are much smaller, often in the hundreds of millions to low billions of parameters, so the surrounding data path matters much more.\n\nThat makes ASR serving a full-path systems problem spanning GPU execution, CPU preprocessing, memory movement, transport, connection scheduling, and runtime behavior. The same stack also has to serve two different regimes: offline transcription, where throughput matters most, and streaming transcription, where latency and jitter dominate.\n\nTogether’s ASR stack serves the two lowest-latency speech-to-text models ranked by Artificial Analysis: NVIDIA’s Parakeet-TDT 0.6B v3 and OpenAI’s Whisper Large v3. The faster of the two, NVIDIA Parakeet-TDT 0.6B v3, can transcribe roughly 20 hours of speech, about the runtime of the Harry Potter film franchise, in under 10 seconds.\n\nThe rest of this post breaks down the production changes behind that result: TensorRT profiles for real audio shapes, GPU-side decoder control flow, lower-copy CPU paths, evented streaming I/O, and runtime GC control.\n\n**Compile the encoder for real audio shapes**\n\nParakeet uses an encoder-decoder architecture, and roughly 95% of its weights sit in the encoder. The encoder takes a variable-length speech segment and produces acoustic frames for the decoder, which made it the first place to optimize.\n\nAudio inputs span a wide range of lengths, from a 200 ms streaming packet to 30 seconds of uninterrupted speech. A kernel plan tuned for one input shape can be substantially slower at another, so the engine needs to know the shape distribution it will see at compile time.\n\nBefore TensorRT, we were already using an optimized PyTorch path with `torch.compile`\n\nand CUDA graphs, tuned across the same shape profiles. That gave us a strong baseline: profile-aware execution without leaving the PyTorch stack.\n\nTensorRT gave us a faster encoder path for production. It builds an optimized execution plan ahead of time, fusing kernels where possible, tuning memory layouts, and benchmarking kernel variants for the shape ranges we expect to serve.\n\nThe important detail is profile tuning. A single engine tuned only for the largest input shape forces shorter audio segments into a padded path, which is especially costly for streaming chunks and short utterances. A multi-profile TensorRT engine lets us keep one copy of the encoder weights in memory while selecting the right optimization profile per request.\n\nThe memory savings were modest, roughly 6 GB to 5 GB. The larger win was avoiding bad shape matches and moving from optimized PyTorch to TensorRT for tuned profiles. In the small-input regime, profile-aware TensorRT can be several times faster than sending those requests through a large padded profile.\n\nWith the encoder optimized, the decoder loop became the next bottleneck.\n\n**Remove the CPU from the decoder loop**\n\nParakeet’s decoder iterates over the encoder’s acoustic frames and emits either a token or a `BLANK`\n\nfor frames that do not advance the transcript. The code is essentially:\n\n`state = init()`\n\n`for frame in encoder_output:`\n\n` token = predict(frame, state)`\n\n` if token != BLANK:`\n\n` emit(token)`\n\n` state = update(state, token)`\n\nWhen profiling, we found that `predict`\n\nand `update`\n\nwere both fast. The per-iteration GPU work was measured in microseconds.\n\nThe expensive line was the branch:\n\n`if token != BLANK:`\n\nThat branch requires the CPU to read the token back from GPU memory to decide which path to take. This host sync prevents the decode loop from being captured as a single CUDA graph and forces every iteration to round-trip through Python. The GPU does a few microseconds of work, waits for the CPU, launches the next kernel, and repeats that pattern thousands of times per request.\n\nConditional CUDA graph nodes moved that branch onto the GPU. A small device-side kernel evaluates the condition and tells the CUDA runtime whether to enter the token-emission and state-update subgraph. The branch resolves without leaving the GPU, so the entire decoder loop, counter, condition, emit, and state update, can be captured and launched as one CUDA graph.\n\nThe CPU leaves the decoder’s inner loop, and the result is a 2 to 3x faster decoder.\n\n**Stop copying audio bytes**\n\nOnce the encoder and decoder were running well, the remaining latency came from the CPU path around the model. That is where most ASR code we’ve audited spends its latency budget: redundant copies, unnecessary process hops on the hot path, and single-threaded functions that would benefit from higher parallelism.\n\nThe first lever was collapsing unnecessary process boundaries.\n\nAudio preprocessing, whether file decoding, resampling, voice activity detection (VAD), feature extraction, or chunk handling, is mostly I/O or native C/C++ work that releases the Python Global Interpreter Lock (GIL). A typical microservice architecture splits preprocessing across three or four separate processes, paying for isolation the workload does not need. Collapsing most of that work into fewer processes removes kernel copies and serialization/deserialization passes that can cost hundreds of milliseconds on large files.\n\nWhen inter-process communication is genuinely needed, common options like ZeroMQ also carry meaningful overhead. In our workload, a simple custom protocol over persistent Unix domain sockets carrying raw audio bytes performs best under high concurrency because it keeps framing minimal and avoids repeated connection setup.\n\nFor large files, sockets still impose two copies: sender userspace to kernel buffer, then kernel buffer to receiver userspace. To avoid that path, we use shared memory. With shared memory, both processes map the same physical region, so data written by the producer is visible to the consumer without a kernel round trip. That gives us a zero-copy data path.\n\nThe complexity cost is real, so shared memory is worth reaching for only when the data volume justifies it.\n\n**Use evented I/O for streaming**\n\nStreaming ASR adds another problem: connection lifecycle.\n\nOur first streaming implementation used one thread per connection. When hundreds of streams sent chunks at once, hundreds of threads woke up together, GIL contention exploded, and tail latency spiked.\n\nWe moved to one thread blocked on `epoll`\n\n.\n\n`epoll`\n\nlets one thread register thousands of connections and ask the kernel in a single syscall: “wake me up when any of these has data.” When messages arrive, the kernel returns the full ready set, and that thread processes the active sockets before going back to sleep.\n\nSame workload, far less scheduler pressure. For streaming ASR, that predictability matters because delayed partial transcripts can make a voice system feel slow even when average latency looks fine.\n\n**Freeze startup state to remove GC tail latency**\n\nWe almost missed this one.\n\nUnder load for streaming workflows, p50 and p90 latency looked healthy, but p95 would periodically spike by about 200 ms. The logs showed small queue depth and normal GPU times, but CPU functions that normally ran in under 5 ms suddenly took over 100 ms.\n\nSomething in the background was stealing time from the request loop.\n\nProfiling pointed at Python’s garbage collector (GC). Python uses reference counting for most memory management, with a cycle-detecting collector to catch reference cycles. That collector runs in generations. The oldest generation contains long-lived objects, and full collections can walk a large object graph.\n\nWe had preallocated a large pool of buffers, model state, and lookup tables at startup specifically to avoid allocation latency at steady state. Those long-lived objects landed in the oldest generation, so full GC passes walked hundreds of thousands of references. That was the 200 ms stall.\n\nThe fix was one line after startup preallocation:\n\n`gc.freeze()`\n\n`gc.freeze() `\n\ntells Python to exclude the preallocated state from future GC scans, so normal request-scoped objects still get collected while the giant initial state is left alone.\n\nThe p95 spikes disappeared, and p50 improved because the system could sustain smoother traffic patterns.\n\nThe lesson was to keep profiling beyond the model. GPU time, queue depth, and model execution all looked normal; the latency spike lived in the Python runtime.\n\n**Voice latency is an end-to-end systems problem**\n\nVoice agents usually run as a cascade: ASR produces a transcript, an LLM generates the response, and TTS produces audio. ASR is the first stage in that path, so its latency and jitter set the earliest bound on user-visible response time.\n\nThe optimizations above target different parts of that path. TensorRT multi-profile engines tune encoder execution for real audio shapes. Conditional CUDA graphs remove CPU round trips from the decoder loop. Persistent Unix domain sockets, shared memory, and `epoll`\n\nreduce CPU-path overhead. `gc.freeze()`\n\nremoves a runtime-level p95 failure mode.\n\nThe same constraint applies to the rest of the stack: every stage has to control both median latency and tail latency across model execution, preprocessing, transport, scheduling, and runtime behavior.\n\n[NVIDIA Parakeet-TDT 0.6B v3](https://www.together.ai/models/parakeet-tdt-0-6b-v3) and [OpenAI Whisper Large v3](https://www.together.ai/models/openai-whisper-large-v3) are available on Together. Reach out if you’re scaling voice AI in production.\n\n*Parakeet v3 is the successor to v2, which was an English-only model that set the pace on the Hugging Face Open ASR Leaderboard for single-language throughput. v3 extends that foundation significantly, expanding language support from English to 25 European languages, adding automatic language detection without requiring a language prompt, and was trained on 1.7 million hours of audio data — including NVIDIA's Granary multilingual corpus.*", "url": "https://wpnews.pro/news/how-together-ai-built-the-worlds-fastest-speech-to-text-stack", "canonical_source": "https://www.together.ai/blog/how-together-ai-built-the-worlds-fastest-speech-to-text-stack", "published_at": "2026-05-29 00:00:00+00:00", "updated_at": "2026-05-29 22:15:06.429281+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "ai-infrastructure", "ai-products"], "entities": ["Together AI", "NVIDIA", "OpenAI", "Parakeet-TDT", "Whisper Large v3", "Artificial Analysis"], "alternates": {"html": "https://wpnews.pro/news/how-together-ai-built-the-worlds-fastest-speech-to-text-stack", "markdown": "https://wpnews.pro/news/how-together-ai-built-the-worlds-fastest-speech-to-text-stack.md", "text": "https://wpnews.pro/news/how-together-ai-built-the-worlds-fastest-speech-to-text-stack.txt", "jsonld": "https://wpnews.pro/news/how-together-ai-built-the-worlds-fastest-speech-to-text-stack.jsonld"}}