Transcribe.cpp – ggml speech-to-text inference engine

wpnews.pro

C/C++ speech-to-text inference library. Runs diverse STT model families via GGUF models on the ggml runtime, with Metal, Vulkan, and CUDA backends for fast GPU inference plus a tinyBLAS-accelerated CPU path.

16 model families and 60+ variants, streaming and batch. Every model we publish under handy-computer is numerically verified and WER-tested against its reference implementation

Supported models:

Family	Variants	Docs
Parakeet	10 variants: TDT, RNN-T, CTC, TDT+CTC (110M–1.1B)

canary-1b

, canary-1b-v2

, canary-1b-flash

, canary-180m-flash

docs/models/canary.mdcanary-qwen-2.5b

(FastConformer + Qwen3-1.7B SALM)docs/models/canary-qwen-2.5b.mdtiny

through large-v3-turbo

, plus .en

siblings)docs/models/whisper.mdgigaam-v3-{e2e-rnnt,e2e-ctc,rnnt,ctc}

docs/models/gigaam.mdmoonshine-tiny

, moonshine-base

docs/models/moonshine.mdmoonshine-streaming-{tiny,small,medium}

docs/models/moonshine-streaming.mdqwen3-asr-0.6b

, qwen3-asr-1.7b

docs/models/qwen3-asr.mdcohere-transcribe-03-2026

docs/models/cohere-transcribe-03-2026.mdsensevoice-small

docs/models/sensevoice-small.mdfun-asr-nano-2512

, fun-asr-mlt-nano-2512

docs/models/fun-asr-nano.mdnemotron-speech-streaming-en-0.6b

docs/models/nemotron-speech-streaming-en-0.6b.mdnemotron-3.5-asr-streaming-0.6b

(multilingual, 40 locales)docs/models/nemotron-3.5-asr-streaming-0.6b.mdgranite-4.0-1b-speech

, granite-speech-4.1-2b{,-plus,-nar}

docs/models/granite-speech.mdvoxtral-mini-3b-2507

, voxtral-small-24b-2507

(audio-LLM; transcription + translation)docs/models/voxtral.mdvoxtral-mini-4b-realtime-2602

(streaming audio-LLM)docs/models/voxtral-realtime.mdmedasr

(Conformer + CTC, English medical-dictation, gated)docs/models/medasr.mdPer-variant model cards live under docs/models/.

cmake -B build
cmake --build build

Metal is enabled automatically on Apple Silicon. For Vulkan (Linux/Windows):

sudo apt install build-essential cmake libvulkan-dev glslc libopenblas-dev

cmake -B build -DTRANSCRIBE_VULKAN=ON
cmake --build build

For CUDA (Linux + NVIDIA GPU):

cmake -B build -DTRANSCRIBE_CUDA=ON
cmake --build build

libopenblas-dev

is optional but recommended. It accelerates the host-side decoder ~10-15x. Without it the build falls back to a scalar path automatically.

tinyBLAS (Justine Tunney's llamafile_sgemm

kernels) is on by default.

To build the quantization tool:

cmake -B build -DTRANSCRIBE_BUILD_TOOLS=ON
cmake --build build

Pre-built GGUFs for all supported models are hosted on Hugging Face under handy-computer. Each per-model doc (linked in the table above) includes direct download links for every quant. Convert from source only if you need a different dtype or a checkpoint that isn't pre-built.

The converter loads directly from NVIDIA's NeMo checkpoints via ASRModel.from_pretrained

. Requires uv; the parakeet env ships NeMo and its deps.

uv run --project scripts/envs/parakeet \
  scripts/convert-parakeet.py nvidia/parakeet-tdt-0.6b-v2

This writes models/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2-F32.gguf

following the llama.cpp-style <slug>-<QUANT>.gguf

naming convention. Pass a local .nemo

path or extracted directory for offline conversion.

The transcribe-quantize

tool produces smaller models from the reference GGUF. Available presets: F16

, Q8_0

, Q6_K

, Q5_K_M

, Q4_K_M

.

build/bin/transcribe-quantize \
  models/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2-F32.gguf \
  models/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2-Q4_K_M.gguf \
  --quant Q4_K_M
build/bin/transcribe-cli -m models/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2-F32.gguf samples/jfk.wav

Input must be 16 kHz mono WAV. Use ffmpeg

or sox

to convert other formats:

ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

Official bindings wrap the C API for other languages:

Language	Path
Python

bindings/typescript bindings/rust/transcribe-cpp bindings/swiftSee docs/bindings.md for how the bindings are generated and kept in sync with the header.

cd build && ctest

Some tests require a real model file. Enable them with:

cmake -B build -DTRANSCRIBE_BUILD_REAL_MODEL_TESTS=ON
cmake --build build
TRANSCRIBE_PARAKEET_GGUF=path/to/model.gguf ctest --test-dir build

For the model-family smoke-test, numerical-validation, and benchmark pattern expected of new ports, see docs/model-family-testing.md.

A huge thanks to Mozilla AI and their BiR Program. This whole project started out as an idea, not even an implementation direction. It was a research project in how to accelerate transcription models across all platforms as easily as possible. The BiR program and Davide helped support the research, and my eventual direction to choose to implement and inference engine backed by ggml. And also experimenting with automated model porting using agentic programming tools.

Hugging Face provided the project extra storage so we can host all of the models which we support. We want to provide canonical references for as many models as reasonably possible, the support from Hugging Face helps to enable this.

Modal helped to provide GPU credits so the project can test and validate the projects implementations match the transformers or nemo reference source. This is critical to ensuring that we have as close to a production grade inference engine that works everywhere. We believe it is critical to have accurate transcriptions and the only way to ensure this is through long running WER checks which Modal helps to provide. Every model published under handy-computer on hugggingface has had the WER checked, so you can trust the results. And if there are any regressions, you bet we will be fixing them.

Blacksmith provides many of the CI runners for this project. That helps to keep transcribe.cpp well tested and ensure our releases are as smooth as possible. The CI is quick and a drop in replacement for the standard Github Actions runners. I ran into limits very fast with them and super happy upon reaching out to Blacksmith they were able to provide runners for the project.

include/transcribe.h       Public C API (single header)
src/                       Library internals (C++17)
src/arch/parakeet/         Parakeet family implementation
src/arch/cohere/           Cohere Transcribe family implementation
examples/cli/              CLI binary source
tools/transcribe-quantize/ Quantization tool source
bindings/                  Python, TypeScript, Rust, and Swift bindings
docs/                      Porting and validation guidance
scripts/                   Python converter + test tooling
ggml/                      Vendored ggml (see ggml/UPSTREAM for pinned SHA)
src/third_party/miniz/     Vendored miniz deflate codec (see its UPSTREAM file)
samples/                   Test audio files
tests/                     Unit and smoke tests

transcribe.cpp is MIT-licensed. See LICENSE for details. Vendored third-party components (ggml, miniz — both MIT) are attributed in THIRD-PARTY-LICENSES.md.

source & further reading

github.com — original article

Transcribe.cpp – ggml speech-to-text inference engine

Run your AI side-project on zahid.host