{"slug": "transcribe-cpp-ggml-speech-to-text-inference-engine", "title": "Transcribe.cpp – ggml speech-to-text inference engine", "summary": "Transcribe.cpp, a C/C++ speech-to-text inference library, has been released supporting 16 model families and 60+ variants via GGUF models on the ggml runtime. It offers Metal, Vulkan, and CUDA backends for GPU acceleration and a tinyBLAS-accelerated CPU path, with all models numerically verified and WER-tested.", "body_md": "C/C++ speech-to-text inference library. Runs diverse STT model families via [GGUF](https://github.com/ggerganov/gguf) models on the [ggml](https://github.com/ggml-org/ggml) runtime, with Metal, Vulkan, and CUDA backends for fast GPU inference plus a tinyBLAS-accelerated CPU path.\n\n16 model families and 60+ variants, streaming and batch. Every model we publish under [ handy-computer](https://huggingface.co/handy-computer) is numerically verified and WER-tested against its reference implementation\n\n**Supported models:**\n\n| Family | Variants | Docs |\n|---|---|---|\n| Parakeet | 10 variants: TDT, RNN-T, CTC, TDT+CTC (110M–1.1B) |\n|\n\n`canary-1b`\n\n, `canary-1b-v2`\n\n, `canary-1b-flash`\n\n, `canary-180m-flash`\n\n[docs/models/canary.md](/handy-computer/transcribe.cpp/blob/main/docs/models/canary.md)`canary-qwen-2.5b`\n\n(FastConformer + Qwen3-1.7B SALM)[docs/models/canary-qwen-2.5b.md](/handy-computer/transcribe.cpp/blob/main/docs/models/canary-qwen-2.5b.md)`tiny`\n\nthrough `large-v3-turbo`\n\n, plus `.en`\n\nsiblings)[docs/models/whisper.md](/handy-computer/transcribe.cpp/blob/main/docs/models/whisper.md)`gigaam-v3-{e2e-rnnt,e2e-ctc,rnnt,ctc}`\n\n[docs/models/gigaam.md](/handy-computer/transcribe.cpp/blob/main/docs/models/gigaam.md)`moonshine-tiny`\n\n, `moonshine-base`\n\n[docs/models/moonshine.md](/handy-computer/transcribe.cpp/blob/main/docs/models/moonshine.md)`moonshine-streaming-{tiny,small,medium}`\n\n[docs/models/moonshine-streaming.md](/handy-computer/transcribe.cpp/blob/main/docs/models/moonshine-streaming.md)`qwen3-asr-0.6b`\n\n, `qwen3-asr-1.7b`\n\n[docs/models/qwen3-asr.md](/handy-computer/transcribe.cpp/blob/main/docs/models/qwen3-asr.md)`cohere-transcribe-03-2026`\n\n[docs/models/cohere-transcribe-03-2026.md](/handy-computer/transcribe.cpp/blob/main/docs/models/cohere-transcribe-03-2026.md)`sensevoice-small`\n\n[docs/models/sensevoice-small.md](/handy-computer/transcribe.cpp/blob/main/docs/models/sensevoice-small.md)`fun-asr-nano-2512`\n\n, `fun-asr-mlt-nano-2512`\n\n[docs/models/fun-asr-nano.md](/handy-computer/transcribe.cpp/blob/main/docs/models/fun-asr-nano.md)`nemotron-speech-streaming-en-0.6b`\n\n[docs/models/nemotron-speech-streaming-en-0.6b.md](/handy-computer/transcribe.cpp/blob/main/docs/models/nemotron-speech-streaming-en-0.6b.md)`nemotron-3.5-asr-streaming-0.6b`\n\n(multilingual, 40 locales)[docs/models/nemotron-3.5-asr-streaming-0.6b.md](/handy-computer/transcribe.cpp/blob/main/docs/models/nemotron-3.5-asr-streaming-0.6b.md)`granite-4.0-1b-speech`\n\n, `granite-speech-4.1-2b{,-plus,-nar}`\n\n[docs/models/granite-speech.md](/handy-computer/transcribe.cpp/blob/main/docs/models/granite-speech.md)`voxtral-mini-3b-2507`\n\n, `voxtral-small-24b-2507`\n\n(audio-LLM; transcription + translation)[docs/models/voxtral.md](/handy-computer/transcribe.cpp/blob/main/docs/models/voxtral.md)`voxtral-mini-4b-realtime-2602`\n\n(streaming audio-LLM)[docs/models/voxtral-realtime.md](/handy-computer/transcribe.cpp/blob/main/docs/models/voxtral-realtime.md)`medasr`\n\n(Conformer + CTC, English medical-dictation, gated)[docs/models/medasr.md](/handy-computer/transcribe.cpp/blob/main/docs/models/medasr.md)Per-variant model cards live under [ docs/models/](/handy-computer/transcribe.cpp/blob/main/docs/models).\n\n```\ncmake -B build\ncmake --build build\n```\n\nMetal is enabled automatically on Apple Silicon. For Vulkan (Linux/Windows):\n\n```\n# Ubuntu/Debian\nsudo apt install build-essential cmake libvulkan-dev glslc libopenblas-dev\n\ncmake -B build -DTRANSCRIBE_VULKAN=ON\ncmake --build build\n```\n\nFor CUDA (Linux + NVIDIA GPU):\n\n```\n# requires the CUDA toolkit (nvcc) on PATH\ncmake -B build -DTRANSCRIBE_CUDA=ON\ncmake --build build\n```\n\n`libopenblas-dev`\n\nis optional but recommended. It accelerates the host-side decoder ~10-15x. Without it the build falls back to a scalar path automatically.\n\ntinyBLAS (Justine Tunney's `llamafile_sgemm`\n\nkernels) is on by default.\n\nTo build the quantization tool:\n\n```\ncmake -B build -DTRANSCRIBE_BUILD_TOOLS=ON\ncmake --build build\n```\n\nPre-built GGUFs for all supported models are hosted on Hugging Face under\n[ handy-computer](https://huggingface.co/handy-computer). Each per-model doc\n(linked in the table above) includes direct download links for every quant.\nConvert from source only if you need a different dtype or a checkpoint that\nisn't pre-built.\n\nThe converter loads directly from NVIDIA's NeMo checkpoints via\n`ASRModel.from_pretrained`\n\n. Requires [uv](https://docs.astral.sh/uv/);\nthe parakeet env ships NeMo and its deps.\n\n```\nuv run --project scripts/envs/parakeet \\\n  scripts/convert-parakeet.py nvidia/parakeet-tdt-0.6b-v2\n```\n\nThis writes `models/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2-F32.gguf`\n\nfollowing\nthe llama.cpp-style `<slug>-<QUANT>.gguf`\n\nnaming convention. Pass a local\n`.nemo`\n\npath or extracted directory for offline conversion.\n\nThe `transcribe-quantize`\n\ntool produces smaller models from the\nreference GGUF. Available presets: `F16`\n\n, `Q8_0`\n\n, `Q6_K`\n\n, `Q5_K_M`\n\n,\n`Q4_K_M`\n\n.\n\n```\nbuild/bin/transcribe-quantize \\\n  models/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2-F32.gguf \\\n  models/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2-Q4_K_M.gguf \\\n  --quant Q4_K_M\nbuild/bin/transcribe-cli -m models/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2-F32.gguf samples/jfk.wav\n```\n\nInput must be 16 kHz mono WAV. Use `ffmpeg`\n\nor `sox`\n\nto convert other formats:\n\n```\nffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav\n```\n\nOfficial bindings wrap the C API for other languages:\n\n| Language | Path |\n|---|---|\n| Python |\n|\n\n[bindings/typescript](/handy-computer/transcribe.cpp/blob/main/bindings/typescript)[bindings/rust/transcribe-cpp](/handy-computer/transcribe.cpp/blob/main/bindings/rust/transcribe-cpp)[bindings/swift](/handy-computer/transcribe.cpp/blob/main/bindings/swift)See [ docs/bindings.md](/handy-computer/transcribe.cpp/blob/main/docs/bindings.md) for how the bindings are generated\nand kept in sync with the header.\n\n```\ncd build && ctest\n```\n\nSome tests require a real model file. Enable them with:\n\n```\ncmake -B build -DTRANSCRIBE_BUILD_REAL_MODEL_TESTS=ON\ncmake --build build\nTRANSCRIBE_PARAKEET_GGUF=path/to/model.gguf ctest --test-dir build\n```\n\nFor the model-family smoke-test, numerical-validation, and benchmark\npattern expected of new ports, see\n[ docs/model-family-testing.md](/handy-computer/transcribe.cpp/blob/main/docs/model-family-testing.md).\n\nA huge thanks to [Mozilla AI](https://www.mozilla.ai/) and their [BiR Program](https://www.mozilla.ai/company/bir).\nThis whole project started out as an idea, not even an implementation direction. It was a research project in how\nto accelerate transcription models across all platforms as easily as possible. The BiR program and Davide helped\nsupport the research, and my eventual direction to choose to implement and inference engine backed by ggml. And\nalso experimenting with automated model porting using agentic programming tools.\n\n[Hugging Face](https://huggingface.co/) provided the project extra storage so we can host all of the models\nwhich we support. We want to provide canonical references for as many models as reasonably possible,\nthe support from Hugging Face helps to enable this.\n\n[Modal](https://modal.com/) helped to provide GPU credits so the project can test and validate the projects\nimplementations match the transformers or nemo reference source. This is critical to ensuring that we have\nas close to a production grade inference engine that works everywhere. We believe it is critical to have\naccurate transcriptions and the only way to ensure this is through long running WER checks which Modal\nhelps to provide. Every model published under [handy-computer](https://huggingface.co/handy-computer)\non hugggingface has had the WER checked, so you can trust the results. And if there are any regressions, you\nbet we will be fixing them.\n\n[Blacksmith](https://www.blacksmith.sh/) provides many of the CI runners for this project. That helps to keep\ntranscribe.cpp well tested and ensure our releases are as smooth as possible. The CI is quick and a drop\nin replacement for the standard Github Actions runners. I ran into limits very fast with them and super happy\nupon reaching out to Blacksmith they were able to provide runners for the project.\n\n```\ninclude/transcribe.h       Public C API (single header)\nsrc/                       Library internals (C++17)\nsrc/arch/parakeet/         Parakeet family implementation\nsrc/arch/cohere/           Cohere Transcribe family implementation\nexamples/cli/              CLI binary source\ntools/transcribe-quantize/ Quantization tool source\nbindings/                  Python, TypeScript, Rust, and Swift bindings\ndocs/                      Porting and validation guidance\nscripts/                   Python converter + test tooling\nggml/                      Vendored ggml (see ggml/UPSTREAM for pinned SHA)\nsrc/third_party/miniz/     Vendored miniz deflate codec (see its UPSTREAM file)\nsamples/                   Test audio files\ntests/                     Unit and smoke tests\n```\n\ntranscribe.cpp is MIT-licensed. See [LICENSE](/handy-computer/transcribe.cpp/blob/main/LICENSE) for details. Vendored\nthird-party components (ggml, miniz — both MIT) are attributed in\n[THIRD-PARTY-LICENSES.md](/handy-computer/transcribe.cpp/blob/main/THIRD-PARTY-LICENSES.md).", "url": "https://wpnews.pro/news/transcribe-cpp-ggml-speech-to-text-inference-engine", "canonical_source": "https://github.com/handy-computer/transcribe.cpp", "published_at": "2026-07-01 10:13:40+00:00", "updated_at": "2026-07-01 10:50:44.368592+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "ai-tools", "ai-infrastructure", "natural-language-processing"], "entities": ["ggml", "GGUF", "Metal", "Vulkan", "CUDA", "tinyBLAS", "handy-computer", "NVIDIA"], "alternates": {"html": "https://wpnews.pro/news/transcribe-cpp-ggml-speech-to-text-inference-engine", "markdown": "https://wpnews.pro/news/transcribe-cpp-ggml-speech-to-text-inference-engine.md", "text": "https://wpnews.pro/news/transcribe-cpp-ggml-speech-to-text-inference-engine.txt", "jsonld": "https://wpnews.pro/news/transcribe-cpp-ggml-speech-to-text-inference-engine.jsonld"}}