# Transcribe.cpp – ggml speech-to-text inference engine

> Source: <https://github.com/handy-computer/transcribe.cpp>
> Published: 2026-07-01 10:13:40+00:00

C/C++ speech-to-text inference library. Runs diverse STT model families via [GGUF](https://github.com/ggerganov/gguf) models on the [ggml](https://github.com/ggml-org/ggml) runtime, with Metal, Vulkan, and CUDA backends for fast GPU inference plus a tinyBLAS-accelerated CPU path.

16 model families and 60+ variants, streaming and batch. Every model we publish under [ handy-computer](https://huggingface.co/handy-computer) is numerically verified and WER-tested against its reference implementation

**Supported models:**

| Family | Variants | Docs |
|---|---|---|
| Parakeet | 10 variants: TDT, RNN-T, CTC, TDT+CTC (110M–1.1B) |
|

`canary-1b`

, `canary-1b-v2`

, `canary-1b-flash`

, `canary-180m-flash`

[docs/models/canary.md](/handy-computer/transcribe.cpp/blob/main/docs/models/canary.md)`canary-qwen-2.5b`

(FastConformer + Qwen3-1.7B SALM)[docs/models/canary-qwen-2.5b.md](/handy-computer/transcribe.cpp/blob/main/docs/models/canary-qwen-2.5b.md)`tiny`

through `large-v3-turbo`

, plus `.en`

siblings)[docs/models/whisper.md](/handy-computer/transcribe.cpp/blob/main/docs/models/whisper.md)`gigaam-v3-{e2e-rnnt,e2e-ctc,rnnt,ctc}`

[docs/models/gigaam.md](/handy-computer/transcribe.cpp/blob/main/docs/models/gigaam.md)`moonshine-tiny`

, `moonshine-base`

[docs/models/moonshine.md](/handy-computer/transcribe.cpp/blob/main/docs/models/moonshine.md)`moonshine-streaming-{tiny,small,medium}`

[docs/models/moonshine-streaming.md](/handy-computer/transcribe.cpp/blob/main/docs/models/moonshine-streaming.md)`qwen3-asr-0.6b`

, `qwen3-asr-1.7b`

[docs/models/qwen3-asr.md](/handy-computer/transcribe.cpp/blob/main/docs/models/qwen3-asr.md)`cohere-transcribe-03-2026`

[docs/models/cohere-transcribe-03-2026.md](/handy-computer/transcribe.cpp/blob/main/docs/models/cohere-transcribe-03-2026.md)`sensevoice-small`

[docs/models/sensevoice-small.md](/handy-computer/transcribe.cpp/blob/main/docs/models/sensevoice-small.md)`fun-asr-nano-2512`

, `fun-asr-mlt-nano-2512`

[docs/models/fun-asr-nano.md](/handy-computer/transcribe.cpp/blob/main/docs/models/fun-asr-nano.md)`nemotron-speech-streaming-en-0.6b`

[docs/models/nemotron-speech-streaming-en-0.6b.md](/handy-computer/transcribe.cpp/blob/main/docs/models/nemotron-speech-streaming-en-0.6b.md)`nemotron-3.5-asr-streaming-0.6b`

(multilingual, 40 locales)[docs/models/nemotron-3.5-asr-streaming-0.6b.md](/handy-computer/transcribe.cpp/blob/main/docs/models/nemotron-3.5-asr-streaming-0.6b.md)`granite-4.0-1b-speech`

, `granite-speech-4.1-2b{,-plus,-nar}`

[docs/models/granite-speech.md](/handy-computer/transcribe.cpp/blob/main/docs/models/granite-speech.md)`voxtral-mini-3b-2507`

, `voxtral-small-24b-2507`

(audio-LLM; transcription + translation)[docs/models/voxtral.md](/handy-computer/transcribe.cpp/blob/main/docs/models/voxtral.md)`voxtral-mini-4b-realtime-2602`

(streaming audio-LLM)[docs/models/voxtral-realtime.md](/handy-computer/transcribe.cpp/blob/main/docs/models/voxtral-realtime.md)`medasr`

(Conformer + CTC, English medical-dictation, gated)[docs/models/medasr.md](/handy-computer/transcribe.cpp/blob/main/docs/models/medasr.md)Per-variant model cards live under [ docs/models/](/handy-computer/transcribe.cpp/blob/main/docs/models).

```
cmake -B build
cmake --build build
```

Metal is enabled automatically on Apple Silicon. For Vulkan (Linux/Windows):

```
# Ubuntu/Debian
sudo apt install build-essential cmake libvulkan-dev glslc libopenblas-dev

cmake -B build -DTRANSCRIBE_VULKAN=ON
cmake --build build
```

For CUDA (Linux + NVIDIA GPU):

```
# requires the CUDA toolkit (nvcc) on PATH
cmake -B build -DTRANSCRIBE_CUDA=ON
cmake --build build
```

`libopenblas-dev`

is optional but recommended. It accelerates the host-side decoder ~10-15x. Without it the build falls back to a scalar path automatically.

tinyBLAS (Justine Tunney's `llamafile_sgemm`

kernels) is on by default.

To build the quantization tool:

```
cmake -B build -DTRANSCRIBE_BUILD_TOOLS=ON
cmake --build build
```

Pre-built GGUFs for all supported models are hosted on Hugging Face under
[ handy-computer](https://huggingface.co/handy-computer). Each per-model doc
(linked in the table above) includes direct download links for every quant.
Convert from source only if you need a different dtype or a checkpoint that
isn't pre-built.

The converter loads directly from NVIDIA's NeMo checkpoints via
`ASRModel.from_pretrained`

. Requires [uv](https://docs.astral.sh/uv/);
the parakeet env ships NeMo and its deps.

```
uv run --project scripts/envs/parakeet \
  scripts/convert-parakeet.py nvidia/parakeet-tdt-0.6b-v2
```

This writes `models/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2-F32.gguf`

following
the llama.cpp-style `<slug>-<QUANT>.gguf`

naming convention. Pass a local
`.nemo`

path or extracted directory for offline conversion.

The `transcribe-quantize`

tool produces smaller models from the
reference GGUF. Available presets: `F16`

, `Q8_0`

, `Q6_K`

, `Q5_K_M`

,
`Q4_K_M`

.

```
build/bin/transcribe-quantize \
  models/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2-F32.gguf \
  models/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2-Q4_K_M.gguf \
  --quant Q4_K_M
build/bin/transcribe-cli -m models/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2-F32.gguf samples/jfk.wav
```

Input must be 16 kHz mono WAV. Use `ffmpeg`

or `sox`

to convert other formats:

```
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
```

Official bindings wrap the C API for other languages:

| Language | Path |
|---|---|
| Python |
|

[bindings/typescript](/handy-computer/transcribe.cpp/blob/main/bindings/typescript)[bindings/rust/transcribe-cpp](/handy-computer/transcribe.cpp/blob/main/bindings/rust/transcribe-cpp)[bindings/swift](/handy-computer/transcribe.cpp/blob/main/bindings/swift)See [ docs/bindings.md](/handy-computer/transcribe.cpp/blob/main/docs/bindings.md) for how the bindings are generated
and kept in sync with the header.

```
cd build && ctest
```

Some tests require a real model file. Enable them with:

```
cmake -B build -DTRANSCRIBE_BUILD_REAL_MODEL_TESTS=ON
cmake --build build
TRANSCRIBE_PARAKEET_GGUF=path/to/model.gguf ctest --test-dir build
```

For the model-family smoke-test, numerical-validation, and benchmark
pattern expected of new ports, see
[ docs/model-family-testing.md](/handy-computer/transcribe.cpp/blob/main/docs/model-family-testing.md).

A huge thanks to [Mozilla AI](https://www.mozilla.ai/) and their [BiR Program](https://www.mozilla.ai/company/bir).
This whole project started out as an idea, not even an implementation direction. It was a research project in how
to accelerate transcription models across all platforms as easily as possible. The BiR program and Davide helped
support the research, and my eventual direction to choose to implement and inference engine backed by ggml. And
also experimenting with automated model porting using agentic programming tools.

[Hugging Face](https://huggingface.co/) provided the project extra storage so we can host all of the models
which we support. We want to provide canonical references for as many models as reasonably possible,
the support from Hugging Face helps to enable this.

[Modal](https://modal.com/) helped to provide GPU credits so the project can test and validate the projects
implementations match the transformers or nemo reference source. This is critical to ensuring that we have
as close to a production grade inference engine that works everywhere. We believe it is critical to have
accurate transcriptions and the only way to ensure this is through long running WER checks which Modal
helps to provide. Every model published under [handy-computer](https://huggingface.co/handy-computer)
on hugggingface has had the WER checked, so you can trust the results. And if there are any regressions, you
bet we will be fixing them.

[Blacksmith](https://www.blacksmith.sh/) provides many of the CI runners for this project. That helps to keep
transcribe.cpp well tested and ensure our releases are as smooth as possible. The CI is quick and a drop
in replacement for the standard Github Actions runners. I ran into limits very fast with them and super happy
upon reaching out to Blacksmith they were able to provide runners for the project.

```
include/transcribe.h       Public C API (single header)
src/                       Library internals (C++17)
src/arch/parakeet/         Parakeet family implementation
src/arch/cohere/           Cohere Transcribe family implementation
examples/cli/              CLI binary source
tools/transcribe-quantize/ Quantization tool source
bindings/                  Python, TypeScript, Rust, and Swift bindings
docs/                      Porting and validation guidance
scripts/                   Python converter + test tooling
ggml/                      Vendored ggml (see ggml/UPSTREAM for pinned SHA)
src/third_party/miniz/     Vendored miniz deflate codec (see its UPSTREAM file)
samples/                   Test audio files
tests/                     Unit and smoke tests
```

transcribe.cpp is MIT-licensed. See [LICENSE](/handy-computer/transcribe.cpp/blob/main/LICENSE) for details. Vendored
third-party components (ggml, miniz — both MIT) are attributed in
[THIRD-PARTY-LICENSES.md](/handy-computer/transcribe.cpp/blob/main/THIRD-PARTY-LICENSES.md).
