Interfaze Ships diffusion-gemma-asr-small, an Open-Source Diffusion ASR Model Transcribing Six Languages via DiffusionGemma’s Parallel Denoising Decoder Interfaze, a Y Combinator-backed startup, open-sourced diffusion-gemma-asr-small, the first multilingual diffusion-based ASR model. The model transcribes six languages using a single 42M-parameter adapter on top of Google's frozen 26B DiffusionGemma backbone, achieving 6.6% WER on LibriSpeech. It uses a parallel denoising decoder instead of traditional autoregressive generation, with transcription cost scaling by denoising steps rather than transcript length. Interfaze, a young YC’s startup, has open-sourced a new speech recognition model. It is called diffusion-gemma-asr-small . The model transcribes audio through a diffusion decoder, not an autoregressive one. It is described as the first multilingual audio diffusion ASR model. One adapter handles six languages. The research team trained only about 42M parameters on top of a frozen 26B backbone. That is roughly 0.16% of the model’s weights. Here two terms matter up front. Autoregressive models generate text one token at a time. Diffusion models refine all tokens in parallel. This model uses the diffusion approach for speech-to-text. TL;DR - Claimed by the Interfaze team, to be the first open-source multilingual diffusion ASR: six languages from a single ~42M-parameter adapter. - Transcribes via DiffusionGemma’s diffusion decoder using uniform, random-token diffusion, not the absorbing