{"slug": "how-to-fine-tune-nemotron-3-5-asr-for-your-language-domain-or-accent", "title": "How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent", "summary": "NVIDIA released Nemotron 3.5 ASR, a 600M-parameter streaming multilingual speech-to-text model that transcribes 40 language-locales from a single checkpoint with built-in punctuation and capitalization. The model, which succeeds the English-only Nemotron 3 ASR, uses a Cache-Aware FastConformer-RNNT architecture to achieve low latency and high accuracy without redundant audio processing. Available as open weights on Hugging Face, the model can be fine-tuned for specific languages, domains, or accents without API dependencies or per-call billing.", "body_md": "Automatic Speech Recognition • Updated • 225 • 47\n\n# How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent\n\n[Enterprise + Article](/blog)Published June 4, 2026\n\n[NVIDIA Nemotron 3.5 ASR](https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b), streaming multilingual: a 600M-parameter speech-to-text model that transcribes\n\n**40 language-locales from a single checkpoint**, in\n\n**real time**, with\n\n**punctuation and capitalization built in**. It is the successor of the popular Nemotron 3 ASR model (English only) which was released\n\n[on Hugging Face](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)and\n\n[as a NIM](https://build.nvidia.com/nvidia/nemotron-asr-streaming/modelcard)earlier this year. Since its release, Nemotron 3 ASR has been validated by independent benchmarks at Artificial Analysis, where it ranks\n\n[2nd in latency among all streaming ASR models](https://artificialanalysis.ai/speech-to-text/streaming)— with just 0.07 seconds to final transcript after end of speech — and sits in the \"most attractive quadrant\" of the AA-WER Streaming Index vs. Time to Final Transcription leaderboard, placing it among the best models on the combined accuracy-latency tradeoff. The model uses a Cache-Aware FastConformer-RNNT architecture that streams audio without the redundant recomputation that makes most streaming ASR slow — so you get low latency\n\n*and*high accuracy, not one at the expense of the other. Nemotron 3.5 ASR ships as open weights on Hugging Face — you can inspect, fine-tune, and deploy it without API dependencies or per-call billing. No data leaves your infrastructure unless you choose. And because it's a strong base model, you can fine-tune it for your own language, domain, or accent. The second half of this post walks through exactly how.\n\n## The problem with multilingual speech recognition today\n\nIf you've ever built a product that needs to transcribe speech, you've probably hit one of these walls:\n\n**The polyglot tax.** You want to support multiple languages, so you stitch together 40 different models — or 40 different vendor APIs — each with its own quirks, latency profile, and billing. Your infrastructure becomes a museum of one-off integrations.**The streaming-vs-accuracy tradeoff.** Real-time captioning needs low latency, but most \"streaming\" ASR systems fake it by re-processing overlapping windows of audio over and over. That burns compute and adds delay. Turn down the latency and accuracy falls off a cliff.**The post-processing pipeline.** Raw ASR output is often an unpunctuated, lowercase wall of text. You bolt on a second model for punctuation and capitalization, adding yet another moving part.**The \"known language\" assumption.** Many systems require you to tell them the language up front. But what about a customer-support line where callers switch between English and Spanish mid-sentence?\n\nNemotron 3.5 ASR was built to collapse all four of those problems into one model.\n\n## What it does\n\n**One model, 40 language-locales.** A single 600M-parameter checkpoint transcribes English (US/GB), Spanish (US/ES), German, French (FR/CA), Italian, Arabic, Japanese, Korean, Portuguese (BR/PT), Russian, Hindi, Turkish, Vietnamese, Dutch, Ukrainian, Polish, Finnish, Mandarin, Czech, Bulgarian, Slovak, Swedish, Croatian, Romanian, Estonian, Danish, Hungarian, Norwegian Bokmål, Norwegian Nynorsk, Hebrew, Greek, Lithuanian, Latvian, Maltese, Slovenian, and Thai. No per-language deployment, no model-swapping.\n\n**Real-time streaming, done right.** The model is built on a **Cache-Aware FastConformer** encoder. Traditional \"buffered\" streaming re-processes overlapping chunks of audio at every step, doing the same work many times over. This model instead **caches the encoder's internal state** and reuses it — every audio frame is processed exactly once, with no overlap. The result is dramatically lower compute and end-to-end latency, with no accuracy penalty.\n\n**Punctuation and capitalization, natively.** The output is production-ready text — proper casing, commas, periods, question marks — straight from the model. No separate punctuation-restoration step.\n\n**Language conditioning, your choice.** You can run it two ways:\n\n**Tell the model the input language**(`target_lang=en-US`\n\n) when you know it — typically the best accuracy.**Let the model detect the language**(`target_lang=auto`\n\n) when you don't — the model detects the language and transcribes accordingly.\n\n## How it works (the 2-minute version)\n\nThe model has two main pieces:\n\n**A Cache-Aware FastConformer encoder (24 layers).** FastConformer is an efficient evolution of the Conformer architecture with linearly scalable attention. The \"cache-aware\" part is the streaming magic: the encoder keeps a cache of its self-attention and convolution activations from previous frames, so as new audio arrives it only computes what's genuinely new. Nothing is recomputed.**An RNNT (Recurrent Neural Network Transducer) decoder.** RNNT is the workhorse decoder for streaming ASR — it emits text as audio streams in, frame by frame, which is exactly what you want for live transcription.\n\nOn top of this, the model adds **prompt-based language-ID conditioning**: a language signal is fed alongside the audio, which lets one set of weights specialize its output to the target language — or, in `auto`\n\nmode, infer the language itself.\n\nIt was trained on **a massive speech data** spanning all supported languages, using a blend of public and proprietary data normalized to punctuated, properly-cased text.\n\n### A knob worth knowing: `att_context_size`\n\nStreaming ASR is fundamentally a tradeoff between *how soon* you emit text and *how much future audio* the model gets to \"peek at\" before committing. Nemotron ASR exposes this directly through the **attention context size**:\n\n| Attention Context | Chunk Size (Latency) | Use Case |\n|---|---|---|\n`[56, 0]` |\n80ms (Ultra-Low) | Ultra low latency Voice Agents |\n`[56, 1]` |\n160ms (Low) | Interactive Voice Agents, Conversational AI |\n`[56, 3]` |\n320ms (Balanced) | Conversational AI, Live caption |\n`[56, 6]` |\n560ms (Medium) | High accuracy with reasonable latency |\n`[56, 13]` |\n1.12s (High) | Highest accuracy with high latency |\n\nThe same checkpoint covers the whole spectrum — you choose the operating point at inference time, no retraining required.\n\n## Try it in minutes\n\nThe model ships as a NeMo checkpoint. Clone the NeMo branch and point the streaming inference script at your audio:\n\n```\ngit clone https://github.com/NVIDIA-NeMo/NeMo.git\n```\n\n**Transcribe with a known language:**\n\n```\npython ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py \\\n    model_path=${MODEL_PATH} \\\n    dataset_manifest=${MANIFEST_PATH} \\\n    output_path=${OUTPUT_FOLDER} \\\n    target_lang=es-ES \\\n    att_context_size=\"[56,3]\" \\\n    strip_lang_tags=true\n```\n\n**Or let the model detect the language:**\n\n```\npython ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py \\\n    model_path=${MODEL_PATH} \\\n    dataset_manifest=${MANIFEST_PATH} \\\n    output_path=${OUTPUT_FOLDER} \\\n    target_lang=auto \\\n    att_context_size=\"[56,3]\" \\\n    strip_lang_tags=true\n```\n\nAudio should be mono-channel `.wav`\n\n. The manifest is a standard NeMo JSON-lines file:\n\n```\n{\"audio_filepath\": \"/path/to/clip.wav\", \"duration\": 4.27, \"text\": \"reference transcript\"}\n```\n\nModel automatically predicts language_tag at the end of each completed sentence, i.e. “This is a test sample. <en-US>”. “`strip_lang_tags=True`\n\n” removes the language tag <xx-XX> for better readability.\n\n# Deep Dive: Fine-Tuning Nemotron ASR for Your Language\n\nNemotron 3.5 ASR is strong out of the box — but it was trained on a mix where some languages have far more data than others. The long-tail locales have headroom, and a few hours of in-domain audio plus the right recipe closes a surprising amount of it.\n\nTo make this concrete, we ran a worked example: take the base model and sharpen it on two mid-resource European languages — Greek, and Bulgarian — then measure honestly on held-out data. The results below are from that run. This section is a high-level overview and the coding example lives in the companion [GitHub repo](https://github.com/nvidia-riva/tutorials/blob/main/asr-finetune-nemotron-3.5-asr-streaming-prompt.ipynb). When we publish an agentic SKILL.md covering the whole process, this blog will be updated accordingly.\n\n## Why fine-tune?\n\nA few situations where it pays off:\n\n**Sharpening a long-tail locale.** Languages with less pretraining data have the most to gain.**Domain expertise or specialized vocabulary** Medical, legal, financial, or technical vocabulary the base model rarely saw.**Accent, dialect, and acoustics.** Telephony, far-field, in-car, or a specific speaker population.**New languages.** Bootstrapping a locale that isn't yet covered.\n\n## A Preview of the Power of Fine-Tuning\n\n🎥 **Video Walkthrough:** [Watch on YouTube](https://www.youtube.com/watch?v=kP9yaH-DT8E)\n\nThis walkthrough demonstrates multilingual streaming inference, latency/accuracy tradeoffs, deployment options, and the fine-tuning workflow described below.\n\n## The recipe at a glance\n\nThe whole workflow is five moves:\n\n- Point the trainer at tarred speech data for the target languages — no per-file unpacking, streamed efficiently by NeMo/Lhotse.\n- Fine-tune from the base checkpoint (\n`init_from_nemo_model`\n\n) using the same Cache-Aware FastConformer-RNNT recipe, conditioned on each clip's language tag. - Evaluate on a held-out set the model never saw — at the same low-latency streaming setting you'll deploy (e.g.\n`att_context_size=[56,0]`\n\n, 80ms chunk; 0ms lookahead). - Add more data where the language is weak and retrain.\n- Export and deploy the fine-tuned checkpoint.\n\n## Step 1 — Data\n\nWe assembled a balanced, ~2000-hour mix across the two languages (Greek and Bulgarian) from public multilingual corpora ([Granary](https://huggingface.co/datasets/nvidia/Granary), Common Voice, FLEURS), kept as tarred NeMo/Lhotse shards. The two details that matter most:\n\n- Every clip carries a\n`target_lang`\n\ntag — this is what drives the model's prompt-based language conditioning, so getting the tag right (and using a value the model recognizes) is essential. - Match the base model's text style — punctuated, properly-cased transcripts, since that's what the model produces.\n\nHeld-out FLEURS test splits (which were not in training) gave us an honest, in-the-wild benchmark per language.\n\n## Step 2 — Train\n\nA straightforward full fine-tune of the streaming RNNT model, driven by a fixed step budget (the right way to schedule with streaming/iterable data). It runs on a single GPU for a quick pass and scales cleanly to multi-GPU for a fuller run. On a small dataset like this, an epoch is minutes, not hours.\n\n## Step 3 — Evaluate\n\nWe measured Word Error Rate on the held-out FLEURS test set, in streaming mode with 80ms chunk — the most demanding condition, with no future-audio \"peeking.\" The improvement over the base model is large, especially for the languages that started out weakest:\n\n| Language | Base model | Fine-tuned | Relative Improvement in WER |\n|---|---|---|---|\n| 🇬🇷 Greek | 35 | 24 | 32% |\n| 🇧🇬 Bulgarian | 22 | 15 | 31% |\n\n*Raw WER (%) on held-out FLEURS test, lowest-latency streaming. Same evaluation for both the base and the fine-tuned models.*\n\nLanguages with higher error rates in the base model became genuinely useful after a short fine-tune — Bulgarian error rates more than halved.\n\n## Step 4 — Scale the data where it helps\n\nTo test how far more data goes, we then mixed in ~2,000 additional hours of parliamentary speech (MOSEL/VoxPopuli) part of the [Granary Dataset](https://huggingface.co/datasets/nvidia/Granary), taking the training pool from ~290 hours to ~2,300 hours. Even partway through that longer run, the weakest languages improved further (e.g. Bulgarian dropping into the high-20s), confirming the obvious lever: more in-language data keeps helping — though gains are uneven across languages and domains, so measure rather than assume.\n\n## Step 5 — Deploy\n\nThe fine-tuned model is the same architecture as the base, so it drops straight into the same serving path and you pick your latency/accuracy operating point at inference time via `att_context_size`\n\n, exactly as in Part 1.\n\n## What we learned\n\n**Fine-tuning is transformative for under-resourced languages**— the biggest wins came where the base model was weakest.** Evaluate at deployment latency, on held-out data.**Training-set scores flatter you; a separate test set at 0 ms look-ahead tells the truth.** Get the language tag right.**The prompt conditioning is powerful but unforgiving of mismatched language labels.** Protect the other languages.**When specializing in a multilingual model, blend in a slice of the model's other languages (\"replay\") and re-check them, so you sharpen your target locales without eroding the rest.**More data helps, unevenly.** Adding hours reliably moved most languages; one plateaued — a reminder that domain match matters as much as raw quantity.\n\n📦 The full walkthrough — data prep scripts, training configs, the exact commands, and the complete benchmark numbers — is in the companion [GitHub repo](https://github.com/nvidia-riva/tutorials/blob/main/asr-finetune-nemotron-3.5-asr-streaming-prompt.ipynb). This section is the overview; the repo is the build.\n\nFor production serving, look out for the NIM release later this month, providing gRPC streaming, and support across NVIDIA Ampere, Hopper, Blackwell, Lovelace, Turing, Volta, and Jetson.\n\n## What you can build with it\n\nA few of the use cases this model unlocks:\n\n**Sub-second voice agents**— ASR → LLM → TTS loops where the speech-to-text leg is no longer the bottleneck.** Live multilingual meeting captions**— one stream, participants in different languages, captions in real time.** Call-center analytics at global scale**— one ASR backend instead of a per-language vendor sprawl.** Real-time captioning + translation**for livestreams and events.** On-device transcription**on Jetson for privacy-sensitive or disconnected environments.\n\n## Get Started\n\nReady to build multilingual speech applications with a single streaming ASR model?\n\n🤗 Try Nemotron 3.5 ASR: nvidia/nemotron-3.5-asr-streaming-0.6b\n\n🧠 Run and fine-tune with NVIDIA NeMo: github.com/NVIDIA-NeMo/NeMo\n\n📚 Explore the training example: Fine-Tuning Notebook\n\nWhether you're building voice agents, multilingual captioning systems, contact-center analytics, or on-device speech applications, Nemotron 3.5 ASR provides a single multilingual model that can be deployed, customized, and fine-tuned for your use case.\n\nWe'd love to see what you build. Share your benchmarks, fine-tuning results, and language adaptations on the model discussion page:\n\n💬 Model Discussions: [https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b/discussions](https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b/discussions)\n\nModel: [https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b](https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b)\n\nLicense: OpenMDW-1.1\n\nRuntime: NeMo 26.06+", "url": "https://wpnews.pro/news/how-to-fine-tune-nemotron-3-5-asr-for-your-language-domain-or-accent", "canonical_source": "https://huggingface.co/blog/nvidia/fine-tuning-nemotron-35-asr", "published_at": "2026-06-04 12:59:35+00:00", "updated_at": "2026-06-04 13:42:01.492996+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "natural-language-processing", "ai-products", "ai-tools"], "entities": ["NVIDIA", "Nemotron 3.5 ASR", "Hugging Face", "Nemotron 3 ASR", "Artificial Analysis", "Cache-Aware FastConformer-RNNT"], "alternates": {"html": "https://wpnews.pro/news/how-to-fine-tune-nemotron-3-5-asr-for-your-language-domain-or-accent", "markdown": "https://wpnews.pro/news/how-to-fine-tune-nemotron-3-5-asr-for-your-language-domain-or-accent.md", "text": "https://wpnews.pro/news/how-to-fine-tune-nemotron-3-5-asr-for-your-language-domain-or-accent.txt", "jsonld": "https://wpnews.pro/news/how-to-fine-tune-nemotron-3-5-asr-for-your-language-domain-or-accent.jsonld"}}