How to Use NVIDIA Canary-1B-v2 for ASR, Translation, and Automatic SRT Subtitle Export in Python

NVIDIA released Canary-1B-v2, a multilingual speech recognition and translation model, with a Python tutorial demonstrating ASR, translation, and SRT subtitle export. The tutorial shows how to install dependencies, load the model on GPU, process audio at 16 kHz, and generate translated subtitles. This enables developers to build multilingual ASR pipelines for real audio files and large-scale transcription.

In this tutorial https://github.com/MARKTECHPOST-AI-MEDIA-INC/AI-Agents-Projects-Tutorials/blob/main/Voice%20AI/nvidia canary 1b v2 asr translation tutorial marktechpost.py , we build a speech recognition and translation workflow using NVIDIA Canary-1B-v2 https://huggingface.co/nvidia/canary-1b-v2 . We begin by setting up the required audio, NeMo, NumPy, and SciPy dependencies, then load the Canary model on a GPU-enabled runtime for efficient inference. From there, we prepare audio into a clean 16 kHz mono format, perform English ASR, translate speech into multiple languages, generate word and segment timestamps, export translated subtitles as an SRT file, test long-form transcription, run batch processing, and benchmark inference speed. At the end, we have a complete multilingual ASR and speech translation pipeline that we can adapt for real audio files, subtitle generation, and large-scale transcription experiments. Installing NeMo, Audio Libraries, NumPy, and SciPy Dependencies python import os, subprocess, sys SENTINEL = "/content/.canary setup done" if not os.path.exists SENTINEL : def sh c : print "$", c ; subprocess.run c, shell=True, check=False print " PHASE 1: installing dependencies one-time ...\n" sh "apt-get -qq update" sh "apt-get -qq install -y libsndfile1 ffmpeg /dev/null" sh 'pip install -q "nemo toolkit asr "' sh "pip install -q librosa soundfile pydub" sh 'pip install -q --force-reinstall --no-cache-dir "numpy =2.2,<2.4" "scipy =1.15"' open SENTINEL, "w" .write "done" print "\n✅ Setup complete. Restarting the runtime now." print " When it reconnects, RUN THIS CELL AGAIN to start the tutorial." os.kill os.getpid , 9 We set up the environment for the NVIDIA Canary-1B-v2 tutorial. We install the required system packages, NeMo ASR toolkit, audio libraries, and compatible NumPy and SciPy versions. We then create a setup marker and restart the runtime so that the updated dependencies load cleanly before running the main tutorial. Loading NVIDIA Canary-1B-v2 and Checking GPU Availability python import time, json, gc, math, urllib.request import torch, numpy as np, soundfile as sf, librosa print " PHASE 2: running tutorial\n" print "NumPy:", np. version , "| PyTorch:", torch. version print "CUDA available:", torch.cuda.is available if torch.cuda.is available : print "GPU:", torch.cuda.get device name 0 , f"| VRAM: {torch.cuda.get device properties 0 .total memory/1e9:.1f} GB" else: print "⚠️ No GPU — will run on CPU very slow . " "Set Runtime Change runtime type GPU." DEVICE = "cuda" if torch.cuda.is available else "cpu" LANGS = { "bg":"Bulgarian","hr":"Croatian","cs":"Czech","da":"Danish","nl":"Dutch", "en":"English","et":"Estonian","fi":"Finnish","fr":"French","de":"German", "el":"Greek","hu":"Hungarian","it":"Italian","lv":"Latvian","lt":"Lithuanian", "mt":"Maltese","pl":"Polish","pt":"Portuguese","ro":"Romanian","sk":"Slovak", "sl":"Slovenian","es":"Spanish","sv":"Swedish","ru":"Russian","uk":"Ukrainian", } print f"\nSupported languages {len LANGS } :", ", ".join LANGS.keys from nemo.collections.asr.models import ASRModel print "\nLoading nvidia/canary-1b-v2 ..." t0 = time.time asr model = ASRModel.from pretrained model name="nvidia/canary-1b-v2" .to DEVICE .eval print f"Model loaded in {time.time -t0:.1f}s" We import the main libraries and check whether CUDA is available for GPU acceleration. We define the supported language dictionary to enable Canary to handle multilingual ASR and translation tasks. We then load the NVIDIA Canary-1B-v2 model from NeMo and move it to the available device for inference. Preparing 16 kHz Audio and Running English ASR with Translation python TARGET SR = 16000 def prepare audio path or url, out path=None : if str path or url .startswith "http://", "https://" : local = "/content/ dl " + os.path.basename path or url.split "?" 0 urllib.request.urlretrieve path or url, local path or url = local audio, = librosa.load path or url, sr=TARGET SR, mono=True if out path is None: base = os.path.splitext os.path.basename path or url 0 out path = f"/content/{base} 16k mono.wav" sf.write out path, audio, TARGET SR, subtype="PCM 16" dur = len audio / TARGET SR print f"Prepared: {out path} {dur:.1f}s, 16kHz, mono " return out path, dur SAMPLE URL = "https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav" sample wav, sample dur = prepare audio SAMPLE URL def transcribe files, source lang="en", target lang="en", timestamps=False, batch size=1 : if isinstance files, str : files = files return asr model.transcribe files, source lang=source lang, target lang=target lang, timestamps=timestamps, batch size=batch size print "\n=== 1 BASIC ASR English ===" res = transcribe sample wav, source lang="en", target lang="en" print "Transcript:", res 0 .text print "\n=== 2 TRANSLATION EN audio - X ===" for tgt in "fr", "de", "es", "it" : out = transcribe sample wav, source lang="en", target lang=tgt print f" EN - {LANGS tgt :<10} {tgt} : {out 0 .text}" We create a reusable audio preparation function that downloads audio when needed and converts it into 16 kHz mono WAV format. We load the sample audio file and define a helper function for transcription and translation. We then run basic English ASR and translate the same English speech into French, German, Spanish, and Italian. Generating Word and Segment Timestamps and Exporting SRT Subtitles print "\n=== 3 TIMESTAMPS ASR ===" ts out = transcribe sample wav, source lang="en", target lang="en", timestamps=True word ts = ts out 0 .timestamp.get "word", seg ts = ts out 0 .timestamp.get "segment", print "Segments:" for s in seg ts: print f" {s 'start' :6.2f}s - {s 'end' :6.2f}s {s 'segment' }" print "First 10 words:" for w in word ts :10 : print f" {w 'start' :6.2f}s - {w 'end' :6.2f}s {w 'word' }" def srt time t : h=int t//3600 ; m=int t%3600 //60 ; s=int t%60 ; ms=int round t-int t 1000 return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}" def segments to srt segments, out path="/content/output.srt" : lines= for i, seg in enumerate segments, 1 : lines += str i , f"{ srt time seg 'start' } -- { srt time seg 'end' }", seg "segment" .strip , "" open out path, "w", encoding="utf-8" .write "\n".join lines print f"Saved SRT: {out path}" return out path print "\n=== 4 SRT EXPORT translated French subtitles ===" fr ts = transcribe sample wav, source lang="en", target lang="fr", timestamps=True segments to srt fr ts 0 .timestamp "segment" , "/content/subtitles fr.srt" print open "/content/subtitles fr.srt" .read We enable timestamped transcription to extract both segment-level and word-level timing information. We print the transcript segments and the first few word timestamps to inspect how the model aligns text with audio. We also convert translated French segments into an SRT subtitle file and display the generated subtitles. Running Long-Form Transcription, Batch Processing, and Speed Benchmark print "\n=== 5 LONG-FORM sample tiled x6 ===" long audio, = librosa.load sample wav, sr=TARGET SR, mono=True long audio = np.tile long audio, 6 sf.write "/content/long.wav", long audio, TARGET SR, subtype="PCM 16" print f"Long clip duration: {len long audio /TARGET SR:.1f}s" long out = transcribe "/content/long.wav", source lang="en", target lang="en", batch size=1 print "Long transcript first 300 chars :", long out 0 .text :300 , "..." print "\n=== 6 BATCH ===" for name in "clip a", "clip b" : sf.write f"/content/{name}.wav", librosa.load sample wav, sr=TARGET SR, mono=True 0 , TARGET SR, subtype="PCM 16" batch = transcribe "/content/clip a.wav", "/content/clip b.wav" , source lang="en", target lang="en", batch size=2 for i, b in enumerate batch : print f" file {i}: {b.text}" print "\n=== 7 BENCHMARK ===" t0 = time.time ; = transcribe sample wav, source lang="en", target lang="en" elapsed = time.time -t0 print f"Audio: {sample dur:.2f}s | Compute: {elapsed:.2f}s | RTFx ≈ {sample dur/elapsed:.1f}x" print "\n✅ Done. Change source lang/target lang from the LANGS dict to try other languages." We test long-form transcription by repeating the sample audio several times and passing the longer clip through the model. We also create two duplicate audio clips to demonstrate batch transcription with a batch size of two. Also, we benchmark the model by comparing audio duration with compute time and report the real-time factor speed. Conclusion In conclusion, we completed a practical end-to-end workflow for using NVIDIA Canary-1B-v2 as a multilingual ASR and speech translation system. We processed raw audio, generated accurate transcripts, translated speech into different target languages, extracted timestamps, created subtitle files, handled longer audio clips, and compared runtime performance through a simple benchmark. We now have a reusable Colab-ready pipeline that we can extend further with custom uploads, more languages, larger batches, and production-style audio processing. Check out the Full Codes with Notebook. Also, feel free to follow us on and don’t forget to join our Twitter https://x.com/intent/follow?screen name=marktechpost and Subscribe to 150k+ML SubReddit https://www.reddit.com/r/machinelearningnews/ . Wait are you on telegram? our Newsletter https://www.aidevsignals.com/ now you can join us on telegram as well. https://t.me/machinelearningresearchnews Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us https://forms.gle/wbash1wF6efRj8G58 Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions. - Sana Hassan - Sana Hassan - Sana Hassan - Sana Hassan