Evaluate Clinical ASR Models Faster with Agent Skills and NVIDIA Nemotron Speech NVIDIA has introduced a new workflow using agent skills and Nemotron Speech to accelerate the evaluation of clinical automatic speech recognition (ASR) models, addressing the challenge of accurately recognizing rare medical terminology. The system generates pronunciation-aware synthetic audio to create domain-specific benchmarks in hours, bypassing the need for real patient recordings that are restricted by privacy regulations. This approach enables developers to rapidly iterate on ASR model performance for clinical terms like drug names and procedures without compliance overhead. Training a speech AI https://www.nvidia.com/en-us/glossary/speech-ai/ model to correctly recognize or synthesize clinical terminology is surprisingly difficult. Drug names like Acetaminophen, Amlodipine, Cefazolin, and Biktarvy are not part of everyday vocabulary. Procedure names, anatomy terms, and specialty-specific diagnoses introduce the same problem in a different form. Off-the-shelf speech systems can sound fluent and still miss the words that matter most to a clinical workflow. Synthetic data generation SDG https://www.nvidia.com/en-us/glossary/synthetic-data-generation/ can help close this gap, but only if the synthesized speech is phonetically accurate. A text-to-speech TTS system that mispronounces a medication or procedure name produces training or evaluation data that teaches the wrong pronunciation. Instead of fixing the original problem, it can make the failure more difficult to detect. When correctly implemented, SDG enables a team to stand up a domain benchmark in hours without collecting real clinical audio or waiting on annotation pipelines or IRB approval. This post presents a clinical automatic speech recognition ASR https://www.nvidia.com/en-us/glossary/speech-to-text/ workflow for generating pronunciation-aware synthetic audio, reviewing clinical terms, and evaluating recognition quality. NVIDIA agent skills https://developer.nvidia.com/blog/nvidia-verified-agent-skills-provide-capability-governance-for-ai-agents/ guide the workflow, while NVIDIA NeMo Data Designer https://github.com/NVIDIA-NeMo/DataDesigner and NVIDIA Nemotron Speech https://developer.nvidia.com/topics/ai/nemotron provide the data generation and speech services. Why does clinical ASR need a repeatable feedback loop? Clinical voice AI is becoming part of dictation, ambient documentation, call-center workflows, patient intake, and post-visit follow-up. These systems are expected to understand terms that are rare in general speech but central to the task: medication names, procedure names, anatomy, diagnoses, devices, symptoms, and specialty abbreviations. Real-world clinical audio is also difficult to collect and share. It can be expensive, slow to annotate, restricted by privacy requirements, and unevenly distributed across specialties and rare terms. Real patient recordings are protected health information under HIPAA, which means they cannot be freely shared across teams, checked into version control or used in automated test pipelines without significant compliance overhead. Synthetic audio contains no PHI by design, making it the only form of clinical speech data a team can version, share, and test. Public datasets may not include the exact terminology a deployment depends on. The practical challenge is not only to generate more data. Developers need a repeatable way to define the target clinical profile, create a benchmark, review pronunciation risk, measure ASR behavior, improve the model, and decide whether the next cycle should expand terms, improve pronunciations, add noise, or fine-tune. How are AI agent skills helpful for clinical voice AI? Agent skills guide a developer through the repeated steps of clinical ASR evaluation: defining a profile, building a term-centered benchmark, reviewing pronunciations, generating synthetic audio, measuring ASR behavior, and choosing the next iteration. In this post, the flywheel is the full improvement loop: build the benchmark, evaluate ASR behavior, use the results to decide what to change, and reevaluate after the change. The pipeline is one pass through part of that loop, such as generating sentences, adding pronunciation markup, synthesizing audio, and writing the manifest. The pipeline begins as a conversation, not a static config file. You can run the build skill in any agent harness Claude Code or Codex, for example and describe the clinical workflow they want to stress test. The digital health clinical asr build https://github.com/NVIDIA/skills/tree/main/skills/digital-health-clinical-asr-build agent skill then asks for the specialty, target setting, known failure terms, and desired contexts before generating a first-pass benchmark configuration. The skill asks one question at a time, in order: - What specialty or workflow - Which ASR failure modes have been observed - Which terms are encountered daily and which terms are difficult Common terms become the baseline, and difficult terms become the signal that drives benchmark design. Developer: /flywheel-build Agent: What clinical specialty or workflow is this for? Examples: orthopedic practice, oncology dictation, ICU handoff, ambulatory cardiology. Developer: Orthopedic practice. Focus on post-op instructions and PACU handoff. The ASR system often misses medication names and procedure names. Agent: I'll start a benchmark profile with drug, procedure, anatomy, and condition terms, then run pronunciation QA before generating the full audio set. This profile-driven start enables the same flywheel to be configured for orthopedic surgery, cardiology, oncology, behavioral health, or any other domain where the vocabulary differs. The agent’s job is to keep the workflow on the right rails: collect the clinical profile, propose or ingest terms, generate a small QA set first, route IPA misses to review, and only then build the full benchmark. Category | Example terms | | Drugs | Cefazolin, Ketorolac, Ropivacaine, Enoxaparin, Tranexamic acid | | Procedures | Total knee arthroplasty, hemiarthroplasty, ORIF, arthroscopy | | Anatomy | Acetabulum, tibial plateau, femoral neck, iliopsoas | | Conditions | Hemarthrosis, osteomyelitis, compartment syndrome, femoroacetabular impingement | Table 1. Example clinical term categories for an orthopedic practice profile How to generate TTS-ready synthetic audio from clinical seed terms Starting from the profile-specific term list, the pipeline uses NeMo Data Designer https://github.com/NVIDIA-NeMo/DataDesigner to expand seed terms into a richer dataset. NeMo Data Designer generates high-quality synthetic data from scratch or from seed data. Developers define the output columns and the dependencies between them. NeMo Data Designer resolves the dependencies while handling batching, parallel execution, validation, and preview or full-run execution. In this flywheel, the output columns produce a complete synthetic speech record: a unique sample ID, a clinical sentence containing the target term, a pronunciation source, a Speech Synthesis Markup Language SSML sentence with phoneme markup when available, and the target path for the synthesized audio. For this pipeline, five columns transform a clinical term into a phoneme-annotated, TTS-ready sentence Figure 1 . Column | Purpose | Skill use | | sample id | Unique ID for the generated sample | Keeps audio files, transcripts, and metric rows aligned | | sentence | Clinical sentence containing the exact target term | Becomes the ASR reference transcript | | ipa pronunciation | Reviewed or dictionary-derived pronunciation candidate | Drives phoneme injection and flags review gaps | | ssml sentence | Sentence wrapped in SSML with phoneme markup when available | Becomes the TTS input | | audio filepath | Target path for the synthesized audio file | Becomes the manifest audio path | Table 2. Core columns in the generated text dataset The generated sentence prompt should preserve the exact target term. If the model substitutes a brand name, generic equivalent, abbreviation, or spelling variant, the benchmark no longer tests the intended entity. The agent skill can check for that condition and regenerate or reject rows that do not contain the exact term. Drug | Sentence | ipa pronunciation | ssml sentence | audio filepath | | Acetaminophen | The nurse administered Acetaminophen to the patient after surgery to manage mild pain. | əˌsiːtəˈmɪnəfɛn |