Evaluate Clinical ASR Models Faster with Agent Skills and NVIDIA Nemotron Speech

NVIDIA has introduced a new workflow using agent skills and Nemotron Speech to accelerate the evaluation of clinical automatic speech recognition (ASR) models, addressing the challenge of accurately recognizing rare medical terminology. The system generates pronunciation-aware synthetic audio to create domain-specific benchmarks in hours, bypassing the need for real patient recordings that are restricted by privacy regulations. This approach enables developers to rapidly iterate on ASR model performance for clinical terms like drug names and procedures without compliance overhead.

Training a speech AI https://www.nvidia.com/en-us/glossary/speech-ai/ model to correctly recognize or synthesize clinical terminology is surprisingly difficult. Drug names like Acetaminophen, Amlodipine, Cefazolin, and Biktarvy are not part of everyday vocabulary. Procedure names, anatomy terms, and specialty-specific diagnoses introduce the same problem in a different form. Off-the-shelf speech systems can sound fluent and still miss the words that matter most to a clinical workflow. Synthetic data generation SDG https://www.nvidia.com/en-us/glossary/synthetic-data-generation/ can help close this gap, but only if the synthesized speech is phonetically accurate. A text-to-speech TTS system that mispronounces a medication or procedure name produces training or evaluation data that teaches the wrong pronunciation. Instead of fixing the original problem, it can make the failure more difficult to detect. When correctly implemented, SDG enables a team to stand up a domain benchmark in hours without collecting real clinical audio or waiting on annotation pipelines or IRB approval. This post presents a clinical automatic speech recognition ASR https://www.nvidia.com/en-us/glossary/speech-to-text/ workflow for generating pronunciation-aware synthetic audio, reviewing clinical terms, and evaluating recognition quality. NVIDIA agent skills https://developer.nvidia.com/blog/nvidia-verified-agent-skills-provide-capability-governance-for-ai-agents/ guide the workflow, while NVIDIA NeMo Data Designer https://github.com/NVIDIA-NeMo/DataDesigner and NVIDIA Nemotron Speech https://developer.nvidia.com/topics/ai/nemotron provide the data generation and speech services. Why does clinical ASR need a repeatable feedback loop? Clinical voice AI is becoming part of dictation, ambient documentation, call-center workflows, patient intake, and post-visit follow-up. These systems are expected to understand terms that are rare in general speech but central to the task: medication names, procedure names, anatomy, diagnoses, devices, symptoms, and specialty abbreviations. Real-world clinical audio is also difficult to collect and share. It can be expensive, slow to annotate, restricted by privacy requirements, and unevenly distributed across specialties and rare terms. Real patient recordings are protected health information under HIPAA, which means they cannot be freely shared across teams, checked into version control or used in automated test pipelines without significant compliance overhead. Synthetic audio contains no PHI by design, making it the only form of clinical speech data a team can version, share, and test. Public datasets may not include the exact terminology a deployment depends on. The practical challenge is not only to generate more data. Developers need a repeatable way to define the target clinical profile, create a benchmark, review pronunciation risk, measure ASR behavior, improve the model, and decide whether the next cycle should expand terms, improve pronunciations, add noise, or fine-tune. How are AI agent skills helpful for clinical voice AI? Agent skills guide a developer through the repeated steps of clinical ASR evaluation: defining a profile, building a term-centered benchmark, reviewing pronunciations, generating synthetic audio, measuring ASR behavior, and choosing the next iteration. In this post, the flywheel is the full improvement loop: build the benchmark, evaluate ASR behavior, use the results to decide what to change, and reevaluate after the change. The pipeline is one pass through part of that loop, such as generating sentences, adding pronunciation markup, synthesizing audio, and writing the manifest. The pipeline begins as a conversation, not a static config file. You can run the build skill in any agent harness Claude Code or Codex, for example and describe the clinical workflow they want to stress test. The digital health clinical asr build https://github.com/NVIDIA/skills/tree/main/skills/digital-health-clinical-asr-build agent skill then asks for the specialty, target setting, known failure terms, and desired contexts before generating a first-pass benchmark configuration. The skill asks one question at a time, in order: - What specialty or workflow - Which ASR failure modes have been observed - Which terms are encountered daily and which terms are difficult Common terms become the baseline, and difficult terms become the signal that drives benchmark design. Developer: /flywheel-build Agent: What clinical specialty or workflow is this for? Examples: orthopedic practice, oncology dictation, ICU handoff, ambulatory cardiology. Developer: Orthopedic practice. Focus on post-op instructions and PACU handoff. The ASR system often misses medication names and procedure names. Agent: I'll start a benchmark profile with drug, procedure, anatomy, and condition terms, then run pronunciation QA before generating the full audio set. This profile-driven start enables the same flywheel to be configured for orthopedic surgery, cardiology, oncology, behavioral health, or any other domain where the vocabulary differs. The agent’s job is to keep the workflow on the right rails: collect the clinical profile, propose or ingest terms, generate a small QA set first, route IPA misses to review, and only then build the full benchmark. Category | Example terms | | Drugs | Cefazolin, Ketorolac, Ropivacaine, Enoxaparin, Tranexamic acid | | Procedures | Total knee arthroplasty, hemiarthroplasty, ORIF, arthroscopy | | Anatomy | Acetabulum, tibial plateau, femoral neck, iliopsoas | | Conditions | Hemarthrosis, osteomyelitis, compartment syndrome, femoroacetabular impingement | Table 1. Example clinical term categories for an orthopedic practice profile How to generate TTS-ready synthetic audio from clinical seed terms Starting from the profile-specific term list, the pipeline uses NeMo Data Designer https://github.com/NVIDIA-NeMo/DataDesigner to expand seed terms into a richer dataset. NeMo Data Designer generates high-quality synthetic data from scratch or from seed data. Developers define the output columns and the dependencies between them. NeMo Data Designer resolves the dependencies while handling batching, parallel execution, validation, and preview or full-run execution. In this flywheel, the output columns produce a complete synthetic speech record: a unique sample ID, a clinical sentence containing the target term, a pronunciation source, a Speech Synthesis Markup Language SSML sentence with phoneme markup when available, and the target path for the synthesized audio. For this pipeline, five columns transform a clinical term into a phoneme-annotated, TTS-ready sentence Figure 1 . Column | Purpose | Skill use | | sample id | Unique ID for the generated sample | Keeps audio files, transcripts, and metric rows aligned | | sentence | Clinical sentence containing the exact target term | Becomes the ASR reference transcript | | ipa pronunciation | Reviewed or dictionary-derived pronunciation candidate | Drives phoneme injection and flags review gaps | | ssml sentence | Sentence wrapped in SSML with phoneme markup when available | Becomes the TTS input | | audio filepath | Target path for the synthesized audio file | Becomes the manifest audio path | Table 2. Core columns in the generated text dataset The generated sentence prompt should preserve the exact target term. If the model substitutes a brand name, generic equivalent, abbreviation, or spelling variant, the benchmark no longer tests the intended entity. The agent skill can check for that condition and regenerate or reject rows that do not contain the exact term. Drug | Sentence | ipa pronunciation | ssml sentence | audio filepath | | Acetaminophen | The nurse administered Acetaminophen to the patient after surgery to manage mild pain. | əˌsiːtəˈmɪnəfɛn | <speak The nurse administered <phoneme alphabet=”ipa” ph=”əˌsiːtəˈmɪnəfɛn” Acetaminophen</phoneme to the patient after surgery to manage mild pain.</speak | data/audio/audio Acetaminophen 3c7a1f02.wav | Table 3. Example-enriched row from the text dataset SSML phoneme tag injection SSML is an XML-based markup language that provides TTS engines with instructions on how to synthesize speech. It is critical for controlling aspects like pronunciation, pacing, volume, and emphasis. The SSML step wraps the generated sentence in a <speak element and injects a <phoneme alphabet="ipa" tag around every occurrence of the target term. The implementation uses a case-insensitive regex so the original casing in the sentence is preserved while the match remains robust. <speak A forty-five year old patient was prescribed <phoneme alphabet="ipa" ph="əˌsiːtəˈmɪnəfɛn" Acetaminophen </phoneme once daily to manage mild pain.</speak Manual pronunciation review for IPA gaps Dictionary lookup covers many clinical terms, but not all of them. Newer drug names, trade names, rare procedure terms, and specialty-specific phrases may be missing or may return a pronunciation that requires review. The flywheel handles those gaps with an explicit manual review path. When a trusted dictionary pronunciation is unavailable, an LLM-backed agent harness can propose candidate IPA strings. The important boundary is that the LLM proposal is not treated as ground truth. It is a candidate that must pass validation and human review. The manual pronunciation loop is as follows: - Flag rows with missing or low-confidence IPA - Use the agent harness to propose one or more IPA candidates - Validate the candidate against the TTS phoneme inventory - Synthesize a short QA clip for the term in context - Review to accept, edit, or reject the candidate - Write accepted pronunciations to a reviewed override file - Regenerate the affected SSML and audio This process turns pronunciation gaps into a small review queue instead of a hidden benchmark-quality problem. For example, in the orthopedic practice reference session, terms such as Femoroacetabular impingement, Hemiarthroplasty, Ketorolac, Pertrochanteric, and Ropivacaine needed review or overrides. After review, the full benchmark generated 67 audio samples with no rows relying on unreviewed native TTS pronunciation. The loop only works if the agent actually stops and waits for the human at the right moment. The skill itself enforces that pause. The instructions in the skills are written for the agent, not the developer, and they tell the agent in plain language that it cannot move on until the user has listened to the clips. How to synthesize the audio and produce the manifest Once each row has an SSML sentence and target audio path, the workflow synthesizes one audio file per generated sample. NVIDIA Magpie TTS Multilingual https://build.nvidia.com/nvidia/magpie-tts-multilingual/modelcard is a good fit for this stage because it supports SSML phoneme tags with IPA and ARPAbet. This allows the synthesizer to render the clinical term using the reviewed phoneme sequence instead of relying only on its own grapheme-to-phoneme prediction. The final output is a NeMo-compatible JSONL manifest. Each line links an audio file to its transcript and metadata: { "audio filepath": "data/audio/audio Acetaminophen 3c7a1f02.wav", "text": "The nurse administered Acetaminophen to the patient after surgery to manage mild pain.", "duration": 3.914, "term": "Acetaminophen", "entity category": "drug", "ipa source": "reviewed" } The manifest is the handoff point between SDG, ASR evaluation, and model adaptation. It is also where the benchmark keeps the metadata needed for slicing results by entity category, pronunciation source, context type, voice, or acoustic condition. What is the value of a skill-native clinical ASR quality flywheel? While generating phonetically controlled audio is useful on its own, the greater value is an AI agent https://www.nvidia.com/en-us/ai/ working together with a developer through the improvement loop. The user starts with a clinical profile. The build skill creates a benchmark. The evaluation skill reports where the ASR system struggles. The adaptation skill helps decide whether to fine-tune, expand the term list, improve pronunciation coverage, or add harder acoustic conditions. The reevaluation step then checks whether the change helped. The evaluation skill includes one counter-intuitive routing rule worth surfacing. If Merriam-Webster improved audio scores but Magpie fallback audio scores poorly, the skill routes the user back to build, not to fine-tune. That pattern is a pronunciation-coverage gap, not a model gap. Fine-tuning over a TTS-pronunciation gap teaches the model to misrecognize the model’s own mistakes. ASR transcription itself is served by NVIDIA Nemotron Speech https://github.com/NVIDIA-NeMo/NeMo . Stage | Developer intent | Skill behavior | | Setup | Prepare the environment and check access | Verifies dependencies, credentials, and smoke tests | | Build | Create a profile-specific benchmark | Collects specialty context, proposes terms, runs pronunciation QA, and generates the manifest | | Evaluate | Measure ASR behavior on the benchmark | Runs transcription and reports aggregate and entity-level metrics | | Adapt | Improve ASR quality based on failure patterns | Gates fine-tuning behind two thresholds, priority-category KER 0.3 and manifest ≥ 100 rows, and otherwise routes back to build to grow the manifest. Fine-tuning runs use the stock | Table 4. Skill stages in the ASR quality flywheel How to benchmark ASR performance The flywheel still reports familiar ASR metrics, but the skill presents them as decision signals. If pronunciation QA is incomplete, the next step may be review rather than model training. If entity errors cluster in one category, the next step may be more targeted data. If errors persist across reviewed terms, adaptation may be justified. Metric | What it measures | Skill use | | WER | Word error rate across the full sentence | General ASR quality signal | | CER | Character error rate | Near-miss signal for long clinical terms | | KER | Keyword error rate on the target clinical entity | Primary signal for whether workflow-critical terms are recognized | | SER | Sentence error rate | Shows whether any error occurred in the sentence | Table 5. Metrics reported by the evaluation skill In the orthopedic practice simulation, the entity-level metrics made the next step clear: medication names were the weakest category, and the follow-up cycle focused on pronunciation review, additional drug-name coverage, and model adaptation. The result was not a production benchmark, but it showed how the flywheel can turn a clinical ASR failure pattern into a concrete improvement path. What are the limitations of the flywheel Synthetic audio is not a substitute for real clinical audio. It is a controllable way to create targeted stress tests, especially for rare terms, but production validation still requires real-world audio from the intended setting. Pronunciation control still needs human review. Dictionary lookup works well for many medical terms, but not every term appears in a trusted dictionary. Automated pronunciation proposals can accelerate review, but they should not be treated as ground truth without audio inspection. The current benchmark is small. The orthopedic practice simulation demonstrates the flywheel on a small set of generated samples. Stronger claims require held-out terms, more contexts, more speakers, acoustic perturbations, repeated runs, and real audio. Clean-audio performance is not enough. Clinical environments include alarms, overlapping speakers, masks, telehealth microphones, room reverberation, ambulance noise, and dictation artifacts. The next version of the benchmark should include acoustic stress profiles. Get started with clinical ASR agent skills Clinical ASR improvement requires more than a one-time dataset or aggregate score. You need a workflow that helps you define the clinical profile, generate pronunciation-aware synthetic audio, measure ASR quality on the terms that matter, adapt the model when appropriate, and reevaluate the result. The flywheel described in this post starts with a simple conversation and ends with a repeatable ASR flywheel. NVIDIA NeMo Data Designer handles the text-enrichment layer. Magpie TTS Multilingual synthesizes pronunciation-controlled audio. The NeMo-compatible manifest connects generation, evaluation, adaptation, and reporting. AI agent skills make the process repeatable by guiding term curation, IPA review, benchmark generation, scoring, and next-step decisions. The orthopedic practice simulation shows the workflow pattern: configure a profile-specific term list, generate reviewed synthetic audio, inspect entity-level errors, and decide the next action. The larger contribution is the repeatable loop: profile-driven benchmarks, pronunciation-aware TTS, explicit review gates, and entity-level evaluation. Ready to get started? Explore NVIDIA agent skills https://github.com/NVIDIA/skills to use the clinical ASR agent workflow as a guide for building profile-driven benchmarks, reviewing pronunciations, generating synthetic clinical audio, and evaluating ASR output with entity-level metrics.