{"slug": "evaluate-clinical-asr-models-faster-with-agent-skills-and-nvidia-nemotron-speech", "title": "Evaluate Clinical ASR Models Faster with Agent Skills and NVIDIA Nemotron Speech", "summary": "NVIDIA has introduced a new workflow using agent skills and Nemotron Speech to accelerate the evaluation of clinical automatic speech recognition (ASR) models, addressing the challenge of accurately recognizing rare medical terminology. The system generates pronunciation-aware synthetic audio to create domain-specific benchmarks in hours, bypassing the need for real patient recordings that are restricted by privacy regulations. This approach enables developers to rapidly iterate on ASR model performance for clinical terms like drug names and procedures without compliance overhead.", "body_md": "Training a [speech AI](https://www.nvidia.com/en-us/glossary/speech-ai/) model to correctly recognize or synthesize clinical terminology is surprisingly difficult. Drug names like Acetaminophen, Amlodipine, Cefazolin, and Biktarvy are not part of everyday vocabulary. Procedure names, anatomy terms, and specialty-specific diagnoses introduce the same problem in a different form. Off-the-shelf speech systems can sound fluent and still miss the words that matter most to a clinical workflow.\n\n[Synthetic data generation (SDG)](https://www.nvidia.com/en-us/glossary/synthetic-data-generation/) can help close this gap, but only if the synthesized speech is phonetically accurate. A text-to-speech (TTS) system that mispronounces a medication or procedure name produces training or evaluation data that teaches the wrong pronunciation. Instead of fixing the original problem, it can make the failure more difficult to detect. When correctly implemented, SDG enables a team to stand up a domain benchmark in hours without collecting real clinical audio or waiting on annotation pipelines or IRB approval.\n\nThis post presents a clinical [automatic speech recognition (ASR)](https://www.nvidia.com/en-us/glossary/speech-to-text/) workflow for generating pronunciation-aware synthetic audio, reviewing clinical terms, and evaluating recognition quality. [NVIDIA agent skills](https://developer.nvidia.com/blog/nvidia-verified-agent-skills-provide-capability-governance-for-ai-agents/) guide the workflow, while [NVIDIA NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) and [NVIDIA Nemotron Speech](https://developer.nvidia.com/topics/ai/nemotron) provide the data generation and speech services.\n\n## Why does clinical ASR need a repeatable feedback loop?\n\nClinical voice AI is becoming part of dictation, ambient documentation, call-center workflows, patient intake, and post-visit follow-up. These systems are expected to understand terms that are rare in general speech but central to the task: medication names, procedure names, anatomy, diagnoses, devices, symptoms, and specialty abbreviations.\n\nReal-world clinical audio is also difficult to collect and share. It can be expensive, slow to annotate, restricted by privacy requirements, and unevenly distributed across specialties and rare terms. Real patient recordings are protected health information under HIPAA, which means they cannot be freely shared across teams, checked into version control or used in automated test pipelines without significant compliance overhead. Synthetic audio contains no PHI by design, making it the only form of clinical speech data a team can version, share, and test. Public datasets may not include the exact terminology a deployment depends on.\n\nThe practical challenge is not only to generate more data. Developers need a repeatable way to define the target clinical profile, create a benchmark, review pronunciation risk, measure ASR behavior, improve the model, and decide whether the next cycle should expand terms, improve pronunciations, add noise, or fine-tune.\n\n**How are AI agent skills helpful for clinical voice AI?**\n\nAgent skills guide a developer through the repeated steps of clinical ASR evaluation: defining a profile, building a term-centered benchmark, reviewing pronunciations, generating synthetic audio, measuring ASR behavior, and choosing the next iteration.\n\nIn this post, the flywheel is the full improvement loop: build the benchmark, evaluate ASR behavior, use the results to decide what to change, and reevaluate after the change. The pipeline is one pass through part of that loop, such as generating sentences, adding pronunciation markup, synthesizing audio, and writing the manifest.\n\nThe pipeline begins as a conversation, not a static config file. You can run the build skill in any agent harness (Claude Code or Codex, for example) and describe the clinical workflow they want to stress test. The [digital health clinical asr build](https://github.com/NVIDIA/skills/tree/main/skills/digital-health-clinical-asr-build) agent skill then asks for the specialty, target setting, known failure terms, and desired contexts before generating a first-pass benchmark configuration.\n\nThe skill asks one question at a time, in order:\n\n- What specialty or workflow\n- Which ASR failure modes have been observed\n- Which terms are encountered daily and which terms are difficult\n\nCommon terms become the baseline, and difficult terms become the signal that drives benchmark design.\n\n```\nDeveloper:\n/flywheel-build\n\nAgent:\nWhat clinical specialty or workflow is this for?\nExamples: orthopedic practice, oncology dictation, ICU handoff, ambulatory cardiology.\n\nDeveloper:\nOrthopedic practice. Focus on post-op instructions and PACU handoff.\nThe ASR system often misses medication names and procedure names.\n\nAgent:\nI'll start a benchmark profile with drug, procedure, anatomy, and condition terms, then run \npronunciation QA before generating the full audio set.\n```\n\nThis profile-driven start enables the same flywheel to be configured for orthopedic surgery, cardiology, oncology, behavioral health, or any other domain where the vocabulary differs. The agent’s job is to keep the workflow on the right rails: collect the clinical profile, propose or ingest terms, generate a small QA set first, route IPA misses to review, and only then build the full benchmark.\n\nCategory | Example terms |\n| Drugs | Cefazolin, Ketorolac, Ropivacaine, Enoxaparin, Tranexamic acid |\n| Procedures | Total knee arthroplasty, hemiarthroplasty, ORIF, arthroscopy |\n| Anatomy | Acetabulum, tibial plateau, femoral neck, iliopsoas |\n| Conditions | Hemarthrosis, osteomyelitis, compartment syndrome, femoroacetabular impingement |\n\n*Table 1. Example clinical term categories for an orthopedic practice profile*## How to generate TTS-ready synthetic audio from clinical seed terms\n\nStarting from the profile-specific term list, the pipeline uses [NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) to expand seed terms into a richer dataset. NeMo Data Designer generates high-quality synthetic data from scratch or from seed data. Developers define the output columns and the dependencies between them.\n\nNeMo Data Designer resolves the dependencies while handling batching, parallel execution, validation, and preview or full-run execution. In this flywheel, the output columns produce a complete synthetic speech record: a unique sample ID, a clinical sentence containing the target term, a pronunciation source, a Speech Synthesis Markup Language (SSML) sentence with phoneme markup when available, and the target path for the synthesized audio.\n\nFor this pipeline, five columns transform a clinical term into a phoneme-annotated, TTS-ready sentence (Figure 1).\n\nColumn | Purpose | Skill use |\n| sample_id | Unique ID for the generated sample | Keeps audio files, transcripts, and metric rows aligned |\n| sentence | Clinical sentence containing the exact target term | Becomes the ASR reference transcript |\n| ipa_pronunciation | Reviewed or dictionary-derived pronunciation candidate | Drives phoneme injection and flags review gaps |\n| ssml_sentence | Sentence wrapped in SSML with phoneme markup when available | Becomes the TTS input |\n| audio_filepath | Target path for the synthesized audio file | Becomes the manifest audio path |\n\n*Table 2. Core columns in the generated text dataset*The generated sentence prompt should preserve the exact target term. If the model substitutes a brand name, generic equivalent, abbreviation, or spelling variant, the benchmark no longer tests the intended entity. The agent skill can check for that condition and regenerate or reject rows that do not contain the exact term.\n\nDrug | Sentence | ipa_pronunciation | ssml_sentence | audio_filepath |\n| Acetaminophen | The nurse administered Acetaminophen to the patient after surgery to manage mild pain. | əˌsiːtəˈmɪnəfɛn | <speak>The nurse administered <phoneme alphabet=”ipa” ph=”əˌsiːtəˈmɪnəfɛn”>Acetaminophen</phoneme> to the patient after surgery to manage mild pain.</speak> | data/audio/audio_Acetaminophen_3c7a1f02.wav |\n\n*Table 3. Example-enriched row from the text dataset*### SSML phoneme tag injection\n\nSSML is an XML-based markup language that provides TTS engines with instructions on how to synthesize speech. It is critical for controlling aspects like pronunciation, pacing, volume, and emphasis. The SSML step wraps the generated sentence in a `<speak>`\n\nelement and injects a `<phoneme alphabet=\"ipa\">`\n\ntag around every occurrence of the target term. The implementation uses a case-insensitive regex so the original casing in the sentence is preserved while the match remains robust.\n\n```\n<speak>A forty-five year old patient was prescribed\n<phoneme alphabet=\"ipa\" ph=\"əˌsiːtəˈmɪnəfɛn\">Acetaminophen </phoneme>\nonce daily to manage mild pain.</speak>\n```\n\n### Manual pronunciation review for IPA gaps\n\nDictionary lookup covers many clinical terms, but not all of them. Newer drug names, trade names, rare procedure terms, and specialty-specific phrases may be missing or may return a pronunciation that requires review. The flywheel handles those gaps with an explicit manual review path.\n\nWhen a trusted dictionary pronunciation is unavailable, an LLM-backed agent harness can propose candidate IPA strings. The important boundary is that the LLM proposal is not treated as ground truth. It is a candidate that must pass validation and human review.\n\nThe manual pronunciation loop is as follows:\n\n- Flag rows with missing or low-confidence IPA\n- Use the agent harness to propose one or more IPA candidates\n- Validate the candidate against the TTS phoneme inventory\n- Synthesize a short QA clip for the term in context\n- Review to accept, edit, or reject the candidate\n- Write accepted pronunciations to a reviewed override file\n- Regenerate the affected SSML and audio\n\nThis process turns pronunciation gaps into a small review queue instead of a hidden benchmark-quality problem. For example, in the orthopedic practice reference session, terms such as Femoroacetabular impingement, Hemiarthroplasty, Ketorolac, Pertrochanteric, and Ropivacaine needed review or overrides. After review, the full benchmark generated 67 audio samples with no rows relying on unreviewed native TTS pronunciation.\n\nThe loop only works if the agent actually stops and waits for the human at the right moment. The skill itself enforces that pause. The instructions in the skills are written for the agent, not the developer, and they tell the agent in plain language that it cannot move on until the user has listened to the clips.\n\n**How to synthesize the audio and produce the manifest**\n\nOnce each row has an SSML sentence and target audio path, the workflow synthesizes one audio file per generated sample. [NVIDIA Magpie TTS Multilingual](https://build.nvidia.com/nvidia/magpie-tts-multilingual/modelcard) is a good fit for this stage because it supports SSML phoneme tags with IPA and ARPAbet. This allows the synthesizer to render the clinical term using the reviewed phoneme sequence instead of relying only on its own grapheme-to-phoneme prediction.\n\nThe final output is a NeMo-compatible JSONL manifest. Each line links an audio file to its transcript and metadata:\n\n```\n{\n  \"audio_filepath\": \"data/audio/audio_Acetaminophen_3c7a1f02.wav\",\n  \"text\": \"The nurse administered Acetaminophen to the patient after surgery to manage mild pain.\",\n  \"duration\": 3.914,\n  \"term\": \"Acetaminophen\",\n  \"entity_category\": \"drug\",\n  \"ipa_source\": \"reviewed\"\n}\n```\n\nThe manifest is the handoff point between SDG, ASR evaluation, and model adaptation. It is also where the benchmark keeps the metadata needed for slicing results by entity category, pronunciation source, context type, voice, or acoustic condition.\n\n**What is the value of a skill-native clinical ASR quality flywheel?**\n\nWhile generating phonetically controlled audio is useful on its own, the greater value is an [AI agent](https://www.nvidia.com/en-us/ai/) working together with a developer through the improvement loop. The user starts with a clinical profile. The build skill creates a benchmark. The evaluation skill reports where the ASR system struggles. The adaptation skill helps decide whether to fine-tune, expand the term list, improve pronunciation coverage, or add harder acoustic conditions. The reevaluation step then checks whether the change helped.\n\nThe evaluation skill includes one counter-intuitive routing rule worth surfacing. If Merriam-Webster improved audio scores but Magpie fallback audio scores poorly, the skill routes the user back to build, not to fine-tune. That pattern is a pronunciation-coverage gap, not a model gap. Fine-tuning over a TTS-pronunciation gap teaches the model to misrecognize the model’s own mistakes. ASR transcription itself is served by [NVIDIA Nemotron Speech](https://github.com/NVIDIA-NeMo/NeMo).\n\nStage | Developer intent | Skill behavior |\n| Setup | Prepare the environment and check access | Verifies dependencies, credentials, and smoke tests |\n| Build | Create a profile-specific benchmark | Collects specialty context, proposes terms, runs pronunciation QA, and generates the manifest |\n| Evaluate | Measure ASR behavior on the benchmark | Runs transcription and reports aggregate and entity-level metrics |\n| Adapt | Improve ASR quality based on failure patterns | Gates fine-tuning behind two thresholds, priority-category KER > 0.3 and manifest ≥ 100 rows, and otherwise routes back to build to grow the manifest. Fine-tuning runs use the stock\n|\n\n*Table 4. Skill stages in the ASR quality flywheel***How to benchmark ASR performance**\n\nThe flywheel still reports familiar ASR metrics, but the skill presents them as decision signals. If pronunciation QA is incomplete, the next step may be review rather than model training. If entity errors cluster in one category, the next step may be more targeted data. If errors persist across reviewed terms, adaptation may be justified.\n\nMetric | What it measures | Skill use |\n| WER | Word error rate across the full sentence | General ASR quality signal |\n| CER | Character error rate | Near-miss signal for long clinical terms |\n| KER | Keyword error rate on the target clinical entity | Primary signal for whether workflow-critical terms are recognized |\n| SER | Sentence error rate | Shows whether any error occurred in the sentence |\n\n*Table 5. Metrics reported by the evaluation skill*In the orthopedic practice simulation, the entity-level metrics made the next step clear: medication names were the weakest category, and the follow-up cycle focused on pronunciation review, additional drug-name coverage, and model adaptation. The result was not a production benchmark, but it showed how the flywheel can turn a clinical ASR failure pattern into a concrete improvement path.\n\n**What are the limitations of the flywheel**\n\nSynthetic audio is not a substitute for real clinical audio. It is a controllable way to create targeted stress tests, especially for rare terms, but production validation still requires real-world audio from the intended setting. Pronunciation control still needs human review. Dictionary lookup works well for many medical terms, but not every term appears in a trusted dictionary. Automated pronunciation proposals can accelerate review, but they should not be treated as ground truth without audio inspection.\n\nThe current benchmark is small. The orthopedic practice simulation demonstrates the flywheel on a small set of generated samples. Stronger claims require held-out terms, more contexts, more speakers, acoustic perturbations, repeated runs, and real audio. Clean-audio performance is not enough. Clinical environments include alarms, overlapping speakers, masks, telehealth microphones, room reverberation, ambulance noise, and dictation artifacts. The next version of the benchmark should include acoustic stress profiles.\n\n**Get started with clinical ASR agent skills**\n\nClinical ASR improvement requires more than a one-time dataset or aggregate score. You need a workflow that helps you define the clinical profile, generate pronunciation-aware synthetic audio, measure ASR quality on the terms that matter, adapt the model when appropriate, and reevaluate the result.\n\nThe flywheel described in this post starts with a simple conversation and ends with a repeatable ASR flywheel. NVIDIA NeMo Data Designer handles the text-enrichment layer. Magpie TTS Multilingual synthesizes pronunciation-controlled audio. The NeMo-compatible manifest connects generation, evaluation, adaptation, and reporting. AI agent skills make the process repeatable by guiding term curation, IPA review, benchmark generation, scoring, and next-step decisions.\n\nThe orthopedic practice simulation shows the workflow pattern: configure a profile-specific term list, generate reviewed synthetic audio, inspect entity-level errors, and decide the next action. The larger contribution is the repeatable loop: profile-driven benchmarks, pronunciation-aware TTS, explicit review gates, and entity-level evaluation.\n\nReady to get started? Explore [NVIDIA agent skills](https://github.com/NVIDIA/skills) to use the clinical ASR agent workflow as a guide for building profile-driven benchmarks, reviewing pronunciations, generating synthetic clinical audio, and evaluating ASR output with entity-level metrics.", "url": "https://wpnews.pro/news/evaluate-clinical-asr-models-faster-with-agent-skills-and-nvidia-nemotron-speech", "canonical_source": "https://developer.nvidia.com/blog/evaluate-clinical-asr-models-faster-with-agent-skills-and-nvidia-nemotron-speech/", "published_at": "2026-06-09 15:00:00+00:00", "updated_at": "2026-06-11 19:51:01.426332+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "natural-language-processing", "generative-ai", "ai-agents"], "entities": ["NVIDIA", "Nemotron Speech", "Acetaminophen", "Amlodipine", "Cefazolin", "Biktarvy"], "alternates": {"html": "https://wpnews.pro/news/evaluate-clinical-asr-models-faster-with-agent-skills-and-nvidia-nemotron-speech", "markdown": "https://wpnews.pro/news/evaluate-clinical-asr-models-faster-with-agent-skills-and-nvidia-nemotron-speech.md", "text": "https://wpnews.pro/news/evaluate-clinical-asr-models-faster-with-agent-skills-and-nvidia-nemotron-speech.txt", "jsonld": "https://wpnews.pro/news/evaluate-clinical-asr-models-faster-with-agent-skills-and-nvidia-nemotron-speech.jsonld"}}