{"slug": "reference-based-prosody-and-rhythm-evaluation-for-spoken-dialogue-systems", "title": "Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems", "summary": "Researchers propose a reference-based evaluation protocol for prosody and rhythm in speech-to-speech AI agents, using 4000+ hours of dyadic English conversation to create matched reference regimes for metrics like F0 and speaking rate. The percentile-based method flags deviations from human-like behavior more accurately than pooled statistics, serving as a behavioral plausibility check for conversational AI systems.", "body_md": "arXiv:2606.31055v1 Announce Type: new\nAbstract: Speech-to-speech (S2S) AI agents are advancing rapidly, yet evaluation lacks interpretable speech-native measures for conversational prosody and rhythm. Because $F_0$, speaking rate, articulation rate, and pausing shift with model-predicted speaker traits and interaction state, pooled human statistics can be poorly calibrated for evaluating a particular output. Using 4000+ hours of dyadic English conversation from the Seamless Interaction dataset, we construct matched reference regimes for $F_0$ mean, $F_0$ expressivity, speech rate, articulation rate, pause ratio, and mean pause duration. We then define a percentile-based evaluation protocol: extract the same metrics from an S2S output waveform, compare them to the closest matched human reference stratum, and report percentile deviations or 5th-95th percentile out-of-regime flags. On held-out human rows, pooled references over-flag state-conditioned $F_0$ expressivity and rhythm, while matched references return flag rates closer to the nominal 10% and make deviation direction interpretable. These outputs serve as behavioral plausibility checks that complement, rather than replace, perceptual and user-centered evaluation.", "url": "https://wpnews.pro/news/reference-based-prosody-and-rhythm-evaluation-for-spoken-dialogue-systems", "canonical_source": "https://arxiv.org/abs/2606.31055", "published_at": "2026-07-01 04:00:00+00:00", "updated_at": "2026-07-01 04:23:18.449961+00:00", "lang": "en", "topics": ["artificial-intelligence", "natural-language-processing", "ai-research", "ai-agents"], "entities": ["Seamless Interaction dataset"], "alternates": {"html": "https://wpnews.pro/news/reference-based-prosody-and-rhythm-evaluation-for-spoken-dialogue-systems", "markdown": "https://wpnews.pro/news/reference-based-prosody-and-rhythm-evaluation-for-spoken-dialogue-systems.md", "text": "https://wpnews.pro/news/reference-based-prosody-and-rhythm-evaluation-for-spoken-dialogue-systems.txt", "jsonld": "https://wpnews.pro/news/reference-based-prosody-and-rhythm-evaluation-for-spoken-dialogue-systems.jsonld"}}