{"slug": "unlocking-fine-grained-and-within-utterance-speaking-style-control-in-prompt-to", "title": "Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models", "summary": "Researchers have developed techniques to enable fine-grained and within-utterance speaking style control in prompt-based text-to-speech models. The methods allow smooth style interpolation between utterances and time-varying style transitions within a single utterance, achieving up to 36 Hz pitch variation and 1.6 syllables-per-second speed changes. This advancement addresses key limitations in existing TTS systems, enabling more natural and dynamic speech generation for practical applications.", "body_md": "arXiv:2605.27376v1 Announce Type: new\nAbstract: While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance. This restricts practical use cases that require continuous style attribute interpolation across utterances and time-varying style transitions within a single utterance. In this paper, we propose novel techniques to achieve both capabilities in existing prompt-based TTS models. For inter-utterance style interpolation, we compute direction vectors between contrastive style prompts in the embedding space and perform simple interpolation, enabling smooth transitions between style characteristics. For intra-utterance style transition, we first identify a strong attention bias toward early tokens in autoregressive TTS decoders, causing the initial audio realization to dominate subsequent generation. To mitigate this effect, we introduce KV-cache swapping and sliding-window attention masking. Experiments demonstrate that our proposed inter-utterance interpolation achieves a 99-100% success rate in gender conversion, up to 36 Hz pitch variation, and up to 1.6 syllables-per-second speed change. Our intra-utterance transition maintains a speaker similarity of 0.81-0.91 and achieves perceptual smoothness scores of 3.48-4.48.", "url": "https://wpnews.pro/news/unlocking-fine-grained-and-within-utterance-speaking-style-control-in-prompt-to", "canonical_source": "https://arxiv.org/abs/2605.27376", "published_at": "2026-05-28 04:00:00+00:00", "updated_at": "2026-05-28 04:34:00.186375+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "neural-networks", "generative-ai", "natural-language-processing"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/unlocking-fine-grained-and-within-utterance-speaking-style-control-in-prompt-to", "markdown": "https://wpnews.pro/news/unlocking-fine-grained-and-within-utterance-speaking-style-control-in-prompt-to.md", "text": "https://wpnews.pro/news/unlocking-fine-grained-and-within-utterance-speaking-style-control-in-prompt-to.txt", "jsonld": "https://wpnews.pro/news/unlocking-fine-grained-and-within-utterance-speaking-style-control-in-prompt-to.jsonld"}}