{"slug": "inworld-tts-paralinguistic-tags-don-t-work-here-s-what-does", "title": "Inworld TTS Paralinguistic Tags Don't Work — Here's What Does", "summary": "Inworld's TTS-1.5 Max model ignores common paralinguistic tags like `[laugh]` and `[sigh]`, rendering them as silence or literal text instead of expressive audio. Developers at HoneyChat, a Telegram-native AI companion, tested multiple tag variants across 26 archetypes and 15 languages before discovering that SSML `<break>` tags, ellipsis-based pauses, and spelled-out onomatopoeia like \"ha-ha\" reliably produce the intended vocal effects.", "body_md": "If you've worked with expressive TTS in the last year you've probably seen the pattern:\n\n```\nShe paused. [sigh] \"Fine, you can come in.\"\n```\n\nInline paralinguistic tags. Half the model demos use them. So when we wired up **Inworld TTS-1.5 Max** for [HoneyChat](https://honeychat.bot/) — Telegram-native AI companion where voice messages are a first-class output — we sprinkled `[laugh]`\n\n, `[sigh]`\n\n, `[breathe]`\n\nthrough the prompts and shipped.\n\nThe audio sounded fine. Just… exactly the same as before. No laugh. No sigh. The tags were getting read out as silence at best, and as the literal text \"sigh\" at worst, depending on the voice.\n\nWe tested all the variants we could find. None of them moved the needle.\n\n**HoneyChat voice stack at a glance:**\n\n`en, ru, ja, zh, ko, es, fr, de, it, pt, pl, hi, ar, he, nl`\n\n.`voiceId`\n\nstrings in `config/archetype_voice_ids.json`\n\n. Generated via the Voice Design API and managed with `core/voice_design.py`\n\n.`core/voice_clone_manager.py`\n\n) — persistent `voiceId`\n\nminted from a WAV/MP3 sample.`core/voice_cache.py`\n\n.`VOICE_GENDER_MALE`\n\n/`VOICE_GENDER_FEMALE`\n\n, not `\"male\"`\n\n/`\"female\"`\n\nstrings. Passing the strings 400s silently.Tried on the same sentence, same voice, side-by-side audio comparison:\n\n| Pattern | What it did |\n|---|---|\n`[laugh]` `[sigh]`\n|\nSilence in output |\n`(laughs)` `(sighs)`\n|\nSometimes read literally |\n`*laughs*` `*sighs*`\n|\nSilence (asterisks get stripped) |\n`<laugh/>` `<sigh/>`\n|\nSilence (not valid SSML on Inworld) |\n`<emotion>laugh</emotion>` |\nSilence |\n\nThe Inworld API does not document support for any of these. We had assumed (because every other TTS post on the internet uses them) that they were a universal convention. They are not.\n\nWhat Inworld *does* expose is ** temperature** and\n\n`speakingRate`\n\nAfter enough A/B-ing across 26 archetypes × 15 languages, four patterns reliably change the audio output.\n\n```\n\"You did *what?*\"\n```\n\nThe asterisks get stripped from the spoken text but the emphasised word lands with audible stress. Works in every voice we tried. The cheapest, highest-hit-rate marker.\n\n```\n\"Fine... you can come in.\"\n```\n\nThree dots produces a real pause with a tonal drop — the voice equivalent of a sigh, without trying to fake `[sigh]`\n\n. Five dots for a longer pause. The model interprets them as prosodic cues.\n\n`<break>`\n\nfor hard pauses\n\n```\n<speak>\n  She paused. <break time=\"0.4s\"/> \"Fine, you can come in.\"\n</speak>\n```\n\nInworld accepts a useful subset of SSML, and `<break>`\n\nis the one that matters most for expressive speech. `0.2s`\n\nfor a beat, `0.4s`\n\nfor a sigh-pause, `0.8s`\n\nfor a beat-before-a-line-delivery moment. Wrap the whole text in `<speak>`\n\nand the parser handles it.\n\n```\n\"Mmm... ha-ha, you're right.\"\n\"ahh... I needed that.\"\n```\n\nThe model *will* render `ha-ha`\n\n, `mmm`\n\n, `ahh`\n\n, `oh`\n\n, `nnn`\n\nas the actual sound, because they're spellings of sounds rather than meta-tags. They sound far more natural than a synthesised `[laugh]`\n\neven when one exists.\n\nFor emotional/intimate scenes, rhythmic repeats (`ah... ah... ah`\n\n) carry actual prosody. We use this for breath patterns where another TTS would want a `[breathe]`\n\nmarker.\n\nIn `core/voice.py`\n\nwe run every chunk through `enrich_for_tts()`\n\n(line ~772) before handing it to Inworld. Regex-based, language-aware, idempotent:\n\n``` php\ndef enrich_for_tts(text: str, lang: str = \"en\") -> tuple[str, dict]:\n    \"\"\"Return (preprocessed_text, request_params).\n    Strips fake paralinguistic tags, adds SSML breaks where appropriate,\n    and bumps temperature/speakingRate for high-emotion scenes.\"\"\"\n    text = _STRIP_FAKE_TAGS.sub(\"\", text)\n    text = _ELLIPSIS_TO_BREAK.sub(r'<break time=\"0.3s\"/>', text)\n    if \"<break\" in text:\n        text = f\"<speak>{text}</speak>\"\n    params = _detect_mood_params(text, lang)\n    return text, params\n```\n\nThe mood detector looks for emotional cues (intensity words, repeated punctuation, onomatopoeia density) and bumps `temperature`\n\nand `speakingRate`\n\nfor the more expressive scenes. Same model, same voice, much more dynamic output, all without any inline tag that the model would have ignored.\n\n`[laugh]`\n\n/`[sigh]`\n\nis universal.`[sigh]`\n\nthat emits silence looks identical to one that emits a sigh in any log.`temperature`\n\n, `speakingRate`\n\n, and a useful subset of SSML — not inline tags.`\"ahh...\"`\n\nis a thing the model can read; `[sigh]`\n\nis a meta-instruction it can't.The audio quality jump from these four patterns is meaningful — users notice. The cost is a 30-line preprocessor and the courage to delete every `[laugh]`\n\nyour team has been sprinkling for months.\n\nThis is from production work at ** HoneyChat** — Telegram-native AI companion where voice messages are a first-class output. Canonical version:\n\n— *HoneyChat Engineering*\n\n`temperature`\n\n, `speakingRate`\n\n), SSML subset, voice design API.`<break>`\n\n, `<speak>`\n\n, prosody elements.", "url": "https://wpnews.pro/news/inworld-tts-paralinguistic-tags-don-t-work-here-s-what-does", "canonical_source": "https://dev.to/sm1ck/inworld-tts-paralinguistic-tags-dont-work-heres-what-does-50pj", "published_at": "2026-05-31 01:42:57+00:00", "updated_at": "2026-05-31 02:12:29.441210+00:00", "lang": "en", "topics": ["artificial-intelligence", "natural-language-processing", "ai-products", "ai-tools", "generative-ai"], "entities": ["Inworld", "HoneyChat", "Inworld TTS-1.5 Max", "Telegram", "Voice Design API"], "alternates": {"html": "https://wpnews.pro/news/inworld-tts-paralinguistic-tags-don-t-work-here-s-what-does", "markdown": "https://wpnews.pro/news/inworld-tts-paralinguistic-tags-don-t-work-here-s-what-does.md", "text": "https://wpnews.pro/news/inworld-tts-paralinguistic-tags-don-t-work-here-s-what-does.txt", "jsonld": "https://wpnews.pro/news/inworld-tts-paralinguistic-tags-don-t-work-here-s-what-does.jsonld"}}