{"slug": "holo-tolk-tokenizer-free-speech-stt-tts-on-the-0-parameter-hsl-byte-substrate", "title": "HoLo-ToLk: tokenizer-free speech (STT + TTS) on the 0-parameter HSL byte substrate", "summary": "Researcher HoLo-ToLk built speech-to-text and text-to-speech models using a tokenizer-free byte substrate, achieving a character error rate of 0.194 on STT, beating a mel-spectrogram baseline, while TTS remains a feasibility demo with unstable free-run synthesis.", "body_md": "Follow-up to my earlier post on the 0-parameter input layer.\n\nI took the HSL byte substrate (no tokenizer, no learned input embedding) and built\n\ntwo small speech models on top, to see whether “bytes as signal” carries through to\n\naudio. I’m calling the line HoLo-ToLk.\n\nSTT (speech → text) — the result I’m most confident about.\n\nFeeding the raw HSL substrate to a char-CTC baseline is weak on its own (CER ~0.67).\n\nAdding a small model-side spectral lens (log-mel + a learnable gated fusion over the\n\nfrozen substrate) flips it: CER 0.194, beating a mel-spectrogram baseline (0.213) in\n\nthe same setup, confirmed across 4 seeds. So the honest takeaway is a controlled\n\ncomparison — substrate + lens > mel, same setup — not a SOTA number (8 kHz, char-CTC,\n\nno LM; readable but rough).\n\nTTS (text → speech) — here the byte substrate is even more natural: UTF-8 text bytes\n\ngo straight in as HSL features, no tokenizer/vocab. A small AR transformer + guided\n\nattention + HiFi-GAN gives a single-speaker voice. Held-out teacher-forced mel-L1 is\n\n0.296 (multi-seed) and some samples sound genuinely natural — but free-run synthesis\n\non arbitrary sentences is still rough/unstable. So I’m framing TTS as a feasibility\n\ndemo, not a usable TTS.\n\nBoth are research/devlog results, not production or SOTA. The two models are separate\n\ntoday; the goal is to unify them into one over time.\n\nTry it (combined demo, both tabs):\n\nSubstrate: pip install hsl-embedding-zero\n\nHappy to answer questions on the lens design or the byte→signal encoding — and very\n\nopen to critique, especially on the TTS free-run instability.", "url": "https://wpnews.pro/news/holo-tolk-tokenizer-free-speech-stt-tts-on-the-0-parameter-hsl-byte-substrate", "canonical_source": "https://discuss.huggingface.co/t/holo-tolk-tokenizer-free-speech-stt-tts-on-the-0-parameter-hsl-byte-substrate/177216#post_1", "published_at": "2026-06-28 18:25:42+00:00", "updated_at": "2026-06-28 23:20:10.005556+00:00", "lang": "en", "topics": ["machine-learning", "natural-language-processing", "ai-research"], "entities": ["HoLo-ToLk", "HSL", "HiFi-GAN"], "alternates": {"html": "https://wpnews.pro/news/holo-tolk-tokenizer-free-speech-stt-tts-on-the-0-parameter-hsl-byte-substrate", "markdown": "https://wpnews.pro/news/holo-tolk-tokenizer-free-speech-stt-tts-on-the-0-parameter-hsl-byte-substrate.md", "text": "https://wpnews.pro/news/holo-tolk-tokenizer-free-speech-stt-tts-on-the-0-parameter-hsl-byte-substrate.txt", "jsonld": "https://wpnews.pro/news/holo-tolk-tokenizer-free-speech-stt-tts-on-the-0-parameter-hsl-byte-substrate.jsonld"}}