HoLo-ToLk: tokenizer-free speech (STT + TTS) on the 0-parameter HSL byte substrate

Researcher HoLo-ToLk built speech-to-text and text-to-speech models using a tokenizer-free byte substrate, achieving a character error rate of 0.194 on STT, beating a mel-spectrogram baseline, while TTS remains a feasibility demo with unstable free-run synthesis.

Follow-up to my earlier post on the 0-parameter input layer. I took the HSL byte substrate no tokenizer, no learned input embedding and built two small speech models on top, to see whether “bytes as signal” carries through to audio. I’m calling the line HoLo-ToLk. STT speech → text — the result I’m most confident about. Feeding the raw HSL substrate to a char-CTC baseline is weak on its own CER ~0.67 . Adding a small model-side spectral lens log-mel + a learnable gated fusion over the frozen substrate flips it: CER 0.194, beating a mel-spectrogram baseline 0.213 in the same setup, confirmed across 4 seeds. So the honest takeaway is a controlled comparison — substrate + lens mel, same setup — not a SOTA number 8 kHz, char-CTC, no LM; readable but rough . TTS text → speech — here the byte substrate is even more natural: UTF-8 text bytes go straight in as HSL features, no tokenizer/vocab. A small AR transformer + guided attention + HiFi-GAN gives a single-speaker voice. Held-out teacher-forced mel-L1 is 0.296 multi-seed and some samples sound genuinely natural — but free-run synthesis on arbitrary sentences is still rough/unstable. So I’m framing TTS as a feasibility demo, not a usable TTS. Both are research/devlog results, not production or SOTA. The two models are separate today; the goal is to unify them into one over time. Try it combined demo, both tabs : Substrate: pip install hsl-embedding-zero Happy to answer questions on the lens design or the byte→signal encoding — and very open to critique, especially on the TTS free-run instability.