# HoLo-ToLk: tokenizer-free speech (STT + TTS) on the 0-parameter HSL byte substrate

> Source: <https://discuss.huggingface.co/t/holo-tolk-tokenizer-free-speech-stt-tts-on-the-0-parameter-hsl-byte-substrate/177216#post_1>
> Published: 2026-06-28 18:25:42+00:00

Follow-up to my earlier post on the 0-parameter input layer.

I took the HSL byte substrate (no tokenizer, no learned input embedding) and built

two small speech models on top, to see whether “bytes as signal” carries through to

audio. I’m calling the line HoLo-ToLk.

STT (speech → text) — the result I’m most confident about.

Feeding the raw HSL substrate to a char-CTC baseline is weak on its own (CER ~0.67).

Adding a small model-side spectral lens (log-mel + a learnable gated fusion over the

frozen substrate) flips it: CER 0.194, beating a mel-spectrogram baseline (0.213) in

the same setup, confirmed across 4 seeds. So the honest takeaway is a controlled

comparison — substrate + lens > mel, same setup — not a SOTA number (8 kHz, char-CTC,

no LM; readable but rough).

TTS (text → speech) — here the byte substrate is even more natural: UTF-8 text bytes

go straight in as HSL features, no tokenizer/vocab. A small AR transformer + guided

attention + HiFi-GAN gives a single-speaker voice. Held-out teacher-forced mel-L1 is

0.296 (multi-seed) and some samples sound genuinely natural — but free-run synthesis

on arbitrary sentences is still rough/unstable. So I’m framing TTS as a feasibility

demo, not a usable TTS.

Both are research/devlog results, not production or SOTA. The two models are separate

today; the goal is to unify them into one over time.

Try it (combined demo, both tabs):

Substrate: pip install hsl-embedding-zero

Happy to answer questions on the lens design or the byte→signal encoding — and very

open to critique, especially on the TTS free-run instability.