Follow-up to my earlier post on the 0-parameter input layer.
I took the HSL byte substrate (no tokenizer, no learned input embedding) and built
two small speech models on top, to see whether “bytes as signal” carries through to
audio. I’m calling the line HoLo-ToLk.
STT (speech → text) — the result I’m most confident about.
Feeding the raw HSL substrate to a char-CTC baseline is weak on its own (CER ~0.67).
Adding a small model-side spectral lens (log-mel + a learnable gated fusion over the
frozen substrate) flips it: CER 0.194, beating a mel-spectrogram baseline (0.213) in the same setup, confirmed across 4 seeds. So the honest takeaway is a controlled
comparison — substrate + lens > mel, same setup — not a SOTA number (8 kHz, char-CTC,
no LM; readable but rough). TTS (text → speech) — here the byte substrate is even more natural: UTF-8 text bytes
go straight in as HSL features, no tokenizer/vocab. A small AR transformer + guided
attention + HiFi-GAN gives a single-speaker voice. Held-out teacher-forced mel-L1 is 0.296 (multi-seed) and some samples sound genuinely natural — but free-run synthesis
on arbitrary sentences is still rough/unstable. So I’m framing TTS as a feasibility
demo, not a usable TTS.
Both are research/devlog results, not production or SOTA. The two models are separate
today; the goal is to unify them into one over time.
Try it (combined demo, both tabs):
Substrate: pip install hsl-embedding-zero
Happy to answer questions on the lens design or the byte→signal encoding — and very
open to critique, especially on the TTS free-run instability.