cd /news/machine-learning/removing-the-embedding-from-my-embed… · home topics machine-learning article
[ARTICLE · art-27098] src=discuss.huggingface.co ↗ pub= topic=machine-learning verified=true sentiment=· neutral

Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)

A researcher reports that a byte transformer with a zero-parameter input layer (HSL-embedding-zero) performs comparably to learned embeddings across text, image, audio, radar, and lidar modalities, with channel-knockout experiments confirming modality-specific reliance on Fourier, delta, and phase channels. The model breaks on absolute-distance tasks due to lacking an absolute-magnitude channel, and grounding collapses under the zero door, suggesting a need for adaptive slotting and a coarse absolute channel in future work.

read5 min publishedJun 14, 2026

A bigger update — data for several more of the boundaries you mapped, plus the places it breaks. (Geometry/K/implementation were in my last reply; this adds modality, schedule, grounding, and a first sensor pass.)

One framing note up front so the numbers are comparable: these come from three different runs, not one, and I’ll label which is which — a per-byte text/AV checkpoint (knockout + schedule), a new 61M I/O-symmetric multimodal model (grounding + the larger curriculum), and a tiny single-seed sensor probe. The input door is the published hsl-embedding-zero

package in all of them (0 learned input params, verified bit-identical to the frozen substrate). 1. Modality boundary — channel-knockout matrix (per-byte checkpoint). Because the substrate enters at fixed addresses (Δ at dims 0–7, Δ² at 8–15, boundary at 16, Fourier at 17–24, phase at 25–26), I can zero a channel group at eval time on the trained checkpoint and read each modality’s reliance — no retraining. Δbpb when a group is knocked out (higher = more load-bearing):

channel group text image audio caption
Δ (dxor) 1.30 0.61 0.71 0.33
Δ² (d2xor) 2.19 0.11 0.21 0.27
boundary 0.01 −0.01 −0.05 0.01
Fourier 6.99 0.29 0.62 1.45
phase 0.18 0.47 0.38 −0.05

It splits the way your numeric-locality argument predicted: text/caption lean hard on Fourier, image leans on Δ+phase, audio spreads over Δ+Fourier+phase. The boundary channel is ~0 everywhere — the honest negative: at a fixed patch size the model has no reason to consult “where a unit ends,” so it idles. I’d expect it to wake under adaptive (content-determined) slotting — now a falsifiable prediction rather than a hope.

2. Modality boundary — extended to real sensor bytes (single-seed probe). You predicted HSL geometry should suit signal-like data more than symbolic text. I serialized real radar / lidar / depth (KITTI, RadarScenes, RaDICaL ADC) through the same frozen substrate with zero per-modality engineering. Next-byte bpb (dim512/d4/K8, single seed, 3k steps, not converged — text/radar still descending):

| HSL (0 params) | learned (78k) | raw scalar | |
|---|---|---|---|

| text | 2.82 | 2.86 | 5.98 | | radar | 4.54 | 4.65 | 6.62 | | lidar | 0.358 | 0.358 | 1.31 |

HSL ≈ learned across all three — every |Δ| is within single-seed noise, so I read it as “on par,” not ahead (on lidar the learned door actually edges HSL by 0.0003). raw scalar is far worse everywhere. Cross-sensor binding also works: real radar ADC → camera depth, 4-way distance quartile 0.84 (chance 0.25, non-circular). Where it breaks: on that absolute-distance task the substrate (0.84) loses to a raw range profile (0.89) — HSL is a change-rate substrate with no absolute-magnitude channel, so absolute-size tasks aren’t its strength. That points at a concrete v-next encoder card: add one coarse absolute channel without breaking losslessness. All of this is single-seed/toy/unconverged, i.e. exactly the multi-seed + scale you asked for — still owed.

3. Schedule boundary — partial, two runs. I ran the zero door through full from-scratch curricula (not the 3k probe). Per-byte model: text bpb 1.674 → 1.531 → 1.578 across stages (the last rise is the chat/knowledge mix pulling on the text budget). A new 61M run with an I/O-symmetric output door (below): 2.25 → 1.965 → 1.999. Either way the zero door holds up over a long schedule on its own. The head-to-head you actually asked for — does a plain learned embedding catch up given the same long budget — I still haven’t run; that’s the clean next experiment and I’ll report the curve, not a point.

4. Grounding — the break, the fix, and the recovery. In my last reply I flagged that the full run broke disk-grounding: the gap (extra bits/byte when retrieved facts are swapped for wrong ones) collapsed from 1.835 on a learned-door model to ~0.007 under the zero door. Diagnosis: the zero door also zero-pads the retrieved memory features, so they dilute against the slot positional encoding and the model stops reading their content. Fix: keep the input door at 0 params, but give the memory path (retrieved knowledge ≠ input bytes) its own small learned projection. The recovery result, on the 61M multimodal-first curriculum:

step 1 1k 2k 3k 4k 5k 6k 7k 8k
gap 0.000 0.001 0.016 0.025 0.052 0.067 0.087 0.115 0.131

Monotonic, no collapse — the memory-path projection does restore disk reading. Three honest caveats: (1) 0.131 ≪ the old learned-door 1.835 — it reads the disk again, but modestly. (2) In an isolated knowledge-only probe the same fix overshot to ~1.7 and then overfit — it memorized the store and text bpb invaded; in the mixed run (knowledge = 20% of the batch) it stays at a stable 0.131, so the mix regularizes it and magnitude is regime-dependent. (3) clean isolated-vs-mixed and multi-seed comparisons are still pending. Net: the break is fixed in direction; the ceiling under a zero input door is an open number.

5. A note on the output head (not a scale claim). The model in the original post was 25M with a per-byte output. The 61M run above also packs the output the same way as the input — K-byte within-slot autoregressive, so input and output are both K-packed (I/O symmetric). It costs bpb (the per-byte numbers were lower), so “bigger” and “bundled output” are confounded here — I’m not reading 61M as a scale result. I flag it only because you separated input-door from output-head: this is a first look at giving the output the same zero-shaped K-packing, and the honest read is that the symmetry is a structural choice that costs bits rather than a win.

Still blank, in your order: schedule head-to-head (does a plain learned embedding catch up given the same long budget — the clean one), then hard-negative binding, then a real multi-seed scale sweep. Thanks again — the map keeps paying off, including where it says “this part doesn’t work yet.”

── more in #machine-learning 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/removing-the-embeddi…] indexed:0 read:5min 2026-06-14 ·