Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)

A researcher reports that a byte transformer with a zero-parameter input layer (HSL-embedding-zero) performs comparably to learned embeddings across text, image, audio, radar, and lidar modalities, with channel-knockout experiments confirming modality-specific reliance on Fourier, delta, and phase channels. The model breaks on absolute-distance tasks due to lacking an absolute-magnitude channel, and grounding collapses under the zero door, suggesting a need for adaptive slotting and a coarse absolute channel in future work.

A bigger update — data for several more of the boundaries you mapped, plus the places it breaks. Geometry/K/implementation were in my last reply; this adds modality, schedule, grounding, and a first sensor pass. One framing note up front so the numbers are comparable: these come from three different runs , not one, and I’ll label which is which — a per-byte text/AV checkpoint knockout + schedule , a new 61M I/O-symmetric multimodal model grounding + the larger curriculum , and a tiny single-seed sensor probe. The input door is the published hsl-embedding-zero package in all of them 0 learned input params, verified bit-identical to the frozen substrate . 1. Modality boundary — channel-knockout matrix per-byte checkpoint . Because the substrate enters at fixed addresses Δ at dims 0–7, Δ² at 8–15, boundary at 16, Fourier at 17–24, phase at 25–26 , I can zero a channel group at eval time on the trained checkpoint and read each modality’s reliance — no retraining. Δbpb when a group is knocked out higher = more load-bearing : | channel group | text | image | audio | caption | |---|---|---|---|---| | Δ dxor | 1.30 | 0.61 | 0.71 | 0.33 | | Δ² d2xor | 2.19 | 0.11 | 0.21 | 0.27 | | boundary | 0.01 | −0.01 | −0.05 | 0.01 | | Fourier | 6.99 | 0.29 | 0.62 | 1.45 | | phase | 0.18 | 0.47 | 0.38 | −0.05 | It splits the way your numeric-locality argument predicted: text/caption lean hard on Fourier , image leans on Δ+phase , audio spreads over Δ+Fourier+phase . The boundary channel is ~0 everywhere — the honest negative: at a fixed patch size the model has no reason to consult “where a unit ends,” so it idles. I’d expect it to wake under adaptive content-determined slotting — now a falsifiable prediction rather than a hope. 2. Modality boundary — extended to real sensor bytes single-seed probe . You predicted HSL geometry should suit signal-like data more than symbolic text. I serialized real radar / lidar / depth KITTI, RadarScenes, RaDICaL ADC through the same frozen substrate with zero per-modality engineering. Next-byte bpb dim512/d4/K8, single seed, 3k steps, not converged — text/radar still descending : | HSL 0 params | learned 78k | raw scalar | | |---|---|---|---| | text | 2.82 | 2.86 | 5.98 | | radar | 4.54 | 4.65 | 6.62 | | lidar | 0.358 | 0.358 | 1.31 | HSL ≈ learned across all three — every |Δ| is within single-seed noise, so I read it as “on par,” not ahead on lidar the learned door actually edges HSL by 0.0003 . raw scalar is far worse everywhere. Cross-sensor binding also works: real radar ADC → camera depth, 4-way distance quartile 0.84 chance 0.25, non-circular . Where it breaks: on that absolute-distance task the substrate 0.84 loses to a raw range profile 0.89 — HSL is a change-rate substrate with no absolute-magnitude channel, so absolute-size tasks aren’t its strength. That points at a concrete v-next encoder card: add one coarse absolute channel without breaking losslessness. All of this is single-seed/toy/unconverged, i.e. exactly the multi-seed + scale you asked for — still owed. 3. Schedule boundary — partial, two runs. I ran the zero door through full from-scratch curricula not the 3k probe . Per-byte model: text bpb 1.674 → 1.531 → 1.578 across stages the last rise is the chat/knowledge mix pulling on the text budget . A new 61M run with an I/O-symmetric output door below : 2.25 → 1.965 → 1.999. Either way the zero door holds up over a long schedule on its own. The head-to-head you actually asked for — does a plain learned embedding catch up given the same long budget — I still haven’t run; that’s the clean next experiment and I’ll report the curve, not a point. 4. Grounding — the break, the fix, and the recovery. In my last reply I flagged that the full run broke disk-grounding: the gap extra bits/byte when retrieved facts are swapped for wrong ones collapsed from 1.835 on a learned-door model to ~0.007 under the zero door. Diagnosis: the zero door also zero-pads the retrieved memory features, so they dilute against the slot positional encoding and the model stops reading their content. Fix: keep the input door at 0 params, but give the memory path retrieved knowledge ≠ input bytes its own small learned projection. The recovery result, on the 61M multimodal-first curriculum: | step | 1 | 1k | 2k | 3k | 4k | 5k | 6k | 7k | 8k | |---|---|---|---|---|---|---|---|---|---| | gap | 0.000 | 0.001 | 0.016 | 0.025 | 0.052 | 0.067 | 0.087 | 0.115 | 0.131 | Monotonic, no collapse — the memory-path projection does restore disk reading. Three honest caveats: 1 0.131 ≪ the old learned-door 1.835 — it reads the disk again, but modestly. 2 In an isolated knowledge-only probe the same fix overshot to ~1.7 and then overfit — it memorized the store and text bpb invaded; in the mixed run knowledge = 20% of the batch it stays at a stable 0.131, so the mix regularizes it and magnitude is regime-dependent. 3 clean isolated-vs-mixed and multi-seed comparisons are still pending. Net: the break is fixed in direction; the ceiling under a zero input door is an open number. 5. A note on the output head not a scale claim . The model in the original post was 25M with a per-byte output. The 61M run above also packs the output the same way as the input — K-byte within-slot autoregressive, so input and output are both K-packed I/O symmetric . It costs bpb the per-byte numbers were lower , so “bigger” and “bundled output” are confounded here — I’m not reading 61M as a scale result. I flag it only because you separated input-door from output-head: this is a first look at giving the output the same zero-shaped K-packing, and the honest read is that the symmetry is a structural choice that costs bits rather than a win. Still blank, in your order: schedule head-to-head does a plain learned embedding catch up given the same long budget — the clean one , then hard-negative binding, then a real multi-seed scale sweep. Thanks again — the map keeps paying off, including where it says “this part doesn’t work yet.”