{"slug": "removing-the-embedding-from-my-embedding-a-byte-transformer-with-a-0-parameter", "title": "Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)", "summary": "A researcher reports that a byte transformer with a zero-parameter input layer (HSL-embedding-zero) performs comparably to learned embeddings across text, image, audio, radar, and lidar modalities, with channel-knockout experiments confirming modality-specific reliance on Fourier, delta, and phase channels. The model breaks on absolute-distance tasks due to lacking an absolute-magnitude channel, and grounding collapses under the zero door, suggesting a need for adaptive slotting and a coarse absolute channel in future work.", "body_md": "A bigger update — data for several more of the boundaries you mapped, plus the places it breaks. (Geometry/K/implementation were in my last reply; this adds modality, schedule, grounding, and a first sensor pass.)\n\nOne framing note up front so the numbers are comparable: these come from **three different runs**, not one, and I’ll label which is which — a per-byte text/AV checkpoint (knockout + schedule), a new 61M I/O-symmetric multimodal model (grounding + the larger curriculum), and a tiny single-seed sensor probe. The input door is the published `hsl-embedding-zero`\n\npackage in all of them (0 learned input params, verified bit-identical to the frozen substrate).\n\n**1. Modality boundary — channel-knockout matrix (per-byte checkpoint).** Because the substrate enters at fixed addresses (Δ at dims 0–7, Δ² at 8–15, boundary at 16, Fourier at 17–24, phase at 25–26), I can zero a channel group at eval time on the trained checkpoint and read each modality’s reliance — no retraining. Δbpb when a group is knocked out (higher = more load-bearing):\n\n| channel group | text | image | audio | caption |\n|---|---|---|---|---|\n| Δ (dxor) | 1.30 | 0.61 | 0.71 | 0.33 |\n| Δ² (d2xor) | 2.19 | 0.11 | 0.21 | 0.27 |\n| boundary | 0.01 | −0.01 | −0.05 | 0.01 |\n| Fourier | 6.99 | 0.29 | 0.62 | 1.45 |\n| phase | 0.18 | 0.47 | 0.38 | −0.05 |\n\nIt splits the way your numeric-locality argument predicted: **text/caption lean hard on Fourier**, **image leans on Δ+phase**, **audio spreads over Δ+Fourier+phase**. The boundary channel is ~0 everywhere — the honest negative: at a *fixed* patch size the model has no reason to consult “where a unit ends,” so it idles. I’d expect it to wake under adaptive (content-determined) slotting — now a falsifiable prediction rather than a hope.\n\n**2. Modality boundary — extended to real sensor bytes (single-seed probe).** You predicted HSL geometry should suit signal-like data more than symbolic text. I serialized real radar / lidar / depth (KITTI, RadarScenes, RaDICaL ADC) through the same frozen substrate with zero per-modality engineering. Next-byte bpb (dim512/d4/K8, single seed, 3k steps, **not converged — text/radar still descending**):\n\n| HSL (0 params) | learned (78k) | raw scalar | |\n|---|---|---|---|\n| text | 2.82 | 2.86 | 5.98 |\n| radar | 4.54 | 4.65 | 6.62 |\n| lidar | 0.358 | 0.358 | 1.31 |\n\nHSL ≈ learned across all three — every |Δ| is within single-seed noise, so I read it as “on par,” not ahead (on lidar the learned door actually edges HSL by 0.0003). raw scalar is far worse everywhere. Cross-sensor binding also works: real radar ADC → camera depth, 4-way distance quartile 0.84 (chance 0.25, non-circular). **Where it breaks:** on that absolute-distance task the substrate (0.84) *loses* to a raw range profile (0.89) — HSL is a change-rate substrate with no absolute-magnitude channel, so absolute-size tasks aren’t its strength. That points at a concrete v-next encoder card: add one coarse absolute channel without breaking losslessness. All of this is single-seed/toy/unconverged, i.e. exactly the multi-seed + scale you asked for — still owed.\n\n**3. Schedule boundary — partial, two runs.** I ran the zero door through full from-scratch curricula (not the 3k probe). Per-byte model: text bpb 1.674 → 1.531 → 1.578 across stages (the last rise is the chat/knowledge mix pulling on the text budget). A new 61M run with an I/O-symmetric output door (below): 2.25 → 1.965 → 1.999. Either way the zero door holds up over a long schedule on its own. The head-to-head you actually asked for — does a *plain learned embedding* catch up given the same long budget — I still haven’t run; that’s the clean next experiment and I’ll report the curve, not a point.\n\n**4. Grounding — the break, the fix, and the recovery.** In my last reply I flagged that the full run broke disk-grounding: the gap (extra bits/byte when retrieved facts are swapped for wrong ones) collapsed from 1.835 on a learned-door model to ~0.007 under the zero door. Diagnosis: the zero door also zero-pads the *retrieved memory* features, so they dilute against the slot positional encoding and the model stops reading their content. Fix: keep the input door at 0 params, but give the *memory* path (retrieved knowledge ≠ input bytes) its own small learned projection. The recovery result, on the 61M multimodal-first curriculum:\n\n| step | 1 | 1k | 2k | 3k | 4k | 5k | 6k | 7k | 8k |\n|---|---|---|---|---|---|---|---|---|---|\n| gap | 0.000 | 0.001 | 0.016 | 0.025 | 0.052 | 0.067 | 0.087 | 0.115 | 0.131 |\n\nMonotonic, no collapse — the memory-path projection does restore disk reading. Three honest caveats: (1) **0.131 ≪ the old learned-door 1.835** — it reads the disk again, but modestly. (2) In an *isolated* knowledge-only probe the same fix overshot to ~1.7 and then **overfit** — it memorized the store and text bpb invaded; in the *mixed* run (knowledge = 20% of the batch) it stays at a stable 0.131, so the mix regularizes it and magnitude is regime-dependent. (3) clean isolated-vs-mixed and multi-seed comparisons are still pending. Net: the break is fixed in direction; the ceiling under a zero input door is an open number.\n\n**5. A note on the output head (not a scale claim).** The model in the original post was 25M with a per-byte output. The 61M run above also packs the *output* the same way as the input — K-byte within-slot autoregressive, so input and output are both K-packed (I/O symmetric). It costs bpb (the per-byte numbers were lower), so “bigger” and “bundled output” are confounded here — I’m not reading 61M as a scale result. I flag it only because you separated input-door from output-head: this is a first look at giving the output the same zero-shaped K-packing, and the honest read is that the symmetry is a structural choice that *costs* bits rather than a win.\n\n**Still blank, in your order:** schedule head-to-head (does a plain learned embedding catch up given the same long budget — the clean one), then hard-negative binding, then a real multi-seed scale sweep. Thanks again — the map keeps paying off, including where it says “this part doesn’t work yet.”", "url": "https://wpnews.pro/news/removing-the-embedding-from-my-embedding-a-byte-transformer-with-a-0-parameter", "canonical_source": "https://discuss.huggingface.co/t/removing-the-embedding-from-my-embedding-a-byte-transformer-with-a-0-parameter-input-layer-25m-single-rtx-4070/176731#post_4", "published_at": "2026-06-14 14:54:16+00:00", "updated_at": "2026-06-14 16:47:56.818117+00:00", "lang": "en", "topics": ["machine-learning", "neural-networks", "ai-research", "computer-vision", "natural-language-processing"], "entities": ["HSL-embedding-zero", "KITTI", "RadarScenes", "RaDICaL ADC", "RTX 4070"], "alternates": {"html": "https://wpnews.pro/news/removing-the-embedding-from-my-embedding-a-byte-transformer-with-a-0-parameter", "markdown": "https://wpnews.pro/news/removing-the-embedding-from-my-embedding-a-byte-transformer-with-a-0-parameter.md", "text": "https://wpnews.pro/news/removing-the-embedding-from-my-embedding-a-byte-transformer-with-a-0-parameter.txt", "jsonld": "https://wpnews.pro/news/removing-the-embedding-from-my-embedding-a-byte-transformer-with-a-0-parameter.jsonld"}}