Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)

wpnews.pro

cd /news/machine-learning/removing-the-embedding-from-my-embed… · home › topics › machine-learning › article

[ARTICLE · art-27098] src=discuss.huggingface.co ↗ pub=2026-06-14T14:54Z topic=machine-learning verified=true sentiment=· neutral

Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)

A researcher reports that a byte transformer with a zero-parameter input layer (HSL-embedding-zero) performs comparably to learned embeddings across text, image, audio, radar, and lidar modalities, with channel-knockout experiments confirming modality-specific reliance on Fourier, delta, and phase channels. The model breaks on absolute-distance tasks due to lacking an absolute-magnitude channel, and grounding collapses under the zero door, suggesting a need for adaptive slotting and a coarse absolute channel in future work.

read5 min views24 publishedJun 14, 2026

A bigger update — data for several more of the boundaries you mapped, plus the places it breaks. (Geometry/K/implementation were in my last reply; this adds modality, schedule, grounding, and a first sensor pass.)

One framing note up front so the numbers are comparable: these come from three different runs, not one, and I’ll label which is which — a per-byte text/AV checkpoint (knockout + schedule), a new 61M I/O-symmetric multimodal model (grounding + the larger curriculum), and a tiny single-seed sensor probe. The input door is the published hsl-embedding-zero

package in all of them (0 learned input params, verified bit-identical to the frozen substrate). 1. Modality boundary — channel-knockout matrix (per-byte checkpoint). Because the substrate enters at fixed addresses (Δ at dims 0–7, Δ² at 8–15, boundary at 16, Fourier at 17–24, phase at 25–26), I can zero a channel group at eval time on the trained checkpoint and read each modality’s reliance — no retraining. Δbpb when a group is knocked out (higher = more load-bearing):

channel group	text	image	audio	caption
Δ (dxor)	1.30	0.61	0.71	0.33
Δ² (d2xor)	2.19	0.11	0.21	0.27
boundary	0.01	−0.01	−0.05	0.01
Fourier	6.99	0.29	0.62	1.45
phase	0.18	0.47	0.38	−0.05

It splits the way your numeric-locality argument predicted: text/caption lean hard on Fourier, image leans on Δ+phase, audio spreads over Δ+Fourier+phase. The boundary channel is ~0 everywhere — the honest negative: at a fixed patch size the model has no reason to consult “where a unit ends,” so it idles. I’d expect it to wake under adaptive (content-determined) slotting — now a falsifiable prediction rather than a hope.

2. Modality boundary — extended to real sensor bytes (single-seed probe). You predicted HSL geometry should suit signal-like data more than symbolic text. I serialized real radar / lidar / depth (KITTI, RadarScenes, RaDICaL ADC) through the same frozen substrate with zero per-modality engineering. Next-byte bpb (dim512/d4/K8, single seed, 3k steps, not converged — text/radar still descending):

| HSL (0 params) | learned (78k) | raw scalar | |
|---|---|---|---|

| text | 2.82 | 2.86 | 5.98 | | radar | 4.54 | 4.65 | 6.62 | | lidar | 0.358 | 0.358 | 1.31 |

HSL ≈ learned across all three — every |Δ| is within single-seed noise, so I read it as “on par,” not ahead (on lidar the learned door actually edges HSL by 0.0003). raw scalar is far worse everywhere. Cross-sensor binding also works: real radar ADC → camera depth, 4-way distance quartile 0.84 (chance 0.25, non-circular). Where it breaks: on that absolute-distance task the substrate (0.84) loses to a raw range profile (0.89) — HSL is a change-rate substrate with no absolute-magnitude channel, so absolute-size tasks aren’t its strength. That points at a concrete v-next encoder card: add one coarse absolute channel without breaking losslessness. All of this is single-seed/toy/unconverged, i.e. exactly the multi-seed + scale you asked for — still owed.

3. Schedule boundary — partial, two runs. I ran the zero door through full from-scratch curricula (not the 3k probe). Per-byte model: text bpb 1.674 → 1.531 → 1.578 across stages (the last rise is the chat/knowledge mix pulling on the text budget). A new 61M run with an I/O-symmetric output door (below): 2.25 → 1.965 → 1.999. Either way the zero door holds up over a long schedule on its own. The head-to-head you actually asked for — does a plain learned embedding catch up given the same long budget — I still haven’t run; that’s the clean next experiment and I’ll report the curve, not a point.

4. Grounding — the break, the fix, and the recovery. In my last reply I flagged that the full run broke disk-grounding: the gap (extra bits/byte when retrieved facts are swapped for wrong ones) collapsed from 1.835 on a learned-door model to ~0.007 under the zero door. Diagnosis: the zero door also zero-pads the retrieved memory features, so they dilute against the slot positional encoding and the model stops reading their content. Fix: keep the input door at 0 params, but give the memory path (retrieved knowledge ≠ input bytes) its own small learned projection. The recovery result, on the 61M multimodal-first curriculum:

step	1	1k	2k	3k	4k	5k	6k	7k	8k
gap	0.000	0.001	0.016	0.025	0.052	0.067	0.087	0.115	0.131

Monotonic, no collapse — the memory-path projection does restore disk reading. Three honest caveats: (1) 0.131 ≪ the old learned-door 1.835 — it reads the disk again, but modestly. (2) In an isolated knowledge-only probe the same fix overshot to ~1.7 and then overfit — it memorized the store and text bpb invaded; in the mixed run (knowledge = 20% of the batch) it stays at a stable 0.131, so the mix regularizes it and magnitude is regime-dependent. (3) clean isolated-vs-mixed and multi-seed comparisons are still pending. Net: the break is fixed in direction; the ceiling under a zero input door is an open number.

5. A note on the output head (not a scale claim). The model in the original post was 25M with a per-byte output. The 61M run above also packs the output the same way as the input — K-byte within-slot autoregressive, so input and output are both K-packed (I/O symmetric). It costs bpb (the per-byte numbers were lower), so “bigger” and “bundled output” are confounded here — I’m not reading 61M as a scale result. I flag it only because you separated input-door from output-head: this is a first look at giving the output the same zero-shaped K-packing, and the honest read is that the symmetry is a structural choice that costs bits rather than a win.

Still blank, in your order: schedule head-to-head (does a plain learned embedding catch up given the same long budget — the clean one), then hard-negative binding, then a real multi-seed scale sweep. Thanks again — the map keeps paying off, including where it says “this part doesn’t work yet.”

source & further reading

discuss.huggingface.co — original article Rakarrack-0.6.1 port making progress! ( AI assisted ) Cloud Storage Poll Welcome to Haiku basic(Haiku Docs, Haiku slide and Haiku sheets)

~/api · this article 200

$curl api.wpnews.pro/v1/news/removing-the-embedding-f…

Read original on discuss.huggingface.co → discuss.huggingface.co/t/removing-the-embedding-…

mentioned entities

HSL-embedding-zero

KITTI

RadarScenes

RaDICaL ADC

RTX 4070

metadata

slugremoving-the-embedding-from-my-embedding-a-byte-transformer-with-a-0-parameter

topic#machine-learning

secondary4 topics

sentimentneutral

canonicaldiscuss.huggingface.co

navigation

← prevStructured Output From Local LLM…

next →Starmer to announce social media…

── more in #machine-learning 4 stories · sorted by recency

arxiv.org · 30 Jul · #machine-learning

Defeating vanishing gradients in deep neural networks: Quotient Tree Arithmetic

arxiv.org · 30 Jul · #machine-learning

Dynamic Parameterization Is Not Dynamic Inference

arxiv.org · 30 Jul · #machine-learning

Weight and Height Estimation from a Single Human Image Captured in the Wild

arxiv.org · 30 Jul · #machine-learning

Zero-Fi: Zero-Shot Wi-Fi-Based Human Activity Recognition via Contrastive Signal-Language Alignment

── more on @hsl-embedding-zero 3 stories trending now

wpnews · 28 Jul · #large-language-models

How to Download and Run Kimi K3 Open Weights

wpnews · 29 Jul · #ai-safety

News Summary for July 29, 2026

wpnews · 30 Jul · #artificial-intelligence

Apple to join Samsung in AI glasses race against Meta

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required