A bigger update — data for several more of the boundaries you mapped, plus the places it breaks. (Geometry/K/implementation were in my last reply; this adds modality, schedule, grounding, and a first sensor pass.)
One framing note up front so the numbers are comparable: these come from three different runs, not one, and I’ll label which is which — a per-byte text/AV checkpoint (knockout + schedule), a new 61M I/O-symmetric multimodal model (grounding + the larger curriculum), and a tiny single-seed sensor probe. The input door is the published hsl-embedding-zero
package in all of them (0 learned input params, verified bit-identical to the frozen substrate). 1. Modality boundary — channel-knockout matrix (per-byte checkpoint). Because the substrate enters at fixed addresses (Δ at dims 0–7, Δ² at 8–15, boundary at 16, Fourier at 17–24, phase at 25–26), I can zero a channel group at eval time on the trained checkpoint and read each modality’s reliance — no retraining. Δbpb when a group is knocked out (higher = more load-bearing):
| channel group | text | image | audio | caption |
|---|---|---|---|---|
| Δ (dxor) | 1.30 | 0.61 | 0.71 | 0.33 |
| Δ² (d2xor) | 2.19 | 0.11 | 0.21 | 0.27 |
| boundary | 0.01 | −0.01 | −0.05 | 0.01 |
| Fourier | 6.99 | 0.29 | 0.62 | 1.45 |
| phase | 0.18 | 0.47 | 0.38 | −0.05 |
It splits the way your numeric-locality argument predicted: text/caption lean hard on Fourier, image leans on Δ+phase, audio spreads over Δ+Fourier+phase. The boundary channel is ~0 everywhere — the honest negative: at a fixed patch size the model has no reason to consult “where a unit ends,” so it idles. I’d expect it to wake under adaptive (content-determined) slotting — now a falsifiable prediction rather than a hope.
2. Modality boundary — extended to real sensor bytes (single-seed probe). You predicted HSL geometry should suit signal-like data more than symbolic text. I serialized real radar / lidar / depth (KITTI, RadarScenes, RaDICaL ADC) through the same frozen substrate with zero per-modality engineering. Next-byte bpb (dim512/d4/K8, single seed, 3k steps, not converged — text/radar still descending):
| HSL (0 params) | learned (78k) | raw scalar | |
|---|---|---|---|
| text | 2.82 | 2.86 | 5.98 | | radar | 4.54 | 4.65 | 6.62 | | lidar | 0.358 | 0.358 | 1.31 |
HSL ≈ learned across all three — every |Δ| is within single-seed noise, so I read it as “on par,” not ahead (on lidar the learned door actually edges HSL by 0.0003). raw scalar is far worse everywhere. Cross-sensor binding also works: real radar ADC → camera depth, 4-way distance quartile 0.84 (chance 0.25, non-circular). Where it breaks: on that absolute-distance task the substrate (0.84) loses to a raw range profile (0.89) — HSL is a change-rate substrate with no absolute-magnitude channel, so absolute-size tasks aren’t its strength. That points at a concrete v-next encoder card: add one coarse absolute channel without breaking losslessness. All of this is single-seed/toy/unconverged, i.e. exactly the multi-seed + scale you asked for — still owed.
3. Schedule boundary — partial, two runs. I ran the zero door through full from-scratch curricula (not the 3k probe). Per-byte model: text bpb 1.674 → 1.531 → 1.578 across stages (the last rise is the chat/knowledge mix pulling on the text budget). A new 61M run with an I/O-symmetric output door (below): 2.25 → 1.965 → 1.999. Either way the zero door holds up over a long schedule on its own. The head-to-head you actually asked for — does a plain learned embedding catch up given the same long budget — I still haven’t run; that’s the clean next experiment and I’ll report the curve, not a point.
4. Grounding — the break, the fix, and the recovery. In my last reply I flagged that the full run broke disk-grounding: the gap (extra bits/byte when retrieved facts are swapped for wrong ones) collapsed from 1.835 on a learned-door model to ~0.007 under the zero door. Diagnosis: the zero door also zero-pads the retrieved memory features, so they dilute against the slot positional encoding and the model stops reading their content. Fix: keep the input door at 0 params, but give the memory path (retrieved knowledge ≠ input bytes) its own small learned projection. The recovery result, on the 61M multimodal-first curriculum:
| step | 1 | 1k | 2k | 3k | 4k | 5k | 6k | 7k | 8k |
|---|---|---|---|---|---|---|---|---|---|
| gap | 0.000 | 0.001 | 0.016 | 0.025 | 0.052 | 0.067 | 0.087 | 0.115 | 0.131 |
Monotonic, no collapse — the memory-path projection does restore disk reading. Three honest caveats: (1) 0.131 ≪ the old learned-door 1.835 — it reads the disk again, but modestly. (2) In an isolated knowledge-only probe the same fix overshot to ~1.7 and then overfit — it memorized the store and text bpb invaded; in the mixed run (knowledge = 20% of the batch) it stays at a stable 0.131, so the mix regularizes it and magnitude is regime-dependent. (3) clean isolated-vs-mixed and multi-seed comparisons are still pending. Net: the break is fixed in direction; the ceiling under a zero input door is an open number.
5. A note on the output head (not a scale claim). The model in the original post was 25M with a per-byte output. The 61M run above also packs the output the same way as the input — K-byte within-slot autoregressive, so input and output are both K-packed (I/O symmetric). It costs bpb (the per-byte numbers were lower), so “bigger” and “bundled output” are confounded here — I’m not reading 61M as a scale result. I flag it only because you separated input-door from output-head: this is a first look at giving the output the same zero-shaped K-packing, and the honest read is that the symmetry is a structural choice that costs bits rather than a win.
Still blank, in your order: schedule head-to-head (does a plain learned embedding catch up given the same long budget — the clean one), then hard-negative binding, then a real multi-seed scale sweep. Thanks again — the map keeps paying off, including where it says “this part doesn’t work yet.”