Beam is a character-level language model that computes count tables mapping character contexts to next-character frequencies.
At prediction time, each order looks up the current context in its count table and produces a distribution over the vocabulary, smoothed over a symmetric Dirichlet prior
ₒⱼ
Each order receives a capacity score composed of two terms:
ₒ
where H(pₒ) is the Shannon entropy of the smoothed distribution. This is 1 when all mass is on one token and 0 when the distribution is uniform.
where n is the total count for the current context. This saturates toward 1 as evidence accumulates and is 0 when the context has not been observed.
A third term, capacity, is computed from the product of concentration and reliability. The capacity scores are converted to weights via softmax at temperature τ = 0.10:
ₒₒⱼⱼ
The low temperature makes the routing nearly winner-take-all: the highest-capacity order almost always dominates. The final prediction is the weighted geometric mean of the per-order distributions:
ₒₒₒ
This was chosen deliberately to assign high probability to a token only when multiple weighted orders agree.
The model has four hyperparameters: the set of context orders, α, τ, and the reliability threshold (min_count = 1). These were selected by evaluating variants on the validation set.
Evaluation uses the nanoGPT shakespeare_char benchmark: character-level Shakespeare, about 1M training tokens, about 100K validation tokens, and a vocabulary size of 65.
EntropyBeam uses 0 trainable parameters, a single fit pass, and character-level input.
| Training tokens | Validation loss, nats | Contexts stored | Transitions stored |
|---|---|---|---|
| 1,000 | 2.954 | 5,495 | 6,388 |
| 3,000 | 2.654 | 14,670 | 17,176 |
| 10,000 | 2.482 | 44,092 | 51,835 |
| 30,000 | 2.289 | 120,043 | 140,961 |
| 100,000 | 2.193 | 346,462 | 405,119 |
| 300,000 | 1.990 | 919,897 | 1,071,750 |
| 1,003,854 | 1.596 | 2,753,581 | 3,199,496 |
nanoGPT uses 60,192 parameters, 2 layers, n_embd=48
, n_head=4
, block_size=32
, batch_size=16
, and AdamW with lr=1e-3
, wd=0.01
.
| Step | Tokens seen | Validation loss, nats |
|---|---|---|
| 0 | 0 | 4.189 |
| 300 | 153,600 | 2.507 |
| 600 | 307,200 | 2.409 |
| 1,200 | 614,400 | 2.262 |
| 1,800 | 921,600 | 2.162 |
| 2,400 | 1,228,800 | 2.096 |
| 3,000 | 1,536,000 | 2.065 |
| Metric | EntropyBeam | nanoGPT | Ratio |
|---|---|---|---|
| Fit/train FLOPs | 0.009 G | 614 G | 68,000x |
| FLOPs per prediction | 4,500 | 133,000 | 30x |
| Total FLOPs to result | ~0.5 G | ~760 G | ~1,500x |
| Validation loss, nats | 1.596 | 2.065 | |
| Trainable parameters | 0 | 60,192 | |
| Wall time | 12s | 26s |
Per-decade improvement in validation loss.
Range | Change in loss, nats |
|---|---|
1K to 10K | -0.47 | 10K to 100K | -0.29 | 100K to 1M | -0.60 |
Storage is not comparable directly to a transformer's parameter count. EntropyBeam stores 2.7M context-transition entries, compared to 60k learned floats for the transformer. Either way, the fixed combination rule achieves lower cross-entropy than learned optimization on the corpus.
The model was not compared with many different transformer baselines, but in limited testing, it achieved similar next-token prediction accuracy in larger datasets.
The code is available under https://github.com/zw5/beam