# Gradient-free Single-pass Model Beats nanoGPT on Shakespeare

> Source: <https://www.lesswrong.com/posts/rSiwKbisKmMhALpBk/gradient-free-single-pass-model-beats-nanogpt-on-shakespeare-1>
> Published: 2026-06-29 16:38:43+00:00

Beam is a character-level language model that computes count tables mapping character contexts to next-character frequencies.

At prediction time, each order looks up the current context in its count table and produces a distribution over the vocabulary, smoothed over a symmetric Dirichlet prior

ₒⱼ

Each order receives a capacity score composed of two terms:

ₒ

where H(pₒ) is the Shannon entropy of the smoothed distribution. This is 1 when all mass is on one token and 0 when the distribution is uniform.

where n is the total count for the current context. This saturates toward 1 as evidence accumulates and is 0 when the context has not been observed.

A third term, capacity, is computed from the product of concentration and reliability. The capacity scores are converted to weights via softmax at temperature τ = 0.10:

ₒₒⱼⱼ

The low temperature makes the routing nearly winner-take-all: the highest-capacity order almost always dominates. The final prediction is the weighted geometric mean of the per-order distributions:

ₒₒₒ

This was chosen deliberately to assign high probability to a token only when multiple weighted orders agree.

The model has four hyperparameters: the set of context orders, α, τ, and the reliability threshold (min_count = 1). These were selected by evaluating variants on the validation set.

Evaluation uses the nanoGPT shakespeare_char benchmark: character-level Shakespeare, about 1M training tokens, about 100K validation tokens, and a vocabulary size of 65.

EntropyBeam uses 0 trainable parameters, a single fit pass, and character-level input.

Training tokens | Validation loss, nats | Contexts stored | Transitions stored |
|---|---|---|---|
1,000 | 2.954 | 5,495 | 6,388 |
3,000 | 2.654 | 14,670 | 17,176 |
10,000 | 2.482 | 44,092 | 51,835 |
30,000 | 2.289 | 120,043 | 140,961 |
100,000 | 2.193 | 346,462 | 405,119 |
300,000 | 1.990 | 919,897 | 1,071,750 |
1,003,854 | 1.596 | 2,753,581 | 3,199,496 |

nanoGPT uses 60,192 parameters, 2 layers, `n_embd=48`

, `n_head=4`

, `block_size=32`

, `batch_size=16`

, and AdamW with `lr=1e-3`

, `wd=0.01`

.

Step | Tokens seen | Validation loss, nats |
|---|---|---|
0 | 0 | 4.189 |
300 | 153,600 | 2.507 |
600 | 307,200 | 2.409 |
1,200 | 614,400 | 2.262 |
1,800 | 921,600 | 2.162 |
2,400 | 1,228,800 | 2.096 |
3,000 | 1,536,000 | 2.065 |

Metric | EntropyBeam | nanoGPT | Ratio |
|---|---|---|---|
Fit/train FLOPs | 0.009 G | 614 G | 68,000x |
FLOPs per prediction | 4,500 | 133,000 | 30x |
Total FLOPs to result | ~0.5 G | ~760 G | ~1,500x |
Validation loss, nats | 1.596 | 2.065 | |
Trainable parameters | 0 | 60,192 | |
Wall time | 12s | 26s |

Per-decade improvement in validation loss.

Range | Change in loss, nats |
|---|---|
1K to 10K | -0.47 |
10K to 100K | -0.29 |
100K to 1M | -0.60 |

Storage is not comparable directly to a transformer's parameter count. EntropyBeam stores 2.7M context-transition entries, compared to 60k learned floats for the transformer. Either way, the fixed combination rule achieves lower cross-entropy than learned optimization on the corpus.

The model was not compared with many different transformer baselines, but in limited testing, it achieved similar next-token prediction accuracy in larger datasets.

The code is available under [https://github.com/zw5/beam](https://github.com/zw5/beam)