Gradient-free Single-pass Model Beats nanoGPT on Shakespeare

A new character-level language model called EntropyBeam, using gradient-free count tables and a Dirichlet prior, achieved a validation loss of 1.596 nats on the Shakespeare character benchmark, outperforming nanoGPT's 2.065 nats while using zero trainable parameters and 1,500x fewer total FLOPs.

Beam is a character-level language model that computes count tables mapping character contexts to next-character frequencies. At prediction time, each order looks up the current context in its count table and produces a distribution over the vocabulary, smoothed over a symmetric Dirichlet prior ₒⱼ Each order receives a capacity score composed of two terms: ₒ where H pₒ is the Shannon entropy of the smoothed distribution. This is 1 when all mass is on one token and 0 when the distribution is uniform. where n is the total count for the current context. This saturates toward 1 as evidence accumulates and is 0 when the context has not been observed. A third term, capacity, is computed from the product of concentration and reliability. The capacity scores are converted to weights via softmax at temperature τ = 0.10: ₒₒⱼⱼ The low temperature makes the routing nearly winner-take-all: the highest-capacity order almost always dominates. The final prediction is the weighted geometric mean of the per-order distributions: ₒₒₒ This was chosen deliberately to assign high probability to a token only when multiple weighted orders agree. The model has four hyperparameters: the set of context orders, α, τ, and the reliability threshold min count = 1 . These were selected by evaluating variants on the validation set. Evaluation uses the nanoGPT shakespeare char benchmark: character-level Shakespeare, about 1M training tokens, about 100K validation tokens, and a vocabulary size of 65. EntropyBeam uses 0 trainable parameters, a single fit pass, and character-level input. Training tokens | Validation loss, nats | Contexts stored | Transitions stored | |---|---|---|---| 1,000 | 2.954 | 5,495 | 6,388 | 3,000 | 2.654 | 14,670 | 17,176 | 10,000 | 2.482 | 44,092 | 51,835 | 30,000 | 2.289 | 120,043 | 140,961 | 100,000 | 2.193 | 346,462 | 405,119 | 300,000 | 1.990 | 919,897 | 1,071,750 | 1,003,854 | 1.596 | 2,753,581 | 3,199,496 | nanoGPT uses 60,192 parameters, 2 layers, n embd=48 , n head=4 , block size=32 , batch size=16 , and AdamW with lr=1e-3 , wd=0.01 . Step | Tokens seen | Validation loss, nats | |---|---|---| 0 | 0 | 4.189 | 300 | 153,600 | 2.507 | 600 | 307,200 | 2.409 | 1,200 | 614,400 | 2.262 | 1,800 | 921,600 | 2.162 | 2,400 | 1,228,800 | 2.096 | 3,000 | 1,536,000 | 2.065 | Metric | EntropyBeam | nanoGPT | Ratio | |---|---|---|---| Fit/train FLOPs | 0.009 G | 614 G | 68,000x | FLOPs per prediction | 4,500 | 133,000 | 30x | Total FLOPs to result | ~0.5 G | ~760 G | ~1,500x | Validation loss, nats | 1.596 | 2.065 | | Trainable parameters | 0 | 60,192 | | Wall time | 12s | 26s | Per-decade improvement in validation loss. Range | Change in loss, nats | |---|---| 1K to 10K | -0.47 | 10K to 100K | -0.29 | 100K to 1M | -0.60 | Storage is not comparable directly to a transformer's parameter count. EntropyBeam stores 2.7M context-transition entries, compared to 60k learned floats for the transformer. Either way, the fixed combination rule achieves lower cross-entropy than learned optimization on the corpus. The model was not compared with many different transformer baselines, but in limited testing, it achieved similar next-token prediction accuracy in larger datasets. The code is available under https://github.com/zw5/beam https://github.com/zw5/beam