{"slug": "gradient-free-single-pass-model-beats-nanogpt-on-shakespeare", "title": "Gradient-free Single-pass Model Beats nanoGPT on Shakespeare", "summary": "A new character-level language model called EntropyBeam, using gradient-free count tables and a Dirichlet prior, achieved a validation loss of 1.596 nats on the Shakespeare character benchmark, outperforming nanoGPT's 2.065 nats while using zero trainable parameters and 1,500x fewer total FLOPs.", "body_md": "Beam is a character-level language model that computes count tables mapping character contexts to next-character frequencies.\n\nAt prediction time, each order looks up the current context in its count table and produces a distribution over the vocabulary, smoothed over a symmetric Dirichlet prior\n\nₒⱼ\n\nEach order receives a capacity score composed of two terms:\n\nₒ\n\nwhere H(pₒ) is the Shannon entropy of the smoothed distribution. This is 1 when all mass is on one token and 0 when the distribution is uniform.\n\nwhere n is the total count for the current context. This saturates toward 1 as evidence accumulates and is 0 when the context has not been observed.\n\nA third term, capacity, is computed from the product of concentration and reliability. The capacity scores are converted to weights via softmax at temperature τ = 0.10:\n\nₒₒⱼⱼ\n\nThe low temperature makes the routing nearly winner-take-all: the highest-capacity order almost always dominates. The final prediction is the weighted geometric mean of the per-order distributions:\n\nₒₒₒ\n\nThis was chosen deliberately to assign high probability to a token only when multiple weighted orders agree.\n\nThe model has four hyperparameters: the set of context orders, α, τ, and the reliability threshold (min_count = 1). These were selected by evaluating variants on the validation set.\n\nEvaluation uses the nanoGPT shakespeare_char benchmark: character-level Shakespeare, about 1M training tokens, about 100K validation tokens, and a vocabulary size of 65.\n\nEntropyBeam uses 0 trainable parameters, a single fit pass, and character-level input.\n\nTraining tokens | Validation loss, nats | Contexts stored | Transitions stored |\n|---|---|---|---|\n1,000 | 2.954 | 5,495 | 6,388 |\n3,000 | 2.654 | 14,670 | 17,176 |\n10,000 | 2.482 | 44,092 | 51,835 |\n30,000 | 2.289 | 120,043 | 140,961 |\n100,000 | 2.193 | 346,462 | 405,119 |\n300,000 | 1.990 | 919,897 | 1,071,750 |\n1,003,854 | 1.596 | 2,753,581 | 3,199,496 |\n\nnanoGPT uses 60,192 parameters, 2 layers, `n_embd=48`\n\n, `n_head=4`\n\n, `block_size=32`\n\n, `batch_size=16`\n\n, and AdamW with `lr=1e-3`\n\n, `wd=0.01`\n\n.\n\nStep | Tokens seen | Validation loss, nats |\n|---|---|---|\n0 | 0 | 4.189 |\n300 | 153,600 | 2.507 |\n600 | 307,200 | 2.409 |\n1,200 | 614,400 | 2.262 |\n1,800 | 921,600 | 2.162 |\n2,400 | 1,228,800 | 2.096 |\n3,000 | 1,536,000 | 2.065 |\n\nMetric | EntropyBeam | nanoGPT | Ratio |\n|---|---|---|---|\nFit/train FLOPs | 0.009 G | 614 G | 68,000x |\nFLOPs per prediction | 4,500 | 133,000 | 30x |\nTotal FLOPs to result | ~0.5 G | ~760 G | ~1,500x |\nValidation loss, nats | 1.596 | 2.065 | |\nTrainable parameters | 0 | 60,192 | |\nWall time | 12s | 26s |\n\nPer-decade improvement in validation loss.\n\nRange | Change in loss, nats |\n|---|---|\n1K to 10K | -0.47 |\n10K to 100K | -0.29 |\n100K to 1M | -0.60 |\n\nStorage is not comparable directly to a transformer's parameter count. EntropyBeam stores 2.7M context-transition entries, compared to 60k learned floats for the transformer. Either way, the fixed combination rule achieves lower cross-entropy than learned optimization on the corpus.\n\nThe model was not compared with many different transformer baselines, but in limited testing, it achieved similar next-token prediction accuracy in larger datasets.\n\nThe code is available under [https://github.com/zw5/beam](https://github.com/zw5/beam)", "url": "https://wpnews.pro/news/gradient-free-single-pass-model-beats-nanogpt-on-shakespeare", "canonical_source": "https://www.lesswrong.com/posts/rSiwKbisKmMhALpBk/gradient-free-single-pass-model-beats-nanogpt-on-shakespeare-1", "published_at": "2026-06-29 16:38:43+00:00", "updated_at": "2026-06-29 17:04:51.620382+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "natural-language-processing"], "entities": ["EntropyBeam", "nanoGPT", "Shakespeare", "AdamW"], "alternates": {"html": "https://wpnews.pro/news/gradient-free-single-pass-model-beats-nanogpt-on-shakespeare", "markdown": "https://wpnews.pro/news/gradient-free-single-pass-model-beats-nanogpt-on-shakespeare.md", "text": "https://wpnews.pro/news/gradient-free-single-pass-model-beats-nanogpt-on-shakespeare.txt", "jsonld": "https://wpnews.pro/news/gradient-free-single-pass-model-beats-nanogpt-on-shakespeare.jsonld"}}