Gradient-free Single-pass Model Beats nanoGPT on Shakespeare

wpnews.pro

cd /news/large-language-models/gradient-free-single-pass-model-beat… · home › topics › large-language-models › article

[ARTICLE · art-43750] src=lesswrong.com ↗ pub=2026-06-29T16:38Z topic=large-language-models verified=true sentiment=↑ positive

Gradient-free Single-pass Model Beats nanoGPT on Shakespeare

A new character-level language model called EntropyBeam, using gradient-free count tables and a Dirichlet prior, achieved a validation loss of 1.596 nats on the Shakespeare character benchmark, outperforming nanoGPT's 2.065 nats while using zero trainable parameters and 1,500x fewer total FLOPs.

read3 min views1 publishedJun 29, 2026

Beam is a character-level language model that computes count tables mapping character contexts to next-character frequencies.

At prediction time, each order looks up the current context in its count table and produces a distribution over the vocabulary, smoothed over a symmetric Dirichlet prior

ₒⱼ

Each order receives a capacity score composed of two terms:

ₒ

where H(pₒ) is the Shannon entropy of the smoothed distribution. This is 1 when all mass is on one token and 0 when the distribution is uniform.

where n is the total count for the current context. This saturates toward 1 as evidence accumulates and is 0 when the context has not been observed.

A third term, capacity, is computed from the product of concentration and reliability. The capacity scores are converted to weights via softmax at temperature τ = 0.10:

ₒₒⱼⱼ

The low temperature makes the routing nearly winner-take-all: the highest-capacity order almost always dominates. The final prediction is the weighted geometric mean of the per-order distributions:

ₒₒₒ

This was chosen deliberately to assign high probability to a token only when multiple weighted orders agree.

The model has four hyperparameters: the set of context orders, α, τ, and the reliability threshold (min_count = 1). These were selected by evaluating variants on the validation set.

Evaluation uses the nanoGPT shakespeare_char benchmark: character-level Shakespeare, about 1M training tokens, about 100K validation tokens, and a vocabulary size of 65.

EntropyBeam uses 0 trainable parameters, a single fit pass, and character-level input.

Training tokens	Validation loss, nats	Contexts stored	Transitions stored
1,000	2.954	5,495	6,388
3,000	2.654	14,670	17,176
10,000	2.482	44,092	51,835
30,000	2.289	120,043	140,961
100,000	2.193	346,462	405,119
300,000	1.990	919,897	1,071,750
1,003,854	1.596	2,753,581	3,199,496

nanoGPT uses 60,192 parameters, 2 layers, n_embd=48

, n_head=4

, block_size=32

, batch_size=16

, and AdamW with lr=1e-3 , wd=0.01

Step	Tokens seen	Validation loss, nats
0	0	4.189
300	153,600	2.507
600	307,200	2.409
1,200	614,400	2.262
1,800	921,600	2.162
2,400	1,228,800	2.096
3,000	1,536,000	2.065

Metric	EntropyBeam	nanoGPT	Ratio
Fit/train FLOPs	0.009 G	614 G	68,000x
FLOPs per prediction	4,500	133,000	30x
Total FLOPs to result	~0.5 G	~760 G	~1,500x
Validation loss, nats	1.596	2.065
Trainable parameters	0	60,192
Wall time	12s	26s

Per-decade improvement in validation loss.

Range | Change in loss, nats |
|---|---|

1K to 10K | -0.47 | 10K to 100K | -0.29 | 100K to 1M | -0.60 |

Storage is not comparable directly to a transformer's parameter count. EntropyBeam stores 2.7M context-transition entries, compared to 60k learned floats for the transformer. Either way, the fixed combination rule achieves lower cross-entropy than learned optimization on the corpus.

The model was not compared with many different transformer baselines, but in limited testing, it achieved similar next-token prediction accuracy in larger datasets.

The code is available under https://github.com/zw5/beam

source & further reading

lesswrong.com — original article Fake Alignment Till You Make Alignment Functional Decision Theory: Not Even Wrong, Also Wrong P(doom) is a Dumb Meme

~/api · this article 200

$curl api.wpnews.pro/v1/news/gradient-free-single-pas…

Read original on lesswrong.com → www.lesswrong.com/posts/rSiwKbisKmMhALpBk/gradie…

mentioned entities

EntropyBeam

nanoGPT

Shakespeare

AdamW

metadata

sluggradient-free-single-pass-model-beats-nanogpt-on-shakespeare

topic#large-language-models

secondary2 topics

sentimentpositive

canonicallesswrong.com

navigation

← prevForget the Steam Summer Sale, Fa…

next →AI loops: who pays for the token…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 13 Jun · #large-language-models

Build Your Own Shakespearean LLM

github.com · 29 Jun · #large-language-models

Show HN: DriftGuard – response drift detection for LangGraph agents

dev.to · 29 Jun · #large-language-models

🗓️ Monthly Dev Report: June 2026

discuss.huggingface.co · 29 Jun · #large-language-models

Project UCTF: An Open Research Program on Machine-Native AI Training Representations

── more on @entropybeam 3 stories trending now

wpnews · 28 May · #ai-startups

[AINews] Cognition raises $1B in $26B Series D

wpnews · 5 Jun · #ai-agents

Miasma Worm Targets AI Coding Agents via GitHub Repos

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required