cd /news/large-language-models/llm-from-scratch-a-small-llm-running… · home topics large-language-models article
[ARTICLE · art-38098] src=github.com ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

LLM from Scratch: a small LLM running inside MIT's Scratch

A developer has created a small LLM that runs inside MIT's Scratch by compiling the llama2.c inference code to Scratch blocks using llvm2scratch. The project, called LLM from Scratch, allows users to generate text token-by-token in a Scratch sprite using a quantized 260K-parameter model. It demonstrates running neural network inference within Scratch's constrained environment.

read4 min views1 publishedJun 24, 2026
LLM from Scratch: a small LLM running inside MIT's Scratch
Image: source

Run the smallest llama2.c

model (stories260K

) inside Scratch/TurboWarp by compiling C inference code to Scratch blocks with llvm2scratch

.

If everything is working, the sprite will start generating the familiar opening: Once upon a time, ...

(streamed into the speech bubble token-by-token).

This repo vendors two upstream projects in-tree for reproducibility:

llama2.c

by Andrej Karpathy (MIT). Source:llama2.c/

andllama2.c/LICENSE

.llvm2scratch

by Classfied3D (MIT). Source:llvm2scratch/

andllvm2scratch/LICENSE

.

The model/tokenizer artifacts in artifacts/

come from the llama2.c

ecosystem.

High-level pipeline:

scratch_llama2/build_stories260k_sprite3.py

reads:artifacts/stories260K.bin

(the smallest llama2.c checkpoint)artifacts/tok512.bin

(tokenizer vocabulary)

  • It quantizes the weight matrices to Q8_0 (group size 4) and packs 4 signed int8 values into one u32

. - It lays out everythinginto a single Scratch list!stack

:- packed weights + per-group scales

  • RMSNorm weights
  • RoPE cos/sin tables (for a reduced SEQ_LEN

) - runtime buffers (x/xb/hb/q/att + KV cache)

  • It writes scratch_llama2/generated_layout.h

with 1-indexed addresses into!stack

. - It compiles scratch_llama2/llama2_scratch.c

to LLVM IR (scratch_llama2/llama2_scratch.ll

) using:clang --target=i386-none-elf

(keeps pointers as 32-bit ints)

  • It runs llvm2scratch

to turn LLVM IR into Scratch blocks, then exports.sprite3

and.sb3

outputs.

Runtime UI:

!!output

(list) stores generated token IDs.!!vocab

(list) stores token pieces (strings).!!text

(variable) accumulates decoded text; the spritesay

s it continuously.!!resets

(variable) increments when the compiler triggers a broadcast-based “stack reset” (progress indicator + avoids JS call stack blowups).!!status

(variable) shows a high-level state machine (Edit params...

->Running...

->Done.

).ui_*

variables let you adjust sampling/generation settings from TurboWarp/Scratch UI.

Requires:

clang

uv

(and Python >= 3.12;llvm2scratch

requires it)

Command:

#
MAX_BRANCH_RECURSION=200 \\
GEN_STEPS=20 \\
uv run --python 3.12 --no-project --with-editable ./llvm2scratch python scratch_llama2/build_stories260k_sprite3.py

Outputs:

scratch_llama2/stories260k_inference.sprite3

: sprite, blocks hidden (fast editor/import)scratch_llama2/stories260k_inference_visible.sprite3

: sprite, blocks visible (debug)scratch_llama2/stories260k_inference_visible.sb3

: standalone project wrapper around the visible spritescratch_llama2/stories260k_inference_visible_scratch.sprite3

: Scratch-compatible sprite (no TurboWarp-only blocks)scratch_llama2/stories260k_inference_visible_scratch.sb3

: Scratch-compatible standalone project

Sprite workflow:

  • Import scratch_llama2/stories260k_inference_visible.sprite3

into TurboWarp (File -> Upload sprite

or drag/drop). - Select the sprite.

  • Click the green flag.
  • Edit ui_*

variables (Variables panel). - Press space

(or click the sprite) to start.

Project workflow:

  • Open scratch_llama2/stories260k_inference_visible.sb3

in TurboWarp (File -> Load from your computer

). - Click the green flag.

  • Use the sliders/monitors on the stage to edit params.
  • Press space

(or click the sprite) to start.

What you should see:

!!status

updates:Edit params...

->Running...

->Done.

!!resets

increments periodically (a "still alive" indicator during long runs).- As tokens are generated, the sprite streams decoded text into its speech bubble ( !!text

). - For debugging, generated token IDs are appended to the !!output

list.

Sampling UI:

ui_steps

: max tokens to generate (<= 32).ui_temperature

:0

=> greedy;>0

=> sampling.ui_top_k

:1

=> greedy;>1

=> top-k sampling.ui_top_p

: nucleus cutoff in(0, 1]

(use1

to disable).ui_seed

: nonzero => deterministic;0

=> pick a random seed at start.ui_prompt_preset

:0

=> start from BOS;1

=> force the token prefixOnce upon a time,

(demo).

Use the *_scratch.*

outputs:

scratch_llama2/stories260k_inference_visible_scratch.sb3

(recommended)

Scratch is significantly slower than TurboWarp, and does not support TurboWarp-only “hacked counter” blocks.

scratch_llama2/llama2_scratch.c

is inference-only and uses a reducedSEQ_LEN

for Scratch feasibility.llvm2scratch

is vendored here and patched to support pre-seeding!stack

and a few extra IR patterns.- Official Scratch does not support TurboWarp's hacked counter opcodes. Use the *_scratch.*

outputs for scratch.mit.edu.

These are the key changes that made llama2_scratch.c

viable:

  • Preseeded memory: skip generating huge “initializer” scripts by directly injecting !stack

at export time. - i8 pointer arithmetic fix: clang emits getelementptr i8

usingbyte offsets(4/8/12/...), but our “memory” is list-indexed; we scale i8 GEP indices back into 32-bit cells (i8_gep_div=4

). - Stack reset progress: optional !!resets

counter to confirm the VM is still working during long runs (we keep the speech bubble for generated text). - Token streaming: SB3_emit_token_dbl

logs token IDs to!!output

, decodes through!!vocab

, appends into!!text

, and continuously updates the sprite speech bubble. - Added intrinsic support: clang can emit llvm.umin/umax/smin/smax

; llvm2scratch now translates these so-O2

IR compiles.

@misc{andrews2026llm_from_scratch,
  author       = {Andrews, David},
  title        = {llm\_from\_scratch},
  year         = {2026},
  howpublished = {\\url{https://github.com/broyojo/llm_from_scratch}}
}
── more in #large-language-models 4 stories · sorted by recency
── more on @mit scratch 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/llm-from-scratch-a-s…] indexed:0 read:4min 2026-06-24 ·