# LLM from Scratch: a small LLM running inside MIT's Scratch

> Source: <https://github.com/Broyojo/llm_from_scratch>
> Published: 2026-06-24 17:03:04+00:00

Run the smallest `llama2.c`

model (`stories260K`

) inside Scratch/TurboWarp by compiling C inference code to Scratch blocks with `llvm2scratch`

.

If everything is working, the sprite will start generating the familiar opening:
`Once upon a time, ...`

(streamed into the speech bubble token-by-token).

- Scratch project:
[https://scratch.mit.edu/projects/1277883263](https://scratch.mit.edu/projects/1277883263)

This repo vendors two upstream projects in-tree for reproducibility:

`llama2.c`

by Andrej Karpathy (MIT). Source:`llama2.c/`

and`llama2.c/LICENSE`

.`llvm2scratch`

by Classfied3D (MIT). Source:`llvm2scratch/`

and`llvm2scratch/LICENSE`

.

The model/tokenizer artifacts in `artifacts/`

come from the `llama2.c`

ecosystem.

High-level pipeline:

`scratch_llama2/build_stories260k_sprite3.py`

reads:`artifacts/stories260K.bin`

(the smallest llama2.c checkpoint)`artifacts/tok512.bin`

(tokenizer vocabulary)

- It quantizes the weight matrices to Q8_0 (group size 4) and packs 4 signed int8 values into one
`u32`

. - It lays out
*everything*into a single Scratch list`!stack`

:- packed weights + per-group scales
- RMSNorm weights
- RoPE cos/sin tables (for a reduced
`SEQ_LEN`

) - runtime buffers (x/xb/hb/q/att + KV cache)

- It writes
`scratch_llama2/generated_layout.h`

with 1-indexed addresses into`!stack`

. - It compiles
`scratch_llama2/llama2_scratch.c`

to LLVM IR (`scratch_llama2/llama2_scratch.ll`

) using:`clang --target=i386-none-elf`

(keeps pointers as 32-bit ints)

- It runs
`llvm2scratch`

to turn LLVM IR into Scratch blocks, then exports`.sprite3`

and`.sb3`

outputs.

Runtime UI:

`!!output`

(list) stores generated token IDs.`!!vocab`

(list) stores token pieces (strings).`!!text`

(variable) accumulates decoded text; the sprite`say`

s it continuously.`!!resets`

(variable) increments when the compiler triggers a broadcast-based “stack reset” (progress indicator + avoids JS call stack blowups).`!!status`

(variable) shows a high-level state machine (`Edit params...`

->`Running...`

->`Done.`

).`ui_*`

variables let you adjust sampling/generation settings from TurboWarp/Scratch UI.

Requires:

`clang`

`uv`

(and Python >= 3.12;`llvm2scratch`

requires it)

Command:

```
# If you don't have a usable Python yet:
# uv python install 3.12
#
# Optional: tune stack reset frequency for TurboWarp stability/perf.
# Lower = more stable (less likely to hit "Maximum call stack size exceeded"), but slower.
# Higher = faster, but can crash in TurboWarp.
# MAX_BRANCH_RECURSION=200 is the default.
MAX_BRANCH_RECURSION=200 \\
# Optional: number of tokens to generate (upper bound). Defaults to 20.
# (Must be <= SEQ_LEN, currently 32.)
GEN_STEPS=20 \\
# llvm2scratch requires Python >= 3.12; pin via `--python` to avoid uv picking an older system Python.
uv run --python 3.12 --no-project --with-editable ./llvm2scratch python scratch_llama2/build_stories260k_sprite3.py
```

Outputs:

`scratch_llama2/stories260k_inference.sprite3`

: sprite, blocks hidden (fast editor/import)`scratch_llama2/stories260k_inference_visible.sprite3`

: sprite, blocks visible (debug)`scratch_llama2/stories260k_inference_visible.sb3`

: standalone project wrapper around the visible sprite`scratch_llama2/stories260k_inference_visible_scratch.sprite3`

: Scratch-compatible sprite (no TurboWarp-only blocks)`scratch_llama2/stories260k_inference_visible_scratch.sb3`

: Scratch-compatible standalone project

Sprite workflow:

- Import
`scratch_llama2/stories260k_inference_visible.sprite3`

into TurboWarp (`File -> Upload sprite`

or drag/drop). - Select the sprite.
- Click the green flag.
- Edit
`ui_*`

variables (Variables panel). - Press
`space`

(or click the sprite) to start.

Project workflow:

- Open
`scratch_llama2/stories260k_inference_visible.sb3`

in TurboWarp (`File -> Load from your computer`

). - Click the green flag.
- Use the sliders/monitors on the stage to edit params.
- Press
`space`

(or click the sprite) to start.

What you should see:

`!!status`

updates:`Edit params...`

->`Running...`

->`Done.`

`!!resets`

increments periodically (a "still alive" indicator during long runs).- As tokens are generated, the sprite streams decoded text into its speech bubble (
`!!text`

). - For debugging, generated token IDs are appended to the
`!!output`

list.

Sampling UI:

`ui_steps`

: max tokens to generate (<= 32).`ui_temperature`

:`0`

=> greedy;`>0`

=> sampling.`ui_top_k`

:`1`

=> greedy;`>1`

=> top-k sampling.`ui_top_p`

: nucleus cutoff in`(0, 1]`

(use`1`

to disable).`ui_seed`

: nonzero => deterministic;`0`

=> pick a random seed at start.`ui_prompt_preset`

:`0`

=> start from BOS;`1`

=> force the token prefix`Once upon a time,`

(demo).

Use the `*_scratch.*`

outputs:

`scratch_llama2/stories260k_inference_visible_scratch.sb3`

(recommended)

Scratch is significantly slower than TurboWarp, and does not support TurboWarp-only “hacked counter” blocks.

`scratch_llama2/llama2_scratch.c`

is inference-only and uses a reduced`SEQ_LEN`

for Scratch feasibility.`llvm2scratch`

is vendored here and patched to support pre-seeding`!stack`

and a few extra IR patterns.- Official Scratch does not support TurboWarp's hacked counter opcodes. Use the
`*_scratch.*`

outputs for scratch.mit.edu.

These are the key changes that made `llama2_scratch.c`

viable:

- Preseeded memory: skip generating huge “initializer” scripts by directly injecting
`!stack`

at export time. - i8 pointer arithmetic fix: clang emits
`getelementptr i8`

using*byte offsets*(4/8/12/...), but our “memory” is list-indexed; we scale i8 GEP indices back into 32-bit cells (`i8_gep_div=4`

). - Stack reset progress: optional
`!!resets`

counter to confirm the VM is still working during long runs (we keep the speech bubble for generated text). - Token streaming:
`SB3_emit_token_dbl`

logs token IDs to`!!output`

, decodes through`!!vocab`

, appends into`!!text`

, and continuously updates the sprite speech bubble. - Added intrinsic support: clang can emit
`llvm.umin/umax/smin/smax`

; llvm2scratch now translates these so`-O2`

IR compiles.

```
@misc{andrews2026llm_from_scratch,
  author       = {Andrews, David},
  title        = {llm\_from\_scratch},
  year         = {2026},
  howpublished = {\\url{https://github.com/broyojo/llm_from_scratch}}
}
```


