Run the smallest llama2.c
model (stories260K
) inside Scratch/TurboWarp by compiling C inference code to Scratch blocks with llvm2scratch
.
If everything is working, the sprite will start generating the familiar opening:
Once upon a time, ...
(streamed into the speech bubble token-by-token).
- Scratch project: https://scratch.mit.edu/projects/1277883263
This repo vendors two upstream projects in-tree for reproducibility:
llama2.c
by Andrej Karpathy (MIT). Source:llama2.c/
andllama2.c/LICENSE
.llvm2scratch
by Classfied3D (MIT). Source:llvm2scratch/
andllvm2scratch/LICENSE
.
The model/tokenizer artifacts in artifacts/
come from the llama2.c
ecosystem.
High-level pipeline:
scratch_llama2/build_stories260k_sprite3.py
reads:artifacts/stories260K.bin
(the smallest llama2.c checkpoint)artifacts/tok512.bin
(tokenizer vocabulary)
- It quantizes the weight matrices to Q8_0 (group size 4) and packs 4 signed int8 values into one
u32
. - It lays out
everythinginto a single Scratch list!stack
:- packed weights + per-group scales
- RMSNorm weights
- RoPE cos/sin tables (for a reduced
SEQ_LEN
) - runtime buffers (x/xb/hb/q/att + KV cache)
- It writes
scratch_llama2/generated_layout.h
with 1-indexed addresses into!stack
. - It compiles
scratch_llama2/llama2_scratch.c
to LLVM IR (scratch_llama2/llama2_scratch.ll
) using:clang --target=i386-none-elf
(keeps pointers as 32-bit ints)
- It runs
llvm2scratch
to turn LLVM IR into Scratch blocks, then exports.sprite3
and.sb3
outputs.
Runtime UI:
!!output
(list) stores generated token IDs.!!vocab
(list) stores token pieces (strings).!!text
(variable) accumulates decoded text; the spritesay
s it continuously.!!resets
(variable) increments when the compiler triggers a broadcast-based “stack reset” (progress indicator + avoids JS call stack blowups).!!status
(variable) shows a high-level state machine (Edit params...
->Running...
->Done.
).ui_*
variables let you adjust sampling/generation settings from TurboWarp/Scratch UI.
Requires:
clang
uv
(and Python >= 3.12;llvm2scratch
requires it)
Command:
#
MAX_BRANCH_RECURSION=200 \\
GEN_STEPS=20 \\
uv run --python 3.12 --no-project --with-editable ./llvm2scratch python scratch_llama2/build_stories260k_sprite3.py
Outputs:
scratch_llama2/stories260k_inference.sprite3
: sprite, blocks hidden (fast editor/import)scratch_llama2/stories260k_inference_visible.sprite3
: sprite, blocks visible (debug)scratch_llama2/stories260k_inference_visible.sb3
: standalone project wrapper around the visible spritescratch_llama2/stories260k_inference_visible_scratch.sprite3
: Scratch-compatible sprite (no TurboWarp-only blocks)scratch_llama2/stories260k_inference_visible_scratch.sb3
: Scratch-compatible standalone project
Sprite workflow:
- Import
scratch_llama2/stories260k_inference_visible.sprite3
into TurboWarp (File -> Upload sprite
or drag/drop). - Select the sprite.
- Click the green flag.
- Edit
ui_*
variables (Variables panel). - Press
space
(or click the sprite) to start.
Project workflow:
- Open
scratch_llama2/stories260k_inference_visible.sb3
in TurboWarp (File -> Load from your computer
). - Click the green flag.
- Use the sliders/monitors on the stage to edit params.
- Press
space
(or click the sprite) to start.
What you should see:
!!status
updates:Edit params...
->Running...
->Done.
!!resets
increments periodically (a "still alive" indicator during long runs).- As tokens are generated, the sprite streams decoded text into its speech bubble (
!!text
). - For debugging, generated token IDs are appended to the
!!output
list.
Sampling UI:
ui_steps
: max tokens to generate (<= 32).ui_temperature
:0
=> greedy;>0
=> sampling.ui_top_k
:1
=> greedy;>1
=> top-k sampling.ui_top_p
: nucleus cutoff in(0, 1]
(use1
to disable).ui_seed
: nonzero => deterministic;0
=> pick a random seed at start.ui_prompt_preset
:0
=> start from BOS;1
=> force the token prefixOnce upon a time,
(demo).
Use the *_scratch.*
outputs:
scratch_llama2/stories260k_inference_visible_scratch.sb3
(recommended)
Scratch is significantly slower than TurboWarp, and does not support TurboWarp-only “hacked counter” blocks.
scratch_llama2/llama2_scratch.c
is inference-only and uses a reducedSEQ_LEN
for Scratch feasibility.llvm2scratch
is vendored here and patched to support pre-seeding!stack
and a few extra IR patterns.- Official Scratch does not support TurboWarp's hacked counter opcodes. Use the
*_scratch.*
outputs for scratch.mit.edu.
These are the key changes that made llama2_scratch.c
viable:
- Preseeded memory: skip generating huge “initializer” scripts by directly injecting
!stack
at export time. - i8 pointer arithmetic fix: clang emits
getelementptr i8
usingbyte offsets(4/8/12/...), but our “memory” is list-indexed; we scale i8 GEP indices back into 32-bit cells (i8_gep_div=4
). - Stack reset progress: optional
!!resets
counter to confirm the VM is still working during long runs (we keep the speech bubble for generated text). - Token streaming:
SB3_emit_token_dbl
logs token IDs to!!output
, decodes through!!vocab
, appends into!!text
, and continuously updates the sprite speech bubble. - Added intrinsic support: clang can emit
llvm.umin/umax/smin/smax
; llvm2scratch now translates these so-O2
IR compiles.
@misc{andrews2026llm_from_scratch,
author = {Andrews, David},
title = {llm\_from\_scratch},
year = {2026},
howpublished = {\\url{https://github.com/broyojo/llm_from_scratch}}
}