LLM from Scratch: a small LLM running inside MIT's Scratch

A developer has created a small LLM that runs inside MIT's Scratch by compiling the llama2.c inference code to Scratch blocks using llvm2scratch. The project, called LLM from Scratch, allows users to generate text token-by-token in a Scratch sprite using a quantized 260K-parameter model. It demonstrates running neural network inference within Scratch's constrained environment.

Run the smallest llama2.c model stories260K inside Scratch/TurboWarp by compiling C inference code to Scratch blocks with llvm2scratch . If everything is working, the sprite will start generating the familiar opening: Once upon a time, ... streamed into the speech bubble token-by-token . - Scratch project: https://scratch.mit.edu/projects/1277883263 https://scratch.mit.edu/projects/1277883263 This repo vendors two upstream projects in-tree for reproducibility: llama2.c by Andrej Karpathy MIT . Source: llama2.c/ and llama2.c/LICENSE . llvm2scratch by Classfied3D MIT . Source: llvm2scratch/ and llvm2scratch/LICENSE . The model/tokenizer artifacts in artifacts/ come from the llama2.c ecosystem. High-level pipeline: scratch llama2/build stories260k sprite3.py reads: artifacts/stories260K.bin the smallest llama2.c checkpoint artifacts/tok512.bin tokenizer vocabulary - It quantizes the weight matrices to Q8 0 group size 4 and packs 4 signed int8 values into one u32 . - It lays out everything into a single Scratch list stack :- packed weights + per-group scales - RMSNorm weights - RoPE cos/sin tables for a reduced SEQ LEN - runtime buffers x/xb/hb/q/att + KV cache - It writes scratch llama2/generated layout.h with 1-indexed addresses into stack . - It compiles scratch llama2/llama2 scratch.c to LLVM IR scratch llama2/llama2 scratch.ll using: clang --target=i386-none-elf keeps pointers as 32-bit ints - It runs llvm2scratch to turn LLVM IR into Scratch blocks, then exports .sprite3 and .sb3 outputs. Runtime UI: output list stores generated token IDs. vocab list stores token pieces strings . text variable accumulates decoded text; the sprite say s it continuously. resets variable increments when the compiler triggers a broadcast-based “stack reset” progress indicator + avoids JS call stack blowups . status variable shows a high-level state machine Edit params... - Running... - Done. . ui variables let you adjust sampling/generation settings from TurboWarp/Scratch UI. Requires: clang uv and Python = 3.12; llvm2scratch requires it Command: If you don't have a usable Python yet: uv python install 3.12 Optional: tune stack reset frequency for TurboWarp stability/perf. Lower = more stable less likely to hit "Maximum call stack size exceeded" , but slower. Higher = faster, but can crash in TurboWarp. MAX BRANCH RECURSION=200 is the default. MAX BRANCH RECURSION=200 \\ Optional: number of tokens to generate upper bound . Defaults to 20. Must be <= SEQ LEN, currently 32. GEN STEPS=20 \\ llvm2scratch requires Python = 3.12; pin via --python to avoid uv picking an older system Python. uv run --python 3.12 --no-project --with-editable ./llvm2scratch python scratch llama2/build stories260k sprite3.py Outputs: scratch llama2/stories260k inference.sprite3 : sprite, blocks hidden fast editor/import scratch llama2/stories260k inference visible.sprite3 : sprite, blocks visible debug scratch llama2/stories260k inference visible.sb3 : standalone project wrapper around the visible sprite scratch llama2/stories260k inference visible scratch.sprite3 : Scratch-compatible sprite no TurboWarp-only blocks scratch llama2/stories260k inference visible scratch.sb3 : Scratch-compatible standalone project Sprite workflow: - Import scratch llama2/stories260k inference visible.sprite3 into TurboWarp File - Upload sprite or drag/drop . - Select the sprite. - Click the green flag. - Edit ui variables Variables panel . - Press space or click the sprite to start. Project workflow: - Open scratch llama2/stories260k inference visible.sb3 in TurboWarp File - Load from your computer . - Click the green flag. - Use the sliders/monitors on the stage to edit params. - Press space or click the sprite to start. What you should see: status updates: Edit params... - Running... - Done. resets increments periodically a "still alive" indicator during long runs .- As tokens are generated, the sprite streams decoded text into its speech bubble text . - For debugging, generated token IDs are appended to the output list. Sampling UI: ui steps : max tokens to generate <= 32 . ui temperature : 0 = greedy; 0 = sampling. ui top k : 1 = greedy; 1 = top-k sampling. ui top p : nucleus cutoff in 0, 1 use 1 to disable . ui seed : nonzero = deterministic; 0 = pick a random seed at start. ui prompt preset : 0 = start from BOS; 1 = force the token prefix Once upon a time, demo . Use the scratch. outputs: scratch llama2/stories260k inference visible scratch.sb3 recommended Scratch is significantly slower than TurboWarp, and does not support TurboWarp-only “hacked counter” blocks. scratch llama2/llama2 scratch.c is inference-only and uses a reduced SEQ LEN for Scratch feasibility. llvm2scratch is vendored here and patched to support pre-seeding stack and a few extra IR patterns.- Official Scratch does not support TurboWarp's hacked counter opcodes. Use the scratch. outputs for scratch.mit.edu. These are the key changes that made llama2 scratch.c viable: - Preseeded memory: skip generating huge “initializer” scripts by directly injecting stack at export time. - i8 pointer arithmetic fix: clang emits getelementptr i8 using byte offsets 4/8/12/... , but our “memory” is list-indexed; we scale i8 GEP indices back into 32-bit cells i8 gep div=4 . - Stack reset progress: optional resets counter to confirm the VM is still working during long runs we keep the speech bubble for generated text . - Token streaming: SB3 emit token dbl logs token IDs to output , decodes through vocab , appends into text , and continuously updates the sprite speech bubble. - Added intrinsic support: clang can emit llvm.umin/umax/smin/smax ; llvm2scratch now translates these so -O2 IR compiles. @misc{andrews2026llm from scratch, author = {Andrews, David}, title = {llm\ from\ scratch}, year = {2026}, howpublished = {\\url{https://github.com/broyojo/llm from scratch}} }