{"slug": "llm-from-scratch-a-small-llm-running-inside-mit-s-scratch", "title": "LLM from Scratch: a small LLM running inside MIT's Scratch", "summary": "A developer has created a small LLM that runs inside MIT's Scratch by compiling the llama2.c inference code to Scratch blocks using llvm2scratch. The project, called LLM from Scratch, allows users to generate text token-by-token in a Scratch sprite using a quantized 260K-parameter model. It demonstrates running neural network inference within Scratch's constrained environment.", "body_md": "Run the smallest `llama2.c`\n\nmodel (`stories260K`\n\n) inside Scratch/TurboWarp by compiling C inference code to Scratch blocks with `llvm2scratch`\n\n.\n\nIf everything is working, the sprite will start generating the familiar opening:\n`Once upon a time, ...`\n\n(streamed into the speech bubble token-by-token).\n\n- Scratch project:\n[https://scratch.mit.edu/projects/1277883263](https://scratch.mit.edu/projects/1277883263)\n\nThis repo vendors two upstream projects in-tree for reproducibility:\n\n`llama2.c`\n\nby Andrej Karpathy (MIT). Source:`llama2.c/`\n\nand`llama2.c/LICENSE`\n\n.`llvm2scratch`\n\nby Classfied3D (MIT). Source:`llvm2scratch/`\n\nand`llvm2scratch/LICENSE`\n\n.\n\nThe model/tokenizer artifacts in `artifacts/`\n\ncome from the `llama2.c`\n\necosystem.\n\nHigh-level pipeline:\n\n`scratch_llama2/build_stories260k_sprite3.py`\n\nreads:`artifacts/stories260K.bin`\n\n(the smallest llama2.c checkpoint)`artifacts/tok512.bin`\n\n(tokenizer vocabulary)\n\n- It quantizes the weight matrices to Q8_0 (group size 4) and packs 4 signed int8 values into one\n`u32`\n\n. - It lays out\n*everything*into a single Scratch list`!stack`\n\n:- packed weights + per-group scales\n- RMSNorm weights\n- RoPE cos/sin tables (for a reduced\n`SEQ_LEN`\n\n) - runtime buffers (x/xb/hb/q/att + KV cache)\n\n- It writes\n`scratch_llama2/generated_layout.h`\n\nwith 1-indexed addresses into`!stack`\n\n. - It compiles\n`scratch_llama2/llama2_scratch.c`\n\nto LLVM IR (`scratch_llama2/llama2_scratch.ll`\n\n) using:`clang --target=i386-none-elf`\n\n(keeps pointers as 32-bit ints)\n\n- It runs\n`llvm2scratch`\n\nto turn LLVM IR into Scratch blocks, then exports`.sprite3`\n\nand`.sb3`\n\noutputs.\n\nRuntime UI:\n\n`!!output`\n\n(list) stores generated token IDs.`!!vocab`\n\n(list) stores token pieces (strings).`!!text`\n\n(variable) accumulates decoded text; the sprite`say`\n\ns it continuously.`!!resets`\n\n(variable) increments when the compiler triggers a broadcast-based “stack reset” (progress indicator + avoids JS call stack blowups).`!!status`\n\n(variable) shows a high-level state machine (`Edit params...`\n\n->`Running...`\n\n->`Done.`\n\n).`ui_*`\n\nvariables let you adjust sampling/generation settings from TurboWarp/Scratch UI.\n\nRequires:\n\n`clang`\n\n`uv`\n\n(and Python >= 3.12;`llvm2scratch`\n\nrequires it)\n\nCommand:\n\n```\n# If you don't have a usable Python yet:\n# uv python install 3.12\n#\n# Optional: tune stack reset frequency for TurboWarp stability/perf.\n# Lower = more stable (less likely to hit \"Maximum call stack size exceeded\"), but slower.\n# Higher = faster, but can crash in TurboWarp.\n# MAX_BRANCH_RECURSION=200 is the default.\nMAX_BRANCH_RECURSION=200 \\\\\n# Optional: number of tokens to generate (upper bound). Defaults to 20.\n# (Must be <= SEQ_LEN, currently 32.)\nGEN_STEPS=20 \\\\\n# llvm2scratch requires Python >= 3.12; pin via `--python` to avoid uv picking an older system Python.\nuv run --python 3.12 --no-project --with-editable ./llvm2scratch python scratch_llama2/build_stories260k_sprite3.py\n```\n\nOutputs:\n\n`scratch_llama2/stories260k_inference.sprite3`\n\n: sprite, blocks hidden (fast editor/import)`scratch_llama2/stories260k_inference_visible.sprite3`\n\n: sprite, blocks visible (debug)`scratch_llama2/stories260k_inference_visible.sb3`\n\n: standalone project wrapper around the visible sprite`scratch_llama2/stories260k_inference_visible_scratch.sprite3`\n\n: Scratch-compatible sprite (no TurboWarp-only blocks)`scratch_llama2/stories260k_inference_visible_scratch.sb3`\n\n: Scratch-compatible standalone project\n\nSprite workflow:\n\n- Import\n`scratch_llama2/stories260k_inference_visible.sprite3`\n\ninto TurboWarp (`File -> Upload sprite`\n\nor drag/drop). - Select the sprite.\n- Click the green flag.\n- Edit\n`ui_*`\n\nvariables (Variables panel). - Press\n`space`\n\n(or click the sprite) to start.\n\nProject workflow:\n\n- Open\n`scratch_llama2/stories260k_inference_visible.sb3`\n\nin TurboWarp (`File -> Load from your computer`\n\n). - Click the green flag.\n- Use the sliders/monitors on the stage to edit params.\n- Press\n`space`\n\n(or click the sprite) to start.\n\nWhat you should see:\n\n`!!status`\n\nupdates:`Edit params...`\n\n->`Running...`\n\n->`Done.`\n\n`!!resets`\n\nincrements periodically (a \"still alive\" indicator during long runs).- As tokens are generated, the sprite streams decoded text into its speech bubble (\n`!!text`\n\n). - For debugging, generated token IDs are appended to the\n`!!output`\n\nlist.\n\nSampling UI:\n\n`ui_steps`\n\n: max tokens to generate (<= 32).`ui_temperature`\n\n:`0`\n\n=> greedy;`>0`\n\n=> sampling.`ui_top_k`\n\n:`1`\n\n=> greedy;`>1`\n\n=> top-k sampling.`ui_top_p`\n\n: nucleus cutoff in`(0, 1]`\n\n(use`1`\n\nto disable).`ui_seed`\n\n: nonzero => deterministic;`0`\n\n=> pick a random seed at start.`ui_prompt_preset`\n\n:`0`\n\n=> start from BOS;`1`\n\n=> force the token prefix`Once upon a time,`\n\n(demo).\n\nUse the `*_scratch.*`\n\noutputs:\n\n`scratch_llama2/stories260k_inference_visible_scratch.sb3`\n\n(recommended)\n\nScratch is significantly slower than TurboWarp, and does not support TurboWarp-only “hacked counter” blocks.\n\n`scratch_llama2/llama2_scratch.c`\n\nis inference-only and uses a reduced`SEQ_LEN`\n\nfor Scratch feasibility.`llvm2scratch`\n\nis vendored here and patched to support pre-seeding`!stack`\n\nand a few extra IR patterns.- Official Scratch does not support TurboWarp's hacked counter opcodes. Use the\n`*_scratch.*`\n\noutputs for scratch.mit.edu.\n\nThese are the key changes that made `llama2_scratch.c`\n\nviable:\n\n- Preseeded memory: skip generating huge “initializer” scripts by directly injecting\n`!stack`\n\nat export time. - i8 pointer arithmetic fix: clang emits\n`getelementptr i8`\n\nusing*byte offsets*(4/8/12/...), but our “memory” is list-indexed; we scale i8 GEP indices back into 32-bit cells (`i8_gep_div=4`\n\n). - Stack reset progress: optional\n`!!resets`\n\ncounter to confirm the VM is still working during long runs (we keep the speech bubble for generated text). - Token streaming:\n`SB3_emit_token_dbl`\n\nlogs token IDs to`!!output`\n\n, decodes through`!!vocab`\n\n, appends into`!!text`\n\n, and continuously updates the sprite speech bubble. - Added intrinsic support: clang can emit\n`llvm.umin/umax/smin/smax`\n\n; llvm2scratch now translates these so`-O2`\n\nIR compiles.\n\n```\n@misc{andrews2026llm_from_scratch,\n  author       = {Andrews, David},\n  title        = {llm\\_from\\_scratch},\n  year         = {2026},\n  howpublished = {\\\\url{https://github.com/broyojo/llm_from_scratch}}\n}\n```\n\n", "url": "https://wpnews.pro/news/llm-from-scratch-a-small-llm-running-inside-mit-s-scratch", "canonical_source": "https://github.com/Broyojo/llm_from_scratch", "published_at": "2026-06-24 17:03:04+00:00", "updated_at": "2026-06-24 17:10:07.759007+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "developer-tools"], "entities": ["MIT Scratch", "llama2.c", "Andrej Karpathy", "llvm2scratch", "Classfied3D", "TurboWarp"], "alternates": {"html": "https://wpnews.pro/news/llm-from-scratch-a-small-llm-running-inside-mit-s-scratch", "markdown": "https://wpnews.pro/news/llm-from-scratch-a-small-llm-running-inside-mit-s-scratch.md", "text": "https://wpnews.pro/news/llm-from-scratch-a-small-llm-running-inside-mit-s-scratch.txt", "jsonld": "https://wpnews.pro/news/llm-from-scratch-a-small-llm-running-inside-mit-s-scratch.jsonld"}}