LLM for the ESP32-S3

Two ESP32-S3 microcontrollers running a Llama-architecture language model have achieved the first multi-chip pipelined LLM inference on ESP32-class hardware, splitting layers across two boards connected by three jumper wires. The system runs a 15-million-parameter model at approximately 1.4 tokens per second, overcoming the single-board memory limit of 16MB flash that caps models at roughly 15M parameters. The project targets a 42-million-parameter model by distributing weights across both chips' flash storage, with the worker board handling initial layers and the head board managing embedding, classification, and token sampling.

One language model, two microcontrollers. A Llama-architecture LLM running with its layers split across two ESP32-S3 boards — per token, the activation vector crosses three wires CRC-framed UART between the chips. Once upon a time there was a brave little fox cat and fish are friends. They like to play and swim in the pond. One day, the cat sees a bird on a tree... 1.4 tok/s across 2 boards As far as published projects show, this is the first multi-chip pipelined LLM inference on ESP32-class hardware. A 16MB ESP32-S3 caps out at a ~15M-parameter model INT4 . The next TinyStories size — stories42M, ~24MB — fits no single board . Splitting layers across two chips makes combined flash the limit, not the chip. Weights: INT4 group-32 scales , streamed from a memory-mapped flash partition — 0 bytes of RAM used for weights Compute: INT8 activations, integer-exact group dot products, matmul rows split across both LX7 cores of each chip Split: worker board runs layers 0–K; head board holds embedding, layers K–L, classifier, tokenizer, and samples each next token Link: UART @460800, frame = A5 5A | cmd | len | payload | CRC16 , ~1.2KB/token round trip ~3% of token time Everything testable without hardware was tested before flashing see /tests : forward pass matches a NumPy reference to ~3e-7 INT4 and INT8 ; the split pipeline is bit-exact vs the monolithic model; the link protocol was fuzzed — noise ignored, corrupted frames rejected by CRC. | params | weights live in | speed | output | | |---|---|---|---|---| | known ESP32 LLM ports | 260K | RAM | 19–33 tok/s | sentence-level babble | | this, single board | 15M | flash mmap | ~1.4 tok/s | multi-paragraph stories | | this, two boards | 15M now / 42M target | split flash | ~1.4 / est. 0.5 tok/s | better still | Coherence isn't gradual: TinyStories research shows plot consistency emerges in the millions-of-parameters range Eldan & Li 2023 https://arxiv.org/abs/2305.07759 . Hardware: one ESP32-S3 with 16MB flash + 8MB PSRAM for single-board mode; two of them + 3 jumper wires for the pipeline. Verified on: Waveshare ESP32-S3-Touch-LCD-5 head and Guition JC3248W535C worker . The two boards do not have to be the same model. Displays are unused in v1 — all interaction is over USB serial, screens stay dark by design. PC one-time setup, Windows commands shown : winget install Python.Python.3.12 :: close cmd, open a NEW one, then: python --version pip install numpy esptool Arduino IDE with the esp32 board package Boards Manager → "esp32" by Espressif; v3.x recommended, v2.x works — the sketches include a compat shim . "Python was not found" after installing? Windows Settings → "Manage app execution aliases" → turn OFF python.exe and python3.exe, reopen cmd. Download into pc tools/ : https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin ~60MB https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.bin https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.bin right-click → Save link as → keep the name tokenizer.bin cd path\to\repo\pc tools python export model.py stories15M.bin tokenizer.bin full.bin --bits 4 Output: full.bin ~10MB — the INT4 flash image. - Copy core\llm core.c , core\llm core.h , and partitions.csv into sketches\storyteller\ . - Open esp32 storyteller.ino in Arduino IDE. Tools settings: ESP32S3 Dev Module · USB CDC On Boot: Enabled · Flash Size: 16MB · PSRAM: OPI · CPU 240MHz . Upload. - Flash the model close Serial Monitor first; COMx is in Tools → Port : python -m esptool --chip esp32s3 --port COMx write flash 0x1F0000 full.bin Success = "Hash of data verified." 4. Serial Monitor @ 115200 , press the board's reset button, wait for Ready , type a story opening, press Enter. Split the model worker gets layers 0–2; head gets embedding + 3–5 : python split image.py full.bin 3 worker.bin head.bin Firmware: copy core\ and partitions.csv into BOTH sketches\pipeline head\ and sketches\pipeline worker\ . Upload pipeline head.ino to board A and pipeline worker.ino to board B same Tools settings as 2A . Pins & baud — edit each sketch's copy of core/pipeline link.h if needed. Each board's LINK TX PIN / LINK RX PIN must match its own wiring ; the two boards' pin numbers do NOT have to match each other. LINK BAUD MUST be identical on both. Defaults: 17/18 @460800. On the Waveshare 5" no free GPIO header use the I2C terminal block: TX = GPIO8 SDA , RX = GPIO9 SCL . Flash each half each board's own COM port, Serial Monitor closed : python -m esptool --chip esp32s3 --port COM head write flash 0x1F0000 head.bin python -m esptool --chip esp32s3 --port COM worker write flash 0x1F0000 worker.bin Wire — 3 jumpers, boards powered off, TX↔RX crossed: head TX-pin ──→ worker RX-pin head RX-pin ←── worker TX-pin GND ─────────── GND any GND pin on each board; do NOT connect 3V3/VCC Run: power both head on the PC; worker on any USB power . Serial Monitor on the HEAD's port @115200, reset both boards, wait for Ready: emb + 3 local layers of 6 total , type a prompt. | Command | Effect | |---|---| | any text | generates a story continuing your text | /temp 0.7 | lower = focused, higher 1.0+ = wilder | /topp 0.9 | nucleus sampling cutoff | /len 250 | max tokens per prompt | /stats | free RAM/PSRAM and current settings | Status: export path implemented and size-budgeted; the 15M pipeline is hardware-verified. Measured 42M numbers welcome via issues. :: download stories42M.bin ~170MB from the same HuggingFace page, then: python export model.py stories42M.bin tokenizer.bin full42.bin --bits 4 --gs 32 --seq 224 python split image.py full42.bin 7 worker42.bin head42.bin python -m esptool --chip esp32s3 --port COM head write flash 0x1F0000 head42.bin python -m esptool --chip esp32s3 --port COM worker write flash 0x1F0000 worker42.bin --seq 224 is required 42M's native 1024-token context would need a ~29MB KV cache — over the 8MB PSRAM; 224 fits . Split at 7 , not 3: the ~9MB embedding lives on the head, so layers skew to the worker. Check the printed image sizes stay under 14.6MB. Boot banners should read "emb + 1 local layers of 8" head and "7 local layers, dim 512" worker . Expect ~0.4–0.7 tok/s — and clearly better stories. Worker allocation failure at boot → re-export with --seq 192 . | Symptom | Fix | |---|---| llm init -1 | no/old model in flash — redo the esptool step at 0x1F0000 | no 'model' partition | partitions.csv not applied — copy into the sketch folder, set Partition Scheme: Custom, re-upload | | boot log but no banner | flip Tools → USB CDC On Boot, re-upload | PSRAM not found | Tools → PSRAM: OPI or QSPI , re-upload | link timeout | TX/RX swapped at one end, or GND wire missing | crc error, retry constantly | shorten wires or set LINK BAUD 115200 on BOTH boards | | esptool: No such file | cd into the folder containing the .bin | | esptool: port busy | close Serial Monitor | core/ inference engine + link protocol host-verified, portable C sketches/ Arduino firmware copy core/ + partitions.csv in before building pc tools/ model quantizer/exporter and the layer splitter tests/ the proof: NumPy-reference, bit-exactness, and protocol fuzz tests partitions.csv 16MB flash layout with the 14MB model partition - stories42M measured on hardware - PIE SIMD in the marked matmul slot llm matmul rows — est. 2-3× - On-device touch UI no PC MIT License. Engine architecture after llama2.c https://github.com/karpathy/llama2.c Andrej Karpathy, MIT ; models trained on TinyStories https://arxiv.org/abs/2305.07759 Eldan & Li, Microsoft Research . Code developed in collaboration with Claude Anthropic ; hardware, integration, and debugging by me.