One language model, two microcontrollers. A Llama-architecture LLM running with its layers split across two ESP32-S3 boards β per token, the activation vector crosses three wires (CRC-framed UART) between the chips.
> Once upon a time there was a brave little fox
cat and fish are friends. They like to play and swim in the pond.
One day, the cat sees a bird on a tree...
[1.4 tok/s across 2 boards]
As far as published projects show, this is the first multi-chip pipelined LLM inference on ESP32-class hardware.
A 16MB ESP32-S3 caps out at a ~15M-parameter model (INT4). The next TinyStories size β stories42M, ~24MB β fits no single board. Splitting layers across two chips makes combined flash the limit, not the chip.
Weights: INT4 (group-32 scales), streamed from a memory-mapped flash partition β0 bytes of RAM used for weightsCompute: INT8 activations, integer-exact group dot products, matmul rows split across both LX7 cores of each chipSplit: worker board runs layers 0βK; head board holds embedding, layers KβL, classifier, tokenizer, and samples each next tokenLink: UART @460800, frame =A5 5A | cmd | len | payload | CRC16
, ~1.2KB/token round trip (~3% of token time)
Everything testable without hardware was tested before flashing (see /tests): forward pass matches a NumPy reference to ~3e-7 (INT4 and INT8); the split pipeline is bit-exact vs the monolithic model; the link protocol was fuzzed β noise ignored, corrupted frames rejected by CRC.
| params | weights live in | speed | output | |
|---|---|---|---|---|
| known ESP32 LLM ports | 260K | RAM | 19β33 tok/s | sentence-level babble |
| this, single board | 15M | flash (mmap) | ~1.4 tok/s | multi-paragraph stories |
| this, two boards | 15M now / 42M target | split flash | ~1.4 / est. 0.5 tok/s | better still |
Coherence isn't gradual: TinyStories research shows plot consistency emerges in the millions-of-parameters range (Eldan & Li 2023).
Hardware: one ESP32-S3 with 16MB flash + 8MB PSRAM for single-board mode; two of them + 3 jumper wires for the pipeline. Verified on: Waveshare ESP32-S3-Touch-LCD-5 (head) and Guition JC3248W535C (worker). The two boards do not have to be the same model. Displays are unused in v1 β all interaction is over USB serial, screens stay dark by design.
PC (one-time setup, Windows commands shown):
winget install Python.Python.3.12
:: close cmd, open a NEW one, then:
python --version
pip install numpy esptool
Arduino IDE with the esp32 board package (Boards Manager β "esp32" by Espressif; v3.x recommended, v2.x works β the sketches include a compat shim).
"Python was not found" after installing? Windows Settings β "Manage app execution aliases" β turn OFF python.exe and python3.exe, reopen cmd.
Download into pc_tools/
:
https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin(~60MB)https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.bin(right-click β Save link as β keep the nametokenizer.bin
)
cd path\to\repo\pc_tools
python export_model.py stories15M.bin tokenizer.bin full.bin --bits 4
Output: full.bin
(~10MB) β the INT4 flash image.
- Copy
core\llm_core.c
,core\llm_core.h
, andpartitions.csv
intosketches\storyteller\
. - Open
esp32_storyteller.ino
in Arduino IDE. Tools settings:ESP32S3 Dev Module Β· USB CDC On Boot: Enabled Β· Flash Size: 16MB Β· PSRAM: OPI Β· CPU 240MHz. Upload. - Flash the model (close Serial Monitor first; COMx is in Tools β Port):
python -m esptool --chip esp32s3 --port COMx write_flash 0x1F0000 full.bin
Success = "Hash of data verified."
4. Serial Monitor @ 115200, press the board's reset button, wait for
Ready
, type a story opening, press Enter.
Split the model (worker gets layers 0β2; head gets embedding + 3β5):
python split_image.py full.bin 3 worker.bin head.bin
Firmware: copy core\*
and partitions.csv
into BOTH
sketches\pipeline_head\
and sketches\pipeline_worker\
. Upload
pipeline_head.ino
to board A and pipeline_worker.ino
to board B (same Tools settings as 2A).
Pins & baud β edit each sketch's copy of core/pipeline_link.h
if
needed. Each board's LINK_TX_PIN
/LINK_RX_PIN
must match its own
wiring; the two boards' pin numbers do NOT have to match each other.
LINK_BAUD
MUST be identical on both. Defaults: 17/18 @460800. On the Waveshare 5" (no free GPIO header) use the I2C terminal block: TX = GPIO8 (SDA), RX = GPIO9 (SCL).
Flash each half (each board's own COM port, Serial Monitor closed):
python -m esptool --chip esp32s3 --port COM_head write_flash 0x1F0000 head.bin
python -m esptool --chip esp32s3 --port COM_worker write_flash 0x1F0000 worker.bin
Wire β 3 jumpers, boards powered off, TXβRX crossed:
head TX-pin βββ worker RX-pin
head RX-pin βββ worker TX-pin
GND βββββββββββ GND (any GND pin on each board; do NOT connect 3V3/VCC)
Run: power both (head on the PC; worker on any USB power). Serial
Monitor on the HEAD's port @115200, reset both boards, wait for
Ready: emb + 3 local layers of 6 total
, type a prompt.
| Command | Effect |
|---|---|
| any text | generates a story continuing your text |
/temp 0.7 |
|
| lower = focused, higher (1.0+) = wilder | |
/topp 0.9 |
|
| nucleus sampling cutoff | |
/len 250 |
|
| max tokens per prompt | |
/stats |
|
| free RAM/PSRAM and current settings |
Status: export path implemented and size-budgeted; the 15M pipeline is hardware-verified. Measured 42M numbers welcome via issues.
:: download stories42M.bin (~170MB) from the same HuggingFace page, then:
python export_model.py stories42M.bin tokenizer.bin full42.bin --bits 4 --gs 32 --seq 224
python split_image.py full42.bin 7 worker42.bin head42.bin
python -m esptool --chip esp32s3 --port COM_head write_flash 0x1F0000 head42.bin
python -m esptool --chip esp32s3 --port COM_worker write_flash 0x1F0000 worker42.bin
--seq 224
is required (42M's native 1024-token context would need a
~29MB KV cache β over the 8MB PSRAM; 224 fits). Split at 7, not 3:
the ~9MB embedding lives on the head, so layers skew to the worker. Check
the printed image sizes stay under 14.6MB. Boot banners should read
"emb + 1 local layers of 8" (head) and "7 local layers, dim 512" (worker).
Expect ~0.4β0.7 tok/s β and clearly better stories. Worker allocation
failure at boot β re-export with --seq 192
.
| Symptom | Fix |
|---|---|
llm_init -1 |
|
| no/old model in flash β redo the esptool step at 0x1F0000 | |
no 'model' partition |
|
| partitions.csv not applied β copy into the sketch folder, set Partition Scheme: Custom, re-upload | |
| boot log but no banner | flip Tools β USB CDC On Boot, re-upload |
PSRAM not found |
|
| Tools β PSRAM: OPI (or QSPI), re-upload | |
link timeout |
|
| TX/RX swapped at one end, or GND wire missing | |
crc error, retry constantly |
|
| shorten wires or set LINK_BAUD 115200 on BOTH boards | |
| esptool: No such file | cd into the folder containing the .bin |
| esptool: port busy | close Serial Monitor |
core/ inference engine + link protocol (host-verified, portable C)
sketches/ Arduino firmware (copy core/* + partitions.csv in before building)
pc_tools/ model quantizer/exporter and the layer splitter
tests/ the proof: NumPy-reference, bit-exactness, and protocol fuzz tests
partitions.csv 16MB flash layout with the 14MB model partition
- stories42M measured on hardware
- PIE SIMD in the marked matmul slot (
llm_matmul_rows
) β est. 2-3Γ - On-device touch UI (no PC)
MIT License. Engine architecture after llama2.c (Andrej Karpathy, MIT); models trained on TinyStories (Eldan & Li, Microsoft Research). Code developed in collaboration with Claude (Anthropic); hardware, integration, and debugging by me.