cd /news/large-language-models/llm-for-the-esp32-s3 Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-24897] src=github.com pub= topic=large-language-models verified=true sentiment=↑ positive

LLM for the ESP32-S3

Two ESP32-S3 microcontrollers running a Llama-architecture language model have achieved the first multi-chip pipelined LLM inference on ESP32-class hardware, splitting layers across two boards connected by three jumper wires. The system runs a 15-million-parameter model at approximately 1.4 tokens per second, overcoming the single-board memory limit of 16MB flash that caps models at roughly 15M parameters. The project targets a 42-million-parameter model by distributing weights across both chips' flash storage, with the worker board handling initial layers and the head board managing embedding, classification, and token sampling.

read6 min publishedJun 12, 2026

One language model, two microcontrollers. A Llama-architecture LLM running with its layers split across two ESP32-S3 boards β€” per token, the activation vector crosses three wires (CRC-framed UART) between the chips.

> Once upon a time there was a brave little fox
cat and fish are friends. They like to play and swim in the pond.
One day, the cat sees a bird on a tree...
[1.4 tok/s across 2 boards]

As far as published projects show, this is the first multi-chip pipelined LLM inference on ESP32-class hardware.

A 16MB ESP32-S3 caps out at a ~15M-parameter model (INT4). The next TinyStories size β€” stories42M, ~24MB β€” fits no single board. Splitting layers across two chips makes combined flash the limit, not the chip.

Weights: INT4 (group-32 scales), streamed from a memory-mapped flash partition β€”0 bytes of RAM used for weightsCompute: INT8 activations, integer-exact group dot products, matmul rows split across both LX7 cores of each chipSplit: worker board runs layers 0–K; head board holds embedding, layers K–L, classifier, tokenizer, and samples each next tokenLink: UART @460800, frame =A5 5A | cmd | len | payload | CRC16

, ~1.2KB/token round trip (~3% of token time)

Everything testable without hardware was tested before flashing (see /tests): forward pass matches a NumPy reference to ~3e-7 (INT4 and INT8); the split pipeline is bit-exact vs the monolithic model; the link protocol was fuzzed β€” noise ignored, corrupted frames rejected by CRC.

params weights live in speed output
known ESP32 LLM ports 260K RAM 19–33 tok/s sentence-level babble
this, single board 15M flash (mmap) ~1.4 tok/s multi-paragraph stories
this, two boards 15M now / 42M target split flash ~1.4 / est. 0.5 tok/s better still

Coherence isn't gradual: TinyStories research shows plot consistency emerges in the millions-of-parameters range (Eldan & Li 2023).

Hardware: one ESP32-S3 with 16MB flash + 8MB PSRAM for single-board mode; two of them + 3 jumper wires for the pipeline. Verified on: Waveshare ESP32-S3-Touch-LCD-5 (head) and Guition JC3248W535C (worker). The two boards do not have to be the same model. Displays are unused in v1 β€” all interaction is over USB serial, screens stay dark by design.

PC (one-time setup, Windows commands shown):

winget install Python.Python.3.12
:: close cmd, open a NEW one, then:
python --version
pip install numpy esptool

Arduino IDE with the esp32 board package (Boards Manager β†’ "esp32" by Espressif; v3.x recommended, v2.x works β€” the sketches include a compat shim).

"Python was not found" after installing? Windows Settings β†’ "Manage app execution aliases" β†’ turn OFF python.exe and python3.exe, reopen cmd.

Download into pc_tools/

:

https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin(~60MB)https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.bin(right-click β†’ Save link as β†’ keep the nametokenizer.bin

)

cd path\to\repo\pc_tools
python export_model.py stories15M.bin tokenizer.bin full.bin --bits 4

Output: full.bin

(~10MB) β€” the INT4 flash image.

  • Copy core\llm_core.c

,core\llm_core.h

, andpartitions.csv

intosketches\storyteller\

. - Open esp32_storyteller.ino

in Arduino IDE. Tools settings:ESP32S3 Dev Module Β· USB CDC On Boot: Enabled Β· Flash Size: 16MB Β· PSRAM: OPI Β· CPU 240MHz. Upload. - Flash the model (close Serial Monitor first; COMx is in Tools β†’ Port):

python -m esptool --chip esp32s3 --port COMx write_flash 0x1F0000 full.bin

Success = "Hash of data verified." 4. Serial Monitor @ 115200, press the board's reset button, wait for Ready

, type a story opening, press Enter.

Split the model (worker gets layers 0–2; head gets embedding + 3–5):

python split_image.py full.bin 3 worker.bin head.bin

Firmware: copy core\*

and partitions.csv

into BOTH sketches\pipeline_head\

and sketches\pipeline_worker\

. Upload pipeline_head.ino

to board A and pipeline_worker.ino

to board B (same Tools settings as 2A).

Pins & baud β€” edit each sketch's copy of core/pipeline_link.h

if needed. Each board's LINK_TX_PIN

/LINK_RX_PIN

must match its own wiring; the two boards' pin numbers do NOT have to match each other. LINK_BAUD

MUST be identical on both. Defaults: 17/18 @460800. On the Waveshare 5" (no free GPIO header) use the I2C terminal block: TX = GPIO8 (SDA), RX = GPIO9 (SCL).

Flash each half (each board's own COM port, Serial Monitor closed):

python -m esptool --chip esp32s3 --port COM_head   write_flash 0x1F0000 head.bin
python -m esptool --chip esp32s3 --port COM_worker write_flash 0x1F0000 worker.bin

Wire β€” 3 jumpers, boards powered off, TX↔RX crossed:

head TX-pin ──→ worker RX-pin
head RX-pin ←── worker TX-pin
GND ─────────── GND     (any GND pin on each board; do NOT connect 3V3/VCC)

Run: power both (head on the PC; worker on any USB power). Serial Monitor on the HEAD's port @115200, reset both boards, wait for Ready: emb + 3 local layers of 6 total

, type a prompt.

Command Effect
any text generates a story continuing your text
/temp 0.7
lower = focused, higher (1.0+) = wilder
/topp 0.9
nucleus sampling cutoff
/len 250
max tokens per prompt
/stats
free RAM/PSRAM and current settings

Status: export path implemented and size-budgeted; the 15M pipeline is hardware-verified. Measured 42M numbers welcome via issues.

:: download stories42M.bin (~170MB) from the same HuggingFace page, then:
python export_model.py stories42M.bin tokenizer.bin full42.bin --bits 4 --gs 32 --seq 224
python split_image.py full42.bin 7 worker42.bin head42.bin
python -m esptool --chip esp32s3 --port COM_head   write_flash 0x1F0000 head42.bin
python -m esptool --chip esp32s3 --port COM_worker write_flash 0x1F0000 worker42.bin

--seq 224

is required (42M's native 1024-token context would need a ~29MB KV cache β€” over the 8MB PSRAM; 224 fits). Split at 7, not 3: the ~9MB embedding lives on the head, so layers skew to the worker. Check the printed image sizes stay under 14.6MB. Boot banners should read "emb + 1 local layers of 8" (head) and "7 local layers, dim 512" (worker). Expect ~0.4–0.7 tok/s β€” and clearly better stories. Worker allocation failure at boot β†’ re-export with --seq 192

.

Symptom Fix
llm_init -1
no/old model in flash β€” redo the esptool step at 0x1F0000
no 'model' partition
partitions.csv not applied β€” copy into the sketch folder, set Partition Scheme: Custom, re-upload
boot log but no banner flip Tools β†’ USB CDC On Boot, re-upload
PSRAM not found
Tools β†’ PSRAM: OPI (or QSPI), re-upload
link timeout
TX/RX swapped at one end, or GND wire missing
crc error, retry constantly
shorten wires or set LINK_BAUD 115200 on BOTH boards
esptool: No such file cd into the folder containing the .bin
esptool: port busy close Serial Monitor
core/            inference engine + link protocol (host-verified, portable C)
sketches/        Arduino firmware (copy core/* + partitions.csv in before building)
pc_tools/        model quantizer/exporter and the layer splitter
tests/           the proof: NumPy-reference, bit-exactness, and protocol fuzz tests
partitions.csv   16MB flash layout with the 14MB model partition
  • stories42M measured on hardware
  • PIE SIMD in the marked matmul slot ( llm_matmul_rows

) β€” est. 2-3Γ— - On-device touch UI (no PC)

MIT License. Engine architecture after llama2.c (Andrej Karpathy, MIT); models trained on TinyStories (Eldan & Li, Microsoft Research). Code developed in collaboration with Claude (Anthropic); hardware, integration, and debugging by me.

── more in #large-language-models 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/llm-for-the-esp32-s3] indexed:0 read:6min 2026-06-12 Β· β€”