{"slug": "llm-for-the-esp32-s3", "title": "LLM for the ESP32-S3", "summary": "Two ESP32-S3 microcontrollers running a Llama-architecture language model have achieved the first multi-chip pipelined LLM inference on ESP32-class hardware, splitting layers across two boards connected by three jumper wires. The system runs a 15-million-parameter model at approximately 1.4 tokens per second, overcoming the single-board memory limit of 16MB flash that caps models at roughly 15M parameters. The project targets a 42-million-parameter model by distributing weights across both chips' flash storage, with the worker board handling initial layers and the head board managing embedding, classification, and token sampling.", "body_md": "**One language model, two microcontrollers.** A Llama-architecture LLM\nrunning with its layers split across two ESP32-S3 boards — per token, the\nactivation vector crosses three wires (CRC-framed UART) between the chips.\n\n```\n> Once upon a time there was a brave little fox\ncat and fish are friends. They like to play and swim in the pond.\nOne day, the cat sees a bird on a tree...\n[1.4 tok/s across 2 boards]\n```\n\nAs far as published projects show, this is the **first multi-chip pipelined\nLLM inference on ESP32-class hardware.**\n\nA 16MB ESP32-S3 caps out at a ~15M-parameter model (INT4). The next\nTinyStories size — stories42M, ~24MB — fits **no single board**. Splitting\nlayers across two chips makes combined flash the limit, not the chip.\n\n**Weights:** INT4 (group-32 scales), streamed from a memory-mapped flash partition —**0 bytes of RAM** used for weights**Compute:** INT8 activations, integer-exact group dot products, matmul rows split across both LX7 cores of each chip**Split:** worker board runs layers 0–K; head board holds embedding, layers K–L, classifier, tokenizer, and samples each next token**Link:** UART @460800, frame =`A5 5A | cmd | len | payload | CRC16`\n\n, ~1.2KB/token round trip (~3% of token time)\n\nEverything testable without hardware was tested before flashing (see /tests):\nforward pass matches a NumPy reference to **~3e-7** (INT4 and INT8); the\nsplit pipeline is **bit-exact** vs the monolithic model; the link protocol\nwas fuzzed — noise ignored, corrupted frames rejected by CRC.\n\n| params | weights live in | speed | output | |\n|---|---|---|---|---|\n| known ESP32 LLM ports | 260K | RAM | 19–33 tok/s | sentence-level babble |\n| this, single board | 15M | flash (mmap) | ~1.4 tok/s | multi-paragraph stories |\n| this, two boards | 15M now / 42M target | split flash | ~1.4 / est. 0.5 tok/s | better still |\n\nCoherence isn't gradual: TinyStories research shows plot consistency\n*emerges* in the millions-of-parameters range\n([Eldan & Li 2023](https://arxiv.org/abs/2305.07759)).\n\n**Hardware:** one ESP32-S3 with **16MB flash + 8MB PSRAM** for single-board\nmode; two of them + 3 jumper wires for the pipeline. Verified on: Waveshare\nESP32-S3-Touch-LCD-5 (head) and Guition JC3248W535C (worker). The two\nboards do not have to be the same model. Displays are unused in v1 — all\ninteraction is over USB serial, screens stay dark by design.\n\n**PC (one-time setup, Windows commands shown):**\n\n```\nwinget install Python.Python.3.12\n:: close cmd, open a NEW one, then:\npython --version\npip install numpy esptool\n```\n\nArduino IDE with the esp32 board package (Boards Manager → \"esp32\" by Espressif; v3.x recommended, v2.x works — the sketches include a compat shim).\n\n\"Python was not found\" after installing? Windows Settings → \"Manage app execution aliases\" → turn OFF python.exe and python3.exe, reopen cmd.\n\nDownload into `pc_tools/`\n\n:\n\n[https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin)(~60MB)[https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.bin](https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.bin)(right-click → Save link as → keep the name`tokenizer.bin`\n\n)\n\n```\ncd path\\to\\repo\\pc_tools\npython export_model.py stories15M.bin tokenizer.bin full.bin --bits 4\n```\n\nOutput: `full.bin`\n\n(~10MB) — the INT4 flash image.\n\n- Copy\n`core\\llm_core.c`\n\n,`core\\llm_core.h`\n\n, and`partitions.csv`\n\ninto`sketches\\storyteller\\`\n\n. - Open\n`esp32_storyteller.ino`\n\nin Arduino IDE. Tools settings:**ESP32S3 Dev Module · USB CDC On Boot: Enabled · Flash Size: 16MB · PSRAM: OPI · CPU 240MHz**. Upload. - Flash the model (close Serial Monitor first; COMx is in Tools → Port):\n\n```\npython -m esptool --chip esp32s3 --port COMx write_flash 0x1F0000 full.bin\n```\n\nSuccess = \"Hash of data verified.\"\n4. Serial Monitor @ **115200**, press the board's reset button, wait for\n`Ready`\n\n, type a story opening, press Enter.\n\n**Split the model** (worker gets layers 0–2; head gets embedding + 3–5):\n\n```\npython split_image.py full.bin 3 worker.bin head.bin\n```\n\n**Firmware:** copy `core\\*`\n\nand `partitions.csv`\n\ninto BOTH\n`sketches\\pipeline_head\\`\n\nand `sketches\\pipeline_worker\\`\n\n. Upload\n`pipeline_head.ino`\n\nto board A and `pipeline_worker.ino`\n\nto board B\n(same Tools settings as 2A).\n\n**Pins & baud** — edit each sketch's copy of `core/pipeline_link.h`\n\nif\nneeded. Each board's `LINK_TX_PIN`\n\n/`LINK_RX_PIN`\n\nmust match *its own\nwiring*; the two boards' pin numbers do NOT have to match each other.\n`LINK_BAUD`\n\nMUST be identical on both. Defaults: 17/18 @460800. On the\nWaveshare 5\" (no free GPIO header) use the I2C terminal block:\nTX = GPIO8 (SDA), RX = GPIO9 (SCL).\n\n**Flash each half** (each board's own COM port, Serial Monitor closed):\n\n```\npython -m esptool --chip esp32s3 --port COM_head   write_flash 0x1F0000 head.bin\npython -m esptool --chip esp32s3 --port COM_worker write_flash 0x1F0000 worker.bin\n```\n\n**Wire — 3 jumpers, boards powered off, TX↔RX crossed:**\n\n```\nhead TX-pin ──→ worker RX-pin\nhead RX-pin ←── worker TX-pin\nGND ─────────── GND     (any GND pin on each board; do NOT connect 3V3/VCC)\n```\n\n**Run:** power both (head on the PC; worker on any USB power). Serial\nMonitor on the HEAD's port @115200, reset both boards, wait for\n`Ready: emb + 3 local layers of 6 total`\n\n, type a prompt.\n\n| Command | Effect |\n|---|---|\n| any text | generates a story continuing your text |\n`/temp 0.7` |\nlower = focused, higher (1.0+) = wilder |\n`/topp 0.9` |\nnucleus sampling cutoff |\n`/len 250` |\nmax tokens per prompt |\n`/stats` |\nfree RAM/PSRAM and current settings |\n\nStatus: export path implemented and size-budgeted; the 15M pipeline is hardware-verified. Measured 42M numbers welcome via issues.\n\n```\n:: download stories42M.bin (~170MB) from the same HuggingFace page, then:\npython export_model.py stories42M.bin tokenizer.bin full42.bin --bits 4 --gs 32 --seq 224\npython split_image.py full42.bin 7 worker42.bin head42.bin\npython -m esptool --chip esp32s3 --port COM_head   write_flash 0x1F0000 head42.bin\npython -m esptool --chip esp32s3 --port COM_worker write_flash 0x1F0000 worker42.bin\n```\n\n`--seq 224`\n\nis **required** (42M's native 1024-token context would need a\n~29MB KV cache — over the 8MB PSRAM; 224 fits). Split at **7**, not 3:\nthe ~9MB embedding lives on the head, so layers skew to the worker. Check\nthe printed image sizes stay under 14.6MB. Boot banners should read\n\"emb + 1 local layers of 8\" (head) and \"7 local layers, dim 512\" (worker).\nExpect ~0.4–0.7 tok/s — and clearly better stories. Worker allocation\nfailure at boot → re-export with `--seq 192`\n\n.\n\n| Symptom | Fix |\n|---|---|\n`llm_init -1` |\nno/old model in flash — redo the esptool step at 0x1F0000 |\n`no 'model' partition` |\npartitions.csv not applied — copy into the sketch folder, set Partition Scheme: Custom, re-upload |\n| boot log but no banner | flip Tools → USB CDC On Boot, re-upload |\n`PSRAM not found` |\nTools → PSRAM: OPI (or QSPI), re-upload |\n`link timeout` |\nTX/RX swapped at one end, or GND wire missing |\n`crc error, retry` constantly |\nshorten wires or set LINK_BAUD 115200 on BOTH boards |\n| esptool: No such file | cd into the folder containing the .bin |\n| esptool: port busy | close Serial Monitor |\n\n```\ncore/            inference engine + link protocol (host-verified, portable C)\nsketches/        Arduino firmware (copy core/* + partitions.csv in before building)\npc_tools/        model quantizer/exporter and the layer splitter\ntests/           the proof: NumPy-reference, bit-exactness, and protocol fuzz tests\npartitions.csv   16MB flash layout with the 14MB model partition\n```\n\n- stories42M measured on hardware\n- PIE SIMD in the marked matmul slot (\n`llm_matmul_rows`\n\n) — est. 2-3× - On-device touch UI (no PC)\n\nMIT License. Engine architecture after\n[llama2.c](https://github.com/karpathy/llama2.c) (Andrej Karpathy, MIT);\nmodels trained on [TinyStories](https://arxiv.org/abs/2305.07759)\n(Eldan & Li, Microsoft Research). Code developed in collaboration with\nClaude (Anthropic); hardware, integration, and debugging by me.", "url": "https://wpnews.pro/news/llm-for-the-esp32-s3", "canonical_source": "https://github.com/harmansingh4163-ai/ESP-32-s3-Story-maker-LLM", "published_at": "2026-06-12 05:56:02+00:00", "updated_at": "2026-06-12 06:48:55.099715+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "ai-chips", "neural-networks"], "entities": ["ESP32-S3", "TinyStories", "Llama", "UART", "INT4", "INT8", "CRC", "NumPy"], "alternates": {"html": "https://wpnews.pro/news/llm-for-the-esp32-s3", "markdown": "https://wpnews.pro/news/llm-for-the-esp32-s3.md", "text": "https://wpnews.pro/news/llm-for-the-esp32-s3.txt", "jsonld": "https://wpnews.pro/news/llm-for-the-esp32-s3.jsonld"}}