Your Coral and Hailo TPU Can't Run an LLM. Here's Why (and What Hailo-10 Changed)

wpnews.pro

Here is a purchase that happens thousands of times a month: someone building a local-AI project buys a Google Coral, or a Raspberry Pi AI Kit with a Hailo chip, expecting to plug in a language model and chat with it offline. It never works. The model won't load on the accelerator, and every guide quietly falls back to running it on the CPU. This isn't a driver problem or a missing feature. These chips physically cannot run a large language model, and the reason is worth understanding, because it also explains what just changed with Hailo's newest part.

We have not tested every one of these ourselves. What follows is grounded in Google's and Hailo's own documentation, linked throughout.

The seductive wrong assumption #

The confusion is understandable. A Hailo-8 advertises 26 TOPS; a Coral does 4. Those are big compute numbers, and "AI accelerator" sounds like it should accelerate any AI. But TOPS measures one thing: how fast the chip can do the multiply-add math inside a convolution, the operation at the heart of image models. A language model is a different animal, and it stresses a part of the hardware these chips deliberately left out.

Why a language model needs what a vision chip doesn't have #

Generating text is a memory problem, not a math problem. To produce a single token, the hardware reads every weight in the model out of memory once, then reads a key-value cache that grows with each token. The arithmetic per byte moved is tiny, so the speed limit is memory bandwidth and capacity, not compute. This is the roofline model (Williams, Waterman, Patterson, 2009), and it is why the whole field, from Microsoft's Splitwise to Apple's MLX team, describes decode as "bounded by memory bandwidth, rather than by compute ability."

Vision accelerators are built to be the opposite: enormous compute over a tiny, fixed pool of fast on-chip memory, because an image model is small and reused thousands of times per second. That design choice is exactly what blocks LLMs.

Google Coral: an 8MB cache and a CNN-only op set

Coral's own docs are blunt about the limits. The Edge TPU runs "only TensorFlow Lite models that are fully 8-bit quantized and then compiled specifically for the Edge TPU" (Coral, models overview). It cannot load a GGUF, a safetensors file, or anything the modern local-LLM toolchain (llama.cpp, Ollama, MLX) produces. Its supported operations are CNN building blocks; there is no multi-head attention, no scaled-dot-product attention, no KV cache, and tensors must have static, compile-time shapes, which is fundamentally at odds with a transformer's growing sequence. And it caches weights in "roughly 8 MB of SRAM"; anything larger "must instead be fetched from the external memory at run time" (Coral, compiler docs). Even a 1-billion-parameter model at 8-bit is about 1GB, roughly 125 times that cache, so essentially every weight would stream over a slow bus, destroying the chip's only advantage. The single "LLM on Coral" demo you can find runs the language model on the board's little Arm CPU at about 2.5 tokens per second; the Edge TPU is doing camera vision alongside it.

Hailo-8: no external memory at all

The Hailo-8 in the Raspberry Pi AI Kit and AI HAT+ is even more clear-cut. Its defining design choice is that all model memory lives on the die, there is no external DRAM interface. That is brilliant for a vision model that fits in tens of megabytes and lets it hit hundreds of frames per second at low power. It is fatal for an LLM, whose multi-gigabyte weights simply have nowhere to live. The 26 TOPS of compute is real and useless here, because there is no way to feed it a language model. So the Pi AI Kit is a superb object-detection add-on and cannot run a chatbot, both at once.

The proof: Hailo's own next chip #

If the wall were really about compute or TOPS, adding more of them would fix it. It doesn't, and the clearest evidence comes from Hailo itself. Their newer **Hailo-10H** is aimed at generative AI, and its single most important change is not more TOPS, it is a bolted-on external **LPDDR memory interface** (up to 8GB). Give the same company's accelerator a place to put weights, and suddenly it runs LLMs: Hailo reports a 1.5B model at about 9 tokens per second and Llama 3 8B at about 11, all at a few watts. Notice that this is the same ballpark as a Raspberry Pi 5's CPU, on a chip with vastly more compute, because the bottleneck was never compute. It was memory, exactly as the roofline predicts. Google is making the same move: its answer to on-device transformers is a separate, newer "Coral NPU" being co-designed for small language models, a quiet admission that the original Edge TPU was not built for this.

## What the new Hailo-10 does (and doesn't)

Give Hailo credit: the Hailo-10H is shipping, and it now appears in the Raspberry Pi AI HAT+ 2 at around $130, so this is a real option, not a press release. It runs small models at very low power: roughly 9 to 19 tokens per second on 1.5 to 3B models, ~11 on Llama 3 8B, sipping single-digit watts. That makes it a genuinely good fit for a low-power, always-on appliance, a local voice assistant, a small on-device helper, a speech-to-text box.

What it is not is a replacement for a real GPU or a Mac. It is capped around 8GB, holds one model at a time, and works with short contexts. If you want to interactively chat with a capable model, this is not the tier. It is the "tiny model, always on, barely any power" tier, and for that it is excellent.

So what should you buy? #

Match the chip to the job:

Vision at the edge(object detection, pose, classification, keyword spotting): a Coral or a Hailo-8 / Pi AI Kit is exactly right, and cheap.A tiny always-on language model at minimal power: the new Hailo-10 (Raspberry Pi AI HAT+ 2).Chatting with a 7 to 8B model on the edge: aJetson Orin Nanoor a used GPU, not a vision TPU.

The one lesson to carry out of the store: for language models, the accelerator's TOPS tells you almost nothing. Look at whether it has gigabytes of memory it can actually reach. We sort the entire edge lineup by that test in which edge chips can run an LLM, and you can check any specific model against your hardware with our Can I run it? calculator.

Sources and how we researched this #

We have not tested all of this first-hand. The Coral limits are from Google's own Edge TPU documentation and compiler docs; the Hailo-8 memory architecture and the Hailo-10H generative-AI figures are from Hailo's own materials and its Raspberry Pi AI HAT+ 2 integration. The memory-bound-decode framing is the roofline model (Williams et al., 2009) plus Splitwise (Patel et al., 2023) and Apple's MLX write-up. Vendor token-rate figures are single-configuration and will vary by model, quant, and firmware.

source & further reading

vettedconsumer.com — original article Raspberry Pi 5 (16GB) Buyer's Guide: A $120 Local-AI and Self-Hosting Machine Unified Memory, Explained: Why Mini PCs Can Run 70B Models a Big GPU Can't (and Where They Slow Down) Three Mini PCs, One 70B Model: What Clustering Intel's New NUCs Can (and Can't) Do