What building an LLM inference engine from scratch taught me about compiler design

A developer built ignis, a from-scratch LLM inference engine in Rust with only two dependencies, to explore how compiler design principles apply to inference. The engine uses SSA IR, fusion passes, and liveness analysis to collapse 363 activation buffers down to 5, achieving a 76% reduction in activation memory. It runs Qwen2.5-0.5B at 52 tok/s on M3 with hand-written NEON kernels.

the insight that started this project hit me while i was finishing a bytecode-compiled language i'd written in C i'd spent months building a hand-written lexer, a single-pass Pratt compiler, a stack VM with 35 opcodes, and a mark-and-sweep garbage collector. and right near the end i had this realization: an LLM inference engine is the same problem. it's a graph-compile plus memory-plan plus kernel-schedule problem. i'd just built one so i decided to find out if that was actually true the project the result is ignis, a from-scratch LLM inference engine in Rust. i used it specifically to see how far the compiler analogy held up. the dependency count ended up at 2: memmap2 to mmap the weight blob off disk and fancy-regex for one look-ahead in the BPE tokenizer . everything else is hand-written, because the whole point was to understand what's actually happening the compiler analogy holds up better than i expected the interesting part of any inference engine isn't loading the weights or doing matrix math. it's what happens between "here's a compute graph" and "here's an efficient execution plan." that's a compiler problem ignis builds an SSA static single assignment IR of the entire Qwen2 forward pass. every operation in the transformer the RMSNorm layers, the SwiGLU activations, the attention projections, all of it becomes a node in the graph with explicit data dependencies then fusion passes run over the graph. the intuition is simple: if operation B always and only reads the output of operation A, you can merge them into one op and eliminate the intermediate buffer. in practice this fused 49 RMSNorm ops and 24 SwiGLU ops, bringing the total from 435 operations down to 362 that part felt expected. the liveness analysis surprised me the liveness analysis after fusion, the graph still needs activation buffers: scratch memory to hold intermediate results as the plan executes. the naive approach allocates one buffer per node. the smarter approach asks: which buffers are actually live at the same time? liveness analysis figures out exactly when each buffer's value is last used. once a buffer is dead, the memory it holds can be given to a new operation. this is textbook register allocation, and it works on activation buffers for the same reason it works on registers i expected maybe a 30 or 40% reduction. the actual result was 363 activation buffers collapsing to 5 76% reduction in activation memory, just from tracking liveness. that number genuinely surprised me, and the intuition for why only clicked after i'd already implemented it. most tensors in a forward pass are dead almost immediately after they're consumed. you read a layer norm output once, feed it into a matmul, and never need it again. the graph looks busy but the actual live set at any moment is tiny the kernel side the other half of the compiler analogy is the code generation side, which in an inference engine means the compute kernels i wrote hand-written NEON kernels for the Q8 0 quantization format int8 to int32 to f32 widening with FMA, then an f32 reducer . the exercise was less about squeezing out performance and more about understanding what quantized inference is actually doing at the hardware level. there's a lot of "just use Q4 K M" advice out there and most of it treats quantization as a magic dial. implementing the dequant kernels by hand makes the tradeoffs concrete i also have a scalar fallback for non-ARM so the engine runs everywhere, but NEON is where the interesting work lives on M-series where it lands ignis runs Qwen2.5-0.5B end to end, loading GGUF off disk, tokenizing, running the full forward pass with KV cache, and streaming UTF-8 output. getting about 52 tok/s at q8 0 on M3 i'm not matching llama.cpp and i want to be honest about that. llama.cpp has years of kernel work, metal backend, and a lot of optimizations i haven't implemented. the goal was to understand the problem, not beat the best implementation in existence. i think i did that what the exercise taught me the compiler analogy is real. if you've ever implemented a compiler, the mental model transfers almost directly: your tokens are tensor values, your IR nodes are ops, your register allocator is your memory planner, your code generator is your kernel dispatcher the thing that took me longest to internalize was that the memory savings from liveness analysis aren't free in a compiler either. you have to do the analysis work upfront, and for a long forward pass that's not trivial. the payoff is that your execution plan can reuse a tiny set of buffers for the entire run instead of allocating fresh memory for every intermediate value the other thing: two dependencies is actually achievable. i went in thinking i'd end up pulling in a tensor library or a BLAS somewhere. i didn't need to i'd genuinely love feedback on the graph compiler design specifically, whether the fusion pass ordering is right and whether there's a smarter liveness analysis than what i implemented. that part feels like it has more room to improve than the kernel side does. the code is on github if you want to dig into the implementation: https://github.com/arya51-ai/ignis https://github.com/arya51-ai/ignis