What Is MLIR and Why Does It Exist?

Chris Lattner created MLIR (Multi-Level Intermediate Representation) in 2018 at Google to solve the problem of fragmented compiler infrastructure across different hardware targets and programming models. MLIR, released publicly in 2019 under the LLVM umbrella, provides a common way to represent and transform code, reducing the need to build separate compilers for each new chip, language, or ML framework. The project lives inside the LLVM monorepo to leverage existing battle-tested building blocks.

If you've never written a compiler, the word "MLIR" probably looks like alphabet soup. This article is for you. By the end you'll understand, in plain language, what problem MLIR solves and why it had to exist at all. Let's start with the origin story — because where something comes from tells you almost everything about what it's for. The story of MLIR starts in 2018 at Google. Chris Lattner, one of the most influential figures in compiler engineering, set out to solve a problem that had been bothering the industry for years — there was no common way to represent and transform code across different hardware targets and programming models. MLIR was his answer, and it went public in 2019 under the LLVM umbrella. Imagine you work on TensorFlow, Google's machine learning library. Your job is to take a model someone wrote in Python and make it run fast — on a laptop CPU, on a phone, on a GPU, and on Google's custom TPU chips. To do that, the model has to be translated, step by step, into instructions each piece of hardware understands. That translation-and-optimization process is, fundamentally, a compiler . The trouble was that there wasn't one compiler. There were many. One team built a tool to optimize graphs. Another built a separate tool to target TPUs. Another for mobile. Another for a specific hardware accelerator. Each tool had its own way of representing the program internally, its own bugs, its own optimization tricks that couldn't be shared with the others. The ecosystem was siloed — a pile of separate, half-overlapping compilers all reinventing the same wheels. And this wasn't unique to Google. Across the industry, the same pattern kept repeating: a new chip, a new language, or a new ML framework would appear, and someone would sit down to build yet another compiler from scratch to support it. Everybody was paying the same enormous bill, over and over. Chris Lattner moved to Google in 2017 to lead the TensorFlow infrastructure team, walked straight into that fragmentation mess, and built MLIR to fix it. MLIR stands for Multi-Level Intermediate Representation . Hold onto that name — every word in it is doing real work, and we'll unpack it as we go. The official paper describes the goals directly: reduce software fragmentation, improve compilation for the wild variety of modern hardware, dramatically lower the cost of building domain-specific compilers, and help existing compilers connect to one another. A small but telling detail:MLIR doesn't live in its own separate project. It was addedinsidethe LLVM monorepo llvm-project in a folder literally called mlir/ . Why? Because LLVM already had two decades of battle-tested, reusable building blocks — data structures, error handling, a testing framework — and Lattner knew that codebase better than anyone alive. Starting from zero would have meant rebuilding all of that. Sitting inside the monorepo, MLIR could borrow it on day one. Before we get to the machine-learning payoff, we need a shared mental model of what a compiler actually does . Let's build that with the simplest possible program. When you compile a program, your code goes on a journey through several stages: Source code → Frontend parsing → AST a tree of your program → IR intermediate representation → Optimization passes run in a loop → Lowering toward the machine → Backend per-CPU details → Code generation actual machine code Don't worry about memorizing it. The three ideas that matter are: Let's trace a single expression — x = 1 + 2 — through all three. For instance, when you run a .py file, the very first thing CPython does is break raw text into tokens — the smallest meaningful chunks of the language. python import tokenize, io source = "x = 1 + 2" tokens = tokenize.generate tokens io.StringIO source .readline for tok in tokens: print tok Output: TokenInfo type=1 NAME , string='x', ... TokenInfo type=54 OP , string='=', ... TokenInfo type=2 NUMBER , string='1', ... TokenInfo type=54 OP , string='+', ... TokenInfo type=2 NUMBER , string='2', ... So x = 1 + 2 stops being an opaque string and becomes a flat list of typed pieces. The tokenizer doesn't care about meaning yet — it just answers: "what kind of thing is this character sequence?" Next, the parser takes that flat list of tokens and builds an AST Abstract Syntax Tree — a nested structure that captures the grammar of your program. python import ast tree = ast.parse "x = 1 + 2" print ast.dump tree, indent=2 Output: Module body= Assign targets= Name id='x' , value=BinOp left=Constant value=1 , op=Add , right=Constant value=2 The flat sequence 1 + 2 became a BinOp node with an Add operator and two children. The structure of the expression is now explicit in the shape of the tree — not buried in the order of characters. This tree is what gets handed off to the next stage. The compiler never looks at your source text again. Next, compile takes the AST and produces bytecode — CPython's IR. The optimizer runs between the two, applying any transformations it can find. Here it applied constant folding : since both operands are literals, 1 + 2 can be solved at compile time. The runtime never sees the addition at all. python import ast, dis source = "x = 1 + 2" tree = ast.parse source Stage 1 — AST code = compile source, "<string ", "exec" Stage 2 — bytecode dis.dis code Output: 1 0 RESUME 0 2 LOAD CONST 0 3 ← already computed 4 STORE NAME 0 x 6 RETURN CONST 1 None 1 and 2 are gone. Only 3 remains. The backend is the most complex part of any compiler and deserves its own article. For now, just one thing worth seeing: after all the stages above, x = 1 + 2 eventually becomes exactly two x86 instructions : mov eax, 3 ; load the result already computed at compile time ret ; return it That's it. The CPU never sees 1 or 2 — only 3 . CPython itself doesn't go this far. It stops at bytecode and interprets it via a virtual machine in ceval.c . JIT compilers like PyPy or Numba go all the way to machine code like the snippet above. The Python example showed the pipeline from the outside. Let's now watch the optimizer do something slightly more interesting — remove code that will never matter. Here's a small C++ program with a deliberate mistake: include <iostream include <string int main { std::string dead = "I am never used"; // created, then never read std::cout << "Hello world\n"; return 0; } That dead variable is dead code : we build it, then never read it. A human reviewer would say "just delete that line." We're going to watch the compiler figure that out on its own. The AST captures the structure of your code with all the punctuation and formatting stripped away. For brevity, the include machinery is omitted — it expands into a lot of generated declarations. The meaningful structure of main looks like this: php FunctionDecl: main - int └── CompoundStmt ├── DeclStmt │ └── VarDecl: dead : std::string = "I am never used" ├── CallExpr: operator<< │ └── std::cout << "Hello world\n" └── ReturnStmt └── IntegerLiteral: 0 The tree is faithful to what you wrote — warts and all. The dead variable is still there. Cleanup comes later. The compiler then converts the AST into Intermediate Representation IR . Real IR for a std::string program is genuinely noisy, so let's switch to a simpler version of the same idea: int compute { int unused = 99; // dead variable int a = 2; int b = 3; return a + b; } With optimizations off , the LLVM IR looks like this simplified : define i32 @compute { entry: %unused = alloca i32 %a = alloca i32 %b = alloca i32 store i32 99, i32 %unused ; unused = 99 store i32 2, i32 %a ; a = 2 store i32 3, i32 %b ; b = 3 %0 = load i32, i32 %a %1 = load i32, i32 %b %add = add i32 %0, %1 ; a + b ret i32 %add } Verbose, but readable: reserve some slots, store numbers, add two of them, return the result. Every line of your source has a faithful echo — including the pointless unused = 99 . Now we turn optimizations on . The compiler runs a series of optimization passes — small, focused transformations applied in a loop until nothing more can be improved. Two run here: 2 + 3 is always 5 . No reason to compute it at runtime. unused is written but never read. No one depends on it, so it's deleted.The result: define i32 @compute { entry: ret i32 5 } The whole function became "return 5." The dead variable vanished and the arithmetic was solved at compile time. That is what the compiler's middle stage is for — and it's exactly the kind of work MLIR is built to make easy across many different kinds of programs. Go to godbolt.org . Paste in C++ or dozens of other languages , pick a compiler, and watch the output update in real time as you toggle between -O0 no optimization and -O2 optimize hard . Watching dead code evaporate is the fastest way to build intuition for everything above. It's the single best companion to this article.So if LLVM is such a great compiler infrastructure, why couldn't TensorFlow just use it directly? Here's the catch. LLVM's IR was designed to describe programs at the level of CPU instructions — load this number, add these two registers, jump to that address. That's the right level for compiling C or Rust. But it's far too low for machine learning. A neural network doesn't think in "add two registers." It thinks in operations like "do a 2D convolution" or "apply softmax" or "multiply these two matrices." If you flatten all of that down to individual CPU instructions too early, you throw away the high-level meaning — and with it, the chance to do the big optimizations that only make sense when you can still see "oh, these two matrix multiplications could be fused together." This is the core insight behind the "Multi-Level" in MLIR. Instead of one fixed IR, MLIR lets you have many IRs at different levels of abstraction , and lower your program gradually: High level: "matmul", "convolution", "softmax" ← ML-shaped operations ↓ Mid level: loops, array indexing, linear algebra ↓ Low level: LLVM IR → actual CPU / GPU / TPU instructions Each level is called a dialect in MLIR — a self-contained vocabulary of operations suited to one kind of reasoning. You optimize at the level where it's natural, then lower to the next. The philosophy in one sentence: a big compiler should be broken into many small compilers between intermediate languages, each designed to make one kind of optimization easy to express. LLVM couldn't be stretched to do this: it was designed for CPUs, sat at too low a level of abstraction, and carried years of incidental baggage. But it had all those reusable pieces worth keeping. MLIR is what you get when you keep the good parts and add the missing "multi-level" idea on top. Let's make it concrete. Suppose we're training a network to recognize handwritten letters of the alphabet 26 classes, A–Z . In Keras the model is just a few lines: python import tensorflow as tf model = tf.keras.Sequential tf.keras.layers.Flatten input shape= 28, 28 , tf.keras.layers.Dense 128, activation='relu' , tf.keras.layers.Dense 26, activation='softmax' , Innocent-looking. But under the hood, running this model is a chain of math operations on large grids of numbers. To make it fast on real hardware, a compiler has to take it through exactly the kind of multi-level lowering we just described. Quick detour, because the word is everywhere it's literally in "TensorFlow" . A tensor is just a container of numbers with a shape: 7 → 1, 2, 3 → For our purposes: a tensor is a matrix of numbers, and in a neural network, those numbers are the weights the model learned during training. When the model recognizes a letter, your input image a tensor gets multiplied by weight tensors, over and over, until it produces 26 scores — one per letter. When that Keras model is fed into an MLIR-based compiler, the high-level operations get represented in a dialect with explicit tensor types. Below is a simplified but syntactically real sketch of the Dense layer — a matrix multiply followed by a bias add: // Input: one flattened image 784 = 28×28 numbers func.func @dense %input: tensor<1x784xf32 , %weights: tensor<784x128xf32 , %bias: tensor<1x128xf32 - tensor<1x128xf32 { %0 = "tosa.matmul" %input, %weights : tensor<1x784xf32 , tensor<784x128xf32 - tensor<1x128xf32 %1 = "tosa.add" %0, %bias : tensor<1x128xf32 , tensor<1x128xf32 - tensor<1x128xf32 return %1 : tensor<1x128xf32 } Look at the types: tensor<1x784xf32 means "a tensor shaped 1 × 784 of 32-bit floats." The compiler can see the shapes and the high-level operations matmul , add , which means it can reason about them — fuse operations, reorder them, choose the optimal memory layout for a TPU — all before lowering everything down to LLVM IR and finally to machine code. That's the whole point. The dead-code-elimination trick we watched earlier was a tiny optimization on a tiny program. MLIR is the framework that lets you apply that same style of optimization to machine-learning-shaped programs, at the right level of abstraction, for whatever hardware you're targeting — without building a brand-new compiler from scratch every single time. We've covered the why — deliberately staying at altitude: In the next articles we'll get our hands dirty: setting up an MLIR project, reading and writing real dialects, running an actual lowering pass, and seeing the mlir-opt tool transform code live. If you want a head start, the MLIR tutorial series by Jeremy Kun https://www.jeremykun.com/2023/08/10/mlir-getting-started/ and the official MLIR docs https://mlir.llvm.org/ are excellent next stops. The one idea worth keeping: MLIR exists because the world kept building the same compiler over and over. It's the reusable, multi-level foundation that makes that stop.