Cloud & InfraArticle
Choosing between GPUs and TPUs requires balancing CUDA's dynamic flexibility against the static compilation efficiency of systolic arrays.
The Architectural Divergence #
Modern deep learning is, at its core, an exercise in massive matrix multiplication. A single forward pass through a transformer layer—represented mathematically as $\mathbf{Y} = \mathbf{X}\mathbf{W}$—requires billions of multiply-accumulate (MAC) operations. For example, multiplying two $4096 \times 4096$ matrices involves over 68 billion MAC computations.
Historically, developers turned to Graphics Processing Units (GPUs) to handle this load. GPUs, originally built to render 3D graphics, rely on thousands of parallel, programmable cores. While this architecture was a massive upgrade over general-purpose CPUs, it carries significant overhead: complex instruction scheduling, branch prediction, and graphics-specific circuitry.
In contrast, Google’s Tensor Processing Unit (TPU) was designed from the ground up as an Application-Specific Integrated Circuit (ASIC) solely optimized for tensor algebra. The fundamental difference between a GPU and a TPU is not just clock speed or memory size; it is a structural divergence in how data moves through silicon. While GPUs rely on dynamic instruction scheduling and massive parallel threads, TPUs utilize a deterministic, static dataflow architecture known as a systolic array.
Under the Hood: Systolic Arrays vs. the Von Neumann Bottleneck #
To understand why TPUs excel at scale, one must understand the Von Neumann bottleneck. In traditional architectures, a processor must repeatedly fetch instructions and data from memory, execute the operation, and write the results back to memory. Because memory access is orders of magnitude slower and more energy-intensive than arithmetic execution, the bus between the processor and memory becomes the primary bottleneck.
A GPU mitigates this by running thousands of concurrent threads, hiding memory latency through massive parallelism. However, each core must still continuously access registers, shared memory, and global memory to fetch operands and store intermediate results.
The TPU’s systolic array solves this by keeping data in motion. Named after the rhythmic pulsing of the heart, a systolic array passes data through a 2D grid of simple processing elements (PEs) without returning to register files or main memory between operations.
flowchart TD
subgraph Systolic_Array [Systolic Array Grid]
PE11[PE] --> PE12[PE]
PE21[PE] --> PE22[PE]
PE11 --> PE21
PE12 --> PE22
end
Inputs[Input Data Matrix X] --> PE11
Inputs --> PE21
Weights[Stationary Weights W] --> PE11
Weights --> PE12
In a TPU Matrix Multiply Unit (MXU):
Stationary Weights: The weights of the neural network layer are loaded into the grid of PEs from memory and remain stationary.Streaming Inputs: The input activations flow into the array from the left, one cycle at a time.Rhythmic Accumulation: Each PE performs a multiply-accumulate operation on the input value and the stationary weight, passes the input to its neighbor on the right, and passes the accumulated partial sum to the PE below it.Output Generation: The final matrix multiplication results emerge from the bottom of the grid.
By reusing inputs and intermediate sums across the physical grid, the TPU drastically reduces memory read/write cycles. This architectural efficiency translates directly to lower latency, higher throughput, and significantly better performance-per-watt than GPUs for dense matrix operations.
The Software Reality: XLA Compilation and PyTorch Friction #
For developers, the hardware advantages of TPUs come with a distinct software trade-off. The GPU ecosystem is dominated by NVIDIA's highly mature CUDA platform, which supports dynamic, imperative execution. Developers can write native PyTorch code, insert arbitrary Python breakpoints, and execute operations eagerly with minimal overhead.
TPUs, however, do not execute instructions dynamically. They require a static computation graph compiled via OpenXLA (Accelerated Linear Algebra). XLA acts as a compiler that analyzes the entire machine learning graph, fuses multiple operations together to prevent memory roundtrips, and schedules execution deterministically across the systolic arrays.
This compilation step introduces several practical challenges for developers:
1. The PyTorch/XLA Bridge Friction
While Google Cloud supports PyTorch via the torch_xla
library, it is not native. The bridge must translate PyTorch's eager-mode operations into static XLA graphs. This translation layer introduces overhead and debugging complexity. If a model contains operations not supported by torch_xla
, execution falls back to the CPU, causing severe performance degradation.
2. The Nightmare of Dynamic Shapes
Because XLA compiles static graphs, it optimizes the hardware layout for specific tensor dimensions. If your model processes inputs with dynamic shapes (e.g., variable-length sentences in NLP without padding), any change in shape forces XLA to recompile the graph. This leads to "compilation storms," where the TPU sits idle for seconds or minutes while the host CPU recompiles the model. Developers must aggressively pad inputs or bucket sequences to maintain static shapes.
def train_step_dynamic(model, inputs):
outputs = model(inputs)
loss = loss_fn(outputs)
loss.backward()
def train_step_static(model, inputs):
padded_inputs = pad_to_max_length(inputs, max_len=512)
outputs = model(padded_inputs)
loss = loss_fn(padded_inputs)
loss.backward()
3. Framework Alignment
While PyTorch is the industry standard for general GPU development, JAX is the first-class citizen of the TPU world. Because JAX was designed from the ground up around functional programming and static compilation, it maps perfectly to XLA and TPUs, offering near-theoretical hardware utilization.
The 2025 Landscape: Trillium, Ironwood, and Pod-Scale Clustering #
The accelerator landscape has evolved rapidly. While NVIDIA GPUs like the H100 and B200 remain the default choice for general-purpose AI development, Google has scaled its TPU architecture into a formidable competitor.
Google's TPU roadmap has transitioned through seven generations:
TPU v1 to v3: Established the core systolic array architecture, moving from inference-only (v1) to training-capable liquid-cooled pods (v3).TPU v5p: Doubled FLOPS and tripled high-bandwidth memory (HBM) compared to previous generations, scaling up to 9,216-chip pods connected via optical circuit switches.Trillium (TPU v6): Delivered a 4.7x performance improvement over the cost-optimized v5e.Ironwood (TPU v7): Launched in 2025 as Google's first dedicated inference-only chip of the modern LLM era, optimized specifically for real-time reasoning, low-latency agentic workflows, and high-volume serving.
The scale of these deployments is no longer niche. Major AI labs like Anthropic rely heavily on TPU infrastructure, with plans to scale Claude's training and inference clusters to as many as 1 million TPUs. Furthermore, Google's 2025 strategic shift to offer TPUs not just in Google Cloud but also inside customers' own on-premises data centers has lowered the barrier to entry.
| Attribute | GPU (e.g., NVIDIA H100) | TPU (e.g., Trillium / Ironwood) |
|---|---|---|
| Core Architecture | ||
| Thousands of programmable CUDA cores | Systolic arrays (Matrix Multiply Units) | |
| Execution Model | ||
| Dynamic, instruction-driven (Eager) | Static, compiler-driven (XLA) | |
| Memory Architecture | ||
| High-bandwidth memory (HBM) + L1/L2 caches | HBM + large on-chip buffers | |
| Software Ecosystem | ||
| CUDA, cuDNN, Native PyTorch (Highly flexible) | OpenXLA, JAX, PyTorch/XLA (More rigid) | |
| Best Use Case | ||
| Dynamic models, PyTorch-native code, graphics | Large-scale Transformers, JAX-native training |
Decision Framework: Choosing Your Accelerator #
Selecting between GPUs and TPUs is a multi-dimensional decision involving model architecture, team expertise, and budget constraints.
When to Choose GPUs
Dynamic Architectures: If your model relies on dynamic control flow, custom non-standard operations, or variable tensor shapes that cannot be easily padded (e.g., certain reinforcement learning agents or dynamic graph neural networks).PyTorch-First Teams: If your codebase is heavily reliant on native PyTorch, third-party CUDA extensions, or rapid prototyping where compilation latency hinders developer velocity.Multi-Workload Infrastructure: If the hardware must be shared between AI training, 3D rendering, scientific simulation, or general-purpose HPC tasks.
When to Choose TPUs
Scale-Out Transformer Training: If you are training massive LLMs or diffusion models using JAX or highly optimized TensorFlow, where you can leverage TPU Pods (up to 9,216 interconnected chips) as a single, unified virtual accelerator.High-Throughput, Static Inference: If you are serving models at scale (e.g., BERT-style encoders or static-shape decoders) where low latency and high batch throughput are critical. For instance, historical benchmarks show a BERT inference batch of 128 sequences takes 3.8 milliseconds on a V100 GPU compared to just 1.7 milliseconds on a TPU v3.Cost and Power Constraints: If your primary bottleneck is power consumption or operational cost. The TPU's specialized, stripped-down architecture delivers superior performance-per-watt, making it highly cost-effective for sustained, high-utilization workloads.
Sources & further reading #
TPUs vs GPUs: How Google's Tensor Processing Units Actually Work— dev.to - How Google’s Tensor Processing Unit (TPU) Works?— blog.bytebytego.com - TPU vs GPU: What's the Difference in 2025?— cloudoptimo.com - TPUs vs. GPUs: What’s the Difference? | Everpure Blog— blog.purestorage.com
Ji-ho Choi· Security & Cloud Editor
Ji-ho covers the increasingly tangled overlap between cloud architecture and security, drawing on a background as a penetration tester to keep his reporting grounded in real-world attack paths. He never lets a vendor claim go unquestioned and insists that every buzzword come with a proof of concept.
Discussion 0 #
No comments yet
Be the first to weigh in.