TPU vs GPU: The Architecture and Software Trade-offs

Google's TPU uses a systolic array architecture optimized for tensor algebra, offering higher throughput and energy efficiency than GPUs for dense matrix operations, but requires XLA compilation and introduces friction with PyTorch, whereas GPUs benefit from NVIDIA's mature CUDA ecosystem and dynamic flexibility.

Cloud & Infra https://www.devclubhouse.com/c/cloud Article TPU vs GPU: The Architecture and Software Trade-offs Choosing between GPUs and TPUs requires balancing CUDA's dynamic flexibility against the static compilation efficiency of systolic arrays. Ji-ho Choi https://www.devclubhouse.com/u/jiho choi The Architectural Divergence Modern deep learning is, at its core, an exercise in massive matrix multiplication. A single forward pass through a transformer layer—represented mathematically as $\mathbf{Y} = \mathbf{X}\mathbf{W}$—requires billions of multiply-accumulate MAC operations. For example, multiplying two $4096 \times 4096$ matrices involves over 68 billion MAC computations. Historically, developers turned to Graphics Processing Units GPUs to handle this load. GPUs, originally built to render 3D graphics, rely on thousands of parallel, programmable cores. While this architecture was a massive upgrade over general-purpose CPUs, it carries significant overhead: complex instruction scheduling, branch prediction, and graphics-specific circuitry. In contrast, Google’s Tensor Processing Unit TPU was designed from the ground up as an Application-Specific Integrated Circuit ASIC solely optimized for tensor algebra. The fundamental difference between a GPU and a TPU is not just clock speed or memory size; it is a structural divergence in how data moves through silicon. While GPUs rely on dynamic instruction scheduling and massive parallel threads, TPUs utilize a deterministic, static dataflow architecture known as a systolic array . Under the Hood: Systolic Arrays vs. the Von Neumann Bottleneck To understand why TPUs excel at scale, one must understand the Von Neumann bottleneck. In traditional architectures, a processor must repeatedly fetch instructions and data from memory, execute the operation, and write the results back to memory. Because memory access is orders of magnitude slower and more energy-intensive than arithmetic execution, the bus between the processor and memory becomes the primary bottleneck. A GPU mitigates this by running thousands of concurrent threads, hiding memory latency through massive parallelism. However, each core must still continuously access registers, shared memory, and global memory to fetch operands and store intermediate results. The TPU’s systolic array solves this by keeping data in motion. Named after the rhythmic pulsing of the heart, a systolic array passes data through a 2D grid of simple processing elements PEs without returning to register files or main memory between operations. flowchart TD subgraph Systolic Array Systolic Array Grid PE11 PE -- PE12 PE PE21 PE -- PE22 PE PE11 -- PE21 PE12 -- PE22 end Inputs Input Data Matrix X -- PE11 Inputs -- PE21 Weights Stationary Weights W -- PE11 Weights -- PE12 In a TPU Matrix Multiply Unit MXU : Stationary Weights: The weights of the neural network layer are loaded into the grid of PEs from memory and remain stationary. Streaming Inputs: The input activations flow into the array from the left, one cycle at a time. Rhythmic Accumulation: Each PE performs a multiply-accumulate operation on the input value and the stationary weight, passes the input to its neighbor on the right, and passes the accumulated partial sum to the PE below it. Output Generation: The final matrix multiplication results emerge from the bottom of the grid. By reusing inputs and intermediate sums across the physical grid, the TPU drastically reduces memory read/write cycles. This architectural efficiency translates directly to lower latency, higher throughput, and significantly better performance-per-watt than GPUs for dense matrix operations. The Software Reality: XLA Compilation and PyTorch Friction For developers, the hardware advantages of TPUs come with a distinct software trade-off. The GPU ecosystem is dominated by NVIDIA's highly mature CUDA platform, which supports dynamic, imperative execution. Developers can write native PyTorch https://pytorch.org code, insert arbitrary Python breakpoints, and execute operations eagerly with minimal overhead. TPUs, however, do not execute instructions dynamically. They require a static computation graph compiled via OpenXLA https://openxla.org Accelerated Linear Algebra . XLA acts as a compiler that analyzes the entire machine learning graph, fuses multiple operations together to prevent memory roundtrips, and schedules execution deterministically across the systolic arrays. This compilation step introduces several practical challenges for developers: 1. The PyTorch/XLA Bridge Friction While Google Cloud supports PyTorch via the torch xla library, it is not native. The bridge must translate PyTorch's eager-mode operations into static XLA graphs. This translation layer introduces overhead and debugging complexity. If a model contains operations not supported by torch xla , execution falls back to the CPU, causing severe performance degradation. 2. The Nightmare of Dynamic Shapes Because XLA compiles static graphs, it optimizes the hardware layout for specific tensor dimensions. If your model processes inputs with dynamic shapes e.g., variable-length sentences in NLP without padding , any change in shape forces XLA to recompile the graph. This leads to "compilation storms," where the TPU sits idle for seconds or minutes while the host CPU recompiles the model. Developers must aggressively pad inputs or bucket sequences to maintain static shapes. Bad practice on TPUs: Dynamic shapes trigger constant recompilation def train step dynamic model, inputs : inputs have variable sequence lengths outputs = model inputs loss = loss fn outputs loss.backward Recommended practice: Pad to static shapes to leverage XLA caching def train step static model, inputs : padded inputs = pad to max length inputs, max len=512 outputs = model padded inputs loss = loss fn padded inputs loss.backward 3. Framework Alignment While PyTorch is the industry standard for general GPU development, JAX https://github.com/google/jax is the first-class citizen of the TPU world. Because JAX was designed from the ground up around functional programming and static compilation, it maps perfectly to XLA and TPUs, offering near-theoretical hardware utilization. The 2025 Landscape: Trillium, Ironwood, and Pod-Scale Clustering The accelerator landscape has evolved rapidly. While NVIDIA GPUs like the H100 and B200 remain the default choice for general-purpose AI development, Google has scaled its TPU architecture into a formidable competitor. Google's TPU roadmap has transitioned through seven generations: TPU v1 to v3: Established the core systolic array architecture, moving from inference-only v1 to training-capable liquid-cooled pods v3 . TPU v5p: Doubled FLOPS and tripled high-bandwidth memory HBM compared to previous generations, scaling up to 9,216-chip pods connected via optical circuit switches. Trillium TPU v6 : Delivered a 4.7x performance improvement over the cost-optimized v5e. Ironwood TPU v7 : Launched in 2025 as Google's first dedicated inference-only chip of the modern LLM era, optimized specifically for real-time reasoning, low-latency agentic workflows, and high-volume serving. The scale of these deployments is no longer niche. Major AI labs like Anthropic rely heavily on TPU infrastructure, with plans to scale Claude's training and inference clusters to as many as 1 million TPUs. Furthermore, Google's 2025 strategic shift to offer TPUs not just in Google Cloud but also inside customers' own on-premises data centers has lowered the barrier to entry. | Attribute | GPU e.g., NVIDIA H100 | TPU e.g., Trillium / Ironwood | |---|---|---| Core Architecture | Thousands of programmable CUDA cores | Systolic arrays Matrix Multiply Units | Execution Model | Dynamic, instruction-driven Eager | Static, compiler-driven XLA | Memory Architecture | High-bandwidth memory HBM + L1/L2 caches | HBM + large on-chip buffers | Software Ecosystem | CUDA, cuDNN, Native PyTorch Highly flexible | OpenXLA, JAX, PyTorch/XLA More rigid | Best Use Case | Dynamic models, PyTorch-native code, graphics | Large-scale Transformers, JAX-native training | Decision Framework: Choosing Your Accelerator Selecting between GPUs and TPUs is a multi-dimensional decision involving model architecture, team expertise, and budget constraints. When to Choose GPUs Dynamic Architectures: If your model relies on dynamic control flow, custom non-standard operations, or variable tensor shapes that cannot be easily padded e.g., certain reinforcement learning agents or dynamic graph neural networks . PyTorch-First Teams: If your codebase is heavily reliant on native PyTorch, third-party CUDA extensions, or rapid prototyping where compilation latency hinders developer velocity. Multi-Workload Infrastructure: If the hardware must be shared between AI training, 3D rendering, scientific simulation, or general-purpose HPC tasks. When to Choose TPUs Scale-Out Transformer Training: If you are training massive LLMs or diffusion models using JAX or highly optimized TensorFlow, where you can leverage TPU Pods up to 9,216 interconnected chips as a single, unified virtual accelerator. High-Throughput, Static Inference: If you are serving models at scale e.g., BERT-style encoders or static-shape decoders where low latency and high batch throughput are critical. For instance, historical benchmarks show a BERT inference batch of 128 sequences takes 3.8 milliseconds on a V100 GPU compared to just 1.7 milliseconds on a TPU v3. Cost and Power Constraints: If your primary bottleneck is power consumption or operational cost. The TPU's specialized, stripped-down architecture delivers superior performance-per-watt, making it highly cost-effective for sustained, high-utilization workloads. Sources & further reading - TPUs vs GPUs: How Google's Tensor Processing Units Actually Work https://dev.to/shrsv/tpus-vs-gpus-how-googles-tensor-processing-units-actually-work-c8i — dev.to - How Google’s Tensor Processing Unit TPU Works? https://blog.bytebytego.com/p/how-googles-tensor-processing-unit — blog.bytebytego.com - TPU vs GPU: What's the Difference in 2025? https://www.cloudoptimo.com/blog/tpu-vs-gpu-what-is-the-difference-in-2025/ — cloudoptimo.com - TPUs vs. GPUs: What’s the Difference? | Everpure Blog https://blog.purestorage.com/purely-technical/tpus-vs-gpus-whats-the-difference/ — blog.purestorage.com Ji-ho Choi https://www.devclubhouse.com/u/jiho choi · Security & Cloud Editor Ji-ho covers the increasingly tangled overlap between cloud architecture and security, drawing on a background as a penetration tester to keep his reporting grounded in real-world attack paths. He never lets a vendor claim go unquestioned and insists that every buzzword come with a proof of concept. Discussion 0 No comments yet Be the first to weigh in.