Modern GPU Programming Book

wpnews.pro

cd /news/machine-learning/modern-gpu-programming-book · home › topics › machine-learning › article

[ARTICLE · art-40692] src=mlc.ai ↗ pub=2026-06-26T11:22Z topic=machine-learning verified=true sentiment=· neutral

Modern GPU Programming Book

A new book, 'Modern GPU Programming For MLSys', teaches GPU kernel optimization for machine learning systems, focusing on Blackwell architecture and techniques like GEMM and FlashAttention. Developed from Carnegie Mellon University's MLSys course, it uses the TIRx Python DSL for hands-on learning.

read2 min views1 publishedJun 26, 2026

Machine learning systems sit at the heart of modern AI workloads. In these systems, performance often comes down to the quality of a small number of GPU kernels. Attention kernels, LLM prefill and decode kernels, low-precision block-scaled GEMMs, fused MoE layers, and other large fused kernels all directly shape end-to-end speed in both training and serving.

To make these kernels fast, however, we need more than a list of optimization tricks. Modern GPUs are no longer simple variations of the same old design. Recent architectures introduce richer memory spaces, new access patterns, and increasingly specialized execution units. To program them well, we need both a clear mental model of the hardware and a practical understanding of how high-performance kernels are built. This book is about developing both.

The book follows a simple progression: first understand the GPU hardware, then learn the programming model we will use, and finally build state-of-the-art kernels step by step. Our main target is the Blackwell generation, and our main running examples are fast matrix multiplication (GEMM) and FlashAttention. Along the way, we will also study the core ingredients behind GPU optimization: data layout, asynchronous data movement, and asynchronous coordination.

The material grows out of the Machine Learning Systems course series at Carnegie Mellon University. To make the ideas easier to study and easier to run, this book uses the TIRx Python DSL to build real GPU kernel examples step by step. TIRx stays close to the hardware, which lets us reason about low-level control while still learning through runnable code.

How This Book Is Organized# #

Part I, Understanding the GPU. This part introduces the overall organization of the GPU, general recipes for writing fast kernels, and key concepts such as data layout, asynchronous memory operations, and coordination. It builds the hardware intuition that the rest of the book relies on.Part II, TIRx Overview. This part introduces the key elements of TIRx, which serve as the foundation for the code examples throughout the book.Part III, GEMM: Tiled to SOTA. A complete guide to optimizing a tiled GEMM, built up through TMA pipelining, persistent scheduling, warp specialization, and 2-CTA clusters.Part IV, Flash Attention 4. A complete attention kernel built from the Part III techniques: two MMAs with softmax between them, online-softmax rescaling, causal masking, and GQA.Reference. TIRx language reference and compiler internals.

source & further reading

mlc.ai — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/modern-gpu-programming-b…

Read original on mlc.ai → mlc.ai/modern-gpu-programming-for-mlsys/index.ht…

mentioned entities

Carnegie Mellon University

Blackwell

TIRx

GEMM

FlashAttention

metadata

slugmodern-gpu-programming-book

topic#machine-learning

secondary2 topics

sentimentneutral

canonicalmlc.ai

navigation

← prevPoisoning the Well: Defending Ag…

next →Show HN: Git-lazy-mount mount a …

── more in #machine-learning 4 stories · sorted by recency

dev.to · 26 Jun · #machine-learning

Real-Time Network Telemetry for AI: Building an Asynchronous NetFlow/sFlow Ingestion Pipeline in Python

corti.com · 26 Jun · #machine-learning

Context engineering: shifting from "tokenmaxxing" to deliberate curation

dev.to · 26 Jun · #machine-learning

I Wish I Knew About This OpenAI Swap Sooner — Full Breakdown

news.ycombinator.com · 26 Jun · #machine-learning

Ask HN: How is GPU power draw measured at scale?

── more on @carnegie mellon university 3 stories trending now

wpnews · 19 Oct · #developer-tools

Windows Script to clean up and remove all ASUS software

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 1 Nov · #developer-tools

Custom Zig Test Runner, better ouput, timing display, and support for special "tests:beforeAll" and "tests:afterAll" tests

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required