{"slug": "elusive-order-of-async-gpu-kernels-scheduling-abstractions-dsl-implications", "title": "Elusive order of async GPU kernels: scheduling, abstractions, DSL implications", "summary": "Nvidia's Hopper and Blackwell GPU architectures introduced spatial scheduling through warp specialization, requiring developers to divide pipeline work between different warp groups for data movement and computation. This shift from simpler temporal pipelining has created a complex scheduling problem that researchers are addressing through new abstractions like AsyncGraphene's ARef channels, Nvidia's TAWA system, and TileIR, which aim to make hardware-specific schedules less explicit while maintaining role visibility. The challenge matters because as GPU hardware evolves, the burden of portability increasingly falls on developers who must encode generation-specific patterns into their kernel code.", "body_md": "SIMT offered a fantastic bargain. You write a straight-line program, the machine runs a lot of copies of it, and when one waits for memory the hardware swaps in others. You look with disdain on the less enlightened thread programmers dealing with deadlocks and concurrency etc. etc.\n\nChoosing what to run where and when is a scheduling problem, and there have been three effective approaches to that so far.\n\nYou can schedule statically: decide ahead of time what all the units should do each tick. You can schedule temporally: swapping in different phases of workers via a pipeline. Or you can schedule spatially: divide the resources of the machine into different roles.\n\nThe underlying mechanics of which one you pick tends to be determined by the hardware. A chip like a TPU spends most of its silicon on math, and fairly little on orchestrating work. That means static scheduling, and a compiler that can build you that schedule.\n\nAmpere and before 1, and all the modern AMD chips, encourage temporal pipelining. The hardware will swap in warps (or waves) when one stalls ,and by structuring your kernels into phases you can hide memory latency and keep the chips busy.\n\nHopper and beyond are where spatial scheduling started mattering, in the form of warp specialization. Nvidia GPUs let you assign different register footprints to different warp groups. When you introduce warp-group scoped MMA for compute and TMA for executing data moves from a single thread you have the ingredients to divide the pipeline between groups. Instead of the same worker doing *load* -> *compute* -> *store* you have different workers exclusively working on different parts of the pipeline. Blackwell made this… much harder. TMEM and UMMA added new operator and memory types, so you now need to schedule movement between shared memory, tensor memory, registers, global memory, and a variety of compute units.\n\nThe problem is: how do you do that?\n\nTo stick with Nvidia for a moment, at the bottom of the stack are barriers. An [mbarrier](https://docs.nvidia.com/cuda/parallel-thread-execution/#release-acquire-patterns) is a phase switch for a specific number of arrivals: one side waits, the other increases the arrival count. When the counter matches the expected number, the phase flips. It’s elegant and straightforward, and easy to get wrong. A classic example is the phase parity bug: if you screw up the wraparound the kernel can work perfectly at first, but then deadlock waiting on the wrong phase.\n\nNext up, libraries like CUTLASS, and newer ones like ThunderKittens, package the patterns you tend to write. The CUTLASS Pipeline combines buffers and synchronization into a unit and makes it easy to compose common structures. This is where much of the expert-kernel-writing time goes, but that time encodes a lot of hardware-specific behavior. Hopper wants one set of patterns, Blackwell another, and even within a generation there can be differences between variants of the hardware. The more explicit the schedule is for the developer, the more they own the portability problem.\n\nThe subsequent step is to make the *schedule* less explicit, while still keeping the roles visible. [AsyncGraphene’s](https://dl.acm.org/doi/10.1145/3771775.3786277) ARef is a good example of this. An ARef is a reference to asynchronously produced data. Basically, a channel, with synchronization attached. A producer writes, a consumer reads, and both sides can know when the other is done. A compiler can then plan a schedule. Nvidia’s [TAWA](https://arxiv.org/abs/2510.14719) work does this explicitly for Triton, tagging producers and consumers and lowering to ARefs. [TLX](https://arxiv.org/abs/2605.10905) on the other hand, as well as systems like [PipeThreader](https://www.usenix.org/conference/osdi25), allow defining subtasks in a kernel that a compiler can schedule.\n\n[TileIR](https://docs.nvidia.com/cuda/tile-ir/latest/sections/introduction.html) and CuTile also enable building an explicit graph, but through focusing on the data itself. Attaching usage information on how data is read or written gives the compiler room to bundle work into tasks and reschedule.\n\nGetting the graph is the starting point, but then you need to identify what the right schedule actually is. In practice this involves exploring different shapes and combinations to work out which is best. You can either do that explicitly through heuristics and cost models of the hardware, or do it via searching across many different possible schedules to find the ones that work best. Most systems do both.\n\n## But what do we need in a kernel DSL?\n\nIf you are building a DSL for writing kernels, the starting point is to reflect whatever the hardware does. This is not only direct, but also a necessary option because there are always smart people operating at the frontier who have a strong intuition around how to drive the most performance. They’re often targeting very new hardware which is not yet well understood (sometimes, even by the people that made it).\n\nBeyond that, deciding what *else* should be on offer means answering three questions:\n\n### 1. How do you think about portability?\n\nPortability doesn’t mean “write one generic kernel and get peak performance everywhere”. But it can mean: what’s the minimal amount I can express to get correctness and a particular level of performance across hardware. Projects like [Helion](https://github.com/pytorch/helion) are explicitly operating at a high level to enable rapid research iteration. Regardless of your view on where the line for “high performance” is, you need something to define what “correct” means.\n\nHaving a good concept of a “task” seems to offer the flexibility to schedule statically, temporally or spatially, but there are a lot of edge-cases to consider.\n\n### 2. What do agents change?\n\nHumans are not going to be writing every, or even most, kernels. We have to figure out how much of that portability or performance is a *deterministic* search, versus how much is agentic loops exploring the space somewhat probabilistically. Agents make generating code massively cheaper. They can create candidates, run profiling on real hardware, test hypotheses and explore options.\n\nBut we also need a sense of where and how the agents fail, particularly when it differs from the patterns of humans. That includes things like verbosity: more lines (generally!) means more bugs. Performance can be both spiky and somewhat subjective; sometimes small changes can reshape the kernel’s performance, and a faster kernel might only be “correct” within specific numerical accuracy bounds.\n\n### 3. How do you think about kernel boundaries?\n\nA lot of discussion focused on GEMMs, which is understandable. But almost all real-world kernel work is across operator boundaries. FlashAttention wasn’t making the matmuls in attention fast, it was fusing them *despite* a reduction in the middle.\n\nWhen we are writing programs we are expressing intent and providing direction. We mix that “what” and the “how”. This reflects a search vs expressiveness divide; search-oriented approaches want you to focus more on the what, expressiveness leans more into the how. The more the units inside kernels can compose across kernel boundaries, the more we can optimize across models and discover patterns automatically 2,\n\nThe way I think about compilers is that they encode knowledge (in the form of rules and heuristics) about hardware. The more we can move that out of our heads, or the model’s parametric knowledge, the more we can focus our time or tokens on the parts we don’t yet understand.", "url": "https://wpnews.pro/news/elusive-order-of-async-gpu-kernels-scheduling-abstractions-dsl-implications", "canonical_source": "https://ianbarber.blog/2026/05/25/the-elusive-order-of-things/", "published_at": "2026-05-26 04:18:32+00:00", "updated_at": "2026-05-26 04:37:53.957708+00:00", "lang": "en", "topics": ["ai-chips", "ai-infrastructure", "ai-research"], "entities": ["Nvidia", "TPU", "AMD", "Hopper", "Ampere", "SIMT", "TMA", "MMA"], "alternates": {"html": "https://wpnews.pro/news/elusive-order-of-async-gpu-kernels-scheduling-abstractions-dsl-implications", "markdown": "https://wpnews.pro/news/elusive-order-of-async-gpu-kernels-scheduling-abstractions-dsl-implications.md", "text": "https://wpnews.pro/news/elusive-order-of-async-gpu-kernels-scheduling-abstractions-dsl-implications.txt", "jsonld": "https://wpnews.pro/news/elusive-order-of-async-gpu-kernels-scheduling-abstractions-dsl-implications.jsonld"}}