Finding deadlocks in CuTe kernels with SPIN Researchers at the FlashInfer MLSYS Challenge developed a formal verification method using the SPIN model checker to detect deadlocks in CuTe DSL kernels running on NVIDIA B200 GPUs. The approach, demonstrated through a proof-of-concept tool called cute2promela, translates synchronization models into Promela DSL to either find counterexamples or prove the absence of deadlocks, addressing the difficulty of debugging barrier synchronization bugs that typically cause timeouts without error codes. Using SPIN model checker to statically find or prove the absence of deadlocks in CuTe DSL kernels on NVIDIA B200, and presenting a proof-of-concept github.com/cheshire/cute2promela lowering from CuTe to SPIN. Synchronization bugs in GPU kernels are hard to debug. When a barrier deadlocks, the hardware yields no stack trace, and no error code until the benchmark times out. Hence each iteration of the debug loop starts to potentially cost tens of minutes. As we’ve worked on FlashInfer MLSYS Challenge https://mlsys26.flashinfer.ai our solution https://github.com/cheshire/flashinfer-challenge took 1st place in the mixture-of-experts track , we had to iterate on a persistent fused mixture-of-experts kernel for DeepSeek-V3, written in CUTLASS https://github.com/NVIDIA/cutlass ’s CuTe DSL for an NVIDIA B200 and stitched together from FF1, SwiGLU, and FF2 stages across clusters of CTAs. The pipeline is coordinated via mbarriers in shared memory, peer-CTA mbarriers reached via mapa -translated DSMEM pointers, and GMEM atomic counters across clusters. A bug in any of those can potentially result in a deadlock, which is hard to debug, especially when GPUs are only available via Modal https://modal.com/ . As my background is in formal verification https://metaworld.me/these.pdf , I wanted to try to instead encode the synchronization model in Promela DSL https://spinroot.com/ and check them with the SPIN model checker. This would not only shorten the iteration cycle, but also deterministically either demonstrate a counterexample to the desired property e.g. a deadlock or prove that no such interleaving exists. Short Primer: Blackwell Synchronization Primitives Let’s start with an overview of Blackwell synchronization primitives we would have to model. An mbarrier is a 64-bit hardware object that lives in shared memory. Its bits hold a current arrival count, a pre-declared expected count, and a single-bit phase . You allocate it like any SMEM variable, then have one thread initialize it with the expected count; from that point on, the hardware treats it as a small state machine with two transitions: Arrive. Any thread can call mbarrier arrive on the barrier. The hardware atomically increments the arrival count. If the new count equals expected, the barrier completes : the count resets, the phase bit flips. Otherwise the call returns and the thread continues. Wait. A thread that calls mbarrier wait phase=P stalls until the barrier’s phase differs from P . Since the phase only changes on completion, this means “wait until the barrier completes one more time after I last looked”. The single-bit phase is what makes mbarriers reusable. The first time you arrive-and-wait, you wait on phase 0; after completion, the barrier is at phase 1 and ready for the next round; you wait on phase 1; and so on. A persistent kernel that does N iterations of arrive-then-wait has to flip its expected-phase tracking each iteration. Get the flipping wrong and an iteration deadlocks. mbarrier in CuTe DSL In CuTe DSL, every mbarrier op is one of a small handful of calls: Initialization. Barriers are allocated in static SMEM and initialized with their expected arrival count. One elected thread per cluster does this, and a fence makes the initialization visible before anyone arrives. Local arrive. A thread drops one count on a barrier in its own CTA’s SMEM. TMA arrive-and-expect-tx. A variant used by TMA producers: the barrier accumulates bytes rather than arrivals, and completes when the expected byte count has landed. Typically followed by an async copy that names the barrier and a wait on the same barrier. Phase wait. A consumer blocks until the barrier’s phase differs from the value it last saw. Cross-CTA arrive. The B200 cluster lets two CTAs share each other’s shared memory through a DSMEM window. To arrive on a peer CTA’s mbarrier, the local SMEM pointer is translated through mapa.shared::cluster to the peer’s view of the same allocation, and the arrive lands on that translated address. From the peer’s perspective, that arrive lands in their count the same way a local arrive would. The hardware handles the address translation and the count update; no software protocol is involved. Cross-cluster: GMEM atomic counter. Clusters can’t share SMEM, only GMEM. When a producer in one cluster needs to hand off to a consumer in another, the DeepGEMM idiom is an int32 GMEM counter: the producer does atomic add counter, 1, sem="release", scope="gpu" when it finishes, and the consumer spins on atomic load counter, sem="acquire", scope="gpu" until the count reaches expected. sem="acquire" / "release" is the C11 ordering pair: stores before the release are visible after the matching acquire. scope="gpu" is the part that’s easy to get wrong — on Blackwell, ordering and scope are orthogonal, and the scope decides how far the visibility actually propagates. cta , cluster , gpu , and sys form a hierarchy; pick one too small and the consumer reads stale GMEM even though the producer’s atomic completed. Case 2 below works through the bug class this opens up. Fences. A handful of variants order things that don’t order themselves. The one we would use is fence view async shared : it makes synchronous SMEM stores visible to subsequent async consumers in our case, the mbarrier wait that gates a downstream SMEM read . Other call sites use fence proxy "async.shared", space="cta" , the inverse direction async producers, sync consumers . Fences are cheap at runtime; forgetting them produces the same intermittently-stale reads that scope errors do, and the bug is invisible to SPIN unless the visibility constraint is encoded in the model. A motivating kernel For a running example we’ll use a two-CTA Jacobi smoother https://en.wikipedia.org/wiki/Jacobi method on a 256-cell array of float32 . Each CTA owns 128 cells in its own SMEM and repeatedly applies the 3-point average new i = old i-1 + old i + old i+1 / 3 across the whole array. Interior cells only touch the owning CTA’s SMEM. The cells at the boundary — cell 127 of CTA 0 and cell 0 of CTA 1 — need one value from the peer CTA, which is exchanged through cluster DSMEM once per iteration. After N rounds the array relaxes toward the linear interpolation between its two endpoints; concretely, the kernel does something visible. What we care about for the rest of the post is its synchronization skeleton, which is the same as the cross-CTA exchange in our MoE kernel https://github.com/cheshire/flashinfer-challenge . Each iteration brings the barrier’s arrival count to 256 every one of the 128 threads in each CTA arrives once locally and once on its peer , at which point the hardware resets the count and flips the phase bit. A thread that called mbarrier wait with the previous phase value unblocks, XORs its own tracked phase for the next round, and the next wait fires when the barrier’s phase flips again. That single-bit phase, threaded through a per-thread XOR each iteration, is the only thing keeping consecutive rounds distinguishable. Whether this is correct is not obvious from reading. The phase tracking is one register; the XOR is one line; the expected count was set once at init. If we replace the phase argument on the mbarrier wait line with a literal cutlass.Int32 0 the bug we had in our MoE kernel the code would work at small iteration counts because the phase only matters from iteration 2 onward, but would deadlock in larger ones: bash $ modal run repro.py --mode correct --iters 30 === mode=correct iters=30 === GPU: NVIDIA B200 Launching grid= 2, num ctas=2 threads/CTA=128 EXPECTED=256 COMPLETED in 0.507s. done ctr=2 expect 2 PASS mode=correct iters=30 ROUND-TRIP: 5.3s wall incl. Modal launch overhead $ modal run repro.py --mode buggy --iters 1 === mode=buggy iters=1 === COMPLETED in 0.538s. done ctr=2 expect 2 PASS mode=buggy iters=1 $ modal run repro.py --mode buggy --iters 5 === mode=buggy iters=5 === Launching grid= 2, num ctas=2 threads/CTA=128 EXPECTED=256 GPU never returns ... Task's current input hit its timeout of 300s modal.exception.FunctionTimeoutError The model-checking side: SPIN and Promela SPIN https://spinroot.com/ is a model checker for concurrent systems, and Promela Process Meta-Language is the input DSL. You describe a finite set of communicating processes and the properties they should obey, and SPIN compiles a C verifier that exhaustively enumerates the reachable state space. If a property fails, it gives back a concrete interleaving as a counterexample. If the search completes, no interleaving within the model violates the property. Processes and the init block A proctype is a process template, and the run construct spawns an instance. Everything starts from the init block, which spawns the rest: This launches eight Thread processes. They share global state and interleave at the level of individual Promela statements. SPIN would explore every legal interleaving. atomic { ... } An atomic block executes without interleaving. A single mbarrier arrive is one PTX instruction, so wrapping its count/phase update in atomic is exact: inline An inline is a textual macro — like a C define , but multi-statement and arity-checked. We use it for primitives like do arrive and do wait so the proctype body reads like the kernel. if / fi with guards Promela’s if is Dijkstra-style: a list of guarded statements, any one of which may fire if its guard is true. If multiple guards are open, SPIN picks any of them non-deterministically, and explores both. The :: else branch fires only if no other guard is possible. do / od Same as if but loops, with break to exit: Guards as waits Inside an atomic , a false guard blocks the proctype until it becomes true. Promela has no separate “wait” statement; you just write the condition: A thread that hits do wait sits there until mbar phase cta differs from expected phase . Some other thread’s arrive flips the phase, the guard opens, and the proctype proceeds. LTL: the property language SPIN takes safety and liveness properties as Linear Temporal Logic https://en.wikipedia.org/wiki/Linear temporal logic formulas. The LTL subset we use is: P “always P”. Safety: P holds in every reachable state. Use for invariants like “no data race”. < P “eventually P”. Liveness: P holds in some future state of every infinite execution. Use for “the protocol eventually completes”. < P “always eventually”, i.e. infinitely often. Use for fairness-style claims. < P “eventually permanently”. Use for “eventually the system stabilizes”. Our standard deadlock-freedom claim: Internally, SPIN compiles an LTL property into a never claim , a Büchi automaton https://en.wikipedia.org/wiki/B%C3%BCchi automaton an automaton over infinite words, accepting by visiting an accepting state infinitely often that accepts exactly the executions that violate the property. If SPIN finds an accepting run of the never claim, that run is the counterexample. When the violation is a safety invariant P , SPIN compiles it into the never-claim P and prints assertion violated P on a hit. There’s no actual assert in the source, the doubly-negated form is the literal text of the never-claim whose witness SPIN found. Running the verifier The compile-and-run sequence is always three commands: bash $ spin -a model.pml generate pan.c the C verifier $ cc -O2 -o pan pan.c build it $ ./pan -a run; -a enables acceptance-cycle / liveness search A handful of flags matter. -a is what you almost always want: safety, assertions, and LTL liveness including the acceptance cycles that show up as deadlocks . -l checks non-progress cycles instead and needs cc -DNP at build time. Lowering the kernel to Promela Once you know what to keep and what to drop, the translation from the running example to Promela is fairly mechanical. The per-thread loop in the kernel maps directly to a proctype Thread in Promela: CuTe DSL kernel per thread Promela proctype The substitutions, primitive by primitive: Hardware mbarrier 64-bit word in SMEM packing arrival count, expected count, and phase → two Promela ints, mbar count NCTAS and mbar phase NCTAS , plus a constant EXPECTED ARRIVALS . The hardware atomicity is preserved by wrapping every update in atomic { } .→ the mbarrier arrive ptr arrive cta inline shown earlier. Increment the count atomically; if it hits expected, reset to 0 and flip phase.→ mbarrier wait ptr, P atomic { mbar phase cta = P - skip; } . A guarded atomic that blocks the proctype until the phase differs. Peer arrive via → the same mapa shared cluster + mbarrier txn ARRIVE arrive primitive, on the peer’s array slot. The model captures the protocol, not the address translation; mapa is a deterministic function that picks which CTA’s mbarrier gets incremented, so we encode it directly as arrive peer . 128 threads per CTA → 4 proctypes per CTA. For this protocol the correctness doesn’t change with width; shrinking the count keeps the state space tractable. ~30 tile iterations → ITERS = 3 . Three is the smallest count that exercises phase parity iter 0 waits on phase 0, iter 1 on phase 1, iter 2 on phase 0 again . A model with fewer iterations would pass over the phase-flip-back-to-zero bug without seeing it. Persistent loop → do :: iter < ITERS - ...; iter++ :: else - break od . Note that the modelling drops most of the code, as only synchronization primitives and calculation affecting control flow influence liveness. bash $ spin -a kernel model.pml && cc -O2 -o pan pan.c && ./pan -a ltl all done: < iters completed 0 ==4 && iters completed 1 ==4 State-vector 248 byte, depth reached 387, errors: 0 830628 states, stored 1.66126e+06 visited 3735761 states, matched 5397017 transitions = visited+matched pan: elapsed time 2.07 seconds pan: rate 802539.13 states/second What SPIN can check Safety: P never becomes false. Use for invariants like “no data race” or “the consumer never reads before the producer published”. SPIN proves these by depth-first search; on success the full reachable state space has been visited. On failure SPIN gives one offending state and the path to it. P . Liveness: P eventually holds on every infinite execution. Use for “the protocol completes”. Failure looks different from safety: SPIN finds an < P . acceptance cycle , an infinite loop in the state graph that never satisfies P, which would manifest as the deadlock. Fairness: P happens infinitely often, assuming the scheduler is weakly fair to every proctype. Worth reaching for when a counterexample involves a thread that’s simply never scheduled, fairness lets you distinguish a real protocol break from one that only fails under an adversarial scheduler. < P , with -f . How fast does this blow up? State-space size is exponential in the protocol dimensions you choose. We swept the correct Promela model of the running example across iteration counts TILES PER CTA , i.e. N ITERS and thread counts THREADS PER CTA ; the numbers below come from one laptop run, with the bottom row added from the standalone three-iteration baseline: TILES THREADS/CTA states stored transitions wall 1 2 1,137 5,476 <0.01s 1 3 22,082 123,937 0.04s 1 4 416,831 2,564,578 0.95s 1 5 7,701,536 50,393,599 23s 2 2 4,424 21,907 0.01s 2 3 196,051 1,117,014 0.38s 2 4 9,500,414 58,705,617 26s 3 4 18,583,997 114,846,660 52s baseline Adding a thread per CTA multiplies states by roughly 20×. The 3, 4 row is the full verification baseline used in the case studies and is included for context; the rows above it are a clean sweep at smaller sizes. The protocol’s shape phase parity, count-to-expected, peer arrive via mapa is exercised the same way at 2×4×3 as at 2×128×30 — the bug class doesn’t need 128 threads to manifest. Width-invariance holds for the phase-flip family; for racier bug classes the no-race variant from earlier shrinking the thread count can hide the violation. When in doubt, scale up until the verifier struggles, then back off. Case 1: the hardcoded-phase deadlock In the running example, the variant we shipped first had the wait line written like this: buggy — what we shipped correct — what we wanted At small iteration counts the kernel only does one round per CTA, so the phase never matters and the bug works by luck. At larger counts roughly 30 tile iterations per cluster in production the mbarrier’s phase has flipped twice by the second iteration, is back to 0, and the hardcoded wait 0 blocks forever. In the corresponding Promela models, the buggy and correct variants differ in exactly one argument: do wait cta, 0 buggy vs do wait cta, my phase with my phase ^= 1 after correct . The lowering section above shows the full proctype; what we care about here is what the verifier reports. What SPIN says On the buggy model: bash $ spin -a mbar protocol bug.pml && cc -O2 -o pan pan.c && ./pan -a ltl all done: < iters completed 0 ==4 && iters completed 1 ==4 pan:1: acceptance cycle at depth 305 pan: wrote mbar protocol bug.pml.trail Spin Version 6.5.2 -- 6 December 2019 + Partial Order Reduction Full statespace search for: never claim + all done assertion violations + if within scope of claim acceptance cycles + fairness disabled State-vector 244 byte, depth reached 306, errors: 1 99 states, stored 99 transitions = stored+matched pan: elapsed time 0 seconds Replaying the trail with spin -t -p mbar protocol bug.pml ends in the state that explains the deadlock: bash $ spin -t -p mbar protocol bug.pml tail 303: proc 1 Thread:1 line 26 mbar count peer = mbar count peer +1 304: proc 1 Thread:1 line 31 else 305: proc 1 Thread:1 line 31 1 <<<<