When does fragmentation occur in the CUDA caching allocator? The CUDA caching allocator in PyTorch fragments memory when allocated blocks prevent adjacent free blocks from merging, causing allocation failures despite sufficient total free memory. This fragmentation occurs during CUDA graph recording for LLM serving, where distinct graphs for each batch size must share the same memory pool but the allocator's splitting and merging behavior depends on allocation order. Expandable segments fix certain fragmentation patterns, particularly those related to CUDA graph recording, by changing how segments are obtained from CUDA. When does fragmentation occur in the CUDA caching allocator? Disclosure.This post was drafted by Claude Anthropic’s coding assistant with editing from ezyang. In an ideal world, users of CUDA memory in PyTorch programs should be able to abstract the allocator behavior as: there is a fixed amount of GPU memory, whenever you allocate this available memory goes down, and when you free the available memory goes back up. Unfortunately, the internal implementation of the CUDA caching allocator means that certain allocation patterns can give rise to fragmentation, where even though there is “technically” enough free space to store a requested allocation, the CUDA caching allocator is unable to actually serve the request. There are many modern use cases where users wish to use as much memory that their GPUs provide as possible, while needing to ensure we do not OOM. Users are often penny-inching allocations in this situation, and find it very surprising when PyTorch reserves more memory than they expect under the abstract model of the allocator. This is especially common in LLM serving, where every megabyte of GPU memory that isn’t nailed down by model weights or CUDA graph buffers can be used for KV cache. Modern disaggregated serving involves CUDA graphing distinct graphs for each batch size. It’s important for these graphs to share the same memory pool. But sharing a pool means the allocator’s internal bookkeeping needs to be correct before each recording. And the way the allocator manages memory–splitting and merging blocks–can go wrong in ways that depend on allocation order. In this post, we’ll walk through some small laboratory examples where this fragmentation happens, and then demonstrate why expandable segments fixes these examples. It’s important to have a mental model for what exactly we mean by “fragmentation”, because some fragmentation can be solved with expandable segments especially those related to recording CUDA graphs , while others cannot. Segments, blocks, and splitting The caching allocator organizes GPU memory in two levels. Segments are contiguous regions obtained from CUDA cudaMalloc or virtual memory mapping . Blocks are sub-regions within a segment that track individual allocations. When a request comes in, the allocator finds a free block that’s large enough. If the block is bigger than needed, it splits the block: the front portion serves the allocation, the back portion becomes a new free block. When a block is freed, the allocator tries to merge it with its immediate neighbors–but only if the neighbor is also free. Two free blocks separated by an allocated block cannot merge. python import gc, torch MiB = 1024 1024 def alloc n, mib, pool, dev : with torch.cuda.use mem pool pool, dev : return torch.empty int mib MiB , dtype=torch.uint8, device=dev for in range n def free ts : ts.clear def layout pool : for s in torch.cuda.memory snapshot pool.id : blocks = " | ".join f"{b 'size' //MiB}M {b 'state' }" for b in s "blocks" print f" seg {s 'total size' //MiB}M: {blocks} " pool = torch.cuda.MemPool dev = torch.device "cuda:0" t = alloc 1, 32, pool, dev layout pool one 32M block free t ts = alloc 2, 16, pool, dev layout pool 32M segment split into two 16M blocks del ts 0 layout pool first block inactive, second still active; can't merge free ts layout pool both free and adjacent; merged back to 32M How segments are obtained depends on whether expandable segments are enabled. The behavior is quite different in each case. Without expandable segments Run scripts in this section with PYTORCH CUDA ALLOC CONF=expandable segments:False . Without expandable segments, each cudaMalloc call creates a separate segment. For allocations between 1 MiB and 10 MiB, the allocator requests a 20 MiB segment regardless of the actual size. For allocations = 10 MiB, the segment is rounded up to the nearest 2 MiB. The key constraint: blocks in different segments can never merge . Each cudaMalloc is an independent allocation from CUDA’s perspective. A free 16 MiB block in one segment cannot combine with a free 16 MiB block in another segment to serve a 32 MiB request. This is where allocation order matters. Let’s walk through two scenarios step by step. Small then large bad order : python import gc, torch MiB = 1024 1024 def alloc n, mib, pool, dev : with torch.cuda.use mem pool pool, dev : return torch.empty int mib MiB , dtype=torch.uint8, device=dev for in range n def free ts : ts.clear def reserved pool : return sum s "total size" for s in torch.cuda.memory snapshot pool.id def layout pool : for s in torch.cuda.memory snapshot pool.id : blocks = " | ".join f"{b 'size' //MiB}M {b 'state' }" for b in s "blocks" print f" seg {s 'total size' //MiB}M: {blocks} " dev = torch.device "cuda:0" pool = torch.cuda.MemPool small = alloc 8, 16, pool, dev print "after 8x16M:", reserved pool // MiB, "MiB" layout pool free small print "after free:", reserved pool // MiB, "MiB" layout pool large = alloc 4, 32, pool, dev print "after 4x32M:", reserved pool // MiB, "MiB" layout pool free large Step by step: Allocate 8x16 MiB. Each 16 MiB request triggers a separate cudaMalloc . Since 16 MiB = 10 MiB, each segment is rounded up to 16 MiB nearest 2 MiB multiple . Result: eight separate 16 MiB segments, each containing one allocated block. 128 MiB reserved. Free all. Each segment now has one 16 MiB free block. But the segments are separate cudaMalloc allocations–they can’t merge with each other. The pool still holds 128 MiB of reserved memory across eight independent segments. Allocate 4x32 MiB. The allocator looks for a free block = 32 MiB. Every existing free block is only 16 MiB, and blocks can’t span segments. None of the existing segments can serve the request. The allocator calls cudaMalloc four more times for 32 MiB each. Result: 256 MiB reserved–eight stale 16 MiB segments plus four new 32 MiB segments. Large then small good order : python import gc, torch MiB = 1024 1024 def alloc n, mib, pool, dev : with torch.cuda.use mem pool pool, dev : return torch.empty int mib MiB , dtype=torch.uint8, device=dev for in range n def free ts : ts.clear def reserved pool : return sum s "total size" for s in torch.cuda.memory snapshot pool.id def layout pool : for s in torch.cuda.memory snapshot pool.id : blocks = " | ".join f"{b 'size' //MiB}M {b 'state' }" for b in s "blocks" print f" seg {s 'total size' //MiB}M: {blocks} " dev = torch.device "cuda:0" pool = torch.cuda.MemPool large = alloc 4, 32, pool, dev print "after 4x32M:", reserved pool // MiB, "MiB" layout pool free large print "after free:", reserved pool // MiB, "MiB" layout pool small = alloc 8, 16, pool, dev print "after 8x16M:", reserved pool // MiB, "MiB" layout pool free small Step by step: Allocate 4x32 MiB. Four cudaMalloc calls, four 32 MiB segments. 128 MiB reserved. Free all. Each segment has one 32 MiB free block. 128 MiB still reserved. Allocate 8x16 MiB. The first 16 MiB request finds a 32 MiB free block. The allocator splits it: 16 MiB allocated, 16 MiB free remainder. The second 16 MiB request takes that remainder. Two allocations served from one segment, no new cudaMalloc . This repeats for each of the four segments. Result: still 128 MiB reserved, each segment now split into two 16 MiB blocks. Same total work, half the memory. The classic workaround is to always record in decreasing batch-size order: large allocations establish the segments, smaller ones split within them. It works, but it’s a leaky abstraction. With expandable segments Run scripts in this section with PYTORCH CUDA ALLOC CONF=expandable segments:True . What cuMemMap gives us Without expandable segments, the allocator calls cudaMalloc for each segment. Each cudaMalloc returns an independent allocation that can never be merged with another. This is the root cause of the fragmentation above. CUDA also has a separate virtual memory management APIs, which separates three concerns: : reserves a contiguous range of cuMemAddressReserve virtual address space. This is cheap–no physical memory is committed. The allocator reserves enough to map essentially all GPU physical memory 1 1/8x of totalGlobalMem .: allocates a chunk of cuMemCreate physical memory a “handle” . This is the expensive operation that actually consumes GPU memory.: maps a physical handle into the reserved virtual range, making it accessible. cuMemMap + cuMemSetAccess The allocator creates one ExpandableSegment per pool, stream pair. Each expandable segment owns one huge virtual reservation but starts with zero physical memory mapped. As allocations arrive, physical pages are mapped into the segment on demand and the segment grows. Because everything is in one contiguous virtual address range, blocks within the segment can always merge with their neighbors–the cross-segment barrier from cudaMalloc doesn’t exist. Physical page granularity Physical memory is mapped in fixed-size pages: 2 MiB for small blocks , 20 MiB for large blocks configurable via PYTORCH CUDA ALLOC CONF=expandable segments page size: