{"slug": "when-does-fragmentation-occur-in-the-cuda-caching-allocator", "title": "When does fragmentation occur in the CUDA caching allocator?", "summary": "The CUDA caching allocator in PyTorch fragments memory when allocated blocks prevent adjacent free blocks from merging, causing allocation failures despite sufficient total free memory. This fragmentation occurs during CUDA graph recording for LLM serving, where distinct graphs for each batch size must share the same memory pool but the allocator's splitting and merging behavior depends on allocation order. Expandable segments fix certain fragmentation patterns, particularly those related to CUDA graph recording, by changing how segments are obtained from CUDA.", "body_md": "# When does fragmentation occur in the CUDA caching allocator?\n\nDisclosure.This post was drafted by Claude (Anthropic’s coding assistant) with editing from ezyang.\n\nIn an ideal world, users of CUDA memory in PyTorch programs should be able to abstract the allocator behavior as: there is a fixed amount of GPU memory, whenever you allocate this available memory goes down, and when you free the available memory goes back up.\n\nUnfortunately, the internal implementation of the CUDA caching allocator means that certain allocation patterns can give rise to fragmentation, where even though there is “technically” enough free space to store a requested allocation, the CUDA caching allocator is unable to actually serve the request.\n\nThere are many modern use cases where users wish to use as much memory that their GPUs provide as possible, while needing to ensure we do not OOM. Users are often penny-inching allocations in this situation, and find it very surprising when PyTorch reserves more memory than they expect under the abstract model of the allocator.\n\nThis is especially common in LLM serving, where every megabyte of GPU memory that isn’t nailed down by model weights or CUDA graph buffers can be used for KV cache. Modern disaggregated serving involves CUDA graphing distinct graphs for each batch size. It’s important for these graphs to share the same memory pool. But sharing a pool means the allocator’s internal bookkeeping needs to be correct before each recording. And the way the allocator manages memory–splitting and merging blocks–can go wrong in ways that depend on allocation order.\n\nIn this post, we’ll walk through some small laboratory examples where this\nfragmentation happens, and then demonstrate *why* expandable segments fixes\nthese examples. It’s important to have a mental model for what exactly we\nmean by “fragmentation”, because some fragmentation can be solved with\nexpandable segments (especially those related to recording CUDA graphs), while\nothers cannot.\n\n## Segments, blocks, and splitting\n\nThe caching allocator organizes GPU memory in two levels. **Segments**\nare contiguous regions obtained from CUDA (`cudaMalloc`\n\nor virtual memory\nmapping). **Blocks** are sub-regions within a segment that track\nindividual allocations.\n\nWhen a request comes in, the allocator finds a free block that’s large\nenough. If the block is bigger than needed, it **splits** the block: the\nfront portion serves the allocation, the back portion becomes a new free\nblock. When a block is freed, the allocator tries to **merge** it with\nits immediate neighbors–but only if the neighbor is also free. Two free\nblocks separated by an allocated block cannot merge.\n\n``` python\nimport gc, torch\n\nMiB = 1024 * 1024\n\ndef alloc(n, mib, pool, dev):\n    with torch.cuda.use_mem_pool(pool, dev):\n        return [\n            torch.empty(int(mib * MiB), dtype=torch.uint8, device=dev)\n            for _ in range(n)\n        ]\n\ndef free(ts):\n    ts.clear()\n\ndef layout(pool):\n    for s in torch.cuda.memory_snapshot(pool.id):\n        blocks = \" | \".join(f\"{b['size']//MiB}M {b['state']}\" for b in s[\"blocks\"])\n        print(f\"  seg {s['total_size']//MiB}M: [{blocks}]\")\n\npool = torch.cuda.MemPool()\ndev = torch.device(\"cuda:0\")\n\nt = alloc(1, 32, pool, dev)\nlayout(pool)  # one 32M block\n\nfree(t)\n\nts = alloc(2, 16, pool, dev)\nlayout(pool)  # 32M segment split into two 16M blocks\n\ndel ts[0]\nlayout(pool)  # first block inactive, second still active; can't merge\n\nfree(ts)\nlayout(pool)  # both free and adjacent; merged back to 32M\n```\n\nHow segments are obtained depends on whether expandable segments are enabled. The behavior is quite different in each case.\n\n## Without expandable segments\n\nRun scripts in this section with\n`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False`\n\n.\n\nWithout expandable segments, each `cudaMalloc`\n\ncall creates a separate\nsegment. For allocations between 1 MiB and 10 MiB, the allocator\nrequests a 20 MiB segment regardless of the actual size. For allocations\n\n= 10 MiB, the segment is rounded up to the nearest 2 MiB.\n\nThe key constraint: **blocks in different segments can never merge**.\nEach `cudaMalloc`\n\nis an independent allocation from CUDA’s perspective.\nA free 16 MiB block in one segment cannot combine with a free 16 MiB\nblock in another segment to serve a 32 MiB request.\n\nThis is where allocation order matters. Let’s walk through two scenarios step by step.\n\n**Small then large (bad order):**\n\n``` python\nimport gc, torch\n\nMiB = 1024 * 1024\n\ndef alloc(n, mib, pool, dev):\n    with torch.cuda.use_mem_pool(pool, dev):\n        return [\n            torch.empty(int(mib * MiB), dtype=torch.uint8, device=dev)\n            for _ in range(n)\n        ]\n\ndef free(ts):\n    ts.clear()\n\ndef reserved(pool):\n    return sum(s[\"total_size\"] for s in torch.cuda.memory_snapshot(pool.id))\n\ndef layout(pool):\n    for s in torch.cuda.memory_snapshot(pool.id):\n        blocks = \" | \".join(f\"{b['size']//MiB}M {b['state']}\" for b in s[\"blocks\"])\n        print(f\"  seg {s['total_size']//MiB}M: [{blocks}]\")\n\ndev = torch.device(\"cuda:0\")\npool = torch.cuda.MemPool()\n\nsmall = alloc(8, 16, pool, dev)\nprint(\"after 8x16M:\", reserved(pool) // MiB, \"MiB\")\nlayout(pool)\nfree(small)\nprint(\"after free:\", reserved(pool) // MiB, \"MiB\")\nlayout(pool)\nlarge = alloc(4, 32, pool, dev)\nprint(\"after 4x32M:\", reserved(pool) // MiB, \"MiB\")\nlayout(pool)\nfree(large)\n```\n\nStep by step:\n\n**Allocate 8x16 MiB.** Each 16 MiB request triggers a separate`cudaMalloc`\n\n. Since 16 MiB >= 10 MiB, each segment is rounded up to 16 MiB (nearest 2 MiB multiple). Result: eight separate 16 MiB segments, each containing one allocated block. 128 MiB reserved.**Free all.** Each segment now has one 16 MiB free block. But the segments are separate`cudaMalloc`\n\nallocations–they can’t merge with each other. The pool still holds 128 MiB of reserved memory across eight independent segments.**Allocate 4x32 MiB.** The allocator looks for a free block >= 32 MiB. Every existing free block is only 16 MiB, and blocks can’t span segments. None of the existing segments can serve the request. The allocator calls`cudaMalloc`\n\nfour more times for 32 MiB each. Result: 256 MiB reserved–eight stale 16 MiB segments plus four new 32 MiB segments.\n\n**Large then small (good order):**\n\n``` python\nimport gc, torch\n\nMiB = 1024 * 1024\n\ndef alloc(n, mib, pool, dev):\n    with torch.cuda.use_mem_pool(pool, dev):\n        return [\n            torch.empty(int(mib * MiB), dtype=torch.uint8, device=dev)\n            for _ in range(n)\n        ]\n\ndef free(ts):\n    ts.clear()\n\ndef reserved(pool):\n    return sum(s[\"total_size\"] for s in torch.cuda.memory_snapshot(pool.id))\n\ndef layout(pool):\n    for s in torch.cuda.memory_snapshot(pool.id):\n        blocks = \" | \".join(f\"{b['size']//MiB}M {b['state']}\" for b in s[\"blocks\"])\n        print(f\"  seg {s['total_size']//MiB}M: [{blocks}]\")\n\ndev = torch.device(\"cuda:0\")\npool = torch.cuda.MemPool()\n\nlarge = alloc(4, 32, pool, dev)\nprint(\"after 4x32M:\", reserved(pool) // MiB, \"MiB\")\nlayout(pool)\nfree(large)\nprint(\"after free:\", reserved(pool) // MiB, \"MiB\")\nlayout(pool)\nsmall = alloc(8, 16, pool, dev)\nprint(\"after 8x16M:\", reserved(pool) // MiB, \"MiB\")\nlayout(pool)\nfree(small)\n```\n\nStep by step:\n\n**Allocate 4x32 MiB.** Four`cudaMalloc`\n\ncalls, four 32 MiB segments. 128 MiB reserved.**Free all.** Each segment has one 32 MiB free block. 128 MiB still reserved.**Allocate 8x16 MiB.** The first 16 MiB request finds a 32 MiB free block. The allocator splits it: 16 MiB allocated, 16 MiB free remainder. The second 16 MiB request takes that remainder. Two allocations served from one segment, no new`cudaMalloc`\n\n. This repeats for each of the four segments. Result: still 128 MiB reserved, each segment now split into two 16 MiB blocks.\n\nSame total work, half the memory. The classic workaround is to always record in decreasing batch-size order: large allocations establish the segments, smaller ones split within them. It works, but it’s a leaky abstraction.\n\n## With expandable segments\n\nRun scripts in this section with\n`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`\n\n.\n\n### What cuMemMap gives us\n\nWithout expandable segments, the allocator calls `cudaMalloc`\n\nfor each\nsegment. Each `cudaMalloc`\n\nreturns an independent allocation that can\nnever be merged with another. This is the root cause of the\nfragmentation above.\n\nCUDA also has a separate virtual memory management APIs, which separates three concerns:\n\n: reserves a contiguous range of`cuMemAddressReserve`\n\n*virtual*address space. This is cheap–no physical memory is committed. The allocator reserves enough to map essentially all GPU physical memory (1 1/8x of`totalGlobalMem`\n\n).: allocates a chunk of`cuMemCreate`\n\n*physical*memory (a “handle”). This is the expensive operation that actually consumes GPU memory.: maps a physical handle into the reserved virtual range, making it accessible.`cuMemMap`\n\n+`cuMemSetAccess`\n\nThe allocator creates one `ExpandableSegment`\n\nper (pool, stream) pair.\nEach expandable segment owns one huge virtual reservation but starts\nwith zero physical memory mapped. As allocations arrive, physical pages\nare mapped into the segment on demand and the segment grows. Because\neverything is in one contiguous virtual address range, blocks within\nthe segment can always merge with their neighbors–the cross-segment\nbarrier from `cudaMalloc`\n\ndoesn’t exist.\n\n### Physical page granularity\n\nPhysical memory is mapped in fixed-size pages: **2 MiB** for\n`small_blocks`\n\n, **20 MiB** for `large_blocks`\n\n(configurable via\n`PYTORCH_CUDA_ALLOC_CONF=expandable_segments_page_size:<bytes>`\n\n). These\nare the `segment_size`\n\npassed to `ExpandableSegment`\n\n. When the allocator\nneeds more memory, it calls `cuMemCreate`\n\nfor one page and maps it at\nthe end of the segment.\n\nThis means there’s rounding overhead. If you request 16 MiB from the large pool (20 MiB pages), the allocator maps one 20 MiB page, serves 16 MiB, and the remaining 4 MiB becomes a free block. The next allocation can use that 4 MiB remainder, or if it’s larger, the allocator maps another 20 MiB page and merges the free space.\n\nLet’s trace through 8x16 MiB allocations and watch the page mapping at each step:\n\n``` python\nimport torch\n\nMiB = 1024 * 1024\n\ndef layout(pool):\n    for s in torch.cuda.memory_snapshot(pool.id):\n        blocks = \" | \".join(f\"{b['size']//MiB}M {b['state']}\" for b in s[\"blocks\"])\n        print(f\"  seg {s['total_size']//MiB}M: [{blocks}]\")\n\ndef reserved(pool):\n    return sum(s[\"total_size\"] for s in torch.cuda.memory_snapshot(pool.id))\n\npool = torch.cuda.MemPool()\ndev = torch.device(\"cuda:0\")\nts = []\nfor i in range(8):\n    with torch.cuda.use_mem_pool(pool, dev):\n        ts.append(torch.empty(16 * MiB, dtype=torch.uint8, device=dev))\n    print(f\"after alloc {i+1}:\", reserved(pool) // MiB, \"MiB mapped\")\n    layout(pool)\n```\n\nStep by step:\n\n**First 16 MiB.** No physical memory yet. Map one 20 MiB page. Split: 16 MiB allocated, 4 MiB free. (20 MiB mapped)**Second 16 MiB.** The 4 MiB free block is too small. Map another 20 MiB page; it’s adjacent in virtual space, so the allocator merges the 4 MiB free + 20 MiB newly mapped = 24 MiB free. Split off 16 MiB, leaving 8 MiB free. (40 MiB mapped)**Third 16 MiB.** 8 MiB free isn’t enough. Map another 20 MiB page, merge to 28 MiB free, split off 16 MiB, leaving 12 MiB free. (60 MiB mapped)**Fourth 16 MiB.** 12 MiB free isn’t enough. Map another 20 MiB page, merge to 32 MiB free, split off 16 MiB, leaving 16 MiB free. (80 MiB mapped)**Fifth 16 MiB.** The remainder from the previous step is exactly 16 MiB–it fits without mapping a new page. (Still 80 MiB mapped)**Sixth through eighth.** The cycle repeats: the remainder is now 0 MiB, so a new page is needed, producing a 4 MiB remainder that grows by 20 MiB each step until it’s consumed.\n\nAfter all eight allocations, the segment has mapped 7 large pages: 140 MiB of physical memory for 128 MiB of allocations. The 12 MiB of overhead is free space that can serve future allocations–not wasted.\n\n### Why allocation order doesn’t matter\n\nNow free all eight tensors. Because every block lives in the same\nsegment, adjacent free blocks merge. The result is one contiguous free\nblock covering all the mapped physical memory. This is the key\ndifference from `cudaMalloc`\n\nsegments: there are no segment boundaries\npreventing merging.\n\nFrom this single merged free block, 4x32 MiB allocations can be carved out without mapping any new physical memory:\n\n``` python\nimport gc, torch\n\nMiB = 1024 * 1024\n\ndef alloc(n, mib, pool, dev):\n    with torch.cuda.use_mem_pool(pool, dev):\n        return [\n            torch.empty(int(mib * MiB), dtype=torch.uint8, device=dev)\n            for _ in range(n)\n        ]\n\ndef free(ts):\n    ts.clear()\n\ndef reserved(pool):\n    return sum(s[\"total_size\"] for s in torch.cuda.memory_snapshot(pool.id))\n\ndef layout(pool):\n    for s in torch.cuda.memory_snapshot(pool.id):\n        blocks = \" | \".join(f\"{b['size']//MiB}M {b['state']}\" for b in s[\"blocks\"])\n        print(f\"  seg {s['total_size']//MiB}M: [{blocks}]\")\n\ndev = torch.device(\"cuda:0\")\npool = torch.cuda.MemPool()\n\n# Small-then-large: the same order that fragmented without expandable segments.\nsmall = alloc(8, 16, pool, dev)\nprint(\"after 8x16M:\", reserved(pool) // MiB, \"MiB reserved\")\nlayout(pool)  # one segment, blocks interleaved with free remainders\n\nfree(small)\nprint(\"after free:\"); layout(pool)  # one merged free block\n\nlarge = alloc(4, 32, pool, dev)\nprint(\"after 4x32M:\", reserved(pool) // MiB, \"MiB reserved\")\nlayout(pool)  # no new pages mapped\nfree(large)\n```\n\nAllocation order doesn’t matter because the intermediate state–how blocks are split–is irrelevant once everything is freed. All that matters is that enough physical memory is mapped in the segment, and it’s all contiguous in virtual space.\n\n### Expandable segments don’t eliminate all fragmentation\n\nThe above only holds when **everything is freed** before the next round\nof allocations. If some blocks are still alive, they pin the splits in\nplace: free blocks on either side of a live block can’t merge across it.\nThe segment has enough total free memory, but it’s chopped into\nnon-contiguous pieces.\n\n``` python\nimport gc, torch\n\nMiB = 1024 * 1024\n\ndef alloc(n, mib, pool, dev):\n    with torch.cuda.use_mem_pool(pool, dev):\n        return [\n            torch.empty(int(mib * MiB), dtype=torch.uint8, device=dev)\n            for _ in range(n)\n        ]\n\ndef free(ts):\n    ts.clear()\n\ndef reserved(pool):\n    return sum(s[\"total_size\"] for s in torch.cuda.memory_snapshot(pool.id))\n\ndef layout(pool):\n    for s in torch.cuda.memory_snapshot(pool.id):\n        blocks = \" | \".join(f\"{b['size']//MiB}M {b['state']}\" for b in s[\"blocks\"])\n        print(f\"  seg {s['total_size']//MiB}M: [{blocks}]\")\n\ndev = torch.device(\"cuda:0\")\npool = torch.cuda.MemPool()\n\nts = alloc(4, 16, pool, dev)\nprint(\"four 16M:\"); layout(pool)\n\n# Free first and third, keep second and fourth alive.\nt1, t3 = ts[1], ts[3]\nts[0] = None\nts[2] = None\nprint(\"free #0,#2:\"); layout(pool)\n\n# There is plenty of total free memory, but the largest existing free\n# block is only 16M. A 32M allocation has to grow the segment.\nbig = alloc(1, 32, pool, dev)\nprint(\"32M alloc: reserved grew to\", reserved(pool) // MiB, \"MiB\")\nfree(big)\n\n# Now free everything; all blocks merge into one.\nts.clear()\ndel t1, t3\nprint(\"free all:\"); layout(pool)\n\n# 32M fits without growing.\nbig = alloc(1, 32, pool, dev)\nprint(\"32M after full free:\", reserved(pool) // MiB, \"MiB\")\nfree(big)\n```\n\nFor CUDA graph pools this is straightforward: graphs that share a pool don’t run concurrently, so everything should be freed between recordings. As long as that holds, the “free everything -> full merge” path applies and allocation order is irrelevant. But if your use case has long-lived allocations interleaved with short-lived ones in the same pool, expandable segments won’t save you from fragmentation within the segment.\n\nFor the typical CUDA graph pool use case–graphs that don’t run concurrently–everything should be freed between recordings. As long as that holds, expandable segments eliminate fragmentation entirely.\n\n### The 1 MiB loophole\n\nThe allocator routes allocations <= 1 MiB to a pool called\n`small_blocks`\n\nand allocations > 1 MiB to `large_blocks`\n\n. These are\nentirely separate: separate segments, separate free lists. A block in\n`small_blocks`\n\ncan never serve a request from `large_blocks`\n\n, and vice\nversa. Even with expandable segments, each pool gets its own segment.\n\nThis means making a tensor *smaller* can *increase* total pool memory\nif it crosses the 1 MiB boundary:\n\n``` python\nimport gc, torch\n\nMiB = 1024 * 1024\n\ndef alloc(n, mib, pool, dev):\n    with torch.cuda.use_mem_pool(pool, dev):\n        return [\n            torch.empty(int(mib * MiB), dtype=torch.uint8, device=dev)\n            for _ in range(n)\n        ]\n\ndef free(ts):\n    ts.clear()\n\ndef reserved(pool):\n    return sum(s[\"total_size\"] for s in torch.cuda.memory_snapshot(pool.id))\n\ndev = torch.device(\"cuda:0\")\n\n# --- Crossing the boundary breaks sharing ---\npool = torch.cuda.MemPool()\na = alloc(1, 2, pool, dev)       # 2M -> large_blocks\nfree(a)\nb = alloc(1, 0.5, pool, dev)     # 0.5M -> small_blocks; can't reuse large segment\nprint(\"2M then 0.5M:\", reserved(pool) // MiB, \"MiB\")  # 22 MiB with 20M/2M pages\nfree(b); del pool; torch.cuda.empty_cache()\n\n# --- Staying above 1 MiB: sharing works ---\npool = torch.cuda.MemPool()\na = alloc(1, 2, pool, dev)       # 2M -> large_blocks\nfree(a)\nb = alloc(1, 2, pool, dev)       # 2M -> large_blocks; reuses segment\nprint(\"2M then 2M:\", reserved(pool) // MiB, \"MiB\")    # 20 MiB\nfree(b)\n```\n\nIf you’re recording multiple CUDA graphs into a shared pool, be aware that crossing the 1 MiB boundary causes allocations to land in different pools and breaks sharing.", "url": "https://wpnews.pro/news/when-does-fragmentation-occur-in-the-cuda-caching-allocator", "canonical_source": "https://docs.pytorch.org/devlogs/eager/2026-06-01-cuda-caching-allocator/", "published_at": "2026-06-01 18:43:51+00:00", "updated_at": "2026-06-04 05:12:51.576439+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "ai-infrastructure", "ai-research"], "entities": ["CUDA", "PyTorch", "NVIDIA", "Claude", "Anthropic"], "alternates": {"html": "https://wpnews.pro/news/when-does-fragmentation-occur-in-the-cuda-caching-allocator", "markdown": "https://wpnews.pro/news/when-does-fragmentation-occur-in-the-cuda-caching-allocator.md", "text": "https://wpnews.pro/news/when-does-fragmentation-occur-in-the-cuda-caching-allocator.txt", "jsonld": "https://wpnews.pro/news/when-does-fragmentation-occur-in-the-cuda-caching-allocator.jsonld"}}