{"slug": "how-to-fix-cuda-out-of-memory-errors-in-stable-diffusion-webui", "title": "How to Fix CUDA Out of Memory Errors in Stable Diffusion WebUI", "summary": "The \"CUDA out of memory\" error in Stable Diffusion WebUI is often caused by configuration issues rather than insufficient GPU hardware, particularly due to PyTorch's memory allocator failing to release VRAM between generations. The article recommends editing the WebUI's command-line arguments to include flags like `--xformers`, `--medvram`, and `--opt-split-attention` to reduce memory usage, and suggests setting the `PYTORCH_CUDA_ALLOC_CONF` environment variable to `max_split_size_mb:512,garbage_collection_threshold:0.8` to prevent memory fragmentation.", "body_md": "You finally got the WebUI running. You queue up a 1024x1024 generation, hit Generate, and a few seconds later your terminal vomits `RuntimeError: CUDA out of memory. Tried to allocate 2.50 GiB`\n\n. Cool. Cool cool cool.\n\nI've been through this dance on three different rigs now — a 6GB laptop, an 8GB desktop, and a borrowed 12GB workstation — and the fix is almost never \"buy a bigger GPU.\" It's usually a config problem. Let me walk you through what's actually happening and how to make it stop.\n\n## What's actually going on under the hood\n\nWhen you generate an image, the diffusion model loads weights into VRAM, then the U-Net runs N denoising steps where each step holds activations, attention maps, and intermediate tensors in memory. SDXL is roughly 6.6 GB in fp16 just for the U-Net weights. Add the VAE, the text encoders (SDXL has two), and the per-step activations at full resolution, and you can easily blow past 10 GB before you've drawn a single pixel.\n\nThe really nasty part: PyTorch's allocator doesn't always release memory back to the driver between runs. So you'll have a successful generation, then the next one crashes — even though nothing changed. The fragmentation got you.\n\nA few common root causes I've hit over and over:\n\n-\n**Attention layers exploding.** Default scaled dot-product attention materializes the full attention matrix, which scales quadratically with resolution. -\n**Hires fix doubling everything.** It runs a second generation at upscaled resolution. That second pass needs its own activations. -\n**VAE decode at full precision.** The default VAE can spike VRAM at the decode step, especially with`--no-half-vae`\n\n. -\n**Other processes hogging VRAM.** Your browser's hardware acceleration, a Discord overlay, or a stray Python kernel can easily eat 1-2 GB.\n\n## Step 1: Check what's actually using your VRAM\n\nBefore changing any flags, see what you're working with. On Linux or WSL:\n\n```\n# Snapshot current VRAM usage and which processes are holding it\nnvidia-smi\n\n# Or watch it live while a generation runs\nwatch -n 0.5 nvidia-smi\n```\n\nOn Windows, `nvidia-smi.exe`\n\nlives in `C:\\Windows\\System32\\`\n\nand works the same way. If your idle VRAM is already at 2 GB before you launch the WebUI, that's your first problem — kill the offenders. Browser hardware acceleration is usually the biggest one.\n\n## Step 2: Set the right command-line arguments\n\nThis is where most of the wins are. The WebUI accepts flags via `webui-user.bat`\n\n(Windows) or `webui-user.sh`\n\n(Linux/Mac). Open it up and edit `COMMANDLINE_ARGS`\n\n. Here's a solid starting point for an 8 GB card:\n\n```\n# webui-user.sh\nexport COMMANDLINE_ARGS=\"--xformers --medvram --opt-split-attention --no-half-vae\"\n```\n\nWhat each one does:\n\n-\n`--xformers`\n\nenables memory-efficient attention. This alone often cuts VRAM use by 30-40%. You may need to install it separately (more on that below). -\n`--medvram`\n\nsplits the model so the U-Net, VAE, and text encoder aren't all resident at once. There's a small speed cost, maybe 10-15%, but it's the difference between generating and crashing. -\n`--lowvram`\n\nis more aggressive — use it on 4 GB cards. Slower, but it works. -\n`--opt-split-attention`\n\nchunks attention computation across the sequence dimension. -\n`--no-half-vae`\n\nkeeps the VAE in fp32. Counterintuitive, but it prevents black-image artifacts on some GPUs that come from fp16 VAE overflow.\n\nFor xformers, if it's not auto-installing, do it manually inside the venv:\n\n```\n# Activate the venv first\nsource venv/bin/activate\n\n# Match the torch version that the WebUI installed\npip install xformers --index-url https://download.pytorch.org/whl/cu121\n```\n\nCheck your installed torch version with `pip show torch`\n\nand grab the matching xformers build. Mismatched CUDA versions are a frequent source of \"xformers installed but not used\" complaints. The [official xformers repo](https://github.com/facebookresearch/xformers) has a compatibility matrix worth bookmarking.\n\n## Step 3: Tame PyTorch's memory allocator\n\nThis is the one nobody talks about and it's saved me more times than I can count. PyTorch's CUDA caching allocator can be tuned via an environment variable. Set this before launching:\n\n```\n# Linux/Mac — add to webui-user.sh\nexport PYTORCH_CUDA_ALLOC_CONF=\"max_split_size_mb:512,garbage_collection_threshold:0.8\"\n\n# Windows — add to webui-user.bat\nset PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512,garbage_collection_threshold:0.8\n```\n\nThe `max_split_size_mb`\n\nsetting prevents the allocator from fragmenting memory into chunks too small to reuse. The `garbage_collection_threshold`\n\ntriggers eager cleanup when you cross 80% utilization. I picked these numbers after a lot of trial and error on my 8 GB card — your mileage may vary, but this combo handles the \"second generation crashes\" pattern beautifully.\n\nIf you're writing your own inference scripts on top of diffusers, you can also force a flush manually between runs:\n\n``` python\nimport torch\nimport gc\n\n# Run after generation completes, before the next prompt\ndef cleanup_vram():\n    gc.collect()\n    torch.cuda.empty_cache()\n    torch.cuda.ipc_collect()\n\n# Optional: print what's still resident so you can debug leaks\nprint(f\"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB\")\nprint(f\"Reserved:  {torch.cuda.memory_reserved() / 1e9:.2f} GB\")\n```\n\nNote that `empty_cache()`\n\ndoesn't reduce `memory_allocated`\n\n— only `memory_reserved`\n\n. If allocated stays high, you've actually got tensors hanging around (probably a stray reference somewhere).\n\n## Step 4: Reduce the working set\n\nIf you've done all of the above and still hit OOM, the generation itself is just too big. Some things that actually help:\n\n-\n**Drop the base resolution to 512x512 or 768x768**, then use Hires fix with a 1.5x or 2x upscaler. The two-pass approach uses way less peak VRAM than generating at native 1024x1024. -\n**Lower the batch size to 1.** Batching is a VRAM multiplier with no quality benefit for stills. -\n**Switch to a smaller model.** SD 1.5 fine-tunes are 4 GB; SDXL is 6.6 GB. If you don't need SDXL's specific aesthetic, save yourself the headache. -\n**Use a tiled VAE extension.** It decodes the latent in chunks instead of all at once, which avoids the spike at the end of generation.\n\n## How to keep it from happening again\n\nA few habits I've picked up:\n\n- Keep a known-good\n`COMMANDLINE_ARGS`\n\nin version control. I have a tiny git repo of just my WebUI configs. - After updating the WebUI or a major extension, do a clean run with a simple prompt before queuing up big batches. New code paths can change VRAM behavior in surprising ways.\n- Don't run a browser-based image viewer in the same session — it adds VRAM pressure you'll forget about.\n- Watch your inference logs. If\n`memory_reserved`\n\nkeeps creeping up between runs, you've got a leak — usually from an extension that holds references.\n\nThe annoying truth is that VRAM management in local diffusion is mostly fiddly config, not raw hardware. A well-tuned 8 GB card will out-generate a poorly-tuned 12 GB one all day. Spend the hour up front getting your flags right and you'll save yourself dozens of crash-recoveries later.", "url": "https://wpnews.pro/news/how-to-fix-cuda-out-of-memory-errors-in-stable-diffusion-webui", "canonical_source": "https://dev.to/alanwest/how-to-fix-cuda-out-of-memory-errors-in-stable-diffusion-webui-1k9n", "published_at": "2026-05-21 20:22:10+00:00", "updated_at": "2026-05-21 21:04:01.727820+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "developer-tools", "hardware"], "entities": ["CUDA", "Stable Diffusion", "PyTorch", "SDXL", "U-Net", "VAE", "nvidia-smi", "WebUI"], "alternates": {"html": "https://wpnews.pro/news/how-to-fix-cuda-out-of-memory-errors-in-stable-diffusion-webui", "markdown": "https://wpnews.pro/news/how-to-fix-cuda-out-of-memory-errors-in-stable-diffusion-webui.md", "text": "https://wpnews.pro/news/how-to-fix-cuda-out-of-memory-errors-in-stable-diffusion-webui.txt", "jsonld": "https://wpnews.pro/news/how-to-fix-cuda-out-of-memory-errors-in-stable-diffusion-webui.jsonld"}}