# How to Fix CUDA Out of Memory Errors in Stable Diffusion WebUI

> Source: <https://dev.to/alanwest/how-to-fix-cuda-out-of-memory-errors-in-stable-diffusion-webui-1k9n>
> Published: 2026-05-21 20:22:10+00:00

You finally got the WebUI running. You queue up a 1024x1024 generation, hit Generate, and a few seconds later your terminal vomits `RuntimeError: CUDA out of memory. Tried to allocate 2.50 GiB`

. Cool. Cool cool cool.

I've been through this dance on three different rigs now — a 6GB laptop, an 8GB desktop, and a borrowed 12GB workstation — and the fix is almost never "buy a bigger GPU." It's usually a config problem. Let me walk you through what's actually happening and how to make it stop.

## What's actually going on under the hood

When you generate an image, the diffusion model loads weights into VRAM, then the U-Net runs N denoising steps where each step holds activations, attention maps, and intermediate tensors in memory. SDXL is roughly 6.6 GB in fp16 just for the U-Net weights. Add the VAE, the text encoders (SDXL has two), and the per-step activations at full resolution, and you can easily blow past 10 GB before you've drawn a single pixel.

The really nasty part: PyTorch's allocator doesn't always release memory back to the driver between runs. So you'll have a successful generation, then the next one crashes — even though nothing changed. The fragmentation got you.

A few common root causes I've hit over and over:

-
**Attention layers exploding.** Default scaled dot-product attention materializes the full attention matrix, which scales quadratically with resolution. -
**Hires fix doubling everything.** It runs a second generation at upscaled resolution. That second pass needs its own activations. -
**VAE decode at full precision.** The default VAE can spike VRAM at the decode step, especially with`--no-half-vae`

. -
**Other processes hogging VRAM.** Your browser's hardware acceleration, a Discord overlay, or a stray Python kernel can easily eat 1-2 GB.

## Step 1: Check what's actually using your VRAM

Before changing any flags, see what you're working with. On Linux or WSL:

```
# Snapshot current VRAM usage and which processes are holding it
nvidia-smi

# Or watch it live while a generation runs
watch -n 0.5 nvidia-smi
```

On Windows, `nvidia-smi.exe`

lives in `C:\Windows\System32\`

and works the same way. If your idle VRAM is already at 2 GB before you launch the WebUI, that's your first problem — kill the offenders. Browser hardware acceleration is usually the biggest one.

## Step 2: Set the right command-line arguments

This is where most of the wins are. The WebUI accepts flags via `webui-user.bat`

(Windows) or `webui-user.sh`

(Linux/Mac). Open it up and edit `COMMANDLINE_ARGS`

. Here's a solid starting point for an 8 GB card:

```
# webui-user.sh
export COMMANDLINE_ARGS="--xformers --medvram --opt-split-attention --no-half-vae"
```

What each one does:

-
`--xformers`

enables memory-efficient attention. This alone often cuts VRAM use by 30-40%. You may need to install it separately (more on that below). -
`--medvram`

splits the model so the U-Net, VAE, and text encoder aren't all resident at once. There's a small speed cost, maybe 10-15%, but it's the difference between generating and crashing. -
`--lowvram`

is more aggressive — use it on 4 GB cards. Slower, but it works. -
`--opt-split-attention`

chunks attention computation across the sequence dimension. -
`--no-half-vae`

keeps the VAE in fp32. Counterintuitive, but it prevents black-image artifacts on some GPUs that come from fp16 VAE overflow.

For xformers, if it's not auto-installing, do it manually inside the venv:

```
# Activate the venv first
source venv/bin/activate

# Match the torch version that the WebUI installed
pip install xformers --index-url https://download.pytorch.org/whl/cu121
```

Check your installed torch version with `pip show torch`

and grab the matching xformers build. Mismatched CUDA versions are a frequent source of "xformers installed but not used" complaints. The [official xformers repo](https://github.com/facebookresearch/xformers) has a compatibility matrix worth bookmarking.

## Step 3: Tame PyTorch's memory allocator

This is the one nobody talks about and it's saved me more times than I can count. PyTorch's CUDA caching allocator can be tuned via an environment variable. Set this before launching:

```
# Linux/Mac — add to webui-user.sh
export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512,garbage_collection_threshold:0.8"

# Windows — add to webui-user.bat
set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512,garbage_collection_threshold:0.8
```

The `max_split_size_mb`

setting prevents the allocator from fragmenting memory into chunks too small to reuse. The `garbage_collection_threshold`

triggers eager cleanup when you cross 80% utilization. I picked these numbers after a lot of trial and error on my 8 GB card — your mileage may vary, but this combo handles the "second generation crashes" pattern beautifully.

If you're writing your own inference scripts on top of diffusers, you can also force a flush manually between runs:

``` python
import torch
import gc

# Run after generation completes, before the next prompt
def cleanup_vram():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()

# Optional: print what's still resident so you can debug leaks
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved:  {torch.cuda.memory_reserved() / 1e9:.2f} GB")
```

Note that `empty_cache()`

doesn't reduce `memory_allocated`

— only `memory_reserved`

. If allocated stays high, you've actually got tensors hanging around (probably a stray reference somewhere).

## Step 4: Reduce the working set

If you've done all of the above and still hit OOM, the generation itself is just too big. Some things that actually help:

-
**Drop the base resolution to 512x512 or 768x768**, then use Hires fix with a 1.5x or 2x upscaler. The two-pass approach uses way less peak VRAM than generating at native 1024x1024. -
**Lower the batch size to 1.** Batching is a VRAM multiplier with no quality benefit for stills. -
**Switch to a smaller model.** SD 1.5 fine-tunes are 4 GB; SDXL is 6.6 GB. If you don't need SDXL's specific aesthetic, save yourself the headache. -
**Use a tiled VAE extension.** It decodes the latent in chunks instead of all at once, which avoids the spike at the end of generation.

## How to keep it from happening again

A few habits I've picked up:

- Keep a known-good
`COMMANDLINE_ARGS`

in version control. I have a tiny git repo of just my WebUI configs. - After updating the WebUI or a major extension, do a clean run with a simple prompt before queuing up big batches. New code paths can change VRAM behavior in surprising ways.
- Don't run a browser-based image viewer in the same session — it adds VRAM pressure you'll forget about.
- Watch your inference logs. If
`memory_reserved`

keeps creeping up between runs, you've got a leak — usually from an extension that holds references.

The annoying truth is that VRAM management in local diffusion is mostly fiddly config, not raw hardware. A well-tuned 8 GB card will out-generate a poorly-tuned 12 GB one all day. Spend the hour up front getting your flags right and you'll save yourself dozens of crash-recoveries later.