Estimated VRAM
Model weights
Working memory
CUDA overhead
Fits on

A rough estimate to avoid OOM, not an exact figure — leave headroom. Over budget? Try fp8, lower resolution, smaller batch, or tiled VAE / --medvram.

A tool that estimates how much VRAM an image generation is likely to need on Stable Diffusion-family models (SD1.5 / SD2.x / SDXL · Pony · Illustrious / SD3 Medium / Flux.1). Use it to sanity-check whether you can push the resolution or batch size before a run dies with "CUDA out of memory" (OOM). The estimate is split into three parts. (1) Model weights = parameter count × bytes per parameter (fp16/bf16 = 2 bytes, fp32 = 4, fp8 = 1). (2) Working memory (latents, activations, attention) = a baseline × batch size × pixel factor (area relative to 512×512) × an attention factor × the precision ratio. Efficient attention (xformers / SDP) cuts working memory a lot, while vanilla attention blows up at high resolution. (3) A fixed CUDA + framework overhead. It then shows the total and whether it fits an 8 / 12 / 16 / 24 GB GPU as OK / Tight / OOM (VRAM varies by model and implementation, so anything past 90% is flagged Tight to leave headroom). This is only an estimate — the real figure shifts with xformers, tiled VAE, --medvram / --lowvram, offloading and whatever the OS is using. If a run OOMs, try in order: fp8, lower resolution, smaller batch, tiled VAE, and --medvram. All math runs in your browser — the values you enter are never sent to any API or server.

How to use

  1. Pick the model (SD1.5 / SDXL / SD3 / Flux) and precision (fp16 / fp32 / fp8).
  2. Enter the width, height, batch size and attention mode (efficient / none).
  3. Read the VRAM estimate, the breakdown, and which GPUs (8/12/16/24 GB) it fits.

FAQ

Are the values I enter sent anywhere?

No. The VRAM estimate is computed entirely in your browser. The model, resolution, batch and other inputs are never sent to any API or server — it all stays on your device.

How accurate is the number?

It's an estimate. Real VRAM depends on the UI (A1111 / Forge / ComfyUI), the implementation, whether xformers is on, tiled VAE, --medvram / --lowvram, offloading and what the OS uses. Treat it as a ballpark for avoiding OOM and leave headroom.

Which precision (fp16 / fp32 / fp8) should I pick?

fp16 / bf16 is the usual default — pick it unless you have a reason not to. fp32 roughly doubles VRAM and is rarely needed. fp8 saves VRAM but is mainly supported on some models (Flux / SD3) and can shift quality slightly.

It OOMs (out of memory). What can I do?

In order of impact: switch to fp8, lower the resolution, reduce the batch size, enable tiled VAE, and add --medvram (or --lowvram). On Flux / SD3, offloading the text encoder (T5) also helps.

What's the difference between efficient and no attention?

xformers or PyTorch's SDP (scaled dot-product attention) save a lot of intermediate memory, so working memory stays manageable even at high resolution. Vanilla (none) makes attention memory grow sharply as resolution rises — a common cause of OOM.