- Model weights
- —
- Working memory
- —
- CUDA overhead
- —
A rough estimate to avoid OOM, not an exact figure — leave headroom. Over budget? Try fp8, lower resolution, smaller batch, or tiled VAE / --medvram.
A tool that estimates how much VRAM an image generation is likely to need on Stable Diffusion-family models (SD1.5 / SD2.x / SDXL · Pony · Illustrious / SD3 Medium / Flux.1). Use it to sanity-check whether you can push the resolution or batch size before a run dies with "CUDA out of memory" (OOM). The estimate is split into three parts. (1) Model weights = parameter count × bytes per parameter (fp16/bf16 = 2 bytes, fp32 = 4, fp8 = 1). (2) Working memory (latents, activations, attention) = a baseline × batch size × pixel factor (area relative to 512×512) × an attention factor × the precision ratio. Efficient attention (xformers / SDP) cuts working memory a lot, while vanilla attention blows up at high resolution. (3) A fixed CUDA + framework overhead. It then shows the total and whether it fits an 8 / 12 / 16 / 24 GB GPU as OK / Tight / OOM (VRAM varies by model and implementation, so anything past 90% is flagged Tight to leave headroom). This is only an estimate — the real figure shifts with xformers, tiled VAE, --medvram / --lowvram, offloading and whatever the OS is using. If a run OOMs, try in order: fp8, lower resolution, smaller batch, tiled VAE, and --medvram. All math runs in your browser — the values you enter are never sent to any API or server.
How to use
- Pick the model (SD1.5 / SDXL / SD3 / Flux) and precision (fp16 / fp32 / fp8).
- Enter the width, height, batch size and attention mode (efficient / none).
- Read the VRAM estimate, the breakdown, and which GPUs (8/12/16/24 GB) it fits.
FAQ
Are the values I enter sent anywhere?
No. The VRAM estimate is computed entirely in your browser. The model, resolution, batch and other inputs are never sent to any API or server — it all stays on your device.
How accurate is the number?
It's an estimate. Real VRAM depends on the UI (A1111 / Forge / ComfyUI), the implementation, whether xformers is on, tiled VAE, --medvram / --lowvram, offloading and what the OS uses. Treat it as a ballpark for avoiding OOM and leave headroom.
Which precision (fp16 / fp32 / fp8) should I pick?
fp16 / bf16 is the usual default — pick it unless you have a reason not to. fp32 roughly doubles VRAM and is rarely needed. fp8 saves VRAM but is mainly supported on some models (Flux / SD3) and can shift quality slightly.
It OOMs (out of memory). What can I do?
In order of impact: switch to fp8, lower the resolution, reduce the batch size, enable tiled VAE, and add --medvram (or --lowvram). On Flux / SD3, offloading the text encoder (T5) also helps.
What's the difference between efficient and no attention?
xformers or PyTorch's SDP (scaled dot-product attention) save a lot of intermediate memory, so working memory stays manageable even at high resolution. Vanilla (none) makes attention memory grow sharply as resolution rises — a common cause of OOM.