Estimated VRAM
Model weights
KV cache
Runtime overhead
Fits on

A rough estimate to leave VRAM headroom, not an exact figure. Over budget? Pick a smaller quant (Q4_K_M is the usual sweet spot), shorten the context, quantize the KV cache (q8_0 / q4_0), or offload some layers to CPU (llama.cpp -ngl).

A tool that estimates how much VRAM you need to run a GGUF model locally with llama.cpp, Ollama, LM Studio or KoboldCpp. Use it to answer "how much VRAM does a 7B model need?", "how much does Q4_K_M vs Q5_K_M change things?", or "will a 14B fit on my 8 GB / 12 GB GPU?" before you download the model. The estimate is split into three parts. (1) Model weights = parameter count × the effective bits-per-weight of the quantization ÷ 8. It ships presets for the GGUF types Q2_K / Q3_K_M / Q4_K_M / Q5_K_M / Q6_K / Q8_0 / F16 (Q4_K_M is the usual sweet spot of size vs quality). (2) KV cache = the part that grows with context length. For models like Llama 3.1 8B, Qwen2.5, Gemma 2 and Llama 3.x 70B it computes the KV cache from the real architecture (layer count, KV heads, head dim), so it reflects the difference between GQA (grouped-query attention) models with a small KV cache and older models without it (e.g. Llama 2 13B). You can also quantize the KV cache itself to q8_0 / q4_0 to save memory. (3) A fixed runtime overhead (CUDA/Metal context + compute buffers). It then shows the total and whether it fits a 6 / 8 / 12 / 16 / 24 GB GPU as OK / Tight / OOM (anything past 90% is flagged Tight to leave headroom). For models not in the list, pick "Custom" and type the parameter count (weights are computed exactly; the KV cache is approximated). This is only an estimate — real VRAM shifts with the quant implementation, flash attention, batching and what the OS uses. If you run out, try a smaller quant, a shorter context, a quantized KV cache, or offloading some layers to CPU (llama.cpp's -ngl). All math runs in your browser — the values you enter are never sent to any API or server. For Stable Diffusion image generation use the sister VRAM calculator (vram-calc); to inspect a GGUF file's contents use the GGUF metadata viewer.

How to use

  1. Pick a model (Llama 3.1 8B / Qwen2.5 / 70B…) or choose Custom and enter the parameter count (B).
  2. Pick the GGUF quantization (e.g. Q4_K_M), the context length, and the KV cache type.
  3. Read the VRAM estimate and breakdown (weights / KV cache / overhead), and which GPUs (6/8/12/16/24 GB) it fits.

FAQ

Are the values I enter sent anywhere?

No. The VRAM estimate is computed entirely in your browser. The model, parameter count, context and other inputs are never sent to any API or server — it all stays on your device.

How much VRAM does a 7B model need?

It depends on the quantization and context length. As a ballpark, running a 7–8B model at Q4_K_M with an 8K context needs roughly 6–7 GB (about 4.5 GB of weights plus KV cache and overhead). So it fits comfortably on an 8 GB GPU, and is tight or needs offloading on 6 GB. Change the quant and context in the tool to see your exact case.

Which quantization (Q4_K_M / Q5_K_M / Q8_0) should I pick?

Q4_K_M is the usual sweet spot of VRAM vs quality. If you have spare VRAM, Q5_K_M / Q6_K raise quality. Q8_0 is near-fp16 quality but large, and F16 is unquantized (largest). If VRAM is tight you can drop to Q3_K_M / Q2_K, but quality loss is more noticeable on smaller models.

How much does a longer context add to VRAM?

The KV cache grows in proportion to the context length. Long documents, long chats and RAG all use more VRAM, so shortening the context is an effective way to fit. Quantizing the KV cache to q8_0 / q4_0 cuts that part to roughly a half or a quarter.

Is the GGUF file size the same as 'model weights'?

Roughly yes. The GGUF file size is about the VRAM used by the model weights; the total VRAM you need is that plus the KV cache and the runtime overhead. To check a downloaded file's quantization or contents, use the sister tool, the GGUF metadata viewer.

What can I do when I run out of VRAM (OOM)?

In order of impact: use a smaller quant (e.g. Q5 → Q4 → Q3), shorten the context length, quantize the KV cache (q8_0 / q4_0), and offload some layers to CPU (lower llama.cpp's -ngl so fewer layers sit on the GPU). A small shortfall still runs with offloading, just slower.