Free Calculator — Updated 2026

GPU Calculator for Open-Source LLMs

Real VRAM math. Cloud cost comparisons. Find the exact hardware to run Llama 4, DeepSeek, Mistral, Qwen, and more — at any quantization.

Configure Your LLM Setup

Select a model and requirements to see VRAM needs and hardware options

Model Open-source, maintained DB
Quantization Lower = less VRAM, slight quality loss
Context Length Needed 8K tokens
2K4K8K16K32K64K128K256K
Mode
Fine-tuning mode
Needs ~3–4× more VRAM (gradients + optimizer states)
Production / 24/7 uptime
Affects cloud cost calculations (730 hrs/month)

Full Hardware Comparison

Showing all options — compatible hardware highlighted

Hardware VRAM ~Tok/s (7B) Cloud $/hr Cost/Mo (24/7) Buy Price Best For

Prices are estimated 2026 market rates (RunPod, Lambda, Vast.ai, Paperspace). Cloud costs vary. Purchase prices are approximate retail/used market. Tokens/sec are estimates for a dense 7B model at INT8; MoE models with selective expert activation may run faster at equivalent active params. Always benchmark before committing.

VRAM Requirements

Select a model to calculate

Total VRAM Needed
Configure above to calculate
Model weights
KV cache (context)
Activations overhead
Total

Understanding LLM VRAM Math

Why these calculations matter — and how to think about them yourself

📐 Weight Memory

VRAM_weights = params × bytes_per_param

FP16 uses 2 bytes per parameter. A 7B model needs ~14 GB. INT8 halves that to ~7 GB. Q4_K_M uses ~0.55 bytes/param (≈3.8 GB). This is the dominant VRAM cost for most models.

🗂️ KV Cache

KV = 2 × layers × heads × head_dim × ctx × bytes

The KV cache stores attention keys and values for each token in your context window. A 70B model with 80 layers and 128K context at FP16 needs ~20+ GB for KV alone. This is why long context is expensive.

⚡ Activation Overhead

~10–15% of weight VRAM

Intermediate computations during a forward pass need scratch space. Usually 10–15% on top of weights. For fine-tuning, add gradients (~same as weights) plus optimizer states (Adam = 2× weights more).

🔢 MoE Models

Active VRAM ≈ activation_params × bytes

Mixture-of-Experts models like DeepSeek V3 (671B total, ~37B active) or Llama 4 Scout (109B total, ~16B active) only compute a subset of experts per token. You still need to load all weights into VRAM but inference compute is much lower.

🍎 Unified Memory

Apple M-series: ~75% usable for models

Apple Silicon shares RAM between CPU and GPU. An M3 Max 64GB can use ~48 GB for models. Bandwidth is ~400 GB/s — slower than H100's 3.35 TB/s but dramatically cheaper. Great for 70B models at INT4.

💡 CPU Offloading

Speed drops ~10–50× vs pure GPU

Tools like llama.cpp can offload layers to RAM when VRAM is insufficient. A 128GB RAM machine can run large models but at 1–5 tokens/sec instead of 30–100. Fine for experimentation, not production.

Learn to Deploy LLMs in the Real World

Our 2-day bootcamp covers local model deployment, API integration, and building production AI systems from scratch.

View Bootcamp Schedule →