Real VRAM math. Cloud cost comparisons. Find the exact hardware to run Llama 4, DeepSeek, Mistral, Qwen, and more — at any quantization.
Select a model and requirements to see VRAM needs and hardware options
Showing all options — compatible hardware highlighted
| Hardware | VRAM | ~Tok/s (7B) | Cloud $/hr | Cost/Mo (24/7) | Buy Price | Best For |
|---|
Prices are estimated 2026 market rates (RunPod, Lambda, Vast.ai, Paperspace). Cloud costs vary. Purchase prices are approximate retail/used market. Tokens/sec are estimates for a dense 7B model at INT8; MoE models with selective expert activation may run faster at equivalent active params. Always benchmark before committing.
Select a model to calculate
Why these calculations matter — and how to think about them yourself
FP16 uses 2 bytes per parameter. A 7B model needs ~14 GB. INT8 halves that to ~7 GB. Q4_K_M uses ~0.55 bytes/param (≈3.8 GB). This is the dominant VRAM cost for most models.
The KV cache stores attention keys and values for each token in your context window. A 70B model with 80 layers and 128K context at FP16 needs ~20+ GB for KV alone. This is why long context is expensive.
Intermediate computations during a forward pass need scratch space. Usually 10–15% on top of weights. For fine-tuning, add gradients (~same as weights) plus optimizer states (Adam = 2× weights more).
Mixture-of-Experts models like DeepSeek V3 (671B total, ~37B active) or Llama 4 Scout (109B total, ~16B active) only compute a subset of experts per token. You still need to load all weights into VRAM but inference compute is much lower.
Apple Silicon shares RAM between CPU and GPU. An M3 Max 64GB can use ~48 GB for models. Bandwidth is ~400 GB/s — slower than H100's 3.35 TB/s but dramatically cheaper. Great for 70B models at INT4.
Tools like llama.cpp can offload layers to RAM when VRAM is insufficient. A 128GB RAM machine can run large models but at 1–5 tokens/sec instead of 30–100. Fine for experimentation, not production.