GPU Calculator for LLMs 2026: VRAM, Cost, Hardware Picker

Configure Your LLM Setup

Select a model and requirements to see VRAM needs and hardware options

Model Open-source, maintained DB

Quantization Lower = less VRAM, slight quality loss

Context Length Needed 8K tokens

2K4K8K16K32K64K128K256K

Mode

Fine-tuning mode

Needs ~3–4× more VRAM (gradients + optimizer states)

Production / 24/7 uptime

Affects cloud cost calculations (730 hrs/month)

Full Hardware Comparison

Showing all options — compatible hardware highlighted

Hardware	VRAM	~Tok/s (7B)	Cloud $/hr	Cost/Mo (24/7)	Buy Price	Best For

Prices are estimated 2026 market rates (RunPod, Lambda, Vast.ai, Paperspace). Cloud costs vary. Purchase prices are approximate retail/used market. Tokens/sec are estimates for a dense 7B model at INT8; MoE models with selective expert activation may run faster at equivalent active params. Always benchmark before committing.

VRAM Requirements

Select a model to calculate

Total VRAM Needed

—

Configure above to calculate

Model weights—

KV cache (context)—

Activations overhead—

Total—

Understanding LLM VRAM Math

Why these calculations matter — and how to think about them yourself

📐 Weight Memory

VRAM_weights = params × bytes_per_param

FP16 uses 2 bytes per parameter. A 7B model needs ~14 GB. INT8 halves that to ~7 GB. Q4_K_M uses ~0.55 bytes/param (≈3.8 GB). This is the dominant VRAM cost for most models.

🗂️ KV Cache

KV = 2 × layers × heads × head_dim × ctx × bytes

The KV cache stores attention keys and values for each token in your context window. A 70B model with 80 layers and 128K context at FP16 needs ~20+ GB for KV alone. This is why long context is expensive.

⚡ Activation Overhead

~10–15% of weight VRAM

Intermediate computations during a forward pass need scratch space. Usually 10–15% on top of weights. For fine-tuning, add gradients (~same as weights) plus optimizer states (Adam = 2× weights more).

🔢 MoE Models

Active VRAM ≈ activation_params × bytes

Mixture-of-Experts models like DeepSeek V3 (671B total, ~37B active) or Llama 4 Scout (109B total, ~16B active) only compute a subset of experts per token. You still need to load all weights into VRAM but inference compute is much lower.

🍎 Unified Memory

Apple M-series: ~75% usable for models

Apple Silicon shares RAM between CPU and GPU. An M3 Max 64GB can use ~48 GB for models. Bandwidth is ~400 GB/s — slower than H100's 3.35 TB/s but dramatically cheaper. Great for 70B models at INT4.

💡 CPU Offloading

Speed drops ~10–50× vs pure GPU

Tools like llama.cpp can offload layers to RAM when VRAM is insufficient. A 128GB RAM machine can run large models but at 1–5 tokens/sec instead of 30–100. Fine for experimentation, not production.

GPU Calculator for Open-Source LLMs

Configure Your LLM Setup

Full Hardware Comparison

VRAM Requirements

Understanding LLM VRAM Math

📐 Weight Memory

🗂️ KV Cache

⚡ Activation Overhead

🔢 MoE Models

🍎 Unified Memory

💡 CPU Offloading

Learn to Deploy LLMs in the Real World

GPU Calculator for Open-Source LLMs

Configure Your LLM Setup

Full Hardware Comparison

VRAM Requirements

Understanding LLM VRAM Math

📐 Weight Memory

🗂️ KV Cache

⚡ Activation Overhead

🔢 MoE Models

🍎 Unified Memory

💡 CPU Offloading

Learn to Deploy LLMs in the Real World

Related Tools