Day 5 of 5
⏱ ~60 minutes
Computer Architecture in 5 Days — Day 5

GPU Architecture

A GPU is a throughput machine: thousands of simple cores doing the same operation simultaneously. This is exactly what AI needs. Today you'll understand why.

CPU vs GPU: Latency vs Throughput

CPUs minimize latency for a single thread: out-of-order execution, large caches, branch prediction, 3–5 GHz clocks. An Intel Core i9 has 24 cores. An NVIDIA H100 has 16,896 CUDA cores — but each is simpler and slower. GPUs sacrifice single-thread performance for massive parallelism. If your work is embarrassingly parallel (the same operation on millions of data points), a GPU wins by 100x.

Streaming Multiprocessors and Warps

A GPU SM (Streaming Multiprocessor) is roughly analogous to a CPU core. Each SM has 64–128 CUDA cores and a warp scheduler. A warp is 32 threads that execute the same instruction simultaneously on different data — hardware SIMD. If threads in a warp take different branches (warp divergence), all paths must execute serially — 32x slower. Write GPU code so all threads in a warp take the same path.

Memory Bandwidth and Shared Memory

An H100 GPU has 3.35 TB/s of memory bandwidth vs a CPU's 200 GB/s. This is the GPU's real advantage for matrix multiplication: loading a 4096×4096 float32 matrix (64 MB) takes ~20 microseconds on a GPU vs ~300 microseconds on a CPU. Shared memory is a programmer-managed L1 cache (100–200 KB per SM). The key optimization: load from global memory into shared memory once, then do all computation from shared memory.

python
# GPU vs CPU: matrix multiply throughput
import numpy as np, time

try:
    import torch
    gpu = 'cuda' if torch.cuda.is_available() else ('mps' if torch.backends.mps.is_available() else None)
except: gpu = None

def bench_np(n):
    A = np.random.randn(n,n).astype(np.float32)
    B = np.random.randn(n,n).astype(np.float32)
    _ = A@B  # warmup
    t = time.perf_counter(); _ = A@B; elapsed = time.perf_counter()-t
    gflops = 2*n**3/elapsed/1e9
    print(f"CPU  {n}x{n}: {elapsed*1000:6.1f}ms  {gflops:6.0f} GFLOPS")

def bench_gpu(n, device):
    A = torch.randn(n,n,device=device,dtype=torch.float32)
    B = torch.randn(n,n,device=device,dtype=torch.float32)
    for _ in range(3): _ = A@B  # warmup
    t = time.perf_counter(); C = A@B
    if device=='cuda': torch.cuda.synchronize()
    elapsed = time.perf_counter()-t
    gflops = 2*n**3/elapsed/1e9
    print(f"GPU  {n}x{n}: {elapsed*1000:6.1f}ms  {gflops:6.0f} GFLOPS")

for n in [512, 1024, 4096]:
    bench_np(n)
    if gpu: bench_gpu(n, gpu)
    print()
💡
GPU wins at large matrix sizes, not small ones. At 512×512, GPU launch overhead dominates. At 4096×4096, GPU wins by 50–100x. This is why batch size matters in deep learning: larger batches amortize GPU launch overhead.
📝 Day 5 Exercise
Quantify GPU vs CPU Throughput
  1. Run the benchmark (install PyTorch if needed: pip install torch). Use Google Colab for free GPU access.
  2. Record GFLOPS at 512, 1024, 4096 for both CPU and GPU. At what size does GPU start winning?
  3. Look up the theoretical peak GFLOPS for your GPU. What fraction of peak are you achieving? (Typical: 50–80%)
  4. Time a transformer forward pass: import transformers; model = transformers.AutoModel.from_pretrained('bert-base-uncased') on CPU vs GPU with batch sizes 1, 8, 32
  5. Research: what is Tensor Core? How does it differ from a regular CUDA core, and why does it matter for FP16 training?

Day 5 Summary

  • GPUs trade single-thread latency for massive throughput: 16,000+ cores vs CPU's 8–24
  • Warps execute 32 threads simultaneously with SIMD — branch divergence kills performance
  • Memory bandwidth (3 TB/s vs 200 GB/s) is the GPU's key advantage for data-parallel workloads
  • Shared memory is programmer-managed L1 — every high-performance CUDA kernel loads data there first
Challenge

On Google Colab (free GPU), implement naive CUDA matrix multiply (%%cu magic in Colab). Compare against torch.mm() which uses cuBLAS. The gap should be 20–100x. Read about what cuBLAS does differently: register tiling, shared memory tiling, warp-level matrix multiply instructions (WMMA). Write a paragraph explaining the most impactful optimization.

Finished this lesson?