A GPU is a throughput machine: thousands of simple cores doing the same operation simultaneously. This is exactly what AI needs. Today you'll understand why.
CPUs minimize latency for a single thread: out-of-order execution, large caches, branch prediction, 3–5 GHz clocks. An Intel Core i9 has 24 cores. An NVIDIA H100 has 16,896 CUDA cores — but each is simpler and slower. GPUs sacrifice single-thread performance for massive parallelism. If your work is embarrassingly parallel (the same operation on millions of data points), a GPU wins by 100x.
A GPU SM (Streaming Multiprocessor) is roughly analogous to a CPU core. Each SM has 64–128 CUDA cores and a warp scheduler. A warp is 32 threads that execute the same instruction simultaneously on different data — hardware SIMD. If threads in a warp take different branches (warp divergence), all paths must execute serially — 32x slower. Write GPU code so all threads in a warp take the same path.
An H100 GPU has 3.35 TB/s of memory bandwidth vs a CPU's 200 GB/s. This is the GPU's real advantage for matrix multiplication: loading a 4096×4096 float32 matrix (64 MB) takes ~20 microseconds on a GPU vs ~300 microseconds on a CPU. Shared memory is a programmer-managed L1 cache (100–200 KB per SM). The key optimization: load from global memory into shared memory once, then do all computation from shared memory.
# GPU vs CPU: matrix multiply throughput
import numpy as np, time
try:
import torch
gpu = 'cuda' if torch.cuda.is_available() else ('mps' if torch.backends.mps.is_available() else None)
except: gpu = None
def bench_np(n):
A = np.random.randn(n,n).astype(np.float32)
B = np.random.randn(n,n).astype(np.float32)
_ = A@B # warmup
t = time.perf_counter(); _ = A@B; elapsed = time.perf_counter()-t
gflops = 2*n**3/elapsed/1e9
print(f"CPU {n}x{n}: {elapsed*1000:6.1f}ms {gflops:6.0f} GFLOPS")
def bench_gpu(n, device):
A = torch.randn(n,n,device=device,dtype=torch.float32)
B = torch.randn(n,n,device=device,dtype=torch.float32)
for _ in range(3): _ = A@B # warmup
t = time.perf_counter(); C = A@B
if device=='cuda': torch.cuda.synchronize()
elapsed = time.perf_counter()-t
gflops = 2*n**3/elapsed/1e9
print(f"GPU {n}x{n}: {elapsed*1000:6.1f}ms {gflops:6.0f} GFLOPS")
for n in [512, 1024, 4096]:
bench_np(n)
if gpu: bench_gpu(n, gpu)
print()
pip install torch). Use Google Colab for free GPU access.import transformers; model = transformers.AutoModel.from_pretrained('bert-base-uncased') on CPU vs GPU with batch sizes 1, 8, 32On Google Colab (free GPU), implement naive CUDA matrix multiply (%%cu magic in Colab). Compare against torch.mm() which uses cuBLAS. The gap should be 20–100x. Read about what cuBLAS does differently: register tiling, shared memory tiling, warp-level matrix multiply instructions (WMMA). Write a paragraph explaining the most impactful optimization.