Day 2 of 5
⏱ ~60 minutes
Computer Architecture in 5 Days — Day 2

Memory Hierarchy

Your CPU can execute 4 billion operations per second. RAM takes 200 cycles to respond. The memory hierarchy exists entirely to bridge this gap.

Why Memory Hierarchy Exists

At 3 GHz, one clock cycle is 0.33 nanoseconds. Accessing DRAM takes 60–100 ns — 200–300 cycles of waiting. Without caches, your CPU would be idle 99% of the time waiting for data. The solution is a pyramid of storage: fast and small near the CPU, slow and large farther away.

Every Level — Speed and Size

Registers: 1 cycle, ~1 KB (compiler managed). L1 Cache: 4–5 cycles, 32–64 KB per core. L2 Cache: 12–15 cycles, 256 KB–1 MB. L3 Cache: 40–50 cycles, 6–64 MB shared. DRAM: 100–300 cycles, 8–128 GB. NVMe SSD: ~100,000 cycles, 0.5–4 TB. Every level is a cache for the one below it. L1 caches L2. L2 caches L3. L3 caches DRAM. DRAM caches disk.

Virtual Memory

Every process sees a private 64-bit virtual address space — as if it owns all of memory. The OS and the CPU's Memory Management Unit (MMU) translate virtual addresses to physical DRAM locations using page tables (4 KB pages). A Translation Lookaside Buffer (TLB) caches recent translations. A TLB miss means a multi-cycle page table walk. A page fault means the page isn't in RAM at all — the OS must load it from disk.

c
// Measure memory latency at each hierarchy level
#include 
#include 
#include 
#include 

void bench(const char *name, char *buf, size_t size) {
    volatile char x = 0;
    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);
    for (int r = 0; r < 20; r++)
        for (size_t i = 0; i < size; i += 64) // 64-byte cache lines
            x += buf[i];
    clock_gettime(CLOCK_MONOTONIC, &t1);
    long ns = ((t1.tv_sec-t0.tv_sec)*1000000000LL + t1.tv_nsec-t0.tv_nsec)/20;
    printf("%-22s %8ld ns\n", name, ns);
}

int main() {
    char *buf = malloc(256*1024*1024);
    memset(buf, 1, 256*1024*1024);
    bench("32KB  (fits L1)",   buf, 32*1024);
    bench("256KB (fits L2)",   buf, 256*1024);
    bench("8MB   (fits L3)",   buf, 8*1024*1024);
    bench("128MB (DRAM)",      buf, 128*1024*1024);
    free(buf); return 0;
}
💡
The ratios matter more than the absolute numbers: L2 is ~3x slower than L1, L3 is ~10x, DRAM is ~50x. Every cache miss that reaches DRAM costs 50 L1 hits. One cold miss can undo dozens of fast operations.
📝 Day 2 Exercise
Benchmark Memory Latency on Your Machine
  1. Compile: gcc -O2 -o latency latency.c && ./latency
  2. Record the 4 timing numbers. Compute the ratio between each level.
  3. Change stride from 64 to 1 — does it get faster or slower? Why? (Hint: you're loading the same number of cache lines either way)
  4. Change stride to 4096 — what happens? You're now striding past every cache line, maximizing TLB pressure
  5. Run lscpu | grep cache to verify your actual cache sizes match the code

Day 2 Summary

  • DRAM is 200+ cycles away — without caches, CPUs would be idle almost constantly
  • Hierarchy: Registers(1c) → L1(4c) → L2(12c) → L3(40c) → DRAM(200c) → SSD(100Kc)
  • Virtual memory gives each process a private address space; TLB caches recent translations
  • Cache behavior is often the dominant factor in real-world program performance — not algorithm complexity
Challenge

Research 'NUMA architecture' (Non-Uniform Memory Access). In a dual-socket server, why does accessing memory on the remote socket cost more than local memory? How does Linux's numactl help? Write a paragraph explaining why NUMA matters for database servers.

Finished this lesson?