Your CPU can execute 4 billion operations per second. RAM takes 200 cycles to respond. The memory hierarchy exists entirely to bridge this gap.
At 3 GHz, one clock cycle is 0.33 nanoseconds. Accessing DRAM takes 60–100 ns — 200–300 cycles of waiting. Without caches, your CPU would be idle 99% of the time waiting for data. The solution is a pyramid of storage: fast and small near the CPU, slow and large farther away.
Registers: 1 cycle, ~1 KB (compiler managed). L1 Cache: 4–5 cycles, 32–64 KB per core. L2 Cache: 12–15 cycles, 256 KB–1 MB. L3 Cache: 40–50 cycles, 6–64 MB shared. DRAM: 100–300 cycles, 8–128 GB. NVMe SSD: ~100,000 cycles, 0.5–4 TB. Every level is a cache for the one below it. L1 caches L2. L2 caches L3. L3 caches DRAM. DRAM caches disk.
Every process sees a private 64-bit virtual address space — as if it owns all of memory. The OS and the CPU's Memory Management Unit (MMU) translate virtual addresses to physical DRAM locations using page tables (4 KB pages). A Translation Lookaside Buffer (TLB) caches recent translations. A TLB miss means a multi-cycle page table walk. A page fault means the page isn't in RAM at all — the OS must load it from disk.
// Measure memory latency at each hierarchy level
#include
#include
#include
#include
void bench(const char *name, char *buf, size_t size) {
volatile char x = 0;
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
for (int r = 0; r < 20; r++)
for (size_t i = 0; i < size; i += 64) // 64-byte cache lines
x += buf[i];
clock_gettime(CLOCK_MONOTONIC, &t1);
long ns = ((t1.tv_sec-t0.tv_sec)*1000000000LL + t1.tv_nsec-t0.tv_nsec)/20;
printf("%-22s %8ld ns\n", name, ns);
}
int main() {
char *buf = malloc(256*1024*1024);
memset(buf, 1, 256*1024*1024);
bench("32KB (fits L1)", buf, 32*1024);
bench("256KB (fits L2)", buf, 256*1024);
bench("8MB (fits L3)", buf, 8*1024*1024);
bench("128MB (DRAM)", buf, 128*1024*1024);
free(buf); return 0;
}
gcc -O2 -o latency latency.c && ./latencylscpu | grep cache to verify your actual cache sizes match the codeResearch 'NUMA architecture' (Non-Uniform Memory Access). In a dual-socket server, why does accessing memory on the remote socket cost more than local memory? How does Linux's numactl help? Write a paragraph explaining why NUMA matters for database servers.