Day 3 of 5
⏱ ~60 minutes
Computer Architecture in 5 Days — Day 3

Cache Deep Dive

Cache is responsible for the 100x performance gap between 'cache-friendly' and 'cache-hostile' code. Today you'll understand exactly how it works — and how to exploit it.

Cache Lines and Spatial Locality

Cache never loads a single byte — it loads a cache line of 64 bytes. Read one int, and the next 15 ints are already cached. This is spatial locality: if you access one address, nearby addresses are likely to be accessed soon. Sequential array access hits L1 almost every time. Random pointer chasing misses L3 and hits DRAM every time.

Set-Associative Organization

Cache is divided into sets, each with N ways (slots). A memory address maps to exactly one set. The tag bits identify which address occupies each way. On a miss, a new line is loaded into the set, evicting one of the existing ways (usually LRU). 8-way associativity means 8 different addresses can share a set — reducing conflict misses vs direct-mapped.

Write Policies and False Sharing

Write-through: every store immediately updates DRAM. Simple, slow. Write-back: stores update the cache line (marked dirty), DRAM updated only on eviction. Fast, but requires dirty tracking. In multi-core systems, the MESI protocol keeps caches coherent. False sharing: two threads write to different variables that happen to share a cache line. Every write invalidates the other core's copy — effectively serializing the threads despite no real sharing.

c
// Cache-friendly vs cache-hostile matrix access
#include 
#include 
#define N 2048
float A[N][N], B[N][N], C[N][N];

// Row-major: sequential memory — cache-friendly
void row_major() {
    for (int i=0;i
💡
The column-major version is typically 5–10x slower despite identical arithmetic. The inner loop strides 8 KB per iteration (2048 floats × 4 bytes), blowing out L1 and L2. Swap the loop order and your code becomes 5x faster for free.
📝 Day 3 Exercise
Prove Cache Effects with Real Numbers
  1. Compile and run: gcc -O1 -o cache cache.c && ./cache (use -O1 to prevent the optimizer from hiding the effect)
  2. Try different N values: 256, 512, 1024, 2048, 4096. At what size does the gap start widening? Why?
  3. Add a tiled version with a 32×32 block size. Does it match row-major? Exceed it?
  4. On Linux, use valgrind --tool=cachegrind ./cache to get exact cache miss counts for each version
  5. Write down your mental model: for a 2D array in C, which loop order is always cache-friendly?

Day 3 Summary

  • Cache loads 64-byte lines — reading one element loads the next 15 ints into cache for free
  • Spatial locality (sequential access) is your most powerful free optimization
  • False sharing kills parallel performance — pad structs so each thread's data is on separate cache lines
  • Write-back caches are fast but require MESI coherence in multi-core; understand this before using shared mutable state
Challenge

Implement matrix multiplication three ways: naive O(n³), cache-friendly (swap j/k loop order), and tiled (32×32 blocks). Measure all three for N=1024. The tiled version should approach BLAS performance. Use perf stat -e cache-misses to count misses in each version.

Finished this lesson?