Cache is responsible for the 100x performance gap between 'cache-friendly' and 'cache-hostile' code. Today you'll understand exactly how it works — and how to exploit it.
Cache never loads a single byte — it loads a cache line of 64 bytes. Read one int, and the next 15 ints are already cached. This is spatial locality: if you access one address, nearby addresses are likely to be accessed soon. Sequential array access hits L1 almost every time. Random pointer chasing misses L3 and hits DRAM every time.
Cache is divided into sets, each with N ways (slots). A memory address maps to exactly one set. The tag bits identify which address occupies each way. On a miss, a new line is loaded into the set, evicting one of the existing ways (usually LRU). 8-way associativity means 8 different addresses can share a set — reducing conflict misses vs direct-mapped.
Write-through: every store immediately updates DRAM. Simple, slow. Write-back: stores update the cache line (marked dirty), DRAM updated only on eviction. Fast, but requires dirty tracking. In multi-core systems, the MESI protocol keeps caches coherent. False sharing: two threads write to different variables that happen to share a cache line. Every write invalidates the other core's copy — effectively serializing the threads despite no real sharing.
// Cache-friendly vs cache-hostile matrix access
#include
#include
#define N 2048
float A[N][N], B[N][N], C[N][N];
// Row-major: sequential memory — cache-friendly
void row_major() {
for (int i=0;i
gcc -O1 -o cache cache.c && ./cache (use -O1 to prevent the optimizer from hiding the effect)valgrind --tool=cachegrind ./cache to get exact cache miss counts for each versionImplement matrix multiplication three ways: naive O(n³), cache-friendly (swap j/k loop order), and tiled (32×32 blocks). Measure all three for N=1024. The tiled version should approach BLAS performance. Use perf stat -e cache-misses to count misses in each version.