Day 4 of 5
⏱ ~60 minutes
Computer Architecture in 5 Days — Day 4

Instruction Pipelines

Pipelines let CPUs start a new instruction every cycle instead of waiting for the previous one to finish — but branches, data dependencies, and memory latency can break the flow.

The 5-Stage Pipeline

Classic RISC pipeline: Fetch → Decode → Execute → Memory → Writeback. Without pipelining, a 5-stage instruction takes 5 cycles. With pipelining, a full pipeline produces one result per cycle — 5x throughput. Like an assembly line: while stage 5 handles instruction 1, stage 4 handles instruction 2, stage 3 handles instruction 3, and so on. Modern CPUs have 14–19 stages.

Data and Control Hazards

Data hazard: instruction N+1 needs the result of instruction N before it's ready. Solution: forwarding (pass result directly) or pipeline stalls (insert bubbles). Control hazard: a branch changes the PC. The CPU already fetched the next sequential instruction — if the branch is taken, those stages must be flushed (10–20 wasted cycles). Solution: branch prediction. Structural hazard: two instructions need the same execution unit. Solution: duplicate units.

Branch Prediction and Out-of-Order Execution

Modern CPUs predict branch outcomes with >97% accuracy and execute speculatively before the branch resolves. On a wrong prediction, the pipeline flushes and refills — 15-cycle penalty. Out-of-order (OOO) execution: the CPU scans ahead in the instruction stream and executes independent instructions early, keeping execution units busy. A reorder buffer ensures results commit in program order even though they executed out of order. Modern server CPUs retire 4–6 instructions per cycle.

c
// Branch prediction: sorted vs unsorted
// Classic benchmark showing misprediction cost
#include 
#include 
#include 
#define N 65536
int cmp(const void*a,const void*b){return *(int*)a-*(int*)b;}
int main(){
    int *data = malloc(N*sizeof(int));
    srand(42);
    for(int i=0;i=128) sum+=data[i];
    clock_gettime(CLOCK_MONOTONIC,&t1);
    printf("Unsorted: %ldms (sum=%ld)\n",
        ((t1.tv_sec-t0.tv_sec)*1e9+t1.tv_nsec-t0.tv_nsec)/1000000, sum);
    
    qsort(data,N,sizeof(int),cmp);
    // Sorted: predictor sees pattern immediately
    clock_gettime(CLOCK_MONOTONIC,&t0);
    sum=0;
    for(int r=0;r<2000;r++) for(int i=0;i=128) sum+=data[i];
    clock_gettime(CLOCK_MONOTONIC,&t1);
    printf("Sorted:   %ldms (sum=%ld)\n",
        ((t1.tv_sec-t0.tv_sec)*1e9+t1.tv_nsec-t0.tv_nsec)/1000000, sum);
    free(data); return 0;
}
💡
Replacing if(data[i]>=128) sum+=data[i]; with sum+=data[i]*(data[i]>=128); is branchless — the multiplication eliminates the branch entirely. Branchless code is predictable by definition.
📝 Day 4 Exercise
Measure Branch Misprediction Cost
  1. Compile and run: gcc -O2 -o branch branch.c && ./branch
  2. Record sorted vs unsorted time. Compute the ratio.
  3. Replace the branch with branchless arithmetic. Does it match the sorted version?
  4. On Linux: perf stat -e branch-misses,branches ./branch — compare misprediction rates for sorted vs unsorted
  5. Write a hot loop from your own code. Is there a branch inside? Can it be replaced with arithmetic or moved outside the loop?

Day 4 Summary

  • Pipelining gives ~5x throughput by overlapping instruction stages
  • Branch mispredictions cost 10–20 cycles each — make hot branches predictable or eliminate them
  • Out-of-order execution fills the pipeline by running independent instructions early
  • Modern CPUs retire 4–6 instructions per cycle and run 14–19 pipeline stages deep
Challenge

Research the Spectre vulnerability. It exploits speculative execution: the CPU speculatively accesses memory after a mispredicted branch, and the access leaves traces in cache timing. Write two paragraphs: (1) how speculative execution makes Spectre possible, and (2) why software mitigations (like LFENCE) cost 10–30% performance on some workloads.

Finished this lesson?