Pipelines let CPUs start a new instruction every cycle instead of waiting for the previous one to finish — but branches, data dependencies, and memory latency can break the flow.
Classic RISC pipeline: Fetch → Decode → Execute → Memory → Writeback. Without pipelining, a 5-stage instruction takes 5 cycles. With pipelining, a full pipeline produces one result per cycle — 5x throughput. Like an assembly line: while stage 5 handles instruction 1, stage 4 handles instruction 2, stage 3 handles instruction 3, and so on. Modern CPUs have 14–19 stages.
Data hazard: instruction N+1 needs the result of instruction N before it's ready. Solution: forwarding (pass result directly) or pipeline stalls (insert bubbles). Control hazard: a branch changes the PC. The CPU already fetched the next sequential instruction — if the branch is taken, those stages must be flushed (10–20 wasted cycles). Solution: branch prediction. Structural hazard: two instructions need the same execution unit. Solution: duplicate units.
Modern CPUs predict branch outcomes with >97% accuracy and execute speculatively before the branch resolves. On a wrong prediction, the pipeline flushes and refills — 15-cycle penalty. Out-of-order (OOO) execution: the CPU scans ahead in the instruction stream and executes independent instructions early, keeping execution units busy. A reorder buffer ensures results commit in program order even though they executed out of order. Modern server CPUs retire 4–6 instructions per cycle.
// Branch prediction: sorted vs unsorted
// Classic benchmark showing misprediction cost
#include
#include
#include
#define N 65536
int cmp(const void*a,const void*b){return *(int*)a-*(int*)b;}
int main(){
int *data = malloc(N*sizeof(int));
srand(42);
for(int i=0;i=128) sum+=data[i];
clock_gettime(CLOCK_MONOTONIC,&t1);
printf("Unsorted: %ldms (sum=%ld)\n",
((t1.tv_sec-t0.tv_sec)*1e9+t1.tv_nsec-t0.tv_nsec)/1000000, sum);
qsort(data,N,sizeof(int),cmp);
// Sorted: predictor sees pattern immediately
clock_gettime(CLOCK_MONOTONIC,&t0);
sum=0;
for(int r=0;r<2000;r++) for(int i=0;i=128) sum+=data[i];
clock_gettime(CLOCK_MONOTONIC,&t1);
printf("Sorted: %ldms (sum=%ld)\n",
((t1.tv_sec-t0.tv_sec)*1e9+t1.tv_nsec-t0.tv_nsec)/1000000, sum);
free(data); return 0;
}
if(data[i]>=128) sum+=data[i]; with sum+=data[i]*(data[i]>=128); is branchless — the multiplication eliminates the branch entirely. Branchless code is predictable by definition.gcc -O2 -o branch branch.c && ./branchperf stat -e branch-misses,branches ./branch — compare misprediction rates for sorted vs unsortedResearch the Spectre vulnerability. It exploits speculative execution: the CPU speculatively accesses memory after a mispredicted branch, and the access leaves traces in cache timing. Write two paragraphs: (1) how speculative execution makes Spectre possible, and (2) why software mitigations (like LFENCE) cost 10–30% performance on some workloads.