Day 4 of 5
⏱ ~60 minutes
How CPUs Work in 5 Days — Day 4

The Instruction Execution Cycle

Today you'll trace a single instruction from fetch to writeback — through every component of the CPU, seeing exactly what happens at each stage.

Fetch: Reading the Next Instruction

The Program Counter (RIP in x86-64) holds the address of the next instruction. The instruction fetch unit reads bytes from the instruction cache (L1i, 32 KB), sends them to the instruction queue. The instruction length decoder figures out where one instruction ends and the next begins — x86-64 instructions vary from 1 to 15 bytes. The fetch unit tries to fetch 16–32 bytes per cycle to keep the pipeline fed.

Decode: What Does This Instruction Mean?

The decoder converts raw bytes into micro-operations (uops) — primitive operations the CPU's execution engine understands. A single x86 ADD instruction might become 1 uop. A complex PUSH instruction becomes 2 uops (sub RSP,8 and store). The uop cache (Decoded ICache, ~1500 uops in Intel CPUs) caches decoded instructions so the decoder isn't a bottleneck on loops.

Execute, Memory, Writeback

Dispatched uops wait in the scheduler/reservation station for their operands to be ready and an execution unit to be free. Modern CPUs have 10+ execution ports: integer ALU, FP ALU, load/store units, branch unit, etc. Once executed, the result goes to the reorder buffer (ROB) to await in-order commit. On commit, the register file is updated and the uop retires. The ROB enables out-of-order execution while preserving program semantics.

python
# Simulate a simple in-order CPU
# Demonstrates fetch→decode→execute→writeback

class CPU:
    def __init__(self):
        self.regs = {'R0':0,'R1':0,'R2':0,'R3':0}
        self.pc = 0
        self.memory = [0]*256
        self.zero = False
    
    def fetch(self, program):
        if self.pc >= len(program): return None
        instr = program[self.pc]
        self.pc += 1
        print(f"  FETCH   [{self.pc-1}]: {instr}")
        return instr
    
    def decode(self, instr):
        parts = instr.split()
        op, *args = parts
        print(f"  DECODE  op={op} args={args}")
        return op, args
    
    def execute(self, op, args):
        if op == 'MOV':   result = int(args[1]) if args[1].lstrip('-').isdigit() else self.regs[args[1]]
        elif op == 'ADD': result = self.regs[args[0]] + self.regs[args[1]]
        elif op == 'SUB': result = self.regs[args[0]] - self.regs[args[1]]
        elif op == 'MUL': result = self.regs[args[0]] * self.regs[args[1]]
        elif op == 'CMP': result = self.regs[args[0]] - self.regs[args[1]]; self.zero=(result==0); print(f"  EXECUTE cmp: zero={self.zero}"); return None, None
        else: result = 0
        print(f"  EXECUTE result={result}")
        return args[0], result
    
    def writeback(self, dest, val):
        if dest: self.regs[dest] = val; print(f"  WRITEBK {dest} = {val}")
    
    def run(self, program):
        while self.pc < len(program):
            print(f"\nCycle {self.pc+1}:")
            instr = self.fetch(program)
            op, args = self.decode(instr)
            dest, val = self.execute(op, args)
            self.writeback(dest, val)
        print(f"\nFinal registers: {self.regs}")

cpu = CPU()
cpu.run([
    'MOV R0 10',
    'MOV R1 32',
    'ADD R2 R0',   # R2 = R0 + R1 -- simplified, R1 implicit
    'MOV R1 32',
    'ADD R2 R1',
])
💡
The ROB (Reorder Buffer) is what makes out-of-order execution safe. Instructions execute in any order, but the ROB ensures their results commit in program order. Without the ROB, a later instruction could corrupt state that an exception handler needs to restore.
📝 Day 4 Exercise
Extend the CPU Simulator
  1. Add a LOAD and STORE instruction to the simulator that reads/writes the memory array.
  2. Add a JMP instruction that sets self.pc to the target. Add a BEQ (branch if equal/zero flag) instruction.
  3. Write a program that loops 5 times: MOV R0 0, then loop: ADD R0 1, CMP R0 5, BEQ done, JMP loop.
  4. Add a cycle counter and measure how many cycles each instruction type takes.
  5. Add a simple cache: a dictionary mapping address→value. On LOAD, check cache first. Count hit and miss rates.

Day 4 Summary

  • Fetch reads from L1i cache using RIP as the address, handling variable-length x86 encoding
  • Decode converts bytes to uops (micro-operations); the uop cache bypasses decode on hot loops
  • The reorder buffer holds in-flight uops and commits them in program order despite OOO execution
  • Every instruction retires through the ROB — this is how exceptions and interrupts stay precise
Challenge

Research Intel's Sandy Bridge microarchitecture diagram (widely available online). Trace the path of a single LOAD instruction through all named structures: L1D cache, load buffer, scheduler, ROB, register file. Write a paragraph describing each structure it passes through and why each exists.

Finished this lesson?