Today you'll trace a single instruction from fetch to writeback — through every component of the CPU, seeing exactly what happens at each stage.
The Program Counter (RIP in x86-64) holds the address of the next instruction. The instruction fetch unit reads bytes from the instruction cache (L1i, 32 KB), sends them to the instruction queue. The instruction length decoder figures out where one instruction ends and the next begins — x86-64 instructions vary from 1 to 15 bytes. The fetch unit tries to fetch 16–32 bytes per cycle to keep the pipeline fed.
The decoder converts raw bytes into micro-operations (uops) — primitive operations the CPU's execution engine understands. A single x86 ADD instruction might become 1 uop. A complex PUSH instruction becomes 2 uops (sub RSP,8 and store). The uop cache (Decoded ICache, ~1500 uops in Intel CPUs) caches decoded instructions so the decoder isn't a bottleneck on loops.
Dispatched uops wait in the scheduler/reservation station for their operands to be ready and an execution unit to be free. Modern CPUs have 10+ execution ports: integer ALU, FP ALU, load/store units, branch unit, etc. Once executed, the result goes to the reorder buffer (ROB) to await in-order commit. On commit, the register file is updated and the uop retires. The ROB enables out-of-order execution while preserving program semantics.
# Simulate a simple in-order CPU
# Demonstrates fetch→decode→execute→writeback
class CPU:
def __init__(self):
self.regs = {'R0':0,'R1':0,'R2':0,'R3':0}
self.pc = 0
self.memory = [0]*256
self.zero = False
def fetch(self, program):
if self.pc >= len(program): return None
instr = program[self.pc]
self.pc += 1
print(f" FETCH [{self.pc-1}]: {instr}")
return instr
def decode(self, instr):
parts = instr.split()
op, *args = parts
print(f" DECODE op={op} args={args}")
return op, args
def execute(self, op, args):
if op == 'MOV': result = int(args[1]) if args[1].lstrip('-').isdigit() else self.regs[args[1]]
elif op == 'ADD': result = self.regs[args[0]] + self.regs[args[1]]
elif op == 'SUB': result = self.regs[args[0]] - self.regs[args[1]]
elif op == 'MUL': result = self.regs[args[0]] * self.regs[args[1]]
elif op == 'CMP': result = self.regs[args[0]] - self.regs[args[1]]; self.zero=(result==0); print(f" EXECUTE cmp: zero={self.zero}"); return None, None
else: result = 0
print(f" EXECUTE result={result}")
return args[0], result
def writeback(self, dest, val):
if dest: self.regs[dest] = val; print(f" WRITEBK {dest} = {val}")
def run(self, program):
while self.pc < len(program):
print(f"\nCycle {self.pc+1}:")
instr = self.fetch(program)
op, args = self.decode(instr)
dest, val = self.execute(op, args)
self.writeback(dest, val)
print(f"\nFinal registers: {self.regs}")
cpu = CPU()
cpu.run([
'MOV R0 10',
'MOV R1 32',
'ADD R2 R0', # R2 = R0 + R1 -- simplified, R1 implicit
'MOV R1 32',
'ADD R2 R1',
])
Research Intel's Sandy Bridge microarchitecture diagram (widely available online). Trace the path of a single LOAD instruction through all named structures: L1D cache, load buffer, scheduler, ROB, register file. Write a paragraph describing each structure it passes through and why each exists.