Day 4: Today you'll trace a single instruction from fetch to writeback

Today's Objective

Today you'll trace a single instruction from fetch to writeback — through every component of the CPU, seeing exactly what happens at each stage.

Fetch: Reading the Next Instruction

The Program Counter (RIP in x86-64) holds the address of the next instruction. The instruction fetch unit reads bytes from the instruction cache (L1i, 32 KB), sends them to the instruction queue. The instruction length decoder figures out where one instruction ends and the next begins — x86-64 instructions vary from 1 to 15 bytes. The fetch unit tries to fetch 16–32 bytes per cycle to keep the pipeline fed.

Decode: What Does This Instruction Mean?

The decoder converts raw bytes into micro-operations (uops) — primitive operations the CPU's execution engine understands. A single x86 ADD instruction might become 1 uop. A complex PUSH instruction becomes 2 uops (sub RSP,8 and store). The uop cache (Decoded ICache, ~1500 uops in Intel CPUs) caches decoded instructions so the decoder isn't a bottleneck on loops.

Execute, Memory, Writeback

Dispatched uops wait in the scheduler/reservation station for their operands to be ready and an execution unit to be free. Modern CPUs have 10+ execution ports: integer ALU, FP ALU, load/store units, branch unit, etc. Once executed, the result goes to the reorder buffer (ROB) to await in-order commit. On commit, the register file is updated and the uop retires. The ROB enables out-of-order execution while preserving program semantics.

python.txt

PYTHON

# Simulate a simple in-order CPU
# Demonstrates fetch→decode→execute→writeback

class CPU:
    def __init__(self):
        self.regs = {'R0':0,'R1':0,'R2':0,'R3':0}
        self.pc = 0
        self.memory = [0]*256
        self.zero = False
    
    def fetch(self, program):
        if self.pc >= len(program): return None
        instr = program[self.pc]
        self.pc += 1
        print(f"  FETCH   [{self.pc-1}]: {instr}")
        return instr
    
    def decode(self, instr):
        parts = instr.split()
        op, *args = parts
        print(f"  DECODE  op={op} args={args}")
        return op, args
    
    def execute(self, op, args):
        if op == 'MOV':   result = int(args[1]) if args[1].lstrip('-').isdigit() else self.regs[args[1]]
        elif op == 'ADD': result = self.regs[args[0]] + self.regs[args[1]]
        elif op == 'SUB': result = self.regs[args[0]] - self.regs[args[1]]
        elif op == 'MUL': result = self.regs[args[0]] * self.regs[args[1]]
        elif op == 'CMP': result = self.regs[args[0]] - self.regs[args[1]]; self.zero=(result==0); print(f"  EXECUTE cmp: zero={self.zero}"); return None, None
        else: result = 0
        print(f"  EXECUTE result={result}")
        return args[0], result
    
    def writeback(self, dest, val):
        if dest: self.regs[dest] = val; print(f"  WRITEBK {dest} = {val}")
    
    def run(self, program):
        while self.pc < len(program):
            print(f"\nCycle {self.pc+1}:")
            instr = self.fetch(program)
            op, args = self.decode(instr)
            dest, val = self.execute(op, args)
            self.writeback(dest, val)
        print(f"\nFinal registers: {self.regs}")

cpu = CPU()
cpu.run([
    'MOV R0 10',
    'MOV R1 32',
    'ADD R2 R0',   # R2 = R0 + R1 -- simplified, R1 implicit
    'MOV R1 32',
    'ADD R2 R1',
])

The ROB (Reorder Buffer) is what makes out-of-order execution safe. Instructions execute in any order, but the ROB ensures their results commit in program order. Without the ROB, a later instruction could corrupt state that an exception handler needs to restore.

Exercise

Extend the CPU Simulator

Add a LOAD and STORE instruction to the simulator that reads/writes the memory array.
Add a JMP instruction that sets self.pc to the target. Add a BEQ (branch if equal/zero flag) instruction.
Write a program that loops 5 times: MOV R0 0, then loop: ADD R0 1, CMP R0 5, BEQ done, JMP loop.
Add a cycle counter and measure how many cycles each instruction type takes.
Add a simple cache: a dictionary mapping address→value. On LOAD, check cache first. Count hit and miss rates.

Research Intel's Sandy Bridge microarchitecture diagram (widely available online). Trace the path of a single LOAD instruction through all named structures: L1D cache, load buffer, scheduler, ROB, register file. Write a paragraph describing each structure it passes through and why each exists.

→

What's Next

The foundations from today carry directly into Day 5. In the next session the focus shifts to Day 5 — building directly on everything covered here.

Supporting Videos & Reading

Go deeper with these external references.

YouTube

Day 4 — Video Walkthroughs Community tutorials and walkthroughs covering the concepts in this lesson.

→

YouTube

Day 4 Explained Deep-dive explanations and live-coding sessions from top educators.

→

Official Docs

Official Documentation Primary reference documentation for the technologies covered in this lesson.

→

GitHub

Open Source Examples Real-world codebases demonstrating the patterns taught in this lesson.

→

Day 4 Checkpoint

Before moving on, verify you can answer these without looking:

What is the core concept introduced in this lesson, and why does it matter?
What are the two or three most common mistakes practitioners make with this topic?
Can you explain the key code pattern from this lesson to a colleague in plain language?
What would break first if you skipped the safeguards or best practices described here?
How does today's topic connect to what comes in Day 5?

Live Bootcamp

Learn this in person — 2 days, 5 cities

Thu–Fri sessions in Denver, Los Angeles, New York, Chicago, and Dallas. $1,490 per seat. June–October 2026.

Reserve Your Seat →

Continue To Day 5

Day 5

→