Day 5 of 5
⏱ ~60 minutes
How CPUs Work in 5 Days — Day 5

Modern CPUs

Modern CPUs are marvels of engineering: speculative execution, multi-core designs, simultaneous multithreading, and power management running simultaneously. Today you'll see the full picture.

Multi-Core and Simultaneous Multithreading

Modern CPUs have 8–128 cores on a single die, each with its own L1 and L2 caches, sharing L3. Simultaneous Multithreading (SMT/HyperThreading) runs two logical threads per physical core by duplicating the register file and reorder buffer while sharing the execution units. When thread 1 stalls on a cache miss, thread 2's uops fill the pipeline. SMT gives 10–30% throughput improvement with minimal extra silicon.

Power and Frequency Scaling

CPUs don't run at maximum clock speed all the time. Dynamic Voltage and Frequency Scaling (DVFS) scales clock speed (and thus power quadratically) based on workload. Idle cores are clock-gated (clock signal removed) to save power. Turbo Boost overclocks active cores beyond the TDP limit as long as thermal headroom exists. A 3.2 GHz base-clock CPU might boost to 5.4 GHz on 1–2 cores under short bursts.

The Full Pipeline: A Complete Picture

Today's server CPU (e.g. AMD EPYC Genoa): up to 96 cores, 192 threads, 12 memory channels, 384 MB L3. Each core: 6-wide decode, ~600-uop ROB, 12+ execution ports, 4 load/2 store units, 32 KB L1i, 32 KB L1d, 1 MB L2. The chip does power management, thermal throttling, error correction on memory, and security mitigations all simultaneously — while executing your database query at 300 billion operations per second.

python
# Explore CPU topology with Python
import subprocess, platform

def cpu_info():
    """Cross-platform CPU information"""
    info = {}
    
    if platform.system() == 'Linux':
        result = subprocess.run(['lscpu'], capture_output=True, text=True)
        for line in result.stdout.splitlines():
            if ':' in line:
                key, _, val = line.partition(':')
                info[key.strip()] = val.strip()
        
        keys = ['Architecture','CPU(s)','Thread(s) per core','Core(s) per socket',
                'Socket(s)','CPU MHz','CPU max MHz','L1d cache','L1i cache',
                'L2 cache','L3 cache']
        for k in keys:
            if k in info: print(f"  {k:25s}: {info[k]}")
    
    elif platform.system() == 'Darwin':  # macOS
        for cmd, label in [
            (['sysctl','-n','hw.physicalcpu'], 'Physical cores'),
            (['sysctl','-n','hw.logicalcpu'], 'Logical CPUs'),
            (['sysctl','-n','hw.l1dcachesize'], 'L1d cache (bytes)'),
            (['sysctl','-n','hw.l2cachesize'],  'L2 cache (bytes)'),
            (['sysctl','-n','hw.l3cachesize'],  'L3 cache (bytes)'),
            (['sysctl','-n','hw.cpufrequency_max'], 'Max frequency (Hz)'),
        ]:
            r = subprocess.run(cmd, capture_output=True, text=True)
            if r.returncode == 0:
                print(f"  {label:25s}: {r.stdout.strip()}")
    
    import os
    cores = os.cpu_count()
    print(f"\n  os.cpu_count() = {cores} (logical CPUs visible to OS)")

print("Your CPU:")
cpu_info()
💡
Physical cores vs logical CPUs: a 10-core CPU with HyperThreading shows 20 logical CPUs to the OS. For CPU-bound workloads, using more than physical_cores threads often hurts performance — the extra threads share execution units instead of adding capacity.
📝 Day 5 Exercise
Profile a Real Workload End to End
  1. Run the CPU info script. Record your physical cores, logical CPUs, and all cache sizes.
  2. Run a CPU-bound benchmark: python3 -c "import timeit; print(timeit.timeit('sum(range(10**7))', number=10))"
  3. Try it with 1, 2, 4, and 8 Python processes in parallel using subprocess. Does it scale linearly? Why or why not?
  4. On Linux, use perf stat python3 benchmark.py to see IPC (instructions per cycle). A well-optimized program hits 3–4 IPC on modern CPUs.
  5. Research your CPU's microarchitecture name (e.g., Alder Lake, Zen 4, Firestorm). Find one interesting fact about its design not covered in this course.

Day 5 Summary

  • Modern CPUs have 8–128 cores; HyperThreading doubles logical CPUs by sharing execution units
  • DVFS and Turbo Boost dynamically scale clock speed based on thermal headroom
  • The full pipeline: decode 6 uops/cycle, OOO execute in 600-uop ROB, 12 execution ports, retire 4-6/cycle
  • For CPU-bound work, thread count should match physical cores — logical CPUs don't add capacity for compute
Challenge

Use perf report (Linux) or Instruments (macOS) to profile a real program you use or write. Find the hottest function — the one consuming the most CPU time. Look at its assembly. Is it vectorized? Are there cache misses? Write a paragraph describing what you found and one thing you would try to improve it.

Finished this lesson?