Day 5: Multi-Core and Simultaneous Multithreading

Today's Objective

Modern CPUs are marvels of engineering: speculative execution, multi-core designs, simultaneous multithreading, and power management running simultaneously. Today you'll see the full picture.

Multi-Core and Simultaneous Multithreading

Modern CPUs have 8–128 cores on a single die, each with its own L1 and L2 caches, sharing L3. Simultaneous Multithreading (SMT/HyperThreading) runs two logical threads per physical core by duplicating the register file and reorder buffer while sharing the execution units. When thread 1 stalls on a cache miss, thread 2's uops fill the pipeline. SMT gives 10–30% throughput improvement with minimal extra silicon.

Power and Frequency Scaling

CPUs don't run at maximum clock speed all the time. Dynamic Voltage and Frequency Scaling (DVFS) scales clock speed (and thus power quadratically) based on workload. Idle cores are clock-gated (clock signal removed) to save power. Turbo Boost overclocks active cores beyond the TDP limit as long as thermal headroom exists. A 3.2 GHz base-clock CPU might boost to 5.4 GHz on 1–2 cores under short bursts.

The Full Pipeline: A Complete Picture

Today's server CPU (e.g. AMD EPYC Genoa): up to 96 cores, 192 threads, 12 memory channels, 384 MB L3. Each core: 6-wide decode, ~600-uop ROB, 12+ execution ports, 4 load/2 store units, 32 KB L1i, 32 KB L1d, 1 MB L2. The chip does power management, thermal throttling, error correction on memory, and security mitigations all simultaneously — while executing your database query at 300 billion operations per second.

python.txt

PYTHON

# Explore CPU topology with Python
import subprocess, platform

def cpu_info():
    """Cross-platform CPU information"""
    info = {}
    
    if platform.system() == 'Linux':
        result = subprocess.run(['lscpu'], capture_output=True, text=True)
        for line in result.stdout.splitlines():
            if ':' in line:
                key, _, val = line.partition(':')
                info[key.strip()] = val.strip()
        
        keys = ['Architecture','CPU(s)','Thread(s) per core','Core(s) per socket',
                'Socket(s)','CPU MHz','CPU max MHz','L1d cache','L1i cache',
                'L2 cache','L3 cache']
        for k in keys:
            if k in info: print(f"  {k:25s}: {info[k]}")
    
    elif platform.system() == 'Darwin':  # macOS
        for cmd, label in [
            (['sysctl','-n','hw.physicalcpu'], 'Physical cores'),
            (['sysctl','-n','hw.logicalcpu'], 'Logical CPUs'),
            (['sysctl','-n','hw.l1dcachesize'], 'L1d cache (bytes)'),
            (['sysctl','-n','hw.l2cachesize'],  'L2 cache (bytes)'),
            (['sysctl','-n','hw.l3cachesize'],  'L3 cache (bytes)'),
            (['sysctl','-n','hw.cpufrequency_max'], 'Max frequency (Hz)'),
        ]:
            r = subprocess.run(cmd, capture_output=True, text=True)
            if r.returncode == 0:
                print(f"  {label:25s}: {r.stdout.strip()}")
    
    import os
    cores = os.cpu_count()
    print(f"\n  os.cpu_count() = {cores} (logical CPUs visible to OS)")

print("Your CPU:")
cpu_info()

Physical cores vs logical CPUs: a 10-core CPU with HyperThreading shows 20 logical CPUs to the OS. For CPU-bound workloads, using more than physical_cores threads often hurts performance — the extra threads share execution units instead of adding capacity.

Exercise

Profile a Real Workload End to End

Run the CPU info script. Record your physical cores, logical CPUs, and all cache sizes.
Run a CPU-bound benchmark: python3 -c "import timeit; print(timeit.timeit('sum(range(10**7))', number=10))"
Try it with 1, 2, 4, and 8 Python processes in parallel using subprocess. Does it scale linearly? Why or why not?
On Linux, use perf stat python3 benchmark.py to see IPC (instructions per cycle). A well-optimized program hits 3–4 IPC on modern CPUs.
Research your CPU's microarchitecture name (e.g., Alder Lake, Zen 4, Firestorm). Find one interesting fact about its design not covered in this course.

Use perf report (Linux) or Instruments (macOS) to profile a real program you use or write. Find the hottest function — the one consuming the most CPU time. Look at its assembly. Is it vectorized? Are there cache misses? Write a paragraph describing what you found and one thing you would try to improve it.

→

Course Complete

Completing all five days means having a solid working knowledge of How Cpus Work. The skills here translate directly to real projects. The next step is practice — pick a project and build something with what was learned.

Supporting Videos & Reading

Go deeper with these external references.

YouTube

Day 5 — Video Walkthroughs Community tutorials and walkthroughs covering the concepts in this lesson.

→

YouTube

Day 5 Explained Deep-dive explanations and live-coding sessions from top educators.

→

Official Docs

Official Documentation Primary reference documentation for the technologies covered in this lesson.

→

GitHub

Open Source Examples Real-world codebases demonstrating the patterns taught in this lesson.

→

Day 5 Checkpoint

Before moving on, verify you can answer these without looking:

What is the core concept introduced in this lesson, and why does it matter?
What are the two or three most common mistakes practitioners make with this topic?
Can you explain the key code pattern from this lesson to a colleague in plain language?
What would break first if you skipped the safeguards or best practices described here?
How does today's topic connect to what comes in Day the final lesson?

Live Bootcamp

Learn this in person — 2 days, 5 cities

Thu–Fri sessions in Denver, Los Angeles, New York, Chicago, and Dallas. $1,490 per seat. June–October 2026.

Reserve Your Seat →

Back to Course

How Cpus Work — Full Course Overview

→