Modern CPUs are marvels of engineering: speculative execution, multi-core designs, simultaneous multithreading, and power management running simultaneously. Today you'll see the full picture.
Modern CPUs have 8–128 cores on a single die, each with its own L1 and L2 caches, sharing L3. Simultaneous Multithreading (SMT/HyperThreading) runs two logical threads per physical core by duplicating the register file and reorder buffer while sharing the execution units. When thread 1 stalls on a cache miss, thread 2's uops fill the pipeline. SMT gives 10–30% throughput improvement with minimal extra silicon.
CPUs don't run at maximum clock speed all the time. Dynamic Voltage and Frequency Scaling (DVFS) scales clock speed (and thus power quadratically) based on workload. Idle cores are clock-gated (clock signal removed) to save power. Turbo Boost overclocks active cores beyond the TDP limit as long as thermal headroom exists. A 3.2 GHz base-clock CPU might boost to 5.4 GHz on 1–2 cores under short bursts.
Today's server CPU (e.g. AMD EPYC Genoa): up to 96 cores, 192 threads, 12 memory channels, 384 MB L3. Each core: 6-wide decode, ~600-uop ROB, 12+ execution ports, 4 load/2 store units, 32 KB L1i, 32 KB L1d, 1 MB L2. The chip does power management, thermal throttling, error correction on memory, and security mitigations all simultaneously — while executing your database query at 300 billion operations per second.
# Explore CPU topology with Python
import subprocess, platform
def cpu_info():
"""Cross-platform CPU information"""
info = {}
if platform.system() == 'Linux':
result = subprocess.run(['lscpu'], capture_output=True, text=True)
for line in result.stdout.splitlines():
if ':' in line:
key, _, val = line.partition(':')
info[key.strip()] = val.strip()
keys = ['Architecture','CPU(s)','Thread(s) per core','Core(s) per socket',
'Socket(s)','CPU MHz','CPU max MHz','L1d cache','L1i cache',
'L2 cache','L3 cache']
for k in keys:
if k in info: print(f" {k:25s}: {info[k]}")
elif platform.system() == 'Darwin': # macOS
for cmd, label in [
(['sysctl','-n','hw.physicalcpu'], 'Physical cores'),
(['sysctl','-n','hw.logicalcpu'], 'Logical CPUs'),
(['sysctl','-n','hw.l1dcachesize'], 'L1d cache (bytes)'),
(['sysctl','-n','hw.l2cachesize'], 'L2 cache (bytes)'),
(['sysctl','-n','hw.l3cachesize'], 'L3 cache (bytes)'),
(['sysctl','-n','hw.cpufrequency_max'], 'Max frequency (Hz)'),
]:
r = subprocess.run(cmd, capture_output=True, text=True)
if r.returncode == 0:
print(f" {label:25s}: {r.stdout.strip()}")
import os
cores = os.cpu_count()
print(f"\n os.cpu_count() = {cores} (logical CPUs visible to OS)")
print("Your CPU:")
cpu_info()
python3 -c "import timeit; print(timeit.timeit('sum(range(10**7))', number=10))"perf stat python3 benchmark.py to see IPC (instructions per cycle). A well-optimized program hits 3–4 IPC on modern CPUs.Use perf report (Linux) or Instruments (macOS) to profile a real program you use or write. Find the hottest function — the one consuming the most CPU time. Look at its assembly. Is it vectorized? Are there cache misses? Write a paragraph describing what you found and one thing you would try to improve it.