Modern CPUs process multiple data in one instruction via SIMD (Single Instruction, Multiple Data). SSE and AVX registers enable parallel arithmetic — critical for signal processing, graphics, and crypto. Today introduces XMM/YMM registers and floating-point operations.
SSE added 16 XMM registers (XMM0-XMM15), each 128 bits wide. They can hold: 2 doubles, 4 floats, 4 32-bit integers, 8 16-bit shorts, or 16 bytes. MOVAPS (aligned), MOVUPS (unaligned) load/store XMM from memory. ADDPS adds 4 floats in parallel. MULPD multiplies 2 doubles in parallel. The 'P' suffix means packed (all lanes); 'S' means scalar (only lane 0).
AVX doubled XMM to 256-bit YMM registers: 8 floats or 4 doubles per register. AVX-512 added 512-bit ZMM registers: 16 floats or 8 doubles. Modern Intel/AMD CPUs support AVX2. Compiler auto-vectorization uses these automatically — you can see it in compiler output with -march=native -O3 -fopt-info-vec-optimized. Writing manual SIMD delivers 4-16x speedups on regular data patterns.
x87 is the legacy floating-point unit with an 8-register stack (ST0-ST7) using 80-bit extended precision. Modern code uses SSE2 scalar instructions instead: MOVSD (load double), ADDSD, MULSD, DIVSD, SQRTSD. The x86-64 ABI passes and returns floating-point arguments in XMM0-XMM7. When writing functions that return doubles, put the result in XMM0.
; vec_dot.asm: dot product of two float[4] arrays using SSE
section .data
align 16
a dd 1.0, 2.0, 3.0, 4.0 ; float array
b dd 5.0, 6.0, 7.0, 8.0
section .text
global _start
_start:
movaps xmm0, [a] ; xmm0 = {1,2,3,4}
movaps xmm1, [b] ; xmm1 = {5,6,7,8}
mulps xmm0, xmm1 ; xmm0 = {5,12,21,32}
; Horizontal sum: add pairs
movaps xmm2, xmm0
shufps xmm2, xmm2, 0x4e ; swap hi/lo 64-bit halves
addps xmm0, xmm2 ; xmm0[0..1] = {26, 53}
movaps xmm2, xmm0
shufps xmm2, xmm2, 0x11 ; bring lane 1 to lane 0
addss xmm0, xmm2 ; xmm0[0] = 70.0 (dot product)
; Exit
mov rax, 60
xor rdi, rdi
syscall
Write an assembly function that computes the Euclidean distance between two 3D vectors (sqrt of sum of squared differences) using SSE instructions. Return the result in XMM0 as a double.