Day 4 of 5
⏱ ~60 minutes
Assembly Language in 5 Days — Day 4

SIMD & Floating Point

Modern CPUs process multiple data in one instruction via SIMD (Single Instruction, Multiple Data). SSE and AVX registers enable parallel arithmetic — critical for signal processing, graphics, and crypto. Today introduces XMM/YMM registers and floating-point operations.

XMM Registers and SSE

SSE added 16 XMM registers (XMM0-XMM15), each 128 bits wide. They can hold: 2 doubles, 4 floats, 4 32-bit integers, 8 16-bit shorts, or 16 bytes. MOVAPS (aligned), MOVUPS (unaligned) load/store XMM from memory. ADDPS adds 4 floats in parallel. MULPD multiplies 2 doubles in parallel. The 'P' suffix means packed (all lanes); 'S' means scalar (only lane 0).

AVX and AVX-512

AVX doubled XMM to 256-bit YMM registers: 8 floats or 4 doubles per register. AVX-512 added 512-bit ZMM registers: 16 floats or 8 doubles. Modern Intel/AMD CPUs support AVX2. Compiler auto-vectorization uses these automatically — you can see it in compiler output with -march=native -O3 -fopt-info-vec-optimized. Writing manual SIMD delivers 4-16x speedups on regular data patterns.

x87 and SSE Scalar Floating Point

x87 is the legacy floating-point unit with an 8-register stack (ST0-ST7) using 80-bit extended precision. Modern code uses SSE2 scalar instructions instead: MOVSD (load double), ADDSD, MULSD, DIVSD, SQRTSD. The x86-64 ABI passes and returns floating-point arguments in XMM0-XMM7. When writing functions that return doubles, put the result in XMM0.

nasm
; vec_dot.asm: dot product of two float[4] arrays using SSE
section .data
    align 16
    a  dd 1.0, 2.0, 3.0, 4.0   ; float array
    b  dd 5.0, 6.0, 7.0, 8.0

section .text
    global _start

_start:
    movaps xmm0, [a]      ; xmm0 = {1,2,3,4}
    movaps xmm1, [b]      ; xmm1 = {5,6,7,8}
    mulps  xmm0, xmm1     ; xmm0 = {5,12,21,32}

    ; Horizontal sum: add pairs
    movaps xmm2, xmm0
    shufps xmm2, xmm2, 0x4e  ; swap hi/lo 64-bit halves
    addps  xmm0, xmm2         ; xmm0[0..1] = {26, 53}
    movaps xmm2, xmm0
    shufps xmm2, xmm2, 0x11  ; bring lane 1 to lane 0
    addss  xmm0, xmm2         ; xmm0[0] = 70.0 (dot product)

    ; Exit
    mov rax, 60
    xor rdi, rdi
    syscall
💡
Always align SIMD data to 16-byte boundaries (for SSE) or 32-byte (for AVX) using 'align 16' in .data. Unaligned loads with MOVAPS cause general protection faults. Use MOVUPS if you cannot guarantee alignment.
📝 Day 4 Exercise
Vectorized Array Addition
  1. Write a function that adds two float[8] arrays using two MOVAPS + one ADDPS
  2. Write a scalar version with a loop for comparison
  3. Time both versions with a loop that runs each 100 million times
  4. Use objdump -d to see the compiled code for the C scalar equivalent with -O3
  5. Extend to process float[1024] arrays by looping over 64-bit blocks

Day 4 Summary

  • XMM registers are 128-bit: hold 4 floats, 2 doubles, or 16 bytes
  • YMM (AVX) doubles to 256-bit; ZMM (AVX-512) doubles again to 512-bit
  • ADDPS, MULPS operate on all lanes simultaneously (SIMD parallelism)
  • Align SIMD data to 16 bytes; use MOVAPS for aligned, MOVUPS for unaligned
  • SSE2 scalar (ADDSD, MULSD) is the standard for floating-point in x86-64 ABI
Challenge

Write an assembly function that computes the Euclidean distance between two 3D vectors (sqrt of sum of squared differences) using SSE instructions. Return the result in XMM0 as a double.

Finished this lesson?