Key Takeaways
- Every neural network layer is a matrix multiplication — linear algebra is ML's operating language
- Vectors represent data points; matrices represent datasets and transformations
- Dot product measures similarity — the foundation of attention mechanisms and cosine similarity in embeddings
- Eigenvectors of the covariance matrix are the principal components in PCA
- SVD decomposes any matrix — the basis for recommender systems, NLP embeddings, and compression
Machine Learning Is Linear Algebra at Scale
You don't need to prove theorems to do machine learning. But you do need to understand what's happening when you write model.fit(X, y). The answer is linear algebra, applied millions of times per second on GPU hardware designed specifically for it.
When you train a neural network, the forward pass multiplies weight matrices by input vectors. Backpropagation computes gradients using matrix operations (Jacobians). Attention in transformer models is a scaled dot-product of query and key matrices. Embeddings are vectors in high-dimensional space. All of this is linear algebra.
Vectors: Data Points in Space
A vector is an ordered list of numbers: v = [3, -1, 4, 1, 5]. In ML, vectors represent data points. An image with 784 pixels is a 784-dimensional vector. A word embedding (like Word2Vec or OpenAI's ada-002) is a 1536-dimensional vector. A user in a recommendation system is a vector in some latent feature space.
Key vector operations:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Addition: [5, 7, 9]
print(a + b)
# Scalar multiplication: [2, 4, 6]
print(2 * a)
# Dot product: 1*4 + 2*5 + 3*6 = 32
print(np.dot(a, b))
# Magnitude (L2 norm): sqrt(1² + 2² + 3²)
print(np.linalg.norm(a)) # 3.742
# Cosine similarity (measures angle between vectors)
cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(cos_sim) # 0.974 — very similar direction
The dot product and cosine similarity are why embedding-based search works. When you search for "king - man + woman" in word embeddings, you're doing vector arithmetic. When you find similar documents, you compute cosine similarity between their embedding vectors.
Matrices: Datasets and Transformations
A matrix is a 2D array of numbers. In ML: a dataset with n samples and d features is an n×d matrix X. The weight matrix of a neural network layer is an h×d matrix W (h=hidden units, d=input dimensions).
# A dataset: 4 samples, 3 features each
X = np.array([
[1.2, 0.5, 3.1],
[2.3, 1.1, 0.8],
[0.9, 2.2, 4.5],
[3.1, 0.3, 1.2]
]) # shape: (4, 3)
# Weight matrix for a layer with 2 neurons
W = np.array([
[0.5, -0.3, 0.8],
[1.2, 0.6, -0.4]
]) # shape: (2, 3)
# Forward pass: output = X @ W.T
output = X @ W.T # shape: (4, 2) — 4 samples, 2 neuron outputs
print(output.shape) # (4, 2)
Matrix Multiplication Is the Core of Neural Networks
A neural network layer with weights W, input x, and bias b computes: output = activation(W @ x + b). This is just matrix multiplication plus a pointwise non-linearity.
For a batch of inputs (processing multiple samples simultaneously): W is (output_size × input_size), batch is (input_size × batch_size), output is (output_size × batch_size). GPUs are optimized specifically for this operation — thousands of multiply-accumulate operations in parallel.
# Two-layer neural network forward pass
def forward(X, W1, b1, W2, b2):
# Layer 1
z1 = X @ W1.T + b1 # Linear transformation
a1 = np.maximum(0, z1) # ReLU activation
# Layer 2
z2 = a1 @ W2.T + b2 # Linear transformation
output = 1 / (1 + np.exp(-z2)) # Sigmoid for classification
return output
Linear Transformations: What Matrices Do to Space
A matrix represents a linear transformation — it maps vectors from one space to another. Understanding what transformations matrices represent makes neural networks less mysterious.
- Scaling matrix — Stretches or compresses space along axes
- Rotation matrix — Rotates space around the origin
- Projection matrix — Projects high-dimensional data onto a lower-dimensional subspace (what dimensionality reduction does)
- Identity matrix I — Does nothing: Ix = x for all x
The idea that "neural networks learn useful representations" means: each layer applies a learned transformation that makes the data easier to classify. Early layers learn simple features (edges in images); later layers combine these into complex concepts (faces, objects).
Eigenvalues and Eigenvectors: Principal Directions
For a square matrix A, an eigenvector v satisfies: Av = λv. Matrix A times v just scales v — it doesn't change direction. λ is the eigenvalue.
A = np.array([[3, 1], [0, 2]])
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues) # [3. 2.]
print("Eigenvectors:\n", eigenvectors) # columns are eigenvectors
PCA (Principal Component Analysis) uses eigenvalues and eigenvectors to reduce dimensionality:
- Center the data (subtract mean)
- Compute the covariance matrix (d×d matrix capturing how features vary together)
- Find eigenvectors of the covariance matrix — these are the principal components (directions of maximum variance)
- Project data onto the top k eigenvectors to reduce from d dimensions to k dimensions
The eigenvalue tells you how much variance each component captures. Sort by eigenvalue descending — keep the top k to retain the most information.
SVD: The Swiss Army Knife of Linear Algebra
Singular Value Decomposition (SVD) decomposes any matrix M into three matrices: M = U Σ Vᵀ where U and V are orthogonal matrices, and Σ is a diagonal matrix of singular values (non-negative, in descending order).
SVD applications in ML:
- Recommender systems — Netflix, Spotify use matrix factorization (based on SVD) to find latent user and item factors. Decompose the user-item rating matrix, approximate with top k singular values.
- LSA (Latent Semantic Analysis) — Apply SVD to a term-document matrix to find latent topics. Precursor to modern NLP embeddings.
- Image compression — Approximate an image matrix with the top k singular values. Use 50 singular values instead of 1000 for 95% visual fidelity at 5% storage.
- Pseudoinverse — SVD enables solving overdetermined systems (more equations than unknowns) — what linear regression does.
Learn the Math Behind AI at Precision AI Academy
Our bootcamp bridges linear algebra, calculus, and statistics directly to hands-on ML projects — so the math makes sense in context. Five cities, October 2026.
Frequently Asked Questions
Why does machine learning need linear algebra?
Every neural network layer is a matrix multiplication. Datasets are matrices. Training is matrix operations at scale. Attention, embeddings, PCA, SVD — all linear algebra. You can't deeply understand ML without it.
What is matrix multiplication and why is it central to deep learning?
Matrix multiply combines two matrices. In neural nets, each layer computes output = W @ x + b. With batched inputs, you process all samples in parallel — this is why GPU hardware (optimized for matmul) is essential for training.
What are eigenvectors and eigenvalues, and where are they used in ML?
Eigenvectors are directions a matrix doesn't rotate — only scales. Used in PCA (eigenvectors of covariance matrix = principal components), PageRank (dominant eigenvector of web graph), and spectral clustering.