Key Takeaways
- Every neural network layer is a matrix multiplication — linear algebra is ML's operating language
- Vectors represent data points; matrices represent datasets and transformations
- Dot product measures similarity — the foundation of attention mechanisms and cosine similarity in embeddings
- Eigenvectors of the covariance matrix are the principal components in PCA
- SVD decomposes any matrix — the basis for recommender systems, NLP embeddings, and compression
Machine Learning Is Linear Algebra at Scale
Learn the Core Concepts
Start with the fundamentals before touching tools. Understanding why something was built the way it was makes every tool decision faster and more defensible.
Build Something Real
The fastest way to learn is to build a project that produces a real output. Toy examples teach you the happy path; real projects teach you everything else.
Know the Trade-offs
Every technology choice is a trade-off. Engineers who advance fastest can articulate clearly why they chose one approach over another.
Go to Production
Development is the easy part. The real learning happens when you deploy, monitor, debug, and scale. Plan for production from day one.
You don't need to prove theorems to do machine learning. But you do need to understand what's happening when you write model.fit(X, y). The answer is linear algebra, applied millions of times per second on GPU hardware designed specifically for it.
When you train a neural network, the forward pass multiplies weight matrices by input vectors. Backpropagation computes gradients using matrix operations (Jacobians). Attention in transformer models is a scaled dot-product of query and key matrices. Embeddings are vectors in high-dimensional space. All of this is linear algebra.
Vectors: Data Points in Space
A vector is an ordered list of numbers: v = [3, -1, 4, 1, 5]. In ML, vectors represent data points. An image with 784 pixels is a 784-dimensional vector. A word embedding (like Word2Vec or OpenAI's ada-002) is a 1536-dimensional vector. A user in a recommendation system is a vector in some latent feature space.
Key vector operations:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Addition: [5, 7, 9]
print(a + b)
# Scalar multiplication: [2, 4, 6]
print(2 * a)
# Dot product: 1*4 + 2*5 + 3*6 = 32
print(np.dot(a, b))
# Magnitude (L2 norm): sqrt(1² + 2² + 3²)
print(np.linalg.norm(a)) # 3.742
# Cosine similarity (measures angle between vectors)
cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(cos_sim) # 0.974 — very similar direction
The dot product and cosine similarity are why embedding-based search works. When you search for "king - man + woman" in word embeddings, you're doing vector arithmetic. When you find similar documents, you compute cosine similarity between their embedding vectors.
Matrices: Datasets and Transformations
A matrix is a 2D array of numbers. In ML: a dataset with n samples and d features is an n×d matrix X. The weight matrix of a neural network layer is an h×d matrix W (h=hidden units, d=input dimensions).
# A dataset: 4 samples, 3 features each
X = np.array([
[1.2, 0.5, 3.1],
[2.3, 1.1, 0.8],
[0.9, 2.2, 4.5],
[3.1, 0.3, 1.2]
]) # shape: (4, 3)
# Weight matrix for a layer with 2 neurons
W = np.array([
[0.5, -0.3, 0.8],
[1.2, 0.6, -0.4]
]) # shape: (2, 3)
# Forward pass: output = X @ W.T
output = X @ W.T # shape: (4, 2) — 4 samples, 2 neuron outputs
print(output.shape) # (4, 2)
Matrix Multiplication Is the Core of Neural Networks
A neural network layer with weights W, input x, and bias b computes: output = activation(W @ x + b). This is just matrix multiplication plus a pointwise non-linearity.
For a batch of inputs (processing multiple samples simultaneously): W is (output_size × input_size), batch is (input_size × batch_size), output is (output_size × batch_size). GPUs are optimized specifically for this operation — thousands of multiply-accumulate operations in parallel.
# Two-layer neural network forward pass
def forward(X, W1, b1, W2, b2):
# Layer 1
z1 = X @ W1.T + b1 # Linear transformation
a1 = np.maximum(0, z1) # ReLU activation
# Layer 2
z2 = a1 @ W2.T + b2 # Linear transformation
output = 1 / (1 + np.exp(-z2)) # Sigmoid for classification
return output
Linear Transformations: What Matrices Do to Space
A matrix represents a linear transformation — it maps vectors from one space to another. Understanding what transformations matrices represent makes neural networks less mysterious.
- Scaling matrix — Stretches or compresses space along axes
- Rotation matrix — Rotates space around the origin
- Projection matrix — Projects high-dimensional data onto a lower-dimensional subspace (what dimensionality reduction does)
- Identity matrix I — Does nothing: Ix = x for all x
The idea that "neural networks learn useful representations" means: each layer applies a learned transformation that makes the data easier to classify. Early layers learn simple features (edges in images); later layers combine these into complex concepts (faces, objects).
Eigenvalues and Eigenvectors: Principal Directions
For a square matrix A, an eigenvector v satisfies: Av = λv. Matrix A times v just scales v — it doesn't change direction. λ is the eigenvalue.
A = np.array([[3, 1], [0, 2]])
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues) # [3. 2.]
print("Eigenvectors:\n", eigenvectors) # columns are eigenvectors
PCA (Principal Component Analysis) uses eigenvalues and eigenvectors to reduce dimensionality:
- Center the data (subtract mean)
- Compute the covariance matrix (d×d matrix capturing how features vary together)
- Find eigenvectors of the covariance matrix — these are the principal components (directions of maximum variance)
- Project data onto the top k eigenvectors to reduce from d dimensions to k dimensions
The eigenvalue tells you how much variance each component captures. Sort by eigenvalue descending — keep the top k to retain the most information.
SVD: The Swiss Army Knife of Linear Algebra
Singular Value Decomposition (SVD) decomposes any matrix M into three matrices: M = U Σ Vᵀ where U and V are orthogonal matrices, and Σ is a diagonal matrix of singular values (non-negative, in descending order).
SVD applications in ML:
- Recommender systems — Netflix, Spotify use matrix factorization (based on SVD) to find latent user and item factors. Decompose the user-item rating matrix, approximate with top k singular values.
- LSA (Latent Semantic Analysis) — Apply SVD to a term-document matrix to find latent topics. Precursor to modern NLP embeddings.
- Image compression — Approximate an image matrix with the top k singular values. Use 50 singular values instead of 1000 for 95% visual fidelity at 5% storage.
- Pseudoinverse — SVD enables solving overdetermined systems (more equations than unknowns) — what linear regression does.
Learn the Math Behind AI at Precision AI Academy
Our bootcamp bridges linear algebra, calculus, and statistics directly to hands-on ML projects — so the math makes sense in context. Five cities, June–October 2026.
Frequently Asked Questions
Why does machine learning need linear algebra?
Every neural network layer is a matrix multiplication. Datasets are matrices. Training is matrix operations at scale. Attention, embeddings, PCA, SVD — all linear algebra. You can't deeply understand ML without it.
What is matrix multiplication and why is it central to deep learning?
Matrix multiply combines two matrices. In neural nets, each layer computes output = W @ x + b. With batched inputs, you process all samples in parallel — this is why GPU hardware (optimized for matmul) is essential for training.
What are eigenvectors and eigenvalues, and where are they used in ML?
Eigenvectors are directions a matrix doesn't rotate — only scales. Used in PCA (eigenvectors of covariance matrix = principal components), PageRank (dominant eigenvector of web graph), and spectral clustering.
Continue Learning
You need less linear algebra than you think to build with AI, and more to understand what breaks.
The linear algebra prerequisites for building AI applications are significantly lower than the linear algebra prerequisites for researching AI systems. A developer who can call OpenAI's API, design good prompts, and build reliable pipelines around LLM outputs needs almost no linear algebra day to day. The same developer who wants to understand why attention mechanisms scale quadratically with sequence length, how quantization affects model precision, or why a particular embedding space clusters the way it does — that developer genuinely needs matrix operations, vector spaces, and geometric intuition. These two use cases have different prerequisites, and conflating them overstates the math barrier for practitioners while underselling it for researchers.
The linear algebra concepts that have the highest return on investment specifically for AI practitioners (not researchers) are: vector similarity and distance metrics (cosine similarity is everywhere in embedding-based search), the geometric intuition of high-dimensional spaces (understanding why nearest-neighbor search degrades with dimensionality), and basic matrix multiplication intuition (enough to reason about attention head dimensions and parameter counts). This subset is learnable from scratch in a few weeks and unlocks genuine understanding of retrieval-augmented generation, embedding models, and transformer architecture at a level that is useful for debugging and design decisions.
3Blue1Brown's "Essence of Linear Algebra" series on YouTube is the reference starting point — it builds the geometric intuition that pure algebraic treatment misses, and it is free. For practitioners, that intuition is more valuable than mechanical matrix calculation fluency.