Key Takeaways
- A derivative measures how a function's output changes as its input changes — the slope at a point
- The gradient is the vector of all partial derivatives — it points toward steepest ascent
- The chain rule enables computing how changes propagate through composed functions — backpropagation is just chain rule applied to neural networks
- Gradient descent moves parameters in the direction opposite to the gradient, minimizing loss
- Adam optimizer adapts the learning rate per parameter using first and second moment estimates — the default for most deep learning
Training a Neural Network Is Just Solving an Optimization Problem
When you train a neural network, you're adjusting millions of parameters (weights) to minimize a loss function (cross-entropy, MSE, etc.). The loss measures how wrong the model's predictions are. Your goal: find the weight values that make loss as small as possible.
How do you find the minimum of a function with millions of variables? Calculus. The derivative tells you which direction makes the function decrease. The gradient generalizes this to multiple dimensions. Gradient descent iteratively follows the gradient downhill. Backpropagation efficiently computes the gradient using the chain rule. That's all of ML training, mathematically.
Derivatives: Rate of Change
The derivative of a function f(x) at point x tells you how f changes as x changes — the slope of f at that point. Written as f'(x) or df/dx.
Key derivatives to know for ML:
| Function | Derivative | ML Context |
|---|---|---|
| f(x) = x² | f'(x) = 2x | MSE loss derivative |
| f(x) = eˣ | f'(x) = eˣ | Exponential (softmax) |
| f(x) = ln(x) | f'(x) = 1/x | Log-likelihood loss |
| ReLU: max(0,x) | 0 if x<0, 1 if x>0 | Most common activation |
| σ(x) = 1/(1+e⁻ˣ) | σ(x)(1-σ(x)) | Sigmoid activation/output |
ReLU's derivative is especially clean — 0 for negative inputs, 1 for positive. This is why ReLU is computationally efficient and dominates modern neural network activations.
Partial Derivatives and the Gradient
A neural network has millions of parameters — a loss function L(w₁, w₂, ..., wₙ). The partial derivative ∂L/∂wᵢ measures how loss changes when weight wᵢ changes, holding all other weights constant.
The gradient ∇L is the vector of all partial derivatives:
∇L = [∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ]
The gradient has two key properties:
- It points in the direction of steepest ascent (the direction that increases loss most)
- Its magnitude indicates how steep the ascent is
To decrease loss, move in the opposite direction: negative gradient.
The Chain Rule: Derivatives of Composed Functions
If y = f(g(x)), then dy/dx = (dy/du)(du/dx) where u = g(x). This chains the derivatives together. For multiple variables: ∂z/∂x = (∂z/∂y)(∂y/∂x).
A neural network is a composition of functions. Layer 3's output depends on Layer 2's output, which depends on Layer 1's output, which depends on the input. The chain rule connects how loss changes with respect to early layer weights to how loss changes at the output.
Simple example: f(x) = (2x + 3)²
# Let u = 2x + 3, so f = u²
# Chain rule: df/dx = df/du × du/dx
# df/du = 2u, du/dx = 2
# df/dx = 2(2x+3) × 2 = 4(2x+3)
# At x=1: df/dx = 4(2·1+3) = 4·5 = 20
Backpropagation: Chain Rule Applied to Neural Networks
Backpropagation computes ∂L/∂w for every weight w in the network by applying the chain rule from the output layer back to the input layer. "Back" because we propagate error signals backward through the network.
For a simple two-layer network (L=loss, a₂=output, a₁=hidden, w₁=first layer weights):
∂L/∂w₁ = (∂L/∂a₂) × (∂a₂/∂a₁) × (∂a₁/∂w₁)
Each term is the local gradient at that layer. The chain multiplies them together.
import numpy as np
# Simple 1-neuron network with MSE loss
def forward(x, w, b):
return x * w + b # linear
def loss(y_pred, y_true):
return (y_pred - y_true) ** 2 # MSE
# Backprop by hand
x, w, b = 2.0, 0.5, 0.1
y_true = 3.0
y_pred = forward(x, w, b) # 1.1
L = loss(y_pred, y_true) # (1.1-3.0)² = 3.61
# Gradients
dL_dy = 2 * (y_pred - y_true) # dL/dy_pred = 2(y-y_true) = -3.8
dy_dw = x # dy/dw = x = 2.0
dy_db = 1.0
dL_dw = dL_dy * dy_dw # chain rule: -3.8 * 2.0 = -7.6
dL_db = dL_dy * dy_db # -3.8 * 1.0 = -3.8
# Update weights
lr = 0.01
w_new = w - lr * dL_dw # 0.5 - 0.01*(-7.6) = 0.576
b_new = b - lr * dL_db # 0.1 - 0.01*(-3.8) = 0.138
PyTorch and TensorFlow automate this through automatic differentiation (autograd) — you never compute these by hand in practice. But knowing what's happening explains why certain architectures work, why gradients vanish in deep networks, and how to debug training failures.
Gradient Descent: The Algorithm That Trains Every Model
Update rule: w = w - learning_rate × ∇L(w)
Repeat until convergence. That's it. The entire training loop of every neural network.
Variants:
- Batch gradient descent — Compute gradient over entire dataset before updating. Slow, but stable gradient estimate. Rare in practice.
- Stochastic gradient descent (SGD) — Compute gradient for one sample, update. Noisy but fast. Can escape local minima.
- Mini-batch gradient descent — Gradient over a small batch (32, 64, 128 samples). The standard approach — balances noise and efficiency.
Learning rate is the most important hyperparameter. Too large: training diverges (loss increases). Too small: training is slow and may get stuck. Learning rate schedules (cosine annealing, step decay, warmup) adjust the learning rate during training.
Modern Optimizers: Beyond Vanilla SGD
| Optimizer | Idea | When to Use |
|---|---|---|
| SGD + Momentum | Accumulates gradient history to dampen oscillations | Image classification (ResNet training) |
| RMSprop | Adapts per-parameter lr based on recent gradient magnitude | RNNs, non-stationary objectives |
| Adam | Combines momentum + RMSprop. Adapts lr per parameter. | Default for most deep learning |
| AdamW | Adam with weight decay decoupled from gradient | Language models, transformers |
| Lion | Sign-based momentum update. More memory efficient. | Large-scale vision/language models |
Learn AI from the Math Up at Precision AI Academy
Our bootcamp teaches you to build and train real models — with enough mathematical understanding to know what's happening and how to fix it when things go wrong. Five cities, October 2026.
Frequently Asked Questions
What calculus do you actually need for machine learning?
Derivatives, partial derivatives, gradients, and the chain rule. You don't need integrals, differential equations, or complex analysis for most ML work. Focus on understanding what a derivative means and how the chain rule chains gradients through composed functions.
How does backpropagation use the chain rule?
Backpropagation computes d(loss)/d(weight) for every weight by chaining local gradients from output to input using the chain rule. PyTorch/TensorFlow automate this through autograd — but knowing the math helps you understand training failures and gradient issues.
What is gradient descent and why does it work?
Gradient descent moves parameters in the direction opposite to the gradient (steepest ascent), which decreases the loss function. Repeat until convergence. It works because differentiable functions can be locally approximated as linear — the gradient tells you which direction goes downhill.