Gradient Intuition
Gradient Intuition
1. What is this and why do we care?
Neural networks learn by gradient descent — the most important algorithm in all of modern AI.
Every time GPT is trained, every time a recommendation system learns your preferences, every time AlphaFold learns to predict protein structures — gradient descent is running. It is the universal engine of machine learning.
The gradient is the object that makes gradient descent possible. It collects all the partial derivatives — one per weight — into a single vector that points in the direction of steepest increase in loss. Gradient descent then walks in the opposite direction: downhill, toward lower loss.
If you understand the gradient and gradient descent, you understand how every neural network trains.
2. Prerequisites
You need Derivatives and Partial Derivatives. Read those first if you have not.
3. The intuition — before any symbols
Imagine you are in a hilly village at night. It is completely dark. Your torch has died. You want to reach the lowest point in the valley — perhaps a well, or the village center.
You cannot see where the valley is. But you can feel the ground under your feet. At any point you stand, the ground slopes in some direction. You can feel which way is steeper and which way is flatter.
Your strategy: at every step, feel the ground, find the direction that slopes most steeply downhill, and take one step in that direction.
This is gradient descent. Repeat it enough times and, step by step, you will find the valley floor.
In a neural network:
- The hilly landscape is the loss surface — the value of the loss as you vary all the weights
- Your position on the landscape is the current set of weights
- The valley floor is the set of weights that minimise the loss — where the network makes the fewest mistakes
- The gradient is the mathematical description of “which way is uphill from here?”
- Gradient descent is the algorithm of always stepping downhill
4. What the gradient is — precisely
For a function L(w₁, w₂, …, wₙ) with n weights, the gradient is the vector of all partial derivatives:
∇L = [∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ]
The symbol ∇ is called “nabla” or “del.” It collects all the partial derivatives into one package.
The gradient has two properties that make it useful:
- It points in the direction of steepest increase of L
- Its magnitude (length) tells you how steep that slope is
Gradient descent moves in the opposite direction — steepest decrease:
w_new = w_old - η × ∇L
Where η (eta) is the learning rate — a small positive number controlling step size.
5. A tiny worked example — 2 weights
Loss function: L(w₁, w₂) = w₁² + w₂²
(This is a simple bowl-shaped surface. The minimum is at w₁=0, w₂=0.)
Compute the gradient:
∂L/∂w₁ = 2w₁
∂L/∂w₂ = 2w₂
∇L = [2w₁, 2w₂]
Starting point: w₁ = 3, w₂ = 4. Learning rate η = 0.1.
∇L at (3,4) = [2×3, 2×4] = [6, 8]
The gradient is [6, 8] — pointing uphill, away from the origin.
Gradient descent step:
w₁_new = 3 - 0.1 × 6 = 3 - 0.6 = 2.4
w₂_new = 4 - 0.1 × 8 = 4 - 0.8 = 3.2
New loss:
L(2.4, 3.2) = 2.4² + 3.2² = 5.76 + 10.24 = 16.0
Previous loss: L(3,4) = 9 + 16 = 25. We went from 25 → 16. Getting closer to the minimum of 0.
After several more steps:
Step 2: (2.4, 3.2) → (1.92, 2.56) → L = 10.24
Step 3: (1.92, 2.56) → (1.536, 2.048) → L = 6.55
...
Step 20: ≈ (0.03, 0.04) → L ≈ 0.003
Converging toward (0, 0) — the minimum.
6. Learning rate matters enormously
The learning rate η controls the step size. Getting it wrong breaks training.
Too large (η = 2.0):
w₁_new = 3 - 2.0 × 6 = 3 - 12 = -9 ← overshoot!
L(-9, ...) = 81 + ... ← loss went UP, not down
With a large learning rate, you jump over the valley and land on the other side — higher up. Training oscillates or diverges.
Too small (η = 0.00001):
w₁_new = 3 - 0.00001 × 6 = 2.99994 ← barely moved
Training will converge but it will take millions of steps. Impractically slow.
Just right (η ≈ 0.1 for this example): Steady progress toward the minimum.
Finding the right learning rate is one of the most important practical skills in deep learning. Modern training uses adaptive learning rate methods (Adam, RMSProp) that automatically adjust η for each weight individually.
7. The landscape is rarely a perfect bowl
For our toy example L = w₁² + w₂², the loss surface is a smooth bowl with one minimum. Easy.
For a real neural network with millions of weights, the loss surface is a strange, high-dimensional landscape with:
- Local minima — valleys that are not the lowest point
- Saddle points — places that look like a minimum from one direction but slope downward from another
- Plateaus — flat regions where the gradient is near zero and progress stalls
- Cliffs — sudden steep drops where the gradient explodes
Understanding these features is important for training large models. Many tricks in modern deep learning — momentum, gradient clipping, warm-up schedules — exist specifically to navigate this difficult landscape.
8. Where does this appear in AI?
Paper 03 — Backpropagation: Backpropagation computes the gradient ∇L efficiently in one backward pass. Before backprop, you would have to compute each partial derivative separately — one full forward pass per weight. Backprop computes all gradients in a single backward pass, making training tractable.
Paper 04 — LSTM: The vanishing gradient problem is when ∂L/∂w values shrink toward zero as they propagate backward through many layers. The gradient becomes so small that the weights at early layers barely update. LSTM solves this with a specially designed structure that keeps gradients alive.
Paper 15 — RLHF: Reinforcement learning from human feedback uses a variant called policy gradient — the gradient of expected reward with respect to model parameters. Same idea, applied to optimise for human preference rather than prediction accuracy.
9. Common mistakes
-
Moving in the direction of the gradient. Gradient descent moves opposite to the gradient — you subtract, not add. The gradient points uphill. You want to go downhill.
-
Using the same learning rate for every problem. A learning rate that works beautifully for one network will cause another to diverge. Always treat η as a hyperparameter to tune.
-
Thinking gradient descent always finds the global minimum. For non-convex losses (like real neural networks), gradient descent finds a local minimum — or a saddle point, or just stops when the gradient becomes very small. The remarkable empirical finding of deep learning is that local minima in large networks tend to be good enough in practice.
10. Try it yourself
Exercise 1: L(w) = (w - 5)². Compute the gradient dL/dw. Starting from w=0 with η=0.2, perform 3 gradient descent steps. What value is w converging toward?
Show answer
dL/dw = 2(w-5)
Step 0: w = 0. Gradient = 2(0-5) = -10. w_new = 0 - 0.2×(-10) = 2.0 Step 1: w = 2. Gradient = 2(2-5) = -6. w_new = 2 - 0.2×(-6) = 3.2 Step 2: w = 3.2. Gradient = 2(3.2-5) = -3.6. w_new = 3.2 - 0.2×(-3.6) = 3.92
w is converging toward 5 — the minimum of (w-5)², which is 0 when w=5.
Exercise 2: For L(w₁, w₂) = (w₁ - 2)² + (w₂ - 3)², compute the gradient ∇L at (w₁=0, w₂=0). What is the gradient descent update with η=0.1?
Show answer
∂L/∂w₁ = 2(w₁ - 2). At (0,0): 2(0-2) = -4 ∂L/∂w₂ = 2(w₂ - 3). At (0,0): 2(0-3) = -6
∇L = [-4, -6]
Gradient descent update: w₁_new = 0 - 0.1×(-4) = 0.4 w₂_new = 0 - 0.1×(-6) = 0.6
The minimum is at (2, 3). We are moving toward it.
Exercise 3: Why does a too-large learning rate cause divergence? Describe what happens geometrically in the hilly village analogy.
Show answer
In the hilly village analogy: you feel the slope, and instead of taking a small step downhill, you take a giant leap. You fly over the valley completely and land on the hill on the other side — which is higher than where you started. Now the gradient points in the opposite direction. You take another giant leap back — and overshoot again. You keep bouncing back and forth, getting farther from the valley, not closer.
Mathematically: if η is too large, the term η × |∇L| is larger than the distance to the minimum. You overshoot the minimum and land on the other side, where the loss is higher.
10. Interactive widget
Coming soon: Gradient Descent Simulator →
Adjust the learning rate. Watch the ball roll down the loss surface. See how momentum helps escape saddle points.
Previous tutorial: Partial Derivatives ← Used in: Paper 03 — Backpropagation →