Partial Derivatives

1. What is this and why do we care?

A neural network does not have one weight. It might have millions. And the loss depends on all of them simultaneously.

When we train a network, we need to know: “how does the loss change when we change this specific weight, holding all other weights fixed?” That question is answered by a partial derivative.

Partial derivatives are how backpropagation handles networks with many weights. For each weight, you compute the partial derivative of the loss with respect to that weight. Then you update each weight accordingly. Do this for all millions of weights — that is one training step.

2. Prerequisites

Read Derivatives — Introduction first. You should be comfortable with the idea that the derivative of f(x) = x² is f’(x) = 2x.

3. The intuition — before any symbols

Imagine you are a farmer with a field on a hillside. The height of the land at any point depends on two things: how far east you are (call it x) and how far north you are (call it y).

You are standing at a particular spot. You want to know:

“If I walk one step east (holding my north position fixed), do I go uphill or downhill?”
“If I walk one step north (holding my east position fixed), do I go uphill or downhill?”

These two questions have two separate answers. The slope going east is different from the slope going north — the hill is not the same in every direction.

The partial derivative with respect to x answers the first question: slope in the east direction, holding y fixed. The partial derivative with respect to y answers the second: slope in the north direction, holding x fixed.

In a neural network with weights w₁ and w₂, the partial derivative ∂L/∂w₁ answers: “how does loss change if I change w₁, holding w₂ fixed?“

4. A tiny worked example with real numbers

Consider: f(x, y) = x² + 3y

This function takes two inputs (x and y) and gives one output.

Partial derivative with respect to x (written ∂f/∂x):

Treat y as a constant — pretend it is just a fixed number
Differentiate only with respect to x

f(x, y) = x² + 3y

∂f/∂x = 2x + 0 = 2x
         ↑     ↑
  d/dx of x²   d/dx of 3y (y is constant, so 3y is a constant, derivative = 0)

Partial derivative with respect to y (written ∂f/∂y):

Treat x as a constant
Differentiate only with respect to y

∂f/∂y = 0 + 3 = 3
         ↑    ↑
   d/dy of x²   d/dy of 3y (3y differentiates to 3)
   (x² is a constant w.r.t. y, derivative = 0)

Evaluate at a specific point, say x=2, y=1:

∂f/∂x at (2,1) = 2×2 = 4   → moving in x-direction increases f by 4 per unit
∂f/∂y at (2,1) = 3          → moving in y-direction increases f by 3 per unit

5. The general rule

For a function f(x₁, x₂, …, xₙ) with multiple variables:

To compute ∂f/∂xᵢ:

Treat all variables except xᵢ as constants
Differentiate f with respect to xᵢ as you would a single-variable function

The notation ∂ (curly d, called “del” or “partial”) signals that this is a partial derivative, not a full derivative.

6. A neural network example with two weights

Consider a tiny network where the loss depends on two weights:

L(w₁, w₂) = (y - w₁x₁ - w₂x₂)²

Say x₁=1, x₂=2, y=5 (the correct answer).

So: L(w₁, w₂) = (5 - w₁ - 2w₂)²

∂L/∂w₁ (how loss changes with w₁, holding w₂ fixed):

Let u = (5 - w₁ - 2w₂). Then L = u².

∂u/∂w₁ = -1        (derivative of -w₁ is -1; -2w₂ is constant w.r.t. w₁)
∂L/∂u  = 2u

∂L/∂w₁ = 2u × (-1) = -2(5 - w₁ - 2w₂)

∂L/∂w₂ (how loss changes with w₂, holding w₁ fixed):

∂u/∂w₂ = -2        (derivative of -2w₂ is -2; -w₁ is constant w.r.t. w₂)

∂L/∂w₂ = 2u × (-2) = -4(5 - w₁ - 2w₂)

Evaluate at w₁=1, w₂=1:

u = 5 - 1 - 2×1 = 2

∂L/∂w₁ = -2 × 2 = -4    → increase w₁ to reduce loss
∂L/∂w₂ = -4 × 2 = -8    → increase w₂ to reduce loss (stronger signal)

Gradient descent updates both simultaneously:

w₁_new = 1 - 0.1 × (-4) = 1.4
w₂_new = 1 - 0.1 × (-8) = 1.8

7. Where does this appear in AI?

Paper 03 — Backpropagation: Backpropagation computes ∂L/∂wᵢ for every weight wᵢ in the network. It does this efficiently using the chain rule — computing partial derivatives layer by layer from output to input. A network with 1 million weights needs 1 million partial derivatives per training step.

Every neural network library: When you call loss.backward() in PyTorch, it automatically computes the partial derivative of the loss with respect to every learnable parameter in the model. This is called automatic differentiation — the library tracks every operation and applies the chain rule automatically.

8. Common mistakes

Differentiating the “wrong” variables. When computing ∂f/∂x, everything except x is a constant. Students sometimes accidentally differentiate y terms too. Read slowly — if you see y in the expression and you are computing ∂/∂x, that y is a fixed number.
Confusing ∂ with d. The curly ∂ means you are holding other variables constant. The straight d means you are considering the full effect, which includes how other variables might change too (the total derivative). In neural network training, we always use partial derivatives.
Expecting partial derivatives to “add up” to the full derivative. They do not, in general. The full derivative of f(x,y) when both x and y are changing is df = (∂f/∂x)dx + (∂f/∂y)dy — a sum of both partials times their respective changes. This is called the total differential.

9. Try it yourself

Exercise 1: Find ∂f/∂x and ∂f/∂y for f(x, y) = 3x² + 2xy + y³.

Show answer

∂f/∂x: treat y as constant. Differentiate each term:

3x² → 6x
2xy → 2y (y is constant, so this is like 2y×x, derivative is 2y)
y³ → 0 (pure y term, constant w.r.t. x)

∂f/∂x = 6x + 2y

∂f/∂y: treat x as constant:

3x² → 0 (pure x term, constant w.r.t. y)
2xy → 2x (x is constant, derivative of y is 1, so 2x×1 = 2x)
y³ → 3y²

∂f/∂y = 2x + 3y²

Exercise 2: For f(x, y) = 3x² + 2xy + y³, evaluate both partial derivatives at the point (x=1, y=2).

Show answer

∂f/∂x at (1,2) = 6×1 + 2×2 = 6 + 4 = 10

∂f/∂y at (1,2) = 2×1 + 3×2² = 2 + 12 = 14

Interpretation: at the point (1,2), moving in the x-direction increases f by 10 per unit, and moving in the y-direction increases f by 14 per unit. The steeper slope is in the y-direction.

Exercise 3: A loss function with two weights: L(w₁, w₂) = (4 - 2w₁ - w₂)². Find ∂L/∂w₁ and ∂L/∂w₂. Evaluate at w₁=1, w₂=0. Which weight has a larger gradient magnitude?

Show answer

Let u = 4 - 2w₁ - w₂. L = u².

∂u/∂w₁ = -2 → ∂L/∂w₁ = 2u × (-2) = -4(4 - 2w₁ - w₂) ∂u/∂w₂ = -1 → ∂L/∂w₂ = 2u × (-1) = -2(4 - 2w₁ - w₂)

At (w₁=1, w₂=0): u = 4 - 2 - 0 = 2

∂L/∂w₁ = -4 × 2 = -8 ∂L/∂w₂ = -2 × 2 = -4

w₁ has the larger gradient magnitude (8 vs 4). Gradient descent will update w₁ more aggressively. This makes sense — w₁ has coefficient 2 in the prediction, so it has twice the leverage on the output.

Previous tutorial: Chain Rule ← Next tutorial: Gradient Intuition → Used in: Paper 03 — Backpropagation →