5. The math — loss, gradients, and the analogy magic with real numbers
This section has three parts:
- The loss function that Word2Vec optimises.
- The gradient update rules.
- A worked numeric example of the famous analogy
king − man + woman ≈ queen, plus an Indian variantDelhi − India + Japan ≈ Tokyo.
Prerequisites you should have nearby: dot product, matrix multiplication, probability basics.
5.1 The full-softmax objective (the slow version)
For a single training example — target word w and neighbour word c
— the naive skip-gram objective is to maximise:
P(c | w) = exp(v_w · v'_c) / Σ_{j=1}^V exp(v_w · v'_j)
v_wis the target’s embedding (rowwof matrix W).v'_cis the context vector for wordc(rowcof matrix W’).- The sum in the denominator runs over the entire vocabulary — this is softmax.
Summed across all training pairs, the total log-likelihood is:
J = Σ_(w,c) log P(c | w)
We want to maximise J — equivalently, minimise −J. This is beautiful
math but, as noted in Section 4, the softmax denominator makes it
impractical for big vocabularies.
5.2 The negative-sampling objective (the fast version)
Instead of the full softmax, negative sampling turns each (w, c) pair
into (1 + k) binary classifications. The objective for one training
pair becomes:
J_{ns}(w, c) = log σ(v_w · v'_c)
+ Σ_{i=1}^k log σ(−v_w · v'_{n_i})
where:
- σ(x) = 1 / (1 + e⁻ˣ) is the sigmoid.
- n_i is the i-th negative sample (a random word).
- k is the number of negatives per positive (5 to 20 in practice).
Read this aloud:
“Make
σ(v_w · v'_c)close to 1 for real neighbours, and makeσ(−v_w · v'_n)close to 1 for random negatives (which means the dot productv_w · v'_nis pushed toward negative values).”
We maximise J_ns (i.e., minimise −J_ns) by gradient ascent on each vector involved. No softmax, no sum over V. Computing one training example is now only O(k+1) dot products, not O(V).
5.3 The gradient update
Let’s compute the gradient of the positive term with respect to v_w:
∂/∂v_w log σ(v_w · v'_c)
= ( 1 − σ(v_w · v'_c) ) · v'_c
And with respect to v'_c:
∂/∂v'_c log σ(v_w · v'_c)
= ( 1 − σ(v_w · v'_c) ) · v_w
The term (1 − σ(·)) is how “wrong” the network currently is. When
the dot product is already very large, σ is near 1 and the gradient is
near 0 — no more updating needed. When σ is near 0, the gradient is
large — big update needed.
For a negative sample n:
∂/∂v_w log σ(−v_w · v'_n) = −σ(v_w · v'_n) · v'_n
That’s a negative update — it pulls v_w away from v'_n.
The update rule for one training example with learning rate η is then:
v_w ← v_w + η · [ (1 − σ(v_w · v'_c)) · v'_c
− Σᵢ σ(v_w · v'_{n_i}) · v'_{n_i} ]
v'_c ← v'_c + η · (1 − σ(v_w · v'_c)) · v_w
v'_{n_i} ← v'_{n_i} − η · σ(v_w · v'_{n_i}) · v_w
If this looks like a lot, don’t worry. It is three lines of code in any real implementation, and the structure is: “pull real pairs together, push fake pairs apart”. That is the whole training dynamic.
5.4 Why the analogy trick works — informal geometry
Once trained, the vectors live in a 300-dimensional space where certain directions correspond to conceptual axes:
- A “gender” axis that separates man/woman, king/queen, actor/actress.
- A “country → capital” axis that separates India/Delhi, Japan/Tokyo, France/Paris.
- A “verb tense” axis that separates walk/walked, run/ran.
These axes are not explicitly designed — they emerge because the training data contains regularities that are easier to explain if different conceptual dimensions end up pointing in different directions in vector space.
An analogy like king − man + woman traces this geometry:
king − mancomputes the direction that takes you from “a regular adult male” to “a male monarch”. Roughly, it’s the “royalty” vector.- Adding that to
womantakes “a regular adult female” in the royalty direction — landing near “queen”.
The word closest to vec("king") − vec("man") + vec("woman") (by
cosine similarity) tends to be “queen” in a well-trained model.
Why cosine similarity? Because the important thing about a word vector is its direction, not its magnitude. Two vectors pointing the same way have cosine similarity near 1; opposite directions give −1; orthogonal directions give 0. The classic post-Word2Vec evaluation takes the top-1 or top-5 nearest vectors by cosine similarity.
5.5 Worked example — king − man + woman ≈ queen
We’ll use made-up but plausible 4-dimensional vectors (real Word2Vec uses 300 dimensions, but 4 is enough to illustrate). Say after training we have:
vec(king) = [ 0.90, 0.50, 0.10, 0.05 ]
vec(man) = [ 0.80, 0.05, 0.10, 0.00 ]
vec(woman) = [ 0.05, 0.05, 0.85, 0.00 ]
vec(queen) = [ 0.10, 0.55, 0.85, 0.05 ]
The first dimension tracks “maleness”, the second “royalty”, the third “femaleness”, the fourth is just some other direction.
Compute the analogy:
vec(king) − vec(man) = [0.90−0.80, 0.50−0.05, 0.10−0.10, 0.05−0.00]
= [0.10, 0.45, 0.00, 0.05]
← this is the "royalty − maleness" direction
Add vec(woman):
vec(king) − vec(man) + vec(woman)
= [0.10+0.05, 0.45+0.05, 0.00+0.85, 0.05+0.00]
= [0.15, 0.50, 0.85, 0.05]
Now compare to vec(queen) = [0.10, 0.55, 0.85, 0.05]. They’re almost
identical — off by a few hundredths in each dimension. Among all words
in the vocabulary, “queen” would be the closest (by cosine similarity)
to our computed analogy vector. The math worked.
5.6 Indian variant — Delhi − India + Japan ≈ Tokyo
Same 4-dimensional toy vectors, but now tracking country/capital semantics. After training:
vec(Delhi) = [ 0.85, 0.95, 0.10, 0.05 ]
vec(India) = [ 0.10, 0.95, 0.15, 0.05 ]
vec(Japan) = [ 0.10, 0.20, 0.90, 0.05 ]
vec(Tokyo) = [ 0.85, 0.20, 0.85, 0.05 ]
Here:
- Dimension 1 tracks “is-a-capital”.
- Dimension 2 tracks “India-ness”.
- Dimension 3 tracks “Japan-ness”.
- Dimension 4 is noise.
Compute:
vec(Delhi) − vec(India) = [0.75, 0.00, −0.05, 0.00]
← the "is-a-capital, with the India-ness cancelled"
direction
vec(Delhi) − vec(India) + vec(Japan)
= [0.75 + 0.10, 0.00 + 0.20, −0.05 + 0.90, 0.00 + 0.05]
= [0.85, 0.20, 0.85, 0.05]
Compare to vec(Tokyo) = [0.85, 0.20, 0.85, 0.05]. Exact match.
In a real Word2Vec model you won’t get exact matches, but the nearest neighbour of the computed vector will usually be “Tokyo”. This is why, in the famous Google Analogy Test, Word2Vec could solve questions like “Delhi is to India as _____ is to Japan” with 80-90% accuracy — better than any previous method.
5.7 A reality-check note
The “analogy arithmetic works perfectly” framing is mostly true but occasionally oversold. In published evaluations:
- Capital-country, gender, verb-tense analogies: ~70-85% top-1 accuracy.
- Some categories (like comparative-superlative) work less reliably.
- The analogy can fail for very rare words, because their vectors are trained on fewer examples.
So ”≈” in king − man + woman ≈ queen is doing real work. The actual
computed vector is usually close to queen, not exactly equal.
“Queen” is typically one of the top 1-3 nearest words, and sometimes
the question word itself (like “king”) is excluded from the list to
avoid trivial solutions.
This will all be concrete in the next section, where you run it yourself on pretrained Google News vectors.