4. The math — alignment scores, attention weights, and context vectors
🔴 Advanced undergrad. This section uses matrix multiplication and the softmax function. If you need a refresher, read Matrix Multiplication and Softmax Function before continuing.
The architecture
The Bahdanau attention model has three components:
- A bidirectional encoder producing one hidden state per source word
- An alignment model (additive attention) scoring how relevant each source state is to the current decoder state
- A decoder that uses a fresh, step-specific context vector at each generation step
Step 1: The bidirectional encoder
Let the source sentence have T words. A forward LSTM reads left to right:
→h₁, →h₂, ..., →h_T
A backward LSTM reads right to left:
←h_T, ←h_{T-1}, ..., ←h₁
These are concatenated into one hidden state per source word:
hᵢ = [→hᵢ ; ←hᵢ]
Where [ ; ] means concatenation. If each individual LSTM state has dimension d, then each hᵢ has dimension 2d.
Why concatenate? Each hᵢ now contains information about word i from its full sentence context — words before AND after it. Compare to seq2seq’s encoder, where hᵢ only knew about words 1 through i.
Step 2: Alignment scores (additive attention)
At decoding step t, the decoder has a previous hidden state s_{t-1}. We want to score how compatible this decoder state is with each source hidden state hᵢ.
Bahdanau’s alignment model is:
eₜᵢ = vₐᵀ · tanh(Wₐ · s_{t-1} + Uₐ · hᵢ)
Where:
- s_{t-1} — decoder’s previous hidden state (a vector of dimension n)
- hᵢ — encoder hidden state for source word i (a vector of dimension 2d)
- Wₐ — learnable weight matrix (n × n) applied to the decoder state
- Uₐ — learnable weight matrix (n × 2d) applied to the encoder state
- tanh — element-wise non-linearity, squashes values to (−1, +1)
- vₐ — learnable weight vector that projects the combined representation to a scalar
- eₜᵢ — the resulting scalar score: how relevant source word i is at decoding step t
In plain words: transform both the decoder state and the source state into the same space, add them, squash through tanh, then project to a single number. This single number is the alignment score.
This formulation is called additive attention (also called Bahdanau attention). It is different from dot-product attention (Luong / Transformer), where the score is simply s · hᵢ. We will see that simpler formula in Paper 08 — it is faster but loses the tanh non-linearity.
Step 3: Attention weights via softmax
Compute alignment scores for all T source positions at step t:
eₜ = [eₜ₁, eₜ₂, ..., eₜ_T]
Convert to attention weights using softmax:
αₜᵢ = exp(eₜᵢ) / Σⱼ exp(eₜⱼ)
The attention weights αₜ = [αₜ₁, αₜ₂, …, αₜ_T] are a proper probability distribution: each αₜᵢ ≥ 0 and Σᵢ αₜᵢ = 1.
Interpretation: αₜᵢ is the probability that target word t is aligned to source word i. If the model is generating the French word “économique” and the English source contains “economic” at position 4, then αₜ₄ should be large — close to 1 — while other positions have small weights.
Step 4: Context vector as a weighted sum
The context vector for decoding step t is the attention-weighted sum of all encoder hidden states:
cₜ = Σᵢ αₜᵢ · hᵢ
This is a soft lookup: instead of retrieving one encoder state (hard lookup), we retrieve a blend of all of them, weighted by relevance. Words with high attention weight contribute most to cₜ.
Important: cₜ is different at every decoding step. Unlike seq2seq where the decoder saw the same context vector C at every step, Bahdanau’s decoder gets a fresh, query-specific context vector each time. This is the entire point of the mechanism.
Step 5: Decoder update
The decoder hidden state at step t:
sₜ = f(sₜ₋₁, yₜ₋₁, cₜ)
Where:
- sₜ₋₁ — previous decoder hidden state
- yₜ₋₁ — previous output word (as a vector)
- cₜ — context vector (freshly computed via attention)
- f — a GRU or LSTM cell
The output probability over the vocabulary:
P(yₜ | y₁,...,yₜ₋₁, x) = softmax(Wo · g(sₜ, yₜ₋₁, cₜ))
Where g is a transformation (often a single linear layer) and Wo is the output projection matrix.
Worked numerical example
Let us translate the Hindi phrase “Subah chai piyo” (Drink tea in the morning) with a tiny toy model.
Setup:
- Source: 3 words → 3 encoder hidden states, each 2-dimensional (in practice, 1000-dimensional)
- Decoder state: 2-dimensional
- To keep numbers clean, we skip the full bidirectional encoder and work with direct encoder states
Encoder hidden states (after bidirectional encoding):
h₁ = [0.6, 0.2] ("Subah" — morning)
h₂ = [0.4, 0.9] ("chai" — tea)
h₃ = [0.5, 0.3] ("piyo" — drink)
Decoder previous state:
s₀ = [0.3, 0.7]
Weight matrices (tiny, hand-pickable):
Wₐ = [[0.5, 0.1], Uₐ = [[0.4, 0.2], vₐ = [1.0, 1.0]
[0.2, 0.5]] [0.1, 0.4]]
Compute Wₐ · s₀:
Wₐ · s₀ = [[0.5, 0.1], [0.2, 0.5]] · [0.3, 0.7]
= [0.5×0.3 + 0.1×0.7, 0.2×0.3 + 0.5×0.7]
= [0.15 + 0.07, 0.06 + 0.35]
= [0.22, 0.41]
This term is the same for all source positions (it only depends on the decoder state).
For each source word, compute Uₐ · hᵢ:
For h₁ = [0.6, 0.2]:
Uₐ · h₁ = [0.4×0.6 + 0.2×0.2, 0.1×0.6 + 0.4×0.2]
= [0.24 + 0.04, 0.06 + 0.08]
= [0.28, 0.14]
For h₂ = [0.4, 0.9]:
Uₐ · h₂ = [0.4×0.4 + 0.2×0.9, 0.1×0.4 + 0.4×0.9]
= [0.16 + 0.18, 0.04 + 0.36]
= [0.34, 0.40]
For h₃ = [0.5, 0.3]:
Uₐ · h₃ = [0.4×0.5 + 0.2×0.3, 0.1×0.5 + 0.4×0.3]
= [0.20 + 0.06, 0.05 + 0.12]
= [0.26, 0.17]
Add decoder and encoder terms, apply tanh:
For word 1: [0.22+0.28, 0.41+0.14] = [0.50, 0.55]
tanh([0.50, 0.55]) ≈ [0.462, 0.503]
e₁ = vₐᵀ · [0.462, 0.503] = 1×0.462 + 1×0.503 = 0.965
For word 2: [0.22+0.34, 0.41+0.40] = [0.56, 0.81]
tanh([0.56, 0.81]) ≈ [0.508, 0.670]
e₂ = 0.508 + 0.670 = 1.178
For word 3: [0.22+0.26, 0.41+0.17] = [0.48, 0.58]
tanh([0.48, 0.58]) ≈ [0.447, 0.523]
e₃ = 0.447 + 0.523 = 0.970
Apply softmax to get attention weights:
exp(0.965) ≈ 2.625
exp(1.178) ≈ 3.248
exp(0.970) ≈ 2.638
Sum = 2.625 + 3.248 + 2.638 = 8.511
α₁ = 2.625 / 8.511 ≈ 0.308
α₂ = 3.248 / 8.511 ≈ 0.382 ← highest: model attends most to "chai"
α₃ = 2.638 / 8.511 ≈ 0.310
Check: 0.308 + 0.382 + 0.310 = 1.000 ✓
Compute context vector:
c₁ = α₁·h₁ + α₂·h₂ + α₃·h₃
= 0.308×[0.6, 0.2] + 0.382×[0.4, 0.9] + 0.310×[0.5, 0.3]
= [0.185, 0.062] + [0.153, 0.344] + [0.155, 0.093]
= [0.493, 0.499]
The context vector c₁ = [0.493, 0.499] is a blend of all three source states, with “chai” (word 2) receiving the most weight (38.2%). The decoder uses this vector to generate the first English word — ideally “tea” or “morning.”
Notice that if the model were translating and had to generate “tea” next, ideally α₂ would be near 1.0. The model learns this alignment through training; our toy numbers just illustrate the mechanics.
Additive vs dot-product attention
For comparison: in Paper 08 (Transformer), the alignment score is simply:
eᵢ = qᵀ · kᵢ (dot product of query and key vectors)
This is faster (no tanh, no projection vector) and scales well with matrix operations. Bahdanau’s additive formulation has a tanh non-linearity that adds expressiveness but is slower.
In 2015, Luong et al. showed that dot-product attention achieves similar or better results with much less computation. Paper 08 adopts dot-product attention scaled by √d (to prevent large values saturating softmax). But the core idea — score, softmax, weighted sum — is exactly Bahdanau’s.