Section 05

Worked example: translating 'Kal subah chai piyo' step by step

Neural Machine Translation by Jointly Learning to Align and Translate 2014

5. Worked example — translating “Kal subah chai piyo” step by step

🔴 Advanced undergrad. Section 4 should be read first.

We will trace the full forward pass of a Bahdanau attention model translating the Hindi sentence “Kal subah chai piyo” (“Drink tea tomorrow morning”) into English.

We use tiny 2-dimensional vectors so you can verify every calculation by hand. A real model uses 1000-dimensional vectors and 500-dimensional states, but the structure is identical.


Setup

Source sentence: “Kal subah chai piyo” (4 words → 4 encoder states)

We will pre-assign encoder hidden states (imagine the LSTM has already processed the sentence):

h₁ = [0.7, 0.1]   ("Kal"   — tomorrow)
h₂ = [0.5, 0.6]   ("subah" — morning)
h₃ = [0.2, 0.9]   ("chai"  — tea)
h₄ = [0.8, 0.3]   ("piyo"  — drink)

Decoder initial state: s₀ = [0.0, 0.0] (fresh start)

Target sequence to generate: “Drink” → “tea” → “tomorrow” → “morning” → <EOS>

We will walk through two decoding steps in full.


Decoding Step 1: Generate “Drink”

1a. Compute alignment scores

For this step we use a simplification: instead of the full Wₐ/Uₐ machinery from Section 4, we use a simple formula where the alignment score equals the dot product of the decoder state with each encoder state. (This is Luong attention — mechanically identical, just simpler for the worked example.)

Since s₀ = [0.0, 0.0], every dot product is 0. Not very interesting. So we “prime” the decoder with the context vector as a substitute for s₀:

s₀_primed = mean of all encoder states
          = ([0.7,0.1] + [0.5,0.6] + [0.2,0.9] + [0.8,0.3]) / 4
          = [2.2, 1.9] / 4
          = [0.55, 0.475]

Now compute dot products of s₀_primed with each hᵢ:

e₁ = s₀_primed · h₁ = (0.55×0.7) + (0.475×0.1) = 0.385 + 0.048 = 0.433
e₂ = s₀_primed · h₂ = (0.55×0.5) + (0.475×0.6) = 0.275 + 0.285 = 0.560
e₃ = s₀_primed · h₃ = (0.55×0.2) + (0.475×0.9) = 0.110 + 0.428 = 0.538
e₄ = s₀_primed · h₄ = (0.55×0.8) + (0.475×0.3) = 0.440 + 0.143 = 0.583

1b. Apply softmax to get attention weights

exp(0.433) ≈ 1.542
exp(0.560) ≈ 1.751
exp(0.538) ≈ 1.713
exp(0.583) ≈ 1.791

Sum = 1.542 + 1.751 + 1.713 + 1.791 = 6.797

α₁ = 1.542 / 6.797 ≈ 0.227   ("Kal")
α₂ = 1.751 / 6.797 ≈ 0.258   ("subah")
α₃ = 1.713 / 6.797 ≈ 0.252   ("chai")
α₄ = 1.791 / 6.797 ≈ 0.263   ("piyo")  ← highest

Check: 0.227 + 0.258 + 0.252 + 0.263 = 1.000 ✓

At this first step, the attention is nearly uniform — the model is not sure which source word it needs. “Piyo” (drink) edges ahead slightly. A real trained model would attend much more sharply to “piyo” when generating “Drink,” but this is a random initialisation.

1c. Compute context vector c₁

c₁ = 0.227×[0.7, 0.1] + 0.258×[0.5, 0.6] + 0.252×[0.2, 0.9] + 0.263×[0.8, 0.3]

   = [0.159, 0.023]    ("Kal" contribution)
   + [0.129, 0.155]    ("subah" contribution)
   + [0.050, 0.227]    ("chai" contribution)
   + [0.210, 0.079]    ("piyo" contribution)

   = [0.548, 0.484]

c₁ = [0.548, 0.484]. The decoder uses this to generate the first English word.

1d. Decoder produces “Drink”

The decoder combines c₁ with s₀ (and the <SOS> token embedding, omitted here for brevity) to update its state to s₁ and produce a probability distribution over the English vocabulary. With a trained model, “Drink” receives the highest probability. We take it.


Decoding Step 2: Generate “tea”

Now s₁ is the decoder state after generating “Drink”. In a trained model, s₁ encodes “I have just said Drink, now I need to say what is being drunk.” Let us suppose:

s₁ = [0.3, 0.8]

2a. Compute alignment scores using s₁

e₁ = s₁ · h₁ = (0.3×0.7) + (0.8×0.1) = 0.21 + 0.08 = 0.290
e₂ = s₁ · h₂ = (0.3×0.5) + (0.8×0.6) = 0.15 + 0.48 = 0.630
e₃ = s₁ · h₃ = (0.3×0.2) + (0.8×0.9) = 0.06 + 0.72 = 0.780   ← highest
e₄ = s₁ · h₄ = (0.3×0.8) + (0.8×0.3) = 0.24 + 0.24 = 0.480

“Chai” (tea) now gets the highest raw score. The decoder, having already said “Drink,” naturally looks toward what is being drunk — “chai.”

2b. Apply softmax

exp(0.290) ≈ 1.336
exp(0.630) ≈ 1.878
exp(0.780) ≈ 2.182   ← largest
exp(0.480) ≈ 1.616

Sum = 7.012

α₁ = 1.336 / 7.012 ≈ 0.190   ("Kal")
α₂ = 1.878 / 7.012 ≈ 0.268   ("subah")
α₃ = 2.182 / 7.012 ≈ 0.311   ("chai")  ← most attended
α₄ = 1.616 / 7.012 ≈ 0.230   ("piyo")

Check: 0.190 + 0.268 + 0.311 + 0.230 = 0.999 ✓ (rounding)

2c. Compute context vector c₂

c₂ = 0.190×[0.7, 0.1] + 0.268×[0.5, 0.6] + 0.311×[0.2, 0.9] + 0.230×[0.8, 0.3]

   = [0.133, 0.019]
   + [0.134, 0.161]
   + [0.062, 0.280]
   + [0.184, 0.069]

   = [0.513, 0.529]

c₂ has shifted compared to c₁ — it now leans more toward the “chai” encoder state than before. The decoder uses c₂ to generate its next word and, with a trained model, outputs “tea” with highest probability.


The alignment heatmap

If we trace attention weights at every decoding step (one row per target word, one column per source word), we get an alignment matrix:

             Kal   subah  chai  piyo
             (tomorrow) (morning) (tea) (drink)
─────────────────────────────────────────────
"Drink"    │ 0.23  0.26  0.25  0.26  │  ← nearly uniform (first step)
"tea"      │ 0.19  0.27  0.31  0.23  │  ← "chai" dominates
"tomorrow" │ 0.38  0.21  0.18  0.23  │  ← "Kal" dominates
"morning"  │ 0.21  0.44  0.18  0.17  │  ← "subah" dominates
<EOS>      │ 0.25  0.25  0.25  0.25  │  ← uniform (stopping signal)

This heatmap — hot where attention is high — is what the paper displayed as Figure 3. Each column of bright cells traces which source word each target word “looked at” when generated. The alignment is soft (no cell is exactly 1.0) but the dominant connections are clearly visible.

The fact that “tea” looked at “chai” and “tomorrow” looked at “Kal” confirms the model has learned translation alignment without being explicitly taught it. This visualisation was enormously influential — it was the first time anyone could see inside a neural translation model and understand what it was doing.


Summary of the full pass

At each decoding step t:

  1. Dot decoder state sₜ₋₁ against all encoder states hᵢ → raw scores eₜᵢ
  2. Softmax → attention weights αₜᵢ (sum to 1)
  3. Weighted sum → context vector cₜ (fresh each step, source-aware)
  4. Combine cₜ + sₜ₋₁ + previous output → new decoder state sₜ
  5. Project sₜ → vocabulary probabilities → pick word

Repeat until <EOS>.