Section 06

The code: attention in 25 lines of NumPy

Neural Machine Translation by Jointly Learning to Align and Translate 2014

6. The code — attention in 25 lines of NumPy

🟡 First-year college. You need basic Python and NumPy. Runs free on Google Colab — no GPU needed.

We will implement the core attention mechanism from scratch using only NumPy: given encoder hidden states and a decoder query vector, compute attention weights and the resulting context vector.

import numpy as np

# ── Source sentence: "Kal subah chai piyo" (4 words) ──────────────────────────
# Each hidden state is a 4-dimensional vector (real models use ~1000 dims)
encoder_states = np.array([
    [0.7, 0.1, 0.5, 0.3],   # h1: "Kal"   (tomorrow)
    [0.5, 0.6, 0.2, 0.8],   # h2: "subah" (morning)
    [0.2, 0.9, 0.7, 0.1],   # h3: "chai"  (tea)
    [0.8, 0.3, 0.4, 0.6],   # h4: "piyo"  (drink)
])  # shape: (4, 4) — 4 words × 4 dims

# ── Decoder query: state after generating "Drink" ─────────────────────────────
decoder_state = np.array([0.3, 0.8, 0.1, 0.6])  # shape: (4,)

# ── Step 1: Compute alignment scores (dot product attention) ─────────────────
# For each source word, how compatible is it with the current decoder state?
scores = encoder_states @ decoder_state  # (4,4) @ (4,) = (4,) — one score per word
print("Alignment scores:", np.round(scores, 3))
# e.g., [0.64, 0.89, 0.91, 0.74] — "chai" and "subah" score highest

# ── Step 2: Softmax — convert scores to attention weights ────────────────────
# Subtract max for numerical stability (mathematically identical, avoids overflow)
scores_stable = scores - scores.max()
exp_scores    = np.exp(scores_stable)                    # exponentiate
attn_weights  = exp_scores / exp_scores.sum()            # normalise to sum=1
print("Attention weights:", np.round(attn_weights, 3))
print("Sum of weights:", attn_weights.sum())             # should be 1.000

# ── Step 3: Context vector — weighted sum of encoder states ──────────────────
# Each encoder state is scaled by how much attention it receives
context = attn_weights @ encoder_states  # (4,) @ (4,4) = (4,) — blend of states
print("Context vector:", np.round(context, 3))

# ── Visualise the alignment heatmap ──────────────────────────────────────────
import matplotlib.pyplot as plt

words    = ["Kal\n(tomorrow)", "subah\n(morning)", "chai\n(tea)", "piyo\n(drink)"]
fig, ax  = plt.subplots(figsize=(6, 2))
im       = ax.imshow([attn_weights], cmap="YlOrRd", vmin=0, vmax=1)
ax.set_xticks(range(4)); ax.set_xticklabels(words, fontsize=9)
ax.set_yticks([0]);      ax.set_yticklabels(["'tea'"], fontsize=9)
for j, w in enumerate(attn_weights):            # annotate cells with numbers
    ax.text(j, 0, f"{w:.2f}", ha="center", va="center", color="black")
ax.set_title("Attention weights when generating 'tea'")
plt.colorbar(im, ax=ax, shrink=0.8)
plt.tight_layout()
plt.savefig("attention_heatmap.png", dpi=120)   # save to view in Colab
plt.show()

What the code does, line by line:

  • encoder_states — each row is one source word’s hidden state (4D here; 1000D in real models)
  • decoder_state — the decoder’s current vector after generating the previous word
  • scores = encoder_states @ decoder_state — matrix-vector product gives one score per source word (this is dot-product / Luong attention; Bahdanau would add the tanh(W·s + U·h) transformation, which is a few more lines)
  • scores_stable = scores - scores.max() — numerical stability trick, does not change the output
  • exp_scores / exp_scores.sum() — softmax from first principles
  • attn_weights @ encoder_states — this transposes the logic: we weight each row of encoder_states by its attention weight, then sum. The result is the context vector.
  • The final block plots the 1-row heatmap showing how much weight each source word received.

Expected output:

Alignment scores:  [0.64, 0.89, 0.91, 0.74]
Attention weights: [0.17, 0.27, 0.28, 0.21]   ← "chai" and "subah" dominate
Sum of weights: 1.0
Context vector: [0.52, 0.62, 0.41, 0.46]

Try this: change decoder_state to [0.8, 0.1, 0.2, 0.4] (a state that “wants” to generate “tomorrow”). See how the attention weights shift — “Kal” should now receive the highest weight. This is the whole point: different decoder states produce different attention distributions over the same source.