6. The code — attention in 25 lines of NumPy
🟡 First-year college. You need basic Python and NumPy. Runs free on Google Colab — no GPU needed.
We will implement the core attention mechanism from scratch using only NumPy: given encoder hidden states and a decoder query vector, compute attention weights and the resulting context vector.
import numpy as np
# ── Source sentence: "Kal subah chai piyo" (4 words) ──────────────────────────
# Each hidden state is a 4-dimensional vector (real models use ~1000 dims)
encoder_states = np.array([
[0.7, 0.1, 0.5, 0.3], # h1: "Kal" (tomorrow)
[0.5, 0.6, 0.2, 0.8], # h2: "subah" (morning)
[0.2, 0.9, 0.7, 0.1], # h3: "chai" (tea)
[0.8, 0.3, 0.4, 0.6], # h4: "piyo" (drink)
]) # shape: (4, 4) — 4 words × 4 dims
# ── Decoder query: state after generating "Drink" ─────────────────────────────
decoder_state = np.array([0.3, 0.8, 0.1, 0.6]) # shape: (4,)
# ── Step 1: Compute alignment scores (dot product attention) ─────────────────
# For each source word, how compatible is it with the current decoder state?
scores = encoder_states @ decoder_state # (4,4) @ (4,) = (4,) — one score per word
print("Alignment scores:", np.round(scores, 3))
# e.g., [0.64, 0.89, 0.91, 0.74] — "chai" and "subah" score highest
# ── Step 2: Softmax — convert scores to attention weights ────────────────────
# Subtract max for numerical stability (mathematically identical, avoids overflow)
scores_stable = scores - scores.max()
exp_scores = np.exp(scores_stable) # exponentiate
attn_weights = exp_scores / exp_scores.sum() # normalise to sum=1
print("Attention weights:", np.round(attn_weights, 3))
print("Sum of weights:", attn_weights.sum()) # should be 1.000
# ── Step 3: Context vector — weighted sum of encoder states ──────────────────
# Each encoder state is scaled by how much attention it receives
context = attn_weights @ encoder_states # (4,) @ (4,4) = (4,) — blend of states
print("Context vector:", np.round(context, 3))
# ── Visualise the alignment heatmap ──────────────────────────────────────────
import matplotlib.pyplot as plt
words = ["Kal\n(tomorrow)", "subah\n(morning)", "chai\n(tea)", "piyo\n(drink)"]
fig, ax = plt.subplots(figsize=(6, 2))
im = ax.imshow([attn_weights], cmap="YlOrRd", vmin=0, vmax=1)
ax.set_xticks(range(4)); ax.set_xticklabels(words, fontsize=9)
ax.set_yticks([0]); ax.set_yticklabels(["'tea'"], fontsize=9)
for j, w in enumerate(attn_weights): # annotate cells with numbers
ax.text(j, 0, f"{w:.2f}", ha="center", va="center", color="black")
ax.set_title("Attention weights when generating 'tea'")
plt.colorbar(im, ax=ax, shrink=0.8)
plt.tight_layout()
plt.savefig("attention_heatmap.png", dpi=120) # save to view in Colab
plt.show()
What the code does, line by line:
encoder_states— each row is one source word’s hidden state (4D here; 1000D in real models)decoder_state— the decoder’s current vector after generating the previous wordscores = encoder_states @ decoder_state— matrix-vector product gives one score per source word (this is dot-product / Luong attention; Bahdanau would add thetanh(W·s + U·h)transformation, which is a few more lines)scores_stable = scores - scores.max()— numerical stability trick, does not change the outputexp_scores / exp_scores.sum()— softmax from first principlesattn_weights @ encoder_states— this transposes the logic: we weight each row ofencoder_statesby its attention weight, then sum. The result is the context vector.- The final block plots the 1-row heatmap showing how much weight each source word received.
Expected output:
Alignment scores: [0.64, 0.89, 0.91, 0.74]
Attention weights: [0.17, 0.27, 0.28, 0.21] ← "chai" and "subah" dominate
Sum of weights: 1.0
Context vector: [0.52, 0.62, 0.41, 0.46]
Try this: change decoder_state to [0.8, 0.1, 0.2, 0.4] (a state that “wants” to generate “tomorrow”). See how the attention weights shift — “Kal” should now receive the highest weight. This is the whole point: different decoder states produce different attention distributions over the same source.