Section 03

The idea: re-reading the source at every step

Neural Machine Translation by Jointly Learning to Align and Translate 2014

3. The idea — re-reading the source at every step

Picture a student sitting the Class 12 history board exam. The question asks her to write a detailed account of the 1857 uprising — causes, events, consequences. The question paper is on the left. Her open textbook is on the right.

A poor student memorises the chapter the night before, closes the book, and tries to write everything from memory. For a short answer, this works. For a five-page essay, memory fails. Details blur together. Dates get confused. Connections are lost.

A good student — a smart student — keeps the textbook open. When she writes about the causes, she looks at the causes section. When she mentions Bahadur Shah Zafar, she glances at his biography paragraph. When she reaches the aftermath, she re-reads the aftermath section. At every sentence she writes, she is consulting the precise part of the source that is relevant to what she is writing right now.

This is Bahdanau attention.

The decoder is the student writing the answer. The encoder’s hidden states — one vector per source word — are the open textbook. At each decoding step (each word the decoder generates), the model:

  1. Looks at every source word — all the encoder hidden states — and asks: “How relevant is each source word to what I am about to generate right now?”
  2. Computes a relevance score for each source word using a small learned function called the alignment model.
  3. Converts scores into weights using softmax, so they sum to 1. These are the attention weights αᵢ.
  4. Computes a context vector as the weighted average of all encoder hidden states — a blend of source information, dialled up where relevant and dialled down where not.
  5. Uses this context vector (plus its own previous state) to generate the next target word.

The beautiful thing: none of the weights are hard-coded. There is no rule saying “English word 3 always corresponds to Hindi word 2.” The alignment model is a small neural network that is trained end-to-end along with everything else. It learns, from data alone, which source words matter for which target words.

This is called soft alignment — as opposed to the “hard alignment” used in old statistical machine translation, where you explicitly mapped each target word to exactly one source word. Soft alignment allows one target word to be influenced by multiple source words simultaneously, with varying degrees of weight. “Namaste” in Hindi might draw attention from both “hello” and the speaker’s tone of formality in the English source.

The bidirectional encoder: Bahdanau also introduced one more idea. In a standard LSTM encoder, when processing word 3, the network has seen words 1 and 2 but not words 4, 5, 6. Information flows only forward. But the meaning of a word often depends on what comes after it, not just before. “The bank by the river” and “The bank returned my money” — the word “bank” means different things depending on what follows.

To give the encoder full context, Bahdanau used a bidirectional RNN: one LSTM reading left to right, another reading right to left simultaneously. The hidden states from both passes are concatenated. So for word 3, the encoder sees not only words 1 and 2 (from the forward pass) but also words 4, 5, and beyond (from the backward pass). Every source word’s representation reflects its full context in the sentence.

The decoder then attends over these richer, bidirectional hidden states.

The result was the first neural machine translation system that could see the whole picture at every moment — looking backward and forward through the source, attending selectively, generating one word at a time. The fixed-size bottleneck was gone.