2. The problem — one vector to rule them all

The seq2seq model (Paper 06) worked like this: an encoder LSTM reads the entire source sentence, word by word. After reading the last word, it produces a single fixed-size vector. That is everything the model remembers about the source. The decoder then starts producing the translation from this one vector alone.

Think of it this way. You are asked to read a full page of text, then hand the paper to someone behind a wall and try to answer questions about it. The wall has a single small hole — big enough to pass only one small chit. You must cram everything important about the page onto that chit. Your friend then has to answer every question using only that chit.

For a short text (“Chai piyo”), the chit works fine. For a page of dense content, it is hopeless. You simply cannot fit enough.

The seq2seq context vector is that chit.

Researchers confirmed this with data. On the WMT 2014 English-to-French benchmark, the seq2seq model’s translation quality (measured by BLEU score) held up well for sentences up to about 20 words. For sentences of 30–40 words, quality dropped noticeably. For sentences longer than 50 words, it degraded badly.

The LSTMs themselves were not the problem. The LSTM is actually quite good at carrying information across many time steps, as we saw in Paper 04. The problem was a design choice: forcing all that sequential context through a single vector that the decoder sees at the start but cannot revisit.

There were two specific pain points:

Pain point 1 — Long-range dependencies. In an English sentence, the subject might appear at position 1 and the verb that agrees with it might appear at position 15. By the time the encoder has processed position 15, the context vector must still somehow encode what the subject at position 1 was. Over long sentences, this fades.

Pain point 2 — Word order reordering. Different languages have very different word orders. English: Subject-Verb-Object. Hindi: Subject-Object-Verb. When generating word 3 in Hindi, the decoder might need to look back at word 7 of the English source. But it cannot look at the source at all — it only has the context vector.

Bahdanau’s paper targeted both problems with a single mechanism. Instead of asking the encoder to compress everything into one vector, it retained all the encoder’s hidden states — one per source word — and let the decoder reach back and consult them at each decoding step. The compression bottleneck was eliminated entirely.