4. How it works — encoders, decoders, and backwards input
Let’s look under the hood of the encoder-decoder architecture. Here is the exact step-by-step process of how an English sentence becomes a French sentence during inference (when the model is actually translating for a user).
Step 1: The encoder reads the input
We have an English sentence: “How are you”. We add a special token to
mark the end of the sentence: <EOS> (end of sentence).
The encoder LSTM reads the vector for “How” and updates its hidden
state. It takes that hidden state, reads “are”, and updates again. It
reads “you”, updates. It reads <EOS>, and updates one final time.
Step 2: The context-vector handoff
The very last hidden state of the encoder LSTM is captured. This is the context vector. All the English words are now discarded. Everything the model knows about the sentence is crammed into this single fixed-size array of numbers.
Step 3: The decoder starts speaking
We awaken the decoder LSTM. We set its initial hidden state to be the
context vector. We feed it a special start-of-sentence <SOS> token to
get it going.
Based on the context vector and the <SOS> token, the decoder outputs
a probability distribution over the entire French vocabulary. It picks
the most likely word — let’s say “Comment”.
Step 4: The feedback loop (greedy decoding)
The decoder takes the word it just generated (“Comment”) and feeds it
back into itself as the input for the next step. It looks at its
updated hidden state, sees “Comment”, and predicts “allez”. It feeds
“allez” back in, and predicts “vous”. It feeds “vous” back in, and
predicts <EOS>. Once it predicts <EOS>, the translation is complete.
The music teacher: teacher forcing
The feedback loop described above is how the model works after it is trained. But during training, if the decoder predicts a wrong word early on, the rest of the sentence will turn into garbage. The model would learn very slowly.
To fix this, researchers use teacher forcing. Imagine a music teacher sitting with you at a harmonium. You are supposed to play the sequence Sa-Re-Ga-Ma. You play Sa, but then you mess up and play Pa instead of Re. If the teacher lets you continue from Pa, you will play the whole song wrong. Instead, the teacher corrects you: “No, the second note was Re. Now, assuming you played Re, what comes next?”
In teacher forcing, during training, we do not feed the decoder its own predicted word. We feed it the actual correct word from the training data, regardless of what it predicted. This keeps training stable and fast.
The weird hack: the reverse-input trick
Here is one of the most famous empirical hacks in deep-learning history. Sutskever noticed the network was struggling with long sentences. The context vector was having a hard time remembering the beginning of the English sentence by the time it got to the end.
His solution? He reversed the English sentence before feeding it to the encoder.
Instead of feeding “A B C”, he fed “C B A”. The decoder still generated the target sequence normally: “X Y Z”.
Why did this work so well?
Think about how Indian addresses are often written locally — Name, House Number, Street, City, State, PIN code. An English postal system usually expects the reverse, starting broad and getting specific. The information closest to what you need is placed closest to where it will be used.
Same idea here. If you feed the source sentence normally, word “A” is far away from word “X” in the network’s processing steps. By reversing the input to “C B A”, word “A” is processed right before the context vector is handed off. Therefore, word “A” is very fresh in the network’s memory just as the decoder needs to start translating the beginning of the sentence to word “X”. This simple trick dramatically improved their BLEU scores (the standard metric for measuring translation quality).