3. The idea — the courtroom translator

The solution Sutskever and his team proposed is brilliant in its simplicity. Instead of using one neural network to do the whole job, they used two. They called them the encoder and the decoder.

The Patna courtroom

Imagine a high-stakes trial in a district court in Patna. A witness is speaking rapidly in Hindi. The judge only understands English. Standing between them is a professional courtroom translator.

The translator doesn’t try to shout English words over the witness while they’re still speaking Hindi. That would cause chaos, because the sentence structure is different. Instead, the process works in two distinct phases.

Phase 1: Encoding. The translator listens intently to the witness’s entire sentence in Hindi. As they hear each word, they update their mental understanding of what the witness is trying to say. When the witness finishes the sentence, the translator holds a complete, language-free “concept” or “summary” of that sentence in their mind.

Phase 2: Decoding. The translator turns to the judge. Using that mental summary, they start speaking in English. They produce the first English word, which helps them decide the second English word, and they keep speaking until the full thought is conveyed.

How this maps to the network

This is exactly how the seq2seq model works.

The encoder is an LSTM. It reads the input sentence one word (vector) at a time. With each word, it updates its internal hidden state. When it reaches the end of the input sentence, its final hidden state is saved. We call this final state the context vector (or thought vector). This single vector — just a long list of numbers — contains the compressed meaning of the entire input sentence. If you need a refresher on what a vector is, see the vectors tutorial.

The decoder is a second LSTM. It is handed this context vector. Its only job is to unroll that compressed meaning into the target language, one word at a time, until it outputs a special <END> token.

There are no dictionaries. There are no grammar rules. There is just one network listening, passing a thought vector to a second network, which then speaks.