6. The code — a toy seq2seq in PyTorch

Building a full translation system requires large vocabularies and massive datasets. To understand the pure mechanics, we’ll build a toy seq2seq model that learns a much simpler task: reversing a sequence of numbers. If we input [3, 1, 4], we want the decoder to output [4, 1, 3].

This code shows exactly how the encoder passes its hidden state (the context vector) to the decoder.

Runs free on Google Colab.

import torch
import torch.nn as nn

class ToySeq2Seq(nn.Module):
    def __init__(self, input_size=1, hidden_size=16):
        super().__init__()
        # The encoder reads the input sequence
        self.encoder = nn.LSTM(input_size, hidden_size, batch_first=True)
        # The decoder generates the output sequence
        self.decoder = nn.LSTM(input_size, hidden_size, batch_first=True)
        # Linear layer to map hidden state back to a number prediction
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, source_seq, target_seq_for_teacher_forcing):
        # 1. ENCODER PASS
        # We don't care about encoder outputs, only the final (hidden, cell)
        _, (hidden, cell) = self.encoder(source_seq)

        # 2. CONTEXT-VECTOR HANDOFF
        # The encoder's final (hidden, cell) IS the context vector

        # 3. DECODER PASS (using teacher forcing)
        # We feed the true target sequence to speed up learning
        decoder_outputs, _ = self.decoder(
            target_seq_for_teacher_forcing, (hidden, cell)
        )

        # Map decoder hidden states down to number predictions
        predictions = self.fc(decoder_outputs)
        return predictions


# --- Forward-pass demo ---
model = ToySeq2Seq()
# Source sequence: [3.0, 1.0, 4.0] (batch=1, seq_len=3, features=1)
src = torch.tensor([[[3.0], [1.0], [4.0]]])
# Target input (shifted with a 0.0 start token): [0.0, 4.0, 1.0]
tgt = torch.tensor([[[0.0], [4.0], [1.0]]])

out = model(src, tgt)
# Untrained model → random-ish floats
print([round(p.item(), 2) for p in out.squeeze()])
# Example output: [-0.12, 0.04, 0.22]

Things to notice in Colab

self.decoder(...) takes (hidden, cell) from the encoder. That tuple is the physical manifestation of the context vector.
We passed target_seq_for_teacher_forcing as the decoder’s input. That’s teacher forcing in action — feeding the true sequence during the forward pass instead of the decoder’s own predictions.
To actually train this model to reverse numbers, you’d add an optimiser (torch.optim.Adam), a loss (nn.MSELoss), and a loop that runs the forward pass, computes loss against the true reversed sequence, and calls loss.backward() — exactly the same pattern from Paper 03 (Backpropagation).
Try changing hidden_size=16 to hidden_size=4. The model will struggle more, because the context vector is too small to hold the input. That’s the bottleneck we’ll discuss in Section 8, live.