Autoregressive generation
The decoder's mode of operation: generate one token at a time, feed the generated token back as input, generate the next.
The decoder's mode of operation: generate one token at a time, feed the generated token back as input, generate the next. The causal mask ensures that at training time each position can only see previous positions (no cheating by looking at future tokens).