Autoregressive generation

Appears in 1 paper

The decoder's mode of operation: generate one token at a time, feed the generated token back as input, generate the next.

As used in Paper 08 — Attention Is All You Need →

The decoder's mode of operation: generate one token at a time, feed the generated token back as input, generate the next. The causal mask ensures that at training time each position can only see previous positions (no cheating by looking at future tokens).