Transformer Decoder
The architecture used in GPT models: a stack of self-attention and feedforward layers that process tokens left-to-right (causally).
The architecture used in GPT models: a stack of self-attention and feedforward layers that process tokens left-to-right (causally). Each token can only attend to previous tokens.