Transformer Decoder

Appears in 1 paper

The architecture used in GPT models: a stack of self-attention and feedforward layers that process tokens left-to-right (causally).

As used in Paper 13 — Scaling Laws for Neural Language Models →

The architecture used in GPT models: a stack of self-attention and feedforward layers that process tokens left-to-right (causally). Each token can only attend to previous tokens.

Paper 13 — Scaling Laws for Neural Language Models →

Appears in papers