Section 04

The math: autoregressive objective, fine-tuning loss, input transformations

Improving Language Understanding by Generative Pre-Training 2018

4. The math — autoregressive objective, fine-tuning loss, input transformations

🔴 Advanced undergrad. Read Conditional Probability and Cross-Entropy Loss first.


The pre-training objective

GPT-1 is trained on a corpus of tokens U = {u₁, u₂, …, uₙ}. The training objective is to maximise the log-likelihood of each token given its preceding context of length k (the context window):

L₁(U) = Σᵢ log P(uᵢ | uᵢ₋ₖ, uᵢ₋ₖ₊₁, ..., uᵢ₋₁ ; Θ)

Where:

  • uᵢ is the i-th token (a word or subword)
  • k is the context window size (GPT-1 used k = 512 tokens)
  • Θ is the set of model parameters
  • P(uᵢ | …) is the model’s predicted probability for token uᵢ given the preceding tokens

Maximising this log-likelihood is equivalent to minimising the negative log-likelihood (the loss):

Loss = − L₁(U) = − Σᵢ log P(uᵢ | uᵢ₋ₖ,...,uᵢ₋₁; Θ)

This is the standard cross-entropy loss between the model’s distribution and the one-hot target distribution for each token position.


How the model computes P(uᵢ | context)

Given context tokens [u₁, u₂, …, uᵢ₋₁], the forward pass is:

Step 1: Token + positional embedding

h₀ = UₑUW + Wₚ

Where:

  • Uₑ is the context token matrix (k tokens × 1)
  • UW is the token embedding matrix (vocab_size × n_embd), converts token IDs to vectors
  • Wₚ is the positional embedding matrix (k × n_embd), adds position information
  • n_embd = 768 in GPT-1

So h₀ is a (k × 768) matrix — one 768-dimensional vector per token.

Step 2: Pass through 12 Transformer decoder layers

hₗ = transformer_block(hₗ₋₁)   for l = 1, 2, ..., 12

Each transformer block applies:

  1. Masked multi-head self-attention (each token attends only to positions ≤ its own)
  2. Feed-forward network (two linear layers with GELU activation)
  3. Layer normalisation and residual connections

The masking is critical: it enforces the causal constraint. When computing the representation for token at position t, the attention scores for positions t+1 and beyond are set to −∞ (which becomes 0 after softmax). So the model can never look at future tokens.

Step 3: Compute output probability distribution

P(uᵢ | u<i) = Softmax( h₁₂[i] · UW ᵀ )

Where h₁₂[i] is the final-layer representation at position i (a 768-dim vector), and UW ᵀ is the transposed token embedding matrix (768 × vocab_size). The result is a probability distribution over the entire vocabulary (~40,000 tokens in GPT-1’s BPE vocabulary).

Note: GPT-1 ties the token embedding weights between the input embedding (UW) and the output projection (UW ᵀ). This reduces the number of parameters and often improves performance.


Worked numerical example (tiny scale)

Setup: vocabulary of 5 tokens, n_embd = 3, 1 transformer layer, context = 2 tokens.

Context: [“chai”, “bahut”] — we want to predict the next token.

After the transformer, suppose the representation of the last context token (“bahut”) is:

h = [0.6, −0.2, 0.8]     (3-dimensional)

Token embedding matrix UW (5 tokens × 3 dimensions):

       dim₁  dim₂  dim₃
chai:  [ 0.4,  0.1,  0.2]
bahut: [ 0.1,  0.5, −0.1]
garam: [ 0.7, −0.1,  0.9]
hai:   [ 0.2,  0.3,  0.6]
acha:  [−0.3,  0.4,  0.1]

Step 1: Compute logits = h · UWᵀ

For each word w, logit = h · embedding(w):

logit(chai)  = 0.6×0.4 + (−0.2)×0.1 + 0.8×0.2 = 0.240 − 0.020 + 0.160 = 0.380
logit(bahut) = 0.6×0.1 + (−0.2)×0.5 + 0.8×(−0.1) = 0.060 − 0.100 − 0.080 = −0.120
logit(garam) = 0.6×0.7 + (−0.2)×(−0.1) + 0.8×0.9 = 0.420 + 0.020 + 0.720 = 1.160
logit(hai)   = 0.6×0.2 + (−0.2)×0.3 + 0.8×0.6 = 0.120 − 0.060 + 0.480 = 0.540
logit(acha)  = 0.6×(−0.3) + (−0.2)×0.4 + 0.8×0.1 = −0.180 − 0.080 + 0.080 = −0.180

Logits: [0.380, −0.120, 1.160, 0.540, −0.180]

Step 2: Apply softmax to get probabilities

exp(0.380)  = 1.462
exp(−0.120) = 0.887
exp(1.160)  = 3.190
exp(0.540)  = 1.716
exp(−0.180) = 0.835

Sum = 1.462 + 0.887 + 3.190 + 1.716 + 0.835 = 8.090

P(chai)  = 1.462/8.090 = 0.181
P(bahut) = 0.887/8.090 = 0.110
P(garam) = 3.190/8.090 = 0.394   ← highest
P(hai)   = 1.716/8.090 = 0.212
P(acha)  = 0.835/8.090 = 0.103

After “chai bahut”, the model predicts “garam” (hot) as most likely (39.4%), followed by “hai” (21.2%). This is sensible — “chai bahut garam” is a common phrase.

Step 3: Compute loss for this position

If the true next token is “garam”:

Loss = −log P(garam) = −log(0.394) = 0.932 nats

If the model were perfect, P(garam) = 1.0, Loss = 0. Higher loss means the model is surprised by the true token — it needs more training.


The fine-tuning objective

After pre-training, a linear classification layer is added on top of the final transformer representation at the [EXTRACT] token position.

For a labelled dataset C = {(x¹, …, xᵐ, y)} where x¹,…,xᵐ are input tokens and y is the class label:

Step 1: Forward pass through pre-trained transformer

hₗᵐ = transformer(x¹,...,xᵐ)

hₗᵐ is the final-layer representation at the last input position (the [EXTRACT] token).

Step 2: Predict class probabilities

P(y | x¹,...,xᵐ) = Softmax( hₗᵐ · Wᵧ )

Where Wᵧ is a newly added (n_embd × num_classes) weight matrix.

Step 3: Fine-tuning loss (cross-entropy)

L₂(C) = Σ₍ₓ,ᵧ₎ log P(y | x¹,...,xᵐ)

Step 4: Combined loss

L₃(C) = L₂(C) + λ · L₁(C)

Where λ = 0.5. This means: for each fine-tuning batch, compute both the task loss (L₂) and the language model loss (L₁) on the same input, and add them. The λ weight keeps the language modelling contribution small but non-zero.


Numerical example: fine-tuning for sentiment (2 classes)

Setup: n_embd = 3, 2 classes (positive/negative). Wᵧ has shape (3 × 2).

Final transformer output at [EXTRACT] position:

hₗᵐ = [0.5, 0.8, −0.3]

Classification weights:

Wᵧ = [[ 0.6, −0.4],    ← weight from dim₁ to each class
       [ 0.3,  0.7],    ← weight from dim₂
       [−0.5,  0.2]]    ← weight from dim₃

Logits:

logit(positive) = 0.5×0.6 + 0.8×0.3 + (−0.3)×(−0.5)
                = 0.300 + 0.240 + 0.150 = 0.690

logit(negative) = 0.5×(−0.4) + 0.8×0.7 + (−0.3)×0.2
                = −0.200 + 0.560 − 0.060 = 0.300

Softmax:

exp(0.690) = 1.994
exp(0.300) = 1.350

P(positive) = 1.994 / (1.994 + 1.350) = 1.994 / 3.344 = 0.596
P(negative) = 1.350 / 3.344 = 0.404

If the true label is “positive”:

L₂ = −log(0.596) = 0.517 nats

After many fine-tuning steps, the model adjusts both Wᵧ and the pre-trained transformer weights to push P(positive) higher for positive reviews.


Why the causal mask matters

In the pre-training pass, for a sequence of tokens [t₁, t₂, t₃, t₄], the attention at position t₃ computes:

Attention score (t₃ → t₁): allowed
Attention score (t₃ → t₂): allowed
Attention score (t₃ → t₃): allowed (self)
Attention score (t₃ → t₄): MASKED (set to −∞, becomes 0 after softmax)

This upper-triangular masking means the model processes all positions in parallel (efficient training) but each position only sees its past (correct prediction objective). At inference, tokens are generated one at a time, left to right, each conditioned on all previously generated tokens.


Summary of key equations

Pre-training:
  L₁(U) = Σᵢ log P(uᵢ | uᵢ₋ₖ,...,uᵢ₋₁ ; Θ)
  P(uᵢ | ·) = Softmax( h₁₂[i] · UWᵀ )

Fine-tuning:
  P(y | x) = Softmax( hₗᵐ · Wᵧ )
  L₂(C) = Σ log P(y | x)
  L₃(C) = L₂(C) + λ · L₁(C)     λ = 0.5

Architecture:
  h₀ = token_embedding + positional_embedding
  hₗ = transformer_decoder_block(hₗ₋₁)    for l = 1..12
  Output: h₁₂ ∈ ℝ^(context_length × 768)