5. Worked Example — GPT-1 on textual entailment

🔴 Advanced undergrad. This section walks through the full pipeline for one of the trickiest tasks GPT-1 tackled: textual entailment.

What is textual entailment?

Given two sentences, determine whether the first (premise) logically implies the second (hypothesis).

Premise:    "The dog ate the food in the bowl."
Hypothesis: "The bowl had food in it."
Label:      ENTAILS  (if the premise is true, the hypothesis must be true)

Premise:    "Riya passed the exam."
Hypothesis: "Riya failed the exam."
Label:      CONTRADICTS

Premise:    "The meeting is on Thursday."
Hypothesis: "The meeting is outdoors."
Label:      NEUTRAL  (the premise says nothing about location)

This task requires understanding of logical structure, word meaning, and world knowledge. It is hard. Pre-GPT-1 models struggled.

Step 1: Input transformation

GPT-1 does not see a “premise” and a “hypothesis” separately. It sees a flat sequence of tokens:

[START] "The dog ate the food in the bowl." [DELIM] "The bowl had food in it." [EXTRACT]

In token form (with BPE tokenisation, simplified):

Position: 1       2    3   4    5    6  7    8  9    10     11    12   13  14  15   16
Token:   [START]  The  dog ate  the food  in the bowl  [DELIM] The bowl had food  in  it [EXTRACT]

(Actual BPE would split further; simplified here for clarity.)

The [START], [DELIM], and [EXTRACT] tokens are special tokens added to the vocabulary. The model has never seen them in pre-training — they are new vocabulary entries introduced during fine-tuning. Their embeddings are initialised randomly and learned from the labelled data.

Step 2: Forward pass through the 12-layer transformer

The token sequence is embedded and passed through all 12 decoder layers. The causal mask is applied: position i can attend to positions 1 through i.

At position 1 ([START]): attends only to itself At position 8 (“the”): attends to positions 1–8 At position 11 ([DELIM]): attends to positions 1–11 — the full premise At position 17 ([EXTRACT]): attends to positions 1–17 — the complete input

Key insight: at the [EXTRACT] position, the model’s representation has “seen” the entire sequence — both the premise and hypothesis. The 12 layers of self-attention have had the chance to build a rich representation that captures the relationship between them. Words in the hypothesis can “attend” to words in the premise (through intermediate representations) and vice versa.

After 12 layers, h₁₂[17] — the vector at the [EXTRACT] position — is a 768-dimensional summary of the entire sequence.

Step 3: Classification head

A linear layer maps the 768-dimensional representation to 3 class scores:

h₁₂[17] = [0.8, −0.3, 0.5, ..., 0.2]   (768 dimensions)

Wᵧ has shape (768 × 3):
  column 0: weights for ENTAILS
  column 1: weights for NEUTRAL
  column 2: weights for CONTRADICTS

logits = h₁₂[17] · Wᵧ   → shape (3,)

Suppose logits = [2.1, 0.4, −0.8]:

exp(2.1)  = 8.166
exp(0.4)  = 1.492
exp(−0.8) = 0.449

Sum = 10.107

P(ENTAILS)     = 8.166 / 10.107 = 0.808
P(NEUTRAL)     = 1.492 / 10.107 = 0.148
P(CONTRADICTS) = 0.449 / 10.107 = 0.044

Predicted class: ENTAILS (probability 80.8%). Correct.

Step 4: Training loss

If the true label is ENTAILS:

L₂ = −log P(ENTAILS) = −log(0.808) = 0.213 nats

Simultaneously, the model computes the language modelling loss on the same sequence — the probability of each token given its predecessors:

L₁ = − (1/17) × Σᵢ log P(tokenᵢ | tokens < i)

This keeps the language understanding sharp even as the model learns to classify.

Combined loss:

L₃ = L₂ + 0.5 × L₁ = 0.213 + 0.5 × L₁

Step 5: Why this works — what happens inside the transformer

To correctly label the example as ENTAILS, the model must somehow establish that “The bowl had food in it” follows from “The dog ate the food in the bowl.”

What the attention layers actually compute (informally):

In early layers, word-level relationships form: “food” in the hypothesis attends strongly to “food” in the premise; “bowl” attends to “bowl.”
In middle layers, the model builds phrase-level understanding: “had food in it” is related to “food in the bowl.”
In deeper layers, the logical relationship crystallises: if the dog ate food from the bowl, the bowl must have had food. The [EXTRACT] representation absorbs this.

The transformer does not do formal logic. But across millions of training examples, it learns which patterns of token co-occurrence across [DELIM] signal entailment vs. contradiction vs. neutral. The key is that this learning is guided by pre-trained representations: the model already knows what “ate,” “bowl,” “food,” and “had” mean before fine-tuning begins.

Comparison: pre-trained vs. random initialisation

GPT-1’s paper reported an ablation: what if you use the same transformer architecture but randomly initialise all weights (no pre-training) and train only on the entailment data?

Condition	SNLI accuracy
GPT-1 with pre-training	89.9%
Same model, random init	~71–75% (typical for models of this size without pre-training)

The pre-training provides a huge head start. The model that already understands language needs far fewer entailment examples to learn the logical patterns. The model starting from scratch must simultaneously learn language AND the task.

Intuition: what [DELIM] does

The [DELIM] token serves as a clean separation signal. After fine-tuning, the model learns that:

Tokens to the left of [DELIM] are the “premise”
Tokens to the right of [DELIM] are the “hypothesis”
The relationship between them (as represented at [EXTRACT]) should map to an entailment label

This is all implicit — the model is never explicitly told any of this. It learns it from thousands of labelled examples. But because the model already understands language deeply, it needs far fewer examples to learn this task-specific structure than a model starting from scratch.

Scale of the full experiment

GPT-1 was evaluated on NLI (natural language inference / entailment) using:

SNLI: 570,000 sentence pairs. GPT-1 got 89.9% accuracy.
MultiNLI: 433,000 sentence pairs. GPT-1 set the best single-model result at the time.
SciTail: smaller scientific entailment dataset. GPT-1: 88.3%.

The model that predicted the next word in 800 million words of novels outperformed models specifically designed for entailment. This was the paper’s main empirical result — and it was the proof that the pre-train + fine-tune paradigm worked.