6. The Code — causal language model and classification fine-tuning
🟡 Intermediate. Runs free on Google Colab. No GPU required for this demo.
Part A: Causal language modelling (the pre-training objective)
import numpy as np
# --- Tiny vocabulary and corpus ---
vocab = ["chai", "bahut", "garam", "hai", "acha", "[START]", "[END]"]
word_to_id = {w: i for i, w in enumerate(vocab)} # map each word to an integer
id_to_word = {i: w for w, i in word_to_id.items()}
# Three short "sentences" in our tiny corpus (token ID sequences)
corpus = [
[5, 0, 1, 2, 3, 6], # [START] chai bahut garam hai [END]
[5, 0, 4, 6], # [START] chai acha [END]
[5, 0, 1, 4, 3, 6], # [START] chai bahut acha hai [END]
]
# --- Causal language model: predict next token from all previous tokens ---
def build_ngram_lm(corpus, vocab_size):
"""Build a smoothed bigram LM: P(word_t | word_t-1)."""
# Count how often word j follows word i
counts = np.ones((vocab_size, vocab_size)) # Laplace smoothing: start with 1
for sentence in corpus:
for i in range(len(sentence) - 1):
prev, curr = sentence[i], sentence[i + 1] # bigram (prev → curr)
counts[prev, curr] += 1 # increment count
# Convert counts to probabilities (each row must sum to 1)
probs = counts / counts.sum(axis=1, keepdims=True) # divide by row totals
return probs
lm = build_ngram_lm(corpus, len(vocab)) # shape: (vocab_size, vocab_size)
# --- Generate text from the model ---
def generate(lm, start_token, max_len=6):
"""Sample from the LM one token at a time (autoregressive generation)."""
tokens = [start_token]
for _ in range(max_len - 1):
last = tokens[-1] # most recent token
probs = lm[last] # P(next | last)
next_token = np.random.choice(len(vocab), p=probs) # sample
tokens.append(next_token)
if next_token == word_to_id["[END]"]: # stop at sentence end
break
return [id_to_word[t] for t in tokens]
np.random.seed(42)
print("Generated:", generate(lm, word_to_id["[START]"]))
# Expected output: something like ['[START]', 'chai', 'bahut', 'garam', 'hai', '[END]']
This bigram LM is a toy version of GPT-1’s objective. GPT-1 conditions on up to 512 previous tokens (not just 1), using the Transformer’s attention to capture long-range dependencies. The core idea is the same: predict the next token from all previous tokens, sample from the resulting distribution.
Part B: Input transformation for classification
# --- Simulate the GPT-1 input transformation for sentiment classification ---
# Extend vocabulary with special tokens
special_tokens = ["[EXTRACT]", "[DELIM]"]
for t in special_tokens:
word_to_id[t] = len(word_to_id)
id_to_word[len(id_to_word)] = t
def encode_classification(text_tokens):
"""Transform a list of token IDs for classification fine-tuning.
GPT-1 format: [START] + text + [EXTRACT]
"""
start = word_to_id["[START]"]
extract = word_to_id["[EXTRACT]"]
return [start] + text_tokens + [extract] # wrap with markers
def encode_entailment(premise_tokens, hyp_tokens):
"""Transform premise + hypothesis for entailment fine-tuning.
GPT-1 format: [START] + premise + [DELIM] + hypothesis + [EXTRACT]
"""
start = word_to_id["[START]"]
delim = word_to_id["[DELIM]"]
extract = word_to_id["[EXTRACT]"]
return [start] + premise_tokens + [delim] + hyp_tokens + [extract]
# Example: "chai bahut garam hai" → positive sentiment
text = [word_to_id["chai"], word_to_id["bahut"],
word_to_id["garam"], word_to_id["hai"]]
clf_input = encode_classification(text)
print("Classification input:", [id_to_word[t] for t in clf_input])
# → ['[START]', 'chai', 'bahut', 'garam', 'hai', '[EXTRACT]']
# Example: premise = "chai bahut garam hai", hypothesis = "chai acha hai"
premise = [word_to_id["chai"], word_to_id["bahut"],
word_to_id["garam"], word_to_id["hai"]]
hypothesis = [word_to_id["chai"], word_to_id["acha"], word_to_id["hai"]]
nli_input = encode_entailment(premise, hypothesis)
print("Entailment input:", [id_to_word[t] for t in nli_input])
# → ['[START]', 'chai', 'bahut', 'garam', 'hai', '[DELIM]', 'chai', 'acha', 'hai', '[EXTRACT]']
Notice: the model receives a flat list of token IDs in both cases. There is no special “premise encoder” or “hypothesis encoder.” The same transformer processes everything. The [DELIM] token teaches the model where one segment ends and another begins.
Part C: The combined loss (pre-training + fine-tuning)
def cross_entropy_loss(probs, true_idx):
"""Compute cross-entropy loss for a single prediction.
probs: probability distribution over classes (numpy array summing to 1)
true_idx: index of the correct class
"""
return -np.log(probs[true_idx] + 1e-9) # add small constant for numerical stability
# Simulated output of classification head
P_positive = 0.72 # model assigns 72% to "positive"
P_negative = 0.28
# Task loss: true label is "positive"
L_task = cross_entropy_loss(np.array([P_positive, P_negative]), true_idx=0)
print(f"Task loss: {L_task:.4f}") # → 0.3285
# Language model loss: suppose average -log P(token|context) = 1.5 over this batch
L_lm = 1.5
# Combined loss (λ = 0.5)
lambda_lm = 0.5
L_total = L_task + lambda_lm * L_lm
print(f"LM loss: {L_lm:.4f}")
print(f"Combined loss: {L_total:.4f}") # → 0.3285 + 0.75 = 1.0785
During backpropagation, gradients from both losses flow back through the same transformer weights. The task loss pushes weights toward correct classification. The language modelling loss acts as a regulariser, preventing catastrophic forgetting of pre-trained knowledge.
What this code does not show
These snippets capture the conceptual structure of GPT-1. The real model differs in:
- Scale: 12 layers, 768 dimensions, 12 attention heads, 40,478 BPE token vocabulary, trained on 800M words
- Attention: the transformer uses multi-head masked self-attention (Section 4) rather than a bigram
- BPE tokenisation: words are split into subword units (e.g., “beautiful” → “beau” + “tiful”), allowing the model to handle rare words
- Training infrastructure: the full model was trained on 64 GPUs over 30 days
For a full runnable GPT-2 implementation (GPT-1’s successor, same architecture), see Andrej Karpathy’s minGPT — a clean 300-line PyTorch implementation.