Section 06

The Code: MLM with HuggingFace and classification with [CLS]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 2018

6. The Code — MLM with HuggingFace and Classification with [CLS]

Runs free on Google Colab. Install: pip install transformers torch

Two code blocks: (1) fill-mask — see BERT predict masked tokens; (2) sentence classification using the [CLS] vector.


Code Block 1: Fill-Mask — BERT predicts masked tokens

# Install once: pip install transformers torch

from transformers import BertTokenizer, BertForMaskedLM
import torch

# Load BERT-base-uncased (110M parameters, pre-trained checkpoint)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
model.eval()  # evaluation mode — no dropout

# Input: sentence with [MASK] token
sentence = "The cat sat on the [MASK]."
inputs = tokenizer(sentence, return_tensors="pt")  # pt = PyTorch tensors

# Find which position has the [MASK] token
mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
print(f"[MASK] is at position: {mask_idx.item()}")  # Expected: 6

# Forward pass — no gradients needed for inference
with torch.no_grad():
    outputs = model(**inputs)  # logits shape: (1, seq_len, vocab_size)

# Get logits for the masked position
logits_at_mask = outputs.logits[0, mask_idx, :]  # shape: (1, 30522)
probs = torch.softmax(logits_at_mask, dim=-1)    # convert to probabilities

# Print top-5 predictions
top5 = torch.topk(probs, 5)
print("\nTop-5 predictions for [MASK]:")
for score, token_id in zip(top5.values[0], top5.indices[0]):
    word = tokenizer.decode([token_id])
    print(f"  '{word}': {score.item():.4f}")

# Expected output (something like):
# 'floor': 0.2341
# 'mat': 0.1823
# 'ground': 0.1102
# 'table': 0.0876
# 'shelf': 0.0512

Try changing the sentence to see how context affects predictions. Replace “cat” with “politician” and see if the predictions for [MASK] change — they will, because BERT reads the full sentence in both directions.


Code Block 2: Sentence classification using the [CLS] vector

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")  # base model, no task head
model.eval()

# Two sentences to compare
sentence_a = "The movie was absolutely fantastic — I loved every scene."
sentence_b = "The food at the canteen was cold and tasteless."

# Tokenise both sentences
inputs = tokenizer(
    [sentence_a, sentence_b],
    padding=True,        # pad shorter sentence to match longer
    return_tensors="pt"
)

with torch.no_grad():
    # outputs.last_hidden_state shape: (batch_size, seq_len, hidden_size)
    # outputs.pooler_output is the [CLS] vector, shape: (batch_size, hidden_size)
    outputs = model(**inputs)

# Extract the [CLS] vector for each sentence
cls_vectors = outputs.pooler_output  # shape: (2, 768)

print(f"CLS vector shape: {cls_vectors.shape}")  # (2, 768)

# Compute cosine similarity between the two CLS vectors
# (measures how similar the model thinks the two sentences are)
a = cls_vectors[0]  # sentence A
b = cls_vectors[1]  # sentence B
similarity = torch.dot(a, b) / (torch.norm(a) * torch.norm(b))
print(f"\nCosine similarity between A and B: {similarity.item():.4f}")
# Sentences about very different topics → low similarity (closer to 0)

# In a real classifier, you would:
# 1. Load a labelled dataset (e.g. sentiment: positive/negative)
# 2. Add a linear layer on top of cls_vectors: nn.Linear(768, num_classes)
# 3. Fine-tune the whole model on that dataset
# 4. At test time, pass a sentence → get cls_vector → linear layer → class prediction
print("\nThe CLS vector is a 768-dim fingerprint of the sentence.")
print("Fine-tune a linear layer on top to classify any sentence property.")

What just happened?

In Code Block 1, BERT ran a full bidirectional forward pass over your sentence, using context from both sides of the [MASK] to predict the most likely word. The word “mat” ranked high because BERT has seen “sat on the mat” countless times during pre-training.

In Code Block 2, you extracted the [CLS] vector — the 768-dimensional summary that BERT builds for the entire sentence. This is what gets plugged into a linear classifier for fine-tuning. The pre-trained CLS vector already carries useful sentence-level semantics — positive sentiment sentences cluster together in the 768-dimensional space, even before any fine-tuning.

The critical point: in a fine-tuning scenario, you would train the linear classifier and simultaneously update all of BERT’s parameters on your labelled dataset. The pre-trained parameters give the model a massive head start — you typically need only a few hundred to a few thousand labelled examples, not millions.