4. The Math — MLM Loss, NSP Loss, Input Embeddings, and Fine-Tuning
Prerequisites: Conditional Probability · Cross-Entropy Loss · Softmax Function
4.1 Input representation
Every input token in BERT is converted into a single vector by summing three learned embeddings:
e(token, position, segment) = E_token[token] + E_pos[position] + E_seg[segment]
Where:
- E_token ∈ ℝ^(V × d) — token embedding matrix (V = vocabulary size, d = hidden dimension)
- E_pos ∈ ℝ^(512 × d) — positional embedding matrix (BERT supports sequences up to 512 tokens)
- E_seg ∈ ℝ^(2 × d) — segment embedding matrix (two rows: one for sentence A, one for sentence B)
For BERT-base, d = 768. So each input token becomes a 768-dimensional vector before it enters the Transformer encoder.
Note that BERT uses learned positional embeddings (unlike the Transformer paper which used fixed sinusoidal embeddings). Both work; BERT’s team found learned embeddings slightly more effective in practice.
4.2 The Transformer encoder (no causal mask)
After embedding, the sequence is processed by a stack of L Transformer encoder blocks. Each block applies:
- Multi-head self-attention — with no causal mask. Every token can attend to every other token in the sequence.
- Feed-forward network — two linear transformations with a GELU activation in between.
- Layer normalisation + residual connections around both sub-layers.
The key difference from GPT-1 is the attention mask. GPT-1 sets the attention score to −∞ for all future positions. BERT does not mask anything — the full attention matrix is used.
The output of the encoder for each position i is a vector hᵢ ∈ ℝ^d that encodes the meaning of token i in the context of all other tokens.
4.3 Masked Language Modelling (MLM) loss
Let M ⊂ {1, 2, …, n} be the set of masked positions in a sequence of n tokens.
For each masked position i ∈ M, BERT applies a linear layer followed by a softmax to produce a probability distribution over the vocabulary:
P(token = v | context) = softmax(W_mlm · hᵢ + b_mlm)[v]
Where:
- hᵢ ∈ ℝ^d is the final hidden state at position i
- W_mlm ∈ ℝ^(V × d) is a learned linear projection
- b_mlm ∈ ℝ^V is a learned bias
- V is the vocabulary size (30,522 for BERT-base)
The MLM loss is the average cross-entropy over all masked positions:
L_MLM = −(1/|M|) Σᵢ∈M log P(xᵢ | x₁,...,x_{i−1}, x_{i+1},...,xₙ)
Where xᵢ is the original (unmasked) token at position i. The loss is computed only over the masked positions — the model is not penalised for its predictions at unmasked positions.
Numerical example:
Suppose the masked token is “mat” and the vocabulary has V = 5 tokens: [“cat”, “mat”, “bat”, “hat”, “rat”].
The model’s output logits at the masked position are: [0.2, 2.1, 0.5, 0.3, 0.1]
Softmax gives probabilities:
exp([0.2, 2.1, 0.5, 0.3, 0.1]) = [1.22, 8.17, 1.65, 1.35, 1.11]
Sum = 13.50
P = [0.090, 0.605, 0.122, 0.100, 0.082]
The correct token is “mat” (index 1), with predicted probability 0.605.
Cross-entropy loss for this token: −log(0.605) ≈ 0.502
If the model were random (uniform), each probability would be 0.2, and the loss would be −log(0.2) ≈ 1.609. A loss of 0.502 means the model has learned something meaningful.
4.4 Next Sentence Prediction (NSP) loss
The final hidden state of the [CLS] token, h₀ ∈ ℝ^d, is fed into a binary classifier:
P(IsNext) = sigmoid(W_nsp · h₀ + b_nsp)
Or equivalently, using a linear layer with 2 outputs followed by softmax:
P(IsNext, NotNext) = softmax(W_nsp · h₀ + b_nsp)
The NSP loss is the standard binary cross-entropy:
L_NSP = −[y · log P(IsNext) + (1−y) · log P(NotNext)]
Where y = 1 if B truly follows A, y = 0 if B is random.
4.5 Total pre-training loss
BERT trains both objectives simultaneously on every batch:
L = L_MLM + L_NSP
The two losses are simply added together with equal weight. The gradients from both objectives flow through the entire Transformer encoder, meaning both the MLM and NSP objectives jointly shape the parameters of all 12 (or 24) encoder layers.
4.6 Fine-tuning
Fine-tuning is the simplest part of BERT. The pre-trained parameters are loaded and then trained further on a small labelled dataset for a specific task. A task-specific output layer is added on top:
Sentence-level classification (sentiment, NSP-style tasks):
P(class = c) = softmax(W_cls · h_[CLS])[c]
Loss = cross-entropy(y_true, P)
Only the [CLS] vector is used. The h_[CLS] vector is 768-dimensional for BERT-base, and W_cls ∈ ℝ^(C × 768) where C is the number of classes.
Token-level classification (named entity recognition):
P(label = l | position i) = softmax(W_ner · hᵢ)[l]
Each token gets its own prediction from its own hidden state.
Extractive question answering (SQuAD-style):
The task is: given a question and a passage, find the start and end token of the answer span in the passage.
BERT uses two learned vectors s ∈ ℝ^d and e ∈ ℝ^d (start and end pointers):
Score_start(i) = s · hᵢ (dot product of start vector with token i's hidden state)
Score_end(i) = e · hᵢ
P(start = i) = softmax(Score_start)[i]
P(end = j) = softmax(Score_end)[j]
The predicted answer span is (argmax P_start, argmax P_end).
The elegant property: despite the diversity of these task-specific heads, the Transformer encoder itself is the same pre-trained model in every case. Fine-tuning updates the entire model (encoder + task head) end-to-end, but only a small labelled dataset is needed because the encoder already knows how language works.
4.7 Fine-tuning hyperparameters
The paper reports that fine-tuning is stable and fast. Recommended settings across tasks:
- Batch size: 16 or 32
- Learning rate: 2e-5, 3e-5, or 5e-5 (much smaller than typical training from scratch)
- Epochs: 2 to 4 (very few — the model converges quickly because it already understands language)
- Learning rate warmup over the first 10% of training steps
The small learning rate is important: you are nudging a well-trained model toward a specific task, not training it from scratch. Too high a learning rate causes catastrophic forgetting — the model forgets its pre-trained language knowledge.