5. Worked Example — Forward Pass Through BERT on a Masked Sentence

We will trace a complete forward pass through a tiny BERT-like model (scaled down for readability) on a masked sentence. We will compute the MLM loss by hand.

Setup

Input sentence: “The cat sat on the mat.”

After WordPiece tokenisation and adding special tokens:

Position:  0      1     2     3     4     5     6     7
Token:   [CLS]  The   cat   sat   on   [MASK]  mat   [SEP]

The token at position 5 (“the”) has been masked — we will predict it from context.

Tiny model parameters (for illustration):

Hidden dimension d = 4 (real BERT uses 768)
Vocabulary: [“the”, “a”, “on”, “mat”, “cat”, “sat”] → V = 6
1 Transformer encoder layer with 2 attention heads (real BERT has 12 layers, 12 heads)

Step 1: Input embeddings

Each token gets a 4-dimensional vector from summing token, positional, and segment embeddings. We’ll use simplified values:

Position 0 ([CLS]):  [0.1,  0.2, -0.1,  0.3]
Position 1 (The):    [0.5,  0.1,  0.2, -0.3]
Position 2 (cat):    [0.8, -0.2,  0.4,  0.1]
Position 3 (sat):    [0.3,  0.6, -0.5,  0.2]
Position 4 (on):     [-0.1, 0.4,  0.3,  0.5]
Position 5 ([MASK]): [0.0,  0.0,  0.0,  0.0]   ← [MASK] embedding (learned, shown as zeros here)
Position 6 (mat):    [0.6,  0.3, -0.2,  0.4]
Position 7 ([SEP]):  [0.2,  0.1,  0.1,  0.2]

Step 2: Self-attention — position 5 attends to all positions

In BERT, the [MASK] token at position 5 can attend to every other token. Let’s compute what position 5 pays attention to using simplified attention scores.

Attention scores are computed as:

score(5, j) = (Q₅ · Kⱼ) / √d_head

For illustration, suppose after the linear projections, we get these raw scores for position 5 attending to each other position:

Position:    0([CLS])  1(The)  2(cat)  3(sat)  4(on)  5([MASK])  6(mat)  7([SEP])
Raw score:     0.1      1.8     0.9     0.4     1.2      0.0       0.7     0.1

Apply softmax to get attention weights:

exp values:  1.11      6.05    2.46    1.49    3.32     1.00      2.01    1.11
Sum = 18.55

Attention weights:
             0.060     0.326   0.133   0.080   0.179    0.054     0.108   0.060

Position 5 pays the most attention to position 1 (“The”) with weight 0.326, and also attends significantly to position 4 (“on”) with weight 0.179. This makes intuitive sense — the missing word is “the”, which is most similar to the “The” at the start of the sentence, and “on the mat” is the phrase being reconstructed.

The context vector for position 5 is the weighted sum of all value vectors. This context vector captures information from every token in the sentence, weighted by relevance.

Step 3: MLM prediction head

After the full encoder (in our simplified model, just one layer), the hidden state at position 5, h₅ ∈ ℝ^4, is fed through the MLM head:

Logits = W_mlm · h₅ + b_mlm     (shape: V = 6 outputs)

Suppose the logits are:

Token:   "the"   "a"    "on"   "mat"   "cat"   "sat"
Logits:   2.5    0.3    0.8    0.1     -0.2    0.5

Apply softmax:

exp(logits):  12.18   1.35   2.23   1.11   0.82   1.65
Sum = 19.34

Probabilities:
"the":  12.18 / 19.34 = 0.630
"a":     1.35 / 19.34 = 0.070
"on":    2.23 / 19.34 = 0.115
"mat":   1.11 / 19.34 = 0.057
"cat":   0.82 / 19.34 = 0.042
"sat":   1.65 / 19.34 = 0.085

Step 4: MLM loss

The ground truth is “the” (index 0) with probability 0.630.

L_MLM = −log(0.630) ≈ 0.462

Compare to the worst case (random, uniform probability = 1/6 ≈ 0.167):

L_random = −log(0.167) ≈ 1.789

Our tiny model’s loss of 0.462 is well below the random baseline — it has correctly learned that the masked token is most likely “the”, guided by the bidirectional context from both “The cat sat” and “on [MASK] mat”.

Step 5: What makes this bidirectional?

Notice what the model used to make its prediction. The [MASK] token at position 5 attended strongly to:

Position 1 (“The”) — another occurrence of the same word in the sentence, to its left
Position 4 (“on”) — the word immediately before the mask, also to its left
Position 6 (“mat”) — the word immediately after the mask, to its right

A left-to-right model (GPT-1 style) would not be able to use position 6 (“mat”) at all when predicting position 5. BERT uses it with attention weight 0.108. In a real, deeper BERT, this rightward information would propagate through multiple layers and have an even larger effect.

This is the core of bidirectionality: “on the mat” is recognised as a natural phrase only when the model can see both “on” (left) and “mat” (right) simultaneously.

Step 6: NSP forward pass (brief)

For an NSP training example, suppose the input is:

[CLS] The cat sat on the mat [SEP] It was a warm afternoon [SEP]

The [CLS] hidden state h₀ ∈ ℝ^4 is produced after the encoder processes the full two-sentence sequence. A linear classifier then predicts:

Logits_NSP = W_nsp · h₀ + b_nsp   (shape: 2 outputs — IsNext, NotNext)
P(IsNext) = softmax(Logits_NSP)[0]

If B truly follows A (as it does here), the correct label is IsNext, and the NSP loss is:

L_NSP = −log(P(IsNext))

Total loss for this training example: L = L_MLM + L_NSP.

Gradients flow backward through both losses simultaneously, updating every parameter in the encoder and the embedding layers.