Paper 11
Intermediate

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper 11 — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova · Google AI Language · 2018


What this paper did

It flipped the direction of reading — and that change alone beat state-of-the-art on eleven language understanding benchmarks at once.

GPT-1 (Paper 10) proved that pre-training on unlabelled text transfers to downstream tasks. But GPT-1 read text only left-to-right: when predicting the next word, it could only look at the words that came before. This made it powerful for generation, but it meant every word’s representation was built on half the context — the past, never the future.

Devlin et al. asked a simple question: what if we let the model see both directions at once?

The answer was BERT — a Transformer encoder pre-trained with two objectives. The first, Masked Language Modelling (MLM), randomly covers 15% of tokens and asks the model to guess them from the surrounding words in both directions. The second, Next Sentence Prediction (NSP), asks the model to decide whether two sentences appear consecutively in text. Together, these objectives force the model to build deep, bidirectional representations of language.

The result: fine-tuned BERT exceeded human performance on SQuAD (reading comprehension), beat GPT-1 by large margins on GLUE (a suite of 9 NLP tasks), and set new records on named entity recognition and sentence inference — all from the same pre-trained checkpoint.

The key equations:

MLM objective:   L_MLM = −Σ log P(xᵢ | x₁,...,x_{i−1}, x_{i+1},...,xₙ)   [over masked positions]

NSP objective:   L_NSP = −Σ log P(IsNext | [CLS] representation)

Total loss:      L = L_MLM + L_NSP

Where xᵢ is a masked token and [CLS] is a special classification token prepended to every input whose final hidden state is used for sequence-level predictions.


The Indian analogy

Imagine a student studying for their Hindi exam from a textbook where the teacher has randomly blacked out words on each page. To figure out what the hidden word is, the student must read both what comes before and what comes after — they cannot rely on just the left side of the sentence.

This forces something powerful: the student stops skimming and starts understanding full sentences from both ends simultaneously.

BERT’s pre-training is exactly this. By masking random words and demanding the model recover them from full surrounding context, BERT is forced to build a representation of every word that incorporates the entire sentence — not just the words that preceded it. This bidirectional understanding is why BERT is dramatically better at comprehension tasks than GPT-1, which could only read left-to-right.

The second pre-training task — Next Sentence Prediction — is like asking the student: “Does paragraph B logically follow paragraph A, or was it taken from somewhere else?” Answering this requires understanding paragraph-level coherence, not just individual words.


The GPT-1 vs BERT divide

This is the most important architectural split in modern NLP:

PropertyGPT-1 (Paper 10)BERT (Paper 11)
ArchitectureTransformer decoderTransformer encoder
Reading directionLeft-to-right (causal)Bidirectional
Pre-training objectivePredict next tokenMasked token prediction + NSP
Can generate text?YesNo
StrengthGeneration, completionUnderstanding, classification
Attention maskCausal (future is blocked)Full (all tokens see all tokens)

Neither is strictly better — they are optimised for different purposes. GPT became the foundation for generative AI. BERT became the foundation for search, question answering, and document understanding.


Read in this order

SectionWhat you will learnDifficultyTime
1. ContextNLP in late 2018 — the unidirectional limitation of GPT-1🟢4 min
2. The ProblemWhy left-to-right context is insufficient for understanding🟢3 min
3. The IdeaBidirectional encoders, MLM, NSP, and the [CLS]/[SEP] tokens🟡6 min
4. The MathMLM loss, NSP loss, WordPiece tokenisation🔴10 min
5. Worked ExampleStep-by-step forward pass through BERT-base on a real sentence🔴8 min
6. The CodeMLM with HuggingFace; classification with [CLS] token🟡7 min
7. LimitationsCannot generate, NSP is weak, MLM mismatch, quadratic attention🟡4 min
8. ImpactRoBERTa, ALBERT, DistilBERT, and BERT’s legacy in search and NLP🟢4 min
9. SummaryOne-page recap🟢2 min

Also: Glossary · Quiz · Further Reading


Before you read: math tutorials you need


BERT architecture at a glance

Input: [CLS] The cat sat on the [MASK] . [SEP]


    WordPiece Token Embeddings
  + Positional Embeddings
  + Segment Embeddings (sentence A or B)


  ┌──────────────────────────────────────┐
  │  Transformer Encoder Block × 12     │  ← BERT-base
  │                                     │  (× 24 for BERT-large)
  │  Multi-Head Self-Attention (full)   │  ← all tokens see all tokens
  │  Feed-Forward Network               │
  │  Layer Norm + Residual              │
  └──────────────────────────────────────┘


  ┌──────────────────────────────────────┐
  │  [CLS] hidden state → classifier    │  ← sentence-level tasks (NSP, sentiment)
  │  [MASK] hidden state → vocabulary   │  ← MLM: predict the masked token
  │  each token hidden state → label    │  ← token-level tasks (NER, QA)
  └──────────────────────────────────────┘

Paper 10 — GPT-1    → Paper 12 — GPT-3

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.