9. Summary — BERT in One Page

The one-sentence version

BERT is a Transformer encoder pre-trained to predict randomly masked words from their full bidirectional context — and that bidirectionality is what makes it dramatically better than GPT-1 at language understanding tasks.

The problem it solved

GPT-1 proved that pre-training works. But GPT-1 only read text left-to-right, which is fine for generation but discards half the available context for understanding. “The bank by the river” and “The bank charged fees” need rightward context to disambiguate the word “bank.” A left-to-right model cannot use it. BERT can.

The key ideas

Masked Language Modelling (MLM): Randomly mask 15% of tokens. Ask the model to predict them from all surrounding tokens — both left and right. This forces bidirectional context without making the task trivially easy (the model cannot simply copy the answer, because it has been replaced with [MASK]).

Next Sentence Prediction (NSP): Classify whether sentence B genuinely follows sentence A, or is randomly sampled. Teaches sentence-level coherence. (Later shown to be less important than MLM.)

[CLS] token: Prepended to every input. Its final hidden state is a vector summary of the entire sequence, used for classification tasks.

[SEP] token: Marks the boundary between sentence A and sentence B in two-sentence inputs.

WordPiece tokenisation: Splits rare words into subword pieces. 30,522-token vocabulary for BERT-base. Handles words not seen during training.

The architecture numbers

	BERT-base	BERT-large
Layers	12	24
Hidden size	768	1024
Attention heads	12	16
Parameters	110M	340M
Training data	3.3B words (Wikipedia + BooksCorpus)	same

The GPT-1 vs BERT contrast

	GPT-1	BERT
Architecture	Decoder	Encoder
Direction	Left-to-right	Bidirectional
Pre-training objective	Predict next token	Predict masked tokens + NSP
Can generate?	Yes	No
Strength	Generation	Understanding

The Indian analogy

A student studying with words randomly blacked out in the textbook, forced to guess each hidden word from both what came before and what came after. This forces bidirectional reading and deep understanding — not skimming. BERT’s pre-training is this process, applied to billions of sentences.

The results

GLUE: 80.5 (previous best: ~69) — a suite of 9 NLP tasks
SQuAD 1.1: 93.2 F1 — exceeding the published human score
SQuAD 2.0: 83.1 F1 — new state-of-the-art
11 benchmarks improved simultaneously with a single model and checkpoint

What came next

RoBERTa (more data, no NSP) → ALBERT (fewer parameters, same performance) → DistilBERT (40% smaller, 60% faster) → domain-specific BERTs (BioBERT, LegalBERT) → T5 (encoder-decoder combining BERT and GPT ideas). BERT’s bidirectional pre-training philosophy now powers most language understanding systems in production worldwide.

← Paper 10 — GPT-1 → Paper 12 — GPT-3