9. Summary — BERT in One Page
The one-sentence version
BERT is a Transformer encoder pre-trained to predict randomly masked words from their full bidirectional context — and that bidirectionality is what makes it dramatically better than GPT-1 at language understanding tasks.
The problem it solved
GPT-1 proved that pre-training works. But GPT-1 only read text left-to-right, which is fine for generation but discards half the available context for understanding. “The bank by the river” and “The bank charged fees” need rightward context to disambiguate the word “bank.” A left-to-right model cannot use it. BERT can.
The key ideas
Masked Language Modelling (MLM): Randomly mask 15% of tokens. Ask the model to predict them from all surrounding tokens — both left and right. This forces bidirectional context without making the task trivially easy (the model cannot simply copy the answer, because it has been replaced with [MASK]).
Next Sentence Prediction (NSP): Classify whether sentence B genuinely follows sentence A, or is randomly sampled. Teaches sentence-level coherence. (Later shown to be less important than MLM.)
[CLS] token: Prepended to every input. Its final hidden state is a vector summary of the entire sequence, used for classification tasks.
[SEP] token: Marks the boundary between sentence A and sentence B in two-sentence inputs.
WordPiece tokenisation: Splits rare words into subword pieces. 30,522-token vocabulary for BERT-base. Handles words not seen during training.
The architecture numbers
| BERT-base | BERT-large | |
|---|---|---|
| Layers | 12 | 24 |
| Hidden size | 768 | 1024 |
| Attention heads | 12 | 16 |
| Parameters | 110M | 340M |
| Training data | 3.3B words (Wikipedia + BooksCorpus) | same |
The GPT-1 vs BERT contrast
| GPT-1 | BERT | |
|---|---|---|
| Architecture | Decoder | Encoder |
| Direction | Left-to-right | Bidirectional |
| Pre-training objective | Predict next token | Predict masked tokens + NSP |
| Can generate? | Yes | No |
| Strength | Generation | Understanding |
The Indian analogy
A student studying with words randomly blacked out in the textbook, forced to guess each hidden word from both what came before and what came after. This forces bidirectional reading and deep understanding — not skimming. BERT’s pre-training is this process, applied to billions of sentences.
The results
- GLUE: 80.5 (previous best: ~69) — a suite of 9 NLP tasks
- SQuAD 1.1: 93.2 F1 — exceeding the published human score
- SQuAD 2.0: 83.1 F1 — new state-of-the-art
- 11 benchmarks improved simultaneously with a single model and checkpoint
What came next
RoBERTa (more data, no NSP) → ALBERT (fewer parameters, same performance) → DistilBERT (40% smaller, 60% faster) → domain-specific BERTs (BioBERT, LegalBERT) → T5 (encoder-decoder combining BERT and GPT ideas). BERT’s bidirectional pre-training philosophy now powers most language understanding systems in production worldwide.