Further Reading — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Further Reading — Paper 11: BERT
The original paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Devlin, Chang, Lee, Toutanova — Google AI Language, 2018 arXiv:1810.04805
The original paper is readable even for beginners — the core ideas are explained clearly, the ablation studies show exactly what each component contributes, and the fine-tuning appendix covers every task variant in detail.
Essential follow-ups
RoBERTa: A Robustly Optimized BERT Pretraining Approach Liu et al. — Facebook AI, 2019 arXiv:1907.11692
Replicated BERT with more data, no NSP, larger batch sizes, and longer training. The key finding: the original BERT was significantly undertrained. Everything RoBERTa changed was a hyperparameter choice, not an architectural one. A must-read alongside the original BERT paper.
DistilBERT, a distilled version of BERT Sanh, Debut, Chaumond, Wolf — HuggingFace, 2019 arXiv:1910.01108
Shows how to compress BERT via knowledge distillation. 40% fewer parameters, 60% faster, 97% of BERT-base performance. Important for understanding how large pre-trained models can be made practical.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations Lan et al. — Google Research, 2019 arXiv:1909.11942
Parameter reduction by factorised embedding parameterisation and cross-layer weight sharing. Shows that raw parameter count is not the only axis of model quality.
Deeper understanding
The Illustrated BERT, ELMo, and co. Jay Alammar — blog post jalammar.github.io/illustrated-bert
The best visual explanation of BERT available. Highly recommended before or after reading the paper. Jay Alammar’s diagrams make the input representation, attention, and fine-tuning process intuitive.
ELMo: Deep Contextualized Word Representations Peters et al. — Allen AI, 2018 arXiv:1802.05365
The direct predecessor to BERT — bidirectional LSTMs producing contextualised word representations. Reading ELMo makes clear exactly what problem BERT solved and why the Transformer encoder was a better solution than bidirectional LSTMs.
Benchmarks used in the paper
GLUE Benchmark gluebenchmark.com The 9-task benchmark that BERT dominated at publication. Includes sentiment (SST-2), inference (MNLI, RTE), question answering (QNLI), similarity (STS-B), and more.
SQuAD 1.1 and 2.0 rajpurkar.github.io/SQuAD-explorer Stanford Question Answering Dataset. SQuAD 2.0 includes unanswerable questions, making it significantly harder than 1.1.
Code and models
HuggingFace BERT models
huggingface.co/bert-base-uncased
The standard way to use BERT today. Pre-trained checkpoints for BERT-base and BERT-large in multiple languages. The transformers library makes fine-tuning straightforward.
Google’s original BERT repository github.com/google-research/bert The original TensorFlow implementation from Google Research, including pre-trained checkpoints.
What to read next (in this series)
← Paper 10 — GPT-1 — the predecessor that proved pre-training works
→ Paper 12 — GPT-3 — what happens when you scale GPT to 175 billion parameters