Further Reading — Paper 11: BERT

The original paper

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Devlin, Chang, Lee, Toutanova — Google AI Language, 2018 arXiv:1810.04805

The original paper is readable even for beginners — the core ideas are explained clearly, the ablation studies show exactly what each component contributes, and the fine-tuning appendix covers every task variant in detail.

Essential follow-ups

RoBERTa: A Robustly Optimized BERT Pretraining Approach Liu et al. — Facebook AI, 2019 arXiv:1907.11692

Replicated BERT with more data, no NSP, larger batch sizes, and longer training. The key finding: the original BERT was significantly undertrained. Everything RoBERTa changed was a hyperparameter choice, not an architectural one. A must-read alongside the original BERT paper.

DistilBERT, a distilled version of BERT Sanh, Debut, Chaumond, Wolf — HuggingFace, 2019 arXiv:1910.01108

Shows how to compress BERT via knowledge distillation. 40% fewer parameters, 60% faster, 97% of BERT-base performance. Important for understanding how large pre-trained models can be made practical.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations Lan et al. — Google Research, 2019 arXiv:1909.11942

Parameter reduction by factorised embedding parameterisation and cross-layer weight sharing. Shows that raw parameter count is not the only axis of model quality.

Deeper understanding

The Illustrated BERT, ELMo, and co. Jay Alammar — blog post jalammar.github.io/illustrated-bert

The best visual explanation of BERT available. Highly recommended before or after reading the paper. Jay Alammar’s diagrams make the input representation, attention, and fine-tuning process intuitive.

ELMo: Deep Contextualized Word Representations Peters et al. — Allen AI, 2018 arXiv:1802.05365

The direct predecessor to BERT — bidirectional LSTMs producing contextualised word representations. Reading ELMo makes clear exactly what problem BERT solved and why the Transformer encoder was a better solution than bidirectional LSTMs.

Benchmarks used in the paper

GLUE Benchmark gluebenchmark.com The 9-task benchmark that BERT dominated at publication. Includes sentiment (SST-2), inference (MNLI, RTE), question answering (QNLI), similarity (STS-B), and more.

SQuAD 1.1 and 2.0 rajpurkar.github.io/SQuAD-explorer Stanford Question Answering Dataset. SQuAD 2.0 includes unanswerable questions, making it significantly harder than 1.1.

Code and models

HuggingFace BERT models huggingface.co/bert-base-uncased The standard way to use BERT today. Pre-trained checkpoints for BERT-base and BERT-large in multiple languages. The transformers library makes fine-tuning straightforward.

Google’s original BERT repository github.com/google-research/bert The original TensorFlow implementation from Google Research, including pre-trained checkpoints.

What to read next (in this series)

← Paper 10 — GPT-1 — the predecessor that proved pre-training works

→ Paper 12 — GPT-3 — what happens when you scale GPT to 175 billion parameters