DistilBERT

Appears in 1 paper

A compressed version of BERT created by knowledge distillation (training a small model to mimic the outputs of a larger one).

As used in Paper 11 — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding →

A compressed version of BERT created by knowledge distillation (training a small model to mimic the outputs of a larger one). 40% smaller, 60% faster, retains 97% of BERT-base's performance.