WordPiece

Appears in 1 paper

BERT's subword tokenisation algorithm.

As used in Paper 11 — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding →

BERT's subword tokenisation algorithm. Splits rare or unknown words into smaller pieces that are in the vocabulary. Example: "unbelievable" → ["un", "##believable"]. The ## prefix signals a continuation piece. Vocabulary size: 30,522 for BERT-base.