1. Context — NLP in Late 2018 and the Unidirectional Limitation
By October 2018, the NLP world was riding a genuine wave of optimism.
GPT-1, published just four months earlier, had proven that pre-training a Transformer on unlabelled text — books, articles, web pages — and then fine-tuning it on small labelled datasets could beat purpose-built models across a wide range of tasks. No longer did every task need its own bespoke architecture trained from scratch on its own expensive labelled corpus. The pre-train-then-fine-tune paradigm was real, it worked, and the research community was moving fast.
But GPT-1 had a structural constraint baked into its design, and it was not a small one.
GPT-1 only read text from left to right.
This was not a bug — it was a deliberate choice driven by the pre-training objective. GPT-1 learned language by predicting the next token from all preceding tokens. This is called a causal or autoregressive language model. The word “causal” is precise: to predict token number 7, the model is only allowed to look at tokens 1 through 6. Token 8 and everything after it is masked out. This causal masking is what makes GPT-1 capable of generating coherent text — at inference time, you feed it a prompt and it produces tokens one by one, each conditioned on all previous ones.
But this same causal constraint creates a problem for understanding tasks.
Consider the sentence: “The bank was closed because the river flooded.”
Now consider: “The bank was closed because it went bankrupt.”
The word “bank” appears in both sentences, and it has entirely different meanings — a riverbank vs. a financial institution. To determine the correct meaning of “bank,” you need the words that come after it: “river” and “flooded” in the first sentence, “bankrupt” in the second. A purely left-to-right model reading up to the word “bank” has seen only “The bank was closed because the” — not enough to disambiguate. It only gets the clarifying information later, by which point the representation of “bank” has already been computed.
This is the unidirectional limitation: building a representation of any word using only the words to its left discards half of the available context.
For generation tasks, this is fine — in generation, the future tokens do not exist yet when you are predicting the current one. But for understanding tasks — reading comprehension, question answering, named entity recognition, textual entailment — the full sentence is already present. There is no excuse for a model to ignore the right half of the context when it has access to all of it.
There was a prior attempt to build bidirectional language models. ELMo (Embeddings from Language Models, Peters et al., 2018) trained two separate LSTM language models — one left-to-right, one right-to-left — and concatenated their representations. This was better than purely unidirectional models, but it was a shallow form of bidirectionality: the two directions were never truly integrated during training. The left model never saw rightward context; the right model never saw leftward context. Concatenating them at the end is not the same as training a single model that sees both directions simultaneously.
What was needed was a model that could attend to every word in a sentence from every other word — with no directional restriction — during the pre-training phase itself.
The challenge: if you remove the causal mask and let every token see every other token, you can no longer use “predict the next token” as your training objective. If a model can see the word it is supposed to predict, the task is trivially easy — it just copies the answer. You need a different training objective entirely.
This was the problem Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova at Google AI Language set out to solve. Their solution — published in October 2018 — was BERT.