Section 09

What came next

Efficient Estimation of Word Representations in Vector Space (Word2Vec) 2013

9. What came next — the road from Word2Vec to modern embeddings

Word2Vec solved “static word meaning as a vector”. That unlocked a cascade of follow-up ideas, each addressing one of the limitations we just saw. Here is the lineage from 2013 to the modern era.

Step 1 — extend Word2Vec to subwords (fastText, 2016)

Facebook AI’s Bojanowski et al. proposed fastText, which is Word2Vec with a twist: each word is represented as a sum of its character n-grams. So “chalta” is broken into pieces like <ch, cha, hal, alt, lta, ta>, each with its own vector, and the word vector is the sum.

This single change solved Word2Vec’s morphology problem. Rare and inflected words got proper vectors — even words never seen in training could get a reasonable embedding from their character n-grams alone. fastText remains the default choice for morphologically rich Indian languages.

(fastText is not a paper in our series, but you will run into it constantly in applied NLP.)

Step 2 — let words have different vectors in different contexts (ELMo, 2018)

Peters et al. at the Allen Institute took a bigger swing with ELMo. They trained a bidirectional LSTM language model on a billion words of text, and then — instead of throwing away the network like Word2Vec — they kept the network. To get a word’s vector, you ran the sentence through ELMo and took the LSTM’s hidden states at that position.

For the first time, “bank” in “river bank” and “bank” in “HDFC bank” got different vectors. Contextual embeddings were born.

ELMo’s architecture (LSTM-based, Paper 04) is already somewhat outdated, but the conceptual leap — from static to contextual embeddings — was the real contribution.

Step 3 — sequence-to-sequence learning (Paper 06, 2014)

Parallel to these embedding improvements, another line of work was asking: can we use LSTMs (Paper 04) to read one sentence and write a different one? Sutskever, Vinyals, and Le (2014) showed yes — encoder-decoder LSTMs for machine translation. Word2Vec vectors were the natural input.

Read next: Paper 06 — Seq2Seq

Step 4 — attention for long sentences (Paper 07, 2014)

Bahdanau, Cho, and Bengio patched Seq2Seq’s bottleneck by adding attention — letting the decoder look at all encoder states, not just the last one. This was where the word “attention” entered neural networks, in the specific technical sense it now carries.

Read next: Paper 07 — Bahdanau Attention

Step 5 — attention replaces recurrence entirely (Paper 08, 2017)

Vaswani et al.’s Transformer threw out LSTMs and built a sequence model from attention alone. This ended the LSTM era and began the Transformer era.

Crucially for our story: inside the Transformer, words are first turned into embeddings — initialised roughly as in Word2Vec, then learned jointly with the rest of the network. So Word2Vec’s core idea (look up a dense vector for each word) lives on, but now the embeddings adapt to the task during training.

Read next: Paper 08 — Attention Is All You Need

Step 6 — contextual embeddings go big (BERT, Paper 11, 2018)

Devlin et al. at Google built BERT: a Transformer trained on a “fill in the blank” task across huge corpora. For any given sentence, BERT produces a contextual vector for every word — the ultimate answer to Word2Vec’s “one vector per word” problem.

After BERT, no serious English NLP system used static Word2Vec vectors as its primary representation. Contextual embeddings dominated.

Read next: Paper 11 — BERT

Step 7 — embeddings become universal (GPT, Paper 10)

Radford and colleagues at OpenAI took the same Transformer architecture but trained it with next-word prediction — the skip-gram objective scaled up to whole sentences, in some sense. Their models (GPT, GPT-2, GPT-3) ended up being so large that the embeddings and the task model became one: there’s no separate “embedding” step, the entire network is an embedding function.

Read next: Paper 10 — GPT-1

The bigger pattern

Notice the shape of progress across these seven steps:

  1. Word2Vec: learn one static vector per word.
  2. fastText: learn vectors for subword pieces.
  3. ELMo: vectors depend on the sentence (LSTM-based).
  4. BERT: vectors depend on the sentence (Transformer-based).
  5. GPT: everything is learned jointly at massive scale.

The direction is always: more context, more depth, more scale. Each step keeps the Word2Vec insight intact — meaning is a vector — but relaxes a restriction Word2Vec had imposed.

Where you stand now

You have read about:

  • Paper 01 — Turing: can machines think?
  • Paper 02 — Perceptron: linear classifiers.
  • Paper 03 — Backpropagation: training deep nets.
  • Paper 04 — LSTM: sequence memory.
  • Paper 05 — Word2Vec: meaning as geometry.

Papers 06–08 will show you how to use embeddings in sequence models, how to replace LSTMs with attention, and finally how the Transformer arrived. After Paper 08, every subsequent paper in this series is a variation on the Transformer, trained on different data or at different scales.

You are only a few papers away from the machinery that powers Claude, GPT, Gemini, and LLaMA.

Where to go next

  • Take the quiz to test what stuck.
  • Browse the glossary to nail down any fuzzy terms.
  • Try an experiment from further reading — specifically, running Word2Vec on Hindi Wikipedia is a great weekend project.
  • Then move on to Paper 06 (Seq2Seq), which uses embeddings as the input to a full translation system.

Thanks for reading the whole paper. From here, everything gets increasingly recognisable as “modern AI”.

🎉 You've finished this paper!