Efficient Estimation of Word Representations in Vector Space (Word2Vec)
Word2Vec — Mikolov, Chen, Corrado, Dean (2013)
TL;DR
Before 2013, neural networks treated the word “chai” as the vector
[0, 0, 0, 1, 0, 0, ...] — a column of zeros with a single 1 somewhere.
Every word was equally different from every other word. “Chai” and
“coffee” were as far apart as “chai” and “submarine”. Machines had no
sense of meaning.
Mikolov and colleagues at Google proposed something almost magical: train a very small neural network to do a dummy task — predict nearby words in a sentence — and then throw away the network. What you keep is the hidden layer’s weights. Those weights, one row per word, turn out to encode meaning.
Suddenly “chai” and “coffee” had vectors close to each other. “Delhi” and “India” were neighbours, as were “Tokyo” and “Japan”. Most famously:
king − man + woman ≈ queen
You could do arithmetic with words. The vectors captured not just similarity but analogical structure.
The paper trained on a few billion words, produced 300-dimensional vectors for a million words in under a day on one machine, and released the code for free. It was the spark that ignited the deep-learning boom in NLP.
The journey in one line
Words as meaningless IDs → train a tiny network on a fake task → the side-effect of training is that meaning appears in the weights.
What you will learn
- Why one-hot vectors carry zero meaning.
- Two training tricks — CBOW and Skip-gram — that learn word embeddings.
- Why “predict your neighbours” is a clever proxy for learning meaning.
- A worked numerical example: how
king − man + womanends up nearqueen. - An Indian version:
Delhi − India + Japan ≈ Tokyo. - A 25-line Python demo that loads pretrained Word2Vec and plays with it.
- Why Word2Vec eventually lost to contextual embeddings (BERT, Paper 11).
Sections
- Historical context — NLP before meaning
- The problem — one-hot vectors and the sparsity trap
- The core idea — meaning as a by-product of prediction
- How it works — CBOW, skip-gram, and negative sampling
- The math — the loss function, worked king-queen arithmetic
- The code — gensim demo in 25 lines
- Impact — NLP’s first true breakthrough
- Limitations — one vector per word, no context
- What came next — GloVe, ELMo, BERT, the embedding everywhere
Resources
- Glossary — every new term used in this paper
- Quiz — 5 questions to test your understanding
- Further reading — blogs, videos, original paper
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.