2. The problem — one-hot vectors and the sparsity trap

Every problem in AI begins with a question of representation. Before you can learn, classify, predict, or generate, you have to turn the world into numbers. The numbers you choose shape what the model can possibly learn. Pick bad numbers, and no amount of cleverness downstream will save you.

How NLP represented words, pre-2013

The standard choice, as we saw in Section 1, was the one-hot vector.

Suppose your vocabulary has 50,000 words — roughly the size of a decent Hindi or English newspaper’s working lexicon. Each word gets a unique position from 0 to 49,999. Then “chai” at position 4,271 becomes a 50,000-dimensional vector that is all zeros except for a single 1 at position 4,271.

This representation has three properties, and all three cause problems.

Property 1 — every pair of different words is equally far apart

Take any two distinct words. Their one-hot vectors differ in exactly two positions (the 1 for word A and the 1 for word B). Their Euclidean distance is always √2. Their cosine similarity is always 0.

So:

distance("chai", "coffee")     = √2
distance("chai", "submarine")  = √2
distance("chai", "lassi")      = √2

The representation says nothing about which pairs are more similar. It has no sense of topic, or register, or sentiment, or anything.

Property 2 — the curse of dimensionality

The vector is 50,000-dimensional but only one of those dimensions is nonzero. We’re spending 50,000 numbers of memory to encode what is really just one small integer (the word’s ID).

Neural networks trained on one-hot inputs struggle because:

Parameters multiply with vocabulary size. A 50,000-word vocabulary and a 500-unit hidden layer gives a 25 million parameter matrix in the very first layer.
Most of those parameters barely get updated, because each training example only activates one input position.
Generalisation is impossible. If the network sees “chai” in a sentence during training, it learns nothing about “coffee” — from the one-hot vectors, they’re unrelated.

Property 3 — no compositional structure

In a well-designed representation, you’d hope that chai − milk + soybean ≈ soy-chai, or something similarly analogical. You cannot do any such arithmetic with one-hot vectors. Adding two one-hot vectors gives you a two-hot vector, which represents nothing meaningful. The math of one-hot space is sterile.

An Indian-life analogy — names without features

Imagine you move to a new city and you’re told to remember everyone at a wedding reception. You are given a long list of names: Ravi, Priya, Meera, Karthik, Aisha, Rohan, Fatima, and so on. You are not told anything else — no relationships, no jobs, no appearances.

After the wedding, someone asks: “Are Ravi and Priya related?” You have no way to answer. You have no features attached to the names, just the names themselves. Every pair of people is, to you, equally different.

This is exactly the situation a neural network is in with one-hot vectors. The word IDs are just names with no features.

Now imagine instead you had been given a brief profile for each person: age, job, city, how you know them. Now you can tell that Ravi and Priya are cousins — they share city and family background, even if you met them on different days. You can compare them.

Word2Vec builds these profiles automatically. That’s the whole trick.

The insight Word2Vec builds on: distributional semantics

There’s a famous line from the British linguist J.R. Firth (1957):

“You shall know a word by the company it keeps.”

The same idea, more formally, is called the distributional hypothesis: words that appear in similar contexts tend to have similar meanings. If “chai” and “coffee” both frequently appear near “cup”, “sip”, “hot”, “morning”, “sugar” — they’re probably similar in meaning. You don’t need a dictionary. The data itself carries the meaning, encoded in the statistics of co-occurrence.

Word2Vec operationalises this hypothesis beautifully:

For every word in the corpus, look at the words in a small window around it (say, 5 on each side).
Train a network so that a word’s vector predicts the vectors of its neighbours (or vice versa).
Words that appear in similar contexts end up with similar vectors — because the training signal pushes them that way.

Note what the network is not doing. It is not looking up definitions. It is not being told which words are synonyms. It has never seen a thesaurus. All it has is text and a “predict the neighbours” task. The meaning emerges for free.

What a good representation should achieve

Keep these four requirements in mind as we look at the Word2Vec architecture in the next section. Any good word representation should:

Be dense. A few hundred dimensions, not tens of thousands. Every dimension should carry signal.
Cluster by meaning. Similar words (chai, coffee, lassi) should have similar vectors.
Support arithmetic. Meaningful operations like king − man + woman ≈ queen should work.
Generalise. Learning something about “chai” should transfer to “tea”.

One-hot vectors satisfy none of these. Word2Vec, trained on a trivial- seeming task, satisfies all four. That is the leap this paper takes.

Next: the core idea — meaning as a by-product of prediction.