Section 06

The code

Efficient Estimation of Word Representations in Vector Space (Word2Vec) 2013

6. The code — playing with pretrained Word2Vec in 25 lines

We won’t train Word2Vec from scratch — the Google News vectors would take an overnight run. Instead we’ll use gensim, the standard Python library for word embeddings, to load a small pretrained model and reproduce the famous analogies on your laptop or in Google Colab.

Install gensim (if needed)

In a Colab cell:

!pip install gensim --quiet

Takes about 30 seconds.

The demo (25 lines)

import gensim.downloader as api              # gensim's dataset downloader
from numpy import dot                        # for manual dot-product checks
from numpy.linalg import norm                # for computing vector magnitudes

print("Loading vectors (one-time ~300MB download)…")
# glove-wiki-gigaword-100: 100-dim vectors, 400k words, small & fast
vecs = api.load("glove-wiki-gigaword-100")

def analogy(a, b, c):                        # compute b − a + c, return top matches
    result = vecs.most_similar(positive=[b, c], negative=[a], topn=3)
    return [w for w, _ in result]

print("\nking − man + woman ≈ ", analogy("man", "king", "woman"))
print("Delhi − India + Japan ≈ ", analogy("india", "delhi", "japan"))
print("walking − walk + run ≈  ", analogy("walk", "walking", "run"))
print("actor − man + woman ≈   ", analogy("man", "actor", "woman"))

def sim(w1, w2):                             # cosine similarity by hand
    v1, v2 = vecs[w1], vecs[w2]              # fetch two word vectors
    return dot(v1, v2) / (norm(v1) * norm(v2))

print("\nsim(chai,   coffee)    =", round(sim("tea", "coffee"), 3))
print("sim(tea,    submarine) =", round(sim("tea", "submarine"), 3))
print("sim(cricket, football) =", round(sim("cricket", "football"), 3))

Exactly 25 lines of code.

What you’ll see

The expected output looks approximately like:

king − man + woman ≈  ['queen', 'princess', 'daughter']
Delhi − India + Japan ≈  ['tokyo', 'osaka', 'kyoto']
walking − walk + run ≈   ['running', 'runs', 'ran']
actor − man + woman ≈    ['actress', 'singer', 'dancer']

sim(chai,   coffee)    = 0.605
sim(tea,    submarine) = 0.033
sim(cricket, football) = 0.795

Notice:

  • The top answer is almost always the “right” one, or the analogy’s most natural paraphrase.
  • Cosine similarity cleanly separates related pairs (0.6–0.8) from unrelated pairs (near 0).
  • This model (GloVe, a Word2Vec cousin) uses tea rather than chai because it was trained on English Wikipedia. If you want a model with stronger Indian-English coverage, try word2vec-google-news-300 or the multilingual IndicBERT (Paper 11 family) embeddings.

Why we’re using GloVe, not the original Word2Vec

The original 2013 Google News Word2Vec vectors are 1.5 GB to download. The GloVe model we just used is 128 MB and works essentially the same way — it’s a slightly different training objective (matrix factorisation of the co-occurrence matrix) but produces vectors with the same analogy properties. For classroom demos, it’s much nicer.

If you want the actual Word2Vec-trained vectors:

vecs = api.load("word2vec-google-news-300")   # 1.5 GB, takes a few min

Same API, same analogies, just bigger and a little better on rare words.

A fun exercise — find the “roti” direction

Here’s a small project. Pick 10 Indian foods (roti, biryani, dal, paneer, samosa, idli, dosa, vada, puri, chaat). Compute the average of their vectors. That’s the “Indian food” direction.

Now do the same with 10 Italian foods. Compute the difference between the two averages. That difference vector is, roughly, “Indian − Italian” in food-space.

What happens when you add this to vec("pizza")?

# pseudo-code — try filling this in
indian_foods = ['roti', 'biryani', 'dal', 'paneer', 'samosa',
                'idli', 'dosa', 'vada', 'puri', 'chaat']
italian_foods = ['pizza', 'pasta', 'lasagna', 'risotto', 'ravioli',
                 'gelato', 'tiramisu', 'focaccia', 'bruschetta', 'calzone']
# compute average vectors for both, find the difference, etc.

If you do this in Colab and share the result on Twitter tagged #ainiketan, we’ll feature the best ones on the site.

What to take from the code

Three things:

  1. Word vectors are just arrays. vecs["chai"] is a 100-dim NumPy array. Everything you do with words — analogies, similarity, clustering — is just linear algebra on these arrays.
  2. The library does the hard part. Training on billions of words is an engineering feat; using the result is trivial. Gensim exposes it as a dictionary from word to vector.
  3. This is the first taste of “representation learning”. The embedding is the representation. Later papers (BERT, GPT) learn more sophisticated representations — contextual, deeper — but the shape of the idea is the same: turn things into dense vectors, then do arithmetic.

If you want to actually train a Word2Vec model

Short recipe, from scratch, in about 30 lines:

from gensim.models import Word2Vec
sentences = [['the', 'chai', 'is', 'hot'],
             ['students', 'drink', 'coffee'],
             # …thousands more tokenised sentences…
             ]
model = Word2Vec(sentences, vector_size=100,
                 window=5, min_count=5,
                 sg=1, negative=5, workers=4)   # sg=1 → skip-gram
print(model.wv.most_similar('chai'))

On a few million sentences, this trains in an hour on a laptop and produces serviceable embeddings. For real applications you’d use a much larger corpus — the 2013 paper used 6 billion words.

Next: the impact — what Word2Vec did for NLP.