Further reading — Word2Vec (2013)
Further reading — Paper 05
Blogs, videos, code, and Indian-language resources. Start at the top and work down.
The original papers
-
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR Workshop. https://arxiv.org/abs/1301.3781 The paper we just read. Introduces CBOW and skip-gram.
-
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS. https://arxiv.org/abs/1310.4546 The follow-up paper with negative sampling and subsampling. If you want to understand the fast training recipe, this is the one.
Blog posts
-
Chris McCormick — “Word2Vec Tutorial - The Skip-Gram Model” (2016). https://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ Single clearest walkthrough of skip-gram with pictures. Read this if any part of Section 4 felt fuzzy.
-
Chris McCormick — “Word2Vec Tutorial Part 2 — Negative Sampling” (2017). https://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/ Companion to the above. Walks through the negative-sampling math step by step.
-
Jay Alammar — “The Illustrated Word2Vec” (2019). https://jalammar.github.io/illustrated-word2vec/ Gorgeous visual explanation with animations. If you’re a visual learner, read this before or after our Section 3.
-
Sebastian Ruder — “On word embeddings” (series). https://ruder.io/word-embeddings-1/ Three-part deep dive covering Word2Vec, GloVe, fastText, and evaluation. Academic but readable.
Videos
-
Stanford CS224N — Word Vectors (lecture 1). https://www.youtube.com/watch?v=rmVRLeJRkl4 Chris Manning’s lecture. Covers the math we covered, from Stanford’s flagship NLP course.
-
StatQuest — “Word Embedding and Word2Vec, Clearly Explained!!!” https://www.youtube.com/watch?v=viZrOnJclY0 Slower and more visual. Good companion if the math in Section 5 felt fast.
-
Andrej Karpathy — “Let’s build GPT”. https://www.youtube.com/watch?v=kCc8FmEb1nY The first 40 minutes cover character-level embeddings and then Word2Vec-style token embeddings as the input to a Transformer.
Code and libraries
-
gensim (what we used in Section 6). https://radimrehurek.com/gensim/ The standard Python library for Word2Vec. Trivial to use, ships with pretrained model downloads.
-
Original Word2Vec C code (Mikolov’s). https://code.google.com/archive/p/word2vec/ Historical interest only — the original training implementation. If you can read C, it’s eye-opening to see how short the whole algorithm is.
-
fastText. https://fasttext.cc/ Facebook’s successor library. Pre-trained vectors for 157 languages including all major Indian languages. Use this for any serious Indian-language work.
Indian language projects you can try
-
Train Word2Vec on Hindi Wikipedia. Download dump: https://dumps.wikimedia.org/hiwiki/latest/ With ~150,000 Hindi articles this takes about 30 minutes on a laptop. Results:
vec("Dilli") − vec("Bharat") + vec("Japan") ≈ vec("Tokyo"), and you can hunt for curious analogies. -
AI4Bharat IndicNLP embeddings. https://ai4bharat.org/indic-nlp-resources Pre-trained embeddings for 12+ Indian languages. Download and play.
-
iNLTK — Indic NLP Toolkit. https://inltk.readthedocs.io/ Simple Python API for Hindi, Tamil, Bengali, and other Indian language embeddings. Five lines to start.
-
FIRE (Forum for Information Retrieval Evaluation). https://fire.irsi.res.in/ India’s main NLP shared-task forum. Many tasks from 2014–2020 used Word2Vec-style embeddings. Great place to find datasets for projects.
Academic resources in India
-
IIT-Bombay CFILT (Center for Indian Language Technology). http://www.cfilt.iitb.ac.in/ Long history of Indian-language NLP, including Word2Vec-era work.
-
IIT-Madras RBC-DSAI (AI4Bharat’s home). https://ai4bharat.org/ Current generation of Indian NLP research, including embeddings and translation.
-
IIIT Hyderabad LTRC (Language Technologies Research Centre). http://ltrc.iiit.ac.in/ Parsing and embeddings work for Indian languages.
Reading order to understand modern NLP
If you’re continuing through this series:
- ✅ Paper 05 (Word2Vec) — you just finished this.
- Paper 06 (Seq2Seq) — use embeddings as input to a translation system.
- Paper 07 (Bahdanau Attention) — patch Seq2Seq’s bottleneck.
- Paper 08 (Transformer) — throw out LSTMs, keep only attention.
- Paper 10 (GPT-1) and Paper 11 (BERT) — the two faces of modern pretraining.
Back to Paper 05 home · Glossary · Quiz.