Further reading — Paper 05

Blogs, videos, code, and Indian-language resources. Start at the top and work down.

The original papers

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR Workshop. https://arxiv.org/abs/1301.3781 The paper we just read. Introduces CBOW and skip-gram.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS. https://arxiv.org/abs/1310.4546 The follow-up paper with negative sampling and subsampling. If you want to understand the fast training recipe, this is the one.

Chris McCormick — “Word2Vec Tutorial - The Skip-Gram Model” (2016). https://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ Single clearest walkthrough of skip-gram with pictures. Read this if any part of Section 4 felt fuzzy.
Chris McCormick — “Word2Vec Tutorial Part 2 — Negative Sampling” (2017). https://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/ Companion to the above. Walks through the negative-sampling math step by step.
Jay Alammar — “The Illustrated Word2Vec” (2019). https://jalammar.github.io/illustrated-word2vec/ Gorgeous visual explanation with animations. If you’re a visual learner, read this before or after our Section 3.
Sebastian Ruder — “On word embeddings” (series). https://ruder.io/word-embeddings-1/ Three-part deep dive covering Word2Vec, GloVe, fastText, and evaluation. Academic but readable.

Stanford CS224N — Word Vectors (lecture 1). https://www.youtube.com/watch?v=rmVRLeJRkl4 Chris Manning’s lecture. Covers the math we covered, from Stanford’s flagship NLP course.
StatQuest — “Word Embedding and Word2Vec, Clearly Explained!!!” https://www.youtube.com/watch?v=viZrOnJclY0 Slower and more visual. Good companion if the math in Section 5 felt fast.
Andrej Karpathy — “Let’s build GPT”. https://www.youtube.com/watch?v=kCc8FmEb1nY The first 40 minutes cover character-level embeddings and then Word2Vec-style token embeddings as the input to a Transformer.

gensim (what we used in Section 6). https://radimrehurek.com/gensim/ The standard Python library for Word2Vec. Trivial to use, ships with pretrained model downloads.
Original Word2Vec C code (Mikolov’s). https://code.google.com/archive/p/word2vec/ Historical interest only — the original training implementation. If you can read C, it’s eye-opening to see how short the whole algorithm is.
fastText. https://fasttext.cc/ Facebook’s successor library. Pre-trained vectors for 157 languages including all major Indian languages. Use this for any serious Indian-language work.

Train Word2Vec on Hindi Wikipedia. Download dump: https://dumps.wikimedia.org/hiwiki/latest/ With ~150,000 Hindi articles this takes about 30 minutes on a laptop. Results: vec("Dilli") − vec("Bharat") + vec("Japan") ≈ vec("Tokyo"), and you can hunt for curious analogies.
AI4Bharat IndicNLP embeddings. https://ai4bharat.org/indic-nlp-resources Pre-trained embeddings for 12+ Indian languages. Download and play.
iNLTK — Indic NLP Toolkit. https://inltk.readthedocs.io/ Simple Python API for Hindi, Tamil, Bengali, and other Indian language embeddings. Five lines to start.
FIRE (Forum for Information Retrieval Evaluation). https://fire.irsi.res.in/ India’s main NLP shared-task forum. Many tasks from 2014–2020 used Word2Vec-style embeddings. Great place to find datasets for projects.

IIT-Bombay CFILT (Center for Indian Language Technology). http://www.cfilt.iitb.ac.in/ Long history of Indian-language NLP, including Word2Vec-era work.
IIT-Madras RBC-DSAI (AI4Bharat’s home). https://ai4bharat.org/ Current generation of Indian NLP research, including embeddings and translation.
IIIT Hyderabad LTRC (Language Technologies Research Centre). http://ltrc.iiit.ac.in/ Parsing and embeddings work for Indian languages.

If you’re continuing through this series: