Paper 05

Further reading — Word2Vec (2013)

Further reading — Paper 05

Blogs, videos, code, and Indian-language resources. Start at the top and work down.

The original papers

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR Workshop. https://arxiv.org/abs/1301.3781 The paper we just read. Introduces CBOW and skip-gram.

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS. https://arxiv.org/abs/1310.4546 The follow-up paper with negative sampling and subsampling. If you want to understand the fast training recipe, this is the one.

Blog posts

Videos

Code and libraries

  • gensim (what we used in Section 6). https://radimrehurek.com/gensim/ The standard Python library for Word2Vec. Trivial to use, ships with pretrained model downloads.

  • Original Word2Vec C code (Mikolov’s). https://code.google.com/archive/p/word2vec/ Historical interest only — the original training implementation. If you can read C, it’s eye-opening to see how short the whole algorithm is.

  • fastText. https://fasttext.cc/ Facebook’s successor library. Pre-trained vectors for 157 languages including all major Indian languages. Use this for any serious Indian-language work.

Indian language projects you can try

  • Train Word2Vec on Hindi Wikipedia. Download dump: https://dumps.wikimedia.org/hiwiki/latest/ With ~150,000 Hindi articles this takes about 30 minutes on a laptop. Results: vec("Dilli") − vec("Bharat") + vec("Japan") ≈ vec("Tokyo"), and you can hunt for curious analogies.

  • AI4Bharat IndicNLP embeddings. https://ai4bharat.org/indic-nlp-resources Pre-trained embeddings for 12+ Indian languages. Download and play.

  • iNLTK — Indic NLP Toolkit. https://inltk.readthedocs.io/ Simple Python API for Hindi, Tamil, Bengali, and other Indian language embeddings. Five lines to start.

  • FIRE (Forum for Information Retrieval Evaluation). https://fire.irsi.res.in/ India’s main NLP shared-task forum. Many tasks from 2014–2020 used Word2Vec-style embeddings. Great place to find datasets for projects.

Academic resources in India

  • IIT-Bombay CFILT (Center for Indian Language Technology). http://www.cfilt.iitb.ac.in/ Long history of Indian-language NLP, including Word2Vec-era work.

  • IIT-Madras RBC-DSAI (AI4Bharat’s home). https://ai4bharat.org/ Current generation of Indian NLP research, including embeddings and translation.

  • IIIT Hyderabad LTRC (Language Technologies Research Centre). http://ltrc.iiit.ac.in/ Parsing and embeddings work for Indian languages.

Reading order to understand modern NLP

If you’re continuing through this series:

  1. ✅ Paper 05 (Word2Vec) — you just finished this.
  2. Paper 06 (Seq2Seq) — use embeddings as input to a translation system.
  3. Paper 07 (Bahdanau Attention) — patch Seq2Seq’s bottleneck.
  4. Paper 08 (Transformer) — throw out LSTMs, keep only attention.
  5. Paper 10 (GPT-1) and Paper 11 (BERT) — the two faces of modern pretraining.

Back to Paper 05 home · Glossary · Quiz.