BooksCorpus

Appears in 1 paper

The training dataset for GPT-1: approximately 7,000 unpublished novels scraped from the web, totalling ~800 million words.

As used in Paper 10 — Improving Language Understanding by Generative Pre-Training →

The training dataset for GPT-1: approximately 7,000 unpublished novels scraped from the web, totalling ~800 million words. Chosen for its long-range narrative structure.