7. Impact — the twenty-year reign of the LSTM

For the first few years after publication, almost nothing happened. The 1997 LSTM paper was cited sparsely and the world was busy with SVMs and graphical models. Then, slowly, three things changed:

Hardware. GPUs became cheap and CUDA arrived (2007). Long unrolled LSTMs suddenly trained in hours, not weeks.
Data. The internet produced enormous amounts of speech, text, and video. Sequence problems became the most valuable problems in AI.
Engineering refinements. Between 2000 and 2015, researchers polished LSTMs — adding peepholes (Gers & Schmidhuber, 2000), making them bidirectional (Graves & Schmidhuber, 2005), stacking them deep, tuning forget-gate biases. None of these changed the core idea. All of them made it more usable.

What followed was one of the most extraordinary deployment runs in AI history. If you used a technology product between 2005 and 2017 that spoke, translated, or predicted, an LSTM was almost certainly involved.

Where LSTMs quietly took over

Speech recognition

Alex Graves (one of Schmidhuber’s students) showed in 2013 that deep bidirectional LSTMs beat every previous system on phoneme recognition. Within two years, every major voice assistant used LSTMs for acoustic modelling. Siri, Google Now, Cortana, Alexa — all of them, at launch, had LSTMs at their core.

Machine translation

In 2016, Google replaced its phrase-based translation system — an elaborate statistical machine with thousands of hand-tuned components — with a single LSTM-based neural network. Overnight, translation quality jumped more than the previous ten years combined. The paper was called “Google’s Neural Machine Translation System” (GNMT) and it used 8 stacked LSTMs on each side of an encoder-decoder. Every major translation service followed within a year.

Handwriting and drawing

Graves’s 2014 demo showed an LSTM generating realistic cursive handwriting one stroke at a time. The same architecture powered Apple’s Pencil handwriting recognition and several court-submission OCR pipelines.

DeepMind’s early reinforcement learning

Before transformers, DeepMind’s agents — including the one that beat humans at Atari games and the early AlphaStar versions — used LSTMs as their memory. The cell state let an agent remember events that happened many frames ago, essential for games like Ms. Pac-Man.

Medical, legal, financial time series

LSTMs became the default for any “predict-from-history” problem. ICU monitoring systems predicting sepsis hours in advance, algorithmic trading models reading order-book data, fraud detection at banks — all of these were, for the better part of a decade, LSTM-shaped underneath.

In India specifically

Aadhaar’s early speech-based OTP systems (for people who couldn’t read SMS) relied on LSTMs for digit recognition in regional languages.
Ola and Uber’s ETA prediction used LSTMs over historical trip traces. Google Maps did the same.
Every Indian-language TTS (text-to-speech) system released between 2014 and 2019 — whether for Hindi, Tamil, Bengali, or Marathi — was LSTM-based.

None of this was advertised to users. LSTMs were a quiet infrastructure technology. Most of the people whose lives they improved never heard the name.

Why LSTMs won, technically

Looking back, three properties made them the right tool for their decade:

They actually worked on long sequences, where nothing else did.
They were a drop-in replacement for RNNs. Any code that used an RNN could swap in an LSTM with minimal changes.
They trained with plain stochastic gradient descent. No specialised optimiser, no elaborate pre-training recipe. This meant engineers without deep theory backgrounds could use them.

Ease-of-use mattered enormously. For the decade 2005–2015, LSTMs were the first architecture where a grad student could go from paper-reading to production-deployed in a single quarter.

Awards and recognition

Hochreiter and Schmidhuber received the 2021 IEEE Neural Networks Pioneer Award for the 1997 paper.
In 2023, the paper crossed 100,000 citations — one of only a handful of AI papers ever to do so.
The three “godfathers of deep learning” 2018 Turing Award (LeCun, Bengio, Hinton) was widely criticised for not including Schmidhuber, whose group arguably did more to make modern deep learning practical than any other lab in the 1990s.

What this means for you

When you read later papers in this series — Seq2Seq, attention, transformer — they will all use language and ideas that originated here. “Encoder-decoder”. “Context vector”. “Sequence model”. All of them trace back to the world LSTMs created.

Even the transformer, which eventually replaced LSTMs entirely, was invented by people who had spent years training LSTMs. The transformer’s self-attention is, in a sense, a direct response to the limitations of the LSTM — which is exactly what we will look at next.

Next: the limitations that eventually forced a new architecture.