Long short-term memory. S Hochreiter, J Schmidhuber. Neural computation, MIT Press, 1997 (26k citations as of 2019)
It has passed the backpropagation papers by Rumelhart et al. (1985, 1986, 1987). Don’t get confused by Google Scholar which sometimes incorrectly lumps together different Rumelhart publications including:
Learning internal representations by error propagation. DE Rumelhart, GE Hinton, RJ Williams, California Univ San Diego La Jolla, Inst for Cognitive Science, 1985 (25k)
Parallel distributed processing. JL McClelland, DE Rumelhart, PDP Research Group, MIT press, 1987 (24k)
Learning representations by back-propagating errors. DE Rumelhart, GE Hinton, RJ Williams, Nature 323 (6088), 533-536, 1986 (19k)
I think it’s good that the backpropagation paper is no longer number one, because it’s a bad role model. It does not cite the true inventors of backpropagation, and the authors have never corrected this. I learned this on reddit: Schmidhuber on Linnainmaa, inventor of backpropagation in 1970. This post also mentions Kelley (1960) and Werbos (1982).
The LSTM paper is now receiving more citations per year than all of Rumelhart’s backpropagation papers combined. And more than the most cited paper by LeCun and Bengio (1998) which is about CNNs:
Gradient-based learning applied to document recognition. Y LeCun, L Bottou, Y Bengio, P Haffner, IEEE 86 (11), 2278-2324, 1998 (23k)
It may soon have more citations than Bishop’s textbook on neural networks (1995).
In the 21st century, activity in the field has surged, and I found three deep learning research papers with even more citations. All of them are about applications of neural networks to ImageNet (2012, 2014, 2015). One paper describes a fast, CUDA-based, deep CNN (AlexNet) that won ImageNet 2012. Another paper describes a significantly deeper CUDA CNN that won ImageNet 2014:
A Krizhevsky, I Sutskever, GE Hinton. Imagenet classification with deep convolutional neural networks. NeuerIPS 2012 (53k)
B. K Simonyan, A Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014 (32k)
The paper with the most citations per year is a recent one on the much deeper ResNet which won ImageNet 2015:
K He, X Zhang, S Ren, J Sun. Deep Residual Learning for Image Recognition. CVPR 2016 (36k; 18k in 2019)
Remarkably, such “contest-winning deep GPU-based CNNs” can also be traced back to the Schmidhuber lab. Krizhevsky cites DanNet, the first CUDA CNN to win image recognition challenges and the first superhuman CNN (2011). I learned this on reddit: DanNet, the CUDA CNN of Dan Ciresan in Jürgen Schmidhuber’s team, won 4 image recognition challenges prior to AlexNet: ICDAR 2011 Chinese handwriting contest – IJCNN 2011 traffic sign recognition contest – ISBI 2012 image segmentation contest – ICPR 2012 medical imaging contest.
ResNet is much deeper than DanNet and AlexNet and works even better. It cites the Highway Net (Srivastava & Greff & Schmidhuber, 2015) of which it is a special case. In a sense, this closes the LSTM circle, because “Highway Nets are essentially feedforward versions of recurrent Long Short-Term Memory (LSTM) networks.”
Most LSTM citations refer to the 1997 LSTM paper. However, Schmidhuber’s post on their Annus Mirabilis points out that “essential insights” for LSTM date back to Seep Hochreiter’s 1991 diploma thesis which he considers “one of the most important documents in the history of machine learning.” (He also credits other students: “LSTM and its training procedures were further improved” “through the work of my later students Felix Gers, Alex Graves, and others.”)
The LSTM principle is essential for both recurrent networks and feedforward networks. Today it is on every smartphone. And in Deepmind’s Starcraft champion and OpenAI’s Dota champion. And in thousands of additional applications. It is the core of the deep learning revolution.