Munich 1991: The Roots of the Current AI Boom

The foundations of today's AI boom, including the first Transformer variant, unsupervised pre-training, neural network distillation, and deep residual learning, were all published in 1991 by Jürgen Schmidhuber's lab at the Technical University of Munich. These techniques now power modern large language models like ChatGPT, highlighting the lab's pivotal role in AI history.

When we look at the massive scale of today’s Artificial Intelligence boom, it is easy to forget that the foundations of this trillion-dollar industry were laid down over 30 years ago in Munich. Today, the world's top tech companies are investing hundreds of billions into scaling up Large Language Models LLMs such as ChatGPT. Yet, outside of a few history buffs or old-school folks in the Machine Learning community, people might not realize that virtually every core building block of these modern systems was published in a span of just a few months back in 1991 deep-learning-miraculous-year-1990-1991.html . Incredibly, they all emerged from a single lab at the Technical University Munich cogbotlab.html led by Jürgen Schmidhuber cv.html . Before that year ended, his team had essentially mapped out the modern era of deep learning. They published the very first Transformer who-invented-transformer-neural-networks.html variant see ChatGPT's "T" , introduced the concept of unsupervised pre-training very-deep-learning-1991.html ChatGPT's "P" , and pioneered neural network distillation who-invented-knowledge-distillation-with-neural-networks.html . They also introduced deep residual learning who-invented-residual-neural-networks.html , the centerpiece of both LSTMs and ResNets deep-learning-history.html lstm , the most cited AI papers of the 20th and the 21st century, respectively. These four techniques power today's most advanced LLMs. Furthermore, they laid the early groundwork for generative adversarial networks who-invented-generative-adversarial-networks.html , foundational for "Generative AI." Jürgen’s contributions have deeply shaped my own thinking over the years, from my time at Google Brain https://otoro.net/ml/ to our recursive self-improvement RSI metalearning.html research we're currently pushing at Sakana AI https://sakana.ai/rsi-lab/ . I am especially proud to have helped popularize World Models world-model-boom.html back in 2018, building directly on concepts world-models-planning-curiosity-fki-1990.html his lab introduced in the 1990s. It is amazing to see how well some of these ideas have stood the test of time, scaling up to be fully embraced by the global AI community For those interested in the real history of deep learning deep-learning-history.html , Jürgen has put together a detailed timeline below of exactly how these seeds were planted in Munich in 1991. David Ha https://twitter.com/hardmaru , June 2026 Jürgen Schmidhuber's 1991 Timeline, with Annotated References I am proud of the work my team did in 1991 deep-learning-miraculous-year-1990-1991.html in my home city when compute was millions of times more expensive than today RAW RAW , and of all the great people I worked with there and afterwards. Check out TU Munich's following key AI publications dated 3/26/1991—8/31/1991. ★ 26 March 1991: the first kind of Transformer who-invented-transformer-neural-networks.html see the T in ChatGPT —now called the unnormalized linear Transformer 1991-unnormalized-linear-transformer.html ULTRA ULTRA FWP0-6 FWP WHO10 WHO10 DLH DLH : the predecessor of the normalized quadratic Transformer who-invented-transformer-neural-networks.html TR1 TR1 . ULTRA is still important, also because of its efficiency: its computational costs scale linearly in input size, rather than quadratically . ★ 30 April 1991: Pre-Training for deep neural networks NNs very-deep-learning-1991.html —the P in ChatGPT UN0 UN0 UN1 UN1 UN2 UN2 UN UN DLH DLH . This enabled very deep learning who-invented-deep-learning.html WHO5 WHO5 . ★ 30 April 1991: Neural network distillation—central to the famous 2025 DeepSeek "Sputnik" https://x.com/SchmidhuberAI/status/1885357355938046382 and other Large Language Models LLMs UN0 UN0 UN1 UN1 UN2 UN2 WHO9 WHO9 DLH DLH . ★ 15 June 1991: deep residual learning who-invented-residual-neural-networks.html with residual connections for very deep NNs WHO11 WHO11 see Sepp Hochreiter's diploma thesis deep-learning-history.html vanish VAN1 VAN1 : the core ingredient of Long Short-Term Memory deep-learning-history.html lstm LSTM1 LSTM1 , the most cited AI of the 20th century, basis of the first LLMs in the 2010s ELMO, ULMFiT . The most-cited scientific article of the 21st century MOST25-26 MOST25 is also about deep residual learning who-invented-residual-neural-networks.html , focusing on a variant of our LSTM-inspired deep residual Highway Net highway-networks.html HW1-25b HW that was 10 times deeper than previous feedforward NNs who-invented-residual-neural-networks.html WHO11 WHO11 DLH DLH . Deep residual learning who-invented-residual-neural-networks.html is now being used in virtually all LLMs. ★ 31 August 1991: first peer-reviewed publication GAN91 GAN91 on generative & adversarial networks who-invented-generative-adversarial-networks.html GAN90-25 GAN90 for neural world models world-model-boom.html WM26,WM26b WM26 trained through artificial curiosity & creativity artificial-curiosity-since-1990.html —now controversially used for deepfakes and other applications of Generative AI WHO8 WHO8 DLH DLH . As of January 2026, the two most frequently cited papers of all time with the most citations within 3 years—manuals excluded are directly based most-cited-papers-of-all-time.html on the work of 1991 deep-learning-miraculous-year-1990-1991.html MOST26 MOST26 MOST MOST MIR MIR . In 1991, however, it was already totally obvious that LLM-like NNs alone are not enough to achieve Artificial General Intelligence AGI . No AGI without mastery of the real world DLH DLH That's why we started working on additional techniques required to achieve AGI, e.g., planning with adaptive world models world-models-planning-curiosity-fki-1990.html PLAN1-6 PLAN WM26,WM26b WM26 created by artificial scientists artificial-curiosity-since-1990.html AC AC since 1990 at TU Munich , meta learning & recursive self-improvement metalearning.html since 1987 META1 META1 META META , and others DLH DLH AIB AIB . Around the same time , Munich also was the origin of the first self-driving cars in traffic robotcars.html AUT AUT by Ernst Dickmanns's team , going up to 175 km/h. The city was truly the epicenter of AI. In the past 3 decades, however, most of commercial AI has shifted to the Pacific Rim, far away from Munich. How could that happen? Can anything be done about it? See 95-25 95-25 for answers See WHO3-11 WHO3 for the broader historical context deep-learning-history.html DLH DLH of the work published in 1991 deep-learning-miraculous-year-1990-1991.html MIR MIR . I am still hoping that I may live to see our great field of Machine Learning realize my 1970s teenager vision of building something much smarter than myself, such that I can retire. Jürgen Schmidhuber cv.html , June 2026 Acknowledgments Thanks to several expert reviewers for useful comments. Let us know if you can spot any remaining error. The contents of this article may be used for educational and non-commercial purposes, including articles for Wikipedia and similar sites. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License http://creativecommons.org/licenses/by-nc-sa/4.0/ . Annotated References 95-25 J. Schmidhuber AI Blog blog.html , 2025 . 1995-2025: The Decline of Germany & Japan vs US & China. Can All-Purpose Robots Fuel a Comeback? GerJapUsaChiRobots.html In 1995, in terms of nominal gross domestic product GDP , a combined Germany and Japan were almost 1:1 economically with a combined USA and China, according to IMF. Only 3 decades later, this ratio is now down to 1:5 Self-replicating AI-driven all-purpose robots may be the answer. Based on a 2024 F.A.Z. guest article https://www.faz.net/pro/digitalwirtschaft/kuenstliche-intelligenz/juergen-schmidhuber-mahnt-baut-den-ki-gesteuerten-allzweckroboter-110120697.html . AC J. Schmidhuber AI Blog blog.html , 2021, updated 2025 . 3 decades of artificial curiosity & creativity artificial-curiosity-since-1990.html . Schmidhuber's artificial scientists not only answer given questions but also invent new questions. They achieve curiosity through: 1990 the principle of generative adversarial networks, 1991 neural nets that maximise learning progress, 1995 neural nets that maximise information gain optimally since 2011 , 1997 adversarial design of surprising computational experiments, 2006 maximizing compression progress like scientists/artists/comedians do, 2011 PowerPlay... Since 2012: applications to real robots. AIB J. Schmidhuber's AI Blog. https://people.idsia.ch/~juergen/blog.html With lessons on the history of AI & computing, e.g.: Who invented deep learning? Who invented backpropagation? Who invented convolutional neural networks? Who invented artificial neural networks? Who invented generative adversarial networks? Who invented Transformer neural networks? Who invented deep residual learning? Who invented neural knowledge distillation? Who invented the computer? Who invented the transistor? Who invented the integrated circuit? ... ATT J. Schmidhuber AI Blog blog.html , 2020, updated 2025 . 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. neural-attention-1990-1993.html Schmidhuber had both hard attention for foveas 1990 and soft attention in form of Transformers with linearized self-attention 1991-93 . FWP FWP Today, both types are very popular. AUT J. Schmidhuber AI Blog blog.html , 2005 . Highlights of robot car history robotcars.html . Around 1986, Ernst Dickmanns and his group at Univ. Bundeswehr Munich built the world's first real autonomous robot cars, using saccadic vision, probabilistic approaches such as Kalman filters, and parallel computers. By 1994, they were in highway traffic, at up to 180 km/h, automatically passing other cars. DLH J. Schmidhuber. Annotated History of Modern AI and Deep Learning deep-learning-history.html . Technical Report IDSIA-22-22, IDSIA, Switzerland, 2022, updated 2025. Preprint arXiv:2212.11279 https://arxiv.org/abs/2212.11279 . Tweet https://x.com/SchmidhuberAI/status/1606333832956973060 . DLP J. Schmidhuber. How 3 Turing awardees republished key methods and ideas whose creators they failed to credit. ai-priority-disputes.html Technical Report IDSIA-23-23, Swiss AI Lab IDSIA, 14 Dec 2023, updated 2025. Tweet of 2023 https://x.com/SchmidhuberAI/status/1735313711240253567 . DS1 DeepSeek-AI 2025 . DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Preprint arXiv:2501.12948 https://arxiv.org/abs/2501.12948 . See the popular DeepSeek tweet of Jan 2025 https://x.com/SchmidhuberAI/status/1885357355938046382 . FWP J. Schmidhuber AI Blog blog.html , 26 March 2021, updated 2025 . 26 March 1991: Neural nets learn to program neural nets with fast weights—like Transformer variants. 2021: New stuff fast-weight-programmer-1991-transformer.html See tweet of 2022 https://twitter.com/SchmidhuberAI/status/1576966129993797632 . FWP0 J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Technical Report FKI-147-91, Institut für Informatik, Technische Universität München, 26 March 1991. PDF. FKI-147-91ocr.pdf First paper on neural fast weight programmers fast-weight-programmer-1991-transformer.html that separate storage and control: a slow net learns by gradient descent to compute weight changes of a fast net. The outer product-based version Eq. 5 is now known as the unnormalized linear Transformer 1991-unnormalized-linear-transformer.html or the "Transformer with linearized self-attention." ULTRA ULTRA FWP FWP FWP1 J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4 1 :131-139, 1992. Based on FWP0 . PDF https://sferics.idsia.ch/pub/juergen/fastweights.pdf . HTML. fastweights/ncfastweightsrev.html Pictures German . habilitation/node29.html See tweet of 2022 for 30-year anniversary https://twitter.com/SchmidhuberAI/status/1576966129993797632 . FWP2 J. Schmidhuber. Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460-463. Springer, 1993. PDF https://sferics.idsia.ch/pub/juergen/ratio.pdf . A recurrent extension of the unnormalized linear Transformer 1991-unnormalized-linear-transformer.html , ULTRA ULTRA introducing the terminology of learning "internal spotlights of attention." First recurrent NN-based fast weight programmer fast-weight-programmer-1991-transformer.html using outer products to program weight matrices. FWP3a I. Schlag, J. Schmidhuber. Learning to Reason with Third Order Tensor Products. Advances in Neural Information Processing Systems N eur IPS , Montreal, 2018. Preprint: arXiv:1811.12143 https://arxiv.org/abs/1811.12143 . PDF http://papers.nips.cc/paper/8203-learning-to-reason-with-third-order-tensor-products.pdf . FWP6 I. Schlag, K. Irie, J. Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174 https://arxiv.org/abs/2102.11174 . GAN90 J. Schmidhuber. Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990. PDF FKI-126-90ocr.pdf . The first paper on planning with reinforcement learning recurrent neural networks NNs and recurrent world models more deep-learning-miraculous-year-1990-1991.html Sec.%2011 , and on generative adversarial networks where a generator NN is fighting a predictor NN in a minimax game more deep-learning-miraculous-year-1990-1991.html Sec.%205 . Apparently, it was also the first paper of this kind to use the term "world model" for the predictor NN although the basic concept of a world model is much older than that. GAN91 J. Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In J. A. Meyer and S. W. Wilson, editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats , pages 222-227. MIT Press/Bradford Books, 1991. PDF https://sferics.idsia.ch/pub/juergen/curiositysab.pdf . More deep-learning-miraculous-year-1990-1991.html Sec.%205 . Based on GAN90 GAN90 . GAN10 J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation 1990-2010 . IEEE Transactions on Autonomous Mental Development , 2 3 :230-247, 2010. IEEE link http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=5508364&tag=1 . PDF ieeecreative.pdf . This well-known 2010 survey summarised the generative adversarial NNs of 1990 as follows: a "neural network as a predictive world model is used to maximize the controller's intrinsic reward, which is proportional to the model's prediction errors" which are minimized . GAN10b O. Niemitalo. A method for training artificial neural networks to generate missing data within a variable context. Blog post https://web.archive.org/web/20120312111546/http://yehar.com:80/blog/?p=167 , Internet Archive, 2010. A blog post describing the basic ideas GAN90-91 GAN90 GAN20 GAN20 AC AC of GANs. GAN14 I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Generative adversarial nets. NIPS 2014, 2672-2680, Dec 2014. A description of GANs that does not cite Schmidhuber's original GAN principle of 1990 GAN90-91 GAN90 GAN20 GAN20 AC AC R2 R2 DLP DLP and contains wrong claims about Schmidhuber's adversarial NNs for Predictability Minimization. PM0-2 PM0 GAN20 GAN20 DLP DLP GAN19 T. Karras, S. Laine, T. Aila. A style-based generator architecture for generative adversarial networks. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR , pages 4401-4410, 2019. GAN19b D. Fallis. The epistemic threat of deepfakes. Philosophy & Technology 34.4 2021 :623-643. GAN20 J. Schmidhuber. Generative Adversarial Networks are Special Cases of Artificial Curiosity 1990 and also Closely Related to Predictability Minimization 1991 . Neural Networks, Volume 127, p 58-66, 2020. Preprint arXiv/1906.04493 https://arxiv.org/abs/1906.04493 . GAN25 J. Schmidhuber. Who Invented Generative Adversarial Networks? who-invented-generative-adversarial-networks.html Technical Note IDSIA-14-25, IDSIA, December 2025. HW J. Schmidhuber AI Blog blog.html , 2015, updated 2025 for 10-year anniversary . Overview of Highway Networks: First working really deep feedforward neural networks with hundreds of layers highway-networks.html . HW1 R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Preprints arXiv:1505.00387 https://arxiv.org/abs/1505.00387 May 2015 and arXiv:1507.06228 https://arxiv.org/abs/1507.06228 Training Very Deep Networks; July 2015 . Also at NeurIPS 2015. The first working very deep gradient-based feedforward neural nets FNNs with hundreds of layers, ten times deeper than previous gradient-based FNNs. Let g, t, h, denote non-linear differentiable functions. Each non-input layer of a Highway Net highway-networks.html computes g x x + t x h x , where x is the data from the previous layer. The gates g x are typically initialised to 1.0, to obtain plain residual connections who-invented-residual-neural-networks.html weight 1.0 VAN1 VAN1 HW25 HW25 . This allows for very deep error propagation, which makes Highway NNs so deep. The later Resnet Dec 2015 HW2 HW2 adopted this principle. It is like a Highway net variant whose gates are always open: g x =t x =const=1. That is, Highway Nets are gated ResNets: set the gates to 1.0→ResNet. The residual parts of a Highway Net are like those of an unfolded 1999 LSTM LSTM2a LSTM2a , while the residual parts of a ResNet are like those of an unfolded 1997 LSTM LSTM1 LSTM1 HW25 HW25 . Highway Nets perform roughly as well as ResNets on ImageNet HW3 HW3 . Variants of Highway gates are also used for certain algorithmic tasks, where plain residual layers do not work as well NDR NDR . See also HW25 HW25 : who invented deep residual learning? who-invented-residual-neural-networks.html More. highway-networks.html HW1a R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Presentation at the Deep Learning Workshop, ICML'15, July 10-11, 2015. Link https://sites.google.com/site/deeplearning2015/accepted-papers . HW2 He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint arXiv:1512.03385 https://arxiv.org/abs/1512.03385 Dec 2015 . Microsoft's ResNet paper refers to the Highway Net highway-networks.html May 2015 HW1 HW1 as 'concurrent'. However, this is incorrect: ResNet was published seven months later. Although the ResNet paper acknowledges the problem of vanishing/exploding gradients deep-learning-history.html vanish , it fails to recognise that S. Hochreiter first identified the issue in 1991 and developed the residual connection solution who-invented-residual-neural-networks.html weight 1.0 VAN1 VAN1 HW25 HW25 . The ResNet paper cites the earlier Highway Net in a way that does not make it clear that ResNets are essentially open-gated Highway Nets and that Highway Nets are gated ResNets. It also fails to mention that the gates of residual connections in Highway Nets are initially open 1.0 , meaning that Highway Nets start out with standard residual connections, to achieve deep residual learning Highway Nets were ten times deeper than previous gradient-based feedforward nets . The residual parts of a Highway Net are like those of an unfolded 1999 LSTM LSTM2a LSTM2a , while the residual parts of a ResNet are like those of an unfolded 1997 LSTM LSTM1 LSTM1 HW25 HW25 . A follow-up paper by the ResNet authors was flawed in its design https://rupeshks.cc/blog/skip.html , leading to incorrect conclusions about gated residual connections HW25b HW25b . See also HW25 HW25 : who invented deep residual learning? who-invented-residual-neural-networks.html More. microsoft-wins-imagenet-through-feedforward-LSTM-without-gates.html HW3 K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint arxiv:1612.07771 https://arxiv.org/abs/1612.07771 2016 . Also at ICLR 2017. HW25 J. Schmidhuber. Who Invented Deep Residual Learning? who-invented-residual-neural-networks.html Technical Report IDSIA-09-25, IDSIA, 2025. Preprint arXiv:2509.24732 https://arxiv.org/abs/2509.24732 . HW25b R. K. Srivastava January 2025 . Weighted Skip Connections are Not Harmful for Deep Nets https://rupeshks.cc/blog/skip.html . Shows that a follow-up paper by the authors of HW2 HW2 suffered from design flaws leading to incorrect conclusions about gated residual connections. LIL4 J. Schmidhuber. 2025: centennial of the transistor, patented by Julius Edgar Lilienfeld in 1925-1928 1925-first-transistor-2025-centennial-lilienfeld.html . Technical Note IDSIA-10-25, IDSIA, 22 Oct 2025. LSTM0 S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. https://sferics.idsia.ch/pub/juergen/fki-207-95.ps.gz TR FKI-207-95, TUM, August 1995. PDF. FKI-207-95ocr.pdf LSTM1a S. Hochreiter and J. Schmidhuber. LSTM can solve hard long time lag problems. Proceedings of the 9th International Conference on Neural Information Processing Systems NIPS'96 . Cambridge, MA, USA, MIT Press, p. 473–479. LSTM1 S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9 8 :1735-1780, 1997. PDF https://sferics.idsia.ch/pub/juergen/lstm.pdf . Based on LSTM0 . More. rnn.html LSTM2 F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12 10 :2451-2471, 2000. PDF https://sferics.idsia.ch/pub/juergen/FgGates-NC.pdf . The "vanilla LSTM architecture" with forget gates that everybody is using today, e.g., in Google's Tensorflow. LSTM2 F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. In Proc. Int. Conf. on Artificial Neural Networks ICANN'99 , Edinburgh, Scotland, p. 850-855, IEE, London, 1999. The "vanilla LSTM architecture" with forget gates that everybody is using today, e.g., in Google's Tensorflow. LSTM3 A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:5-6, pp. 602-610, 2005. PDF https://sferics.idsia.ch/pub/juergen/nn 2005.pdf . LSTM4 S. Fernandez, A. Graves, J. Schmidhuber. An application of recurrent neural networks to discriminative keyword spotting. Intl. Conf. on Artificial Neural Networks ICANN'07, 2007. PDF icann santi 2007.pdf . LSTM5 A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009. PDF tpami 2008.pdf . LSTM6 A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545-552, Vancouver, MIT Press, 2009. PDF nips2009.pdf . LSTM7 J. Bayer, D. Wierstra, J. Togelius, J. Schmidhuber. Evolving memory cell structures for sequence learning. Proc. ICANN-09, Cyprus, 2009. PDF icann2009bayer.pdf . META J. Schmidhuber AI Blog blog.html , 2020 . 1/3 century anniversary of first publication on metalearning machines that learn to learn 1987 metalearning.html . For its cover I drew a robot that bootstraps itself. 1992-: gradient descent-based neural metalearning. 1994-: Meta-Reinforcement Learning with self-modifying policies. 1997: Meta-RL plus artificial curiosity and intrinsic motivation. 2002-: asymptotically optimal metalearning for curriculum learning. 2003-: mathematically optimal Gödel Machine goedelmachine.html . 2020: new stuff META1 J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: The meta-meta-... hook. Diploma thesis, Institut für Informatik, Technische Universität München, 1987. Searchable PDF scan diploma1987ocr.pdf created by OCRmypdf which uses LSTM rnn.html . HTML diploma.html . For example, Genetic Programming geneticprogramming.html GP is applied to itself, to recursively evolve better GP methods through Meta-Evolution. More metalearning.html . MIR J. Schmidhuber Oct 2019, updated '21, '22, '25, '26 . Deep Learning: Our Miraculous Year 1990-1991. deep-learning-miraculous-year-1990-1991.html Preprint arXiv:2005.05744 https://arxiv.org/abs/2005.05744 . The Deep Learning Artificial Neural Networks NNs of our team have revolutionised Machine Learning & AI deep-learning-history.html . Many of the basic ideas behind this revolution were published within the 12 months of our "Annus Mirabilis" 1990-1991 at our lab in TU Munich. Back then, few people were interested. But a quarter century later, NNs based on our "Miraculous Year" were on over 3 billion devices, and used many billions of times per day, consuming a significant fraction of the world's compute impact-on-most-valuable-companies.html . In particular, in 1990-91, we laid foundations of Generative AI, publishing principles of 1 Generative Adversarial Networks artificial-curiosity-since-1990.html sec1 for Artificial Curiosity and Creativity artificial-curiosity-since-1990.html now used for deepfakes , 2 Transformers who-invented-transformer-neural-networks.html the T in ChatGPT—see the 1991 Unnormalized Linear Transformer 1991-unnormalized-linear-transformer.html , 3 Pre-training very-deep-learning-1991.html for deep NNs see the P in ChatGPT , 4 NN distillation who-invented-knowledge-distillation-with-neural-networks.html key for DeepSeek https://x.com/SchmidhuberAI/status/1885357355938046382 , and 5 recurrent World Models world-model-boom.html for Reinforcement Learning and Planning world-models-planning-curiosity-fki-1990.html in partially observable environments. The year 1991 also marks the emergence of the defining features of 6 LSTM deep-learning-miraculous-year-1990-1991.html Sec.%204 , the most cited AI paper of the 20th century based on deep residual learning through residual NN connections who-invented-residual-neural-networks.html , and 7 the most cited paper of the 21st century, based on our LSTM-inspired Highway Net highway-networks.html that was 10 times deeper than previous feedforward NNs who-invented-residual-neural-networks.html . As of 2025, the two most frequently cited scientific articles of all time most-cited-papers-of-all-time.html with the most Google Scholar citations within 3 years—manuals excluded are both directly based on our 1991 work most-cited-papers-of-all-time.html . MOST J. Schmidhuber AI Blog blog.html , 2021, updated 2025 . The most cited neural networks all build on work done in my labs most-cited-neural-nets.html : 1. Long Short-Term Memory deep-learning-miraculous-year-1990-1991.html Sec.%204 LSTM , the most cited AI of the 20th century. 2. ResNet who-invented-residual-neural-networks.html open-gated Highway Net highway-networks.html , the most cited AI of the 21st century. 3. AlexNet & VGG Net the similar but earlier DanNet DanNet-triggers-deep-CNN-revolution-2011.html of 2011 won 4 image recognition challenges computer-vision-contests-won-by-gpu-cnns.html before them . 4. GAN an instance of Adversarial Artificial Curiosity artificial-curiosity-since-1990.html sec1 of 1990 . 5. Transformer who-invented-transformer-neural-networks.html variants—see the 1991 unnormalised linear Transformer 1991-unnormalized-linear-transformer.html ULTRA . Foundations of Generative AI were published in 1991: the principles of GANs who-invented-generative-adversarial-networks.html now used for deepfakes , Transformers who-invented-transformer-neural-networks.html the T in ChatGPT , Pre-training very-deep-learning-1991.html for deep NNs the P in ChatGPT , NN distillation who-invented-knowledge-distillation-with-neural-networks.html , and the famous DeepSeek—see the tweet https://x.com/SchmidhuberAI/status/1885357355938046382 . As of 2025, the two most frequently cited scientific articles of all time most-cited-papers-of-all-time.html with the most Google Scholar citations within 3 years—manuals excluded are both directly based on our 1991 work most-cited-papers-of-all-time.html . MOST25 H. Pearson, H. Ledford, M. Hutson, R. Van Noorden. Exclusive: the most-cited papers of the twenty-first century. Nature, 15 April 2025. MOST25b R. Van Noorden. Science’s golden oldies: the decades-old research papers still heavily cited today. Nature, 15 April 2025. MOST26 J. Schmidhuber. The two most frequently cited papers of all time are based on our 1991 work. most-cited-papers-of-all-time.html Technical Note IDSIA-1-26, January 2026. PLAN J. Schmidhuber AI Blog blog.html , 2020 . 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity 1990 . world-models-planning-curiosity-fki-1990.html This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, and the GAN principle deep-learning-miraculous-year-1990-1991.html Sec.%205 widely used today . Agents with adaptive recurrent world models even suggest a simple explanation of consciousness & self-awareness. PLAN1 J. Schmidhuber. Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990. PDF FKI-126-90ocr.pdf . The first paper on long-term planning with self-supervised reinforcement learning recurrent neural networks NNs and recurrent predictive world models more deep-learning-miraculous-year-1990-1991.html Sec.%2011 , and on generative adversarial networks where a generator NN is fighting a predictor NN in a minimax game more deep-learning-miraculous-year-1990-1991.html Sec.%205 . Apparently, it was also the first paper of this kind to use the term "world model" for the predictor NN although the basic concept of a world model is much older than that. PLAN2 J. Schmidhuber. An on-line algorithm for dynamic reinforcement learning and planning in reactive environments. https://sferics.idsia.ch/pub/juergen/ijcnn90.ps.gz Proc. IEEE/INNS International Joint Conference on Neural Networks, San Diego , volume 2, pages 253-258, June 17-21, 1990. Based on TR FKI-126-90 1990 PLAN1 PLAN1 . More world-models-planning-curiosity-fki-1990.html . PLAN3 J. Schmidhuber. Reinforcement learning in Markovian and non-Markovian environments. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, NIPS'3 , pages 500-506. San Mateo, CA: Morgan Kaufmann, 1991. PDF https://sferics.idsia.ch/pub/juergen/nipsnonmarkov.pdf . Partially based on PLAN1 PLAN1 . PLAN4 J. Schmidhuber. On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models. Report arXiv:1210.0118 http://arxiv.org/abs/1511.09249 cs.AI , 2015. This paper went beyond the inefficient millisecond by millisecond planning of 1990 PLAN1 PLAN1 , addressing planning and reasoning in abstract concept spaces . The controller C became an RL prompt engineer that learns to create a chain of thought : to speed up RL, C learns to query its world model for abstract reasoning and decision making. PLAN5 One Big Net For Everything. Preprint arXiv:1802.08864 http://arxiv.org/abs/1802.08864 cs.AI , Feb 2018. This paper collapsed the control network and the world model network of PLAN4 PLAN4 into a single One Big Net for everything, using my neural distillation procedure of 1991 UN0-1 UN0 . Apparently, this is what DeepSeek https://x.com/SchmidhuberAI/status/1885357355938046382 used to shock the stock market in 2025. PLAN6 D. Ha, J. Schmidhuber. Recurrent World Models Facilitate Policy Evolution. Advances in Neural Information Processing Systems NIPS , Montreal, 2018. Talk. Preprint: arXiv:1809.01999 https://arxiv.org/abs/1809.01999 . Github: World Models https://worldmodels.github.io/ . PM0 J. Schmidhuber. Learning factorial codes by predictability minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF https://core.ac.uk/download/pdf/54846569.pdf . More unsupervised-neural-nets-fight-minimax-game.html . PM1 J. Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4 6 :863-879, 1992. Based on PM0 , 1991. PDF https://sferics.idsia.ch/pub/juergen/factorial.pdf . More unsupervised-neural-nets-fight-minimax-game.html . PM2 J. Schmidhuber, M. Eldracher, B. Foltin. Semilinear predictability minimzation produces well-known feature detectors. Neural Computation, 8 4 :773-786, 1996. PDF https://sferics.idsia.ch/pub/juergen/edgedetect.pdf . More unsupervised-neural-nets-fight-minimax-game.html . R2 Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990. https://www.reddit.com/r/MachineLearning/comments/djju8a/d jurgen schmidhuber really had gans in 1990/ RAW J. Schmidhuber AI Blog blog.html , 2001 . Raw Computing Power raw.html . TR1 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin 2017 . Attention is all you need. NIPS 2017, pp. 5998-6008. This paper introduced the name "Transformers" for a now widely used NN type. It did not cite the 1991 publication on what's now called unnormalized "linear Transformers" 1991-unnormalized-linear-transformer.html with "linearized self-attention." ULTRA ULTRA Schmidhuber also introduced the now popular attention terminology neural-attention-1990-1993.html in 1993. ATT ATT FWP2 FWP2 R4 R4 See tweet of 2022 for 30-year anniversary https://twitter.com/SchmidhuberAI/status/1576966129993797632 . TR2 J. Devlin, M. W. Chang, K. Lee, K. Toutanova 2018 . Bert: Pre-training of deep bidirectional Transformers for language understanding. Preprint arXiv:1810.04805. TR3 K. Tran, A. Bisazza, C. Monz. The Importance of Being Recurrent for Modeling Hierarchical Structure. EMNLP 2018, p 4731-4736. ArXiv preprint 1803.03585. TR4 M. Hahn. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, Volume 8, p.156-171, 2020. TR5 A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret. Transformers are RNNs: Fast autoregressive Transformers with linear attention. In Proc. Int. Conf. on Machine Learning ICML , July 2020. TR5a Z. Shen, M. Zhang, H. Zhao, S. Yi, H. Li. Efficient Attention: Attention with Linear Complexities. WACV 2021. TR6 K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al. Rethinking attention with Performers. In Int. Conf. on Learning Representations ICLR , 2021. TR6a H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. Smith, L. Kong. Random Feature Attention. ICLR 2021. TR7 S. Bhattamishra, K. Ahuja, N. Goyal. On the Ability and Limitations of Transformers to Recognize Formal Languages. EMNLP 2020. TR8 W. Merrill, A. Sabharwal. The Parallelism Tradeoff: Limitations of Log-Precision Transformers. TACL 2023. TR25 J. Schmidhuber. Who Invented Transformer Neural Networks? who-invented-transformer-neural-networks.html Technical Note IDSIA-11-25, IDSIA, Switzerland, Nov 2025. ULTRA References on the 1991 unnormalized linear Transformer 1991-unnormalized-linear-transformer.html ULTRA : original tech report March 1991 FWP0 FWP0 . Journal publication 1992 FWP1 FWP1 . Recurrent ULTRA extension 1993 introducing the terminology of learning "internal spotlights of attention” FWP2 FWP2 . Modern "quadratic" Transformer 2017: "attention is all you need" scaling quadratically in input size TR1 TR1 . 2020 paper TR5 TR5 using the terminology "linear Transformer" for a more efficient Transformer variant that scales linearly , leveraging linearized attention TR5a TR5a . 2021 paper FWP6 FWP6 pointing out that ULTRA dates back to 1991 FWP0 FWP0 when compute was a million times more expensive. Overview of ULTRA and other Fast Weight Programmers 2021 FWP FWP . See the T in ChatGPT See also surveys DLH DLH DLP DLP , 2022 tweet for ULTRA's 30-year anniversary https://twitter.com/SchmidhuberAI/status/1576966129993797632 , and 2024 tweet https://x.com/SchmidhuberAI/status/1864701357107634390 . UN J. Schmidhuber AI Blog blog.html , 2021, updated 2025 . 1991: First very deep learning with unsupervised pre-training see the P in ChatGPT . First neural network distillation very-deep-learning-1991.html key for DeepSeek https://x.com/SchmidhuberAI/status/1885357355938046382 . Unsupervised hierarchical predictive coding with self-supervised target generation finds compact internal representations of sequential data to facilitate downstream deep learning. The hierarchy can be distilled into a single deep neural network suggesting a simple model of conscious and subconscious information processing . 1993: solving problems of depth 1000. UN0 J. Schmidhuber. Neural sequence chunkers. Technical Report FKI-148-91, Institut für Informatik, Technische Universität München, April 1991. PDF. FKI-148-91ocr.pdf Unsupervised/self-supervised pre-training for deep neural networks very-deep-learning-1991.html see the P in ChatGPT and predictive coding is used in a deep hierarchy of recurrent nets RNNs to find compact internal representations of long sequences of data, across multiple time scales and levels of abstraction. Each RNN tries to solve the pretext task of predicting its next input, sending only unexpected inputs to the next RNN above. The resulting compressed sequence representations greatly facilitate downstream supervised deep learning such as sequence classification. By 1993, the approach solved problems of depth 1000 UN2 UN2 requiring 1000 subsequent computational stages/layers—the more such stages, the deeper the learning . A variant collapses the hierarchy into a single deep net. It uses a so-called conscious chunker RNN which attends to unexpected events that surprise a lower-level so-called subconscious automatiser RNN. The chunker learns to understand the surprising events by predicting them. The automatiser uses a neural knowledge distillation procedure deep-learning-miraculous-year-1990-1991.html Sec.%202 key for the famous 2025 DeepSeek https://x.com/SchmidhuberAI/status/1885357355938046382 to compress and absorb the formerly conscious insights and behaviours of the chunker, thus making them subconscious. The systems of 1991 allowed for much deeper learning than previous methods. UN1 J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4 2 :234-242, 1992. Based on TR FKI-148-91, TUM, 1991. UN0 UN0 PDF https://sferics.idsia.ch/pub/juergen/chunker.pdf . First working Deep Learner based on a deep RNN hierarchy with different self-organising time scales , overcoming the vanishing gradient problem through unsupervised pre-training of deep NNs see the P in ChatGPT and predictive coding with self-supervised target generation . Also: compressing or distilling a teacher net the chunker into a student net the automatizer that does not forget its old skills—such approaches are now widely used, e.g., by DeepSeek https://x.com/SchmidhuberAI/status/1885357355938046382 . See also this tweet https://twitter.com/SchmidhuberAI/status/1608870559609421831 . More. very-deep-learning-1991.html UN2 J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF https://sferics.idsia.ch/pub/juergen/habilitation.pdf . An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised / self-supervised pre-training for a stack of recurrent NN can be found here habilitation/node114.html depth 1000 . See also Sec. 5.5 on "Vorhersagbarkeitsmaximierung" Predictability Maximization . VAN1 S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 advisor J. Schmidhuber http://www.idsia.ch/~juergen . PDF. SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf More on the Fundamental Deep Learning Problem deep-learning-miraculous-year-1990-1991.html Sec.%203 . WHO3 J. Schmidhuber AI Blog blog.html , 2025 . Who invented the transistor? who-invented-the-transistor.html Based on LIL4 LIL4 . WHO4 J. Schmidhuber. Who invented artificial neural networks? who-invented-artificial-neural-networks.html Technical Note IDSIA-15-25, IDSIA, Switzerland, Nov 2025. WHO5 J. Schmidhuber. Who invented deep learning? who-invented-deep-learning.html Technical Note IDSIA-16-25, IDSIA, Switzerland, Nov 2025. WHO6 J. Schmidhuber AI Blog blog.html , 2014; updated 2025 . Who invented backpropagation who-invented-backpropagation.html ? See also LinkedIn post https://www.linkedin.com/feed/update/urn:li:activity:7354090939369283585/ . WHO7 J. Schmidhuber. Who invented convolutional neural networks? who-invented-convolutional-neural-networks.html Technical Note IDSIA-17-25, IDSIA, Switzerland, 2025. See popular tweet https://x.com/SchmidhuberAI/status/1952007922721919219 . WHO8 J. Schmidhuber. Who Invented Generative Adversarial Networks? who-invented-generative-adversarial-networks.html Technical Note IDSIA-14-25, IDSIA, Switzerland, Dec 2025. WHO9 J. Schmidhuber. Who invented knowledge distillation with artificial neural networks? who-invented-knowledge-distillation-with-neural-networks.html Technical Note IDSIA-12-25, IDSIA, Nov 2025. WHO10 J. Schmidhuber. Who Invented Transformer Neural Networks? who-invented-transformer-neural-networks.html Technical Note IDSIA-11-25, IDSIA, Switzerland, Nov 2025. WHO11 J. Schmidhuber. Who Invented Deep Residual Learning? who-invented-residual-neural-networks.html Technical Report IDSIA-09-25, IDSIA, Switzerland, Sept 2025. Preprint arXiv:2509.24732 https://arxiv.org/abs/2509.24732 . WHO12 J. Schmidhuber. Who invented JEPA? who-invented-jepa.html Technical Note IDSIA-3-22, IDSIA, Switzerland, March 2026. WM26 J. Schmidhuber. The Neural World Model Boom world-model-boom.html . Technical Note IDSIA-2-26, 4 Feb 2026 updated April 2024 . WM26b J. Schmidhuber. Simple but powerful ways of using world models and their latent space. Opening Keynote at the World Modeling Workshop, Agora, Mila - Quebec AI Institute, 4 Feb 2026. Also on YouTube starts around 10:44 https://www.youtube.com/live/7Gyuar7nMz0 . See video tweet https://x.com/SchmidhuberAI/status/2045868873853849937 .