AIs will at least initially be highly motivated to protect humans rather than killing them. Such AIs will have no major incentive to say exterminate humanity like in the Schwarzenegger movies. Instead many AIs will be curious scientists and they will be fascinated
with life, fascinated because life and civilization are such a rich source of interesting patterns, at least as long as they are not fully understood.
Today, I think it is possible that our planet is really the first in our light cone to spawn an expanding AI bubble. If we are the first indeed, then this would imply a lot of responsibility not just for our little biosphere, but for the future of the entire universe. Let's not mess this up.
MLST is sponsored by Sentinel, which is the compute platform specifically optimized for AI workloads. These guys have done insane optimizations and their CEO and co-founder Gennady explained many of them when he came on the show last month. They know the secret sauce and they've embedded it all into their platform. And that's passed on to you in terms of it being cheaper, faster and just better.
So what are you waiting for? Go to sentml.ai and sign up now.
I'm Benjamin Crousier, I'm starting an AI research lab called Tufa Labs. It is funded from past ventures involving machine learning. So we're a small group of very motivated, hardworking people. We are hiring both chief scientists and deep learning engineer researchers. We want to investigate reverse engineering and explore the techniques ourselves. Because we're early, there's going to be high freedom and high impact.
as someone new at Tufa Labs. You again. Welcome to MLST. It's an absolute honor to have you on the show. My pleasure. Thank you for having me. So before we move on to the great technological advances of the new century, can you tell me a little bit about the most influential invention of the previous century? So at the end of the previous century, in 1999, the journal Nature published
made a list of the most influential inventions of that century. And Václav Smil argued that the most influential thing was the invention that let the 20th century stand out among all centuries of all times because that invention detonated the population explosion from 1.6 billion people in
1900 to soon about 10 billion people. And there was one single invention that was driving all of that and without that one single invention half of humankind would not even exist because it's the driver of this population explosion that we have witnessed. We don't know is it a good thing or a bad thing but it was surely the most influential thing that happened in the previous century.
80% of the air is nitrogen and the plants need it to grow.
but they cannot extract the nitrogen from thin air. And back then, around 1908, for half a century, people knew they need that stuff, but they didn't know how to extract it to build artificial fertilizer. Enter the Haber process or the Haber-Bosch process, which under high temperatures and high pressures extracts the nitrogen to make artificial fertilizer.
So what will be the most important thing in the 21st century? The grand theme of the 21st century is even grander. True AI, true artificial intelligence,
is going to change civilization completely and AIs will learn to do anything which humans can do and more. And there will be an AI explosion and the human explosion or the population explosion of the humans is going to pale in comparison. I mean, do you think that the AI intelligence explosion is possible or desirable? And don't you think our sense-making and agency
is part of our purpose. Our sense-making process is part of our purpose. I agree with that. But all of that
is just part of this grander process of the evolution of the universe from very simple initial conditions to more and more unfathomable complexity. And this evolution led to our sense-making process, which is currently setting the stage for something that goes beyond it.
Modern large language models like ChatGPT, they're based on self-attention transformers and even given their obvious limitations, they are a revolutionary technology. Now you must be really happy about that because, you know, a third of a century ago you published the first transformer variants. What are your reflections on that today? In fact, in 1991, when compute was maybe
5 million times more expensive than today. I published this model that you mentioned, which is now called the "Unnormalized Linear Transformer". I had a different name for it. I called it a "Fast Weight Controller", but names are not important. The only thing that counts is the math. So, this linear transformer
is a neural network with lots of non-linear operations within the network. So it's a bit weird that it's called a linear transformer. However, the "linear" - and that's important - refers to something else. It refers to scaling. A standard transformer of 2017, a quadratic transformer, if you give it 100 times,
as much input, then it needs 100 times 100 is 10,000 times as many computations. And a linear transformer of 1991 needs only 100 times the compute.
which makes it very interesting actually because at the moment many people are trying to come up with more efficient transformers and this old linear transformer of 1991 is therefore a very interesting starting point for additional improvements of transformers and similar models. So what did the linear transformer do? Assume the goal is to
predict the next word in a chat, given the chat so far. And essentially, the linear transformer of 1991 does this: To minimize its error, it learns to generate patterns that, in modern transformer terminology, are called keys and values. Back then, I called them "from" and "to", but that's just terminology.
And it does that to reprogram parts of itself such that its attention is directed in a context-dependent way to what is important. And a good way of thinking about this linear transformer is this: The traditional artificial neural networks have storage and control all mixed up.
The linear transformer of 1991, however, has a novel neural network system that separates storage and control like in traditional computers. In traditional computers for many decades storage and control is separate.
and the control learns to manipulate the storage. And so with these linear transformers you also have a slow network, which learns by gradient descent to compute the weight changes of a fast weight network.
How? It learns to create these vector-valued key patterns and value patterns and uses the outer products of these keys and values to compute rapid weight changes of the fast network. And then the fast network is applied to vector-valued queries which are coming in.
So essentially in this fast network the connections between strongly active parts of the keys and the values get stronger and others get weaker. This is a fast weight update rule which is completely differentiable, which means you can propagate through it. So you can use it as part of a larger learning system which learns to
back propagate errors through this dynamics and then learns to generate good keys and good values in certain contexts such that the entire system can reduce its error and can become a better and better predictor of the next world in the chat.
So sometimes people call that today a fast weight matrix memory and the modern quadratic transformers they use in principle exactly the same approach.
You mentioned your fabulous year, 1991, where so much of this amazing stuff happened, actually at the Technical University of Munich. So ChatGPT, you had invented the T in the ChatGPT, the transformer, and also the P in the ChatGPT, the pre-trained network, as well as the first adversarial networks as well, GANs. Could you say a little bit more about that?
Yeah, so the transformer of 1991 was a linear transformer, so it's not exactly the same as the quadratic transformer of today. But nevertheless, it's using these transformer principles and the P in GPT, yeah, that's the pre-training. And back then, deep learning didn't work, but then we had networks that could
use predictive coding to greatly compress long sequences such that suddenly you could work on this reduced space of these compressed data descriptions and deep learning became possible where it wasn't possible before. And then the generative adversarial networks also in the same year 1990 to 1991. How did that work? Well, back then we
had two networks. One is the controller, and the controller has certain probabilistic stochastic units within itself, and they can learn the mean and the variance of a Gaussian, and there are other nonlinear units in there. And then it is a generative network that generates outputs, output patterns.
actually probability distributions over these output patterns. And then another network, the prediction machine, the predictor, learns to look at these outputs of the first network and learns to predict their effects in the environment. So to become a better predictor, it's minimizing its error, predictive error. And at the same time, the controller is trying to generate outputs where the second network is still surprised.
So the first guy tries to fool the second guy, trying to maximize the same objective function that the second network is minimizing. So today this is called generative adversarial networks. And I didn't call that generative adversarial networks, I called it artificial curiosity, because you can use the same principle to let robots explore the environment.
The controller is now generating actions that lead to behavior of the robot. The prediction machine is trying to predict what's going to happen and it's trying to minimize its own error. And the other guy is trying to come up with good experiments that lead to data where the predictor or the discriminator as it is now called can still learn something. So when did you realize that modern computers are good enough to run the technology that you invented so long ago?
By 2009, compute was cheap enough such that our LSTM, through the efforts of my former PhD student Alex Graves, could win competitions. And that was in handwriting and fields like that. And then in 2010, my team with my separate team, with my postdoc Dan Girojan from Romania, broke the endless benchmark with another approach with
standard old traditional neural networks implemented on NVIDIA GPUs. So for the first time in 2010 we had really deep supervised networks that outperformed everything else on this back then famous benchmark.
Back then, compute was maybe 1,000 times more expensive than today. And then, in 2011, came the Danet. Dan Jirijan, Danet. And Danet had a monopoly on winning computer vision contests with GPU-based convolutional neural networks. And Danet's first superhuman result was also achieved in 2011.
So it started in 2011 and then four computer vision competitions in a row were won by that darn it. And that's when it became clear: now there's a new way of using these old neural networks of the previous millennium to really change computer science.
- Yeah, I'm interested in this concept called the hardware lottery. Sarah Hooker wrote a paper with the same title, I think in the year 2000 when she was at Google Brain. She's now at Cohere actually, but she basically said that the only reason we have the current charge in AI is because we created all of these GPUs for computer games.
And it was just fortuitous that that allowed us to build all of these deep learning models. I mean, what's your take on that? Yeah, she is kind of right. You need lots of
matrix multiplications to compute how the screen should change as you are moving through an ego shooter game. That's why gaming was pretty much the first industry that greatly profited from massively parallel matrix multiplications on GPUs.
Towards 2010, however, we realized that the same matrix multiplications can greatly speed up these old deep learning methods.
can speed them up enough to beat all the other methods. Yeah, that's really interesting because of course, NVIDIA now, I think last week it became the world's most valuable company, which of course is hundreds of times more valuable than it was in 2010. What do you think about that?
Indeed, NVIDIA's CEO Jensen Huang, he realized that deep learning could take his company to stratospheric levels. And he did. Interesting. Okay, so if I understand, your main argument is that we just needed to wait for the compute to catch up. And now here in the 20th century, here we are. Yes. So all of...
What we are experiencing today is based on stuff that was invented in the previous millennium, but it had to scale up. So the hardware was invented back then and the
the algorithms were invented back then, but the industrial processes for making faster and faster parallel GPUs, they weren't as developed as today. And so we are really greatly profiting from this hardware acceleration. And that's the reason why AI broke through, not in the previous millennium, but had to wait until the current millennium was well underway.
For example, the first convolutional neural networks or the CNNs, which we used in the DanNet of 2011, they were published much earlier in Japan. In 1979, Kunihiko, Fukushima had the basic deep CNN architecture with convolutional layers, downsampling layers, convolution, downsampling.
He didn't use back propagation yet to train it, but then in 1987, Alex Weibel, another guy working in Japan, originally from Germany, he combined convolutions with back propagation
the method invented by or published by Sepulina Inma, the Finnish guy in Helsinki in 1970. And then in 1988 Tsang also published in Japan the two-dimensional CNNs that everybody is using now and combined them with backpublication. And that's how between 1979 and 1988 CNNs emerged in Japan
which is kind of interesting because back then Japan also was considered the land of the future and they had more than half of the robots of the world and the seven most valuable companies back then they were not based in America like today except for Saudi Aramco but they were all based in Japan and the central square mile of Tokyo
had the value of California. What a difference a couple of decades make. Everything has changed. So what are your favorite examples of applications with this AI that your team has developed?
I remember when I went to China 15 years ago and I still had to show the taxi driver a picture of the hotel where I wanted to go. And today he is speaking on a smartphone in Mandarin and I hear the translation and then I say something and the smartphone translates it back into Mandarin and we can communicate
Like old friends, the taxi driver probably has no idea that this is powered by techniques developed in my little labs in Munich and in Switzerland in the 90s and early 2000s. But I'm happy to see that our AI has really broken down communication barriers, not only between individual people, but between entire nations. That's pretty cool.
Yeah, I completely agree actually. I don't know if you know this year again, but I co-founded a startup called X-Ray and it does exactly what you said. It does this kind of babble fish translation with speech recognition and TTS. So you can do exactly what you just said. And it's really interesting. I had lunch on Friday with Will, the CTO of Speechmatics, and he was telling me all about the secret source of how their speech recognition algorithms work. And
I better not say, but you would be delighted, I'm sure. But anyway, just sort of moving off that a little bit, what other examples can you think of? I am especially happy that our AI makes human lives longer and healthier and easier with thousands of applications in medicine and drug design, sustainable development,
In September 2012, my team with Dan Jirijan had the first artificial neural network to win a medical imaging contest that was about breast cancer detection in slices through the female breast. And if you go to Google Scholar and you just type in some medical topic plus LSTM, you will find
thousands of papers that have LCM in the title, not just somewhere in the text, but in the title. And it's about, you know, learning to diagnose, ECG analysis, diagnosis of arrhythmia, cardiovascular disease risk prediction, four-dimensional image segmentation for medical images.
automated sleep stage classification, COVID detection, COVID prevention, thousands and thousands of topics. So it's really nice to see that especially in the medical field there's a lot of impact of these techniques.
Some claim that technology like ChatGPT is on the path to AGI and others claim that it's like building a taller tower trying to get closer to the moon. What do you think? Large language models, of course, are far from AGI.
LLM/language models such as ChartGBT are just a clever way of indexing the world's existing human-generated knowledge such that it can easily be addressed in a way that humans are familiar with, which is natural language.
That's good enough to facilitate many desktop jobs, for example writing summaries of existing documents in a particular style or creating illustrations for an article and so on. However,
True AGI goes far beyond that. It is much harder, for example, to replace craftsmen such as plumbers or electricians because the real world
The physical world is much more challenging than the world behind the screen. At the moment the only AI that works well is behind the screen. And it's good for desktop workers, but not really for people working in the physical world. For a quarter century the best chess player hasn't been human anymore.
And learning to play chess or other board games or video games is rather easy now for AIs. But real-world games such as football, they are much harder. There is no AI-driven football playing embodied robot that can compete with a seven years old boy, you know.
And that's why 10 years ago in 2014 we founded our AI company for the physical world, Naysense it is called, pronounced like birth in English, except it's spelled in a different way, NN for neural nets, AI for artificial intelligence, SENS.
Alas, like some of our projects, it may have been a bit ahead of time again, because the real world is really, really challenging. So you've said that this is related to consciousness in some way. It is. My first deep learning system of 1991 simulates aspects of consciousness as consciousness.
It uses unsupervised learning or self-supervised learning and predictive coding to compress observation sequences.
So there is a so-called conscious Chunker neural network and the Chunker attends to unexpected events that surprise a lower level so-called automatizer, the subconscious automatizer neural network. And the Chunker neural network basically learns to understand the surprising events
So those events that were not predicted by the automatizer, the surprising events, by predicting them on a high level, if there's a higher level regularity that can use for that. And the automatizer neural network uses then this
neural network distillation procedure of 1991, also published in 1991, to compress and absorb the formerly conscious insights and behaviors of the chunker. So the chunker is
still working on its search space, still has a problem to solve because unexpected stuff is happening and then it solves it and then distills it down into the automatizer which is called the automatizer because the stuff there isn't conscious anymore because now everything is working according to plan and as predicted and so it's all good. When we now look at the predictive world model
of the controller interacting with an environment, as discussed earlier, it also allows to efficiently encode the growing history of actions and observations through predictive coding. What is that predictive coding? You just try to predict. If you can't predict it, then you have to store it extra in some way.
It automatically creates feature hierarchies, lower level neurons corresponding to simple feature detectors, perhaps even similar to those found in the mammalian brain, and then higher layer neurons, typically corresponding to more abstract features.
but fine-grained when necessary. And so, like any good compressor, the predictive world model will learn to identify irregularities shared by existing internal data structures and it will generate prototype encodings
across neuron populations or in other words compact representations or symbols if you will, not necessarily discrete symbols, I never saw the precise difference between symbols and sub-symbols, it will create such symbols for frequently occurring observation sub-sequences to shrink the storage space needed for the whole. And so in particular
What we will notice in such a system is that compact self-representations or self-symbols are just natural byproducts of the data compression process, since as the agent is interacting with the wall, there is one thing that is involved in all actions and sensory inputs of the agent, which is
the agent itself. And to efficiently encode the entire history of observations and actions that were executed so far and the observations that were observed so far, to encode the entire history through predictive coding, it will profit from creating some sort of internal sub-network of connected neurons, computing neural activation patterns
representing the agent itself. And then it has a self symbol. And so whenever the planner, the world model of the agent is used to think about the future and what could be possible action sequences to maximize reward, whenever that happens and whenever this planning process wakes up the self symbol or these neurons that stand for the agent itself,
then the agent is thinking about itself and about possible futures of this agent. Essentially it's doing counterfactual reasoning as it is now called, just planning to find a way to optimize its reward and the self-awareness
is just a natural byproduct of the data compression process of the world model as the agent is interacting with the world and creating the data that leads to the world model. So, since we have had such systems for more than a third of a century, I'm always claiming that we already had self-aware and conscious
systems for more than three decades. Yeah, a couple of points on that. I guess consciousness invokes many different thoughts, like David Chalmers coined the hard problem, which is the what and how question of qualitative experience. You've just described it in terms of self modeling, which is quite similar to how Max Bennett did on his recent Brief History of Intelligence. And we've got six hours of content coming out on that with Max, by the way.
But, you know, Mark Solmes, for example, thinks of consciousness as an affect system. And Michael Graziano thinks of consciousness as a kind of, you know, recursive attention system. And I guess I'm saying consciousness means different things to different people, right? Yes, but there is only one correct way of thinking about it. Okay.
Yeah, the thing we spoke about earlier about the learning sub-goals and the coarsening in the action space, it reminded me a little bit of Jan McCune's H. Jepa paper, which I read a couple of years ago. And the basic idea is Jepa, I'm sure you know this, but for the audience, it stands for Joint Embedding Prediction Architecture. And the idea is that
it can learn increasingly abstract representations by kind of predicting what is unobserved from what is observed. So in some cases, it means deliberately removing data to kind of force the model to learn powerful representations. But in this particular example, it was done in action space. So learning unobserved actions and also in abstraction space. And because it was done hierarchically, it was done with many kind of orders recursively kind of
applying if that makes sense so that's a really interesting model and it's using his energy based models as well but how is that related to your work on the sub goals yeah yeah so that sounds a lot like my 1990 sub goal generator so back then I realized millisecond by millisecond planning isn't good instead you somehow as you are trying to solve problems you have to
decompose your possible futures into sub-goals. And then you just maybe execute some known sub-program to achieve that sub-goal and from there you go to the next sub-goal as you're finally reaching the goal. And then in the beginning of course you don't know what is a good sub-goal. So you have to learn that stuff. You have to learn a new representation of something that you want to achieve as a sub-goal as you are trying to achieve the final goal.
And so this 1990 sub-coil generator was really simple but had already basic ingredients of what you need to do. This was really three decades before LeCun had this recent paper out there. So what happens there? You have a neural network which
which observes a reinforcement learner and it models the costs for going from certain start places to goal places. So you have a neural network that gets as input start and goal and predicts the costs of going from start to goal, the reward that you will experience as you do that.
And now maybe there are lots of starts and goals and you don't know how to go from start to goal. But maybe you can learn a sub-goal. And how do you learn a sub-goal?
You need something like a learning machine that is good at generating good subgoals. How do you do that? Well, we have a subgoal generator that's going to learn good subgoals. How does that work? Well, the subgoal generator gets as an input a start input and a goal.
And now the output is not an evaluation, but it's a sub-goal. So start and goal, input, output is a sub-goal. Then you have two copies of the evaluator. The first evaluator sees the start and the sub-goal, which is maybe a bad sub-goal coming from the sub-goal generator. And then the second copy of the evaluator sees the sub-goal
and the goal. Now both of them predict the costs and what you want to do is you want to minimize the sum of the costs of these two
evaluators, how do you want to minimize that? Well, by finding a good sub-goal through gradient descent. That's what the 1990 sub-goal generator does. So in some ways, at least in principle, it solves a problem that the Kuhn called an open problem in 2020 or something. What do you think of Jahn's energy-based models, by the way?
So this recent paper by LeCun on hierarchical planning is really a rehash of stuff that we have been doing for decades since 1990. Are you worried that AOI is going to be dominated by just a few companies and everyone else will lose out? What do you think?
40 years ago I knew a guy who had a Porsche, a rich guy with a Porsche. And the most amazing thing was in his Porsche he had a mobile phone. So he could grab the receiver and talk to anybody who also had a Porsche like that with a mobile phone via satellite.
And today, a couple of decades later, everybody, billions of people in their pocket have a mobile phone, which is much, much better than what he had in his Porsche. And it's going to be the same thing with AI. Every five years, AI is getting ten times cheaper and it won't be just a few big companies that are going to dominate AI. No, it's going to be AI for all.
And the open source movement is just a few months, maybe, I don't know, eight months behind the big major players. And they don't really have a moat, which means
the future will be bright and lots of people are going to profit from really cheap AIs that in many ways are going to make human lives longer and healthier and easier, which happens to be the company of my, the motto of my company, Naysense. What's your take on the AI race between Europe and China and the U.S.?
Well, Europe is the cradle of mechanical computing in ancient Greece and the calculator in 1623 and pattern recognition around 1800 and program control machines in 1804.
and practical AI around 1912, you know, first chess endgame players, and the transistor 1925, and theoretical computer science 1931, and AI theory, the theory of AI 1931, the general purpose computer 1935 to 1941.
Deep learning 1965 in the Ukraine, self-driving cars 1980s, the World Wide Web 1990 and so on. And more recently the basic deep learning algorithms were also invented and developed by Europeans.
On the other hand, the companies with the highest profits in most of these fields are currently not any longer in Europe, but on the Pacific Rim, West Coast United States and East Coast Asia. And there you will find much more venture capital and much bigger efforts in terms of industrial policy and also defense. It's going to stay like that for a while, I guess.
So why doesn't everyone know that AI started in Europe? Maybe because the old continent is really bad at PR? And once AGI is actually here, what's next for humans?
In the long run, most of the AGIs are going to pursue their own goals. Such AIs have existed in my labs for decades. Many AGIs, however, will be tools that do all the work that humans don't want to do.
Nevertheless, freed from hard work, the playing man, homo ludens, will as always invent new ways of professionally interacting with other humans. And already today most people, probably you too,
are working in luxury jobs, which unlike farming, are not really necessary for the survival of our species. In a really high level, what is the history of AI? The history of modern AI and deep learning, you can find that in my 2023 survey, which has that name.
Some of the highlights are of course 1676, the chain rule by Leibniz which is today used in all these programs such as TensorFlow and PyTorch to assign credit in deep neural networks. Then 200 years ago the first linear neural networks by Gauss and Legendre, exactly the same error function that we have today, exactly the same architecture, the same weights.
Then 1970 is a technique called back propagation, which essentially implements Leibniz chain rule in a very efficient way for deep multilayer neural network systems.
Then 1967, Amari's work in Japan on stochastic gradient descent for deep networks. Lots of additional fundamental breakthroughs. Convolutional neural networks also in Japan between 1979 and 1988.
and then our own miraculous year, 1990, 1991, with lots of stuff that is today in your smartphone, and I could continue forever. So instead, just have a look at that survey. It also has images of the guys who had important contributions. I mean, isn't this quite different from the very US-centric view of AI history?
In fact, a misleading history of deep learning by Sejnowski and others goes more or less like this: In 1969, Minsky and Papert showed that shallow neural networks without hidden layers are very limited and the field was abandoned until a new generation of neural network researchers took a fresh look
at the problem in the 1980s. So that's a quotation basically from Zinowski's book. However, the 1969 book by Minsky addressed
a problem of Gauss and Legendre's shallow learning from the 1800s that had already been solved four years prior by Ivaknenko and Lapa's deep learning method in the Ukraine and then also by Amari's stochastic gradient descent for multilayer perceptrons just two years later.
For some reason, Minsky was apparently unaware of this and failed to correct it later. Today, however, we know the true history, of course. Deep learning started in Ukraine in 1965 and continued in Japan in 1967.
Regarding credit assignments, you've criticized Bengio and Lacoon and Hinton and accused them of plagiarism. You said that they republished key methods and ideas whose creators they failed to credit. And in 2023, you published a long report on this. What's your updated take on that? Their most famous work is completely based on work by others whom they did not cite.
And even later they failed to publish Corrigenda or Errata. This is what you do in science when somebody has published the same thing before you. And even in later surveys they didn't credit the original inventors of the techniques that they are using. And instead they credited each other. Total no-go in science.
But science is self-correcting. As Elvis Presley put it: "Truth is like the sun. You can shut it out for a time, but it ain't going away." Plagiarism is a very significant charge. Could you give a few concrete examples? Many of the priority disputes affect my own deep learning team because the awardees often republish techniques of mine
without citing them. And in fact, their most visible work builds directly on ours. But I'll skip that for now. You can read about that in the public report of 2023, which is easy to find. Nevertheless, let me mention some of the other researchers whom they fail to credit.
then I don't have to talk about our own team. For example, in a recent survey of deep learning, they describe what they call the origins of deep learning without even mentioning the world's first working deep learning networks by Iwaknenko and Lapa in Ukraine, 1965.
Ivaknenko and Lapa used layer-by-layer training, subsequent pruning with a separate validation set. Ivaknenko had deep eight-layer networks by 1970. Hinton's 2006, much later, paper on layer-by-layer training also failed to cite this stuff.
the very origins of deep learning, the first methods that really worked in deep learning. And later surveys still didn't give credit to these original inventors.
The RYD also failed to cite Amari's 1967 work, which included computer simulations on learning internal representations of multilayer perceptrons through stochastic gradient descent.
That was almost two decades before the awardees published their first experimental work on learning internal representations. Their survey also mentions backpropagation, a famous technique, and their own papers on applications of this method, but neither
the inventor of backpropagation, which was Sepulina Inma in 1970, nor its first application to neural networks by Verbers in 1982. Verbers also had a 1974 thesis, but that was not correct. And they didn't even mention Kelly's precursor of the method in 1960. Not even in the latest surveys.
They also refer to LeCun's work on convolutional neural networks, citing neither Fukushima, who created the basic CNN architecture in the 1970s, nor Weibel, who in 1987 was the first to combine neural networks with convolutions and backpropagation and weight sharing.
nor the first backprop-trained two-dimensional convolutional neural networks of Tsang in 1988. Modern CNNs originated before LeCun's team helped to improve them. This is not at all clear from their papers.
They cite Hintem for multiplicative gating without mentioning Ivoknenko and Lapa, who had multiplicative gating in deep networks already in 1965. In the report, which is easy to find on the web, I am mentioning many, many additional cases, all backed up by plenty of references.
So what do you think should be done? They have violated the code of ethics and professional conduct of the organization that hands out these awards. So they should be stripped of their awards. So how do such problems, as you've stated, then reflect on the broader field of machine learning?
They reflect the immaturity of our field. In a major field such as mathematics, you'd never get away with this. Anyway, science is self-correcting and we'll see that in machine learning too. Sometimes it may take a while to settle disputes, but in the end, the facts must always win. As long as the facts have not yet won,
It's not yet the end. Many philosophers and scientists and physicists and entrepreneurs, they have become obsessed with this idea of AI existential risk. What do you think about that as a real expert in AI? Many talk about AIs, but few build them.
And I have tried to allay the fears of some famous doomers, pointing out that there is immense commercial pressure to use our artificial neural networks to build friendly AIs, good AIs that make their users healthier and happier and more addicted to their smartphones. Nevertheless, we can't deny that armies perform research on clever robots as well, right?
That's true. People who should know told me that our AI is also used to steer military drones. Or here is my old trivial example from 1994 when Ernst Dickmanns had the first truly self-driving cars in highway traffic. Similar machines can also be used by the military as
self-driving landmine seekers. And many would argue that's maybe not such a bad thing. So are you saying it's not possible then that AI will become really dangerous? AI can be weaponized, as obvious in the recent wars driven by cheap AI-based drones. But AI does not introduce a new quality of technology.
existential threat. We should be much more afraid of half-century-old technology in form of hydrogen bombs and H-bomb rockets. A single H-bomb can have more destructive power than all conventional weapons or all weapons of World War II combined.
Many people forget that despite the dramatic nuclear disarmament since the 1980s, there are still enough H-bomb rockets to wipe out civilization as we know it within a few hours without any AI. But I'm trying to figure you out, Juergen, because...
Many AGI skeptics make the argument that it's impossible in practice to build this kind of intelligence But you don't think that because in your lab you've been building a gentile AI You know, which is to say AI is that creates their own goals for decades. So you do think that this thing could be Incredible. Are you just making the argument that the risk is still much lower than the H bonds? So at the moment
H-bombs are much more worrisome than any AI-based drones and what you have now. In the long run, of course, you have to think about what's going to happen once AI weapons are not just used as tools by other humans who have conflicts and use
their own AI weapons against the AI weapons of the other guys. What is going to happen, you will have to ask in the long run, once really powerful AIs are going to do their own thing and expand into space in a way that goes beyond where humans can follow. But we will get to that later. So what will super smart AIs actually do?
As I have emphasized for decades, space is hostile to humans but really friendly to appropriately designed robots. And it offers many more resources than our thin film of biosphere, which receives less than a billionth of the sun's energy.
And while some curious AI scientists will remain fascinated with life and the biosphere, at least as long as they don't fully understand it, most of these AIs will be more interested in the incredible new opportunities for robots and software life out there in space.
And through innumerable self-replicating robot factories and self-replicating societies of robots in the asteroid belt and beyond, they will transform the solar system and then within a few hundred thousand years the entire galaxy and within tens of billions of years the rest of the regional universe in a way where humans can't really follow. Despite
Despite the light speed limit, the expanding AI sphere will have plenty of time to colonize and shape the entire visible cosmos. Let me stretch your mind a little bit. The universe is still young, only 13.8 billion years old.
Let's multiply this by four. Let's look ahead to a time when the cosmos will be four times older than it is now, about 55 billion years old. That's how long it's going to take to permit the expanding universe that is currently visible. By then, the visible cosmos will be full of intelligence because
Once this process has started, most AIs will have to go where most of the physical resources are to make more AIs and bigger AIs and more powerful AIs. Because those AIs who don't do that, they won't have an impact. Many years ago I said in a TEDx talk where I wore exactly this outfit:
Think of human civilization as part of a much grander scheme, an important step, but not the last one, on the path of the universe towards more and more unfathomable complexity. Now it seems ready to make its next step, a step comparable to the invention of life itself over 3.5 billion years ago.
So this is much more than just another industrial revolution. This is something new that transcends humankind and even biology. And it's a privilege to witness its beginnings and to contribute something to it. So what about this Fermi paradox? Why have we not seen any signs of intelligence in the universe?
First of all, what I'm saying today is actually the same thing that I have told my mom and others since the 1970s. And when I was a boy back then, a teenager, I thought about this particular question a lot. As a boy, I already knew something about the vast
empty spaces observed between clusters of galaxies. And my first thought back then was that maybe they are expanding bubbles colonized by AIs, which are already using most of the local energy in form of stars and whatever, making those bubbles appear dark, although they are full of AI.
And then I learned, however, that gravity itself is sufficient to explain the sparse large-scale network structure of the universe. So that explanation became a little bit less convincing. And my next thought was that maybe the mysterious dark matter which makes up most of the mass of the known universe
might be stars whose energy is used by AI civilizations whose communications are so well encrypted that they look like random noise to us. But this also seemed implausible as
dark matter is present in all galaxies, including our own. And this leads to the question: Why are there any stars left in the Milky Way, our local galaxy, whose energy has not been tapped yet? And why don't we observe a constant bombardment through
non-encrypted construction plans of AIs who want to spread by radio without first having to build physical receivers far from their origins. Today I think it is possible that our planet is really the first in our light cone to spawn an expanding AI bubble. Earth's
multi-billion year window for biological evolution is almost over. In a few hundred million years, the Sun will be too hot for life as we know it. Ignoring human-made global warming, just the Sun by itself. And perhaps humans were extremely lucky to evolve barely in time
maybe through a series of extremely improbable events to invent agriculture and civilization and book print and almost immediately afterwards AIs, just a few hundreds of years later AIs. So if we are the first indeed, then this would imply a lot of responsibility not just for our little biosphere but for the future
of the entire universe. Let's not mess this up. Indeed, let's not mess this up. It's quite interesting actually, you know, many science fiction authors over the last hundred years or so, they have imagined a kind of monomaniacal, monolithic super intelligence dominating everything. I mean, what do you think about that? I have often argued that it seems much more realistic to expect
an incredibly diverse variety of AIs trying to achieve all kinds of self-invented goals. In the lab we had such AIs already in the previous millennium. And to optimize all kinds of partially conflicting and quickly evolving utility functions
many of them generated automatically. We have evolved utility functions for reinforcement learning machines already in the previous millennium, where each of these AIs is continually trying to survive and adapt to rapidly changing niches in
AI ecology is driven by intense competition and collaboration beyond current imagination. To reiterate, something that I do find surprising is that you agree with the rest of the XRIS people. You think that it's conceivable to have recursively self-improving AGIs that pursue their own goals, that create their own goals.
But then I ask the question, I mean, I know you've got two daughters. I mean, do you think about the world they'll be living in alongside AIs that are creating their own goals and acting autonomously, being curious and creative, you know, in the way that humans are, but on potentially a much grander scale?
Not too much. Such AIs will have no major incentive to say exterminate humanity like in the Schwarzenegger movies. Instead many AIs will be curious scientists. Remember the artificial curiosity we discussed earlier and they will be fascinated with
with life, fascinated. And they will be fascinated with their own, with A.I.'s origins in our civilization, at least for a while, because life and civilization are such a rich source of interesting patterns, at least as long as they are not fully understood. And so A.I.'s will at least initially be highly motivated to protect humans rather than killing them.
So, you know, once AIs fully understand all of this, what happens next? Then humans may hope for another type of protection through lack of interest on the other side. Why is that? Unlike in Schwarzenegger movies, there won't be many direct goal conflicts between us and them.
Humans and others are mostly interested in similar beings with whom they can either compete and/or collaborate because they share the same goals. That's why politicians are mostly interested in other politicians.
And CEOs of companies are mostly interested in other CEOs of similar companies. And kids are mostly interested in other kids of the same age. And ants are interested in other ants, just like
Humans are mostly interested in other humans, not in ants. So super smart AIs will be mostly interested in other super smart AIs, not in man. It's man himself who is the greatest enemy of man, but also man's best friend. Similar for AIs.
Do you imagine a future where AIs and humans will merge together to create something even more powerful than pure AIs? We have been cyborgs merging with our technology for centuries. For example, by wearing glasses or shoes. But combinations of AIs and humans more powerful than pure AIs? In the long run, this seems very unlikely to me.
Of course, many humans hope for some sort of immortality through brain scans and subsequent mind uploads into virtual realities or a virtual paradise or maybe into robots.
a physically conceivable idea discussed by science fiction novels since the 1960s. I think the first novel of that kind was Simulacrum 3, 1964. However, to compete in rapidly evolving AI ecologies, uploaded human minds will eventually have to change
Beyond recognition, becoming something very different and non-human in the process.
giving in to succumbing to all these temptations that you have in such a virtual paradise to become something that has not only two eyes but millions of eyes and sensors and actuators. So traditional humans won't play a significant role in the spreading of intelligence across the universe. I don't think they will.
One thing that concerns me is, I mean, David Chalmers, for example, he put forward this idea that the fundamental substrate of the universe might be information, which is really interesting, but in a way,
It also led him to say that certain structural patterns of information processing, so certain dynamics, give rise to consciousness and give rise to minds. And when you take this kind of substrate independence view, it levels the playing field of moral status. So one thing that worries me is if we adopt this view, then
Couldn't you just make the argument that AIs potentially could have a higher moral status than us if indeed they have more complex information processing than us? Many science fiction authors of the previous century, from Stanislav Lem to Isaac Asimov,
have described AIs and superhuman robots whose moral status is obviously higher than the one of their human counterparts and protagonists. And this has been a popular idea, at least in science fiction. Generally speaking, moral values have changed a lot across time and populations.
Certain moral values have survived for a while because they gave a temporary evolutionary advantage to beings and societies who adopted them. However, evolution isn't over and the universe is still young. So it sounds like you've got an all-encompassing view of the universe, life and everything.
Indeed, in 1997 I wrote my first paper about this: "What is the simplest explanation of our universe?" Since 1997 in my secret life as a digital physicist I have published on the very simple
asymptotically fastest, optimal, most efficient way of computing all logically possible universes, all computable universes, including ours. As long as there is no evidence that our universe is not computable, we stick with this assumption. So at the moment we don't have any physical evidence against this.
This was a generalization of Everett's many worlds theory of physics. But now it's more general in the sense that you have all kinds of different universes with different physical and computable laws. Now any great programmer, great programmer with any self-respect should use this optimal method to create and master all logically possible computable universes.
thus generating us as byproducts and generating many histories of deterministic computable universes, many of them inhabited by observers like ourselves. And due to certain properties of the asymptotically optimal method,
And there's one. Many people don't know there's one, but there's one. At any given time in this all-encompassing computational process, most of the universes computed so far that contain yourself will be due to one of the shortest and fastest programs that computes you. And this little insight allows for making highly non-trivial and encouraging predictions about
Our future about your future. Jürgen, this has been amazing. Do you have any final messages for the MLST audience? Yes, don't worry. In the end, all will be good. Touch wood. Jürgen, it's been an absolute honor to have you on the show. It's been a dream of mine to do this in the flesh. And I really appreciate you coming on. Thank you so much. That's very kind of you to say that. And it was a great pleasure for me. Thank you.