We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Sepp Hochreiter - LSTM: The Comeback Story?

2025/2/12

Machine Learning Street Talk (MLST)

AI Deep Dive AI Chapters Transcript

People

Sepp Hochreiter

Yannic Kilcher

Topics

Sepp Hochreiter: 我认为大语言模型本质上是一种数据库技术,它通过存储和概括人类知识来工作。虽然它们在捕捉细微的直觉和文化信息方面表现出色,但它们无法产生全新的概念或代码。我认为真正的AI需要具备推理能力,而当前的AI系统只是在重复和组合已有的知识,缺乏真正的创新。我们需要寻找新的方向,将符号和子符号方法结合起来,构建更强大、更鲁棒的AI系统。 Yannic Kilcher: 我同意传统的大语言模型是近似检索引擎,但像O1这样的模型正在尝试进行近似推理,通过在测试时进行计算和搜索多种组合来生成代码。我认为通过加强程序空间,我们可以取得很大的进展。但是,如果需要全新的概念,大语言模型可能无法做到。我们需要构建能够推理的系统,但可能缺少人类拥有的某些特质。

Deep Dive

Chapters

This chapter explores the limitations of Large Language Models (LLMs) in true reasoning and knowledge creation. Hochreiter argues that LLMs excel at information retrieval and combination, but fall short when it comes to generating genuinely novel code or concepts.

LLMs are essentially advanced database technologies, not true AI.
LLMs struggle to produce original code or concepts, relying instead on combinations of existing information.
The ability to create new code and abstractions is crucial for advancing AI.

Shownotes Transcript

Translations:

中文

We need new direction. Large language models are not our way to advance AI. Language models is for me a database technology. It's not artificial intelligence. You grab all the human knowledge in text, perhaps also in code or whatever, and store it. Currently, AI as a reasoning is not real reasoning. It's repeating reasoning things or code things which

You and Jürgen, you're pioneers of connectionism in a way and you're a neuro-symbolic guy as you've always been. Why is that?

Sepp, it's an honor to have you on MLST. Thank you so much for joining us today. It's an honor for me that you have me. Oh, don't be silly. The thing that is amazing about language models and deep learning in general is that they capture a lot of the subtle intuitions, cultural information, creativity, and so on. So they

They're really good for generating programs. And the thing is, if we want to do abstraction, we need to have programs. But the thing is, where do the programs come from? If we build systems that can create and acquire abstractions, we need to build systems that can write their own programs. It doesn't seem possible just to do some discrete program search because it's too difficult. What my view on large language models is, large language models is for me a database technology.

Yes. It is not artificial intelligent. Okay, give me it's artificial, but that's more or less a database technology. You grab all the human knowledge in text, perhaps also in code or whatever, and store it. And you generalize it, you combine it. You know, if there's a Tuesday, I can replace it by Wednesday because this days of the week are

or names or numbers. It's some kind of generalization, but it's things which already exist. The question is, do we need new code? Is every code already written somewhere? You only have to pull it together or to combine it.

But if you really should come up with a new code, with a new idea, with a new concept, large language models can only pull out existing code they've trained on. It's just not possible. They're not trained for it to produce something new. And therefore, they're very limited. But they're very powerful because AI needs a knowledge representation.

Right now there's a problem of hallucination. Okay, how to pull out the knowledge? Also with inference, with Strawberry, it's about the knowledge is perhaps already in the system. How do I get it out? It's a database where I don't know how can I access the information. We need new direction. Large language models are not our way to advance AI on the long term. It's a good idea.

database technology, it's a good knowledge representation technology. It's important for AI, but we have to find new ways. Could I challenge a tiny bit? So I completely agree, vanilla LLMs are approximate retrieval engines. And even that, they're not quite databases because they have this interpolative property.

Things like O1, they are approximate reasoning engines. So they're doing this test time compute and they're searching many combinations. Because the thing is, even though this thing is a finite state automata as a fixed amount of compute, does a single forward pass, but it can generate code. And the code contains all of these fundamental basis primitives that can be composed together. So you can do this test time search, you can compose together programs. You can, in a sense, search Turing space indirectly, right? By searching through the space of...

programs so there seems to be something there are you saying that doing these kind of methods like o1 is the road to nowhere and we need something completely different or could we just tweak it a little bit you can tweak it and you will go very far because you have a

Let's strengthen the program space because that's very nice. There are so many programs, so many combinations of programs which give you a new program. If you think it's a column worker of complexity, it's the length of the programs and complexity of the programs. Programs, the simple column worker of complexity programs are already stored or combined. But if you have to find a program which needs completely new concepts,

cannot be combined out of existing programs, I think it cannot do it. It can only combine things it has already seen, the large language models learned on code, but it cannot come up with completely new code concepts. Perhaps they don't exist. Then if you say everything which

in code was already invented, we only have to combine and there's nothing new, then I'm with you. But if there's something new to invent, I don't think large language models can invent this. - MLST is sponsored by Sentinel, which is the compute platform specifically optimized for AI workloads. They support all of the latest open source language models out of the box, like Lama, for example. You can just choose the pricing points, choose the model that you want.

It spins up, it's elastic auto scale. You can pay on consumption essentially, or you can have a model which is always working or it can be freeze-dried when you're not using it. So what are you waiting for? Go to sentml.ai and sign up now. - Well, yeah, let me push gently on that. So I think this is a discussion about creativity.

and also epistemic foraging, so creating new knowledge to explain. But also reasoning. Programming is a lot about reasoning. But you have to have some logic if you do this, this, this. But a complex logic so that your program is working. Absolutely. But if we say reasoning is knowledge acquisition,

and we need systems to come up with new abstractions. And if we agree that those abstractions can be combinatorially deduced from abstractions that are already in the system, so we have the combinatorial closure, so they do exist, then creating the abstractions is more a matter of understanding how do I find a good abstraction using an algorithm. But I think what we humans have, it's not only that we draw our abstractions

ideas from coding, but understanding the world, having all this world knowledge. I think from coding alone, you're limited. I think we have much reasoning capabilities outside of doing only programs, but I agree.

In programs, you can go very far. I think if it's a program where you need a lot of reasoning, a lot of logic to go to the next step, next step, next step, and it was not in your training database, I don't think the current large language model can do it. Right now, I don't believe they really understand reasoning, the concept, say, imitate reasoning, say, reproduce they have already seen.

But I don't know if I understand it. And there are many examples. You change a little bit and then they go wrong. Can you explain the difference between the kind of reasoning we do, so strong reasoning, and the kind of reasoning that we can do in current AI? I think in current AI, the reasoning is not real reasoning. It's repeating reasoning things or code things which

have been already seen in the input data and combining it and also replacing some variables. Reasoning we do, we have this reasoning concept like contradiction, induction, all these things we learned. And also for us, it was hard to learn it in school or in study. But now we have reasoning concepts

how we can do, how we structure things and how we show that something is true or not true. All these formal systems, you have to have some formal rules. In theory, LLMs might learn some formal rules.

but then I can do reasoning in a very specific thing and produce new things because I only apply the rules. If the rules are in the training data, I can apply it. I can apply the rules also to new things. But then in this reasoning system, I probably can reason. But if you go to another system,

- I mean, two quick points on that. First of all, would you consider, you know, move 37 in AlphaZero, would you consider that reasoning? So in AlphaZero, you know, the Google, the Go player algorithm creatively discovered this amazing move that-- - So now it's a move, okay. - Yes, yes. It created new knowledge.

It sets things-- it created new knowledge. But here, there was a sub-symbolic part. It was Monte Carlo tree search. Search is one classical AI concept. At the end of Monte Carlo tree search, you have the value functions and so on. But it discovered it. But by checking things and then evaluating it,

Yes, I think it's a combination between understanding the game and computing a lot of moves into the future with Monte Carlo TreeSearch. Tufo Labs is a new AI research lab. I'm starting in Zurich. It is funded from Paz Ventures, involving AI as well. We are hiring both chief scientists and deep learning engineer researchers. And so we are Swiss version of DeepSeq.

And so a small group of people, very, very motivated, very hardworking. And we tried to do some research starting with LLM and Owen style models. We want to investigate reverse engineer and explore the techniques ourselves. Even with that, I guess you could say it was still an approximate value function. There was no formal guarantees or anything. Exactly. That's true. But I completely agree that LLMs on their own, that they are approximate retrieval engines, but...

The thing is, we can build systems. We can have formal verifiers. We can have lean and cock, for example. We can build these systems. So with these systems, can we do reasoning? I think in principle it should be possible. I'm not sure. But I think the reasoning is limited to...

the domain you see in the training data. There are different formal systems, formal logics. You can learn it one logic. If it sees enough of these rules, I think it knows what a variable is, what it can change and how it can produce something. I think you can train an LLM for one logic system to produce new logic.

But it's, you learn the syntax. What you don't learn is the semantics. If I want to prove something, then you have to have different steps towards this proof. And here I think they would struggle. They would learn to do the formal things, the syntax. I have a sentence, I have a correct formula, and I produce another correct formula by applying the rules.

It has learned it, it has seen it, it can do it. But it's not goal-directed. It's a step-by-step. They're still not perfect or they're still...

sometimes or in most cases not as good as humans it's interesting that I I kind of agree that knowledge is created in service of a goal and there's a creative component to to reasoning and we can build systems that can dream and and generate data and we can we can bootstrap that and some of it can come from the users of a system so it feels like we can build systems that that can reason

But perhaps it wouldn't have something that we have. Maybe we have something extra. But I would bet on this, that I should...

Also for us, why should you learn to reason? Why not use a reasoning system? Why not call a subprogram? Can you prove this? Or a theory improver, a mathematician, stuff like this. You can learn it. Perhaps it's also okay to learn it. But I don't see the necessity because also we use tools. Why should not this...

Future AI systems use tools for everything, for math, for looking up knowledge and stuff like this. For me, it's stupid to push everything in one system because also we don't do it. We know how to use our tools.

Somehow I feel that's a better solution. What has happened in the last two years since we spoke? A lot of things happened. For example, I founded a company, NXAI. It's a company dedicated to industrial AI. And also XLSTM happened. This revival of LSTM method, which now should compete with the transformer technology. Yes, and we're going to get onto that. Before we do that, it would be good to go on an intellectual journey of LSTMs.

But just before we get there, we've not really spoken about some more, you know, broad stuff. What was it like working with Jürgen? Jürgen is a very special person. He's very inspiring. He has charisma. He...

I can tell you one story at the University of Munich where we both were. There was a seminar and there were three persons. One tried to get all the students into multi-agent systems. The other in space cognition.

and Jürgen in neural networks. Jürgen came and said, "Oh, I'm not prepared. I don't know what to do." Then Jürgen gave his introduction to his topic. Out of 50 students selected his topic. So you see, he can convince persons and it was fun. I was sitting there. I did programming.

And Jürgen did his art things where he made circles and out of the circles women appeared and he did a lot of things. And he once told me it was not clear for him whether he will go into arts or whether he goes into science. But it was always fun with him, always.

Just in case the audience don't know, of course, you worked under Jürgen and you are both two pioneers in the realm of artificial intelligence. It's insane. But what gave you the intuition all of those years ago to be working on the right things? Probably, I mean, LSDMs, which stands for long, short term memory. Jürgen introduced me to neural networks, in particular into recurrent neural networks.

But they did not work. And in my diploma thesis, he was my supervisor. He gave me some task. It's called the chunker system where you have a sequence and everything which you can predict, you can remove from the sequence because it's predictable anyway. So you shorten the sequence and you can run it. That was the idea of the chunker system.

But this was sort of a solution because required neural networks are not working. And then two things happened. First, I built a neural network. There are only one weight.

have to be adjusted as a way to store a piece of information which you need at the sequence end. And the network could not do this. I did all my printfs, all my coding on the screen numbers flow over the screen and then saw, hey, these are super small numbers and these were the gradients. There was no weight update. The gradients were not there.

And this was then the discovery of the vanishing gradient. That if you want to have some target, you want to know what is needed to predict the target, you would do credit assignment in the sequence end. You got no signal and vanishing gradient. Now I knew why requirement networks do not work.

And the solution was LSTM, so long short term memory cell, where I build something which makes sure that the gradients as they get propagated back, as they're transferred back, they don't scale, they remain the same. Therefore, at the sequence beginning,

there was exactly the same gradient as at the sequence end. There's no vanishing gradient anymore. This was the architecture, the memory cell architecture, which is the core of LSDMs. And I discovered LSDMs. I wrote it up in my diploma thesis. And later as Juergen came back, he asked me, hey, you did something in your diploma thesis.

should we publish it and then we published it. Yeah, and it's been one of the most cited papers in the history of deep learning. Very, very impactful paper. I mean, just reflecting back on it though, what do you think the long-term impact has been of LSTM?

I think it's still used. In my keynote, I gave one example. The example is from this year for predicting floodings or droughts, but especially floodings. LSTMs were the major model in the Google app. And also for predicting floodings, it's used by the US government, by the Canadian government. And here, LSTM works better than everything else.

better than transformers and so on openai built a big lsdm network as an agent or deep mind with starcraft alpha star was a big lsdm network lsdm became the major thing in language up to 2017. everybody used lsdm together also with attention attention was together with lsdm and then

And then the one paper came out, attention is all you need, meaning you only need attention and you don't need the LSTM anymore. And this is where the transformer was born. And the new technology took over

But LSTM still were good performing in time series prediction, as reinforcement learning, agents and so forth. But Transformer were stronger in, especially in language. This changed now again, hopefully, but at this time, Transformer took everything over. So we're back to the old way.

better parallelizable you could throw more data to this model learn on more data so we are faster and it could not compete at this time how did the lstm solve the trade-off between storing new data and protecting data that was already stored

That's a very interesting question. And that's also the strength of the new XLSTM. This idea is gating. We have different gates. Perhaps the most important thing is the input gate.

and it scales up or down new information which coming in. It can be scaled down to zero, it's not stored, or one, everything is stored. And the input gate is something like an early attention mechanism because input gate is attention. You have a time series and to put what sequence elements you want to pay attention to, the input gate would do that.

And then there's the forget gate. Forget gate is saying, is the already stored memory important or should I downscale it? But more important is the input gate. The input gate really

picks out specific sequence elements to be stored. And so non-relevant stuff is not stored. It was one of the first attention mechanisms, but we called it gating. Yes. Just before we get to XLSTM, because of course you've got this amazing new invention which solves many of the problems that the original LSTM had. Can you tell me maybe like the computational complexity of an LSTM compared to an RNN? How did it compare?

LSTM is an RNN. Yeah, vanilla, without all of the gating. Yeah, without the gating. And the complexity is only increased by this gating mechanism, but it's still linear in time. Because perhaps it's better to compare it to attention. Attention, if you have a new query, a new piece of information, you have to look back to all previous data

items and LSTM only interacts with the memory, with all already stored memory. So it's always a constant for one query, you have a constant interaction with the memory. Attention have to go through all

keys and have to do this pairwise interactions. And there are two disadvantages. First disadvantage is computational very complex. It's quadratic in the context length. And the second disadvantage is you have only pairwise interaction. You do a dot product and exponential of the dot product, which is in the softmax, but you have only pairwise comparison. It could be better set more tokens, more sequence elements pulled together and a new element interacts with

I would say an abstraction of these different tokens. Two disadvantages of the transformer: computational complexity plus very simple interactions. But LSTM is like a recurrent neural network. All recurrent neural networks are linear.

in the sequence length, linear in the context length, LSTM is a little bit more complex because it has the gating mechanism, which makes it a little bit more complex, but it's by far not as complex

as a transformer with this quadratic complexity. Could you explain to the audience why something which is quadratic, so, you know, something that should be worse, but it actually ran faster? Why is that? It ran faster because of its implementation on the GPU graphical processing unit, so chips where everything is implemented on. We had something like, it's called flasher tension, this very fast tension mechanism

and you use hardware optimization.

This is one thing. And the other thing is you could do it in parallel. I say one critic is looking back to all keys, but it can look back to all keys at the same time. You can do everything in parallel. You can push up all keys or the whole-- like, assume you have a sequence, you have a sentence, and all the words are pushed up one level simultaneously.

while Recurrent Networks or LSM has to go over it sequentially. So first thing, build up a new memory. The next thing, build up a new memory and attention could push up everything in parallel. And therefore attention

was at this time much faster than LSDM because of this parallelism. And the second thing was you could optimize it for the GPU for the hardware. These two things parallelizing and hardware optimization. This gave attention a big advantage. You can train on much more data at the same time.

And LSM could not compete with this technique. You mentioned flash attention as well. So can you just quickly explain that to the audience? So does that mean in certain circumstances, you don't actually need to do the full quadratic attention?

It's still quadratic, but super highly optimized. Right. Where you use this fast memory, the caches in the thing, you use registers of the GPU, very, very fast entities, memories in the register. It still has the same complexity because it's mathematically quadratic and you cannot cheat math.

but you can do it super, super fast. Flash attention was super, super fast because it is hardware optimized. Wonderful. So can you bring in XLSTM, which is this new invention, and how does it overcome some of these problems with the original LSTM? Yes, I start with a spoiler.

because I talked about flash attention, we are with XLSTEM faster than flash attention, both in training as also in inference, especially inference is important. Now I go back, XLSTEM. We, after seeing this rise of the transformer, and suddenly we thought, okay, first of all,

could not be LSDM because the ResNet backbone architecture to build very, very large models was the key to have this feed forward connection, this many parameters where you store all the information, meaning

Is it important to build big models or is it important to have the specific technology looking back to compress the history? We thought LSDM should do it. And we asked the question, can we scale up the LSDM like transformers?

and get the performance of transformers. But we know some limitations, some drawbacks of LSTMs. We already mentioned once, this was parallelization. We now made LSTM also parallel. We used the same ideas like attention to parallelize LSTM. But there were two other limitations. One limitation was--

that LSDM could not revise decisions. If you store something and then you see something different comes where you say, oh, this should have been stored. You cannot revise it. I give you an example. It's like you want to find new clothes, for example. Then you say, now I found this clothes with this price.

And if you look further in the internet, you find new clothes, which is even better, plus surprise. Perhaps the clothes should fit your shoes or whatever. And if you find something better, you should throw away what you already have stored. Both are largely similar to your shoes and also surprise.

the old LSTM could not do it. If I found a better matching thing and I have to memorize the price, I have to delete everything. The ex-LSTM can do this. And the idea is exponential gating. To revise a storage decision we do with exponential gating, the idea is if I find something better, I up-weight very heavily and then I normalize it. Therefore, the old best solution is down-weighted. And by this,

I can find something better and can throw away my old stuff. In theory, so forget get could do it, but in practice, you cannot learn to forget because you cannot learn at the same time to store very precisely and then forget at one time step.

But exponential gating, exponential input gating was the key that they say, I have something better. Forget everything what was before. And this gave us an advantage. So second thing was,

a matrix memory, a big memory. And the original LSDM has a memory, a scalar, one number. You have only one number which you can store. That's not much you can store. And now, the new--

LSDM, the XLSDM has a whole Hopfield network. We use a classical Hopfield network and it became popular again because there was something called Nobel Prize and John Hopfield got the Nobel Prize for this classical Hopfield network. And now instead of this single Scala, we use a whole Hopfield network. It's like a classical Hopfield network

plus gating with input gate say what we should store in the Hopfield network and with the forget gate how much should be the old stored items be down weighted. It's a Hopfield network equipped with gating. So if we merge the Hopfield network idea with the LSTM idea and this gave us an LSTM with a much stronger memory, with a much bigger memory. So exponential gating was important.

increasing the memory. And the third I already mentioned, it was to paralyze it. And these three ingredients--

We used to build this new XLSTM. It was fantastic what results it gave. We didn't expect so good results, to be honest. And I suppose that the memory mechanism is also reminiscent of the fast weight programmers of the 1990s. Exactly. Jürgen already did this. There were other, like Hopfield Networks, it's always an auto product memory. You have a memory and you have a new...

you have two vectors, one we, like in attention, we call key, the other we call value, and you have an auto product of key and value and adds this to the memory.

This is this idea, but what we did add on is an input gate to this new item which is added and a forget gate to the old memory. But it's a known technique, this auto product storage. It's also even older for this icing models in the '70s that already this ideas.

And also Hopfield networks use the same idea, but also the fast weights use this idea. Could you also just give a little bit more intuition on the gating? So you said you moved to exponential from a sigmoid. Certainly in the 1990s, people were using like sigmoid and hyperbolic tangent even for an activation function.

What was the intuition at the time to go sigmoid? And just in a bit more detail, how does the exponential version fix the problem? To go sigmoid, it's gating. A sigmoid is between 0 and 1. Yes. And that's the natural thing to do. 1 is the gate is open, everything goes through. 0 is nothing goes through, the gate is closed.

And between you do scaling. So for the segment was a natural thing using for gating. But it has a problem because if you encounter one sequence element and you say, I multiply it by 0.5, let's say, there come another element and say, oh, if this is 0.5, I should multiply this by 4.

But this doesn't work because a segment only goes to one. I cannot go higher. I cannot over rule this. So segment is limited and you have to make a decision, but later it's limited to higher values. So exponential gating is not limited.

You always can do larger values. But the problem is we never used in earlier days exponential activation function because learning would break down. But we have to have a second ingredient. One is exponential gating, but it's a normalization.

You have an exponential thing, but then you normalize by this exponential input gates. But it's like a softmax. If you remember how a softmax works, you have this e to the power of something. You have these exponentials. And then you divide by the sum of these exponentials. It's like a rolling softmax.

And therefore, we went in the direction of attention with LSTM, but it's recursive. But it's very similar. You have an exponential input gate, but then you divide by the sum of all input gates. It's a little bit like a softmax, but it's different also. And there's another thing. It changes solubilistic dynamics.

But we don't have a clear understanding what's happening. We only saw if you have different architectures, the softmax with its exponential function

has given advantage to learning dynamics, meaning if other systems got stuck, are stalled, do not learn anymore, there are some gradient peaks which let the transformer learn. And we now observe the same with XLSTM. To revise SORIS decisions was our reason, but the learning dynamics also

were modified in a positive way. But it's not understood completely what's happening here exactly. I think there's some random directions where you, if nothing goes anymore, that you have some random weight updates which help you to progress the learning. But it's a speculation. Exponential gating again.

matrix memory and to paralyze it. I'm just interested in what triggered the flash of inspiration. I mean, if you could go back in time and tell your younger self about this, would your younger self just say, yes, absolutely, and would you have done it then? Yes, but my younger self had to see a couple of examples because at this time,

we don't have these big language models. We don't have these problems where we see exponential gating helps.

this big memory helps because we did not have these data sets. I now only had to say what to do, but also what data will come. Then I would say, yes, of course. I have such a small storage. If you want to store much more, of course you have to do this. I would have seen it, but I also needed a glimpse in the future what data will come. In your paper, you studied how these things scale with data and model size and so on. Can you tell me about some of the theoretical underpinnings?

It's a standard scaling loss. It's not developed by us.

either you do model parameters, you increase model parameters, and they follow a certain exponential law, a certain curve. And you compare it with transformers or state-based models, which also follow a certain law. And then you can extrapolate and say, if you build larger models,

we will be also better. But this is a scaling loss which we are used not invented by us or by others, which is nice because you can now predict if it makes a model larger or if I use more data, how will these models, these larger models behave? You mentioned state space models, things like Mamba. Could you just contrast to that? Mamba was most competitive method for LSDM and

Then after our publication of XLSDM, Mamba 2 came out.

And the nice thing is that number 2 is XLSDM without an input gate. It's exactly the same because it has E to the soft plus. E to the soft plus is a sigmoid. You can do the math. Then you see they have also this, for gate gates, they have also an output gate. Number 2 is like an XLSDM.

but no input gate. Input gate is left out. Therefore, it's nice to see that different method converge to the same architecture a little bit.

They're not service mama because they don't have Inbookate. I think the Inbookate is important. So remaining architecture is very, very similar now. So I started with state space models. We started with LSTM, with Hopfield and blah, blah. And now we more and more converge to very similar architectures. Are you seeing any hints of industry adoption of X-LSTM? Yes. First of all, X-LSTM is now faster.

than flash attention in inference also in training. I can tell you why. With flash attention, you have to put all the stuff along context into the GPU. What we now do, we did chunks of flash attention and between the chunks, we do the recurrent stuff.

And we designed the chunks of flasher tension that we can be more efficient on the GPU. If you have smaller chunks, you don't have to squeeze it in and do inefficient stuff. You can exactly make it so large like the cache is. We use the flasher tension.

technology, we stole it from these guys. But to do the right size of flash attention makes it fast. We do flash attention recurrence, flash attention recurrence. And now we are faster than if we do a whole flash attention over the whole context. And that's both in training but also in inference. This is chunkwise flash attention, it's called, or we call it like this. And this gives us a speed.

I didn't expect that we could be faster than flash attention in training. I thought, no, no way. But hey, it's unbelievable, fantastic. But that we are faster in inference, we know because here attention has also gone autoregressive because you have to produce a new word in generation and then push everything

into your system again. And you produce a new word, you have to push everything into the system. You can cache some of the processing, yeah, you can do it fast, but attention is not for autoregressive mechanism. In training, you have the whole sequence.

So we're fast. We are now faster. Unbelievable. But in inference, I was sure that we are faster. And this gives us advantage in different ways. First of all, I will mention something in language. You're aware of this strawberry or one thing. Oh, yes. It's doing more on the inference side. It's more thinking. Yes. And on the inference side, we would be much faster.

If we're like 100 times faster in the inference, we can do 100 times more thinking. And this is a big opportunity. It plays in our hands. It was so nice that it came out because we are exactly there. We are fast in inference. That would be better.

But this fast inference speed also helps us to go in industrial applications away from language. Language is not at the core of many industries, is not quite many, where there are businessmen companies. But I now can go into robotics.

Transformers were used for robotics. DeepMind had a paper, Tesla had a paper, but they all struggled that the transformer is too slow. Yes. Sometimes you have to wait a couple of seconds before the agent is reacting. Now we have something which is much faster. And we have a second advantage.

we have a fixed memory. We know in advance how large our memory is. If we now go embedded, on an embedded device, we know how large the memory is. We will design the LSTM with this fixed memory. And no matter how long the sequence was, you use the same fixed memory. If the sequence is 100 sequence elements or 100 millions, you have the same memory. And we can fix the memory and we are very fast. And these two,

things give us advantage to go embedded. To go into robotics, there's even one already tried it out for drones. They have these devices, these GPUs on drones. And they emailed us. They don't want to reveal it. They said it's

It's unbelievable that much better results and also drones are flying autonomous and you have to have real time control. You cannot wait until I think and with this exercise it's working. We are fantastic. And this guy also talked to me at NeurIPS there. They said they don't know whether they want to reveal it because it's so good for them. It's a company.

But to going robotics, going to drones, also to self-driving. In a car, you want to be energy efficient. You want to have also batteries with you. You want to be fast. You want to be concise. You want to have a small, a powerful system. And here I see

big advantages with this XLSTM. Perhaps even going to the cell phone, I'm not sure, I don't know the constraints of the cell phone perhaps this is too far fetched, but we have a thing which is energy efficient, it's fast and we can control the memory, we can design the memory for the device, for the embedded device for example. Do you think the XLSTM sort of move us closer to something that resembles symbol manipulation?

- Symbol manipulation? - Yes. - I don't know. We have a project about neuro-symbolic AI. - Yes. - Symbol manipulation, I would say in one sense, I think XLSTEM is better in building abstractions. - Yes. - What I'm missing for this AI system we have out there

I never saw an AI system build proper abstractions. It's always human-made. Their language is human-made. If you look at ImageNet, a human put the object in the middle. I want to see an artificial system which comes up with a new concept, which is not human-made. And XLSTEM, I don't know whether it can do it, but in the memory, by combining more tokens, by combining...

more from the past, perhaps you can build a concept because it's more efficient to store a concept abstract concept than store the single items like an attention would do the single items. If you can compress it to

something if you send your son a beach, a cocktail, and so on. So you say, ah, perhaps somebody sends a beach, holidays, and it's perhaps one abstract concept. And to store this is perhaps more efficient than storing single items. And the same should happen in industrial applications that you see.

concepts, you see structure and you store the structure, not the single things. And if you have the right abstraction,

You bet in generalization, because if we have abstract concepts, in the future we will encounter these abstract concepts again, hopefully. Yes. The reason I ask is, of course, you've got your symbolic AI paper as well. And I'm really interested in neuro-symbolic architectures, and there are many approaches to doing that. So in Europe, we've seen many people using transformers to generate programs. Some folks are just skipping the explicit program generation and just getting transformers to perform symbolic-like typesets.

tasks and transformers are incredibly limited. They can't copy, they can't count. There's lots of things they can't do. But do you think the XLSTM could overcome some of these obvious computational sort of like limitations of the transformer? Probably some of them they can overcome, but I think the solution is to combine both. I think what we have right now is not the final solution. We have to go symbolic.

And there are already things out there, Transformer is perhaps using MATLAB to solve an equation or whatever, or is inquiring at the internet or whatever. I think we need both because there are so many symbolic techniques out there. For 50 years we have developed and we should somehow integrate them, use them. I don't know if everything is learnable. Perhaps in principle,

But now a shortcut would use what's already there, combine it in the right way. And in Austria, the biggest AI project, it's about 40 million euros, I'm reading it, is bilateral AI. Bilateral because bringing symbolic and sub-symbolic AI together. Because we need, as I said in my talk, scaling is over. Now we have to go into industrialization of AI and here we need new techniques.

and perhaps not new techniques only from the sub-symbolic side, from the neural network side. Perhaps we need things from the symbolic side to make things more robust. Because if the production process stands or is stalled, that should not happen. And therefore you need perhaps sub-symbolic methods

integrated or surrounding the sub-symbolic methods like large language models or others. I completely agree. I think we need to build hybrid systems. Yeah. Yeah. That's the neuro-symbolic approach and that's what we are doing in Austria in this big thing. It's hard. I still, it's hard to bring these two communities together. Sometimes they don't like each other. The other side, we have big success stories. The other side have other success stories. But I think that's a way

for things to advance AI, but also to make industrial AI, as I said in my talk, because for industrial AI, we need symbolic systems to make it robust, to guarantee stuff. We now have to team up with the symbolic guys to advance AI. I completely agree. So we need to have formal verification.

The thing is, though, can we have our cake and eat it? Because the only problem with these hybrid neuro-symbolic systems is the degree of human engineering. Can we automate the creation with some kind of architecture search? Because we're building these big systems that have many, many components, many verifiers and so on. How much of that can we automate?

In the group from where we do the neural symbolic, the symbolic guy said, hey, we need machine learning perhaps to adjust our parameters for our symbolic stuff. The sub-symbolic guy say, we can use the symbolics perhaps as a shield surrounding it.

They don't merge it, they don't integrate it. Like learning rules, learning new symbolic rules with rule things. I know how symbolic goes, but perhaps some rules are better. You have to better integrate those things. But now these two groups are thinking in their own domains.

and I'm missing this. And if somebody is doing this, I take this from this community, this from this community, glue it together, but it's clumsy, not nice, not elegant things. Elegant things, as we did, is we learn some formal systems, but perhaps the learning should go into the formal systems, and the formal system should be...

a subcomponent, an integrated subcomponent of, I don't know, a large language model or whatever. Right now, it's not there. These two groups are too separated. Yeah, so on the connectionist camp, you know, there's Hinton and Bengio and Lacoon. You and Jürgen, you're pioneers of connectionism in a way, and you're neuro-symbolic guys. You've always been. Why is that?

Perhaps going back in history, you have to know that Germany and Austria were very strong in the symbolic case. There was this Dave Ke and this formal systems where many professors doing this.

And in the US and so on, there were these things where Europe started and Snowbird and so on. But Jürgen was a very, you know, he's still a guy who is thinking along different lines. And there was a big group of...

group on AI, but it was formal. But he said, "No, I think this neural networks." And as I went to the university, I was a student, everything was boring. Some theorems 50 years old, 100 years old, all computer science, quicksort, everything, this old stuff. But then there was this neural network stuff, Jürgen did.

Nobody knew what's coming out. You learn something. This was super, super interesting. And this was also Jürgen's thing. It was something new, not something which was very old and traditional. And also in the group we were, we read this science fiction books.

And I said, hey, I have a new science fiction book. And so many ideas came also like how you can transverse the universe with generation ships, what's possible, what's not possible, good ideas. It was also the time, rural networks as a new technology and a lot of innovation ideas and so on. And here, going away from this traditional symbolic thing,

This was super fascinating, this new neural network stuff. You did not know what's coming out. You change something here and here. This was exciting. Yeah, and in a way, that's very polymathic. It's knowledge of so many different fields at once. Certainly Jürgen was talking about things like Godel machines and recursive self-improvement, artificial creativity, all of these amazing ideas. In some sense, they were before their time.

But do you think things are starting to swing back the other way? I mean, I'm certainly seeing, look at DeepMind, for example, loads of newer symbolic architectures coming out. I mean, do you think that the consciousness is changing a little bit? I think so. Perhaps it has to, because I think our path ends with scaling up, with making things larger. And I don't know whether it was the right way, because it's more about storing more information in these systems.

you put more training data to make it larger, but not smarter, not the systems. They're not different. They're only larger. And if this has an end...

We have to be smarter. And I think the symbolic way or the neurosymbolic thing have to come because it gives us a way because I don't know where the sub-symbolic way in which neural networks, where should we go? What should we do? We scale it up. We have this almost brain-like models now, but something is missing. It's not what humans doing. Humans

learn different, they learn with a few examples, they have other abstraction capabilities, they are much more adaptive, they can plan and something is missing, something is missing. And perhaps neuro-symbolic gives us what is missing.

we miss something. How do we blend these ideas together? So people think of System 1 and System 2 as being completely different and they might be very interlinked, you know, like a lot of reasoning is perception guided. How do we really integrate these ideas together? Yeah, it's very popular after Kahneman and also it's a Turing Award speeches as I used to always System 1 and System 2. It's very compelling and but

I'm also not sure whether there's a clear separation. Okay, there's perhaps a clear separation if you play a game of chess and now you start to plan that system too. But there's this intermediate thing, something you do, a gut feeling, like open a door, you grab a thing and you don't think about it.

But sometimes you think a little bit. I think there's gradually a thing. And sometimes you plan two steps. Should I go here or here? What's faster? Here, some guys are coming. It's a very short two thoughts where you think. And for me, it's not so separated. I do it immediately and I do a long, long thinking.

I think there's everything gradually. Many things you do...

intuitively it's system one like and sometimes you have something where you're really thinking about uh but there's so many things between when i if i leave here will i go home i can go straight ahead and see perhaps i go down there and so i make a couple of decisions it's a little bit planning and it's intermediate i don't think there's a clear

difference between system one and system two. Yes, I agree. I agree. The abstractions in these systems, should they always be human intelligible? Now, what I mean by that is, you know, Elizabeth Spelke had these core knowledge priors and, you know, things like agentness and spatial reasoning and objects and stuff like that. And it's almost as if there's a core set of basis functions that we've acquired or learned about how the world works and

And that suggests that any reasoning system would just compose those simple priors together. Is that all there is with reasoning? Or do you think that AI systems could discover weird alien forms of reasoning that we wouldn't understand?

That's what I believe. Also, different concepts. The concepts we developed, words and so on, is what helps us. And for example, I was saying neural network, perhaps you have speed, acceleration, and so on. You have these concepts. But if you now do a linear transformation of this,

you have the same information but a little bit mixed up. For neural networks that's no problem because using a linear transformation you can do the inverse. It's the same information, a little bit different distributed. And perhaps sometimes it helps that the information is differently distributed. And I think for us we develop concepts and abstraction which helps us as humans.

helps us to, from one generation to the next generation, convey experience, what we have learned.

also inform others, pass food and so on. And that's the most important thing we do because if our kids have to learn what mushrooms are poisonous and what not, the most information in our children is from the previous generation. They go to school and so on. And I think our language, our abstraction is

designed, is tailored from one generation to the next generation to transmit this information. Because it's the most information. What you acquire as a human, as a single human, is much less what you acquire by the culture, by the-- I think. And therefore, I think our abstraction, our language, our kind of thinking

is tailored to our society too and i think ai systems should come up with complete different reasoning things but also different abstractions for them other concepts might be much more useful because they live in the same world in a different way say

manipulates the world in a different way. Yeah, it's something I think about a lot because there's this constructive component to abstractions as you're talking about. So there's the language game and we have this mimetic cultural transfer and it seems to be in service of some utility for us to understand each other. But they are still grounded in the physical world. Acceleration is a thing in the physical world. Is it?

I would challenge you, for us it is perhaps acceleration plus something else combined is the real thing. Yes. I don't know. Is it only convenient for us because it helps our kind of thinking?

Or might it be acceleration plus the location? I don't know. But we also have this weird ability as humans to think of things that are not directly coming from our sensory experience, like abstract, mathematical, platonic ideals and so on. Yes. And where do those come from? I think...

Many of these things, first of all, they could be only symbols. They're only placeholders for something. More interesting is also in physics because you have a concept of an atom. Probably you never saw an atom. I at least did not see an atom. But you have a concept of an atom. If you now say, what shape is an atom? You say it's perhaps a ball. It's a circle. Why?

Perhaps it's a triangle or whatever. You do this kind of abstractions and you have some image in your head for things. But often it's a placeholder. If this and this together, let's call it blah, blah. And we invent a nice word for it.

and you have an intuition, perhaps you have even an image in your head for this, but sometimes it's abstract. It has no counterpoint in reality. Yes, exactly. So there's this huge difference between semantics and the actual thing. I often think if we gave a 21st century physics book and we went back in time and gave it to Newton, I don't think he would understand very much of it at all. Yeah.

I completely agree. Yeah, yeah. We are trained in a specific way of thinking. It's perhaps different from the thinking many generations ago. Indeed. This has been amazing. Can you tell the audience a bit more about NXAI? NXAI is a senior company. First idea or first founding, I already told you about this XLSVM. I was super excited.

and I'm at the university here. And then I got to the media and said, hey, I have a new idea, but I don't have the money to show that's a cool idea. And then there came this idea

VC capital thing, do you have a business plan? Said, "No, I'm not interested in business plan. I need some money to show it's a cool idea. I want to keep this cool idea in Europe. I want to keep it local." And that was a concept nobody understood until somebody local said, "Yes, I give you some money."

Let's first fix the technology and then build on top of it perhaps verticals some companies. This is how it started. NXAI was for this Excel STEM, the first 10 million euros and into compute and for the first paper. And now NXAI for more, it's a company dedicated to industrial AI.

One pillar is this XLS-DEM is a new technology we want to develop. We now showed for the 7B model, we can compete with the transformer technology. It's powerful enough, but

It has other advantages like this energy efficiency and speed that we can go in other directions, not only in industry, not language. Because language, there are so many companies doing language and competing and it's, I don't know whether we can make money. And it's not that's a core business. The second pillar is AI for simulation. Yes.

And AI for simulation, also here we have big success stories because now we can do simulations via numerical simulation structure which they cannot do. There could be discrete element methods, it's like particles, you have many particles, but if it's a million particles, 10 million particles, 100 million particles,

the numerical method cannot cope with it, cannot do it anymore. So same with mesh points, perhaps you have computational fluid dynamics, you have these mesh points, like if air goes over the car, you have all these points over the airplane. And you have these mesh points, and sometimes the mesh points are so many that the numerical methods do not work anymore. Now we have things where

For example, for a car, you change something. So numerical simulation takes three weeks. The guy does something, goes home. After three weeks, he looks up what came out. And we can do it in three minutes. And what is the idea behind it? Why are these

this neural simulation so good, I always make as an example with the moon. The moon can be described by a location, by an impulse, perhaps by mass, but we don't describe each particle, each atom or each sun-corn. And it's a very good prediction where the moon is in an hour or next day or whatever. And also in many numerical simulations,

you can group particles because they are structures. And if you can group, like if you throw a snowball, you would not simulate every snowflake or whatever, but this whole snowball. And it's quite good. And if the AI system can identify these structures where 10,000 particles stick together or do the same or amperage or whatever,

you can speed up the simulation. And this is happening. You have the things where particles are somehow synchronized or glued together. And one example is--

If you have something like a corn, you have corn in a machine, there's no physics between corn. If you have this, the numeric simulation have to, this corn go down to atomic level or something like this and how they interact. But if you have this corn and another corn and you can learn physics, how they interact,

interact with each other. This is pushing this and perhaps how do they interact if they're a little bit wet, if they're a little bit larger and so on. Then you have like thousands of

points which the numerical simulation needs. You have one thing or a sand corn or whatever. You learn physics of sand corns. So numeric simulations, there's no physics of sand corns. Have to go down of atomic level or... This helps a lot to speed up the simulations. And they're super powerful because now we can do simulations where numerics struggles. And this goes so far

that we have from the local industry, like steel industry, you have this big oven with steel. So I cannot simulate it because there are too many, numerically too many particles.

And often they have to build a prototype, a larger prototype, because the simulation cannot cope with it. Now we can jump over the prototype. The prototype is 100 million euros worth because of this thing. And now we can simulate it and we can build the real thing. And this gives industry a big, big push if this works. That would be the simulation idea. Johannes Brandstedt is the guy

He will come, hopefully. He will tell you much more about it. But I think it's super fruitful. I think it's a super cool direction. But ask Johannes. He can convince you much better than I can do. Well, he's coming here in 30 minutes. So I don't know about it. Sepp, it's been an honor and a pleasure to have you on. Thank you so much for joining us today. It was a pleasure to be here. It was fun. I enjoyed it. Thank you. Wonderful. Amazing.

Sepp Hochreiter - LSTM: The Comeback Story? 01:07:01 Share

Machine Learning Street Talk (MLST)

Deep Dive

Shownotes Transcript

Sepp Hochreiter - LSTM: The Comeback Story?