We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Google AlphaEvolve - Discovering new science (exclusive interview)

2025/5/14

Machine Learning Street Talk (MLST)

AI Deep Dive Transcript

People

Alexander Novikov

Matej Balog

None

Topics

在计算机科学领域，矩阵乘法是一个基础问题，其优化具有重要意义。Alpha Evolve通过结合大型语言模型和进化算法，在矩阵乘法算法的发现上取得了突破，打破了56年的记录。这一成果展示了人工智能在科学发现中的潜力，并为实际应用带来了效率提升。我感到非常兴奋的是，Alpha Evolve找到了一种使用48次乘法而不是49次的更快算法。这个发现证明了人工智能在解决复杂数学问题上的强大能力，也为未来的算法优化提供了新的方向。我们团队对这一结果进行了多次验证，确保其正确性，这进一步增强了我们对人工智能在科学探索中作用的信心。 Alpha Evolve的成功不仅在于算法的发现，更在于其在实际应用中的价值。通过优化Google数据中心的计算任务调度，Alpha Evolve能够持续恢复大量的计算资源，为Google节省了巨大的成本。此外，它还能够加速Gemini模型的训练，进一步提升了AI系统的性能。这些应用案例充分展示了Alpha Evolve在解决实际问题上的潜力。

Deep Dive

Shownotes Transcript

Translations:

中文

For 56 years designing an algorithm with fewer than 49 multiplications was open on any field until today. We've had exclusive early access to the brand new Google Alpha Evolve paper which just got released one minute ago. We did a technical interview with the authors before anyone else. The paper itself drops a bombshell setting world records for many algorithmic and mathematical challenges.

In the world of computer science, few problems are as fundamental as matrix multiplication. For over half a century, a specific efficiency benchmark in this domain, particularly for 4x4 matrices, seemed insurmountable. The search space for optimal algorithms is immense, making exhaustive exploration practically impossible, even for relatively small matrices.

In 1969, Volker Strassen revolutionized the field by discovering an algorithm to multiply two 2x2 matrices using only seven scalar multiplications, down from the standard eight. The established best practice for larger matrices, like 4x4s, was to apply Strassen's 2x2 method recursively. For 4x4 matrices, this meant 7x7, resulting in 49 scalar multiplications.

Today, Alpha Evolve beat this record. Google DeepMind has a long history of building AI systems which actually invent new knowledge through experimentation and iteration rather than just building a glorified database.

We saw AlphaGo, which beat Lee Sedol, learning from human games and even surpassing champions through self-play. AlphaZero was purely self-play. And AlphaFold predicted millions of 3D protein structures that had never been measured experimentally. And AlphaDev discovered faster sorting algorithms.

Google DeepMind recently has been focused on scientific discovery with Alpha Tensor, which framed the problem of finding faster matrix multiplication algorithms as a game, achieving breakthroughs. And FunSearch took us even further, using large language models to find new mathematical solutions by evolving code.

And now Alpha Evolve represents the next stage in this lineage. For the usual case of matrices that have arbitrary numbers in them, still nothing better was known than doing Strassen twice using 49 multiplications. So we were really excited when we used Alpha Evolve in this setting

We actually didn't even hope that it would find something better than 49 because we were trying for so long with AlphaTensor. We just ran it for completeness because we wanted to have this, you know, this table in the paper that we actually have that we tried on all the sizes up to five or six and remarkably it found a faster algorithm which uses 48 instead of 49 multiplications.

It's a bit like Cursor on steroids. It iteratively refines algorithms, drawing on the creative power of LLMs using meta-learning, library learning, automated evaluation and evolutionary search.

The Alpha Evolve paper describes it as an evolutionary coding agent which substantially enhances capabilities of pre-trained large language models on difficult tasks. Of course, their first paper was FundSearch, which was very, very similar. The main difference, I think, was that it was just...

searching for a single function rather than Alpha Evolve, which can essentially work over an entire code base. You just demarcate adaptable regions in a code base and it can search for those. And of course, it's even optimizing for interactions between those functions in different parts of your code base. One potential issue, as Keith Duggar pointed out, is the classic halting problem in computer science. But there's also a really subtle limitation here I want to explore with you guys a bit, which is

Running the programs themselves, we face issues, right? You face issues of the halting problem, for one thing. You start running some code, like the code

compiles fine or doesn't crash immediately and it starts running. But, you know, maybe some time goes by, like an hour, and you're like, well, geez, is this thing ever going to terminate? In theory, you're of course right. You can never tell that if you run an algorithm for longer what it would have done. But in practice, it actually hasn't been any sort of issue for us in the applications that we looked at. Alexander Novikov added that this challenge is very much like a fundamental aspect of human research, too.

How do you know that you should stop working on your problem as a human? Or maybe you spend a month more and then you solve it, right? It's hard. I mean, of course, there's still limitations. They didn't bite much on my halting problem, you know, limitations. So like in practice, you know, it was okay for them, but they had these very well-chosen problems, right? Where like, okay, matrix multiplication, you know, if it's any slower than Strassen's algorithm,

you know, maybe you can turn it off. But the problem is that that restricts your open-ended search capabilities, right? You can't go down the path of a potentially slower algorithm for now that may get you to a stepping stone that hops you over to something more efficient, right? It is interesting, isn't it? There's so many breakthroughs in science are unknown unknowns. We might have a nose for what is interesting, but we never really know where to find the answers. And so often it's the case that we find them through sheer luck.

But there must be at least one step we can take in the direction of mechanising and accelerating this process. In the case of matrix multiplication and several other scientific problems, which had a clear evaluation function, this new evolutionary approach achieved much which decades of human research could not. We just ran it for completeness because we wanted to have this, you know, this table in the paper that we actually have that

We tried on all the sizes up to five or six. And remarkably, it found a faster algorithm, which uses 48 instead of 49 multiplications. Yeah, when one of my teammates messaged on the channel that, oh, it seems like we have this result, I just couldn't believe it. Like, let's triple check it. AI's growing ability to generate entirely new, provably correct algorithms can advance the frontier of science.

And the really cool thing is that Alpha Evolve has already been applied to optimize mission-critical real-world systems within Google. They're able to speed up really already heavily optimized pieces of important computational infrastructure within Google. For instance, take Google's massive data centers. Efficiently scheduling computing jobs is a really complex operation. If done suboptimally, expensive servers will sit around idling.

The Google engineers placed a candidate solution into Alpha Evolve and then evolved a smarter heuristic which assigned jobs to machines much more efficiently.

They said, post-deployment measurements across Google's fleet confirmed the simulator's results, revealing that this remarkably simple yet effective function continuously recovers, on average, 0.7% of Google's fleet-wide compute resources, which would have otherwise been stranded. Now, that's a huge saving on Google's scale, right?

And in another instance of self-improvement, it even found ways to accelerate the training of the very Gemini models which powers Alpha Evolve itself. We have found a way to speed up the training of the next version of Gemini by 1% and we've been able to unstrand resources in the Borg data center. This instance was also interesting because it didn't generate the solutions, but also the programs which generated them.

Whenever you look at the index in the for loop, you just look at the index module of four. And when you index into the array, you're indexing at position i and i plus four and i plus eight, like, okay, what's going on? And just by inspecting the code, we actually were able to develop, you can think of it as like a mathematical insight or a mathematical hypothesis. And that hypothesis turned out to be really crucial for then improving the result.

A key aspect of Alpha Evolve's sophistication lies in its flexible approach to representing the search problem. The alarms will propose you a broad range of things, and some of them will be stupid, some of them will be amazing, some of them will be really weird. And then by having the evaluator, you can filter for those and identify the ones that are actually important and improving things. So isn't it really cool that rather than trying to generate the solution itself, Alpha Evolve can, just like Inception, generate the thing which generates the solution?

We should play some Inception music in the background.

The thing we found really interesting about Alpha Evolve is that it's still very much a humans in the loop thing. Humans identify what's interesting, they find problems that have clear evaluators, they place candidate solutions in the loop and then Alpha Evolve will traverse this cone of possibilities and make jumps along the way and then the cycle continues. So this is very much sketching a future of AI where there is a strong collaborative loop between humans and AIs.

And we should just bring in how you modeled this representation. So you had these different approaches. You could directly model the solution. You could model a constructor function. So, you know, you're actually learning a function which itself constructs the solution. You could be learning a search algorithm, which is what you did with the matrix multiplication. And you also spoke about coevolution as a possibility.

So many people are talking about this vision of AIs which can autonomously just drive cars or just do anything, generate content without any supervision from humans. And that hasn't really panned out, to be honest. And certainly a lot of the content on the Internet is slop.

Do you remember that dead internet theory by Illuminati pirate, this guy on an internet forum a couple of years ago? He said that by now, most content on the internet will be generated by AI and it will have a kind of superficiality to it. And he was right. But it doesn't necessarily mean that AI is bad, right? What's missing is that we need to have this exchange. We need to use AIs as tools and we need to guide them and refine the results and do the process iteratively.

That's kind of what Alpha Evolve does. It mechanizes the correct way of using AI. The thing that makes Alpha Evolve so cool and powerful is this kind of this back and forth between humans and machines, right? And like the humans ask questions, the system gives you some form of the answer and then you like improve your intuition, you improve your question answering, the question asking ability, right? And you ask more questions. I gave this

this talk a few times about like generation AI and the big warning I gave is just the rise of mediocrity. We're just going to be inundated and flooded with mediocrity and the best content will still be produced by the most skilled people. And all that's going to happen is as this tide of mediocrity grows larger and larger, people will get hungrier and hungrier for media.

for the cream, for the things that are rising to the top, right? So I think all that's going to happen in the end is that skilled people, their productivity is going to just rise and they'll continue to be differentiated from the mediocre sort of hordes whose productivity is also enabled.

So, you know, the hordes become more productive and experts become more productive. And just as a whole, we all become more productive. And while we're on the subject of amazing, innovative architectures for discrete program synthesis and reasoning and whatnot, why don't you consider applying to work at Two for AI Labs? That's if you're an ML research scientist or ML engineer.

Benjamin Cruzier is running the lab. It's in Zurich at the moment, and they're thinking about opening an office in San Francisco as well. They would love for you to get in contact with them and apply if you're interested in working for them. So give Benjamin a shout. Guys, welcome to MLST. It's an honor to have you both here. Thank you for having us.

So we have had privileged access to read your Alpha Evolve paper. It's really, really exciting because this is very much up our street. We love program synthesis and we love evolutionary methods. I mean, we had Kenneth Stanley on and, you know, Jeff Klune. I was speaking to him at NeurIPS. And of course, he wrote the Mappalites paper, which influenced some of the stuff that you guys have done. But, you know, just from a thousand miles away, could you describe the work that you've done?

So we are presenting a coding agent, which we called Alpha Evolve. And what this agent is able to do, it's able to design quite advanced algorithms. And when I say advanced, I mean it's algorithms that are able to make new discoveries in the sciences. And we have quite a few examples in mathematics and computer science in this paper. Or on the practical side of things, they're able to speed up really already heavily optimized pieces of important computational infrastructure within Google.

Yeah, I'm curious. I mean, what made you go down the path of including EA algorithms, you know, evolutionary algorithms? What kind of said to you guys, this is a component that we need in a hybrid system, along with, you know, verifiers and LLMs and whatnot to take us in a step forward?

Yeah, I think if you consider the process of scientific discovery, then it's a very natural choice. And you mentioned, you spoke to Kenneth and like evolutionary algorithms on a high level, they give you this diversity in the exploration process, making sure that you don't early on in the process kind of just zoom in on into a particular approach, which might be suboptimal in the end, but you keep exploring the vast array of possibilities that you have.

And especially when you think about trying to solve really difficult problems and making new scientific discoveries, then a priori, there is no way of knowing what is going to be the right approach. So you do need to make sure that you keep exploring different possibilities. And evolutionary algorithms are just a good technical tool that fit the bill really well for this purpose.

And also, I think it's very fun to play with, right? Like, if you want to set up an RL algorithm, it's going to take you like, I mean, depending on the algorithm, I guess, but it's going to take some time, right? And with EA algorithms, you can just do it, right? Like, you have LLM APIs, you just call things, you try things. It's fun.

Alex, can you just sketch out the architecture? Many folks at home will be familiar with FundSearch, for example, and I guess this is an evolution of that, puns intended. But can you just kind of sketch out how the whole thing works? Yeah, of course. I mean, should I assume that people are aware or should I start from scratch with FundSearch stuff, FundSearch bits?

Yeah, let's keep this bit really, really, really simple. We'll kind of, the way we'll do the show is we'll have a progressive disclosure of complexity. So we'll have a hook and it'll be super, super broad and then we'll kind of progressively get more detail. But, you know, this bit, just do like super high level. How does the whole thing work? Yeah, makes sense. And we'll also show, you know, diagrams from your paper and any others you give us and, you know, animations and some visual aids. Awesome.

Yes, so the high-level architecture of Alpha Evolve is an evolutionary method where you basically pair the... So we only focus on problems where you have a way of evaluating the progress, right? Like for any given proposal, for any piece of code that the system gives you, you can automatically test if it's good or not and how good it is.

And this is, I think, the key aspect of maybe the results we got, that having this emulator is, on the one hand, it takes you to the set of problems where you can have it. But on the other hand, it's quite a broad range of problems. And then it gives you a lot, right? Like you can quickly iterate and get feedback. And...

What in particular gives you is that you compare the creativity of LLMs with this evaluator, right? So like the LLMs will propose you some kind of like a broad range of things and some of them will be stupid. Some of them will be amazing. Some of them will be like really weird. And then by having the evaluator, you can filter for those and identify the ones that are actually kind of important and improving things.

And then this kind of pairing LLMs with evaluators is kind of wrapped around in an evolutionary pipeline, which try to iteratively identify the most promising pieces of code and then focus on improving those and exposing them to the LLM. Like, here's what you tried before. This thing worked. This thing doesn't work. Please try to propose a new thing. And then maybe the final ingredient is scale, right? Like making those things in parallel. So...

You mentioned that needing the evaluator, it limits the class of problems, but there's also a really subtle limitation here I want to explore with you guys a bit, which is running the programs themselves. We face issues, right? You face issues of the halting problem, for one thing. You start running some code, like the code complies.

compiles fine or doesn't crash immediately and it starts running but you know maybe some time goes by like an hour and you're like well geez is this thing ever going to terminate you know is this going to contribute to my gradient or not I don't know so maybe I have to just terminate it after some resources are consumed and but if I waited just you know five more minutes I would have gotten an answer that was the godly algorithm right so like this fundamental problem how do you

How do you deal with it now? How do you think we're ever going to get around issues like that? You know, what do you envision for overcoming that limitation?

So in theory, you're of course right. You can never tell that if you run an algorithm for longer, what it would have done. But in practice, it actually hasn't been any sort of issue for us in the applications that we looked at. And one concrete thing I can say is that often you might frame the problem in the way where the kind of time constraint is built into the problem definition. So let's say you could say,

I'm trying to solve this open problem in mathematics, and I'm looking for a search algorithm that is able to make progress on this open problem. But I want a search algorithm that is able to make progress in 10 minutes. So that is part of my problem definition. And then when I'm evaluating the proposals that the language models make, I only run them for the 10 minutes. And so I only explore the space of algorithms that is able to make something happen within those 10 minutes.

Sure, I might miss out on algorithms that would have done even better when run for longer. So that is indeed like a principal thing that you can never eliminate. But in practice, it actually hasn't been an issue for us. And I guess it's kind of this question is also fundamental to in general research, right? Like, how do you know that you should stop working on your problem as a human? Or maybe you spend a month more and then you solve it, right? It's hard. I don't know.

Yeah, it's like the secretary problem. One thing I'm fascinated in interviewing so many people in the space is, you know, we speak about diversity, preservation, novelty, serendipity, open-endedness, creativity. And we really want to design algorithms that can break free, that can make creative jumps. I mean, Demis spoke

about this ladder of creativity where you have inventive creativity. That's the thing that we really need. And as I understand it now in your system, it's a little bit like an automated version of cursor where you kind of have these code gates and you put an initial solution in there. So there's a little bit of domain knowledge. There was one example where I think you decided to use a bin packing algorithm for doing the scheduling on, you know, on the hardware, you know, at Google. And I guess the question is, is that,

Depending on the starting solution, there's a kind of cone of things, of leaps that we can make. And can you speak to any examples where the system made really imaginative leaps?

Yes. So I can talk to that. And indeed, you identified one interesting feature of the system, which is that depending on what you tell it at the beginning, you can guide the process. So if you give it fairly specific instructions or you ask the system to start from a specific type of solution, then what it will usually do, it will squeeze out the juice of that idea that you gave it or squeeze out the juice of that initial solution, see how it can tweak it and bring it to its maximal potential. But

And sometimes this is the right approach to take, maybe when the problem is particularly difficult or has some specific features. But by default, you would start with a solution that's really, really empty. So you give AlphaEvolve a code skeleton where all the functions are almost an empty implementation. You just return zero or return false and so on. And you just let it be completely creative.

And so it just has to rely on its background knowledge of the base LLMs and it can explore in all possible directions. And the evolutionary algorithm makes sense that you preserve the diversity as you keep doing the exploration.

And one particular example I can talk about is we applied Alpha Evolve to discovering algorithms for matrix multiplication. And we did it by asking it to actually design kind of a search algorithm, a gradient-based search algorithm that looks for the matrix multiplication algorithms in turn. So it's a bit of a meta thing, like you look for an algorithm that finds an algorithm, but with

within that first algorithm, the search algorithm, we started from a really simple code skeleton, giving it basically nothing. We just told it use gradients, basically. And then it was able to write these complex loss functions and update functions, which had all sorts of tricks about penalizing various behaviors and introducing randomness in completely unexpected ways, which were like, okay, wow, this is the type of

code that maybe a human could plausibly write, but would they have actually thought of writing this particular piece of code? That was really an aha moment, at least for me, that, wow, this is doing something kind of human-like, but not something that obviously a human would try.

And maybe another cool story about this is that I don't think we added it to the paper because maybe paper is not the right format for this, but Adam and our team did a cool experiment on trying to give advice from humans to the system. And then he asked a few people for like, you know, you please think about this problem for two minutes, you please think about the problem for 30 minutes and then like compare it

write down the notes and then give it to the system to kind of guide it through the process. And then he compared like what would be the outcome of that. And you can see that, as Matej was saying, it's kind of squeezing all the juice out of the idea. So like preserve the essence of the idea because it kind of guides the LM towards things like that, but it will optimize a lot of small things. And yeah, and like in a lot of cases, it will be in intelligent ways, intelligent ways. In a lot of cases, it will be in kind of, you know, I'll try a bunch of things and one of them will stick. But yeah.

So it's kind of cool to watch. Yeah. And so at the moment, the architecture has, let's say maybe two sources of base knowledge. It has the base model itself, right? Which is obviously compressed, you know, all of the corpus it's been trained on, including tons of code and algorithms and numerical recipes and whatever. And then it has the starting program that you put in the sort of gated bits of logic and whatnot. Yeah.

Is there any room for an augmentation that's almost a middle ground where there's, say, a secondary database that contains modules or pieces of code that have known to be very effective in other problems that it can somehow simultaneously draw on? Yes. So a few points on this. One is that you mentioned there are two sources of kind of knowledge in the system. And debatably, I would say that there is a third source, which is that the system can decide to augment

augment its own knowledge. What I mean specifically is that the system proposes an algorithm and then that algorithm is going to be executed on a machine and you will see the results of having run that algorithm. So on a sufficiently high level, you can think of it as the system can decide, okay, I want to gain the piece of knowledge. What does this algorithm do when you actually run it? So that is an important component. But maybe going closer to the essence of your question, indeed, so there could be a

a separate either like a human curated database of useful modules or any of that sort. But even more excitingly, this database can be curated by the system itself.

And so this is an idea that we are thinking about for Alpha Evolve, but there is a related idea that is already implemented and mentioned in the paper, which is not building a curated set of modules, which are generally useful, but a curated set of prompts that tend to work well. So there is this idea called meta-prompting described in the paper, where we actually ask our language models to propose their own prompts. So we just tell them,

what we are trying to do, like we're trying to do this evolutionary algorithm for improving on this particular problem. And we are going to prompt you with this particular prompt. But before we do that, please propose a modification to this prompt itself. And then we curate a set of prompts that actually work well for this purpose. And so in spirit, that's a similar idea, although it's curating prompts rather than programs, but both make sense to me.

Yeah, I think if I understand the direction Keith was going in, because everything you've just described is fascinating. It's various forms of meta learning to basically create diversity and divergence. But I feel that the next trillion dollar business could be where we, you know, like in program learning, right, we want to construct a library. And right now we're kind of expanding this library of functions for a particular purpose.

But what if the library itself was the new oil? You know, what if there was strong robustness between these programs that we've learned for this thing and they generalize, you know, through some analogical relation to other programs and other domains? I mean, what if we just had like the new language model paradigm was actually a kind of program database? I mean, do you think that could work?

Yeah, I think that's a fascinating idea. And we see maybe some first glimpses of that. So it is true that for now we are going in the depth direction. We just focus deeply on the problem and just try to solve that problem. But even within that path, let's say when we worked on matrix multiplication, we saw that when we run AlphaEvolve different times, we discover slightly different algorithms.

And then it's actually a useful technique to take those algorithms and use them to initialize future experiments. So that's kind of a first step towards like building this database of things that were useful in the past to act as inspiration for solving future problems. So yeah, maybe Alex has more thoughts.

Maybe another similar thing we kind of, I feel that is happening is not that we store, not that we use the necessarily the database of programs we produced over all the experiments we did, but

kind of our own human intuition as like users of the system is definitely evolving and i kind of feel like a consultancy right like a person like we we collaborate with a lot of teams at google trying to help them run things with alpha evolve and like through that process we kind of gain a lot of knowledge of like what works what doesn't like what what should what should we try next and then this kind of sort of things like somewhat similar to what you're describing right

Yeah, I mean, I can almost imagine Alpha Evolve having its own repo, you know, whether it's internal or maybe be nice enough to put it up on GitHub for us. But it would just be constantly evolving and contributing to its own repo and maintaining it and categorizing it. And people can go take a look. I wonder if Alpha Evolve has come up with a better search algorithm. Go check the search area and see if it's got anything new there.

Yeah, like technologically, I don't see hurdles for this. It's maybe the organizational question of how exactly to make it happen. And one particular technological piece that's already there is that we already now look for programs that work well across a range of tasks.

And right now, maybe this range is, let's say, fairly constrained because we care about the constrained range, but there is nothing blocking us from expanding it. So concretely, we would be looking for search algorithms that are able to find, let's say, matrix multiplication algorithms for different sizes simultaneously. So there is this...

sort of generality, but there is nothing preventing us from saying, let's look for search algorithms that actually work well across a much broader range of tasks. That's not just matrix multiplication, but also other search problems.

Can we meditate on the matrix multiplication thing only because that's the headline hook of this video? I think we're going to start the video by saying there's this amazing result. There was this Strassen guy 56 years ago. He had this big result. Now Alpha Revolve has just defeated it. There's quite a few things to explore here. I mean, obviously, please just explain the whole thing. But also, there's some interesting stuff around...

we went up to, I think, a maximum of, this is like complex 2D matrix multiplication. You went up to rank six, and even that was a bit of an overshoot. And there was some interesting properties where as the rank went up, it started to get a little bit sketchy, but it did improve more. Can you just talk through that whole story? So let me start maybe with the high-level picture first. So

multiplying matrices, that's like a super basic operation. And some of us like get taught this operation in high school. And there is a very specific way how you multiply matrices when you're taught to do this in high school. Like you, at every step, you need to take one row of one matrix and one column from the other matrix. You compute the scalar dot product of these two things. And that gives you one entry in the output.

And so this is one specific algorithm, which is kind of the basic algorithm for multiplying matrices. For every element in the output matrix, you need to do this one dot product.

And for a long, long time, people thought that this was kind of the, obviously the only way to multiply matrices. How could there even be something better? And then you mentioned Folker Strassen in 1969, it was really a shock to the mathematical community. He wrote this paper saying that actually there is a faster way of doing it. And so already for multiplying two by two matrices, so the smallest non-trivial case,

If you do it the high school way, you need to do eight multiplications because there are four entries in the output matrix. It's a two by two matrix. And each scale, like each inner product requires two multiplications. So four times two is eight. But Strassen, he came up with this ingenious way of making a procedure that only requires seven multiplications.

And it's kind of a magical procedure where you build some combinations of entries from the first matrix and the second matrix and you multiply them. And then you are able to combine the seven products in such a way that you get these magical cancellations and the result is just correct. So this was a

like a big surprise in 1969 and it opened this entire new area of research like okay so for two by two you can actually do seven instead of eight um so then people quickly proved that seven is actually optimal for that small case but already for three by three matrices which you can think okay i mean that that's like laughably small surely people must have figured out what is the best way to do that but we still don't know so even today

We know that you need at least 19 multiplications to do that, multiply two 3x3 matrices, but the best algorithm we have is using 23. So there's this gap between 19 and 23 that people just haven't been able to close for years and

And the reason is that even though the matrices are very small, the space of possible algorithms, how you could multiply them is just completely immense. So computationally, there is just no hope at all to do this exhaustively. So already for 3x3, it's like open, which is crazy. And then, okay, so for 3x3, the best algorithm is using 23. That's at least better than the algorithm you get taught in high school, which would be 27. So at least there is some progress on making it better.

But then for 4x4, which is like the next size...

The best algorithm that had been known was just apply Strassen's algorithm recursively twice. So because Strassen is for 2x2 matrices, what you can do if you have a 4x4 matrix, you consider it as a block matrix of 2x2 blocks and then each block itself is a 2x2 matrix. So you can do Strassen twice. And because Strassen needs 7 multiplications, if you do it twice, you will get 7x7, 49 multiplications.

So for the, like, yeah, so this was the only way known how to multiply four by four matrices in a fast way, like using 49 multiplications just to stress them twice. And this has been the kind of the situation since 1969. And so...

This is where maybe our work comes in. So two years ago, we built AlphaTensor, which was a specialized reinforcement learning agent for discovering matrix multiplication algorithms. And that agent actually did find something faster, but only for Boolean matrices. So it's this very special case where you want to multiply matrices where every entry is just zero or one. And when you do the multiplication, you do everything modulo two.

So for that case, Alpha Tensor found something faster. But apart from that, for the usual case of matrices that have arbitrary numbers in them, still nothing better was known than doing Strassen twice using 49 multiplications. So we were really excited when we used Alpha Evolve in this setting

We actually didn't even hope that it would find something better than 49 because we were trying for so long with AlphaTensor. We just ran it for completeness because we wanted to have this, you know, this table in the paper that we actually have that we tried on all the sizes up to five or six. And remarkably, it found a faster algorithm which uses 48 instead of 49 multiplications.

And so that in itself, like, yeah, when one of my teammates messaged on the channel that, oh, it seems like we have this result. I just couldn't believe it. Like, let's triple check it. But then it was indeed correct. And it actually has this one really appealing feature that, like, usually you think about multiplying matrices. You want to multiply matrices where the entries are real numbers or maybe integers.

The case of multiplying matrices where the numbers are complex is maybe a bit less common. But real matrices are just a special case of complex matrices. So if you find an algorithm that can multiply complex matrices, you can also apply it to real matrices. It's just a generalization. So what's kind of cool here is that

Let's say you care about multiplying real matrices, which is what you care about when you train neural networks, let's say, a very common use case. A priori, you might just look for an algorithm that uses real numbers. But you can say that, oh, actually, what if we look for a complex algorithm, which a priori you would think, well, that's a more difficult task because that algorithm would actually apply not only to real matrices, but also to complex matrices.

But by making the task more difficult, actually Alpha Evolve was able to find an algorithm that uses complex numbers and therefore applies to both complex matrices and real matrices. So that's kind of the result that we were most excited to actually get. But then as you were asking in your question, indeed, we were applying Alpha Evolve to other matrix sizes as well. And indeed, as you go to bigger and bigger cases, like 5x5, 6x6, the problem becomes more

much, much harder very, very quickly. It's kind of an exponential with a quadratic in the exponent because the tensor, this cube that we sometimes show in the visualizations that you have to decompose, it grows quadratically with the size of the matrices that you multiply. So

For 4x4 matrices, you have to deal with a tensor of size 16, 16, 16. For 5x5 matrices, it's 25, 25, 25. So it's like exploding very quickly. And of course, as you go higher, at some point, your method will not scale there. But what we show is that alpha, well, it scales further than alpha tensor. So there is some progress on the scaling direction.

And then just one important point to clarify is that in all these cases, sure, we look at small cases of matrix multiplication, 2x2, 3x3, 4x4. But that doesn't mean that you can only apply these algorithms to matrices that are this small. Like as I already alluded to with Strassen's algorithm, you can apply them recursively. So if you have a big matrix, you treat it as a block matrix and you apply these algorithms that are for smaller matrices recursively.

Yeah. I mean, and it's super fascinating and it is such a fun, a fun domain to work in and almost magical that even when you get to three by three, um, it becomes, you know, intractable. And I'm always fascinated by cases like that in mathematics. It's like, Oh, we can do it for dimension one and two and three. But then when we get to four, it's just totally different, you know, it completely falls apart. But as you went higher, you also had cases where alpha evolved. It,

it couldn't match the current performance and actually found worse. Like it sort of found worse algorithms. So what do you attribute, attribute that to? Why was it? Yeah.

Yeah. So in the case that's the biggest one that we show in the paper, six by six matrices, there is a very clear reason for that, which is that we try to apply Alpha Evolve without giving it domain knowledge about the problem. We just wanted to see how good is Alpha Evolve as a general purpose tool. And we started kind of from scratch and we don't tell it about any tricks for developing specifically matrix multiplication algorithms.

The best known algorithm for 6x6 matrix multiplication, it uses a kind of very specific inductive bias, which means that it's looking for algorithms that have a specific symmetry in them, meaning that it only looks for algorithms that have a regularity in the algorithm. And if you only look for algorithms that have this regularity, that's a much smaller search space.

So in that search space, you are able to scale to much larger sizes. But we just didn't try to do this to incorporate this symmetry into our search. We just looked for algorithms of unrestricted form. And that I think is, at least to me, the clearest reason why we didn't match the best known solution in that case. Yeah.

Yeah, so one of the things that fascinated me most about the paper, I mean, obviously, we're very interested in abstraction and representation. And even in the example that you just gave that Strassen was learned to be used as a sort of dynamic programming formulation, so being used recursively. And

I'm thinking to myself, is that a demonstration of deep abstraction or is it a kind of superficial one-step abstraction where the knowledge of Strassen was in its kind of local neighbourhood and it was composing that? And in an ideal world, what we want algorithms to do is compose abstract basis knowledge, so knowledge which is as far down the stack as you can possibly do to sort of increase our flexibility. And...

We should just bring in how you modeled this representation. So you had these different approaches. You could directly model the solution. You could model a constructor function. So you're actually learning a function which itself constructs the solution. You could be learning a search algorithm, which is what you did with the matrix multiplication. And you also spoke about coevolution as a possibility.

So this is mind blowing in the sense that there's some human design and intuition and how you design the optimization target. But there's this ambiguity in the ways that you can represent so many of these problems and how they relate to each other. So can you talk me through that? Yeah. So first of all, I want to be upfront about the fact that we ourselves don't have all the answers here. We have this tool, Alpha Evolve, and we see that, okay, it's like generally, it's like, yeah.

Not even about AlphaEvolve. So we have this general tool that we see we can apply it here and here and here. And for every problem, there are different ways we can apply this tool. But a priori, we only have some intuition what is the right level of abstraction to apply this tool on. And as you were mentioning, Tim, sometimes the best thing you can do is you directly search for the solution.

Sometimes you search for a simple constructor which constructs the solution. And here, as an illustrated example, let's say you're looking for a solution and you think it's going to be very regular. Maybe it's going to look like a fractal. That's a typical example. Then you know that to describe a fractal, you can do it with a very short piece of code. So in that case, it makes sense to look for a description of the fractal, not as a graph

a grid of pixels, but instead it's a short piece of function that just generates that fractal. But then in other applications, maybe the solution is very different, not so regular, and maybe you want to either look for the solution directly, or you want to look for a complex search algorithm that finds the solution. And sometimes you want to have a sequence of algorithms that gradually refine the solution, which is the co-evolution approach. And

A priori, it's not clear at all which one is going to work best when. So that is something that is definitely in kind of the future work category to build up that understanding. But one positive side of things is that Alpha Evolve is easy to set up in all the different formulations. So often in practice, you just try different things and see what works best.

Well, yeah, I can't believe you guys don't have all the answers. Like I was, I don't know why I'm here now, but no, in all seriousness, and Tim mentioned that these sort of three modes that, that alpha evolve, you know, currently runs in. Right. And it really reminded me of mathematicians because, because, you know, mathematicians,

or there's at least three kind of main ways in which proofs are done, right? They're either done by deduction, construction, or enumeration. And those almost seem to be like very parallel to these three methods. And I feel like there's probably some deep connection there that I'm missing. I'm curious if either of you have any thoughts on that. Any thoughts on that deep connection there or not? Sure, yeah. It's an interesting observation.

Yeah, Mare, did you want to say something? Yeah, so one thing is that when you think about constructions, that is the space where AlphaEvo is like most obviously applicable. So the examples that we show in the paper is where you have open problems and you make progress on those open problems by finding better constructions. So that's where you're applicable out of the box.

But then if you think about other approaches of making progress on other types of mathematical problems, let's say the problem is not obviously about the construction. Let's say you want to prove impossibility results, like lower bounds in a sense, then you have maybe two approaches on a high level that you can take. One is that

often problems that don't look like a construction, they become a construction if you frame them in the right way. So for example, you have the duality theorems for linear programming between the primal and the dual, so you can often switch the side that you're actually trying to prove and turn something that isn't a constructive problem into a constructive one. So that's kind of an alibistic answer, but that's something you can often do.

The more difficult and more general answer is that sometimes you're actually looking for a proof rather than a construction. Now proof is, you can think of a proof also as an algorithm, like it's a step of like a sequence of steps that you need to execute to prove a statement. So this is still

still within the space of algorithm discovery and something that we can be thinking of doing. But there is one technical hurdle that we have maybe some initial signs of overcoming, but it's not something that we have done in this paper, which is that if you're looking for an algorithm that is a proof,

then the proof is at the end of the day either correct or not. So the reward is binary. It's like zero or one. For now, we have focused on problems where you can make gradual progress, like can gradually improve the score, not just switch from zero to one in a single step, but gradually become better and better until you improve on the best on construction. Now, by the

But of course, even as humans, when we write proofs, we face the same issue, like the proof is only correct once it's actually done. But as humans, as we are writing, working on the proof, we will have some intuitions of, okay, have we actually made some progress? Like, have we actually built some understanding about the problem? If so, then that seems likely that this will be a part of the eventual proof.

So we have been exploring the possibility of using scores, which are not hard scores, but soft scores. Maybe a language model can itself provide feedback. Does it look like we have made progress towards solving the problem? So we do see the path of attacking these binary problems, let's say, searching for proofs, but it's not something that we have already done. It's just something we see as a possibility of solving.

doing in the future with this type of technique. Yeah, I mean, and there's so many surprising things in this paper. And I know Tim and I want to ask about more of them, but I'm kind of curious, you know, what each of you saw is, I mean, after you did this work and you had the paper or the research done, at least, you know, which was most surprising to you? And I'll start with Alex. Like what did you walk away from this work thinking, wow, I wasn't actually expecting that?

I think it keeps amazing me how

like how much progress you can make with this sort of system. Like when we started with FundSearch a few years ago, you would go to a chatbot interface somewhere, right? And like you would ask it to solve you an open problem and to like not give you anything basically, right? And then still today, like you can try the same thing and like, okay, it will probably make much more sense to like think for a few minutes, it will give you a lot of reasonable approaches. But generally it's not the experience of people that you like ask a chatbot to solve you an open problem and will just do that, right?

And then it's kind of amazing that like by using this, the same tools, right? The same LLMs in kind of this iterative evolution loops, you can get so much more out of them. It's yeah. I think every time we solve something new, it's really kind of exciting and surprising in a sense. And Mate, what most surprised you? I think what is really new to me is that the generality of the approach, and I don't just mean a generality across scientific open problems, but it's,

It's not my experience from my admittedly short research career that you build some tool for scientific purposes and then out of the box, you can do it, apply it to real world challenges and have so much impact. Usually there is an entire body of work, research work that needs to happen to translate a scientific technology into something that's actually useful in the real world. Usually there are so many challenges you have to tackle.

And here there is a tool which out of the box at the same time is able to make new discoveries on mathematical and scientific problems. And at the same time is able to discover algorithms that you can directly deploy into Google's critical compute stack. So that's something I certainly hadn't experienced before and maybe wasn't expecting to be honest.

Yeah. I don't know if you're familiar with the ARC Challenge and a guy called Ryan Greenblatt, he did this famous approach where he just sampled a language model like 30,000 times to generate programs. We're in this really interesting space, as you spoke about in the paper, where we have an evaluator, which means we can sidestep hallucinations.

right so isn't it fascinating that these nuggets of gold are in there in the search space and of course most of us just use language models in a single shot right just do greedy sampling and and now we can just um do these very sophisticated search routines but

I want to get to, well, actually maybe like a quick sub question is what you folks have done is you've mixed so many interesting paradigms together. You know, we're talking about meta learning and evolution and diversity preservation and program learning and library learning and whatnot. But

like if you could explain simply how is that different from just sampling a language model let's say you could sample a language model a billion times has what you is alpha evolved like in a different category to that yeah

Yeah, for sure. So maybe just one personal observation on what you just said. That is indeed the right way of thinking about it. If you just keep asking a language model in a chat window repeatedly, you'll get completely the wrong idea about what is the capability if you actually scale things up. When we started working on this type of evolutionary approaches, initially we were like,

pretty skeptical what this is going to be able to do because indeed like we're just trying in a chat window and asking it to write some simple algorithms and it wasn't doing that well initially and the magic really happens when you scale things up

But there are different ways in which you can scale things up. One is that indeed, you just keep asking repeatedly the same question and okay, sure, there will be some nuggets, but they'll be nuggets. They will not be the full solution. So it is really important that you just find those good nuggets and then you iteratively build on top of them in subsequent iterations. And that's what you get through these evolutionary algorithms.

And just to speak very quantitatively, we do have a comparison in the ablations in the paper where we try to get rid of evolution. And I mean, as you might expect, that works much, much worse. Yeah.

I think maybe a good test case for building intuitions about that is the CapChat example from the original font search paper. So there we used a very simple and small language model and we didn't even provide context about the problem because we wouldn't ever expect the language model to actually think through the problem and solve it by thinking hard and being very smart about it. What we were hoping to get from the language model was just trying a few things, a lot of things. But the problem is that

The final solution was like this. It's not that long. It's like maybe like 50 lines of Python, but it's very, very unlikely that you'll stumble upon them by accident if you don't even think, if you don't even know the problem, right? Like you're just trying Python things and like, how would you generate the particular kind of 50 line Python program that works? So there it was like completely sanctioned to hill climb. Like you have to like improve gradually and like see what works and trying to modify it a little bit in the neighborhood.

Yeah. Can you actually talk a little bit more about that, the hill climbing? Because you've found clever ways in which to build in hill climbing into problems that at least on their face seem...

so discreet that hill climbing would not be possible. So I'm just curious if you could just expand a bit on the general idea behind how one takes a discrete problem, like looking at different programs, you know, which are fragile and digital, and somehow incorporates hill climbing into that process. Yeah, I guess for many problems, like you do have a natural kind of hill climbing reward, for example,

when we do this gradient-based method for finding matrix multiplication algorithms, like, the natural reward will be the loss, right? Like, how close you are to the solution. But...

Unfortunately, that doesn't necessarily work very well. And what we've found again and again is that you have to be somewhat creative about what kind of auxiliary rewards or auxiliary signals you can come up with. And in particular for the matrix modification case, what was very useful for us is to realize that if you have a curriculum of matrix sizes, then it's probably going to be somewhat easier to solve the small ones, even with simple gradient-based methods. And then

You can hill-climb on kind of this probability of solving, right? Like you, let's say, run your thing 10 times and then like how many times you actually get the thing you're looking for.

And then that can be your signal. And then because you have a curriculum, you can say, okay, first I want to solve two by two. Then once I'm solid on that, I want to continue evolving by trying to solve three by three. So that was really helpful in that particular case. But in general, you just try things and you build intuitions about how do you approach coming up with these auxiliary words. It's very problem-specific. Yeah.

And maybe just to add one like orthogonal point is that often you're trying to solve one particular problem and then maybe it comes with an associated reward or maybe it doesn't, but it's nevertheless useful to also optimize other

other things simultaneously. So let's say you want to find a matrix multiplication of a particular size. You also want to find algorithms for other sizes, because if you do that, that allows you to explore the space of algorithms in a more broad way. And often you can translate ideas that you develop for one matrix size later on

to the matrix size that you actually care about. But if you only optimize for the size you care about from the start with, maybe that idea wouldn't have been useful initially. You first had to develop it and refine it a bit further for a related problem. So the overall point I'm trying to make is that it often makes sense to introduce kind of similar other tasks and try to improve on them, even if you don't intrinsically care about them.

Could we touch on the benefits of program synthesis in general? So we interviewed Kevin Ellis recently. He's the guy who invented DreamCoder, fascinating guy. He used to work under Josh Tannenbaum. And the way he described it, coming from a cognitive science point of view, is a lot of program induction is about explanation. It's about intelligibility. It's about legibility.

And there was this incredible example when you were talking about doing the scheduling of jobs in Google's data center. And you've got this kind of matching problem and whatnot. And I think you said there was some previous experiments where you'd used like an inscrutable reinforcement learning model or something like that. And it wasn't debuggable. It wasn't intelligible. And now you've got this beautiful three lines. But even then, there's the question of...

Is it legible? Have you found things that are not legible? Remember, there was that move 37 in AlphaGo, and that was a famous example of a discovery, which was a little bit weird. And perhaps we wouldn't have discovered it, but maybe we understood it. Maybe we could build theory around it. But when you look at some of these discoveries as well, are you finding deep, abstract principles that you can derive from those discoveries? I mean, talk me through all of that.

Yeah. So you can span the broad spectrum actually. So with Alpha Evolve, you sometimes discover algorithms that are really simple and they're so simple that you can, like a human can verify that they're actually going to be correct on all inputs. And indeed, as you alluded to, they're so simple that you're just happy to submit them to production almost immediately, like no further checks needed. And it's in a completely different league compared to trying to deploy a neural network, which you have to think about things like

retraining it and hosting it and the resources of running inference and all those kinds of things. So indeed, you can be in this very simple regime

For scientific discovery, you can also be in the regime where you explicitly look for programs that are interpretable. And you can do this by setting up the skeleton that you ask Alpha Evolve to fill in, in such a way that by design you expect it to be fairly simple. And we actually, maybe the clearest example is already in the fun search paper where we looked for these big capsets, like a specific mathematical object, and it found the function that we could look at, inspect, and

we just noticed that, oh, this function is using the number four in this interesting way that whenever you look at the index in the for loop, you just look at the index module of four. And when you index into the array, you're indexing at position i and i plus four and i plus eight, like, okay, what's going on? And just by inspecting the code, we actually were able to develop, you can think of it as like a mathematical insight or a mathematical hypothesis,

And that hypothesis turned out to be really crucial for then improving the results. Like we took that insight from the code, incorporated that insight into the next run, and we got much better results. So indeed, this can happen. But in some applications, maybe you don't care as much about interpretability, and then Alpha Evolve can develop even more

like very complex algorithms or a sequence of algorithms where you will perhaps not have the holistic understanding of how exactly this complex search heuristic works, but what you care about is the final result and the final result is as good as possible. So you can span this whole spectrum

And maybe Alex would also mention the work that is not from our group, but another team has applied fun search to cognitive science to exactly discover interpretable programs of behavior. So that is actually a pretty cool application that you can do. Yeah, and also if you maybe go back to the matrix authentication example, there we kind of build this.

based machine learning pipeline, right? Like for looking for those algorithms. And there, I think if you look at the proposed code changes by Alpha Evolve, you would see maybe two type of changes, two types of maybe reactions to the changes I experienced. One is like

yeah that makes sense right like it's and it's it's usually this it's usually the case with machine learning right like when you write when you read machine learning papers you're like oh yeah this makes sense right like that's a good idea i would maybe have done it myself and then usually the problem is not coming up with the idea it's like ideas are sort of cheap it's just like the problem is coming up with the idea that actually works so it's like there's so many things that make sense and then like only like five of them actually work so like

it gives you ideas that you kind of understand and can relate to, but it's hard to know a pure, like which of those will actually work, right? And then maybe other type of ideas or like code changes you will see is things you would probably not even have tried just because they're overcomplicated. And sometimes it's for a good reason, sometimes it's for a bad reason. And for example, for the matrix multiplication case, you would see the kind of

So we have this quantization loss there because we want the solutions to be integers or like some maybe fractionals, but specified range such that we can exactly verify in exact arithmetic that they're correct. And then we'll have this quantization loss, which will drive the solution towards that set of integers, for example.

And then, you know, normally we as humans would only try, like, I don't know, tuning the weight of that quantization term. Or maybe in the worst case, like the schedule of the weight. Like maybe you will have less of it in the beginning and then like more of it in the end. But what Alfred Wolf did was producing like a whole kind of

time evolving shape of the quantization loss which which kind of again makes sense right like you wouldn't say that it's not going to work or you wouldn't or otherwise you wouldn't look at it you wouldn't say that oh this is amazing this is like definitely going to work but the thing is like you wouldn't even try it as a human because it's like so complicated that you would never even think about tuning such a such a complicated like kind of function which changes shape over time with iterations

That's pretty fascinating. Definitely, it's kind of like how human players are able to glean some new insights from chess, you know, from AlphaZero and whatnot. You did, Alex, you also mentioned...

about how surprised you were at the broad application of the technique. And I know you have some prior work in robotics. I'm just curious, for cases where it's more difficult to assess the code, so for robots, you might be able to do some training kind of in a virtual reality or something, but then at some point, you got to just, I don't know, throw it out in the woods and see if it's able to navigate to the other side or something like that. How do you see bridging the gap between

easily automated validation versus more complex real-world scenarios yet still being able to apply alpha evolve. Yeah, I was thinking that maybe one direction that makes sense, and I saw people doing that, is looking into trying to convert kind of fuzzy reward functions into code. So for example, maybe you have

a reward function which is defined by the image of the final state you want to achieve right or maybe you want to have a reward function which is defined by like a natural language description of what you want to do and then uh your task is to convert that into a python function which actually scores it and maybe maybe it's not that hard to have a binary reward here for for the rl to actually train the robot because you know you can ask the visual language model to to verify it but uh

it's very hard to train against binary rewards. So like I saw people trying to use systems like Alpha Evolve to find a piece of Python code, which would provide an auxiliary reward for kind of shape, like shape reward, basically, right? Like to drive you towards that binary reward to make the learning actually faster. I think that makes a lot of sense as a direction. And yeah, I wonder, I don't know if that answers your question.

Yeah, I mean, it does, but I'm also thinking just in terms of the efficiency of the procedure. You know, so for example, you mentioned in the paper that many of the programs it generates, they just fail immediately. They, you know, they crash. They're not valid programs. Those are kind of easy to filter out if you have an evaluator. And then at the opposite extreme, even once you've got those programs, say running the real world program,

test case involves performing experiments and clinical trials or a robot trying to navigate a complex field that may end up in a getting damaged or something like that. So there's this huge gap between kind of automatic verification and expensive. And I'm just wondering how can we, you know, how can we bridge that in some sensible way for systems like alpha evolve? It's curious, like if you have any thoughts on that.

Yeah, we are thinking about those things. And I think one thing that Matej alluded to before with like this proving, like finding proofs, which is binary rewards, and then you can try to ask LLMs for feedback to get like you're again, like some sort of shape rewards to drive you towards the solution. I think that is a direction we're also thinking about because indeed, like in many cases, you do have this kind of ladder of, you know, more and more expensive evaluation. Like for example, if...

if you want to do some sort of simulation based as maybe for robotics, maybe for some like biology or physics or whatever, right? You can do some simulation based reward and then like you ultimately want to try it in real world and maybe it's like going to be again, trying on the robot, maybe you go to the lab, whatever. Like you don't want like for every stage of this evolution ladder, which is more and more expensive, you don't want to have

not just trying all the things that you found that worked on the previous stage, but maybe have some sort of prioritization mechanism, which can be a length of back, can be something else, can be like ELO scores. Those are the things we're thinking about. And maybe one project which also recently came out of Google, which kind of is very good at this, is CoScientists. And they basically attack the same problem, right? Like how do you score...

which are very hard to evaluate with hard feedback. And I think they're doing a really good job at that. Excellent. Maybe just one thing to add is that technologically, I think Alpha Evolve is already well set up for this. We have this idea of evaluation cascades where you evaluate a large number of programs very quickly and then a smaller number of programs you can spend longer to evaluate. So you can just expand this cascade all the way to the regime where maybe you can only afford to evaluate 10 things

and you actually have to go into the real world and run some real world experiments. But in principle, the mechanism is there. And I mean, this is what you have to do even if you try to solve the problems manually or what actual researchers do. Like you have a finite budget. So first you try to filter the ideas using cheaper methods and then you only...

only have the budget to try the most promising ideas on the most expensive valuation. And I don't know if we can limit that. Yeah, exactly. Yeah, same way those are done. So I'm curious about LLMs. I mean, first of all, let me give you my heartfelt congratulations on Gemini Pro.

It is ridiculous. It's so good. Honestly, it's just, I have no words to describe how good it is. But we're in an interesting time, right? Because in Alpha Evolve, you had an ensemble of Flash 2.0, I think, and Pro 2. And it just raises so many questions to me. What would happen if you did it with 2.5? Presumably, internally, you have even better models that we don't know about yet. And

How much uplift is there? And if you think about it, there's this Pareto curve of models and Google's models are on the Pareto curve. So, you know, we're trading off cost and latency and performance and whatnot. And if you were just using one model, there might be some optimal place on that Pareto curve in Alpha Evolve. I mean, maybe we should have a small model, but we should, you know, sample it loads of times, or maybe we should have a big fat model, but not sample it as many times or with an ensemble, maybe there's some perfect distribution of those models.

But we really seem to have, especially with the reasoning versions of the models, we seem to have unlocked something which maybe wasn't there until relatively recently. Could you talk to that? Yeah.

So one thing I can say specifically that is particularly exciting to me about Alpha Evolve is that it does get this uplift from improving the base language model. So this was not necessarily the case in FundSearch, but in Alpha Evolve, we do see this and we actually have like a quantitative confirmation again in the ablations that if you also include the 2.0 Pro in the ensemble, then you get better results compared to just using the Flash model. So it's very clear that we are like

leveraging the kind of the frontier capabilities of these models and and of course we cannot guarantee into the future what will happen but at least for now we are definitely like riding the wave so so indeed we are we're very keen in seeing how the base models improve and what kind of uplift the method is is going to get with this together the other kind of

related point I would just quickly mention is that Alpha Evolve also offers this opportunity, which we haven't taken it, but just forward looking, that it's a system that is able to enhance the capability of the base model. The capability of the base model is somewhere, but then this system makes it even better through orchestrating this test time compute pipeline and actually so much better that you can make a new scientific discovery. So it raises the natural question, can we

somehow distill this improved capability back into the base model. So that's something that you would get if you were to close the reinforcement learning loop. It's not something we have done with Alpha Evolve, but that possibility is clearly on the table.

so maybe um you just mentioned improving the base model and you also mentioned in the paper you know using alpha evolve to improve the infrastructure of uh of alpha evolve and and the base models itself so so you finally now managed to close the recursive self-improvement you know loop uh i don't know that's going to trigger some folks any any thoughts on that what would schmidhuber say

At this point, we want to be very specific about what we have done. We have found a way to speed up the training of the next version of Gemini by 1% and we've been able to unstrand resources in the Borg data center. So currently, if you think about the feedback loop, it's maybe on the order of months, right? Like

when you speed up the training of the next version of Gemini then that will take some time to actually arrive but indeed we do see the steps in the direction that you described Alex any thoughts on the recursive self-improvement of Alpha Evolve? Yeah I mean I don't think I have to have anything to add like as Mateusz said it's interesting to see how it pans out and we are seeing some signs of you know helping the Gemini training run so yeah that's exciting

In that thread, though, we want to have more autonomy. So right now, this is very much a kind of didactic exchange between the human supervisor. We select the problems, we design the evaluation functions, we see the solutions and so on. And I just wonder, what would the next step of autonomy look like, you know, in terms of maybe we could actually have the thing imagine what its own evaluation function is, and maybe it could kind of go several steps further. What would that look like?

To be honest, from my perspective, I guess automating some things is cool and exciting, but also at the same time, I would even lean towards less automation. I think the thing that makes Alpha Evolve so cool and powerful is this back and forth between humans and machines. The humans ask questions, the system gives you some form of the answer, and then you improve your intuition, you improve your question-answering, question-asking ability, and you ask more questions.

So we are thinking a lot about providing access to Alpha Evolve to academics as trusted testers and seeing what they can do with it. And while kind of trying to build that and trying to build the UI, we were thinking a lot about not just implementing the thing we have as a website, but just like what would be the next step

kind of level of human and human AI interaction here. Like what can the humans do in the, like kind of intervene in the process? Like maybe they want to, if the humans would want to supervise the process, like, I don't know, comment on ideas and like inject more ideas and things like that. And like, we're exploring that a lot. And I think it's very exciting to see like what can be done in this kind of symbiosis space. Yeah.

So final question. You said in the paper that, I think you mentioned on the order of 100 compute hours to evaluate a new solution. That was in section 2.3. But just before everyone at home rushes to re-implement Alpha Evolve, could you just give us some examples, maybe on the matrix multiplication thing? I mean, how many compute hours did it run for? How much is it realistically costing to do this? Yeah.

So one nice feature of AlphaEvolve is that it is really elastic, so it can match the difficulty of your problem. So if you ask it to solve a problem that's actually not that difficult, I mean, maybe it's still an open problem, but no one has really worked on it, then maybe even if you ask a chatbot, it would almost solve it or solve it. In that case, AlphaEvolve will also give you the answer basically immediately and it will not cost a lot at all.

But if you ask for a very difficult problem, maybe like decades open, decades long open scientific problem, then you do expect that it's a difficult problem. You will need to spend more time playing with different ideas and iteratively building on top of that. And the nice feature of Alpha Evolve is that it is able to sustain the scaling. And as you keep running it for longer, find better and better ideas.

I know it kind of maybe sounds trivial, but I don't think it's actually easy to build these systems that are able to sustain this continual improvement without plateauing at some point. And in this case with Alpha Evolve, you see this elasticity kind of stretching all the way to making new scientific discoveries. So I think that that's a nice feature of the system.

And so to answer your question concretely, it depends on the difficulty of the problem. So if the problem is difficult, then you do expect that you will need to investigate more ideas and spend more compute. If it's easier, then you can have the answer very quickly. So if you think about the problems that we actually presented in the paper, then even within matrix multiplication, some matrix sizes are much easier than others. So the compute would be vastly different.

And the same goes for the open problems in maths. Like some of them are fairly easy. Some of them are very difficult. So there is no single answer that would position you on this spectrum without knowing the problem. And unfortunately, as is the case with open problems, often you don't actually know a priori how difficult it is. So you can't even predict ahead of time. And sometimes you try and you don't find anything better that can also happen.

Awesome. Well, guys, it's been such an honor having you both on MLST. Thank you so much for joining us today. Thank you. Thank you for inviting us. Yeah. Awesome.

Google AlphaEvolve - Discovering new science (exclusive interview) 01:13:58 Share

Machine Learning Street Talk (MLST)

Deep Dive

Shownotes Transcript

Google AlphaEvolve - Discovering new science (exclusive interview)