Noam Chazir and Jack Ray really need no introduction. The two are at the forefront of Google's Gemini LLM efforts and have been involved in some of the most important discoveries in AI in the last decade. Noam as one of the co-inventors of the Transformer and mixture of experts, Jack, key part of many DeepMind breakthroughs. It's a real privilege of the job to get to sit with these two and ask them literally every top of mind question in AI today. We talked about how far test time compute will get us and the spaces where it will and won't work.
We talked about how the infrastructure needs will be different for test time compute versus the large kind of pre-training paradigm. We also hit on the impressive pace at which open source models have caught up with closed source peers and their reactions to deep seek. We talked about their kind of reception and reaction to Ilya saying that this test time compute paradigm won't get us all the way to AGI.
as well as Jan LeCun saying this current generation of models can't actually have any novel thoughts. We talked about what it actually looks like to do cutting-edge AI research today and what their day-to-days look like, as well as the future model milestones that actually matter to them. And then we also got Noam's reflections on character, as well as both of their responses to what, you know, AGI means for the role of humanity. I think folks are going to love this. It was really just a pleasure to get to speak with both Noam and Jack. Without further ado, here they are.
Well, Noam and Jack, thanks so much for coming on the podcast. Oh, thank you. Are we sitting in the very office that the transformer was invented?
No, this is a new building. We were in 1965 Charleston, I think. Probably about half a mile. Half a mile. So it's in the air around here. Pretty close. Pretty close. Well, many things to dive into today. I mean, obviously, I want to start with some of the latest Gemini 2.0 models and obviously all the work you've been doing around test time compute and Gemini 2.0 Flash thinking. I guess just at the highest level for our listeners, how do you think you're going to do this?
How do you characterize where these models work today, where they don't work as well? And as you were kind of experimenting with them, what surprised you most about those results?
One surprising thing is kind of when we started the particular concerted effort to build a lot of kind of research and into test time compute into Gemini and then think about shipping it is that we were really kind of focused on starting out with reasoning tasks. So math and code were like big areas of focus.
And it wasn't really clear, you know, whilst we're kind of sprinting in that kind of domain, we obviously want to broaden it naturally over time, but it wasn't really clear like how that would work. Would there be any kind of sense of generalization? Would thinking be useful beyond those reasoning tasks if we're just concentrating on those as researchers? And I think it was pretty fun to see one of the early kind of models that we felt like was, had been trained to,
try and match the style of Gemini Flash. So it had been trained with thinking, but then it also undergone some kind of training to actually be just like generally a nice style, a nice model to talk with. But it was actually very fun seeing thinking interact and improve creative tasks as well. You could ask the document to like,
compose kind of an essay on a particular topic and actually like a the thought content was like very fun to read uh and and it would go through like various different ideas and then it would go through like revisions of the idea or things that it should cut and that was kind of fun and then also the output like felt really nice so that was one thing that kind of surprised me let me
Any surprise for you, Noam? Well, yeah, I mean, in general, like I'm all for like generality, like let's train something that's great at everything. It is important, you know, and I was skeptical at first of like, okay, this intense focus on things like math. But it is very important to have good benchmarks that, you know, are going to encourage you to...
you know to be able to reason about the difficult difficult tasks, you know because I
a lot of things will drop perplexity, like add more parameters to the model and memorize more. So it's nice to have the evals that can distinguish better some of the more difficult problems. I mean, what evals are even meaningful to you at this point? I mean, obviously, I feel like people are trying to hill climb the same set of evals that feel increasingly less relevant to day-to-day work. What do you guys do when you're testing these models? How do you vibe check them?
It feels like we keep landing on an eval and we're like, oh, actually, we overlooked this eval, even if it's in math. It's like, okay, we've done a bunch of math evals, but maybe like
I don't know, Putnam answers only, AIME, they're still considered challenging. And then it's like, done, okay, they're completely saturated and we really like don't care and they're small. And why did, and we almost think, why did we ever even work on them? And it's easy to like forget. That was so easy, but we thought six months ago. Yeah, a couple of months ago, it was like considered really hard, maybe too hard for the model. And then it's only like snaps to being trivial. So right now I do feel like
There's always been a lot of concerted effort within DeepMind and within Google as a whole to develop useful evals. But it's kind of very nice to see also this is a shared responsibility
responsibility across like many different AI labs and even like scale has like started really stepping up and developing like very challenging evals. I think calling it Humanity's last exam was a dangerous, you know, if every six months we think it wasn't too hard, it's a dangerous title. The really, really last exam. Yeah, exactly. V73 last exam. It's very, very challenging in general because, you know, evals get leaked, you know, like the
once people start talking about the evals, then there's all this text out there about the evals and they're no good anymore because everyone knows the problems and all the models will know the problems unless you're very, very careful. So I think there's still a lot of work that
goes on into having evals that are private. - Are there a set of milestones that are meaningful to you? It's like, hey, when Gemini 3.0 can do X, that's a really exciting milestone, whether it's an eval or just something that you've tried with these models and they can't quite do yet.
I'd say when Gemini 3.0 writes Gemini 4.0. Or I should say Gemini X writes Gemini X plus one. Yeah, I think these reinforcement loops are probably the most important thing to pay attention to. And there are several...
reinforcement loops going on. The one I just mentioned is probably the most important one that we can actually use the AI we're building as a tool to make ourselves more productive at building AI, but then there are other reinforcement loops around data flywheels, like you get a
You have people use these models and provide feedback and make them better at the things that people care about. I think we'll see a huge acceleration from that. And then there's just the global excitement and funding flywheel, which I think also seems to be...
kicking up in the last few years. Yes, certainly so. To your point, I guess, of having you kind of armed with like a thousand AI engineers by your side, making you even more productive. Like, where are we in, you know, do you have the equivalent of a 0.1 of an FTE today of using some of these Gemini models alongside your research? Yeah, I think one benefit we have at Google is like, we work in a very like structured monorepo. So, and we have a lot of like amazing tooling around like
contributing to the code base already. So there's like lots of angles where AI is kind of like being pulled in as tooling for our own development. Like I think Jeff quoted the statistical, I don't know, but like the number of what we could call like pull requests that have... It's like 25%, right? Yeah, like useful, just like AI, like bug fixes or code reviews attached. That's already like, that just like...
gets pulled in one day i notice it there i'm like oh that's cool i i can now uh like do a lot of like apply fixes that it already like spots but you know that's just like one element where we're already like pulling in ai towards our own like coding development i think we're incredibly excited about um like agentic coding is just definitely very important and trying to uh get the model to be able to um tackle more open-ended and difficult tasks is is
it's definitely something we're very excited about. Um, and, and I, I think it's just like, in some ways it's very, it's a lot easier to orchestrate when we have like a very defined way of like, um,
defining libraries, we have these build rules and things, and it's just like everything gels together in terms of the whole code base very well. So I can imagine it's going to be, as progress continues, it's going to be a very discrete moment where suddenly lots of libraries can be very quickly iterated on within
I'll cut base. Yeah. And you've got your AI engineers proposing experiments to you. But what's good to try? It seems like these models work super well and easily verifiable domains and coding and math and have obviously been among them. Like...
How do you think about how, for some of these less easily verifiable domains, these models may end up scaling and being useful? I mean, they're getting better at that stuff too, but yeah, it's definitely harder. I guess in those domains, we're going to...
need either better ways of verifying or more human feedback loops. Yeah, I think what's good is that we're seeing is like, even with the Gemini model series, they're able to follow much more
abstract instructions. So being able to try and provide a reward signal over a qualitative piece of work, which maybe if a human were going to try and give a feedback signal, there would be quite a broad set of rubrics or grading criteria, or maybe there is even more like a simplistic sense of what is good style, what is interesting. And so I think
part of the problem is really training models to like take on these like very broad criteria for what, how to like take in a very broad set of criteria and then apply a reward signal. Once we have the reward signal, we can train with reinforcement learning against it. And I think, yeah, we're already seeing that kind of like
makes sense. It's not like an abstract thing anymore. It seemed like a very abstract thing maybe a year ago or two years ago. I'm curious, a year or two ago, did you expect that to work? Or did you think this whole path of research was really more toward these more easily verifiable domains? I feel like you expect it to work one day and then it feels like there's a very complicated stack of things that we feel like we need to solve to get there. And then...
Usually the case, it's like, oh, it turns out there was a much simpler path. That's how I feel about it anyway. Yeah, usually there are good surprises, you know, more good surprises than bad surprises. You've released these models out in the wild. What are some of your favorite kind of like ways you've seen people using them? And what would you like to see more people like trying and building with these things? Actually, there's just like an update today happening in the Gemini app where we've
putting in a much stronger model, and it's being integrated with basically the full suite of Gemini app tools. So it's like all of the apps that Gemini app supports from things like maps, also like search integration, and now has like very long context. All these things, it's kind of like a fully featured Gemini thinking release in the app. And it is actually, you know, it's a very enjoyable experience. That's what I think
So far, people, even for their day-to-day stuff, it wasn't clear to me whether they would like to have to pay a bit more latency to get the model to think about stuff before responding. It does seem to be the case that if people are going to pull their phone out of their pocket and then type something in, that actually what we thought was a very long time, maybe a couple of seconds, is a very small price to pay for something that they feel like is a better quality answer. And also, maybe they can sometimes look at the thoughts and inspect them. Yeah.
I was kind of showing this to my mom like a couple of days ago. Like moms are the ultimate test of whether like something has broken the barrier from like the Twitter sphere to like the real world. Yeah, like the vibe, the mom vibe check is a big deal. Yeah, yeah. And yeah, I guess she asked a lot of like what I would consider very generic questions to ask a mom.
her model, like what is the meaning of life? Oh, meaning of life. Yeah. She went for what is the meaning of life and she really like sand read it for a long time. And then she read the thoughts as well. And then she kind of like contemplated on it and she felt like she, I don't know, she seemed to kind of very much appreciate the presence of like thoughts as to like how to even go around such like an open-ended question.
So more folks building philosophical conversations with these models. Well, I mean, one thing I think is super impressive about them is the multimodal capabilities, but it still feels like those are very underexplored from an application perspective.
Not sure totally why that is, but... Yeah, I think right now, in some ways we're quite modest about the multimodal capabilities of Gemini. I feel like the model has always been incredibly strong at image input. Image input plus thinking is actually remarkably good, I would say.
I see a lot of people kind of like red teaming the model on things like on X and trying like difficult or challenging like images, visual reasoning problems. I think that is working pretty well. There is also like at some point, some of those are kind of toy evals, but
But pulling in multimodal with then agentic tasks is super interesting as well. So like we launched Mariner last December, which was like an agent which like uses a browser and things. That has a lot of multimodal like aspects all built in. It was super important that they could get the model to be incredibly strong at like
not only scanning a screen, but really understanding it and knowing how to act on many, many different types of websites. So pairing that kind of capability, agentic, quite open-ended, maybe you really need visual understanding of potentially messy scenes with thinking is something that I think I'm feeling very excited about.
Yeah, that's one. I mean, it made sense to start with text. A picture's worth a thousand words, but it's a million pixels, so the text is still a thousand times more information dense, just to do math on cliches. But
And then there's also a lot more just in-domain training data for text. We have so many examples of text which actually kind of represents the way humans receive and produce information through language. But we have a lot fewer examples for something like, say, image generation because you don't have...
of people generating images. So things are a little more challenging, but we, you know, it's, as Jack said, you know, great stuff, great stuff happening. Yeah. When you mentioned Project Mariner, I'm curious, like, you know, I think you both talked before about, you know, to get really, you know,
agents more widely use, you kind of need to solve both complexity of, of, of reasoning and then also reliability. How would you kind of characterize where we are today, uh, in terms of applying, you know, Gemini models toward these problems and like, what is the actual path to getting better, uh, at a bunch of these things? I mean, there are a lot of answers. I mean, one is just make the model smarter. That, that, that will, uh, that, that will always help. Um, and you know, most likely the, you know, um,
you need very general solutions for these control problems. Just like you need general solutions for the intelligence problems because people are going to use these things in so many unforeseen ways. You can't anticipate it. The users are...
than the developers in figuring out what the use cases are. No one envisioned when they invented the internet what the internet was going to be for. No one envisioned when they invented computers what computers are going to be for. And then, you know,
AI is getting so, so general these days that it's even more true. We're just building...
building a product with billions and billions of use cases that are unanticipated. So we will need to build the general solutions for all of it. How far away does it feel like we are from that kind of like next level of complexity and reliability? I don't want to give a precise like, is it six months? Is it 18 months? Is it 24 months? But
I think a lot of really the time actually is not so much dependent on the, in my opinion, is like not really about like the core algorithmic AI development. There's part of it is just like there's a lot of even like engineering challenges to really changing the whole way you train your
to being in more complex agentic environments that that has like some like almost constant time, but non-trivial cost to like switch how we do research. So I feel like agentic research
It seems like a lot of the upfront challenges, it's no longer going to be very simple prompts and responses. We're now going to act in an environment. So how you define those environments. That kind of angle, though, has at the very least been something that DeepMind has been porting on even since I joined in 2014. It's always been about
There's part of building AGI, which is like figuring out a really good agent. And there's part which is figuring out a really good in general environment. I don't think we like anyone has solved the perfect environment yet. And, you know, there's some obvious ones. There's a notion of like using like a web UI and kind of being able to automate many kind of like web tasks. There's a notion of like having a code base and being able to like work within that code base and do many useful things. But I think picking out those like,
If you can pick out a really good environment, then we can accelerate a lot of agentic research in that environment and build really good algorithms.
And I feel like that's as big a part of the challenges as any given breakthrough in like attention or in long context or reinforcement learning. How high do you guys think the ceiling is for continuing on this like test time compute vector? I mean, obviously, you know, I think very publicly, like Ilya has come out and said, there's this entirely new direction that's needed to really advance AI to the next level. Do you agree with that? Yes, I pretty much agree because, you know,
because LLMs are just too cheap. Like, you know, like if you, what, like operations costs, you know, under 10 to the negative $18 these days. So like, you know, if you can infer relatively efficiently, even on a,
very large model, you're, you know, you're, you're, you're getting like over a million tokens per dollar. You know, if, you know, I guess you can just check the prices on, on Gemini or anybody else. And, uh, you know, that, that, yeah. So you're getting millions of tokens per dollar. That's,
way orders of magnitude below the cost of most other things you can think of like if you think of like a Really cheap pastime like go like buy a paperback book and read it or something You're paying like ten thousand tokens per dollar So we're like a couple of orders of magnitude cheaper than you know then like reading a book and you know the you know
probably, you know, four orders of magnitude cheaper than, you know, paying anybody to do anything and, you know, whatever, six to eight orders of magnitude cheaper than paying a software engineer. But so there's...
there's a huge, huge margin of difference there to apply more compute and make the thing smarter. And, you know, if the value is there, which I think it is, like, you know, would you pay, you know, five cents an hour for the bad engineer or 10 cents an hour for the good engineer? You know, like it's...
Um,
So there's a huge amount of sort of unexploited flops in there to use if we can find ways to use them. And one way to use them straightforwardly is, OK, just train a bigger, better model. We're already doing that. But still, model training costs tend to go up quadruply
quadratically with the size of the models. So then you still end up with a relatively cheap inference if you do it right. So then, of course, what everyone is doing now is just apply more compute at inference time through this chain of thought thinking or any other
brilliant algorithms we can come up with. And so I think we're just going to start seeing a scaling curve there as well as we're seeing in a lot of places. Right. But does that scale us all the way to the AGI future people envision? Or is there some kind of completely adjacent thing that's required to... I guess, where does it ask them to? Yeah, I guess that's where we'll see whether it's the...
the humans that invent the next breakthrough or the AI. But I'm not, I've given up on like organizing my garage and stuff like that because I'll just wait for the robots. Yeah, you think they're coming? I guess you guys had a big robotics release yesterday. Yeah, that was awesome. Yeah. Is it ready to clean your garage though? Yeah.
Not that I know of. Yeah, I think the question of how much more is there to give in this test time compute paradigm? Is it all the way to AGI? I don't think it's all the way to AGI. I think we already kind of established there's other components, being able to act in a complex environment. Acting is very important. Research into acting agents, that's a definite investment. And there's many other aspects. But are we seeing test time compute asymptoting? I think...
like the kind of Ramanujan example is always in my mind here where it's like, I don't want test time compute to just for any given problem, like think longer and then eventually arrive at a solution. But we also want it to be able to think like very deeply and actually like create maybe useful knowledge that it's going to like actually incorporate to then solve further tasks in its thoughts and
and thus dramatically improve data efficiency. If you can just have one math textbook and you spend most of your time really just thinking and playing around with the ideas, and then you can become a world-class mathematician, that would be the kind of thing that
I feel like should be and we should strive to achieve with a very deep thinking model. Are we there yet? No. Do we have a path there? I think there is. There are many directions of gradient towards such a model. We've been already seeing, and this is one thing
I think people don't really talk about too much with test time compute, but we're already seeing amazing improvement in data efficiency by training the model to think deeply with reinforcement learning when it's solving the task. So even if we're going to have a bunch of RL data, we're not going to add any more. Even just like going, the test time compute kind of,
paradigm is allowing us to kind of learn a lot more from that data and i think we could probably push that much more so there's kind of many particular research angles we're interested in and how this model could
think not just in one particular, you know, spitting out a couple of thousands of tokens in order to solve the task, but maybe like think much more deeply, much more kind of like a researcher might think about a hard and open-ended problem. I think so much of what you guys have touched on is like these models increasingly acting like researchers. What are like the early signs of that that you would look for? I think like, you know,
I'll give one example, which is like math. So I think right now math is being treated as a, it's a bit strange. Math is often used as benchmarks and they kind of like exams and maybe even like math competitions. There's going to be like a very important pivot from the math benchmarks and to being really starting to become about like actually generating useful math, starting to solve actually important problems that we really care about.
I think there are ways. I think it was very cool, this frontier math eval that was created, which is trying to provide a gradient towards that. So maybe the harder category are basically almost unpublished math findings. The easier category are supposed to just be harder or trickier. I don't know whether that particular eval is the perfect way, but having some kind of ramp of
of evals that like bridges from where we're at now to like actually useful scientific contributions. This is something I'm like especially excited in as, you know, bringing in professors and researchers and saying, okay, use this tool in a non-ironic way to actually accelerate research in this area. What would you do? What's missing? What do we need to advance? And I think, yeah, I think like,
this kind of notion of incrementally harder benchmarks, it might sound like, "Oh, that's what AI researchers always say, but I think that's going to basically be my metric of progress." Math is a great example here because this is a field that you don't actually need more data. People invented all this math without
without this external input or world up to it. A lot of times in a room, just thinking. Yeah, you just go into a room. And then there are examples where, okay, there's some data, but
Okay, Isaac Newton takes a bunch of astronomical observations, goes into quarantine in his house and then invents physics or something. So, okay, that's an example where there's data, but nobody actually knows the answer.
knows physics and then you generate physics and then math is even crazier because you start with you know with roughly no data and and invent something useful so so that that's that kind of could provide a counter proof to to the assertion that hey
this is just learning to mimic people that, okay, the most we can do with AI is to relearn what everybody knows and what we've... The learning to mimic people critique, I feel, has been around for a while. Like, Jan Koon said a lot about this. Do you feel like that's entirely disproven at this point? Basically, the critique that novel discovery and thinking is kind of impossible in these current moral architectures. Well, there's definitely one class of scientific discovery that I think almost no one could...
argue against, which is like actually a lot of science is like if you knew about these two disjoint pieces of information and thought about the intersection for a while, you would realize a new property. Like in kind of material science, it might be like, oh, turns out just like associating things way better means actually you know way more about like what may, you know, what new kind of material may be photovoltaic, but also may, you know, et cetera. So there's like interpolation already would actually
completely accelerate science. And I think, I'm guessing then it's always going to be this kind of whack-a-mole of like what is considered actually interpolating known ideas versus like creating a completely novel idea. And for that one, I'm going to go to Noam. Yeah, I...
I don't know, but I guess I'd just throw it back at Jan LeCun to prove that he generated a completely novel idea. We'll see. I don't actually care about arguing this stuff. Let's just build AI, the
greatly increase the level of technology in the world, help people. That seems good enough for me. Even if we go back to the math example, like the thing I'd love to see is like, you know, maybe the state of mathematics right now is kind of like the state of kind of
geographic exploration in the 15th century or something. So it's like, there's some known things, there's some like fuzzy area of like, we don't really know what's like beyond these boundaries. And we have a bit of a guess. And then you send a small number of people to go off in a boat. It's very expensive to try and like push the boundary of like what is known and what is not known.
And they come back with some pretty funny looking maps. Yeah, they come back with like a little bit of extra territory, like explored, and then they tell people. And that's a little bit like what's happening right now in mathematics. You have a very small number of like elite math professors that are able to really ask the right questions and then actually like prove useful qualities. And it kind of grows and it's grown that way. And that's how it's been going so far, right? If we can train a model that can actually like
essentially ask, like it can pose the right questions of what you could say is like the space of all useful mathematics is kind of an infinite space. You don't want this to go and like off like a fractal on, on kind of uninteresting questions. But if it, if there was some notion of like all the set of like interesting mathematical questions, if it can like keep posing new ones, and then if it's very, very strong and it can solve those, then at some point,
you know, maybe like right now we have a pretty strong kind of math model. And at some point, maybe it will be a professor level. Maybe one day it'll be Terry Tao level. And then you have like a million of them. And then now it's like, now I think then you could hope to maybe complete the map. And then what would like completing the map look like? That could be one of the greatest contributions to kind of science. You would now physics, chemistry, you'd now like be able to have like a very deep understanding of like
any useful mathematics, I think that would be a very, very exciting thing. Whether it's possible, I'm not sure, but I would agree maybe the key crux is can I ask actually novel questions? The question posing thing seems to be the hardest part, in my opinion. The solving, I feel very confident we will get there. But maybe mathematics is infinite.
Yeah, I mean, it's definitely infinite. But we can do so much better. Yeah, yeah. I guess let me just talk about the culture of AI research a little bit. Like, you know, Noam, you were obviously part of the, you know, a leading part of the original Transformer paper. Both of you have been a part of so many breakthroughs. You know, people like to write these thought pieces about, you know, the culture that drove these innovations. Curious, like, what you think the main takeaways you have after, you know, having done this research, you know,
some periods of great success, some periods maybe of more frustration. What lessons do you take away from what actually works from a cultural and team structure perspective to drive this stuff forward? Maybe sort of AI research is kind of where...
chemistry was in the 15th century. Alchemy. It's alchemy. Like we don't know why it works, but it's highly experimental, you know, in terms of, you know, you get some idea, but really the proof is always in try it out. And then, you know, and then you have various observations you
come to hypotheses about why, you know, okay, this thing works. You know, why? What's the key thing? And, you know, so sometimes you're right, sometimes you're wrong. It usually takes more time
more experimentation to find out, or you can just like, you know, uh, just, uh, claim that, okay, it works because of my magic X, Y, Z, and you have your assistant swallowing frogs or something. And, uh,
Or the equivalent. I find naturally researchers just love to share and get excited about what they're doing. It is always important to try to credit people liberally because it's often...
very complicated to, you know, okay, what idea led to what idea? What was the key insight? And I think at some point we'll just kind of have to give up on credit assignment temporarily
and take a super intelligence-based approach of, we'll wait for the super intelligence to sort it all out once we get it. MARK BLYTH: Super intelligence will write some great thought pieces on the culture that drove these transformations. DAN GALPIN: But then it's super fun working at Google with this group of brilliant people.
and collaborating with everybody. One thing I'm struck by about the Transformer story, I think, is it involved, I think you were like, you heard randomly that some people were working on, when you tell the story, it sounds like, wow, that could have easily not happened. I don't know, if you weren't walking down that hall one day, does it feel like inevitably someone six months later would have figured that out? Or how much just random happenstance is there that kind of leads us toward a different path from just
Yeah, people in this building colliding in different ways. Oh, interesting. Yeah, we would have all been using LSTMs or something. I mean, I guess maybe it's like, okay, you ask the same question about like, okay, if somebody hadn't invented like an internal combustion engine, would we still be using steam engines at this point?
I mean, I think someone would have come up with something like Transformer. From my vantage point, like over in London, it did feel like you were circling
there was like neural GPU, which was also like, okay, the key thing is we're not going to have an RNN anymore. We're going to parallelize, but it's not just going to be a conf net. We're going to have like a notion of like the, I don't know, like depth is going to be some kind of function of your sequence length. That was an idea in neural GPU. It didn't quite work out, but it was like,
the key idea of like get rid of RNNs felt like that vibe was there. Yeah. Yeah. Yeah. Right. Cause they were the, all this work on confidence running around, everyone wanted to kill the, kill the LSDM. And yeah, attention had been floating around from, from the translation models. So it's kind of,
needed to come together. I love that. Also kind of on the, you know, on the culture side, I think one of the interesting things of how I understand Google works is basically there's like this bottoms up compute allocation, right? Of folks, you know, getting to do different projects and then convincing other people to kind of come and allocate compute for that. And obviously that's one model. There's other places that are like, we're going to go all in on like one thing and they're much more top down on the compute side. How do you think about like the trade-offs between those two models and
Yeah, I mean, we've been through both, or I've been through both, I guess. Google Brain, when I was at Google previously, was...
you know, mostly bottom up, you know, as you describe and then DeepMind has been mostly top down. Yeah, it's a bit more top down. A bit more top down and, you know, different philosophies and, you know, there are pluses and minuses to both. I think, you know, top down can be good for getting people to collaborate.
and getting larger training runs working. But bottom-up, I think, is also great for collaboration because then, okay, you bring someone new on your project, it doesn't mean you have fewer resources per person. You have more total resources.
So that's great. And there are so many abstraction breaking ideas that there's no great way to categorize them. So if you're saying, OK, this is the compute for pre-training and this is the compute for post-training, well, OK, you've got something completely different that doesn't fall into those things nicely and then it falls.
falls between the cracks. So we're bringing back a good measure of bottoms up because I think that's super important. Yeah, it feels like an interesting balance to strike. Yeah, like, I guess, like,
I worked at OpenAI briefly and I did like the way that concentrated bets were made and that obviously paid off well for them in certain areas and that was nice. I also liked the way concentrated bets were made at DeepMind, especially kind of like
Yeah, I do think there was a bit of a vision sometimes for like, say AlphaGo. I was asked to join the AlphaGo team right at the beginning because they needed an engineer and I was a research engineer and I just didn't get it. I was like, why would anyone care about a board game? I don't get it. So you do really need good research leads that have vision to drive these things. You can't expect...
everyone from a like, that doesn't have always the best like vantage point to like know what's, where the impact is going to be. Right now within like, for example, thinking as like the area of research that we kind of work within, it's like incredibly important we have a reasonable non-trivial investment in like just bottom up research where we're really not dictating anything. And then it's just a fun process of
It's kind of a fun meta research process of like, how can we make these people maximally efficient? How do you make your baselines lightweight? How do you make these signal bearing? How can people move as fast as possible? And that's very important. And then it's just very important that it can't all be that it needs to be. And here is a mandate of top down bets that we have to deliver on. But it's always humbling, like
I think we have a very good, like, scope of, like, what everything is happening and we get a sense of, like, here are the areas that are going to be really important and usually get humbled by the bottom-up research of, like, a thing that you weren't even thinking about ends up being way more impactful than you thought. So just, like, always keeping that running is super impactful. Yeah, I mean, reflecting back on the past decade, like, are there, you know, some of these, like,
inflection points or these like decision points, whereas maybe like, you know, hard decisions, 51, 49 that like ended up being kind of super impactful. I mean, I think for both of us, like going in on large language models was good. Yeah, I'd say that was probably a good call. And it kind of seems obvious now, but it kind of was at least, I mean, I found it was like in a state of being not obvious to most people, but obvious, I felt like to a small number of collaborators that like this was definitely going to be a big thing.
and kind of had to go against the grain for a while. Yeah, that's true. There was definitely a time people were not excited about language models. Hard to remember now. I mean, it always seemed like the best problem on earth to me, but, you know, that's... Yeah, it was kind of like the deep learning people at the time maybe thought machine translation was still a bit cooler or computer vision.
Yeah, Vision was exciting for a while. Why was everyone into Vision? I don't know, I guess like the ImageNet thing. Yeah, or there's like a picture of a cat or something like that. Oh, there's like the one, yeah, the cat. Did you work on the cat thing? No, no, no, never actually did much in Vision. I've run MNS. And now you've come full circle on these Gemini 2.0 models. I see all these demos that are like Vision-based and you're kind of doing cooler things in Vision now. Yeah.
than certainly identifying cats. I mean, it felt like, I think even almost every early LLM person, world model is on their mind, really. I actually don't feel like the early LLM researchers were even really from a language-oriented background. They weren't really like linguists. It was more, I don't even feel like understanding language was really part of the motivation for that early group. It was like,
train unsupervised learning at scale, do language first because it's the most knowledge compressed, but then gobble everything up into a big generative model and understand everything. And it's very, very cool to see that just continually proving out
Yesterday, we just launched native image generation. It's amazing. I think a lot of image generation right now is just focused purely on getting absolutely maximally aesthetic images, but also having native image generation allows you to really do a lot more with images, understanding, editing. You can have interleave sequence of images and text. And yeah, once again, it's just train...
a generative model on lots of data to arrive at that. You guys are obviously both big believers in these models becoming more and more general purpose. I guess a question some folks are asking is you think about domains, you talked about healthcare earlier, the model that ultimately is going to be our AI doctor, is that just a continuation of what we're doing in some giant model? Is there a healthcare-specific version of the model that ends up being released that only has some set of data inputted into it, or they're just
a bunch of guardrails or like, paint that picture for me of what you ultimately think like our, you know, AI doctor or AI, you know, biology researcher. Yeah, I don't really think you, you know, would need very, you know, specific, you know, test specific models for something that high value because, you know, you probably pay
like $1 a token for talking to your doctor. So it's like the MLM is like way, way, way cheaper at this point. So nearby, the only reason for task-specific models is price. And so if there's things that you wouldn't pay $1 a token, then you'd want something more targeted. Yeah, something to analyze vast quantities of data for...
marginal value, then maybe you want something. Yeah. You can test specific. There's always this notion of like there'll be tons of like negative transfer out there. So like compartment things. I don't really feel like that has ended up being the case. If it could be measured, then that's a good reason to compartmentalize like models. If if it's if there's no negative transfer, there's positive transfer and just have one big model. That's like I think my personal philosophy is
This is not like a thing which people have uniform agreement over though. It's like a continuous active area of research. Like how much do you want to specialize and spin off these expert models? But yeah, from the way I see it, it's just very simple. If there's positive transfer, put it in the same model. As long as it doesn't then become too expensive to serve. You know, obviously you guys have been at the cutting edge for a while. What's one thing you've changed your mind on in the past year? I feel like timelines have shifted forward. And I don't mean that like...
in a vague sense. Like I think the rate of progress is much faster right now
I felt like a year ago, there was obviously the field is advancing, but whenever you have a new paradigm shift, it creates this sudden acceleration. And also, actually, no, this is a pretty good one. One thing I've changed my mind on, like my mental model of like the propagation of information and how people like adopt a scientific advance has completely changed. So when the transformer came out,
I think, so I was over in London at DeepMind. People thought it was a cool paper. We're a bit suspicious. I eventually implemented it in our like code base over the like
holiday break actually. But it was like three months after the paper had come out. I eventually implemented it and tried it for like language modeling. But it was like not really getting picked up. I eventually then collaborated with someone that wanted to use it for reinforcement learning. But I'm going to say really it was like from paper coming out, maybe it was like six to nine months before it was like you saw just transformers dotted around all areas of DeepMind. And that's within, like we're all within Alphabet and it's like much easier to propagate information.
I would say the speed at which the field has picked up this test time compute paradigm, you have many labs have already trained and released models that are looking very good exploring the space. That was very surprising to me. The fact that these things can, if you make an announcement, say, this is important, and it's just a blog post or something, and then you'll have people that are able to
make breakthroughs in that space and release models in order of months
That was a wake-up call for me. There's a lot more compute and a lot more smart people working in AI. I often think of things in a bit of a rose-tinted glasses way and think about 2016 or something. And I think, well, we were very smart then, we were very creative. People are very, very smart and creative now, and they have way more compute, and there's way more of them. And so if anything is going to be very impactful, then it can just spread, and this idea can spread all across the world.
all across the world and people act on it, which is kind of crazy. Yeah, it is kind of crazy. And just the amount of compute out there that, yeah, like now, you know, whatever, a kid in the garage has more compute than was necessary to invent the transformer. So, like, you know, yeah, I mean, people always worry about, oh, just how much compute do they have? But, you know, it is definitely possible to make...
make breakthroughs with way less compute than you would imagine. Yeah. Anything you've changed your mind on in the past year? I mean, I've been continuously impressed with the
with the success of RL. I'd never really worked with it much before. It's like, oh, that's actually pretty good. - Well, I mean, you kind of alluded to this, Jack. Obviously, I think in the reaction to DeepSeq and all these kind of models that kind of fast followed in the test time compete space,
Going forward, do you expect the open source models to be able to keep up with each subsequent generation of these models? It obviously seemed to happen faster than a lot of people would have expected. Yeah, that's actually something I'm changing my mind on. I do feel like the open source...
like the ability for open source models to stay very close and competitive with the frontier is persisting. I actually thought we were getting maybe a false sense of assurance that it's happening because maybe
maybe it felt like it was converging, but then these things can pull away again. But actually that seems to have been very impressive. I'm really, really impressed at the performance of Gemma 3 that just got released yesterday. It's amazing. It's completely incredible. The team did a really good job. And other open source models. Yeah, so the DeepSeat v3 was a very good model when they released it. Yeah, so it seems like people are very passionate in the open source space.
and they're very creative and smart and they have compute. So I don't really see why they wouldn't be able to continually innovate. Yeah, what do you think, Noam? Yeah, I mean, it seems like the time gap between closed source and open source has been shrinking. I think that the technology will continue to...
So, I mean, it could be that the quality gap will be large and the time gap will be very small. But, yeah, we'll have to see how it plays out. It's super exciting to see all of these companies getting great results. Switching to some of the kind of broader implications on all the AI progress we've been talking about for society, I'm curious, obviously, it seems like both of you in the last year have, you know,
been impressed with the power of RL. You've been surprised by the kind of pace at which we've scaled a lot of this test time compute. Have you changed anything in your own lives based on this probably more clarity or belief that a lot of this AI-driven feature is coming? You both have kids. I know anything that you've adapted, it sounds like, I know you don't clean your garage, but I guess that was prior to this year as well. I didn't actually do that. It does make for a good podcast.
You've thought about not cleaning it. Yeah, yeah. I don't worry too much about the warming, you know. We'll have AI to take care of the carbon stuff soon enough anyway, you know. Yeah. But I guess, you know, anything that you've thought about differently in your own lives or, you know, in how you think about the life your kids will have?
AI and education is like... I don't think people are really talking about it enough yet. My son, like under supervision, but he likes to talk to Gemini. It's actually insane...
like how powerful it is, especially if he can like, he goes out to the garden, he like takes pictures of plants, takes pictures of lizards. And like, he now has this like very accurate personalized encyclopedia which can give him information. And do they adapt to that? Like, I wasn't sure how that would work. They do. My four-year-old son walks around like talking very detailed about like the plants. He'll use the Latin name. He's like, they absorb so much stuff. They're a sponge. I feel like I'm seeing kind of like
a type of education that I don't really think has ever existed for like humanity happening. Like AI and education is going to be incredible. Uh, he went, he went to school and he was like, um,
He was like, oh yeah, I caught a lizard to his teacher and his teacher was like, oh, that's cool. And it's like, oh, that looks like a big lizard. He's like, no, it's not a big lizard. It's a Western fence lizard. Like very particular about his type varieties of lizards. And he's like, and I also saw a blue tail skink. She's like, what's that? He's like, it's an amphibious lizard that, you know, so he starts like reeling these things out.
And, you know, that's just like, it's just obvious when you see it. But children are very curious. They're like sponges for information. And if you can, like, combine that productively with AI, I think that's going to be really incredible. I do feel like the next generation will just seem like smarter people. That's what I'm feeling hopeful about. Do you have anything you've changed? Um...
Yeah, it is extremely hard to predict what the future will be like. We will all do our best to make sure that AI will be safe and beneficial.
But, you know, it does, you know, it does make you think that, hey, you know, what I do now really, you know, really, really matters, you know, like, okay, we don't know if, you know, human labor will be, you know, like materially necessary in the future. But that just means a, you know, like...
you know, it makes more difference. If you want to make, if you want to do something that matters materially, go do it now. And then other than that, like, you know, just try to be a good person, you know, whatever you find, find spiritually meaningful, you know, go do it because that's, that maybe that's, you know, the,
purpose of humanity uh uh in the future uh you know may not be about um about providing for physical needs so uh um you know we've got a you know to figure out where we you know where we find meaning in the future but well
have plenty of time for, you know. Yeah, well, especially if your mom's already having the deep philosophical conversations with the models, maybe we'll be able to reason our way to that. But I am struck, I mean, I feel like some other people have come on this podcast, you know, Bob McGrew, chief of research officer at RBI, was like, look, you know, the humanities is always going to have a role in asking the questions. The models will go off and do things. But I think to our conversation earlier, it's like, I think the big question is, we'll, you know,
you know, people always be the best people to ask questions or, you know, will the models actually ask better questions over time? And obviously it has a ton of implications. I feel like every generation thinks they're living through the most important moment in history, but it does, you know, I guess, you know, you're biased when you're going through it, but it does feel like we are certainly in that. Yeah, and I think...
like a technological advancement kind of scares people and it is, people have right to feel trepidation at this stage as well. But there is like, I mean, this isn't such a good example, but like even like the introduction of the, like the television, people are like, oh, is this going to make us all just like lose our attention span? Are we going to like completely lose the will to like go out and walk outside and like have friends and things? That was like a thing people freaked out about. I think obviously now we know that was like
There was like, it was a small piece of technology that was entertaining, maybe even has provided like a net positive to society. I think it like kind of went okay. Right now it's like, there's like,
It's like that, but kind of on steroids. There's very strong signs of how this helps us. There's very good reason to be concerned about how this could not help us. So it's kind of like you have very demonstrable reasons that have already proven out of how it's helping us. You have very concrete arguments for how this could not go well. And I think that makes it a very interesting time too. In some ways, we have less kind of...
than in the past in the sense that like, you know, in the distant past, everyone was like at the brink of starvation and you're like, okay, I have meaning in my work because I have to go work hard today and get some money so my family doesn't starve tomorrow. And like today we're living in America, nobody's family is starving tomorrow if you don't work hard. So that, you know, okay, that's less meaning than we used to have and we've found other, you know, other ways
other sources of meaning. So, you know, future, more of that in the future as hopefully, you know, AI improves our physical situation. How worried are you both about AGI risks? I would say moderately, yeah. It, it,
It is hard. It's often difficult to find examples of creating something which becomes far more intelligent than its creator, but then still acts in predictable and useful ways for its creator. And I think that class of argument is concerning. There's also just more practical...
kind of AI and society implications that we've kind of touched upon, but like making sure that AI is like constructive to the economy and the, like, and the, and people can kind of,
offload their lifestyle and we don't have sharp changes in the employment landscape and things, that's super important. Both of those are often on my mind. And then there's just very pragmatic things when we're always putting more capable models out. We're very excited as technologists to develop and ship things. But we also, I think we have a pretty good balance of then internally
having another group that's going to be thinking about this much more holistically and like how can we make this launch safe? What are some unintended consequences? Which I think has been pretty good and is like super important. So I'm glad that happens.
Yeah. I mean, we, I, I, Hey, I agree. We need to, yeah. I mean, I'm not afraid, but I, uh, but we, we definitely need to be working on, on all the safety aspects of all of that. Um, and there are examples of, we, you know, we, we, we create something that, that, uh, becomes smarter and, uh, and more powerful in that,
you know, we have kids and they're smarter than us. And then they become teenagers. Yeah. But then you sold the alignment. Yeah. Yeah. But so, I mean, hopefully if we respect our parents and, you know, treat them well, the AI will, will learn from us. Yeah.
You just have to have a lot of tokens on the internet of people being really respectful toward their parents. Exactly. We have to stop pushing the robots over. Respect for creators. Yeah, exactly. That feels dangerous. One thing I did want to ask you about, Noam, is obviously in your previous in-between Google stints, you'd spent a lot of time building out character and you thought a lot about that product space of AI companions and the ability for folks to chat with all sorts of different kinds of people.
What do you feel about where that space is today? What kind of problems do you feel like we still need to solve there? Yeah, I mean, it's interesting because the reason I...
main reason I left Google to start Character was I thought that the biggest thing that the LLM industry could use was an application where anybody can go and interact with LLMs and use them and discover use cases that were good for them. Because this was before ChatGPT launched, before
before Gemini launched. So, I mean, mission accomplished. Everyone's out there talking to LLMs now. But, which is kind of different from how things are now. I guess the other thing was, you know, going into character, we were not
really focused on hey, this is going to be an entertainment product or something or something else we kind of just went in with an open mind and we're like, okay You know, we're going to just put this out there in a general way. We're going to help Help people conceptualize this as you know, okay this thing can
you know, take on different personas, meaning it's a, you know, it's a very, very general technology. See what you can make it do. And definitely we found a lot of people are using it for entertainment. I think partially because, like, at that point in the technology, like, okay, like, nobody's figured out how to make this thing not hallucinate, so people are going to use it for, you know, for applications where hallucination is actually a feature. Right.
Such as entertainment. I think that's worked pretty well. I know a lot of people like using character. What were you asking? I guess, what do you think the future of that? Obviously, there's this early behavior. Yeah. What does it look like five, ten years from now?
That's a good question. I do not know. I mean, I do think, you know, people are going to... I think people will always want relationships with humans because it's spiritually more meaningful. But, you know, I think people will like having...
AIs that are kind of more in human form for things they want. I mean, imagine you just get elected president and you get your piece in the cabinet to advise you wherever you go. You have a whole AI cabinet. Yeah, you get your own personal AI cabinet. You get your whole AI company. A good AI summarization of all the secrets that you need to know. Yeah.
Or like maybe you're, you know, you're a CEO of your own AI company
AI company so you get a lot more productivity. I guess that's maybe in a lot of those cases it's less about the personality and more about the productivity. But I think to the degree that people like interacting with something that feels human, probably we'll see a lot of AI that
that feels, feels more human in various ways. And you think progress in that space is just about the models getting better or is there like a whole other set of just like, you know, human computer interaction, uh, you know, product questions that need to be solved? It's a good question. I mean, I think the model is getting better is,
is is is pretty big and but then yeah then uh yeah part of it is uh you know uh for for whoever is um you know running the application to you know to decide okay what what are we going to um you know what are we going to let people do um i think i think users are probably pretty good at
we'll be pretty good at specifying, okay, what do I want for an interface? And it'll be mostly about, okay, do we want to let them specify that? Totally. Well, look, both of you, fascinating conversation. We always like to end with just a quick fire to get your take on some overly broad questions that we cram into the end. Okay.
Sounds good. So maybe to start, what do you feel like is overhyped in the AI world today and what's underhyped? I personally feel like the Arc AGI eval is overhyped. Oh, that's very spicy. Yeah, I think actually maybe the progress has been quite slow because a lot of researchers just don't feel particularly inspired to do these very specific types of puzzles. We did a lot of that
in like 2015, 2016. And then we kind of felt like, yeah, you can, if you know the puzzle domain, you spend a lot of time like fixing all the like actual bottlenecks, like maybe acting on these large grids is like a little bit finicky and stuff. And then you make a lot of progress, but then you don't necessarily...
continue on like building something that's really AGI and useful. So I feel like I personally had a transition from going to, from like all these like synthetic tasks, like, uh, to then going to just like, just model natural language. And I felt like that was a dragging in, in that direction was like a much more AGI thing in the long run. Anything else that comes to mind for you to over, overhyped or underhyped? I don't know. I think, I think AGI is underhyped.
Yeah, ILMs are just still massively underhyped. I think people are, you know, I don't know. People are still thinking about it like it's only going to be about some silly trillion dollar products. I heard you say in another pod that a trillion wasn't cool anymore. Quadrillion was...
Exactly. You got to put the Dr. Evil up. Obviously, you know, it would be a gross misallocation of societal resources to take you away from building models to like build applications. But I am curious, like if you were to, if we were to say today, go build an application, like what do you think are the most interesting? You've talked about education before, but any other areas that come to mind that you think would be fun to go build apps on top of these models? Well, I do think, I think it's actually very cool how many like,
apps that have been trying to break into this agentic space. And people then expose it and say, oh, this is just a wrapper around a known model. But there seems to be a lot of value in actually having the right app experience if you want to actually have a model do something useful for you, like act and do something useful for you. So I feel like, yeah, I feel like, okay, in the agentic space, I think code is very crowded now.
But I do think there's a lot of other things that I would find it useful for a model to automate for me that goes beyond maybe like a chat experience, but it's actually going to be going out and doing useful things.
Yeah, I mean, I'd say code is underhyped. I think it's huge. I think it's huge because, you know, for one, humans aren't even that good at it. Things like code and math are not super designed for that. And then it's one of these things that will help self-accelerate. You know, if it build an automated software engineer, researcher, and then it'll build the next software
the next better AI. So yeah, the combination of engineering and agentic, something that can control surfaces broad enough to do the job of an engineer. I think if I were to focus on applications, that is what I would be focused on. How different will the infrastructure needs look for test time compute models versus these massive pre-training?
You mean in terms of hardware? Yeah, in terms of the hardware requirements, distributed data centers, all that. Yeah. I mean, it's a pretty rosy story, I would say, right? So if it turns out that building AGI becomes mostly an inference problem, an inference problem that can be much more distributed than maybe synchronous large batch training that happens in pre-training, I think that...
that we can be much more flexible with our compute. But yeah, it's going to mean we maybe don't mind the model training across data centers as much. Maybe it can be spreading out actors that are going to go off and get experience and things and send that experience back from many, many different data centers. They don't all need to have very strong, fast interconnects.
So that is also going to drive price down as well, because then we might start to like really optimize towards such a setup, which is intrinsically cheaper. The cool thing is that Google, we kind of have like this co-design link with the TPU team. So it always like feeding them like our, like the profile of like how we're spending our compute, which allows them to tweak the chit
chip design and the data center design within a couple of years' time frame, which I think is really, really motivating. The fact that you can be distributed as Jack said gets better. I mean, the thing that gets worse about inference than training is
is that you lose a lot of the parallelism in the transformer that, you know, that just naively using transformer, you end up memory bound, like looking at your memory
attention keys and values for every token that you're generating. So there's a lot of great work to do in both attacking this from a model architecture perspective and from a hardware perspective, frankly. So to...
get ourselves closer to the point where we're taking the massive computational power of the chips we have and making ourselves able to fully apply that to inference. I want to leave the last word to you. We're going to go learn more about you and what you guys are doing. Yeah, well, we have a new and updated program
Flash model that applies thinking that's considerably stronger than the last model we released in January. It's out on the Gemini app. I would definitely encourage people to try that out and give us feedback. We have been incorporating developer feedback, user feedback into each model series. So I think that's a good, that would be one thing I would encourage people to do. With Jack's head. Yeah. Well, thank you both so much. Seriously, it's such a pleasure to be able to be on all this with you. Real pleasure. Yeah. Thanks. Thanks.
so