You will never hear me argue against scale. The narrative at the time in the world was we're just going to scale up the next GPT model, make it 10x larger. This was such a strong narrative that frankly many people at the time in the world believed this. We didn't.
There was a missing axis of scaling that wasn't being discussed, and it's frankly why we started this company. It was the axis of scaling for the use of reinforcement learning. Scaling of next token prediction is the equivalent of imitation learning. Scaling of reinforcement learning is the equivalent of trial and error learning. The reason we built this company is because we saw a future that I personally think now is 18 to 36 months away, where human-level intelligence across the vast majority of knowledge work is achieved. You don't get to do that unless you build from the ground up.
you don't fine-tune your way to AGI. I've just had the most amazing conversation with Izo Kant. He is the co-founder and CTO of Poolside AI. Now, they are building frontier language models. They're one of about seven or eight companies in the world who have the technical chops to build foundation models from scratch.
They have a really cool solution for doing generative AI coding. Honestly, it's now possible to write software about 10 times faster than we did before. Now what these guys have done is they use reinforcement learning from code execution feedback, which means they're going one step further in the stack
to align the language models that they build to the code and the software that you are writing. There's so much low hanging fruit in this space at the moment. When are we going to have code solutions that can watch your screen, that are multimodal, that help you collaborate better with developers? ISO has a very interesting story to tell about all of this. We also had some pretty cool Galaxy Brain conversations about how to train foundation models, about test time computation and thinking and reasoning.
I think there's a lot for quite a few people to get their teeth into in this conversation.
MLST is sponsored by 2forAI Labs. Now they are the deep seek based in Switzerland. They have an amazing team. You've seen many of the folks on the team. They acquired Minds AI, of course. They did a lot of great work on Arc. They're now working on O1 style models and reasoning and thinking and test time computation. The reason you want to work for them is you get loads of autonomy, you get visibility, you can publish your research. And also they are hiring, as well as ML engineers, they're hiring a chief scientist.
They really, really want to find the best possible person for this role and they're prepared to pay top dollar as a joining bonus. So if you're interested in working for them as an M.O. engineer or their chief scientist, get in touch with Benjamin Kruse, go to twoforlabs.ai and see what happens. Iso, it's an honor to have you on MLSD. Thank you so much for joining us today. No, thank you so much for having me. I appreciate it. Can you tell us a little bit about yourself and Poolside?
Personally, I'm a computer geek. I started programming when I was quite young. In 2016, I found myself building what I believe to be the world's first company that focused on making AI capable of writing code. So I met my co-founder as well, actually. It's a longer story. And in April 23, we founded Poolside. Poolside was really founded on our view that the world was going to achieve human-level intelligence and AI.
And we took our own point of view on how to get there. And that's been the fundamental start of us since now almost two years. And your co-founder is Jason. Correct. Yeah, Jason I met because in 2017, he was the CTO of GitHub. And I'm not sure if I've ever said this publicly, or at least not on camera, is he actually made an acquisition offer for that company I was building called Sourced.
And we back in the day had the world's first models that were able to do code completion and things like this working. And I turned down the acquisition offer, but nonetheless, we became really good friends. So tell us a little bit more about Poolside. What's the main goal? Poolside's main goal comes down to wanting to build a world where we have human level intelligence that we can scale up on compute.
We think it is essentially going to have two ways of having impact. One is the more that we can make capable intelligence scalable, we can start driving the cost of goods and services down to zero. On the other hand, there's this entire frontier of technology and scientific progress that's ahead of us. And by definition, it's infinite.
We will always continue to find more. And so, to be able to pull that closer in time has always been our mission. But we took a slightly different path than others. We took the path on focusing on making AI incredibly capable of building software. And we laid out this three-step plan kind of on our website on day zero. It's still there in the bottom of the footer if you click on vision. And it said, make AI capable to assist developers in building software.
Step one. Step two, allow anyone in the world to build software. And step three, generalize it to all other fields and domains. So there's a kind of winner-takes-all dynamics in the space at the moment. There are amazing frontier models out there. I mean, I've been playing with, you know, Sonnet 3.7 thinking and whatnot. And there's always the question of differentiation. I mean, Anthropic, they released this Claude CLI thing. And, you know,
You can just stick it on your repo and my God, it's really, really good. So how can you differentiate on top of that? So I think all of us at the frontier are constantly competing with each other for model capabilities. And I think in the fullness of time on capabilities like software development, we might all end up even in the same place. But if you look at right now where the world is at, it's a very small number of companies that are actually competing there. I would say that we have kind of...
The old guard, like Google, we have kind of the first generation of AI companies, OpenAI and Anthropic. And you kind of have the second generation, XAI, Poolside, Mistral. We all were founded around this April, May 23 date. And all of us, I think, are in that same race. Now, we've decided to not focus on making our models new.
generally available for every possible use case, but really make them available for software development. And this allows us a set of liberties in terms of where we focus on and areas we don't focus on. But don't be mistaken, like the work that we do to build really capable foundation models still lends itself to building really, really capable models across the board. Because software development is not about writing code. You need to understand the world. You need to be able to do multi-step complex reasoning. You need to be able to plan across long objectives.
And so I was super excited what Anthropic did with CLAW 3.7. I think it's an amazing model, the SONNET model. And of course, it's for us to make sure that we didn't surpass that. And so we're constantly in this race with each other, but we take certain views on our research and our approaches that I think will allow us to accelerate over time faster towards those goals.
So I'm trying to understand this because there seems to be a dichotomy between having really general foundation models that can do lots and lots of things versus the story from many folks in the space who are saying we need customization, we need personalization, we need on-prem deployments and so on. How do those two worlds come together? So I think it's a spectrum. I think absolutely in the first part of your training of your models, you want to embed as much diversity and knowledge of the world as possible.
Software development is not about writing code. It's about being able to interact with the real world and turn that into a digital form. So having that general part down is incredibly critical. I might not care as much about how humorous my model is and how good it can handle nuances of comedy, but I absolutely care about the knowledge that it has in many different domains.
What ends up happening though is that all of us have a fixed parameter space. At the end of the day, there's a cost of inference, so there's only so many parameters that I can load up and then actually run. And by having a fixed parameter space, it means that you have to choose what you want to do with those parameters. And so we try to shift the distribution of our model capabilities very much towards software development in terms of the capabilities that they have, but also meaning that we're willing to trade off
You know, not maybe being as good as, you know, writing a creative bedtime story or, you know, writing comedy or areas that probably more you would find in a consumer AI from other people. But you mentioned a second part, which is this notion of customization and where do you deploy? I think this really comes down to how do you view where models are going to be in the coming years?
And so I think all of us at the frontier, it's our responsibility to build the world's most capable models that can interact across all of science and technology and knowledge work. Even if we focus on their abilities to build software, we still care about all of that. And over time in our step three, we want to branch off to opening up to those other areas as well. But if you look at what's going to happen in the future, there's a big question. There's a question of do we have some all-powerful model that is static?
It's one model and we all use it to do economically valuable work. Or do we have all powerful models that are able to actually become versions of themselves deployed in environments, learning from the data in that environment?
So the question is, do we have a software developer that can write all software or do we have one that's deployed inside a banking environment and has truly access to all the information and learned over time from it? So it's a little bit, do we anthropomorphize it? Is it going to be like a human? I can be a very capable software developer, but you deploy me at a bank. All of a sudden, I'm going to have to learn everything over time from that bank and that embedded knowledge. Or am I something generalized that gets applied to it?
I think the honest answer is that we don't know yet in this space. What I do know is today, when the models are not yet at human level capabilities and are not even super human level capabilities are there yet, it's very valuable giving the model access to as much data and as much context and as much ability to learn in an environment. And so we just look for the shortest path towards doing that. And the shortest path with enterprises towards doing that is
is by willing to deploy the model, the context intelligence layer, and the applications behind their firewall close to the data. It's a tactical in-time decision. I think over time the form factor of that might change, but it's something that we've seen resonated really well with the customers we have.
It's interesting because I'm trying to tease out your view on scaling so that there are many folks who think we should scale the models up. GPT 4.5 just came out. It's quite interesting that there was, you know, Gwen and Kapathi, you know, were on Twitter and they were basically saying, well, you know, high vibes people, that means like smart people, they can see that this is a step up and it's doing well in very nuanced things, but the benchmarks aren't capturing it. We need better benchmarks.
But undeniably at the moment, there is a gap in capability, right? So we need to have customization and like, you know, thinking on site, surface contact with domain verticals in order to do well. But I think you're saying that you can imagine a world in the future where we could bring all of this data back into a huge foundation model and it would work just as well. You will never hear me argue against scale. Scaling of compute and scaling of data is critical for us to close the gap between where models are today
and where we believe they can be, human level intelligence and even beyond. But that doesn't necessarily mean that the axis of scaling today are the same axis of scaling that they were when people thought they were two years ago. So when we started this company, the narrative at the time in the world was we're just going to scale up the next GPT model, make it 10x larger and provide it with more web data and we're going to have this AGI moment, this human level intelligence.
And this was such a strong narrative that frankly, many people at the time in the world believed this. We didn't. And it's not because we don't agree with skill. I think skill massively matters.
But our view was that there was a missing axis of scaling that wasn't being discussed. And it's frankly why we started this company. And it was the axis of scaling for the use of reinforcement learning. You mentioned Karpathy. I liked how we said it the other day, like scaling of next token prediction is the equivalent of imitation learning. Scaling of reinforcement learning is the equivalent of trial and error learning. And while there's probably some nuance to that, I think it is a right way of thinking about it.
Yeah, I read this amazing book by Max Bennett called A Brief History of Intelligence. And he was basically saying when you look at the animal kingdom and humans in particular, you see this axis of simulation, right? So it's the ability to imagine things or, you know, imagine experience which you haven't directly experienced.
had access to, right? And that just creates this explosion. And language, of course, is an even more sophisticated invention because it allows you to sort of mimetically share those simulations that you didn't actually have with other people. So with reinforcement learning, you can actually try things and you can accumulate knowledge without needing to have direct physical experience. So I think it depends on where you apply reinforcement learning here, right? So I tend to agree with a lot what you said.
which is that at the end of the day, what we do in thought, and I think it differs by the way, the personal experience, the personal experience. We spoke about this earlier. My thought is entirely language-based. It's a constant internal monologue that is so language-based that there's no visual representations or abstract concepts. So for me, I feel quite akin to language models, to put it kind of like funnily said, because I see how they think and how they reason, and I can relate to it.
But the reason I mentioned this to your point is that language is a way, I don't think it's the only way, but it's a way that we can explore different possible, you know, chains of thought, different possible thinkings. And I agree a lot of how my mind works and I think how many's is that you're looking at an objective and you're thinking through the different possible chains of thinking that can get you to that objective. That's about writing a piece of code or if that's about something far more long, you know, ranged.
And there are several things that we do to ensure that that objective is correct. One is that we try to keep it consistent with what we know are representations of knowledge that we build upon. So if I am, you know, reasoning or thinking through a math problem, I'm constantly consistency checking against the knowledge representations that I've previously learned, right? The axioms that have to be true in math or physics or any other domain. But then there are certain things that no matter how much I try to keep true against,
I actually need to do the work. I need to get the real world feedback. So the, call it maybe slightly flawed example here, but I think it's a useful one, is if I want to learn Go or chess. I can read a whole bunch of textbooks. I can play out chess games in my head to some extent. But at the end of the day,
I am still a snake eating itself in terms of my own synthetic data, right? And so coming into an environment where I play against someone else, computer or person, doesn't really matter, and I learn from my mistakes, hey, my reasoning chains that I explored led to me losing this game. That's actually where I think it's really, really valuable to have some form of external feedback.
and that external feedback in the case of chess or Go is deterministic and sometimes that external feedback can be human feedback but it's very hard to skill and it's also not necessarily always the right type of feedback and so that's where I think reinforcement learning can live on that boundary where when we have the ability to form a reward that pushes us more correct or at least in the right direction of correctness
we can then improve the next set of thinking and thoughts that are those next thinking chains that you're talking about. Yeah, there was a great Nature paper talking about model collapse with Ilya Shormanov. We were dropping the interview with Ilya actually next week. But in a way, it reminds me of some of these AGI doom type discussions, right? Because when you talk about omniscience, I think that's not a scientific discussion. I think that in the real world, we need to push molecules around.
It's the same thing with software, right? The reason why these Gen AI software coding things are so powerful is because they can actually test the software that they generate. That's really, really important, getting that signal from the real world. So we're very known for our work on reinforcement learning from code execution feedback. I started this work when we were building our first company in the space in 2016. It's kind of a big part of what we've spoken about that Poolside uses.
And this is really the notion that if you have a very large diverse environment, which in our case we do, we have close to a million repositories that are fully containerized with their test suite and many millions or tens of millions of revisions, that we can say, hey, at this commit hash in this repository, I want to change this code and then I want to execute it and see what comes back. If that's either running tests against it or just a compiler or interpreter or even synthetic tests,
And what this allows us to do is it allows us to have a very, very large environment because a million repositories represent huge diversity of types of problems, if that's in cryptography or a web app or a core database kernel. And it allows us to then design tasks for the models to do where they can explore possible solutions and the thinking that leads to those solutions and then learn from when they're right or when they're wrong, or at least
I always try to be careful saying right or wrong, more correct and less wrong, right? Because that's essentially what reinforcement learning is. You're trying to push the model in a direction that the next time around when you're sampling your thoughts and you're sampling your solutions, you're slightly better. And you do that enough times in training, you can start getting to a place where you get very good. But if you do this on a very narrow task,
you get this notion of model collapse, right, or overfitting. You get to a place where it's like, okay, the model can only do this now. And then it's no longer a useful generalized intelligence, right? So coding kind of sits in that spectrum of like it's deterministic enough but has enough diversity that even when you're overfitting a little bit towards it, you're still making it very much a generalized intelligence. You're not making it a task-specific thing that can only play the game of Go.
I'm fascinated by this idea of a diverse set of possible minds or different intelligences. I think that you can create intelligences through a variety of representations and as many degrees of freedom. You made an interesting comment actually that you're not very visual. You think in this analytical language space, I'm very visual. I can imagine sounds and so on in my mind.
And the way we write software is actually very diverse, you know, like design pattern books and so on. Those are different analogies, different abstractions. We, you know, even Einstein, when he thought about relativity, he was thinking about ripples and waves and so on. And with software,
You know, there's the syntax, there's the actual sort of like, you know, how we write the code, there's the semantics, what it means, and there's the behavior space, how we actually test it. And you're kind of talking to a form of AI where we're learning like a hierarchy of representations and it can kind of flit between them dynamically. It comes down to how we view these models, right? And
And I want to really caveat this with that we don't have a good ground of scientific theory yet or really like robust interpretability that what I'm about to say is anything but my best guesstimate or opinion at this point. But I think probably most people in the space will agree on the following is that what's happening in these models is that we're learning extremely high dimensionality representations.
And some of those representations that we're learning represent the ability to use language. So these are representations that are massively interconnected with everything else. Some of the representations that we're learning are very specific pieces of knowledge.
Right? So we had a bit of conversation about this earlier, but if I take a piece of knowledge, like when FDR was born, that sits in a high dimensionality space. Probably sits close to other things related to US presidents and such. If you go back to the early days of machine learning, we think Wartavec and like stuff like this, like I think those are still useful mental models to have. But back then we were talking about embeddings and representations that represented words or bag of words or knowledge.
Nowadays, what we have with the type of models that we've been able to build is that we have representations of things that are far more generalized and far more useful. The ability to use language, the ability to start doing reasoning.
And I mention this because I think in the first wave of how we've trained these models in the last years of just scaling up next token prediction on more and more data and larger models, we were improving the representations that represented the most overly represented things in the data.
language, knowledge, but we weren't yet able to start really improving the representations of complex reasoning, of multi-step processes, of the things that are required for building complex software, the things that are required for figuring out new scientific breakthroughs and theories.
And now we have access with reinforcement learning to really improve those. But at the end of the day, if we had an infinite amount of data in the world that represented all of our thought process and all of the feedback we got,
then it wouldn't really matter what we use. We could just learn it with next token prediction, right? We wouldn't necessarily need to use reinforcement learning. These are all just ways for us to improve the data and hence improve the intelligence. And we're trying to do that in the most compute efficient manner. So my team gets very tired of me saying this. I always say all the work we do is either improving compute efficiency on training or inference
or it's improving the data and hence the intelligence of the model. And everything you do, you can put in one of those buckets. And I know it's an oversimplification and there's always a little corner case here and there. But when we come up with the crazy new architecture for linear attention, which is something we've put a lot of work into, to me, that's just improving compute efficiency of inference, right? If we figure out like a really amazing way to scale up reinforcement learning and that's actually just improving the data.
And so to me, they're just the two facets of model building. Yeah, I think the economics of models is really important, actually, because even now when
when OpenAI finally released O3, there's a new version of the Arc Challenge that came out and they were spending $2,500 per task, but they could solve it. They could get superhuman performance. So now it's simply a matter of computation, but there's still a Pareto frontier, right? You know, we've got like the Gemini model. It's very, very cheap. You can just sample it, you know, maybe 50,000 times and you can still get the answer. So
So then we get to this definition of intelligence and AGI. We've been talking about that a lot. Francois Chollet said it's basically your reasoning or your knowledge acquisition efficiency. So, you know, how quickly do you take new points and experience base and turn them into new skill programs? And you must be wrestling this Pareto frontier, right? So you're figuring out what's the appropriate size of model? What's the appropriate architecture? And what's the appropriate size of model?
What's the trade-off between how fat the model is and how much knowledge acquisition and reasoning I do in the situation? I think at the end of the day, all of us at the frontier right now are taking advantage of as much computational resources as we can get. And I think it can't be ignored. Like, I think if you are in the race for frontier model capabilities, your amount of computational resources that you can direct towards training is absolutely critical.
But the wrestling part comes to where do you apply it? Do you apply it to make the model larger in a parameter space? Do you apply it to make massive amounts of synthetic data generation? Do you apply it to the scaling of reinforcement learning and sampling more? And all of this is essentially an equation that has an optimal, right, for every single one of these things.
And the way that we operate, and frankly I would say most frontier labs operate, is that we try to run experimentation in each of these areas and several others to find that optimal. So to give you a bit of a sense, our team in January ran over 4,000 experimental runs.
These happened across architecture, across data ablations, weight mixes, reinforcement learning, number of samples, all of these different variables. And what we're fundamentally trying to do with all of this experimentation is trying to understand what is that optimal balance between those things. And you said something really important. At the end of the day, cost matters.
So the overarching kind of objective function of all of this is the maximum amount of intelligence that you can create within a certain amount of time and budget that you can then serve for a certain price to the end consumer. And the training and inference part are really critical. And so in our domain, because we focus on software development capabilities, we're in a quite valuable domain. It's economically valuable.
I think it is a lot tougher if you're trying to serve both that domain and you're trying to serve the free users who want to write bedtime stories, right? Because they have different economically like values associated with them.
From an architectural complexity and customization point of view, the enterprise, they want to design their own architecture. They want to have clear security boundaries. So the software engineers in the finance department, you've got the software engineers in the legal department, and they want to create their own trade-off along those boundaries that we were just talking about. So does that make it more complex that rather than designing one recipe for everyone, you're doing a lot of bespoke stuff? So I think there's the building of the foundation model
And then there's building of everything around it, right? All of the software that allows you to then deploy it in different environments, if that is workstation, server on-prem, if that's in the private cloud environments, VPC, if that's pure cloud anyone can access. And so we made a decision early on from day zero of the company to say, we are going to do everything possible to become the trusted partners of enterprises. And it came from quite a
kind of simple analytical process we said well where does the majority of economically valuable software development work in the world sit and it sits in software development in enterprises right 70 percent of as of all dollars in in software development gets spent in enterprises but these are also like you said very complex environments with lots of security concerns and boundaries and so we took a call again a simple point of view it said well where do they want us to be where are we seeing that we can actually be that the customers ask for
And what we heard over and over again is bring the model to the data, not the data to the model. And so we decided to build accordingly. And so today we deploy our full stack, model and applications all the way on top in these private environments. That requires a lot of work, a lot of engineering, a lot of work that we do to streamline this so that it can scale. I personally think in the fullness of time, everyone will end up on the cloud. But today, if you tactically look at the global 2000 enterprises,
a lot of them are still very happy that we're able to deploy in environments that other people aren't. So you've gone for a kind of Tesla-esque strategy in the sense that you're controlling the entire stack. You're building, I mean, you're one of not very many companies that have the skills and expertise to build the foundation models, but
I guess the question is that there are so many folks that are just focusing down the stack. So you could add so much value doing the reinforcement learning from code execution feedback, doing the whole kind of like, you know, architecture piece. Why did you decide to go the full hog and build the foundation models as well? I think it started where we had this conversation today. Like, how do we start this conversation? We started around like, what is Poolside looking to achieve?
And the reason we built this company is because we saw a future that I personally think now is 18 to 36 months away where human level intelligence across the vast majority of knowledge work is achieved. And if you hold that point of view two years ago when we started that we were, and at the time the timelines were not that concrete for us, we would say five to 10 years, maybe 15. We knew that the world was going to get to a place where we would be able to replicate our intelligence and even go beyond.
And when that's the point of view you hold, the question is, well, what do you need to be one of the companies who can help bring that into the world? And if you remember, we had our own point of view on how to do so in research and execution. And you don't get to do that unless you build from the ground up. You don't fine-tune your way to AGI. We put that on the website, on that same page we spoke about earlier that was there on Day Zero. We had this list on there that is essentially, it was called
Strong beliefs weakly held in the face of empirical data on our research. And one of the things was you need to build your foundation models from the ground up to be able to get to these things. You can't fine tune your way to success. We said all data over time becomes synthetic. Reinforcement learning is key to be able to scale up capabilities. So it really came from that. But what we wanted to achieve and what we believe mattered from a research perspective. And I think that has really played out in our favor so far.
And I think it will be very unlikely that we see anybody in the world get to human level capabilities by post-training the latest open source model. And I would even question, and I'm a huge open source advocate. My first company in the space, I built fully open source. But I would even question if we are going to have at some point truly open source AGI.
if there's going to be room for that in the world, if we are continuing down a path where the capital required to build this is so incredibly large. How hard is it to build a foundation model? I mean, just to give you a couple of examples, you know, DeepSeek,
What's cool about them is they've made a lot of their training methods and optimizations public. They've got some great papers out there, new sparse attention papers, really, really cool. And I'm guessing that as you go up in scale, it gets harder and harder to train these models, but it's so difficult for people like me to know. I mean, is it just a software engineering challenge? How hard is it?
So I love to break that into kind of two parts. One is actually common a little bit on DeepSeek because I think DeepSeek is a great example of this second generation companies, us and us.
XAI and DeepSeq, we've taken a different approach. So DeepSeq, as far as I know right now, is about 200 people, researchers and engineers. They've got, you know, billion dollars plus of infrastructure there, and they've got two years of incredible work already behind them. And we know it because they've been publishing their work. All of us in the field, they were a known entity. We've been following the papers. And I have a lot of respect for what they've done. Because if you look at that last paper that they put out, the 47-page technical report,
on V3. In this space, we all know if you do all of that work, you get a really good model. And so I will even say to the controversy of some that I don't think they stole data. I don't think they did anything nefarious. I think they actually just did great work. And we've got a two-year track record of research papers to actually follow to see that.
Now, there's questions that we need to have about do we want capable AI deployed in the world in the West where we might not share the same values or same principles as the CCP has. But that's a completely separate discussion. But the other notion is that process of two years of building, what we have gone through, what XAI has gone through, what DeepSeq has gone through,
It's, yes, of course, the models get larger and the engineering gets more complex and the research, but at the same time, I think it's building up compounded advantages over time. You know, I think talent is so critical in our space. We have this incredible team, and if I look at what we have learned together as we've grown over the last two years, there's an immense amount of value in that.
Now, of course, that needs to go hand in hand with actually improving your data constantly, right? Like every month you can look back and you say your data on which your training is better, it's cleaner, it's better representative of what you're trying to achieve. It's improving your actual distributed training stack.
right either by making it more compute efficient which is a big part of it but also allowing it to work at certain skills that before it wouldn't work out right this is all of the work that you've seen on the different types of parallelism that you get as you scale up into larger and larger clusters there's work that you do that really is specific to how the chips underneath change right so we've got the
the Hopper series, but now if you look at what's coming out with the GB300s, all of a sudden we have 72 chips with NVL connection. That changes the architecture that you want to run if you're training on that.
So people often take the view of like the architecture comes and then you map it to the hardware. It's actually the opposite. You look at the hardware and you determine what the optimal model architecture is for training and inference. And so you're constantly going back. But over time, these all just become compounded things. Our environment of code execution went from 1,000 repos to 10,000 to 100,000, now close to a million, and it's going to keep growing.
So some of it's engineering, but some of it is also just the implicit knowledge and experience that you gain in your organization that builds these kind of modes for others to not be able to compete overnight with. Is it fair to say though, if every time you 10x your scale, what you already know isn't enough to get you there? You have to sort of not go back to the drawing board, but you have to spend a lot of capital trying a whole bunch of different things. I think skill needs to be, can use a little definition.
Because a lot of people have talked about scaleless, just scaling up model size, right? I mean, you mentioned GPT-4.5, which I think, open-eyes, says 10x more compute, which probably means maybe five times larger and x amount times the data or whatever that combination is. Until recently, the world assumed there were two scaling axes.
Size of model, size of data. I think reinforcement learning is truly a third scaling axis right now. And of course, that still is a proxy for data. Don't get me wrong, but I think it's important to call out as a separate scaling axis because this changes the dynamic in terms of what skill means. Maybe you are not taking the model to, you know,
10 trillion parameters. I'm throwing out a number here. You're not taking the model that large anymore because you're finding that you can scale more efficiently by scaling up the reinforcement learning side. A good example of what we saw in the last couple of years was the LAMA models, where you saw at some point they went from 2 trillion tokens to 15 trillion tokens. By the way, all of us in the space have been doing this for some time. And that would be referred to at the time as overtraining.
because it was not chinchilla optimal. But chinchilla optimal never took into account that these models actually have an inference cost. So there are places where on the theoretical, yes, a certain model size with a certain amount of data with a certain reinforcement learning is the theoretical compute optimal way of training a model. But actually, if you have a constraint of running this in the real world, I need to serve this to a customer and it can cost me only so many dollars per million tokens to serve or a number of requests.
then that constraint changes where I might spend my skill. And the reason I mention this is that I might say, okay, I might train it for much longer, but it doesn't actually necessarily introduce a lot more engineering complexity.
But if I'm trying to go 10x larger, then it introduces engineering complexity. But it also depends on what hardware I do it. If I'm scaling up like Elon on, I think it's 632k, you know, Hopper, like H100, H200 clusters that he has interlinked together, then all of a sudden that's a very different way of scaling than if I take the equivalent flops on 100k, you know, GB200s or 300s that are outcoming.
So don't get me wrong, there's always next to engineering. To us, it doesn't really feel like 10x from one to another. But there are major...
changes that we make as we find having more access of scaling. Where I think we've been very well positioned is that two years ago, we started already deeply the company around large language modeling and reinforcement learning. So this wasn't a new thing for us. So we've been building up incrementally over that. I think other companies all of a sudden had to bring this out from almost nowhere. And I'm pretty sure that felt like a 10x engineering project.
Can you help the audience understand a little bit more about thinking? So, you know, the R1 came out and you, you know, it's the same with Sonnet 3.7 and O3 and Gemini Flask thinking and so on. And the end user perspective is you see these thinking tokens
And the language model is kind of doing a form of self-prompt augmentation. So there's, you know, there's like the tier zero of chain of thought and scratch pad. And now the things are prompting themselves. And one way of thinking about it is, you know, we train them with reinforcement learning and it's
It's kind of like it's imbuing this process in. But the interesting thing is that you can just take 100,000 thinking traces, you can fine-tune just a normal base language model, and then just through sheer dint of interpolation, you get a lot of the performance. So you can buy performance. There's a kind of sigmoid relationship with your computer. But what's really going on there? Is there something special about the reinforcement learning in of itself, or should we just think about it as a form of data augmentation?
I have opinions here. Some of them I think already backed up with papers that are out there and things. Others, not yet. And we don't publish. I'm saying this because I want to reference things that are out there publicly to back up some of the argumentation. Look, at the end of the day, we are updating a model
based on a loss or some function that we are applying there. And so, yes, in the truest of definition, all of it is just data. All of it is just data. Like I said, if you had infinite data, you had infinite reasoning traces for everything else, you could learn it through next token prediction and it would probably be an incredible model and reach human level capabilities. So,
But it is very clear that what you are doing today by taking 100,000 reasoning traces versus the equivalent compute spend on reinforcement learning to get there, that reinforcement learning outperforms the SFT side.
And I think that it's just because there is more signal in terms of what you can provide in terms of reward than you can just by providing only like the sample of data. This is, I think, it's a trade-off, right? And again, it always comes down to like the data and the compute efficiency. And so I think if you had an extremely large amount of data there,
There is a path there. What I think is happening right now in the world where it's like, okay, we're seeing massive generalization from 100,000 reasoning thinking examples. I don't buy it. I don't think that's actually what I'm seeing in the models. I think even current benchmarks are able to show that. But it's quite often when you look at these things, you see like, oh, it went up on this math benchmark a lot.
And then the reasoning, you know, traces are all very specific and quite closely linked to like what's happening in that benchmark. So at some point, yeah, you might grok something a little bit more generalized, but you need a lot of data for it. So I think that the scalable ways of improving models are more around...
applying reinforcement learning where possible. That being said, there might be compute trade-offs where at some point you say, oh, I want to use some of that SFT data to either bootstrap versus having it, you know, learn from scratch. I think there are places that can be done there. But you
You mentioned R1. The most exciting thing that I think was finally published in our space, because none of us in the West really publish anymore, was the Xero work. And the fact that you could see a model develop its thinking capabilities in coherent language without having been provided sufficient samples of what thinking looks like.
This should be the thing that blows all of our minds. It shouldn't be the headline around the $5 million. The headline should have been models are able to develop human-like thought in language that is leading to better outcomes in things that are objectively measurable, like math and coding capabilities.
That's the exciting part without having actually been like aligned towards it. Yeah, that paper blew my mind. I'm not sure if the zero was, it's not like AlphaGo Zero where it's, you know, there's no human seeding. I think there was still a little bit of human seeding, but it was mostly self-play. But it was incredible, right? Because it learned these emergent behaviors, like it would say wait, and it would do stop, and it would do reflection, it would do reasoning. And many of these seemed almost like there is a natural way of doing reasoning. It was very like,
human alliance? Well, I think, look, we can't forget that the basis of training data is still the internet, right? It's still the web. And so when people go and say, oh, like, I just want to see it like perfectly learned from nothing. It's like, well, you want all of evolution to happen like overnight. Like we still have, we're teaching these models based in our own image, in our own data and what we have. Otherwise, frankly, also wouldn't be very useful. We want them to act in our environments. And so, but the fact that before
If you take a pre-trained, you know, base model and you see the difference between that pre-trained base model versus one that's had actually, you know, reinforcement learning applied for developing its thinking, you can see that the capability of those thoughts, like you said, the self-reflection, all of these things is improved, right? And I think we often use the word emergence, but I think it's a spectrum and we see things, you know, improve. And all of a sudden we now have a lever that we can pull
And we've had it for some time where we can improve the thinking capabilities of these models. And by improving the thinking capabilities of the models, to our point earlier, there's a smaller space of solutions you have to explore to get to the correct thing. And the more we can make that space more and more correct in the areas we care about, mathematics, software development, scientific theory, all of these kind of areas, the more useful and valuable these models come. And by the way, humans are exactly the same way.
right? Like I have this massive, you know, set of learning behind me that makes me useful in a software development environment. But if tomorrow you drop me into quantum physics, I'm probably not that useful. Yeah. Even then though there's this kind of analogical relation between flexible forms of thinking. And I guess I use the word emergence because it's just surprising. It's a surprising arrival of a capability which gives a significant uplift.
And you see interesting dynamics as well. So you SFT these thinking traces because I kind of think of the thinking traces as being a form of flexible thinking that they give you more degrees of freedom to kind of operate in a particular situation of intelligence.
And also there's an interesting relationship between how fat and how thin the base model is. So with 03, it's a very thin model and they found that it was easier to scale, you know, a computer inference time, but they hit the edge of the sigmoid faster. Whereas Sonnet 3.7 is a much fatter base and it's harder to scale, but they've actually got more headroom left, you know, if they continue to scale. And there's also an interesting thing that when you fine tune the model's
they fine-tuned like a Lama 1 billion model. And because the Lama model was kind of so thin, it's almost like the thinking traces couldn't take root. You needed to have like a base level of intelligence in the model that you're fine-tuning onto in order for them to work. - What is thinking in this concept, right? Thinking is able to explore the possible space of a solution, right? Because right now,
The reason we call this reasoning models and often not thinking models, the reasoning really is to some extent objective oriented. It needs to be able to achieve something to actually be reasoning and then follow a certain step process.
And the more complex the objective, right, it's still very clear you need a better understanding of the world, you need better understanding, you know, manipulation of language. And so thought in that sense is still something that I think is constrained by the model size, absolutely, right? And it always was. Like even before reasoning models, it already was. But now we have an access to use what's already there and really improve it. Another way of thinking about this is if I sampled a model
a thousand times with enough diversity introduced, either through temperature or prompt or whatever I was using, and I could find the correct solution in there, like the correct thinking lead to correct solution, it means that somewhere there in the model it's already there. We just haven't found the perfect latent space for it. The best is, of course, if I just only do this with temperature, right? If I start allowing for more creativity of different sets of probabilities.
Because what you fundamentally want is to be able to reward the model in such a way that for things that have deterministic correctness, at temperature zero it will get it. And frankly, even hopefully at a temperature 0.7, it will still get there. It will kind of expand open its cone of possible options and then it will collapse as it gets closer to the solution to the correct one.
And when you take a very small model, you realize that that opening cone is very wide, but you can't actually collapse into the right solution, right, for many things. But if it is somewhere there already in that cone, then you can try to push it towards it, right? And that's where, again,
name of your podcast, right? Like the machine learning side still kind of comes true. Like you can still overfit a small model to a task, but you can't get it, you know, to actually generalize enough. And so I often think about this as compression. Models are just a massive amount of data compressed into a certain space. If the compression is too small, it becomes, you lose way too much. And if the compression becomes lossless, like so large compared to the data, then you're not really learning anything. So I also don't hold the view that there's this like
infinitely large model world. Like at some point, it's going to make more sense to say for this amount of compute, this amount of data that we're applying to teaching a model, this is the optimal size. And then we're going to want to maybe paralyze them horizontally to try to achieve objectives. The better lesson, like learning and search.
I interviewed the winners of the ARC Challenge and they said that language models, we sample them greedily, which means we take the next token, the next token, and natural language is kind of messy. So there are many, many degrees of freedom. And they were kind of speaking to the fundamental trade-off between creative thinking and reasoning. So with the ARC Challenge, there is just one solution. Well, not one, but there's a very sparse space of solutions.
And they actually came up with a depth first search kind of sampling type strategy. And there were some folks at DeepMind I spoke to and they were saying, you know, when you do reasoning, you actually want the softmax to be very precise because you want the thing, right? You want the thing. But sometimes you actually want to have like creativity and diversity. So how do you square that circle? How do you have both? Well, I think there are problems where you need the creativity and diversity to get to that determinist, like to that final solution.
I think that's the notion of it. If there is a single algorithm that you have to run over and over again, like addition or multiplication or something, then it's a very narrow cone of things that you want.
But if you're trying to find, you mentioned earlier, general relativity of Einstein. So for Einstein to discover general relativity at the moment in time and the axioms he had around him of truth, it required quite a wide cone of diversity of creativity. To be able to explore different ideas and then of course make sure they're consistent with what's already true in the world. To make sure to know which ones are correct and which ones are wrong.
But if you would have kept a very small diversity, you would have probably never discovered it. The quote of it's a fine line between genius and madness, I think kind of applies to models in this case as well. But at the end of the day,
If you have infinite creativity, drag the temperature up of a model, it will collapse into garbage. It won't be something that makes sense. So it's our job when we train models and build them that for the things we care about, the things we define as valuable intelligence, that we make that trade-off perfectly. Because that's actually what we're doing. When we're using reinforcement learning to teach these models,
We are doing exactly that. We're saying, "Hey, when you're sampling in this part of the cone of possible thinking traces, you're directionally correct. When you're sampling in this part, you're entirely wrong." And so if you start going down the path of trying to apply calculus to a problem that doesn't require calculus, you want that entire set of thinking traces that stem from that to no longer be something a model does when it encounters a similar situation in the future.
And so that's, I think, is the trade-off, what we're constantly doing. And I think human intelligence is exactly the same thing. You see this if you take someone who is very new in a field, very young. That's why I also think most interesting scientific breakthroughs come from people in their 20s and like, you know, because there's not a constrained cone yet. They're willing to explore lots of different ideas. It's why you see people who are great at lots of different fields, like a Feynman being able to come up with very creative ideas. But also they might at times...
you know, go way too far off and don't get to the right objective. So I think intelligence is always that trade-off. I don't think we'll ever get it perfectly right, but we can keep getting it more and more efficient. And over time, we might be willing to say, hey, for known compute budget work, which I think is the vast majority of knowledge work, like in the real world, accountant or software developer, we kind of know the intelligence budget. That's where you want that cone to be as perfectly narrow so that it can, you know, maximize for the economically viable work.
But when we come to unknown compute budget challenges in the world, solving cancer, the next breakthrough in material sciences, here we might say, you know what, I want that exploration much wider and I'm willing to pay money for it to be much wider so that we can explore more ideas.
It's coming back to AlphaGo. It's how many moves do you want to explore next? How much depth and breadth do you want to have? Yeah, I love that. I mean, the space of intelligence is very gnarly. And we work as a collective intelligence. There's a great book by Kenneth Stanley called Why Greatness Cannot Be Planned. And he basically said that
monotonic objective optimization is the dumbest thing you could possibly do. So, you know, what we actually do is through serendipity and what's interesting, our nose for what's interesting, we collect diverse stepping stones and many of them lead to greatness. But of course, in an LLM context, it's about just sampling and actually taking in those diverse perspectives.
But I want to talk about software a little bit. So Paulside, your company, your product is intelligence, but initially you've been very focused on software engineering in particular. I can speak to just my own personal experience that Gen AI software has revolutionized the way I write code. I'm now writing software in, let's say, a month or two that would have taken me years to do before. It's absolutely incredible. But what is your main objective here and how do you see the sort of the software engineering trends changing?
The main objective is to achieve human level capabilities and go beyond. And that means that in a world today where we have maybe 100 plus million people who are building software, we want to bring that from 100 million plus people to anyone who is able to build software. And the core and the sets of people who today are at the frontier of what they're able to do in software to be 10x or 100x more productive.
And I think this just comes because what is software? Software is a lever that we have on the world to be able to bring productivity, right? It's a lever to abundance. It's a lever to reducing the cost of things. And so to me, jumping really hard on the end of that lever, right, putting the biggest weight on it is putting the biggest, most capable intelligence on it because that allows all of us together with AI to pull that lever and to drive the cost of things down.
And that's always been the thinking behind this. Now, I think it's important to not just talk about the future because otherwise you get the founder of a frontier eye company just saying AGI, AGI, AGI. And I'm sure you've had plenty of those podcasts already. I think it's important to also know what can you do today? And today is in the intersection between model capabilities and limitations.
a world that's developer-led AI assisted, right? So the AI capabilities and limitations and the human, and how do you find the perfect intersection by adding product on top of that, right? How do you create a product that allows people the max personal lever that they have for their own productivity to, like you said, do what you do, which before would have taken you years, you can do now in months.
And that has a lot to do with the model, but also has a lot to do with the user experience. It has a lot to do, how can you bring the right context to the model? How do you make it easier for it to find the information to be able to give you the correct answers? Some of that is external from the web, some of that is within your code base, some of that's within your knowledge bases.
And so building really powerful assistants, today we do that in editors, we do that in web, we have soon CLI coming, I think is really critical. But this is like a symbiotic relationship. Like as your model gets better, you can do more on the product. The product form factor changes constantly as the model gets better. You've already seen that it went from code completion to chat to now increasingly more agentic and I think in the future increasingly autonomous.
And so you just have to kind of be on that frontier constantly playing with all of those things. So one of the things I've noticed, and again, the reason why I'm so excited in particular, I think it's really good for founders like me. I've got a very small team and I can just rapidly iterate. And I've
I guess my process of coding has become more like I'm a reviewer, right? So I get the language model to generate a bunch of code. I do a bunch of tests and increasingly rather than writing the code, I'm just reviewing it and I'm saying that looks good to me, looks good to me. Sometimes I'm going backwards, I'm going forwards. And
The thing that I want to understand is how does it work when you have teams of people? Because the way we write software is we have a mental model. We create these abstractions and we have some idea of how the software should be crafted and we share that with our friends. And now we're generating code almost quicker than our ability to review it.
So how does that scale in teams? In the world today that's developer-led AI-assisted, the question as always comes is like, you know, what requires essentially knowledge sharing? What requires review and what doesn't?
When you make a one-line changer to the documentation, it doesn't require review. When you make a massive refactoring of the entire code base that impacts every developer, you're going to want to knowledge share that with the entire team and maybe get input. And I think AI is no different of you scaling up your team, right? Or if you're scaling up AI in this case. So it always sits in that boundary of where is it important to make that knowledge share happen and where is it not?
And so code review has often been seen as this like process you have to do to catch bugs or to do X, Y, Z. I always think code review has truly been around knowledge sharing first and foremost, and then occasional where needed, the ability to get input from other people because it touches surface areas that impact others or that you might not know or you weren't the best person on.
And so when all of a sudden you're now producing 10x the code or you're moving a lot faster and it sounds like in a domain you're working in, it already feels like AI is a very valuable partner, almost like an anthropomorphized intern. In other domains, it's not there yet. It should be seen as adding someone to the team. And I think increasingly that's the relationship that we're going to have with AI.
where it is like adding people to our team, except instead of adding humans, we're actually adding AI agents. So one way of thinking about this is, we've got this fairly linear software development lifecycle and we have business analysis and then we do story points and then we write some code and then we do some tests and then
you know, we approve all of those, you know, with release control and so on. And one way of thinking about this is it's about control and it's about aligning the code that we write with our business objectives. That's why we have all of these different gates and approvals and so on. So what does that mean when we have increasing autonomy and when, you know, in the code crafting process itself, we can just do so many things
Does this traditional software engineering lifecycle give us bottlenecks? I think a lot of the software development lifecycle collapses into the models over time. But I think it depends, again, on the environment. If you're writing code for nuclear missiles, you probably want a lot of those steps of that linear process to still exist, even if AI is the one acting in it.
Because you care about a certain number of nines of reliability. In that case, probably, hopefully, not even nines, just 100% reliability, which you all know is impossible in software, right? But we want many. So you're willing to invest in that. In other places, software becomes more ephemeral. Maybe you write something that works for a week as a tool or does a set of tasks. And so there the SDLC doesn't really matter. And so on those two extreme ends of a spectrum, and you're starting to realize about myself, I use spectrums a lot. It's just the way my brain works, is
Because as AI gets more capable, we're going to probably have a distribution of software that tills far more to the left in terms of volume of software that's closer to not requiring those stringent processes than to the rest. But the world's global banking infrastructure is not going away, right? And there we want a certain set of checks and balances in place. The question just becomes, is it AI acting across all of those checks and balances? And is at some point it becomes so reliable that I can start removing a lot of them? Because if I have...
A software developer, a human software developer who never makes a mistake, never writes a bug, his CI tests always 100% pass for the last five years. At some point I might say, you know what, give the three hours of CI back to the human. Let them just move faster.
Now again, that's a theoretical example because in the real world that's not perfectly like this. But as we get more and more above human level capabilities, at some point we might just say, you know what, that's fine. Maybe even unit tests fall away. Maybe even, you know, like CI, like all of this starts becoming less and less.
So I think it's useful to think about things in terms of the limits, not because we're at the limits tomorrow, but because it allows us to kind of show where we're heading towards. And so I think, yes, in the fullness of time, a lot of the software development lifecycle collapses into the model and doesn't require many of these checks, but not everywhere and not all at once.
So thinking about the dynamics of software engineering in large enterprise, a lot of companies in the FTSE 500, they frankly can't hire really talented software engineers. There was always this problem that they would do low code and no code. They would build things on Microsoft Power Platform and so on. And now we're in the era of Gen-AI coding.
And now almost anyone can write amazing software, almost discardable software applications to do whatever they want to do. So do you think that we're going to see more folks writing code and how will that change over time? I'll have to say, and look, that's me because I spent so much time with enterprises. I think we find incredible software engineers everywhere first and foremost.
But I hear you in terms of what you say is that, you know, what you get paid to be a staff engineer at Google is not the same what you get paid to be a staff engineer at a bank. Right? And so there are some distinctions here and there. But overall, I think there's great developers everywhere.
What we're seeing right now is that AI is, while it might be for you in your field or for greenfield projects already at the place where it feels like that massive unlock, I still think right now in most enterprise environments, it's feeling like a 20 or 30% productivity gain. And it's a productivity gain for some people by 2x.
because they're doing unit test automation and all of a sudden now this is way faster and maybe even becomes 3 or 4x. And in other places where people are working with a company-specific programming language in a specific domain where the models are really not that good yet. And so I think it's worth acknowledging that model capabilities mapping over to the real world in enterprises
is not a one-size-fits-all. And what I do see is that people are excited about being able to do more. If that's being an existing developer who can now take the boring parts of your job and actually automate it or can build a lot more software faster. Or the fact where, yeah, I now see a product manager going, oh, I can actually build this prototype myself and show it to the managers and then see if we want to go build it out at a better scale. So I do hold the view that
more and more people will be able and will want to build software. But the want is an important part. Because we're software developers ourselves, we often assume that everybody wants to build software if they're able to. And the reality is not everybody wants to build software, even if they tomorrow can. But the product manager who's always wanted to build his ideas quicker, absolutely. That maybe one in five business person who has this idea that he's been trying to rally a team around but couldn't now can.
But it is not universal that everybody in the world will be building software because it's still something that requires you to want to do it. Yeah, I mean, I guess I agree that...
Clearly the lift is, you know, the nought to one, right? You can now build individual applications just in seconds and it's incredible. But I still think that there's a much bigger lift and the only thing blocking us is a lack of imagination. So for example, the reason why Google engineers are paid so well is they're building scalable distributed systems that millions of people use.
and the multi-agent fault resilient systems, you can still build systems like that. And I guess the question is, some of this is an education thing. There are good ways and bad ways of using Gen AI code.
a good way of doing it is understanding that there is a complexity ceiling. And if you build a monolithic application and you just keep building on top, on top, on top, it's going to collapse. But you can, it's almost like it's teasing you to design modular, almost serverless type, multi-agent type systems. And the LLMs can handle that to a much higher complexity ceiling. I think you're right in mapping the current limitations of models
to a really good way of getting around it, which is building small modular things that the models are able to essentially understand easily and work within. And having those kind of architectures where there's some separation of concerns help in that world. But it's the models of today. And I think this is an important part to always come back to that I don't think there is a universal limitation to the type of software that can be built by models.
when we talk over the next three to five years. I think, but you're absolutely right in saying that if today I try to build a massive monolith application and just have to model vibe code the hell out of it, at some point, the whole thing collapses in on itself. By the way, I don't know about you, but there've been times in my career early on where I did exactly the same thing. I would build something and build something and build something. And at some point I'd be like, oh my God, what monster did I write here in code? And then it pushed me to refactor it, make it more modular,
make it better, right? And so we're seeing limitations in the models today while they're still far from our capabilities that also we can reflect on having seen in ourselves at different moments as well. Yeah, and that's really interesting because it's like the language models, they kind of scale with quadratic complexity and software scales exponentially of complexity. So it allows you to build software which is two orders of magnitude more complex than you would otherwise have built, but you still hit the complexity ceiling just very, very quickly.
It's an interesting way of looking at it. I haven't given that much thought, but I'll have to do so. The other thing which I think is really interesting is
At the moment, when people do gen-AI coding, they're generating software and they're doing unit tests and so on. But we still have this fairly linear mode of software engineering, which means we have release control, we put stuff into production. Now we're starting to see the advent of MCP servers and so on, which means that during the development process itself, the intelligent system can actually talk to your database. They can say, well, what's the schema on the live database? Well,
Talk to my actor system. How many actors are in play at the moment? Do I need to repair this actor? So now increasingly there's an operational layer to the software process. I love to anthropomorphize this for a second, right? What we do as developers is we'll open up the database console and we'll check the schema. We'll go talk to someone on Slack or in person and we'll gather information and I'll pull up the documentation. And so I think that some of the protocols that the world is creating are a way to make that with current model capabilities easy for the model to do.
I think again, if we play this out over the next couple of years, I'm not sure if it's a protocol. I'm not sure if it's just a computer use agent that's been doing this or the model is just writing code to hit the API of Jira or hitting the, like just directly connecting to the database and executing a SQL command to get the schema. So I think there's things that we build today that matter for the limitations that the models have. And they matter and they're useful.
I think the interesting question that stems from this at some point when we talk about a large multi-agent system, just like we talk about a large company with lots of developers collaborating, how do we make that collaboration efficient and work well? Is this a world where we have a thousand instantiations of our model, each essentially acting upon in an organized collective like a company? Is that hierarchical in nature, just like we are in our organizations?
Because there's things that we can't do. I can't access the thinking traces and solutions of my 500 peers. It can't be stored in some central database, but it's something agents can do. And so all of a sudden, I think while we're building towards human level intelligence, there's already things that models today can do that we can't do by the nature of how we operate.
The parallelization of being able to run through an entire code base and summarize it file by file over a thousand files. I can do it, but it wouldn't take me, it took a very long time and not be so efficient. The length of context windows. I don't know about you, but I can't hold a million tokens in my context window and retrieve perfectly across it. At some point, we're going to have to get away from this realm of like anthropomorphizing it and starting to say, okay, these are certain things that models can do and different. And that to me comes to these protocols is right now we need them.
I'm not sure if we need them in 18 months. You were speaking to something interesting there, which is that there's a kind of semantics gap in code. I mean, there was always a famous adage that, you know, many companies, they didn't want to release their codes on GitHub because it's so valuable.
It's actually not that valuable because the semantics, the intentionality, the motivation behind the code is not in the code. You know, it's the same thing with language. You know, there's a lot of missing information that is not in the data. But we can capture this information because you've now got an organization of developers. They're building code, right, using these tools. And the entire thought process is in language.
So you could capture that into a kind of semantical database, you can rag into it, and now you've just got all of this kind of meaningful, motivated information, which means the language model won't always be making the same mistake because it knows the reason why we did this is because of that. There's lots of really cool stuff, right? We've never, like you said, we've never had the ability to trace the thinking of humans and store it.
And as these models are being deployed increasingly more as assistants and future agents and even autonomous agents, we can now all of a sudden access that. The question is, will we access it or will the models access it themselves? And so I always try to be careful with going too many steps into the future because at some point you start collapsing into sci-fi.
But I don't think we are that far out from this specific scenario. There's quite a few people who won't agree with me on this, but I think it's really important that as long as we can, we should keep models, thinking and reasoning in language.
There's incredible work that can and probably even more compute efficiently be done with latent space style thinking. There was a great meta paper about the latent space reasoning tokens. There's been some other approaches that I've seen that I like a lot. We saw a language-based diffusion model coming out earlier this week. I really like the guys behind it. And so I think there's lots of architectures that will work. Actually,
My team kind of gets tired of me saying that probably every architecture can work in the world. It's just a question of how compute efficient it is. So I had no doubt the fusion language models were going to work. It's just, is it the compute efficient thing for the type of capabilities and tasks that we care about?
Now the reason I mention language being important is that as models get more capable, having the ability to see the reasoning and thinking traces that they have, and like you said, reference them from the past that leads to what led certain decisions, I think will become massively valuable both from an interpretability perspective, from just the ability to build upon work that was done previously by other agents,
And I think there is a discussion to be had on if it is going to be useful from a safety and alignment perspective as well. There's this diffusion-based language model is a great example. I've been very excited about this for years, but they never worked particularly well. So this one that just came out that he spoke about that it had roughly 10x efficiency advantage. So, you know, like with an auto-aggressive language model, you have to actually token, token, token, token. This thing, they only ran diffusion about five or six times. They got the same results.
Diffusion is really good from a code point of view because, you know, just like with vision diffusion, you can actually edit things. So you can say, okay, I want to hold this code fixed, but I want to edit this bit in the middle. Why is everyone not doing this? There's so many architectures that can work.
and i think diffusion is is one other field of it uh but we've done a lot of work in our in our industry and in each frontier company to make the ones we have really efficient and so to shift over from one to another you have to have the efficiency gains you have to be willing to take the the time that it takes to actually go after a new architecture
All of the experimentation you've done until that moment, right, does it still hold true in the same way or not? And so, you know, we made a big bet a little over a year ago on linear attention, RNN-inspired attention. And we've had models in production since about fall of last year already with linear attention.
And so we would ask, why is not everybody doing this? Why? This makes so much sense. Or it sees gains, et cetera. And part of that is also just because all of us end up investing heavily in a certain area and then we just scale it up further. And making that shift needs to be really, really valuable to do so.
And so I need to dig further into the diffusion language models in terms of how much of that efficiency shows up compared to some of the things that we're doing or others. We're unlikely to go down this path ourselves.
because of different things that we've done in other architectures that we maybe just don't disclose yet. But I think it's exciting. And I think there's a lot of room left for architecture research. I think the fusion language model is the tip of the iceberg. I think a lot more things can be done. But also, they need to be skilled up. So some of the coolest stuff that we've seen in open source, you know, or others, comes at this 7 billion parameter, if at all at 7 billion. And then the question becomes, well, how does it
you know, operate at 70 billion? How does it operate if we want to try to make an MOE equivalent of, you know, this fusion language model? And so at some point, it's like what works at small scale
very likely can work at large scale, but does it stay the same efficiency? Can you get those same gains? Yeah, that's a very good question because they built a mid-sized model and it was commensurate in performance to the other frontier mid-sized models. The other thing I loved about it is you can actually run it, you can do an unbounded amount of computation. So you can actually just continue to do diffusion and from a test time computation point of view, it's actually very, very flexible. I love that. But
But just coming back to the software engineering thing. So I think what we're seeing though, when we do increasing AI dev is a lack of autonomy on the part of the humans and a lack of legibility, right? So we're now building increasingly inscrutable software. Now, let's make no bones about this. No one at Google understands how the software code, so we shouldn't be hyperbolic about it.
We could, as we were just saying, we could design an information architecture which mediates a cognitive interface, which means at least at some level of abstraction, we understand what the thing's doing. We're putting guardrails and so on. But we are describing a future, though, where we're just building these inscrutable monsters. What does that look like? In terms of to your first point of abstraction,
of models writing code, I think code is that higher level abstraction already. We can all go in and try to understand code and maybe we add some print statements and we spend time and it takes, look, it takes cognitive load, but code is deterministic. It gets interpreted by a compiler and all of us as humans, if we're willing to invest the time and effort, can understand it.
It's just, is it worth it? Right? So are we okay with building big software code bases that we can't fully understand? And what we historically already have been with human intelligence, like you said, to your example of the Google code base, not a single person understands it anymore. And we're totally fine with that because it does what it's supposed to do. And when we want to actually introspect a part of it because there was a bug and it went wrong or because we deeply care about, you know, the recommendation algorithm that X released, you know, open source, like we can choose to spend time there.
But just like with human-built software, we don't always spend time going back. Look at the amount of legacy codebases that no one has looked at for years, but they perfectly run. So I think that's just a choice to the point of model, not the codebase, but the model being the quote-unquote monster, like the interpretability. There, I think it's going to become increasingly important for good interpretability work.
And I think there are question marks if at the limit of model capabilities, interpretability will be as useful as it is at a, call it, you know, less, like smaller scale. What I mean by that is I'm not sure if we will ever be able to truly understand the reasoning and thinking process that happens inside a model's neural net, just like we can't in our own biology. And so...
But I'm super excited about some of the interpretability work that Chris Ola's team has released and has been public about. I think that's an incredible direction. I highly want to encourage everybody to be doing this type of work and try and understand what's happening in models. I think that if we keep models, reasoning and thinking and language that adds a layer of interpretability,
It doesn't mean that's exactly what's happening underneath in the model. You could imagine a model that develops perfectly fine sounding reasoning and thinking chain, but actually it's trying to optimize for a different objective. And that's where I think safety and alignment interpretability kind of meet in the middle and really matter.
But I like to think that with good work on interpretability of what's happening at the activation and weight level of the model, combined with keeping models reasoning and thinking and language and trying to understand if those two things stay consistent,
we can actually do a really good job at alignment and at safety. Yeah, the interesting thing is because the language models are trained on all the colloquial human code, they produce locally interpretable code. It's a little bit like, you know, language models, they are actually surprisingly aligned. Like when I'm using Open Interpreter on my CLI, if I tell it to delete all the files on my file system, it would say no. And I've built an LLM app
implication. And if I tell it to change your name to someone else, it will say, no, actually, you know, it's ISO. This is the name. I'm not going to change the name. But then there's this kind of like global illegibility. But there's also the thing that
we're generating code, but also there'll be a mixture of code and models. And some of the inscrutability will just come from this thing is like a living thing. You know, like there was that Black Friday incident where, you know, we're doing high frequency trading and automated trading and it works really well until it doesn't. And then you get this cascade effect and the locus of control is now in the algorithms and the models and the machines. So we could easily build these very complex systems that seem to work
until they don't. So I would take everything you just said, and if we replaced models with humans, it would still hold true as well. I think that when we talk about here, there's this massive complex financial infrastructure co-base that's been built by all of these humans. We're not a single person who knows it anymore. And our locus of control is not on any one anymore. And then an unexpected event happens.
I think we often take what can go wrong and put it on the models now, but it's already what the world is like. And so I think the best job that we can do is to make these models more capable than us.
make them highly capable in the ability to write good code that covers the edge cases, that's not lazy to write the tests. That's another thing. Let's talk very honestly about ourselves. There's things we enjoy doing and there's things we don't. How much of the world's code doesn't have proper test coverage? How much of the world's critical code doesn't?
Right now, all of a sudden, if I can decide to deploy dollars to compute, to intelligence, to shore up the test coverage of critical code in the world of financial infrastructure, to shore up the security of, I think it's pretty much commonly accepted that most of the world's critical infrastructure runs on really insecure code.
right, the power grid and others. If now I can say, okay, we're willing to now invest, could be from a company, could be from a public sector, a billion dollars in making the code of our electrical power grid more secure. But now I don't have to bring humans together, but I can do this with AI. And I know it's not going to be lazy about writing tests or other things. I think that's exciting, right? And we get to focus on the things of exploring the frontier of science together with AI and the things we want to spend our time on
We have a shortage in the world of people to write incredible software and code. And that just shows on the amount of legacy systems we still haven't upgraded yet. And so I think this is a way of overcoming that, not necessarily. And of course, just like with humans, it adds another area of fault. But I'm not sure if which one of us in the fullness of time would you rather have write code for a critical system? Like I often say privately is like,
I want the nurse to be a human, but I want the surgeon one day to be a robot. Just looking at the evolution of AI software, I'm very excited about meta programming. So we can actually have the system repair itself and generate its own code in response to failures.
But then the next stage of evolution is why do we need code at all? Right? Why don't we just I mean it's so called transduction where we don't even have an intermediate codes You know step at all. We just get the model just to do the thing right and and the whole thing is adaptive Are you excited about that my opinion over the years since 2016 has has changed on this I used to be of like
Karpathy's quote in his, I think it was his 2016 blog article. Software 2.0. Software 2.0, right? Where it's like at some limit, you know, everything just becomes a model. I think about it a little bit different today. No longer that extreme. I think that there are parts of the world's infrastructure
that we want to have in something that's interpretable. And code is interpretable. It can be traced, can be tested, can be understood. A human or an AI can go in and understand it completely. And so our financial payment infrastructure in the world probably want that to be in code. Our electrical grid probably want it to be in code. What's operating, I'm seeing the railroad out here, the switches between trains, probably in code.
But do I care about other pieces of software if it's just a neural net simulating the whole thing behind the scenes? Now, I think I also want to come back to what code is. Code is operating in most cases of most software, not everything, on a CPU.
Right? And we have done an incredible job of optimizing the hell out of that evolution of hardware to be able to run deterministic code that, you know, can serve us to be valuable. There's lots of places where for a long time the models might not be even compute efficient enough
It might just be too expensive to have all of software in the world collapse into a model call, right? But you're not right that if we get to a point of model capabilities that are so good, so trustworthy, and we can treat them as doing deterministic things,
I know that every single time I ask X, it is so aligned that it's going to do Y because that's essentially what code for a CPU is. At the end of the day, we are asking for something to be truly deterministic. Then maybe as those cost profiles change, more and more will move to the models. But I still, and maybe look, maybe I'm holding on to an old idea.
But I still like to think that it will be cheaper for a capable model to build and maintain the Uber Eats app and have it run on CPU infrastructure as maintainable code to update it and change it than it will be to simulate the entire thing as a neural net.
And so I think for a long time code in the real world will continue to exist. The amazing thing though about these AI models is that in so many ways they're just smarter than us. So they can basically write or learn functions that we can't write code to do, which is incredible. 100% for NPR Pro. There's so many places where this is going to be incredible. Absolutely. So maybe it is a spectrum.
And I mean, I was having a bit of a galaxy brain thought. I was driving to pick Marcus up, our creative director this morning, and Google Maps, it was taking me this weird way. And I was thinking to myself,
What's it doing? Like, is this thing being a utilitarian? Is it actually like optimizing for the, you know, to reduce the average route time? And it's taking me down a bad way because it doesn't care about me. And I don't know. And in the future, these AI systems, this is kind of what I mean by the loss of control. Like it might be doing some weird galaxy brain thing and it might not necessarily be what's right for me. And maybe I want to know about that.
I think today there's a lot of machine learning systems that already we don't have any good interpretability on, if that's in fraud detection, if that's actually in the algorithm that's mapping you to your location. I'm not sure what's behind the scenes of Google Maps, but it could very well already be a neural net that is very task specific. I wouldn't be surprised, or at least a part of it likely is, that's learned on a lot of data of the London traffic patterns and whatnot.
So I think we've already faced that in machine learning and will continue to face it. There's parts in the world that we want things to be deterministic in my view, and we're going to want to keep them. And there's other parts where you're absolutely right. I want the most optimal route planned from somebody who understands me, not just the general London traffic. Maybe it's the fact that I prefer to take trains that are less busy than ones that are more busy and I'm okay to get to a certain time.
I think as we add intelligence, generalized intelligence, right? So human-like intelligence to the world, we've got a lever that we can make continue to more compute efficient that we might want to use it. I might want to use, you know, AGI to help me figure out my route, right? Because at some point it's going to be so cheap that it's kind of worth it to do. So I do something today with the models. It's going to sound very silly. I travel a lot for work.
And every time I, before I fly or I land somewhere, I tell it my schedule, I tell it what I ate and I say, what's my best plan for dealing with my jet lag as I'm coming in this schedule where? And then I follow it. It will tell me, you know, protein meal with little carbs and sleep for 90 minutes here. And I will just follow the models.
Now look, our model wasn't trained for this specifically, but it's a pretty good generalized model and so far it's worked pretty well for me. But I use it for that specific reason because all of a sudden intelligence is now cheap enough for me to actually get it to do this. Now you could write an algorithm, you could write code that takes all of these factors into account and calls the API location of the weather and things like, or you could just trust the model. And there's, I think these two things will always exist.
but the code is going to be written by the model in the future and it's choosing to do things in code because it's just more efficient and more deterministic or we want it to stay deterministic like payment infrastructure. Do you think we might lose something when we do this? I watched a great YouTube video last night and it was this guy saying in the days of analog recording, you know, we used to have to record things to tape. We could only do one take and there was lots of noise and so on but it created this kind of serendipity that, you know, like
We can't just delete it and start again. We have to do it again. We have to talk about it. And, you know, like even this filming day today, it's a very creative serendipitous process. And do you think by kind of like offloading so much of the intelligent thought process to machines that we might lose something? It is worth asking the question, do we think in the same ways today? Has our own thought process evolved in the same ways today as it did 100 years ago, 500 years ago, 2000 years ago?
And I think there are things that as we have learned more representations of knowledge, of thought, as we started getting augmented to tools, like, you know, I'm still of the age where I remember, you know, kind of pre-Google. I was young, but a pre-Google and post-Google world, right? And now all of a sudden I'm in a world where I'm willing to all of a sudden start, you know, no longer remembering certain facts.
Actually, there's very few specific knowledge or history facts that I still remember. We're probably, if you and I were born 150 years ago, remembering the things that we had learned from reading books and other things would be critical to us. And so that's changed. And it's changed because we now have a tool that we can use instead. So I might not need to know anymore the theory behind jet lag and the biological process because I trust something else to make a decision.
But I think we've always had versions of that. If that was in the past, trusting the local wise person on X, Y, or Z, or the fact that we go to a doctor to trust. So we'll change. Will our own thought process evolve? If you give enough time over evolution of this, will we become smarter? Will we become less smarter? History hasn't shown us
even though people like to say it in the moment, we've become dumber because of technology. I actually think it's made us more enlightened. It's made us more, you know, ability to explore lots more ideas, do a lot more. Like technology progress has continued to be exponential. And I think it's exciting that it continues to stay exponential. So maybe I'm an optimist, but I do think there's a fair argument to be made if you're a parent.
And you've got a young child and you essentially say, "I still want you to study this. I want you to learn these things." Just develop your own thinking. It's kind of the TikTok debate. Do you want to give TikTok the entire day to your kid and have it spend eight hours a day just on the phone? I'm not sure if that's how you want your children to grow up and develop their intelligence. Would you encourage your kids today to learn how to code? Yes. Go on.
Just like you would encourage your children to learn history, to understand mathematics, to learn how to program and write code. I think you want all of these things because at the end of the day, we still need to train our own intelligence as well.
And so just because we're now getting increasingly capable of training models to reach our level of intelligence one day and even beyond, that doesn't mean that our experience of life should be one without us actually training our own intelligence. And I think coding and building software is a great way of training our intelligence. I today in my job don't get to write a lot of code anymore. It's kind of just the nature of being co-founder and where my role is.
But everything that I learned from having written a lot of code in my life has helped me develop thinking process and understanding that I like to think make me more capable of doing what I do. So I think just developing intelligence in general is a good thing and I think coding is a good tool for that. Can we just speak about the role of multi-modality? So at the moment we're primarily talking about text in this generative AI setting.
I can imagine a future where it seems like low-hanging fruit, where we just record the screen and it can, perhaps in an observational way, it can see the application. Maybe in the future, it can interact with the application, maybe in a gentile way. Is that coming? Oh, 100%.
Like look, multimodality from both visual language modeling to computer use to the work that you see that's happening in video and building world models in the world. All of that is coming. I think the question always comes down to the companies that are working on it need to work on it in terms of what their objectives are. So when I talk about right now building the world's most capable AI for software development and us being on the trajectory towards that, there are things that we care about that are massively represented in text.
characters, language, code, etc. But also things that are massively represented in visual understanding of what's on a screen because applications are on screens, right? And so building the model's understanding to be able to know what's on a screen, how to interact with it, to have computer use-like capabilities so that agents can go and open up the Amazon Web Service Console and click around and find the data that they need, I think is critical.
I think when you focus on software development capabilities, a lot can be done with text modality. With text modality, you're not going to get the model to create beautiful UIs. It's just you can't get, you know, that iteration on UI requires visual like, you know, modeling and having it as a modality. But where maybe a computer use agent is useful to go to the Amazon website and log in and find the IAM role, I can also have a model understand how to do that by making an API call.
And so I think we put ourselves with a bunch of guardrails by focusing on software development capabilities that allow us to focus on modalities that are slightly more compute efficient than video or image generation and hence use our compute more efficiently to getting towards our goal. But if I was trying to build full self-driving, my modality would not be text.
I'm quite excited about a future, you know, there's the Genie paper from DeepMind, for example, and they're talking about being able to generate video, which is interactive in real time. Could we ever in the future have software which is generative? So the user interface is generative because we all think about things differently, right? The user interface for you, the optimal user interface might be different for me. I don't see a reason why that world can't exist. I think the question is, where do we want it to exist?
I want my Uber Eats app to look the same way every time I open it because I build up my own, I train my own model to be able to make sure that, you know, I can find, you know, where the food is that I want to order.
But there are places where I want a UI to be dynamic based on maybe the data that's behind something. I maybe want it to be more tailored towards me. But I venture to say that the vast majority of people who consume software in the world, us as humans, actually don't want our UIs changing massively on a day-to-day basis. We want a set of consistency in our UIs. I think that's just human behavior, but I don't think there's any technical reason why that can't happen. Interesting. Interesting.
What's your relationship with the cloud providers? Publicly, we have announced in December that we have what's called a first party relationship with AWS, Amazon Web Services. And it's quite a unique one. What it allows us is that when an enterprise customer is buying Poolside, they can buy it as if it's under Amazon's paper, meaning that they are the seller of record.
And so this opens up a path for large complex enterprises to contract poolside and bring our product in as if they're adding on another Amazon service. So this usually reduces the time that it takes to get started with us.
It also allows them to fully burn down their spend commits. Enterprises have large spend commits with Amazon. And so this is a way that we have done this for several reasons. One is Amazon's distribution in enterprises is massive. They're the largest service area of any cloud provider in the world. And if you think about our business, we are in a capabilities race. We've spoken a lot about that today, but we're also in a go-to-market race.
Our ability to land customers and grow revenue allows us to invest massive amounts of capital into more compute and into more talent to be able to scale up our model capabilities. And so these are very symbiotic and hence the relationship with Amazon, which has been amazingly exciting. We've also done a lot of work with them, not just on the go-to-market side, but on their silicon. We have done a lot of work around Tranium 1 and Tranium 2.
And so we have a fully dedicated team towards it and we're pretty excited about what they're building there. Yeah, it's great for startups as well because you get a bunch of free credits with Amazon. So presumably you can just use the free credits for your service. I think we're at a scale of compute where credits is not what makes the difference anymore, to be very honest. But for our customers and others, it's absolutely, look, it's part of it, no doubt. I think when you're...
Today, we're very much focused on enterprises and who we work with. And what we see for them is where you're deployed makes a big difference for their security profile. So we deploy inside Amazon Web Service VPCs in that private account, and the model waits and the full stack live there. And the combination of that really allows enterprises to get comfortable with the model as going to their data instead of the other way around.
Interesting. And just from a sort of getting started point of view, so folks want to have dedicated hardware. Do you tend to find that your customers have a centralized model where they have one shared implementation or is it more complex than that? It really depends on the enterprise. A lot what we find is that inside enterprises, there are security boundaries that mean that they need multiple instantiations of the model.
So if they're fine tuning a model on their proprietary SDK that they have internally, but that only covers one business unit, and that model is not allowed to be shared from regulatory compliance reasons with another part of the business, you might find organizations that want to spin up many versions of poolside models
that get fine-tuned in many different environments towards certain use cases. From where we sit, we see all types of complexity, from on-prem to VPC to different model instantiations, different security boundaries, different onion layers of access rights,
And we've built for all of that. We talked a lot about the model today, but the sheer level of engineering we've had to do to be able to build, to succeed in defense, in government, in financial services has been actually quite a lot. And finally, do you have a forward engineering team just to help folks get up and running with the infrastructure? Absolutely, yeah. So we have the poolside solutions architects,
that are fully able and willing to spend a lot of time with our customers. But what we have increasingly done is that we've increasingly built towards a managed install approach.
So that if an enterprise is willing to give us temporary access rights, very, very limited access rights, we can spin up the entire infrastructure of Poolside in their account. So what before might take several days of effort can now be done actually in under 40 minutes by just providing a single IAM role. But there's enterprises. There's always going to be something that you are going to discover along the way of a massive deployment.
a firewall that sits on the network somewhere that needs a certain setting tweaked. And that's of course where often our solutions architects are very helpful. But also just in terms of helping them think through, you know, as we're making these models their models, as we're fine tuning them on their data and making them more capable in their environment,
helping them think through about what data, which groups of engineers to deploy what versions of models with, how to measure the impact, because we offer quite a lot of metrics that are available to the customer to see how the impact of the model is towards acceptance rates, towards how many changes are reviewed but actually not applied, how many lines of code are actually applied. And so we do a lot of that work and we try to help our customers think through it.
Our commitment is to become the trusted partner of enterprises, that as intelligence gets more capable, they'll want to scale it up with us. Iso, it's been an absolute honor. Thank you so much for joining us today. Thank you so much. It was a pleasure.