This is episode number 878, our In Case You Missed It in March episode. Welcome back to the Super Data Science Podcast. I'm your host, Jon Krohn. This is an In Case You Missed It episode that highlights the best parts of conversations we had on the show over the past month, this time this month.
for the month of March. Given the enormous positive social media reaction to his episode, I'm choosing my conversation with Dr. Andrey Burkov, author of the mega best-selling 100-page machine learning and 100-page language model books. That's my selection for my first clip.
In episode number 867, Andre and I talk about when AGI, artificial general intelligence, an algorithm with all the cognitive abilities of an adult human, might become reality. In your work as both a real-world AI system developer as well as through the books you've written and recently, you know, this huge amount of expertise you've developed in language models to write this book on language models,
You probably have an interesting perspective on AGI and when it could be realized. Do you have any, you just mentioned there that we might have it in the future. Do you have any, do you want to hazard any guesses in terms of timeline? When I say that we may have it in the future, it's like to say we may have teleportation in the future and it might work. So like, yes, it can work because if we humans, we are conscious,
then something in nature changed. And I mean, changed compared to our predecessors. So we evolved somehow in humans. Because what is the difference, the biggest difference between humans and the rest of animals? Humans can plan over an infinite horizon. So some...
Monkeys, some like chimpanzee and like the most developed ones.
they can use tools because previously it was considered that only humans can use tools. But now, after decades of research, we know that even some birds can use tools. For example, I think that it's crows that have a nut and they can throw it from a height.
and it falls and it cracked and even when they live in the city they can wait for a car so they wait for a car then they throw the nut the car walk over the nut and the nut is broken so they use tools
Some monkeys can even use tools. So for most animals, they use tools like in this specific moment. So they will not keep their tools for tomorrow. But some monkeys will actually, like you, for example, you give one monkey a stick.
and only with this stick she is able to get a banana. So she will get a banana and when she goes to sleep she will put this stick
under her belly when she sleeps because she knows that tomorrow she will also need a banana. So this means that some animals can plan one day in the future, two days, but if you remove bananas for more than three or four days, it will throw away the stick. It will not think that maybe in five days bananas will be back. But humans...
will think that I will still keep this stick because who knows. And we can even plan like many years, like even hundreds here, thousands here. Today we think about saving the planet. So we think about reducing the consumption of plastic and we think about like the...
global warming issue. Why do we do it? Like, we will die maybe in the next 60, 70, 80 years. The planet will be still fine. We do it for the next generation, for our kids, for their kids, and so on. So this is what we managed to gain somehow through evolution. So now the question is, how can we get this AGI? So basically, the answer is,
Like what inside us is different that makes us planners for infinity versus every other living creature on this earth? If we can answer this question, I think this will be probably the biggest breakthrough because this is something that our LLMs or whatever neural network you talk about, this is what they don't have.
they don't have the ability to actually plan. So they are reactive. You ask a question, it gives you an answer. Even if you call it agent, they don't really have agency. It's because they might act as agent because in the system prompt you said,
You are an agent and your goal is to provide your users with the best information on a specific topic. But this agency didn't come from the agent itself. It came from you. So you instructed it to be an agent. And because the LLM doesn't really understand what it does, it just generates text.
sometimes this agency will be violated. So it will not do what you want it to do and you cannot really explain why. So it's like a black box, it works or it doesn't and you don't know why. So if we answer this fundamental question, what makes us planners for infinity,
I think that this is where we will get one step closer to AGI. - Yeah, I would suspect that some of the answer lies in our prefrontal cortex and the ratio of prefrontal cortex that humans have relative to other primates that allows us to kind of maintain a loop
uh, through our other sensory cortices, um, over an extended period of time, which brings me to a point that I've talked about on this show before, which is that it seems to me, and it sounds like it may be the case for you as well, that cracking AGI may require modeling the neuroanatomy of the
of a brain, of a human brain perhaps, in a more sophisticated way than just scaling up a single kind of architecture like a transformer. That we might need to have different kinds of modules so that we have something like a prefrontal cortex that can be doing this kind of infinite horizon planning that you're describing. And so you'd have to, you have different parts...
that are connected by large connections kind of pre-planned as opposed to just allowing all of the model weights to be learned in a very fine way across the entire cortex in the same way, across the entire neural network in the same way. Yeah, and it's not only, well, I simplify it a bit by saying that this is just one thing that will make us different. But another thing,
that we also have and LLMs, for example, don't. Is that humans somehow have a feeling about what they know and what they don't know. Okay, so for example, I ask you about, I don't know, about astronomy, okay? Or about the universe, stars or galaxies.
And if it's not your domain, you will tell me, you know, Andrey, I like to talk about these topics, but if for you it's something critical, you probably should talk to a specialist because I can only tell that, you know, planets are spins around stars. This is what I know. But LLMs don't have this mechanism to detect that
what you ask about wasn't part of its training data or it was, but not in the level of detail
granular enough to make a to to make a to have an opinion that it it's worth sharing so it will still answer you like for example i i made a test a couple of days ago with this o3 mini uh from from open ei i wanted to see like uh because um
All models, all LLMs, they have been trained on the web data. And on the web, there is a lot of information about my first book. But my third book just came out, so there is really few information. And I'm sure that their cutoff was earlier than the book was released, so they should not know anything at all about it. So I asked O3 Mini, is my 100-page language models book good?
And what is interesting is that previously you couldn't see this. Currently, they show this what they call a chain of thought, like this internal discussion before they provide an answer.
And I read this chain of thought and it's funny. It starts by saying, okay, so he asks about this book, but this book looks very different from the previous one. So probably it's some new book. Okay, what do I know about this new book? Not much. Okay, so what do I know about the previous book? Oh, the previous book is XYZ.
So this discussion, and then it starts releasing the final answer where it just says that, yeah, this new book is very good. It's praised by experts and by readers, and it delivers content in a very good way. And I'm like, where does it come from? It just made up?
the recommendation, and it's based on its internal discussion in which it says, "Yeah, but I don't have anything about this book, but given that Burkov has a great reputation, this is what I might say." But it doesn't tell you in the official answer that it's pure speculation.
It answered this just as if it was like a real deal. So this is where, you know, the LLM cannot really understand this difference between I'm sure about this. I am less sure about this. I'm totally...
I can be totally wrong. So again, if we can solve this, this will be additional next step to AGI. So the model that can be reliably self-observing and self-criticizing. So saying that I would love to help you, but here I feel like I'm in the domain where
I cannot be reliable. And by the way, they try to fine-tune models to tell this.
But it doesn't work this way. So basically, for example, some of some of models, especially released by when like the Chinese Chinese company. So they they decided to fine tune their models to say, I don't know this person. So previously, for example, you can there is information about you online. So you can ask a model who is John Cronk.
And it might say, well, he's a podcast host, book writer. But it might also say that you are a Ukrainian footballer like me.
So to avoid being, you know, ridiculized. So like people Google themselves, people ask about themselves. They know that information, some information is online, but it comes out totally made up. So they decided that they will fine tune their models to say, I don't know anything about this person. And they fine tuned it on by giving the names of really famous people.
And they say, let's go answer. And then they give some just random names, people who don't exist online or very small footprint. And they say, answer, I don't know. But it's funny because I ask, who is Andrey Burkov? It says, first time I hear this name, don't know anything. And then I say, who wrote the 100-page machine learning book? Oh, it's written by Andrey Burkov. Like, you just told me that you don't know. Yeah.
So no, they try, you know, to create some hacks around it, but it's not really training a model to recognize where it can be wrong. It's amazing to hear Andrei's thoughts on this just a few weeks on from this episode going live. And it already feels like major new model releases have brought us closer to HEI. The window of what we might consider old news seems to be narrowing by the day.
So what do these rapid developments mean for humanity and are we becoming obsolete? In a similar theme to Andre's contrast of human and machine intelligence in episode 873, the entrepreneur and digital twin expert Natalie Monbiot discusses what's unique about our species intelligence and why we can be hopeful about the future. And something that I was thinking about as you were talking about this evolution in terms of
the cognitive involvement from a Google search to using an LLM to output something and how with the Google search, you're at least still actively looking up information. You're maybe comparing a few different resources and saying, okay, you know, four out of the five top results all say the same thing. So that's probably reliable. And so you're doing some kind of more critical thinking yourself there. It set my mind off on this
I just kind of did this thought experiment about, I was thinking about, I had this visual of myself as a kid growing up before the internet. And I, in order to look things up, I was dependent upon the dictionary, the thesaurus, the encyclopedia that were in my family home. And it seems like, you know, that's if, if the LLM is taking away a lot of cognitive ability, Google is kind of on a spectrum. It's in between the,
the LLM and me manually looking something up in these physical books, because there's all kinds of interesting things that happen when you physically look something up in a book like that, where you are being exposed to other rich information. So unlike when you do the Google search, the other kinds of things that are showing up on your screen, the things that are showing up top of the fold, the things that are most geared in terms of what humans have been designing, as well as algorithms to capture your attention on those page are ads.
But when I'm opening up a dictionary, all of that is information rich. And oh, that's an interesting illustration of some kind of tree in Asia. And you just kind of end up reading about that. And in ways that you don't even do on purpose,
That experience of discovering this other random piece of knowledge about an Asian tree makes it easier to remember and have this sophisticated web of connections on whatever it was you really were looking at. And meaning, right? It's like you're actually discovering things and all of those different inputs are creating an insight into
And meaning that stays with you. You've just understood something in a different way that was very personal to you and your experience in the world. So what I would say the equivalent of that in this new era where we're living with AI is...
I think we need to stay, like use AI to our advantage and we also need to remain competitive to AI. Like what is our unique human advantage and how do we double down on that? So even though earlier in this conversation I said that AI is part of our cognitive evolution, at least that's something that I believe, it's part of us in the same way that language is part of us, but it's not all of us. We don't experience everything in the world through language. We're also these embodied creatures
that were born to be human, living in the world and having a lived experience and being able to notice things and engage with other people, be it discover different places, make connections in the real world that inform our insight and our understanding of things. And so I think that example of you with the thesaurus, tough word, is actually something that is even more
I think in the age of Google, it was like, oh, goodness. Well, Google totally replaces that behavior. But now we should be doubling down on those behaviors and those experiences because that is truly what makes us human and makes us competitive to AI. Because AI, yes, AI is multimodal and can see and sort of things, but it doesn't really. AI ultimately is not human and it's not competitive.
embodied and embedded in the world. And so what do we do in the age of AI? Yes, collaborate with it, free ourselves, solve big problems with it, and free ourselves to be more human. I think ultimately that is what I'm hopeful about and I feel is very true. We need to be more human because our future and our existence sort of depends on it. And
And we need to tap into humanness in terms of human experience and into human ingenuity to be competitive with AI. AI can't replicate human experience, so this should give us motivation to keep on exploring the skills and opportunities that rely on what distinguishes us from AI rather than what we already know an AI model can do more efficiently.
And as humans, we think and feel and understand differently. And these different perspectives are what make us so useful and what make us unique. In episode number 871, I went to London to chat with the charismatic software engineer, Richmond Alake.
who taught me about the AI stack, how that stack means different things to different people in a company, and how the unstructured database MongoDB does such a great job in simplifying the AI stack for AI engineers. Let's now talk about something that's an extension of this idea of vector databases and AI and MongoDB.
You recently wrote a blog post on AI stacks. And it's actually right now at the time of recording, if you Google the term AI stack, your blog post comes up as the number one hit. So I'll have a link to that in the show notes. We talked a little bit earlier in the episode about things like the mean stack, which was this idea of back end all the way through to front end technologies for the developer.
Is an AI stack somehow related to that kind of thing? Yes. So we said the main stack was, well, we didn't say, but we know that the main stack is a composition of a bunch of tools and libraries to build web application. So the AI stack is a composition of tools and library to build an AI application. One thing I would say is the AI stack
is different in terms of how you visualize it depending on the persona you're talking to. So when I'm talking to developers and when you see the, when you look at the article, when I'm talking to developers there are more layers in the AI stack than when I explain the AI stack to like a C-suite or VP-like person. And that's because I feel developers need to really go
We really need to dive deep into what is making this the AI applications today and understand the composition. But for some CEOs and VP-like folks, they don't need to know the intrinsic detail. They need to know the high-level information. So just to make the point is most of VPs or high-level VPs
execs within companies who describe the AI stack as you have in the application layer, you have your data layer, and you have your compute layer. Very easy. So application would be-- sorry, you have your application layer, you have your tooling layer, then you have your compute layer. So application would be any of the products you see today. So cursor, a very popular IDE that is powered by AI, would lie in the application layer.
then in your tooling layer, you have folks like MongoDB or any of the tools that enable the application layer. Then in your compute, you have like NVIDIA. But when I'm talking to developers, I double click into that and we talk about the other layers of the stack. I'm not going to remember everything now, but programming language is very important.
When you're developing this AI stack, AI application, the languages you select is very crucial because not all libraries that you're going to be using further in the stack are written in all the languages you have available. Some are just Python, or maybe some are just JavaScript, right? There's some that are evolving to have both now. But your programming language is crucial.
you have your model provider, you have the data, your database, which would be MongoDB, then you have your model provider. So you have to have several layers to that stack when I'm talking to developers. I tend to dive deep into that.
So let's talk about some of the stuff that you have been working on since then.
One of those things is the tuning playbook. So you describe yourself as passionate about having more systemic neural network development. So neural networks are the kind of AI technology that would be used to facilitate breast cancer screening that also facilitates all the kind of generative AI capabilities that we have today across text generation, image, video, all that stuff happens with artificial neural networks and
Yeah, you have this tuning playbook that you released as part of a team at Google. And so tell us about that. Yeah, so a lot of the motivation behind this playbook came after the work on this mammography paper actually came
At the time, and it's still kind of true today, training neural networks can be a very ad hoc. Some might uncharitably call it alchemical, and it's kind of true. But it's like there isn't... It involves a lot of experimentation, a lot of empiricism, a lot of research to train and deploy a model. And so something I was really interested and excited about is...
Well, at that point in time, like I just trained a lot of models. I knew a lot of people that have trained a lot of models. And it was like, how can we systematize this process, right? Like the broad research agenda that we were interested in is kind of like,
you could imagine the transition from alchemy to chemistry or something like that. Or it's like, you could imagine, systematization can be very, very helpful for engineering. And so, frankly, on that paper, even though I'm the first author, the other authors know way more than me and everything I learned from that comes from them.
And really, we kind of just got a bunch of our heads together and tried to write down what's worked, what hasn't worked. And we collectively have decades of experience training these models.
And we wanted to provide kind of a systematic approach for thinking about hyperparameter tuning, architecture, like just various aspects of model selection. And it's true, this playbook was kind of released, I believe, before Chachapati came out. But I think that a lot of the things described in that playbook are still very true today because the intent of that playbook was to be a sort of fundamental look at
at how you should think about running hyperparameter sweeps, what sort of plots you should make, how you can be more systemically empirical with questions like, "I have this compute budget. These are the constraints of my problem. Therefore, how can I systematically go through a bunch of steps and reliably reach a good outcome? And then what process should I have to do this over and over again?" And sort of
That's kind of what the whole playbook is about. And so it got popular at the time on the internet and I was pretty excited about it and we released it as a mockdown file. So at the time, the standard way of releasing papers or ML artifacts like this was a PDF and archive.
But we really wanted to release this as a markdown file with, I think, Creative Commons license or whatever the permissive license is, because we really wanted the community to be able to easily fork it, modify it, come up with their own best practices and kind of give us
pull requests back or whatever, for it to be a sort of collaborative thing. I think we weren't exactly clear. I don't want to overstate it. But it is cool that what ended up happening is that a bunch of folks decided to fork it and I believe crowdsource translations in a bunch of different languages, which are not endorsed by us because I can only speak English. But that was pretty cool. And I think it's still pretty relevant today for people training models.
For sure.
I think it's an invaluable resource. And I'm not the only one. It has 28,000 stars at the time of recording, which is insane. That's amongst the most stars I've ever seen on a project. So yeah, hugely impactful, some amazing contributors on there. And so yeah, thanks to you and the Google Brain team, as well as someone from Harvard University, Christopher Shalhoub. CHRISTIAN MCCUTCHEON: Yeah, he actually used to be at Brain before he went to Harvard.
Yeah, he's cool. Like I said, even though I'm the first author, the other authors, I really like shout out to George Dahl, Justin Gilmore, Zach Nadeau, Chris Jalou. They're the real brains behind the outfit. And I was kind of just like learning from them and kind of getting everything going. And yeah, I've learned. It was a lot of fun working with them. And I'm grateful we were able to get that out there.
I teach an intro to deep learning course. I've been doing it for coming on 10 years now. And, you know, five years ago, roughly six years ago, it was published as a book. The curriculum that I developed is an introductory deep learning class and something that every class always asks. Once I explained that, you know, we can add some more layers, we can double the number of neurons in a layer or all of the layers. And I was like, okay, but why?
Why are you making those decisions? And up until now, I basically always just said, well, you can either just experiment and find out empirically by experimenting with a bunch of parameters, or you can do some kind of search. Like the simplest thing is doing a grid search. So just setting up some parameters to search over. But there's also clever Bayesian approaches to homing in on what the ideal parameters could be.
Yeah, so this playbook is about that question pretty much. It tries to take a much more general approach. So it's kind of architecture agnostic in the sense that it won't tell you this is when you should add a new layer versus this is when you should change the width of the layer. But it is about helping practitioners grapple with the question, here are the experiments I have now. What is the experiment I should run next? And
Because the assumption is that if you can set up the base case and a good recurrence relation, you can iterate your way to success, right? And so there's a lot of thinking in the playbook about how should you think about setting up the right initial state for your experimentation? And how should you think about, given the data that I have collected,
what is the next experiment you should do? And I should emphasize, this is meant to be a living document. That's also why it's a markdown file on GitHub. We reserve the right to change our opinions and feedback is very welcome and encouraged. And it's not the final answer. I mean, I won't pretend to be like the arbiter of how everyone should tune their models, but it's just like,
We've been training models for a while. These are our two cents of how one could think about doing it. That's the kind of vibe. Hopefully, it helps people. If it doesn't, please click create issue or something and give us feedback.
Yeah. All right. That's it for today's In Case You Missed It episode. To be sure not to miss any of our exciting upcoming episodes, subscribe to this podcast if you haven't already. But most importantly, I hope you'll just keep on listening. Until next time, keep on rocking it out there. And I'm looking forward to enjoying another round of the Super Data Science Podcast with you very soon.