We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

2024/7/5

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Yi Tay: 我最初的研究重点是模型架构,特别是高效Transformer。随着领域的发展,我逐渐转向通用语言模型和新兴能力的研究。在Google Brain期间,我参与了UL2和PaLM等项目,并与Jason Wei等研究者紧密合作,共同探索了大型语言模型的潜力。我的职业生涯并非刻意规划,而是根据我对领域内最具影响力和前景的方向的判断而自然演变。 swyx: 您的研究方向转变似乎非常自然,能否详细说明您在Google Brain期间如何调整研究重点?

Deep Dive

Chapters

Yi Tay's career progression, from a language researcher focusing on efficient transformers to a co-lead on the Palm 2 project at Google Brain. The transition reflects the broader shift in the field from single-task fine-tuning to universal foundation models.

Transition from language researcher to model architecture researcher at Google.
Focus on efficient transformers and exploration of possibilities without attention.
Involvement in projects like UL2, Palm 2, Flan, and generative retrieval.
Shift in AI research from single-task fine-tuning to universal foundation models.

Shownotes Transcript

Translations:

中文

Welcome back, friends. It's only been a week since the World's Fair, and it was incredible gathering the community to see the latest and greatest in AI engineering. You can catch up now on the four live streamed track days on the AI Engineer YouTube, and our team is busy editing the remaining workshops and five other tracks, including the surprisingly popular AI Leadership track.

Thank you all for your support and stay tuned for news about the next event, the 2024 AI Engineer Summit. Last week, we did a very special deep dive with Josh and John of Imbue and Databricks Mosaic on training LLMs and setting up massive GPU clusters. And today, we're pleased to follow that up with a very special conversation with Yi-Tae, formerly tech lead of Palm II at Google Brain, and now chief scientist of Rekha AI.

Raker's largest model, Raker Core, was at launch, the fifth best model in the world, and the only GPT-4 class model not trained by a big lab like OpenAI, Google, Anthropic or Meta. In fact, while Google Gemini has 950 co-authors, Raker only has 20 employees, only five people actually working on pre-training.

One year after our RWKV episode, Swix was excited to return to Singapore to delve into Yi Reker and building a new AI model lab outside of Silicon Valley. Stay tuned to the very end for a special bonus clip from Yi's recent appearance at the Techanasia meetup for his spiciest take on why senior management is overrated and why this is the time to build up senior 10,000x individual contributors. Watch out and take care.

Welcome, Yi-Tae, to Latent Space. This is a long time coming, but I'm so excited to have you here. Yeah, thanks for inviting and I'm excited to be here. I'm excited about a lot of stuff, yeah. So you are interesting to research and introduce. You are now Chief Scientist of Reka, which is a super interesting model lab. But before that, you were at Google Brain. You were Architecture Co-Lead on Palm 2. You were Inventor of UL2. You're a co-contributor on Flan. You're a member of the Bard Core team, and you also did some work on generative retrieval.

That's very, very illustrious three-year career at Google Brain. Yeah, thanks, thanks, thanks. And then since then, Rekha, you joined in March 2023, announced a $58 million series in June 2023. I don't know if you know the post-money valuation or the pre-money valuation is public. So it's crunch basis is $250-something million. So you don't even have to leak, it's on the internet.

Okay. Rekha's stated goals were to work on universal intelligence, including general purpose, multimodal and multilingual agents, self-improving AI and model efficiency. In February, you released Rekha Flash. In April, you released Rekha Core and Edge. And then most recently you released Vibe Eval. Is that a good summary of the last six years? No, it's not. Four years? Four years. Yeah.

Oh my god. We're talking about AI. Yeah, I was wondering since when did I step into a time machine or something. Yeah. OK, so can we just talk about your transition into-- you did your PhD, and we can talk about your PhD. Transition into brain and research and all that. I saw you do some work on recommender systems. I saw you do some work on quaternions.

What the fuck was that? Let's forget about that. Describe your path into modern LLMs, right? Because you didn't start there. Yeah, okay, sure. I think the world also didn't start there, right? I mean, I think in... So I joined Google in 2019, end of 2019. And the world looked like really different at that time, right? I think that was around the time the first GPT was released by...

GPT-1 or something was released by OpenAI. So research, like ML research and NLP research looked very different at that time. So I was mostly, I identified as like a language researcher. I don't like to use the word NLP. Jason will kill me if I use the word NLP. But like, I was like, okay, a language researcher. But I was more like an architecture kind of researcher. And when I joined Google, I was also, I continued on this, like,

as a model architecture research, I work a lot on like efficient transformers. That was your first viral paper. Yeah, yeah. And I worked like long range arena. I spent quite a lot of time looking of like, could we do without a

attention. Like there was a synthesizer paper back in 2020. I think that was like my early days in Google. There wasn't like a, at that point of time, transformer research was mainly like WMT, like machine translation and like propensity and stuff like that. It's not really about, you know, there wasn't like, I think, few short learning and few short in context learning came only about when GPT-3 came out and beyond, right? And then, so I think that at that time, the meta, I would say, the meta looked very different. And at that time, a lot of the work was focused on

like fine tuning things like T5 or BERT or something like that, right? So I think a lot of the research, not only myself, but like around me or like even the broader community were working on those kinds of things. And so I think that was, which I feel that in hindsight today is actually pretty useful to like kind of think about because

a lot of people came into like AI and into right after ChatGPT came out, right? So they saw AI as kind of, I think there's a lot of benefits of, you know, understanding how transformers and I've broken this thing apart so many times trying to, it's like these things actually help to improve intuition and it's not totally disconnect. I think a lot of things are still relevant today and

And it's just the scale has gotten much larger. And also the paradigm shift a little bit from single task fine tuning to like generally do everything kind of universal. Foundation models. Foundation models, right. I think it's just a slight change in paradigm. But fundamentally, I don't think like the underlying principles of research

hasn't really changed that much except for like compute. Yeah. So basically algorithms stay put and then compute and data scaled. So I have some thoughts about this, right? So I think back then, a lot of the academic research, I think people have talked about this, like Sasha Rush has talked about this or other people have talked about this. It's like the conferences were always organized by like...

applications, right? They were always organized by like, oh, like question answering. It was always like this, right? I think there was, there's like a bit of a transpose going on. Things become universal and then becoming like, okay, there's a data work stream, there's a model architecture work stream and then people work on improving like a universal model and general purpose algorithms to improve this model rather than finding domain specific tricks. I think for

even in 2019, I think I've already been like focusing on works that are like, you know, you could improve on general architecture. At that time, it was like, like maybe LSTM in 2017 or something. And then you try on like 10 different tasks and the kind of thing, right? But like a lot of the research community have been focused more on like, how do I get that extra 2% on,

question answering or like and then sentiment analysis I think there was this phase of like in 2017 2018 where this type of work was like still very fashionable in academia and conferences right and then I think the big thing about the chat GPT moment of like 2022 the thing that changed drastically is like it completely like it was like this shot make all this work like kind of like

obsolete. So November 2022, you're saying? Exactly, ChatGPT launch? Because I feel like if you're in the research community, this was coming. Yeah, yeah. That's why I'm saying that the big labs and stuff, people have already been moving towards general. Even T5 was already a general purpose. And that's the thing, right? But there was a bit of a time, places like Google and Meta, OpenAI,

we will be working on things three years ahead of everybody else. Then academia will be still working on these past specific things. Got it, got it. And then I think the forcing function was the chat GPT moment actually really like, it was coming, it was coming. It was just the final, the last straw and then it's finally like, yeah. Now it's serious. Yeah, now it's really the thing completely changed. I don't know how it turned from my

from my background to like talking about the meta I think that you navigate the meta very well and part of my goal here is to also isolate how you think about the meta for other people to reflect on because I think obviously you do it very well so I'm looking at your papers published somewhere around 2021 you had a hard cut to your two and Palm you did your two Palm emergent abilities DSI

recitation, augmented generation, all in the same year-ish. So, like, there was, did you change teams? Did you, like, have a research focus? Like, when did you become the language model guy? My research became emergent, right? It was, like, it's very obvious. No, I don't think I'm, like, a person that, like, I'm not, like, super, super great at, like, forcing a trend, like, two years ahead and then especially, like, plan for that, right? Yeah. I think I smoothly and

as like kind of like as the few moves I never actually really thought about this this way I just at every step I just optimize for what I found to be most impactful and most

promising and then that gradually and also it's also a lot of influence by talking to people right and then at that time I started working more with I had some close collaborations with Jason and other people like I mean Google is you can work with anybody you want basically so you're kind of partly is the environment shift and I think the environment shifts very quickly but I was always pulling for the environment I was not I think it's always good to have open mind and move along with the field rather than okay this is my research area I'm going to get stuck here two years I think I just

move along to find things that interest me. And naturally, I think that turned out to be the things that were most impactful at that time. In retrospect, I kind of did well, but I never actually really saw it as intentional. I didn't do anything really intentional, except that's doing what I find interesting, actually. Cool. Well, we'll just talk about the main work at Google Brain and then we'll move to Rekha. So out of your two Palm emergent abilities, which of these came first?

Actually, I can't really actually remember. Okay. We'll make you talk about UL2 then. UL2 and DSI, the Differentiable Search Index, I was working on it the December of 2021. So at Google, there are projects that are big efforts that a researcher will be part of the effort. And then this will be kind of top-down to some extent, right? And then there were also bottom-up research that one could do. I can't speak for the Google now for sure, but at least at that time, right? So UL2 and DSI, Differentiable Search Index, were works that I kind of tinkered with

in the December break where nobody was around. Okay. Palm also there's differentiation because there's Palm 1 and there's Palm 2, right? So Palm 2, I was actually a co-lead of one of the work streams, but Palm 1, I was more of a contributor and Palm 2, I was... So now I have to think back of, okay, what's the timeline which came first, right? Oh, yeah. In general, there were three categories of work. One is broader efforts that org level efforts and then there are some that are 2nd and DSI were my own

Projects. I used the compute that I had and then I just played with it. You accidentally left the auto running for a month. Yeah, that was in the paper. It was fun. It was really fun. And then there was a third category where those were the efforts that my good friends were driving and I contributed. So Flan was just one of them. I would like to just maybe say this publicly. You're very publicly... I talk a lot about Flan. You're Flan show number one. But like, yeah, but like, the first author is actually Hsiung Wan, who is great. And then like another guy, I was her co-

core contributor, but I mean, just because I'm a little bit more visible, so I kind of accidentally took a little bit more credit for that. But as in, I was a core contributor, but I was not like...

The lead authors are obvious. Yeah, so the third category was like projects that my friends, like Emergence was also like, Emergence abilities. Actually, that paper was actually supposed to be only me and Jason on the paper. And I actually became friends with Jason from that paper. And then that led to like this streak of like, I don't know, 10 papers or something together with Jason. And now we're like super good friends. The ultimate bromance. But that was like the Emergence paper. But Emergence paper was also like, belonged to be like a bottom up.

kind of like a thing and fun times. Yeah, it was fun. Maybe I'll pick on Palm 2 because I feel like I'll pick on Palm 2 and Emergence because I really want to make sure I tell those stories. Those are important stories. Palm 2, I think it's a career story. Effectively became a co-lead on the second version of a very high profile company-wide effort

How did that happen? I think people would like to know what's the career strategy there. To be clear, I was one of the co-leads, but there were a lot of co-leads. So I don't want to take too much credit for that. But my involvement with Palm 2 came from after UL2 was working well and then it was gaining some visibility within Google. Was UL2 the largest model that Google had released at the time? Yeah, I think so. It was the largest. And it was a personal project? It was a personal project, yeah. Isn't it?

How can it be one person's decision to suddenly release something that is

effectively changed the trajectory of Google Brain. How it worked was that, I mean, 20B is not that much larger from 11B to 11B, T5. Actually, at that time, there was 13B MT5, right? So I think UL2 is an encoder-decoder 20B model. I think when we got it approved, it was like, it was released as like kind of like the big brother of T5, you know, kind of like, okay, we updated T5 with like a new objective

and train this new model at 20 BPM you want to, and it uses the same pre-training data set and everything, right? So like from- PRC4. Yeah, that was the easiest because there was precedence, right? It was like, okay. But yeah, there was some architecture like the mixture of the noisers. Yeah, yeah, yeah. So back to Palm 2, I think my involvement with Palm 2 came from the work to add UL2 to Palm 2. And then, I mean, it was-

from the top-down point of view, I mean, the leads were decided in a top-down manner. It's not like, there was not much like fighting or any major things, right? It was like, it was a mixture of like bottom-up, top-down-ish, like half-half situation. And then like from the top, it was like, okay, like these are the people who are the most

visible in contributing to this work stream and then okay how about he and this other guy becomes will be in charge of this like modeling work stream and something like that right so I think it was this it just happened that way organically and yeah I think that was how I kind of was co-leading the modeling work stream of Palm 2 yeah I think in retrospect you understand now that this is a very valuable experience

And I think now, today, it will be much more competitive to get the job that you got. Whereas you didn't, two years ago, you didn't have to try that hard to get it. Or like you kind of lucked into it with you all two. And then like it just compounded from the initial good decision. I think it's very hard to counterfactually analyze this type of things. I think it's definitely true that there are more people working on generative AI now. And if you are in a big company, it's way harder to navigate.

like this type of things, right? I wouldn't say that there were like nobody also wanting to work on this at that time. In fact, there were actually... You were the obvious choice. There were less people. There were definitely less people. But I think, I would say that maybe it's slightly harder now, but like it's also not like it was easy at that time. Yeah. I imagine it's sensitive. But also in my mind, this is now the most valuable on-the-job training in the world. And so people want to know how to get it.

This is what I'm trying to figure out. Like actually individually, we also cannot take like somebody else's experience and then try to replicate it on, because everybody's circumstances, their initialization point, their thing is kind of also like indifferent. Yeah. This is not only true for LLMs in general, right? Because a lot of times like, oh, okay, you did this in this position. And then because of this, it's very hard to trace all this down to find the causal path for this thing. So I think everything in life, there's some luck involved, I guess. Yeah.

Yeah, there is. Emergent Abilities, very influential paper, subsequently contested by the Mirage paper. Oh yeah, yeah. So before we get to the Mirage, was there a story behind Emergent Abilities? I'm sure it's Jason's thesis or like, just tell more about like the behind the scenes. Like was there a discussion that led to it that- This one was like, this is the idea, the inception of it was like mostly Jason. Okay. Right. I think I helped out to like

you know, shape up a little bit of the paper, get some stakeholders involved and stuff. I was discussing quite a bit with Jason, but this, the idea itself was Jason itself. So actually when the Mirage thing and everything came out, I didn't, okay, I was just hot takes for the sake of hot takes. I didn't feel, I believe in emergence. I have to just go on the record and just say, I believe in emergence. And then I was not feeling very strongly because I think that, I can't speak for Jason, but I would just imagine that he would be maybe personally offended because, because I know Jason is a person that takes a lot of

like feedback like very well he's a very like he's not offended by harsh feedback and he rebuts well like online as well right but I would imagine he would be the one that is the most like affected by criticisms of emergence I was believing in it but I have to say that I mean that's why he's the first author and I'm second but that was mostly Jason

Jason's thesis. And I have to really say that Jason has really good ideas and I was more of a support role for that paper. Sure. Yeah. Lots more to discuss there, but you believe in emergence. That's enough for me to work with. I also think that the Mirage paper is mostly like, I don't know who, actually, I don't even remember who wrote it. Ryland Schaefer. Yeah. I covered him on my New Reps podcast.

Okay, okay. He's a very good speaker and the paper was well done. It's just that people drew the wrong conclusions from the paper because he had a very good title. Do you believe in emergence? Of course. Okay, high five. I mean, how can you read any paper, read any, the progress of LLMs and not believe in emergence? It's so stupid. Like, just because you re-parameterize some benchmarks and evals and make it linear doesn't mean emergence is completely gone. And even in the Merage paper, they acknowledged that there were some metrics that

were true genuine emergence, according to them. I think it was something like 25-ish percent in the ballpark. That's not the exact number. Yeah, yeah, yeah.

So I was like, okay, fine. Like some benchmarks you disagree with, but on the whole, there is emergence. It's just, now we're just talking about the magnitude. Yeah, yeah, yeah, for sure. I think, I don't think the authors of the paper had really very, like they didn't, I mean, nobody, we should just assume people don't have bad intentions, right? But like, they definitely were just doing this, but like the, I think I was, I was more like annoyed by the nearest best paper. I mean, okay, best paper was the sticker

take it with a grain of salt, right? But like, there were people who come to me like, oh, you should care about this because it's the nearest best paper. It's been disproved. Because it's the nearest best paper. I'm like, does best paper awards mean anything? Actually, it doesn't mean anything, right? Like, I think that was more of my, where my angst was coming from. I don't think I really had, I don't even remember who were the authors of that paper, right? I'm sure they're doing well for themselves and we don't have to dwell too much on that. Okay, okay. Okay, so a couple more things from Google and then we can go to Rekha. Kwok Le was a manager. Yeah, yeah.

I had another manager called Don. Like I had two managers during my time at Google. So I'm just basically going to ask for quick hits from what did you learn from Kwok? What did you learn from Jason? What did you learn from Hiong Wan? Oh, okay. Very interesting. Who they are, who they represent to you, like how they advise you and all that. So Kwok as a manager, he was more like a friend. And we will like talk a lot about research. I think Kwok is a very researchy person. He has a lot of like, he's more of like intuition. I learned a lot about like from him about like, there was no like concrete, like it was more like over time and it was very implicit, soft kind of,

but I think like a lot of research science, we will like brainstorm a lot about like, I quite like that when we were, there was this new palm paper that didn't like get much, like as much attention that I feel it deserves. But like, I think that was one of the works that I kind of like discussed with Kwok quite a bit. And like, and that time we were releasing the Flan 2 stuff and everything. And then like, I think Kwok has a lot of good sense about like what makes a work a good hit and like, you know, publicly a good hit and like a lot of research sense about like what, what makes like,

like research cool so I think he has good like intuition as a researcher and I learned quite a little bit about and also I would say that I think Jason also probably learned like quite a bit from Quark and this also influenced his like more of like it was not only just like me getting influenced but there was like Jason getting influenced and then Jason influenced me so I think overall what I learned from Quark probably is more of like intuition research

taste we would like chat about AGI sometimes singularity and stuff like this like it was like it's nice to talk to as a friend manager kind of he's like kind of a friend figure to me he's very much a researcher more than like a corporate manager kind of

I totally expect that. It was fun. It was fun. Jason Wei, what do you learn from him? What is your distillation? Okay. Jason is very interesting. So I learned in my career, I learned two or three things, major things from Jason, right? So I think the first thing I learned from him is that, so Jason was actually, okay, I'm going to talk about the more casual, more fun stuff first. Jason was the most spicy on Twitter first before me. There was an era where I was goody two shoes. I only had my main account. My only tweets would be like, new paper alert. You know,

And then Jason was starting to post hot takes. And I just thought to myself, oh damn. And there were types that I was like, Jason, you should not post this. You're going to get cancelled. And he was fine. He always braved through the storm and everything. Until I looked at him and I'm like, maybe it's not that bad after all.

to just be, right? So that was like, kind of like, which is very interesting because Jason is much younger than me and I, and the other thing also, our accounts, right, we created them around the same time, right? And the interesting story behind it was that Jason's account and my account has our own, our original, it was not like an anime character that nobody, I know who is it. We have our identity. It's pseudonymous, right? And then I asked Jason, why do you want to have a

pseudo, like, why don't you just make like, right? And he told me this thing, which was quite true, was that like, okay, you can post a take that is spicy and it's hot, but if you cannot stand by the opinion, then you should not have the opinion in the first place, right? Wow. Right. So there was something that, oh, okay, I thought that was profound because so far this, I mean, there are times where, okay, I post something and it's spicy and then, okay, it gets a little bit bad. And then I, okay, I kind of agree that, okay, this is bad, then I will retract it. But if I could stand by the opinion, then I would just stand by it because

that's the point of making it. It should be said. Right, it should be said because I can put my name behind it, right? So that was a, this is part of the first bucket about like how kind of influence my online persona like a little bit and then, I mean, it turns out that now AGI Hippo is so much more

spicy than the cola. The cola is just hibernating somewhere. It's not even around, right? So, I mean, Jason also is more constrained because he works for, he has like an actual employer, right? And he has really... Oh my God. The worst thing about Twitter, you know, anytime anyone from OpenAI tweets anything, they're like, did you see this researcher from OpenAI said something? And they read tea leaves that are not there.

And it makes you very cautious to tweet anything. And so it kills the golden goose is what I say. There was one tweet, I mean, at a time when somebody was, people were speculating that you would do chatbots, right? And then Jason just posted something on his main account, something like, excited about new experiments being run. Just a random, and then people screenshot that and post it. Yeah, I hate that. So I think,

And now I think for all the account is mostly like personal, like personal stuff. Very personal. I think he would stay away from like non-work things. The golden goose has been killed because people on Twitter cannot control themselves from drawing random conclusions from all these hints and all that.

Yeah, yeah, yeah, yeah. But going to like the actual, like this is like filler, filler. This is filler. This is not canon, it's filler. I think the second thing I learned from Jason is more about like, as from my, you know, kind of like from my own career, it's like the importance of like marketing and PR. So Jason is actually like super good at, I mean, I would just, like he was actually like really the emergence, like how many blog posts he wrote about the emergent abilities and how many talks he's given about

about immersion, like a lot. Probably like the other day I was just at this webcom keynote and he was giving a keynote again about immersion abilities and it's been two years, right? So I think one big success of him is that like he does the work, he thinks a lot about like marketing the work itself. I did not like, in my early parts of my career, early parts in Google, right? I was putting out a lot of work, but I didn't put in a lot of like effort in like thinking about the, like how the work is going to be received. I'll just be like, here's a paper, here's a paper, here's a paper, right? But Jason will be like,

I'm going to write his paper and I'm going to like market the shit out of it. So I think I learned a lot about like, so every single first author paper that Jason writes in the last, has like 1,000 citations in one year. Oh my God. Like, no, I mean not every, but like most of it that he leads. So his hit rate is very high. His hit rate, like impact density, like it's very high, right? So,

It's pretty interesting, but I kind of see him as like a peer and I learned a lot from his, basically, some people are just like talented in different ways. And I think that like, I looked at how he markets his own work and markets himself actually, right? If someone is starting from zero, like no Twitter presence, what is the second best thing to do? You mean as a researcher? For marketing, yeah.

I think you were like the most obvious thing to do. Like if you're like a research, like say hypothetically, you're like a researcher in like a place without visibility or without, and then you have no personal visibility. The first goal is always to try to find a mentor or co-author that is like within this circle. And then you start from there, right? Because,

And then you get people from like, who has a visibility and following to retweet. So you will like work with them. The big goal is not about like, I learned this, I mean, this is like probably a career mistake in my early days was that, you know, instead of like focusing on like so-called people, okay, if you do good work, it's more of like, okay, how am I going to, I see this visible researcher from DeepMind, right? Or how can I collaborate with this person and then kind of do something that feel is cool and like I can win their respect and that they will like,

you know, they will be willing to co-author for me because the exercise itself was so about how to, you're not trying to please reviewers or anything. If you can find one semi-visible, you don't even have to be like a famous person, just like a semi-few thousands of followers has a good reputation of research and then you collaborate with this person and then when you post the work, you are co-author with this person and then you get the person to vouch for you or like this,

over time this would it could be from internships it could be from just DMs I think people are nicer than some people they seem scary but if you DM them they're actually willing to collaborate actually I was scared of you actually and when I DMed you you turned out a lot nicer than I feared so thank you for being nice yeah

That's really great advice for people. I just want to leave that out there for people. For others who follow the career advice that I give, the title topic of this is pick up what others put down and specifically pick up what your mentors put down. Mentors always have more work to do than they have personally time for. The high visibility mentors. And if you can show that you're a good collaborator with them, they will lift you up accordingly. That's a pretty...

good formula for career growth. Should I ask about Hyungwon or I don't know how close you are. We're still good friends. Hyungwon is a great engineer and he's very systematic in the way he thinks. I think Hyungwon is without going into detail too much I still spend a lot of time talking to Hyungwon even after we both are different places about very interesting things

algorithmic ways to think about life. Very interesting like perspectives on life rather than research. But Xiong Wang is a great engineer and the one thing that scares me about Xiong Wang is he doesn't have multiple monitors. He just quotes you one small screen and he does everything with very hyper-optimized and then... This is one of those U-curves where one screen, one screen and then many screens. Yeah, yeah, yeah. So I think Xiong Wang scares me because...

It's like, I think that was at NeurIPS 2022, like we were doing some work at the New Orleans and then he'll be like coding like perfectly fine with like this, you know, 13 inch MacBook with like one terminal. And then he'll be like, he keeps telling us, okay, it's more optimal to using keyboard is more optimal than moving your head because if you can switch your screen fast enough, it's faster than your head, like moving to different screens.

and stuff like that. I did not actually distill that because it's too painful to do that. But it's very interesting in a way that he belongs to one of those hardcore people with one monitor. Maybe this is a relevant question to just close out the Google side. What do you think is a good programmer for AI research? You mean a setup or editing? No, not setup. Lifestyle.

not even lifestyle, it's more about skills. Like what should people have? What do you interview for maybe? What do you see the high performers do differently than less high performers? I mean, okay, like generally, I think like for AI researchers, like being a strong IC is like probably like the thing that I feel like is like important for AI researchers. Like not, I think like

there's certain level like sacrifice to be like an AI engineer AI researcher especially if you're training like LNs because you cannot really be detached from your jobs could die on a Saturday at 4am right and then there are people who like would just leave it dead until like Monday morning and then but there will be people who will crawl out of bed at 4am to restart the job or to check the

a pencil board or something like that, right? I think a lot of being a successful AI researcher, I want to say passion is also the entire thing, but it's more of just a kind of personality that if something, there's a bug at 3 a.m. on Saturday night or something, right? And then you would be like,

you couldn't go back to sleep unless you, I'm not saying, this is very unhealthy, by the way. Like, people should not do this for a long time. You know, I think this kind of things actually, like, allows people to make progress faster. But it's unhealthy, so I'm also not even sure, like, what's, like, the checking out on, like, Friday, Saturday, Sunday, and, like, work at 9 to 5 if you want to, like, make progress. Or, like, some people are just so good at detaching, like, okay, like, 8 p.m., I'm not going to, my job can die, and then the chips can stay idle for, like, the whole night. But,

I want to watch Netflix, right? You cannot. I think there's a level. It's like a sport. You cannot win an Olympic gold if you want to have super ultra good work-life balance, right? Yeah. Passion, intensity, dedication. Yeah, intensity, right? So those are really good personal qualities. Just technical qualities wise, how much of the stack should people know

Okay, so that was the question. No, no, no. But that was important as well. It's just harder to interview for because you really just see it on the job. I think stack is not that... Should I know CUDA kernels? I don't know CUDA kernels. Exactly, right? Okay, good. For all you listening out there, you don't have to feel like an imposter. No, but you need to be willing to learn if you have to, I think.

Well, you haven't had two so far. Yeah, I haven't had two so far, right? So if I sling Pytorch, okay, great. You know, what kind of, do I know like distributed systems? Like, do I know, like, what is the stack that you recommend for people that gets you like a well-rounded end-to-end researcher? I don't think there's any specific thing. In fact, I will try to, I don't really say like, okay, you need to learn Jax, you need to learn this. By the time you finish learning, there's a new framework out anyway. So it's more of like,

staying constantly trying to, being able to continuously learn and update. I don't think that's a single stack or a single workflow. I don't think that's a single one. Well, that leads us to Rekha. What's the founding story? So I met some of my other co-founders while we were collaborating at DeepMind. I was at Brain and they were at DeepMind. I'm not a startup person. I identify even today.

as a scientist and a researcher more than like a startup person, right? My co-founder, Danny, started this story.

And then this record was in the works from late 2022. I finally left in 2023. Then he kept asking me if he wants to do something. Do I want to go with him and do it? And it took a while for me. So I was kind of the last co-founder to kind of form the... Was the plan always for you to leave at some point and join him? No, no. He was just convincing you to do it. It was like six months, more than... In fact, I think more than six months period of like... I always had this...

at the back of my mind for since like what August actually I didn't want to do it in the first place but I think eventually in March I felt that okay it's time for me to experience something new like my leap of faith was more of like I want to experience something new I've okay I've like wrapped up this palm tool work at Google and then like more of like okay let me experience this new life and see where

where we can go with this. And I also, I mean, we don't have a lot of like, okay, the funny thing was that like many, many years ago before my PhD, I wanted to do a startup actually at that point. And then over time, I realized that like I was better off as a researcher and I just forgot about the startup thing. And it's quite funny that today I end up doing a bigger startup, right? But even until now, I,

I actually identify more as like a researcher and scientist. Well, I mean, it's not... When you left Brain, you already had a high profile coming out of Brain. You could have gone to any startup out there. They all have wanted you. Yeah, okay, okay. So why did you choose this one, basically? Like, is it just because of pre-existing relationships? Because it wasn't obvious to me. A lot of your other co-workers went to OpenAI. Others went to... If you're fair, you went to Mistral, that kind of stuff, right? Like, Rekha was like not on the map. I think it was, for me, it was a decision between staying at...

at Google and like co-founding something I didn't want to like it was more of the experience of like being a co-founder that like was attracted me right and wanting to experience that I wouldn't have left for inflection or something like that like I mean inflection is gone but like

RIP. They're still alive. They're selling themselves as a model foundry or something. I don't know. They're a services company now. Yeah, I know. But I also think that like, for example, if you were to join another, it would be like a very big tech experience again, right? I don't know. I felt like the experience I get is very complimentary to what I have. That's the experience I had at Google, right? But if I were to join something else, right, then I wouldn't have, I would have to stay at Google, to be honest. Because to me, it was very clear. It's,

two decisions that I didn't really, I was talking to a bunch of other startups, but I didn't really actually had the intention to go. I was happy at Google actually, to be honest. I'm sure. I'm sure they have a lot of things to keep you happy.

I was happy at Google, yeah, actually. So you described yourself as GPU poor, but also you had $60 million to play with. You got a whole bunch of GPUs. I think you disclosed somewhere, but I don't remember the exact number. And you had a good training run for Flash and then Core and Edge. How would you tell that sort of story? Like people can read the technical report, but also, you know, what was that...

overall experience like and I should also point people to the blog post that you wrote there were a lot of interesting things that happened along the way that like led to our so I think I left around like early April like March end of March April and everything right but most of our compute actually came in December actually yeah and there were delays

So H100, they were major delays, right? So we were sitting around, right? Bunched with like- It's because you don't own the computer, you are renting. Yeah, yeah, yeah. So we were sitting around, like, for a long period of time, we had 500 A100s because we made a commitment and they were constantly being delayed, I think because of H100 supply, demand, whatever reasons. And it was also very hard to get a lot of compute in one place, right? And then we were locked in and we had to wait for the compute to come.

So I think it was very painful because even when the compute came, it was mostly broken most of the time. And it was broken to a very bad extent that before I left Google, I was like, even the early stage, I was very optimistic about, okay, this compute translates to this amount of flops, this is the model, right? But I never expected the reliability to be so poor that it just threw off all the calculations. And then we had to work 10 times harder just to make the thing go smoothly. So

It was a bearable pain. I think the pain was bearable, but it was just way more than expected. I think you addressed this in your post, but the temptation would have been just to run everything on TPUs, which is the stack that you already know very well, that works very well. No, no, no. So TPUs outside Google and TPUs inside Google are probably very different things. Oh, how come? Okay, firstly, it's like infrastructure. There wasn't a lot of good codebases outside Google that was still... And the codebase that I was most familiar with was T5X. It was a Jackspace database.

it would have been, like, by the time we wanted to consider it, it was really, like,

deprecated for nine months, right? And then TPUs, I mean, we weren't sure about, I mean, the availability of TPUs was not great, great. - Oh, my perception is it was a lot better. People have the learning curve. - Yeah, but at the point of time we had our infra set up, we were training already, training models, and it would be so much cost to switch to TPUs. So I think TPUs, the experience of TPUs inside and outside Google, I have not actually run a single TPU job outside Google, by the way, but just looking through documentation from what I see outside,

and from like how much I think that people inside Google don't care about what people think outside Google, I kind of feel like, okay, we were a bit like, I don't think we considered, I mean, not like forever not considering this, but like just like at that point of time, it was like. The obvious choice is to stick to PyTorch. Just stick to GPUs and PyTorch and make like,

I mean, it's not as if the chips we ordered were not there. They were there. They're just not in the best shape. Reliable. Right? Yeah. So I think it was too much work to kind of migrate suddenly to TPUs. Yeah. For those who haven't read the report, you had a very traumatic description about the chaotic and stable phases of various compute providers. And I was just wincing when I read all those things.

Yeah, that was like a three body problem reference, the chaotic and stable phrases. I mean, I was watching three body problem at the time and I thought it was fun to, there was a lot of like, I think we had a lot of fun adding a lot of references and memes into the tech report. I think it goes to show like how fun the environment is within record, right? We had a lot of fun with this, but

So I think chaotic and stable phase, mostly it's like, we actually found that like, usually when like a provider provisions new nodes or they would like, Yeah, you don't want to be the first to use it. Yeah, it's usually like, like bad, like dog shit, like at the start, right? And then it gets better as you go through the process of returning nodes and draining them, giving it back to them.

they will send it back for repairs and everything. And then over time, because it's more of, it's a more of a numbers game, right? If there's one bad note, it kills the entire job, right? So like the fact of the game became like just eliminating bad notes from the thing, right? And then, I mean, just because of, maybe because of the supply issue or something, when the deadline comes to ship this, for example, like I just give rough numbers, let's say you order 1,000 H100s, right?

they will not be able to usually they don't meet the demand of like 1000 H100s at the date they'll give you like 500 first time not to piss you off and then they'll give you like another 100 like every over 2-3 weeks they were just like okay I added like 4 nodes added like 8 nodes that kind of thing and then over time you reach like the capacity like that you or you actually maybe you never actually ever reach the capacity that you ordered for and then like as they add these nodes right sometimes these nodes are bad and then they just kill entire training runs and the thing

which I feel that, I mean, for all those people trying to sell, there are a lot of people trying to sell GPUs now. We sell a shell package, whatever, GPUs, right? And I think the most important thing that

There are obviously there are SLAs all this in the contract and everything. And obviously you might be entitled to something, something if something goes wrong, right? The thing that for large model training runs is that like one bad node kills the entire job, right? So should the compute provider be liable to pay for all the node wastage that, no, because it's unlikely because otherwise- It's unrealistic. Yeah. No one will take that on. No one will take that on, right? So I think that's also like a tricky thing. Who is taking the risk? Is the LOM startup taking the risk?

Or is the compute provider taking the risk? I think that, I mean, this is my sense. I'm not 100% sure, but I think as there are more providers trying to sell GPUs, we get all this inbound so much about people trying to sell us GPUs, right? The key differentiator is actually to find a way to

to balance the risk of note failure with, as long as the provider, I'm not going to say 100%, but if somebody can come and tell me that my notes are so stable that I can share some cost with you if your job dies, this is green flag, green flag, right? The moment they start to, I cannot. Do any of the big clouds do that? As far as I know, no. They have the size to guarantee that. But I think

for anybody who is watching or if you're doing like a compute startup or anything, the biggest green flag would be to share the cost of node failures with your customers, right? I mean, the whole run? No, no. It's very hard to go because you need software to like, you need software to

So let's say you run it for 12 hours, right? And it dies after 12 hours, right? You get 12 hours of throughput, right? But then you get like some wastage because of like the downtime and everything, right? You know, I think it would be fair to find some middle ground to kind of split the cost of the failures, right? And this brings back to my point about like work-life balance because if the node fails so badly, right? Like it actually...

basically, right, your engineers cannot sleep at all. You have babies sitting in rosters and everything, but you are living life with like constant anxiety because even in the case, right, where the node failures are refunded, right, you still lose time. You lose three hours. You lose everything, right? So I don't know how to go around this, but I think if there are a lot of compute providers like fighting over that, I think a good solution

a good thing to do is to figure out like this pain point otherwise or at least figure out some hot swapping but so far most of the providers that we tried don't have this they will also get confused when you try to ask them so my job is dead can you pay for the food can you refund for or at least they will get confused because like this is a LOM specific thing that the large nodes the

They don't care about, yeah. Yeah, they get confused about this, right? So current status quo is the LM startup pays for everything. Maybe you could negotiate some like refunds, but usually they will not be so generous to pay for, let's say you run 500 GPUs, right? If you break for four hours, in their mind, they will be thinking I should refund you for one node. But in your mind, you just think that they should refund you for the full job, right? Everyone who is

from my background is going to be asking this. How is it so fragile? Like, how is it so brittle? Like, what's your frequency of checkpointing? Our checkpointing is kind of like, we see how stable the job is and then we decide, because checkpointing takes, without a good file system, checkpointing takes actually quite long.

So it could be-- It's like a few hundred gigs, right? Yeah, I think so. I don't remember offhand, but-- That doesn't take that long. No, no. But sometimes if your file system is slow, your file I/O is slow, your checkpointing for a 20-bit model could be like, what, 30 minutes or something. OK. I don't know this by heart. Sure, sure. But it's not hours.

If you go larger, what if it's like a 200-bit model, right? Okay. So you should have some kind of ideal checkpointing to run ratio that is not catastrophic if you run into a node failure. Yeah, no. So we see of it as like a MFU, because you can average out your flop utilization and then you can see how many percent hit, like how much slow down, right? So you probably go for something like if it's like you're taking off 1% of your speed, 2% of your speed. So basically...

it's actually fine to just checkpoint more regularly. So I think checkpointing, you also never fully, you can get from the clean slate nothing. As you optimize, engineer the system to automatically restart everything, you get some of the time back, but you'll never be perfect. So you still lose stuff. If you checkpoint too often, like every 30 minutes, then your file system is going to blow up.

If you're going to checkpoint every like, for us, we just see it as like how much storage is cheap compared to compute. No, when your model is like very, very large, your storage can easily blow up. Going on to the models, I feel like I digress so much about all these fun side things. You like compute, right? You like hardware and compute.

I know how we're going to compute. And also, I'm an orchestration guy. So one of the questions I'm skipping right now is, I came from Temporal. I'm familiar with Kubernetes. I've used Airflow. These are all the data and cloud engineer type tools. It's surprising to me that you guys don't have your set of orchestration tools that is solved. You wrote in your blog post, you had the pain of multi-cluster setups. And to the rest of us, this is completely solved.

Okay. I don't know if you know that. We use Kubernetes for a bunch of stuff, but I think for experimentation and stuff like this, it's still not fully... We didn't have the time to actually build something that is... It should exist in open source. Someone should have done this. It is what it is, but I'm surprised. That's all. Because it seems like a variable problem.

And someone should do it. YUFENG GUO: OK, OK, OK. Yeah, yeah, yeah. Good to know. Good to know. MARK MANDEL: OK, so Rekha Flash Core Edge, congrats on beating a whole bunch of state-of-the-art models, especially much bigger than each. People can see the papers for all the other stuff. Was this your expectation from the start, that you would basically definitely be frontier?

How do you like from the start of like you haven't trained anything yet and you're about to kick off the runs like are you able to like call your shots and say we will beat GP 3.5? Nobody can predict the future. No. How much confidence? Okay, we were confident. Like we were confident. How? Why? Alright, it's a good question. Because it would be a shame to do a whole bunch of work and then end up in the middle of the pack which a lot of people end up. We were confident. I think that a lot of it was like YOLO. I mean, I mentioned in the thing. I think we would

require a lot less iteration than this because of our prior experience in training these models. So I was confident in myself about our models will turn out to be good. And about exactly how, I actually don't really pinpoint to a particular reason of like, I mean, we de-risk stuff. So a lot of part of it is like de-risking and like, okay, you run like 4B as ablation and you can see, okay,

this is like my spice if you run 4B and your loss is like going crazy you know that this is going to be a shit model right but I think it's like we train enough like okay we don't have a lot of computer to do a lot of applications but we did enough experiments to know that okay our infrastructure and our like everything is set up to be good right obviously

the field moves right I won't say that everything was like smooth like the first time round it's like smooth and everything but I think we were confident in our ability to like make the list like we're not like really we're more confident about like the ability to like

move with as little steps as possible to the goal more so than like my model is going to be this like level at this time you know what I mean it's more of like for example let's say we run the first round of human evaluations right and then we see our number as this right and then we are confident that in five more tries we will get to this

you know, kind of like get to like this. It's more of that kind of confidence rather than actually... It's also a little bit of you see a new leaderboard hypothetically. Like as a researcher, you see a new leaderboard, right? You approach it like a puzzle. You don't know like whether...

at the start of it, you might not have the answer to the puzzle. But if you're good at solving puzzles, like generally, right, you know that with one hour, I'll be able to solve it. You know, that kind of confidence, like it's the ability to, to heal climb or the ability to, to improve over arbitrary things, right? Rather than, I think we were confident more about that rather than, everything is different, right? The stack is different. The infrastructure is different. The data is also different from what, I mean, we have a lot. Which you haven't talked about, right? We have a lot of, yeah, we have a lot of experience from prior, like our jobs, but like,

it's not going to be that, like we don't have actually like exactly the same thing because different companies have different stacks, different everything, right? So it's more about de-risking, being confident in like solving the general problem of like improving over things, which is why also I think that the team is valuable in the sense that we are not like valued by our model itself, but we are just valued about like

Like how we can see one problem and we can just like solve it like super quickly. Right. And that's what we are confident about. Right. Actually, like the artifact itself. Mentioning your team, you said at the largest, your team was three to five people on the pre-training side. It was that the team that you recruited? Was it all your ex-colleagues? How do you find people that would have this kind of

solid intuition. So I think that some of the people in our team were like, I worked with them at Google as colleagues and stuff. And some of them were like fresh hires, like they were like fresh PhDs or like everything. Okay. So I do want to comment on GNOME architecture. So if you want to, people have variants of all these, Swigloo, GQA, Rope, RMS Norm, and then obviously the big one is Encoder, Decoder versus Decoder. Could you comment on each of those? Like, were you just like, we're confident that GNOME got it right? Or did you actually

do an evaluation of each of your architecture choices. Oh, I mean like, okay, architecture-wise is something that I feel like I'm easily able to like, I've run so many architecture experiments

that like, I look at architecture and I okay, I don't want to be like overly like, I think it's very hard to outperform the OG gnome. Why? It can't be, I mean, on the surface of it, like we have to have learned something in the last seven years. No, all the changes, all the changes that like, Suigru was this like, okay, Suigru is like probably one of my favorite papers of all time, just because of the divine benevolence, like the gnome actually wrote like, we owe this success to the divine benevolence. It's always a meme thing, right?

Okay, so like GQA, MQA was always like the multi-career that was always like a big controversial thing because MQA usually you get a hit because it's MQA and everything. So people kind of know that like, it was a very- - Hit in what? Hit in performance? - Like hit or miss. Like it was like, you could get a hit in performance from MQA, like MQA alone. MQA was always like, you know, the choice, right? It's always like, okay, should we use MQA? Should we not use MQA? Right?

When GQ came in, it became like a no-brainer to use GQA because you don't get the hit anymore. And then you just get the fast inference benefits of GQA. So I think GQA- Which is LAMA3 now. Yeah, yeah, yeah. So I think LAMA2 already. I'm not very sure. LAMA2, the 70D. GQA, right? But I mean, the reason why we call it norm architecture because MQA came from Dome and GQA was like a follow-up paper by some of my colleagues at Google. So I think GQA became a point where, okay, this is already accepted.

like it is good, like it's a no-brainer to use GQA. Suiglu was an interesting thing because there was a very long period of time. So Suiglu was a single author paper by Nome and very few papers were, like Suiglu had very few citations like at the start. Only Google papers were citing Suiglu at one time. And a lot of them was, I was like, at one point I was like, probably like 30% of Suiglu citations because every time, like Suiglu became popular because of the updated T5, the T5 1.1 that uses

Suigru, right? And nobody actually really cared about Suigru for a long time because I was checking why is this like underrated paper, like not getting much citations. And then I think probably now he has like a few hundred citations by now. But I think Suigru is one of the things that like, you know, I played around with a lot like at Google. So Suigru really works. There was also a paper we wrote about like do transformer modifications,

blah, blah, blah. Like it was a paper with Norm and Sharan and Hyungwon and stuff like that. And then we ablated like so many transformer variants. Yes. Yeah. I saw that. Some of them matter, but most of them don't. Most of them don't. And then the only thing that mattered in that two paper was

in the paper was sweet glue. I forgot which exact sweet glue variant was it, but, and sparsity at that time, right? So, so that was strong enough like to finding to... For the listeners, this is the inductive bias. Scaling loss versus model architecture is how this inductive bias... No, no, not this one. There was another one like...

to transform modifications something something something okay because portal auto was run I think you gave the keywords yeah yeah I think the RMS norm rope thing not controversial like it's not like obviously I think rope is probably like it has that extrapolation thing which is nice and then it's also like default now nobody wants to

at particular embeddings anymore, right? And I think, I mean, I like the T5 style relative attention for a bit, but like, I think, okay, rope is, I actually ran that ablation for palm, like the T5 relative attention versus rope. I think rope is similar to other things, but he has this extrapolation thing, which is nice. Which is why your long context version can go to 256K. For most of the long context models, they use the rope extrapolation thing, which is nice property, right?

so that was for rope I think there was also like some things like the layer norm like petitions and stuff like that that were like it mattered a little bit maybe not too much and everything but I think in general there was not a lot of like

there are not a lot of things that people could do to the transformer. It's been like four or five years, right? And then the vanilla transformer, I think if you use it as it is today, will not be like that optimal. But like the transformer that we slowly evolve to now, it's like the GNOME transformer is probably like very, very, very strong baseline that is very hard to like, I think you need a drastic shift to

to beat that, right? Rather than, or you could find like more like, like sweet group is a small change, right? You could find like some small change that are like a big enough impact widely that don't cost a lot of, because a lot of architecture changes, right? The moment they are tedious to implement,

Like nobody, SQL is a simple thing, right? Just split it and then get it. It's a very simple thing to implement. Maybe that's why it's caught on because it has like a additional boost that's for the simplicity of it, right? So there's also a bit of implementation lottery, if you will, right? A little bit of, if you propose some very complicated thing for like 0.1%. Is it easy as high torch? Yeah. Nobody will use that, right? The biggest, biggest, I mean, I can't believe we're taking so long to come to this topic, but the biggest no-marketech-ture issue

decision is encoder decoder versus decoder only no so encoder decoder is not like a gnome the gnome architecture is more like the okay maybe like more more old school transformers maybe we want to just talk about the decision on encoder decoder versus decoder only so I okay I wouldn't be able to comment about like exactly our our setup but like

I think encoder-decoder are kind of very misunderstood from thing, right? So there's encoder-decoder, non-causal decoder, which is a prefix LM. And then there's a decoder-only model, right? Technically, a causal decoder and a non-causal decoder are very similar in the sense that it's just a bidirectional mask, right? And then a prefix LM and an encoder-decoder has only... The only difference is that encoder-decoder splits the inputs and targets into different non-shared.

transformer stacks and then like there's encoder bottleneck in the end right so technically people like kind of always associate like encoder decoders with like

I like bird or like something, you know, people get confused about these things, right? But I think in the UL2 paper, we really like kind of explore this. And also like maybe some of the big science papers that also talk about this, right? Is that prefix LLM and causal decoders are very similar. That's the mask. At the end of the day, they're all autoregressive transformers. That's actually like the only big benefit of encoder decoders. It has this thing called like, I mean, what I like to call intrinsic sparsity. Okay. So basically encoder decoder with like N params,

is like basically if it's like

it has the cost of like an N over 2 decoder model. So it is a bit like a sparse model because you actually spend the same amount of flops. It's just that you have two sets of parameters like for encoder and decoder, right? So it's actually flop match with a decoder model of like half the parameters. So like a, like UL220B is actually about a 10B decoder only model, right? So you get free sparsity from that. It's something that, okay, the OGT5 paper talks about this. You can look at that. There's this complexity chart. I didn't like, when doing decoder,

The UL2 paper, I kind of like was mind blown by wow, encoder decoder is so much more not bounded by the causal mass anymore. A lot of the efficient transformers, like a lot of the sparse transformers, like, I mean, the old early days, that's like lean formal and like whatever things like this, they cannot maintain the causal mass. And that's why you cannot train a proper language model with this, right? If you separate out your very long context into an encoder, this encoder has no loss whatsoever.

right? You could just do like aggressive pooling. You could do some crazy sparse attention that has like final transformer or something like that, right? And then you could make that smaller than the decoder. You could make that faster than the decoder. That are just some of the advantages of like why splitting into encoder decoder could be beneficial to like just using like a decoder only model. At the end of the day, the decoder in Encode Decoder,

It's a language model. It's still a regular autoregressive language model. So that's actually, I mean, it's not that much different from like a retrieval augmented language model. This is news to me. I don't know if you've ever expressed this, but yeah, this is actually makes sense. Okay, okay. I don't, unfortunately, I don't know enough to push back on this, but on the surface of it, it seems to make sense. Would you make the same choices if you were not so focused on multimodality? You know,

That's one of the ways in which I was thinking like, oh, encoder decoder makes sense that it's more natively multimodal. I just have to say that it's relevant. It's relevant. Yeah, it's relevant. Then we can move on to broader trends in LLMs. Just commentary on ecosystem stuff, like completely independent from Rekha. Commented on a few things like Lama 1 to 3 glowed up a lot.

I call this the Lama 1 to 3 glow up. They improved into an actual top tier open source model. Yeah. PHY 1 had a lot of criticism, but it seems like PHY 3 is getting a lot of love. Do you just generally see in your open model tier list what's going up and down? I think Lama 1 and Lama 2 are quite

quite made, right? But Lama Tree actually got good, right? I think Lama Tree is actually strong, right? I don't really follow Fai much. Their whole thesis is that textbooks is all you need thing, right? Like that we can use way less data than everyone else and still... But I think you cannot cheat the scaling laws, right? Because like you... I remember saying like vaguely saying that, oh, they match like Mixtra 8 by 22 or like something like that.

on like some, okay, I don't think these academic benchmarks are like that meaningful anymore, right? So, but then like, then when you go, they go on LMC, they get like, what, 47 and then they get like, maybe it just like seem slightly, maybe it's like, That was Phi 2. I don't know about Phi 3. Oh, there's Phi 3? No, I think, Phi 3 was just released like yesterday. Oh, I don't even, I didn't even, yeah, but, but I, I don't know. I think there's, there's some, I don't follow Phi that much, but I don't like, like a model that is synthetically,

actually I don't even know this, I didn't even read the paper, but I think that like a model that is like based on the premise of like distilling and stuff, something like that is like not that interesting to me. But I think that like Lama Tree actually shows that kind of like meta got a pretty good stack around training these models. Oh,

oh, and I've even started to feel like, oh, they actually kind of maybe caught up to Google now, right? The kind of feeling, there's also maybe a hot take on itself, but yeah, I mean, Fy, I don't really kind of follow you that much and I just, yeah, there's too much, too much things to follow. So I think it's like, I,

I think like Lama Tree is probably like the most, the first most legit. When you say these kinds of things, like most legit, obviously there's some, there's vibes eval or whatever, but I feel like people, the very common feeling is MMO is kind of saturated. Yeah. So like, what do you look at now?

Is it just LMSIS? Okay, I think that LMSIS has its problems also. So LMSIS is not exactly like, I mean, it's probably better than all these regular benchmarks, right? But I think a serious LMSIS create their own evals.

and a good eval set is the one that you don't release. A good eval set is the one that you, okay, you release some of it, but it's like you don't let it be contaminated by the community. Yeah, I think analysis is probably the most

legit one. I mean, the things like GSMK, Human Invalid, the coding human, they're all like... Contaminated. They're all like saturated, contaminated. No. GSMK, whether you're 92, 91, no one cares, right? That kind of thing, right? But we still report three decimal places in all of our reports. Yeah.

Yeah, yeah, yeah. But it's kind of like almost like this like obligatory thing to do. Numbers of your thing at the bowl. It's interesting to see how the field evolves also over time for this type of like benchmarks. But I think evals are going to be important. And it's on the, actually interestingly, it's probably on the academics to set the correct. I mean, they have, like they've been, academics have always been, oh, we have no computer. But like, okay, this is your chance to like steer the field in the right direction, right? I think the challenge is getting attention. So now,

MMLU was reaching its end of its life. What is next? There's MMU, or there's MMLU Hard, which someone recently released. It's Pro, MMLU Pro, I think. It's called MMLU Pro. Oh, yeah. That's right. That's right. MMLU Pro. But that only lasts you a year. And then you have to find something else. So I don't really know what is that. Well, so one thing-- you had a comment, I think, in your paper about-- there's two types of evals. This is a Vibe eval paper. One is LMS as judge, and then two is arena style.

right? That's sort of the two ways forwards for just general evals that cannot be gained. Although there's also human evals that you, like instead of as a judge, there's also like human evals that you run. Like that's kind of similar to Arena but kind of different to some extent also. Different in the sense that like... By the way, do you use your own staff to do that or do you like hire an outsourcing firm? No, we don't. We have like, we work with third-party data companies to like, there are a bunch of these like around, right? But like, obviously we don't like eval them ourselves. I don't know. I don't know.

I don't know how many evals you want to do, right? Sometimes the best researchers do their own evals. Yeah, looking at the outputs and stuff is something that researchers should do. Well, there is one element of parametric evals which I'm hoping that more people come up with where you kind of

the benchmark is generated from a seed, let's say. And you can withhold the seed or you can vary the seed. I can report how your model did on the benchmark given a certain set of seeds or whatever, and you can maybe average them. But in that way, it becomes much harder to contaminate.

I wonder if that is an example of this. Not specifically, this is just something I'm wondering for myself, but I did, someone did recently put out GSM 1K, which was... Oh, the scale thing. Is it Scale AI? Yeah. Which is similar in that respect, like make it easy to make variations of a one node benchmark, but like that is more likely to be withheld from training data. Yeah, yeah, yeah. But eventually those will, like, so it's always a side. Like even we put out Vibe with VLB also, I quite like upfront with like,

if the more people use it, there's a lifetime. It's like a car, right? After you drive, run, run a certain miles, it's time to shelve it, right? So I don't think there's like a, actually like a, like a good solution. In general, I'm also like a bit, I mean, I think this is like important for the community to think about, right? But like, is it like a fundamental limitation that any benchmark that goes out? Like also, there's also one thing is that in the past, people used to like, which whole test set, right? Like squat or something. They used to which whole test set. But then like, after a while, I think people also realized that like,

when you withhold like MMBU no like when you withhold it's like so much extra work for like the community to like eval on this that they just don't do that right it's either your data set becomes your benchmark becomes unpopular I think it's also incentive things right so if let's say you are you want to run like a contest right and then your goal as an academic is to get as much citations as possible on this benchmark

paper, right? Like then you, or like this, this, you want to be as famous as possible. You will not want to withhold the test set because if you withhold the test set and then people have like, there was once like, I mean like many years ago, there were even some benchmarks where you had to like, like package your model and send it to them to run. Like these benchmarks never ever like took off, like took off just because like, so at the end of the day, right? It's like, it's the root problem, like incentives,

like it's the also the benchmark the benchmarking problem is also like an incentive problem right so like it's also people want to show their model is the best and then the game masters want to to gain as much cloud as possible and I think also LMC also got into some I don't have a take on this but like there's like people who also feel that they are also optimizing for hype right their own cloud right so that's all this I think it's a lot of interesting like I don't know what field this will be but

like the sociological, I don't know, like, I think there's a lot of papers to be written, right? I mean, about how these incentives, like rewards and incentives, like kind of, it might not be soft. So I don't know. I would say SweetBench is probably the one that's kind of broken out this year as like now a thing that everyone wants to compete on as if you're a coding agent. I don't know if you have a view on it, but it's just like, it should be known to be hard. You should be able to make progress on it quickly. That makes you popular and cited a lot. Yeah, yeah, yeah, yeah, yeah.

Multi-modality versus omni-modality. So this is a little bit of commentary on GPT-4.0 and Chameleon. I don't know if you saw the Chameleon paper from Meta. Briefly saw it, yeah. I'm not, I didn't really take a look at it. Basically, the general idea is that most multi-modal models like Lava or Flamingo, which are late fusion, which is you...

freeze freeze and then join together versus early fusion where you do it properly where like everything is all the modalities are present in the early pre-training stage and it seems like things are trending from late fusion to early fusion is the general thesis with gpc 4.0 being very obviously early fusion you guys I would class it as early fusion I don't know if you have commentary on whether this is obvious to you or

or this is the way or they will just be, they will coexist? I think whenever possible, like early Fusion is better. I think there will still be a lot of works that do late Fusion just because of, like it's a... GPU poor. No, no, not GPU. Okay, partially right. I think,

I see this as an artifact of the line between language researchers and vision researchers and more of like, okay, like people who are training language models, they put out like a Lama or whatever, and then somebody takes it and then do late fusion on top of it. It's more like a... It's Conway's Law. They're shipping the Orchard. Yeah, yeah, yeah. I think so. I don't know what law it was. Conway's Law. Okay, I didn't know about that. But it's kind of like an artifact of the organization or anything, right?

No, it's just because people don't have money to train things from scratch. I don't know. No, no. I mean, even in big companies, right? Like, I mean, I don't know how things have evolved in many companies, but like... You're talking about Flamingo? Like language and vision teams don't used to be the same team, right? So I think this is like an artifact of this. But as early fusion models get more traction, I think the teams will start to get more and more

it's a bit like of how all the tasks like unify like from 2019 to like now it's like all the tasks are unifying now it's like all the modality is unifying and then I think like eventually everything moved towards like early fusion yeah the other element of multi-modality is I've been calling this screen modality screen vision versus general vision

In the sense that ADEPT is very focused on screens, tables, charts. Most vision models focus on things in the real world and embodied images. Do you have a view on the usefulness for this? I don't think there's a huge... I think at the end of the day, maybe screen intelligence is more useful in general. But what if you have a natural image in the screen?

Yeah, they should be part of the mix. No, I think at the end of the day, it should be mixed. If a model can do natural images well, it should be able to do screen well and everything. I think at the end of the day, the models would become... I don't see that there will be screen agents and natural image. Humans, you can read what's on the screen. You can go out and appreciate the scenery. You're not like, say, I only can look at screens. So I mean, I think eventually the models would be this good on everything. I look at it from a point of capabilities.

and screen is, even screen, there's also like mobile phone screen and there's also laptop screen, like also different type of interfaces and everything, like reading emails, whatever, right? But like reading a page from a website or buying something from like Amazon or something, like all kinds of things, right? And then even in the picture of like a shopping website, there could be like a natural, like for example, like picking Airbnb, right? Toobab.

like there's then there's a natural image in there that is like you have to understand like how nice is the scenery right or like where is it right like so I think at the end of the day it's probably like the same if you want to build a general model yeah but I think the natural images is like way easier like as in this way like

The models currently, current models are actually already very pretty good at these natural images. And I think like screen images are just something that people need to enhance the capability a little more. That's why there's like some focus on. Got it. I'll touch on three more things and then we'll just go to career stuff. Scaling laws. Palm 2 was chinchilla.

which is one-to-one scaling of model parameters and data. Now you are training a 7B model with 5 trillion tokens. What are you thinking about the trend in scaling laws for data versus parents? Chinchilla scaling laws are just like optimal for like with this amount of compute, how much do you have to think, right? But like actually the optimal, like there's no, I mean, this is something that even before I left, we already knew that like Chinchilla scaling laws are not the end of it, right? Obviously there's also inference optimal scaling law, which is,

obviously you take a spot model and then you just blast it with as much compute and data as you can until...

until you saturate on everything that you care about. So I think like Lama trees are what, 15 T tokens or something. So I think- Which is ridiculous. It's ridiculous to be honest. But at a certain point of time, your value per flop is not great anymore because your models eventually get saturated. But then the question of where is this saturation is also like, you always find some metric that you still continue to improve a little bit.

And then you're like, okay, maybe, oh, 100K more is worth it to continue training. Like just a little bit more. Right. But then it's like, where does it end? Right. But I think at the end of the day, like the thing about Chinchillas getting lost is that it was a bit misunderstood as though this model you need is compute. And if you train this Chinchillas,

I don't know why so many people had this idea that you want to improve past the Chinchilla scaling law. And then people make so much big deal about trading past Chinchilla scaling law, like, oh, Lamao do the first model. Like T5 base was 1 trillion tokens. That was really so much beyond Chinchilla scaling law, because that was T5 base. I think OPT and GPT maybe set that as an industry standard.

It's GPT-3 specifically. No, sorry. Wait, GPT-3 was not Chinchilla. No, I think like OPT and Bloom, right? Models like this, they train a large model with a very small number of tokens and the model turned out to be bad. Yeah, yeah. So I'm talking about Kaplan, the pre-Chinchilla one, the Kaplan scaling loss. Oh, okay, okay. That one was from OpenAI. Anyway, death of Chinchilla, covered, agreed. Chinchilla is still a coupé, but I think Chinchilla is still a...

I love any scaling laws paper, to be honest. It's like such a service to the community in general. Hugging Face recently did one, Data Blations, which is like a data scaling laws paper, looking at data constraints, which was kind of nice. Long context. People are talking million token context, two million token from Gemini. Magic is talking about 100 million token. How important is it, do you think?

I think we need to solve benchmarks first before solving the long context, right? We have your benchmark. No, no, not like the benchmark for long context. Okay, yeah. Because like you, the needle in his stack is basically like an MNIST, like it's always like a unit test for this style of things, right? But I think there's a one part about like hitting the context line and other part about like actually utilizing, right? I think Gemini's long context is surely like amazing, right? But I think like for the community to move forward in this, then it comes to a problem of like,

how do we evaluate this? I think I've seen some long-context benchmarks, like coding one, and stuff like that. I think making those are important for the community to heal climb. But I think long-context is important. It's just that you don't have a very good way to measure them, like,

like properly now. And yeah, I mean, I think long context is definitely the future rather than rag. But I mean, they could be used in conjunction. Definitely. Okay. Yeah. That's an outtake. Which part of the... Long context is the future rather than rag. Like you would, they will coexist, but you are very positive on long context.

I will put myself on the other mirror image, which is like long context is good for prototyping, but any production system will just move to RAC. There are a lot of application use cases where you want a model to take the time and then come up with the right answer, right? Sure. Because RAC is like... But you will use those sparingly because they're expensive calls. Yeah, it depends on the nature of the application, I think. Because in RAC, there's a lot of issues like how you...

like the retrieval itself is the issue or you might get fragmented. It's like, what if it's like a very complex story, right? Like a storybook or like a complex thing, right? And then, like rec is very like, you kind of chunks and chunks, right? The chunking is like,

And you definitely have lots of information, right? So I think there are a lot of application use cases where you just want the model. It's like you were like, okay, like a hundred bucks, like take your time, take one whole day. Come back to me with like an answer, right? Rather than like, I pay like,

Like one cent and then like get back a wrong answer. So I think that's like, that is actually very easy to show that rack is better than long context because there are a lot of tasks that don't need this long context. You would like like fat retriever, you just like rack and then you do this thing, right? So like long context may get a unfairly bad rep sometimes because like it's very easy to show like rack is like 100 times cheaper and it's very easy to show this, right? But then it's also like,

like not so easy to emphasize the times where you actually really need like the long context to really make like very very very very very good like decisions so yeah I mean I think both have pros and cons depending on the use cases using them together is also interesting like at the end of the day it's like a HBRM that you have to

To wiggle around, right? Yeah. There's another wiggle on the H-param. There's another toggle on the H-param, which is how much you fine-tune new knowledge into the model. Are you positive on that? Do you have any views? So, for example, instead of doing RAG on a corpus and then inserting it into context, you would just fine-tune your model on the corpus so it learns the new knowledge.

in whatever capacity, right? This is cumbersome, I guess. This is cumbersome and you don't want like, you don't want so many of, like the point of in-context learning is so that you don't actually have to do, I think this one is depending on like the business use case, right? If fine tuning is actually like the, you are very clear like you want this knowledge and then you just fine tune once and then you don't ever have to pay like

context, like in the context window, cause again, then maybe that makes sense. But if the domain is changing, then you might not like. Yeah. Obviously it doesn't make sense if the domain keeps changing. But I think for the model to maybe update fundamental assumptions or reweight associations between words for, let's say, a legal context versus the financial or medical context, it might work. This is the argument that some people are talking about. So I see this as a trio. It's long context, it's rag, and it's fine tuning.

people always have this like whether either of them will kill Ragn basically because Ragn is kind of the simplest approach yeah

Yeah, yeah, okay. I mean, I could see like, if you want a model for medical domain, legal domain, then fine tuning really works. It's always the move, like the, you know, domain specialized model, universal model, and, you know, the kind of this tension between both of them. I think it definitely like makes sense. It also makes sense like to, fine tuning can also be like an alternative to RAC, yeah. Yeah, well, there are some companies that are set up

entirely just to do that for people. So it's interesting that, I mean, I kind of view Rekha as like not working in that space, but you could potentially offer that if you wanted to. Okay. I was going to ask about efficiency and scaling. I'll just mention this briefly and then we can talk about MOEs because I discovered that you wrote your co-author on the sparse upcycling paper, which is excellent. Oh, no, I was just advising on that. Oh, okay. Yeah, yeah. But you can talk about sparse upcycling. It's a topic that's hot. But

But more generally, efficiency, in my mind, when I go to ICI Clear, I go to NeurIPS, I see efficiency paper, 90% of the chance, I'm just going to ignore it because I don't know if it's going to work. And I think this is related to some of your scaling work and your inductive. Oh, okay. Scaling log. Which is like, okay, there was this, TR Texas, I don't know who this person is on Twitter. You keep talking about me. It's fucking amazing. Okay.

Oh, yeah. He does have some obsessions, but like he's good. I don't know who he is, but he's good. So he says, if 2024 papers are to be trusted, you don't need most attention. You don't need high precision. You don't need most KV cash. You don't need most FIFO network layers. You don't need a reward model. Like it's like a lot of efficiency papers are just like, hey, on this like small example, we cut this thing out.

works fine or works great, works better, whatever. And then it doesn't scale, right? So it's a very interesting observation where most efficiency work is just busy work or it's work at a small scale that just ignores the fact that this thing doesn't scale because you haven't scaled it. It's just fine for a grad student. But as for someone who's trying to figure out what to pay attention to, it's very difficult.

to figure out what is a worthwhile direction in efficiency. Yeah, that's a good point. I think there's a couple, I agree with you fundamentally that it's actually quite easy to tell

like when you see a paper, okay, this one doesn't work, this one works, this one doesn't work. I guess the people account will just tell you that sometimes, this thing doesn't work, this thing works, everything, right? Sometimes it's not, you can always find a task in the dataset where your efficiency method gets neutral results, right? You can always find one thing that has, okay, I have comparable complexity. And you know what's the most

the cutest thing ever. Every time some people propose like this, they run like some zero-shot score on like some LM, Evalhannes or something like that. And at 1B scale, all the numbers are random, basically. Like all your BooQ, they're all like random chance performers, right? And they will be like, okay, I get like 50 versus 54, I'm better. But like, dude, that's all random chance, right? Sometimes I see people that run experiments at like, and then it's right. That's a good tell. I,

I think it's very, like the sad truth is that like, it's very hard to tell until you scale up. And sometimes the benchmarks that we have don't even probe entirely about what, I mean, especially all the works about the transformer alternative, right? You can always find like this alternative that, that,

at 7B scale, at 3B scale, you kind of like, okay, I met transformer on this and this, this, this, right? But then what's the implications when you go to like 200B? What's the implications when you go to 100B? No one knows that, right? So that's one thing, right? And I think developing your own intuition of like what works and what doesn't work is important. For example, if somebody's like, okay, to be honest, all researchers like sometimes are also like guilty of this sometimes because you cannot test on like

everything. I cannot test on everything, right? So sometimes you also just want to show your method works on this, but it depends on the objective. If the objective is to write a paper to ICML, sure, you can find two datasets, your stuff works, right? But when you get adopted, I am not sure. Yeah, researcher metagame is one thing, but as a consumer of research, I'm also trying to figure out like what is, how do I know what is a useful direction that that's

interesting thing. So for example, MOEs seem to have worked out. Yeah, yeah. I will go so far to say it's the first form of sparsity that worked. Because there's so much sparsity research, like we can chop all these parameters and look, we still perform the same, but then it never actually works. But it

But MOE is really-- Oh, you mean like the pruning line of work? PRUNING line of work. Sorry. I should have used that word. So I don't know if you have any commentary on DeepSeq, Snowflake, Quen, all these proliferation of MOEs, MOE models that seem to all be sparse up cycle because

You were advisor on the sparse upcycling paper? So the sparse upcycling paper was mostly vision focused with a little bit of T5 experiments. So it was early stage of sparse upcycling. But it was good that Google was really thinking about this long ago. And Noam also had a paper on it, right? Yeah. I think Noam always had a way to go.

Is it like a hundred experts? Is it a thousand experts? For some reason, the community settled on eight. Now you probably get more gains from more than eight, I think. But like, I think in general, it's like MOEs are just a trade-off with like Prime and Flop, right? And then you're able to make, like you cannot make that

scaling law increase from that additional so you can keep a low flop but kind of have more parameters it's just changing the flop parameter ratio keeping in mind there's a lot of inefficiency between the experts yeah I think as architecture itself the flop parameter ratio makes it like worth it right but I think the thing that's not very well understood is that like how there's

like MOE, like, like for me as a research question is that like when you, like, how does it like relate to capabilities and stuff like that? Like, does this inductive bias actually, for example, when you, when you do like massive instruction, I think there was this paper like, like flood MOE or something. Like they show that like instruction tuning, I'm not, I'm not like fully sure, I don't recall fully, but like when you do massive instruction tuning, like MOE models are like, they behave differently from, from dense models and stuff like that. Like, I think it,

Okay, like fundamentally I just think that MOEs are just like the way to go in terms of like flop parameter ratio, they bring the benefit from the scaling curve. If you do it right, they bring the benefit from the scaling curve, right? And then that's the performance per flop argument, activated parameters, whatever. That's like, that's a way to slightly cheat the scaling law a little bit, right? By having more parameters, right? I think the more interesting thing is about like what trade-offs do you make?

in terms of capabilities because of this new architecture. I think that's actually like the question that I think, I guess all the frontier labs, they already know this, nobody's writing papers anymore about this. So like, you just have to live with what's outside. But I think MOEs are, I'm bullish about MOEs.

Yeah. I had to... I made an exercise for myself on rating research directions and what their asymptotic value is. And I put MOEs pretty low because I think you have a good base model and then you upcycle it and it bumps you a little bit and...

I think that's it, but I'm always seeking to invalidate my hypothesis. But from scratch, MOE is also promising, right? From scratch, MOE is promising. I think in the IU case, you do MOE from scratch. Okay. The last part that makes me uncomfortable about MOE debate is actually related to another paper that you wrote about the efficiency misnomer, in a sense that now people are trying to make the debate all about the active parameters rather than total parameters. But it sounds like that's something that you're comfortable with, like flops at inference is a relevant metric.

And it's not that. Well, thanks for like actually reading all the, like reading the papers. I'm trying man. It's very hard to copy. You have a lot of papers. Well, I'm actually very impressed that like, oh, you're bringing up these papers very, very- Yeah, I'm using attention. Okay, okay. Yeah, thanks, thanks. And also, I mean, I'm interested in efficiency that works. It's just very hard to find efficiency that works. And so like anything that helps me have high signal on efficiency is helpful. So I think for the inefficiency misnomer, by the way, I love the paper, by the way, it's had a fun time.

working on it. I think Efficiency Means Normal was like, we found that like a lot of people like they use params, like especially to the kind of like, right? And then MOEs was not very hot like in the community at that time, right? But MOEs were like a thing long ago at Google, right? So I think using active params, I'm comfortable with using active params to kind of approximate like cost for the model. But like in the Efficiency Means Normal paper, we actually

made it quite clear that you should always like look holistically about like, because you have serving like additional serving costs, like fitting in the GPUs, like fitting on second node or something like that. - Interesting one was speed. Nobody really talks about speed, but your paper actually talked about speed. - Oh, okay. I have something to say about speed, throughput, right? There are so many methods, right? That are proposed about efficiency, right?

they are like theoretically like faster because of some complexity or like something like that. Okay. But because there's no way to work around the implementation or like your implementation becomes so hard, it becomes like 10x slower. Okay. There's so many papers. It's not hardware aware. Like it could be hard.

it might not be, it could be hardware, it could be just the way that, like you have a convenient way to like, in its mathematical form, it's actually like, okay, linear complexity, like whatever, and it's actually theoretically faster. But like, just because you have to like do a scan or something like that, and then it becomes like, actually like 10 times slower in practice, right? There are a lot of things like, not a lot, but like there are some things that are like, some methods that are like, like this, where you don't take it into account,

throughput, right? Which is also the problem of like sometimes...

like the incentives of like, like people who get efficiency, you can easily just like sell paper as like more efficient. And then people will not suspect that because the reason why we wrote a paper is that so many people were confused about like efficiency itself. Right. And then they will be like, okay, like a lot of these unsuspecting reviewers, especially like even academics or they don't have like that, that real world feeling. They were like, okay, less parameters, more efficient. Right. So you could have a method that's like less parameters, but like three times slower. Right.

Because a lot of times when you add things to the model, it becomes slow. You add complexity, especially if it's like something that's not hardware optimized, no kernels or like something that is like bad for TPUs or whatever, your model just become like slow. That's a temporary issue. People can fix it, but some things are not like so...

some things may not be like so easily fixed or like it just adds a lot of like sweet cost to optimize it and everything right but then it's always marketed as like because I save primes so I save like right and then the primes will you add a different place of the model like for example like if let's say you even in the case where you prime match models right if I take out like some primes from like

FFN, right? And I put it to like embedding layer. Embedding layer is like a, it's a cheap operation for embedding layer, right? But my model becomes like lopsided, right? I could say I parametrized this, but it's not flop match, it's not throughput match, right? Because it's unbalanced on the side, right? So there's also this type of tricky things that like when mixed model comparisons like very, very, very, very, very difficult and because you cannot even put like flop throughput and speed, flop

params and speed like actual speed right in the same plot right and then there's always like one money shot in a like there's always like a kind of compute like whatever plot right like for marketing and papers or something like that it's always very easy to like I mean not intentionally but like

to subconsciously like show one story when it's actually like there's like all these other things to consider. Yeah, it's a selection bias, self-bias, whatever. Very cool. Okay, well, that was mostly of most of the technical side. We have one commentary that will happen today on the future of open source models. Basically, Founders Fund said like the future is closed source. You were agreeing with it.

And a lot of the open source fanatics are up in arms over this. I don't know if you care to comment about just open versus closed and closed whatever. I mean, I don't really like when I mean like if you're referring to the tweet that I wrote, but I wrote something about this is huge. Like so many people are commenting about it because they are personally physically offended that open source cannot catch up. Okay. So I want to say it's like I'm not

I contributed to open source in the past, so I'm not against open source per se. But the interesting thing that I want to talk about here is that there's a difference between... I draw a line with open source as in... To me, Lama Tree is like... Meta has an org that is hypothetically very similar to...

to like Gem Knight or something, but they just decided to release the weights. It's open weights. It's open weights everything. I think when most people try to say that, like open source is catching up everything, they kind of mean like this grassroots, like this bottom up people that are like, these indie developers that are like coming together to fight, like it's romanticized and it's dramatized to some extent to fight against like this. And to be very fair,

I think that there isn't really much like, like so far, if you just look at the fractions of people, the big labs are just pushing and pushing and pushing. The academics like Stanford and stuff, they came out with DPO, they came out with things like that. They make some like, but they're kind of in between the line of like open source community. And then there's also like the developers that are like,

fine tuning on GPT-4 models and everything, right? I don't, I think the open source, the underlying like thing about like collectively improving something, I'm not like,

criticizing it for the sake of criticizing it. But I'm just saying that in order to make progress, I think the incentives of open source are like, what I observed is that people like to do things like they like to take somebody else model, they rename it, they make a quick win from that. And then you notice that when people realize that this standing on the GPT-4 tab and running some DPO is not going to give them the reward signal that they want anymore.

right then all these variants gone right you know there was this era where there's what there's so many of these like I cannot I lost track of this like all these model variants but now they're all gone because people realize that that you cannot climb LMSIS because you need something more than just something that is lightweight right so I think that was just my overall like honestly the Hugging Face Leaderboard contributed to most of that it's not LMSIS

No, no, I think LLC is probably they realized that they could not. Yeah, right. The open LLM leaderboard is probably like a big problem, to be honest. We're talking to Clementine in one of our future episodes. Okay, okay, okay. They dedicate a lot of, I mean, there's so much attention to them. It's a tough problem, but they're providing a public service for sure. Yeah, I mean, good intentions are always good. I mean, good intentions are always good. I'm interested in like,

just career-wise, what is your productivity practice? And so I'll split it into three things. Keeping up.

like reading papers and whatever, the outside world. And then two, like how you organize your own work. And then three, like work and life. Take that in any order that you wish. I don't have much of a life actually. But I am trying more to have more. I mean, you're a father now. I have a baby now. So like I'm trying more to have more life and everything like this. I think the productivity hack that I have is just like I didn't have like a boundary between my life and my work.

for a long time. So I think I just cared a lot about working most of the time. Actually, for the last, like, through my PhD, through my Google and everything, I'll be just like working all the time. It's not like the most healthy thing ever, but I think that was actually one of the biggest productivity. And I like to spend a lot of time writing code and I just enjoy running experiments, writing code and stuff like that, right? So you kind of, if you enjoy something, it's not work, right?

So like, it's very strange. It's like, I would get distracted by, sometimes I have to watch some Netflix series because like my wife asked me to like watch it. Like, or somebody tells me that I'm back on time on some shows, right? But then I get distracted by my experiments running and I just end up like, like writing code instead of like, so things like this. It's not the most healthy thing, but yeah.

but I think that's one. I'm looking for like a practice where like, okay, so Andre recently had a thing where like before when he wakes up, he doesn't look at social media. He only goes trick to work. Damn, I check Twitter the moment I wake up. I know, it's just something I do as well. But I'm like, damn, that's a smart rule. And like, I'm looking for like rules like that. Like, do you have a rule? No, he doesn't check social media because his phone is exploding all the time. All the time, yeah. Right? I don't have so many likes and followers, so like it's fine for me. Yeah.

Yeah, you get there. Rules like that, mantras that you've developed for yourself where you're like, okay, I must do this. So for example, recently for me, I've been trying to run my life on calendar for a long time and I found that the only way that I work is I write things down on pen and paper and I cross them off individually. And the physical action really, really helps me get things sorted.

And that's work-wise. Reading-wise, I don't know if you know, but I've been running this AI newsletter. All those summarizers, all Twitter, Reddit, and all that. So that helps me keep up because I have a socially graded and I personally vetted the entire pipeline from beginning to end. So this is my input algorithm. I know how to keep up with news because I now have a...

information condenser. So like, I'm trying to figure out what's your algorithm or what's your rules for keeping up. I got something for keeping up. So I used to check archive like every morning when the gate opens, I just check archive. I will wake up 9.30am Singapore time the archive gate opens, right? And then I'll be very sad if there's no papers to read.

But you usually just pick one paper or two papers that you find interesting. I don't read them. I just like skim like the thing, right? Yeah. So I used to do that. I don't do that anymore. I mean, ever since I have in the startup, I read, I read, I read less papers, right? But I used to camp at the door of archive quite frequently just to see. Isn't that, that's not a good use of time. I'll come on and say it. It's not a good use of time. No, no, no. It's a newness bias. Sorry, go ahead. No, no. It's just because like I ran out of things

It's just that the new stuff comes out, right? Yeah. And the new stuff comes out. So that's how I keep up to date. So in the space of three years, you read every...

No, I didn't read everything. It's just that. But these days, I realize I don't have to do that anymore just because if the paper is important enough, Twitter will show it to me. So that isn't really like... And one thing I do is that I actually don't read papers that much anymore. I just skim them almost. So that's for keeping up with papers, research, everything. And the other thing more of just a productivity point of view is that I used to always keep the text. I usually start writing the text

the thing while working on that thing itself like so even like let's say if you want to launch something like the end goal is like a blog post or shipping something everything right I like not really a launch or just papers I always like to look at it from like what's the story and the end

And then I just like figure out what I need to do to kind of, right? So I think as a researcher, like this is something like, I would have like, like so many drafts of like, like when I start the project, I don't know the experiments and everything, right? But I like to imagine like what the title will be, right? And then I always vibe check, like I always, like, so I mean, my friends at Google will know that I always have like,

like the Overleaf draft of like so many and then I would just spend time looking at it like looking at the title is it better two seconds like I used to care about a lot of things but this actually helped my productivity because every time I look at it I'm like okay this is the final product I'm like looking towards it right because I think a lot of researchers they tend to like

they swore around with the experiments and they never like ship the final story. It's like the shipping, like, I mean, it started out with ship products, but like as a researcher, your product- - It's a bit like product management, yeah. - You're shipping the thing. So I like to hang around a lot in my drafts and I get motivated from that. And that's like one productivity thing that I did as a researcher. Yeah, so I think that there's, other than that, I don't really have any things that I do that probably different from others, yeah.

probably you don't know it this is unconscious competence versus okay what's it like just NTU PhD just the story of like how is it coming out from NTU which is which is like a good school but like not not typical target school for like a big lab I did my PhD unknowingly like I didn't have very like when I was I was a very regular undergrad I had decent grades but not the best grades I was not like super smart in school or something like that I

I wanted to do a PhD just because I was curious and then I wanted to stay in Singapore at that time so I just naturally just did a PhD there. I didn't even...

vet my advisor I didn't even think too much I just like fell into the PhD program and then that was when I realized that oh actually I can do research like I'm like pretty decent at research like I just fell into a PhD like like unknowingly yeah and I definitely like NTU leaves a lot to be desired actually to be honest I think that I mean Singapore leaves a lot to be desired in general like the research community here is like like probably not great so how how did you like break out

If I was you, I would have no idea how to break onto the international scene. I think it was, okay, to be honest, in retrospect, it's a bit of a miracle. I mean, it's not easy to... I think I could not... If I had someone to mentor, I could not tell somebody how to replicate the same thing that I did. It's much easier now, maybe compared to in the past, but I've been mostly self-supervised during my PhD. My advisor was basically like...

like Grammarly, like a free paid plan of Grammarly. You can watch this, so it's fine. But like, there's a lot of things that, it was like this strange arc of my life where I was figuring out research by myself and everything. And okay, maybe going back to the change of opinion is that like the biggest culture shock I had like when I was moving from Singapore PhD to

to Google. I think my research, like, taste... Which you went straight to Mountain View, right? Yeah, I went to Mountain View. I started at Mountain View. Like, my research taste and everything, like, I was... It was so different. Like, the research culture is so different in the US and in Asia. I had to grow so much, like, during my time at Google to, like, actually evolve. And then, whenever I come back, right, I still have friends in, like, faculty here and everything. I don't think that I'm a snob.

or they think that I'm like being a very nasty person because I think to be honest the research here in Singapore is just basically like they just care about publishing papers and stuff like that and then it's not impact driven I think in US it's mostly focused on impact driven and the thing needs to make real impact right and well to be fair you're also working in an industrial lab versus a

an academic circle, right? Like you're comparing apples and oranges here a little bit. I know. I mean, at the end of the day, I think research is like fundamentally like as an industry, RIS is to write papers. Your goal is to advance science and everything. To be honest, it's all the incentives, rewards system is like different and maybe like slightly different and everything. But like at the end of the day, I still feel that researchers are researchers, scientists are scientists no matter what.

like really like where you are. I will get so much dissonance when I come back and I talk to people, like I will feel like, oh, why do you think like this? But then I used to think like this. So like the environment shapes the way a researcher thinks. The taste is very important. Sometimes I try to communicate this to people and then maybe I come across as a snob.

to the local community here, right? But like, it's just that maybe there's so much dense information that I want to bring back, but like, there's no fast way to like, transfer all the things that I've learned. Also a big culture shock because I was in brain in the Singapore office for a while and I'm reporting to- You're the only brain person. Yeah, yeah, brain in Singapore. And then I had like, I took on an intern from NUS actually and the research like, vibes and the thing was amazing.

so much of a conflict for me because

that it was almost like my body was rejecting it, you know? But this person so like grew and became, I'm happy with how this person grew from my mentorship. So he's now in a way better situation. But I will say that like a lot of people in the universities here are like not like a bit like, ignorance is bliss, right? Maybe sometimes. Well, no, it's exposure. I didn't know any better myself until I went to the US for college. And then, yeah, my world was expanded and

It's a little bit of a Pandora's box because once you've tasted that, you're never happy. Yeah, yeah, yeah. So OK, last question would be just a sort of Singapore question. So I like to be visibly non-American covering the AI scene because it's very US-centric.

Every non-American I talk to always wants to be like, how can we build Silicon Valley in my city? You know, my country, my city, whatever. That is not Silicon Valley. I feel like you have basically just kind of like me. You kind of operate in the US circles, but you just don't live there. Do you have any advice for like if Singapore... Okay, so I'm wearing a race shirt today. This is the official Singapore government's community group that is trying to guide Singapore AI policy.

If we want a hundred more ETs to come out, what should governments be doing? What should communities, ecosystems should be doing? So I actually think that like, sometimes like, like not doing too much is...

Maybe less is more, maybe. I don't think there's actually much the government can do to influence. This kind of thing is like a natural- Organic. Like an organic, natural thing, right? The worst thing to do is probably to create a lot of artificial things that- Exchange programs? Okay. I mean, Singapore used to have a lot of exchange programs. They send people to... I mean, just talking about AI specifically, right? I think that, for example, sometimes trying to do too much or moving in the right, wrong direction is just better than not moving.

at all. Especially if you, if you accelerate in the wrong direction, you actually get into a worse state than possible, right? So I think it's very dangerous to like move in a bad like direction. I think respect your talent more. Maybe the government should just respect the talent more. And I don't know whether this is too much of a... No, no, no. But not...

maybe not moving in a wrong direction is, to me, is already a very good thing. Funding for startups, incubation, holding academic conferences. I think iClear next year is going to be in Singapore. So people come here and exposed to it. But like, I don't know. This is just, it's very interesting. Like everyone wants to build up AI expertise within their own country. And like, there's a massive range into the US. I'm part of that. Like I live there. Yeah.

I feel guilty. I don't see any other way around it. Uh, it's, it's such a huge problem. I also do think that there is like a cultural hegemony. Let's call it like US values based, basically being asserted on the whole world, right? Like, because we decide RLHF on, on these models and, and now you will, you shall use all our models. And it's, it's just troubling for like national sovereignty should be AI sovereignty. And, uh,

I don't know how to achieve it for people. It's very scary. Okay, there's a lot to unpack. Yeah, this is not technical, but I was just curious. We can make this the ending conversation, which is I think you're an inspiration to a lot of other people who want to follow your career path. I'm really glad that we got the chance to walk through your career a bit. Yeah, I'm sure this is just the start, so...

Hopefully there's more to come. And I want to inspire more of you. Yeah, yeah. Sounds good. So I'm just glad that you shared it with us today. As a special coda to this conversation, we were invited to join the Tech in Asia meetup featuring Yi by managing editor Terence Lee.

Terence asked a similar question on how other countries can create conditions for top AI labs to spring up outside of Silicon Valley. So like, where do you see Singapore playing a role in AI? So like, how would you... Oh, okay, right. I got a practical one. I got a practical one that is actually actionable. I feel like one thing that people don't get, the advice, the practical advice, is that like, the era of like,

people who talk versus people who do, the people who talk is gone. So it's no longer about, I have a team, I have 10 interns from Southeast Asia or the region, and then they're going to do this, do this, do this, do this for me. So I think one thing that senior people in any government may not get is that the world has shifted into this paradigm where senior ICs, ICs as in individual contributors,

are actually making the most impact in AI. So in GDM and in OpenAI, I mean, Frontier Labs, they're all very driven by

individual contributors and not actually this is not even related to this is I'm talking about like this is the advice I give but it's actually general like it's a very general thing so multi-purpose basically it's not AI specific no it's also it's very AI specific because the level the difficulty of making impact and making breakthrough has started to become like it's no longer about like it's not like software engineering where it's

I think AI is a little bit harder. And then it's mostly about getting very senior people who are hands-on and have a lot of experience, rather than management style people that try to think they know what

they are doing but they actually don't so i think i i mean i i'm not going to like say like names obviously right but like i mean i meet a lot of people like this like in general i mean not only in singapore but like like right but ai has shifted quite a lot into this ic driven paradigm where the people making impact are the people who are like on the ground fighting the war right so it's no longer about

I have 10 interns, 20 interns, 100 interns. You do this, you do this, you do this. I just take meetings, right? No, right? The senior person writes code. Everybody writes code. Nobody should not write code, right? And then everybody, so I think this is, okay, this is a bit extreme, but this is a bit on the extreme side. But I think from people, like, I just, the advice is just like, maybe like just take 20% of what I say and incorporate it, right? So instead of like, if you, if you,

For example, hypothetical situation, say you want to organise an AI conference in Singapore, and then you want to show Singapore as the AI hub in the world. I mean, you don't invite policy people to come and talk about AI safety, AI safety, AI safety. You invite people who actually know their stuff. And then if you organise a conference and then 100 people go there and they feel very productive and everything,

The problem is that Singapore doesn't have people who really can do it. I mean, through the grapevines, I hear about people fighting for territory here and there. This is what I hear. I don't want to hear this, but I hear this somehow. And then sometimes I just ask them, "Who's actually going to do it? Who's going to do it?" The model is not going to train itself unless we have AGI.

Understand that times have changed, it's no longer about like, "Oh I'm very senior, very senior, very senior." Can you quote? That's the question. I think that's the... Spicy already. We are like Coco Izibaya, we raise the Coco Izibaya curry to the maximum. Almost there already. Questions, anyone?

Indeed, questions are very welcome. Head over to the Latent Space sub stack to leave a question or tweet at YitaiML or at AGIHippo directly with your feedback. Music

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka 01:44:38 Share

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Deep Dive

Shownotes Transcript

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka