We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Ep 69: Co-Founder of Databricks & LMArena on Current Eval Limitations, Why China is Winning Open Source and Future of AI Infrastructure

2025/6/17

Unsupervised Learning

AI Deep Dive AI Chapters Transcript

People

Jan Stojka

Topics

Jan Stojka: 我开发LM Arena的初衷是为了评估Vicuna模型，这是一个由学生在不知情的情况下开发的对话式AI模型。最初我们用人工评估，但扩展性差，后来尝试用GPT-4评估，但人们质疑其与人工评估的差异。因此，我们开发了Chatbot Arena，它使用ELO评分系统，通过随机匿名模型提供答案，用户投票来计算评分，并扩展到多模态评估。之所以成立公司，是因为评估需要扩展，运行成本高，且需要构建可扩展的后端和更具响应性的用户界面。我们收集的数据非常有价值，可以回答诸如模型替换对应用的影响等问题。我认为人工智能的主要挑战是可靠性，而LM Arena可以帮助解决这个问题。虽然人类评估很重要，因为大多数应用都有人类参与，但我们可以通过收集足够的数据来消除已知的偏见。LLM作为评估者也存在偏见，例如位置偏见和冗长偏见。我认为开源模型的发展速度令人印象深刻，中国在开源模型方面具有结构性优势，因为他们有更多的专家、数据，并且学术界和产业界之间有更强的合作。美国则存在AI开发孤立、学术界作用不大的问题。我认为美国很可能出现基础设施过度建设的情况。中国有能力长期资助战略项目，这可能给他们带来结构性优势。我认为人工智能基础设施正在向垂直整合和跨层协同设计的方向发展，需要解决分布式异构基础设施的挑战，包括自动优化和生成优化的代码内核，以及优化网络和计算之间的交叉点。 Databricks在获取对所有数据的无缝和高性能访问方面做得很好，并且很早就积极地为企业追求AI。AI在Databricks的DNA中，早期客户购买Databricks产品是为了进行AI。我改变了对量化的看法，认为它比我预期的更成功。我对AGI的看法是，计算机在越来越多的任务上比人类做得更好，但对于那些更主观的任务，进展会更慢。AI有可能产生新颖的想法和突破，但瓶颈将是测试它们。

Deep Dive

Chapters

LMArena, initially a Berkeley project, arose from the need to evaluate the Vicuna model. It started with student-based evaluations, then leveraged GPT-4 as a judge, and finally evolved into a platform with human-based evaluations and an ELO rating system to handle the dynamic nature of model comparisons.

LMArena was born from a need to evaluate the Vicuna model.
Initially, student evaluations were used, then GPT-4.
It evolved into a platform with human-based evaluations and an ELO rating system.
Handles the dynamic nature of model comparisons.

Shownotes Transcript

Jan Stojka has an incredible background. He's the co-founder of Databricks, AnyScale, and now Ella Marina, a company that's raised $100 million to help other companies with evals.

He's also a professor over at Berkeley. And today I was joined by guest host Rob Taves, general partner at Radical Ventures. And Jan, Rob and I talked about a bunch of things. We hit on LM Arena, Jan's new company, and where he thinks the opportunities are. We talked about the future of AI infrastructure, where the gaps are in the space today. We also talked about what the US can learn from China to better improve open source model efforts within the country.

This was a super fun opportunity to talk with one of the most brilliant minds in computer science about everything AI. I think people will really enjoy it. Without further ado, here's Jan. Well, thanks so much for coming on the podcast. Really appreciate it. Thanks for having me. Yeah, been looking forward to this one. And we were joking beforehand, we timed this recording pretty nicely. I think we'll start with obviously the big news announced yesterday, the launch of your new startup, Bellamarina, $100 million fundraise. Tell us a bit about the company, the product, the vision. What are you all building?

Yeah, indeed. So the LM Arena is based on the project we had at Berkeley started almost two years ago. Well, two years ago.

And actually that was driven by a need to evaluate a model which we released, I think, March 2024 called Vicunia. So it was a model which was basically a fine-tuned LAMA, the first version of LAMA, using the shared GPT data, if you remember the data, which was people sharing their conversation with GPT, shared GPT, and someone making them publicly available. Okay.

So we did that, and Vicunia was a few students. Actually, the students did that even without me knowing. I realized later that we can fine-tune this model.

And then there's a question is about evaluating it, right? It's like, of course, you know, you have a few benchmarks back then and so forth, like today, a lot of benchmarks. But also it was a little bit harder because it was this kind of conversational, right? There is chat, right? So that was kind of how do you evaluate it, how you show that you are meaningfully better than what is out there at that time.

And our initial

effort was, what do you do? You're at university, you buy some pizza, get some students, ask some questions and evaluate them on different models. The problems that was didn't scale. - I thought they could scale pretty well on a campus. - Yeah, yeah, yeah, but you know. And then actually we did a detour. It was two weeks before GPT-4 was released.

So we thought, "Okay, why don't we use GPT-4 to evaluate?" And that was what became eventually known as LLM as a Judge.

So, it is that and to our surprise, it performed pretty well. So, I look at that and did it. We got good numbers for our model. But then still people asking because again, using LLM to evaluate was very early. It was the first or maybe among the first to do that. I'm saying among the first because I think that some of the companies, they did it internally before maybe Microsoft and so forth.

But people are still asking, okay, yes, I see that for this example, it looks like it's doing well, but still, how does it compare with people?

That's why we started Chatbot Arena and trying to scale up the kind of evaluation of humans. That was one reason. The other reason it was these things about it's like today, it's very hard to evaluate this language model. Already there, there was evidence of contamination. Basically, GPT-4, I remember,

doing very well for some different kind of problems which are before the cutoff for training but not doing very well after that. Right?

So it was this kind of static benchmarks, right? So then we also thought about, okay, that if this is static of a benchmark, how do, it is almost equivalent if you think about people, is that you take over and over the same exam, right? It's like what it is, right? So, you know, how you make it and with humans, you know, we do make exams differently and different exam and so forth. So that's why the second, this is sort of kind of the two reasons, right? We wanted to

have a benchmark which is not static, it's more dynamic, and also which captures the preference of the users, of the humans. Right? That's the two reasons.

And then we try to figure out how to do that. There are multiple ways to do it. You can do a little bit of a tournament, which is what the students are doing before because you give, okay, this is a question, this is a prompt, and you have these slide language models of, you know, whatever, three or four of them you compare and see the answer to each of them and you rank them.

Unfortunately, it's a tournament almost. It doesn't scale as well. It's not only that, but assumes that while you are doing this tournament, the number of models is fixed. Again, we look always inspiration of how humans do evaluation. It was this thing, okay, where are similar situation in which

Not everyone is playing with everyone and the number of players is kind of dynamic. And one is obviously chess. But then many other sports like tennis, ATP and also team sports, they do that. So yeah, so that's ELO rating and kind of ELO rating we used. And then the model was very simple. You come, you ask your prompt.

We provide answers from two randomized, anonymized, light language models. Then you can pick one which you believe is better and you may vote. It's not mandatory to vote, but you may vote. We get all this data and we compute these ratings. Right. Obviously, it became wildly popular. Why didn't you decide it was a company? Yeah.

To their credit, you know, Waelene and then Anastasio who work on this, they believe more that it's a company before I did. I said, you know, this is just evaluation, it's just a leaderboard. What company is here, right? You know, but to their credit, they really believe in it. And then I think that you get to a point, and of course there are so many things you need to do,

and becomes more popular. Some language model providers like Frontier Labs started to evaluate the model, also asking us to evaluate before the release

So it kind of becomes bigger and bigger and people start to ask, "Okay, what about categories?" People started to ask about, "Well, what about..." You know, the human is very subjective, right? "Hey, this model gets higher rating because it's showing more emojis." So we introduced kind of style control to try to factor out some of these things. And/or, you know,

So it kind of grew and then another dimension it started to grow, it started to text and then multi-modal, right? It was more text to image, image to text, text to web to generate code and many more. So it kind of grew into a valuation platform and then the data we got, it started to realize that it's quite valuable in terms of what you can do with it.

And I'm not sure you've seen, but one of the features we launched one month or two months ago, something like that, is called Prompt to Leaderboard. And Prompt to Leaderboard, what it is, is that you come and you give your prompt. I may never have seen your prompt, but I can give you back. It's a leaderboard for your prompt.

What are the best models to answer your prompt? What is their estimated rating? The way you do it is, although I didn't intuitively, it's not very accurate, but I think intuitively, I think it's reasonable. Although I never seen your prompt, I've seen a lot of prompts which looks like your prompt.

And then I can use the votes on these similar prompts as a proxy to estimate the rating of the models on your prompt. That's kind of exciting. And there are many other questions. You start realizing there is a potential here. So first of all, we did a company because there was no way to scale what we are doing. It's costly. Because what we provide the users is free access to these powerful models.

Even we spent for the past year, maybe close to two millions to run this. Of course, a lot of money comes in terms of credits, in terms of some gifts, grants, things like that. But if you think about right now, you want to scale 10x or something like that, it's clearly it's a lot of, it's not going to be cheap.

The other thing it is was that, like I mentioned to you, is a lot of requests to do more of kind of evaluations. Right now you have agents and you have all of this. We have a search now. You can evaluate the search engines. Not the search, you know, like perplexity and

Open AI search and things like that. And then you need to build really a scalable backend. And you need to build something new, a much more responsive UI, UX, all of these things. There is no way you can do that with a few people. And then you have a model provider and other people, both open source and proprietary, asking for evaluations.

And recently we have a student, Wailin, basically is like,

doing that, he's a back end. Request, you know, like you put. So all of these things. And then you see that, again, once you have this kind of data, there are a lot of questions you can answer. Like I mentioned, prompt leaderboard, maybe you can answer questions about which many people have. I build my application on a particular model, and now I want to swap the model.

Maybe it's a new one, it's available. Maybe something cheaper. But what will be the impact of swapping the model on my application? Many, many of these questions. But what convinced me about...

that its potential here, and us in general, maybe not only me, is that if you look at AI, I think that one of the main challenges today with AI, it's reliability. That's a key question, and it's very hard to see how AI will really achieve its potential without figuring out how to build more reliable applications.

Now, if you look about software, a lot of things we do, it's about reliability, right? It's actually most of the energy we put in software development is about reliability. Because it doesn't take a lot of time to write the code to provide a feature or a service. But you spend a lot of more time in order to test it, debug it.

and so forth. And there's a lot of innovation there. You'll have a CICD, all of these things. And that, in some sense, is an easier problem because the software is pretty...

white box, right? This is the instruction, you change the instruction, you can see what is the state of the program after each instruction. If you don't use a debugger, you can do printf and things like that. But models are more black box, right? So if you look at the analogy, and we still want reliable application, because even if you look now, the most successful AI applications are the ones which have a human in the loop.

for the same reason, to validate the answer, like code assistance, customer support, and many others. So then if you think kind of the analogy, you have to have ways to test and validate and build more reliable AI applications. Again, a big problem. And it seems like, obviously, there are many things you need to do, but...

Also, something like LM Arena will help. Super interesting, Jan. It makes a lot of sense. And I think as you've touched on, evaluations and validation of reliability for AI systems is one of the biggest unsolved challenges. I think it's a massive opportunity you guys are tackling. Is your vision and is the thesis of the company LM Arena that

human-based evaluations will always be the best way to go and that needs to be a core of it? Yeah, that's a great question. I'm happy to answer that. That's why we have our guest host. I asked a great question. Yeah, so I think that it is very interesting here. Like, we have a lot of discussion on this and probably you've seen and so forth. I think it's one...

one evaluation. There are many other evaluations and you should look at many evaluations, many benchmarks. However, if you think today, like we discussed early on, that most of the applications today, they have a human in the loop. So this makes the human evaluation particularly important.

And there are people saying, well, you know, like we mentioned to you about this style control, but, you know, people like that model not because it's better, but because it's funnier or because, you know, or emojis or something like that.

And the answer to that question, well, first of all, even that is relevant. If you build application which interacts with humans, you want to know that. And by the way, guys, you know, like,

We, as a human, right, is like, you know, we like people who give better presentations. All sorts of biases. All sorts of biases, right? So you want to understand that. So that's kind of very legitimate, right? And, of course, now the other thing is that you have enough data. If you know a bias, you can factor it out.

This is what we do with style control, for instance formatting. If you have nicer formatted outs, you like it better. You try to factor that out, that you can pick style control. Maybe your style control will be by default in the future. You factor out some of these things.

And then, you know, that's having a platform in which you can try to remove certain biases. Of course, you can remove only the biases you know. They exist, right? That's a problem. But definitely. By the way, you know, this is funny about the biases. But, you know, when we look, I mentioned to you that when we do this, when we build the

When we use GPT-4 as a judge early on, and then we build, you know, Charbot Arena, one of the reasons to answer the question about how well and how does it compare with GPT-4, because people are asking. We wrote a paper and we did

we made a study. And the funniest thing is that these LLMs have also biases as a judge. Like, for instance, if they have position bias, they in general prefer the first answer. They have verbosity bias. The more verbosity they prefer, the more verbose answer. At this time, maybe now they are better. They are not very good at math, like people, right? They are not very good at math. So it's very, a lot of biases of...

or liking their own answers, right? Or, you know, from their own family, right? A family of models, right? It's like Lama model views that the judge is going to like other versions of Lama models. So, but that's kind of very interesting because even of these models, maybe not surprisingly, again, they are

trained on artifacts, on what are produced by the humans. It's interesting. I feel like, obviously, valuations, I feel like any startup you talk to, this is one of the top things they're thinking about. I think a question has always been, as people have been building tooling in this space, how generalizable can you make an evaluations tool for a company? Because I think every, in some sense, it's the most valuable thing companies have is

their own evals that are specific to their own use cases. How do you think about providing a general set of tooling to companies in very different end domains? That's a good question. I don't think I know the answer to that question. Good. I don't either. We are happy. Things like that, like I mentioned to you as a prompt leaderboard, are going to help. Because in that particular case, we don't know, again, we haven't seen your

prompt or your question before. But we can still tell you, you know, some statistical meaningful margins that which are the best models for that question. So that can generalize easily. You give me the set of data and I can give you, you know, the

the best models for your set of data. I can give you also probably you can do personalized, right? You can do what are the best models for you, right? So I think you can do, that's why the key for that is to scale, right? To get more data. The more data you are going to have, the more kind of this kind of micro categories you can have. Yeah.

Makes total sense. What do you think happens to the model landscape over the next few years? Maybe we're starting with the most state-of-the-art closed-source models. Yeah. I think, really, if you think about right now, the trend, obviously, and that was the open-source model caught up with the progress quite impressive over the past year. Of course, it was...

Not of course, but maybe surprising and also a surprising thing is that these models are not necessarily from US or from China. But definitely at this point, if you are talking about the open source models, the best car coming from China. And if you look at the trends, you know, the open source should be able to catch up with proprietary models within one year or so.

Now the question is where these open source models are going to come from. What I can say is that in China it's a lot of momentum, I think for structural reasons.

how the ecosystem is right now grew in China versus US. So I think that we in US are a little bit of disadvantage. - Say more about that? Why is that? - Structural. So first, if you think about what do you need to develop these models, right? You need three things. You need experts, you need data, you need infrastructure, right?

And I think that if you look at China, they have a lot of experts. Probably in sheer numbers, they have more than US. Data, they have data. They don't have as much infrastructure, but they are making progress. If you look at the latest announcement from Huawei and so forth, export controls, everyone guesses about how effective they are going to be, or they are. That's

to debate. But the one thing there is that the open source, it's much more prevalent. It's almost by default. Now, if you go to U.S., is that you have huge resources, and so forth, and you have lots of smart people. The problem is that the development, what happens in the U.S., is siloed. Right? It's

you know, this frontier lab, right? Everyone is doing the same thing, basically. That's number one. The number two is that academia, when you talk about pre-training and building the models, doesn't play a major role because

lack of resources. You know, there are companies like AI2, that is a group from Stanford, Percy, Liang, trying to develop, and we at Berkeley try to develop this, you know, pre-training, post-training, full open source pipelines and so forth. But it's a challenge. And unless that changes, we are going to be a structural disadvantage because the diffusion of innovation is very limited.

It's like you have people, so the way you can, if you want to maximize the rate of progress, you have to have all the smart people collaborate. But if you have kind of these silos and a big part of the research, I'm talking about academia, not being able to contribute in a meaningful way to developing these models, then

it's a significant disadvantage. The main diffusion of innovation here in the US is that people who leave one company go to another or start new companies. But if you look, for instance, in China, there is a much stronger collaboration between academia and industry, you know, like ByteDance or Alibaba and obviously DeepSeek. And

That really helps them. What is the reason academia is so much more effectively involved in China? How does that come about? Because it's much closer collaboration to industry. Here it's very hard to collaborate with a frontier lab because everything is secret. If you really want to maximize the rate of progress, again, you have to have all your researchers or your experts collaborating. The only way they can collaborate is by shared artifacts, which means the software side, open source models,

And they need also to have a shared infrastructure, right? I mean, obviously the motivation behind a lot of the closed labs is, you know, this belief that, you know, they, you know, being the first to find these state-of-the-art models, there's all these dangers of what they can be done in bio or cyber and they need to get to them first. Like, what do you make of that? Look, I mean, I am on, I am optimistic in general. I am not

I think it's very, you know, you always need to remember that as humans we are driven by emotions and one of the most by far powerful emotions is fear. So you are always going to respond to that much more, much stronger, right? But then providing you some optimistic view which is not palpable. But the other one, you know, is you are, you know, like...

kill you, right? So you are going to react very strongly. So I think to start with, whenever you are seeing this kind of discussion, you need to discount the negative one because you as a human, you are going to be much more prone to respond to that. Just to keep a little bit of objectivity. Objectivity.

The other thing I would say is that I still do, you know, we are talking about, and this was SB 1047 last year and so forth. You are talking about, think about the marginal risks, right? And marginal risk is about risk which are not present before, but they are enabled by this technology. And right now, I still need to see real marginal risk enabled by AI.

It makes the previous risk more prevalent. They're making much worse, maybe. But if you talk about deepfake and so forth, I mean, you could do that before with Adobe and so forth, and impersonating people, it's all the way from antiquity, right? So that's one. Obviously, it's like they speak, oh, you can tell you how to build a bomb.

So first of all, what it tells you, it makes it maybe easier in the best case scenario. But if I'm going to kind of any library and so forth, I should be able to find that information if I am determined. So it makes this much easier. Maybe say 10 times easier. Maybe 100 times easier. But if you look at end-to-end, what you need to do to apply, to make that a reality, acquiring the knowledge is like very little.

Right? Then you need to get the materials. You need to assemble without anyone detecting. You need to deliver that or something like that. So if you look at the S1, that's kind of dominating. So yeah, the 10% of acquiring, which was before, I'm going to do it to 1% or 0.1%.

Yeah, sure. But the rest is not going to get better, right? So that's kind of, when you think at this kind of level of details, it doesn't convince me that this is existential risk and so forth. Of course, everything, you know, there are things you cannot imagine, granted.

You mentioned physical infrastructure as one area where China is still lagging and export controls are a piece of that and so forth. In the US, obviously, and in the West more broadly, a massive physical infrastructure build-out is underway and all of the big hyperscalers are pouring tens of billions of dollars into building out massive data centers, one gigawatt, five gigawatt data centers, etc.

Do you think that that's the right approach and that's a good trend to be happening in the US? Do you think there's some risk of overbuilding infrastructure? I'm curious how you think about that. I think overbuilding, it's very likely that will happen with the internet. That will happen with the internet. We overbuild the internet and a lot of companies who are overbuilding it went under. There are a lot of other companies which really took advantage of that infrastructure, like

Google and Amazon and so forth, right? So probably very likely this is something like this could happen now, right? It's like, clearly, look, it's much easier to get GPUs today than it was one year and a half ago, okay? For one, that's a fact, right? I am not, you know, it's not about housings. But I do think that

China is not going to, you know, it's going to be two years, three years, I don't know. But they are going to build and the advantage is that, you know, they have an economy which is in the same ballpark, I don't know, whatever numbers you look at, with the U.S. economy, right? And the other thing what they do is they have the ability to fund strategic initiatives, you know, many, many years, decades if needed, right?

So that kind of gives them maybe a structural advantage there. So we'll see. We'll see. And also they have a lot of experts. And like we see DeepSeek, right? They are going to do optimization at the lower level. And some people say, oh, yeah, but they do that optimization. If we force them to spend their intellectual cycles to do that optimization, sure. But you have enough resources, it's going to work out. Work out for you. But, you know, look at...

And look, you know, this is also always amazes me that people say, oh, you know, it's like this kind of belief, which is confidence that we are always going to be ahead. But there are many, many high tech industries which we are no longer ready, like solar cells, car batteries, drones, electric cars, robot, you know, at least industrial robotics.

Maybe transitioning topics slightly. You have co-founded and helped build many defining infrastructure companies over the years, software infrastructure companies, Databricks, AnyScale, now Alamarina. Aside from the evaluations topic, which obviously you guys are focusing on with the new company, what do you think the biggest opportunities and the most important unsolved issues are at the infrastructure layer for AI right now? Again, you look...

you look at the trends and clearly right now what we see and every lab is doing we are seeing a lot of the infrastructure evolving to what is being very vertical integrated co-design across all the layers all the way from the application to the hardware and I think that you are going to see more and more on that. That's all. I'm going to say again based on the trends where I think there are

some opportunities. Clearly what we are seeing here, it's like one thing it's about you have this kind of distributed heterogeneous infrastructures, right? And if anything, you know, you have at the accelerator level, you have GPUs, NVIDIA and AMD, and you have many other accelerators like TPU, Trinium and others. Then you have at the networking level,

You have Ethernet, you have InfiniBand, and then a little bit higher level RDMA. So you have a lot of this. And then collective communication like nickel, rickle, and so forth. So huge, huge, huge heterogeneity. So you need to master that. And the one thing I would say is that, and there is quite a bit of work, I'm hopeful, it's about automatically optimizing and generating information.

optimized all of our code kernels, right, for these accelerators. I think that hopefully it will happen. So that will help us support much easier newbies and optimize for a large variety of hardware and networking. I think that it's going to see a lot more optimization at the intersection between networking and compute, right? Fine-grained

overlapping the communication with computation and so forth. Obviously load balancing is going to be very important. So that's one. And when you look about, in order to optimize this kind of models and workloads, it's so complicated, right? It's like if you have, if you look at this kind of parallelism, like for model-model serving and training, you have, the last time I was looking it was like seven.

you know, model data parallelism, model parallelism, or tensor parallelism, pipeline parallelism, context parallelism, and token parallelism, and expert parallelism, sequence parallelism, I forgot, so I think this are seven. So it is, you know, it's like you need to do this in an automated way, right? Because there are too many, and then these are very

fine grain again between communication and computation because you want to overlap the computation with communication to increase utilization of GPUs. Obviously we'll see what happens at the agentic level.

What do you think the infrastructure needs will end up being for agents? I feel like there's been some early attempts at broad frameworks, at other areas. It feels like the model companies themselves are building lots of things. It's hard because when you have a field which moves so fast, it's very hard to come with good frameworks which are going to be stable over time, right? Because it's just...

The needs change every month, every week, every day. So that kind of makes it difficult. So I think this will be typically when you start to build good frameworks or good software abstractions, it's when the speed of evolution at the application level is kind of slowing down. Is that ever going to happen given model progress? Yeah.

Well, you know, it's like, look, the different layers, it does happen. I mean, it's like now everyone is using transformers. Everyone is using kind of pie torch, right? Almost. There are things like...

We are using right now, when you talk about inference, we are using OpenAI API, more or less. So yeah, I think you are seeing that. Your kind of standardization is a low level. If you look about, for instance, a lot of new post-training frameworks, every day you get another one, but most of them are built on Ray, BLLM, and maybe HLNC also.

Yeah, so I think you see at different points some kind of standardization, but still it's

It's pretty dispersed. Since the generative AI explosion, I feel like it was this huge moment for Databricks. In many ways, you guys have met it head on. Reflecting back maybe on the last two and a half years, what do you think the company got most right in the immediate post-chat TPP moment? Then maybe if you could do it over again, something you might have done differently. I think that one thing was you got right. It's about the data is as important as ever.

So, getting access to all your data in a seamless and performant way while having all the governance on top is very important. Right? And this is what we've done with Lakehouse, now with the Unity Catalog. I think that's the key because if you are a company, an enterprise, a large enterprise, you are going to have the data in a myriad of storage, right? Legacy or newer data.

You know, like used to be data lakes and things like that. So having access, uniformly accessing the data, and not only that, to have the metadata associated with the data you are going to access, is super, super important. Okay? So I think that was right, and you see even today, that's why, you know, our

Our main conference is called Data + AI, right? Data Intelligence is a new category. So, and you'll see still it's a huge push in that direction.

I think the other one is about also after acquiring Mosaic and so forth, we are very early on, very aggressively pursuing AI for enterprises, right? And it's perfect, right? Because you have enterprise, they have the data. In general, the data, it's one of their big...

crown jewel, right? It's like it's unique. And then you help them to extract a lot of value from the data and build new products powered by AI on the data, right? So that's one thing. Now, the one thing I want to mention here is that, and people may not realize that, but

This AI was in the DNA of Databricks. When we started Databricks with Spark, one of the main libraries on top of Spark was a machine learning library. It was ML-Abe and then Spark-ML. Of course, it wasn't deep learning at that stage. It was classic machine learning, random forest and linear regression and things like that.

But that's one. And early customers actually were buying Databricks, you know, products and Spark because they wanted to do AI, right, early on. So it's coming, in some sense, full circle. But I think, you know, that's... I think, you know...

It's hard to know what you would have done differently. Of course, we tried to build, you know, it was a pretty high-performance model, D-bricks, right? So if you go back, whether you are going to do that again, it's a question. Given how many powerful open-source models have been released, of course, I cannot talk about everything here, but I think we did, for the last two years and so forth, I think that

I don't think I'll go back. I would go back and say we should have done differently. And of course, we need to talk with Ali and so forth. Much better than I know. And he will be better to answer these questions. You're so close to all this cutting edge stuff in the AI world. What's one thing you've changed your mind on in the last year? Maybe the way I would answer the question, what things are, I thought will happen and didn't happen. And by definition, you'll change a little bit your mind on that.

I thought that we are going to see more alternative to NVIDIA, more, right? And we haven't seen that yet. That's one. The other thing, I was pleasantly surprised by the progress of the open source. Now, this came from where I, maybe one, more than a year ago, I haven't expected, you know, came more from China rather than

US. I thought that we'll make more progress on reliability, hallucination and so forth, but that still remains a problem, especially with the reasoning models. I think that certainly reasoning models are probably more effective than, not reasoning, post-training was more effective than I thought. And one particular thing which, okay, one, I know now, one thing I changed my mind, I was not very

big early on quantization because I saw that everyone you know like people quantization makes a trade-off right it's like a little bit hit the performance to provide you better efficiency or and I thought that people will not be willing to do to make that trade-off but that's

And quantization, without question, has been very successful and a game changer. Yeah. There's a lot in there that you said that's interesting. I mean, I guess to go back to your first one, like, do you still think over time, you know, in the next few years, there does emerge real challengers to NVIDIA? Or, like, has what's happened over the last year changed your mind on the viability of that? I think there could be. You know, it's certainly, you know, probably China will put a lot of effort and Huawei, because they don't have any other choice if...

Things do not change, otherwise the export control gets worse or things like that. That could happen. I think that clearly we see a lot of investments. Google always invested on TPO's and AWS makes a lot of effort on Trinium and then AMD. Very good hardware. I think the biggest challenge as you know is software. The software stack.

one of the biggest challenges. We'll see. I'm hopeful because it's like, obviously from the software layer, it's like, well, you never know. On one hand, if you have a lot of more choices, you need to put a lot more effort in the software to do these different choices. Continuing with the crystal ball gazing a little bit here, I'm curious to hear what's your view on our current trajectory toward superintelligence AGI timeline there, if it's even a coherent concept. How do you think about that debate?

You know, I used to joke that with AGI everyone will be right because they have no good definition. So I said I'm right because what I call is AGI. I don't know. And what is AGI? What people say is that you have one artifact that is going to do majority of tasks better than humans.

But how that artifact will be built, you can build even probably today something like that, or maybe very few years. But I don't think that intuitively what people mean by AGI. But if you look historically, there are an increasing number of tasks at which computers are better than humans. Think about calculators. That was 70s or 60s. They are better than humans for a long time.

than playing games. It's like chess, right? It was Diplo, '97 or something. And then it was obviously Go, 2017. And now you have kind of more ImageNet. It was for a lot of convolutional talks, for recognising image recognition, which was arguably better than humans. So you always have these kind of things which are going to add up, right? Maybe you're going to prove sense of force.

You know, you're going to have more and more tasks at which computers are better than humans. And you can package them in one artifact. And depending on the task, you can invoke different things under the hood. But from the user perspective, it's the same. Although people, you know, some of them will not call that AGI. But one thing I say, so I'm not going to, I don't know, right? But one thing I was going to say is that what you see is, this goes back to the discussion about reliability a little bit.

But you see progress. You see progress where you have good validation, good test for the answers. You have ground truth. That's what you see. Okay? Like your document calculator, right? It's like, you know, it's only one right answer. If you look at games, it's very easy to test whether you're successful or not, you win or lose. It's also, the rules are pretty clear.

If you think about reasoning models, where are they more successful? Problem solving, coding. Where you have ground truth. But this is natural. It shouldn't be a surprise. Because if you look, forget about computers. If you look where the progress was done in the history of humanity just over the past say 200 years, where did it happen?

in sciences, engineering, where you have measurable outcomes, right? Chemistry, mechanical engineering, electrical engineering, physics, and so forth, right? This is what the pro-- right? The world for these sciences is today very different than 200 years ago, where things are not as measurable, like novels, books, writing, creative writing, right? Arguably, you don't see the same progress. Actually, some people will argue the other way around.

So that's kind of what I want to say. So this is what I'm trying to say is that I think that you are going to see continuously for the measurable task or use case is a measurable outcome, you are going to see very rapid progress. But for the other one, it will be more subjective. Do you think we'll see a proliferation of what measurable tasks are? I mean, obviously, a lot of people in the lab seem to think you can build reward models for

almost any domain. Yes, but there are things which are going to be more called kind of objective, what is true or false, right? You can have formal specification and so forth. And things which are going to be more subjective, right? And because what is a good book, right? How you are going to develop a real model for that.

That's what I think that you may have your personal reward model, maybe you can someone write a book for your model, your own model. So yeah, you are going to do, but reward models are not as efficient as a cell. Like for instance, if you have even for, say you build, so you solve mass problems, and you can use to have reward models to kind of learn what is good or bad

result and now you compare that if you have the ground truth. I know this is the correct results, right? If you have the ground truth, it's much more efficient. Maybe one or another magnitude more efficient in terms of the compute to get to a particular accuracy for whatever benchmarks, AME or whatever.

to problem solving. This is almost a philosophical question, but within the hard sciences, which are verifiable, do you think that the current paradigm of AI that we're in, it's possible for AI to generate novel ideas and breakthroughs? Definitely. But there, the bottleneck will be testing them. Right?

Right, because now you are talking about a little bit more reinforcement learning and so forth. Right, because I generate, that's how I generate AI. I generate a solution. Right, now you need to test it. Right, so that's kind of the bottleneck. It seems like AI research will be like the first place. Yeah, AI research, yeah. It's like, well, yeah, but then it's...

What does it mean to generate a good artifact, a research artifact? Because right now we have so many papers. In that sense, can you think of AI creativity as basically just generating a ton of ideas, most of which are awful, some of which are good, and then just being able to accurately... There are two parts of it. Sounds like much cooperation. Those are two things. It's like...

Yeah, it's like brainstorming, right? You know, on one hand, you need to, even to solve a problem, right? It's like this method. And many, many, like, all the way alpha, you know, alpha geometry and so forth, it's the same kind of pattern. You generate lots of solution,

and then you select the good solution. So it has two parts, right? First part is generating solution. Now this means that a condition to find a good solution is that at least one solution that is generated should be good. And then is selection. Selection is also very hard. You need to be in the high stack.

from all these myriad of solutions you generated, you need to identify that good solution. And that's very tricky, because even our probability of identifying the solution, you say is 99%. But if you have to pick from one million solution or good solution, your guarantee is you are going to pick the wrong one. It's very high probability, no guarantee. Right? So that's kind of, these are the two things. But yeah, I think that

Right now, even today, when you have a human envelope, the best application is the one in which the solution, generating the solution is hard, but verifying it is relatively easy. Well, it's been fascinating. We always like to end our interviews with a quick fire round where we get your thoughts on a standard set of questions. So maybe to start, what's one thing that's overhyped and underhyped in the AI world today? I think underhyped is reliability. Still people discuss, but it's not enough. Because if you believe that

That's one of the main challenges which will hold back the proliferation of AI and solving real problems. I think that's one. Overhyped maybe a little bit. All this kind of scale, maybe scaling laws. Everyone now is looking for scaling laws.

Especially in the post-training recently, we've seen that if you have, of course, a powerful base model, just with a very small set of high-quality data, you can unleash new capabilities in that base model. Another rapid-fire question for you. What's one AI startup outside of your area of focus that you're really excited about or bullish on? This is probably one of the most successful applications, but I'd still be very curious about

startups like Coursera or Insurf or something like that, call assistance, how far they can push this. Because it's clearly that they are great for some of the use cases, but when you have to have bigger

to work with the context of the entire code base and things like that is much more difficult. And I think that the other point there is about, for instance, once you generate more and more code, then it's going to

how easy is to maintain it, right? The maintainability. But the reason I'm saying, you know, obviously Cursor, they made a lot of progress, and here maybe it was more than I expected. That's what I will say. And also it's interesting because there are a lot of still open interesting questions there in that space. And if successful, obviously they are going to have

there will be a massive impact, but in that space. No, the space is going to get quite crowded now with the big players also entering. But I think that space, we are going to learn a lot. Because in some sense, this is like canary in the mind because the code assistance is like, it's almost a perfect application because you have developers which are early adopters of technology.

So they are going to adopt. The other thing is that these coding assistants, they fit in the existing workflow. So I think that this kind of having early adopters and fitting naturally in the workflow are big, big advantages, which you are not going to see for other examples, like doctors or lawyers and things like that. So I think that it's very interesting to see how

far this will go. Well, I'm sure there's all sorts of threads folks will want to pull on. You obviously are involved with so many interesting parts of the AI world. I want to leave the last word to you. Any place you'd point our listeners where they can go to learn more about you, the work you're doing?

So, you know, we are doing still a lot of work. A lot of work is happening at Berkeley, Sky Computing Lab, and obviously the companies, all of them, they have these websites, good websites, presumably. But I think that the cutting-edge research we are doing, people should go to Sky Computing Lab. We have blogs and everything. Amazing. Well, thanks so much. This was a ton of fun. Thank you. Thank you.

Bye.

Ep 69: Co-Founder of Databricks & LMArena on Current Eval Limitations, Why China is Winning Open Source and Future of AI Infrastructure 54:57 Share

Unsupervised Learning

Deep Dive

Shownotes Transcript

Ep 69: Co-Founder of Databricks & LMArena on Current Eval Limitations, Why China is Winning Open Source and Future of AI Infrastructure