We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Beyond Leaderboards: LMArena’s Mission to Make AI Reliable

2025/5/30

AI + a16z

AI Deep Dive AI Chapters Transcript

People

Anastasios N. Angelopoulos

Anjney Midha

Ion Stoica

Wei-Lin Chiang

Topics

Anjney Midha: 我认为我们应该关注AI部署前的实时测试，而不是纠结于AI的最终考试应该是什么。我们应该实时地对AI进行评估，尤其是在AI被应用到关键任务领域时。 Wei-Lin Chiang: 为了支持这个项目并进一步扩大平台规模，我们希望创建一个公司。通过扩大用户规模，覆盖不同行业，我们将能够深入研究人们真正关心的关键任务领域。我们甚至可以为核物理学家、放射科医生等专家提供微型竞技场，让他们能够获得研究问题的最佳答案。 Ion Stoica: 许多人要求部署他们自己的私有竞技场，用于内部评估。即使在硬科学或关键任务行业中，人们提出的问题也大多是主观的。这些模型之所以有用，是因为它们能够处理不完全明确的问题，并给出带有主观性的答案。如果要在医疗和国防等领域部署这些系统，就必须接受数据是混乱的这一现实。Arena 已经成为大型实验室评估和测试的标准。

Deep Dive

Shownotes Transcript

Translations:

中文

Thanks for listening to the A16Z AI podcast. We have a fascinating and lengthy discussion for you today, so we'll keep the introduction brief. If you're familiar with the world of generative AI models, you're likely familiar with LM Arena.

The leaderboard and competition space created and managed by a team at UC Berkeley. What began with a focus on language models has since expanded to cover vision models, coding models, and more. And very recently, the team behind LM Arena announced they're starting a company to scale the project's reach and its impact. They want to amass a global community of AI users and use their collective experiences and ratings to make AI models more reliable and to help everyone find the right model for the right use case.

So without further ado, here are LM Arena founders Anastasios and Angelopoulos, Weiland Chang, and Jan Stojka discussing the state and future of AI evaluation with A16z general partner Anjane Mitha. They kick off the discussion discussing the importance of mass scale, real-time testing and evaluation right after these disclosures.

As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see a16z.com slash disclosures.

Yeah, that would be great. Sometimes I get asked, what's the last exam that AI should take for humanity? And it seems like that's the wrong question to ask. We should be asking, what's the real-time exam you want your AIs to be taking before they get deployed every hour, every second of the day? Especially as we start to get... I think one of the things that's emerging for me is that the arena is misunderstood partly because we're just early.

In AI. And so while benchmarks like MMLU and the idea of these static exams were useful three years ago, the future is about real-time evaluation, real-time systems, real-time testing in the wild. Now, one thing that concerns a lot of people is the reliability of these systems. When we start going from chatbots that are good at, let's say, companionship and more consumer use cases to mission-critical systems, defense, healthcare, financial services,

How will Arena have to evolve as we go beyond companionship or web dev to those kinds of mission-critical use cases? I think that's one of the very reasons we wanted to create a company to support this project to further scale the platform. So right now we are at a million-month user now. What if we scale it to five to ten or even more to capture even more diverse user base across different industries?

And then in that case, we'll have the ability to really like zoom in into all these different kinds of like areas that people really care about for critical mission tasks that they'll be used to. You can imagine we are going to, when we are going to scale, we can have micro-size for nuclear physicists, radiologists.

and so forth, right? And these experts are going to come there to get the best answers to their, again, research questions. So that's interesting. Is there a future where now that Arena is becoming a company,

You could see a scientific lab or a shipping company or a defense company deploy their own arena on their own infrastructure for their own users, on their own prompts. Many people have asked us for this already. So these would be sort of private arenas. Private evaluation. And it's worth saying, I think when people have these mission-critical industries in mind, they often are thinking about the factual nature of the responses and so on and so forth. But...

In reality, even in such industries, the majority of questions that people ask are subjective. Okay, so the mythology that in hard sciences or in mission critical industries, people just have like cut and dried questions and they just need like a retrieval and a lookup. That's completely false.

That's the very reason why these models are useful is because they allow you to sort of like interpolate between these weird questions and answer questions that are not fully specified and give responses that are sort of like geared to answer the question, but might not have a fully factual basis. Right. And they might incorporate factual elements through RAG, let's say, but there's a subjective nature to the response.

And that's a reality that everyone's going to have to live with. If these systems are going to be deployed in medicine and defense and so on and so forth, they're going to be deployed in places where the data is messy because that's where they're useful. Okay. Given that fact,

how are you going to make sure that they're reliable? Well, you need something like Arena. You know, at this point, Arena is hard to miss in the AI space, whether it's Grok3 coming out and Elon putting it up for the bulk of the keynote or Demis using the WebDev Arena scores to kind of demonstrate how good Gemini is. It's sort of become the standard bearer for evaluation and testing at all the big labs, right? But does that mean...

You guys have been helping them more than open source labs or smaller labs? No, we work with model providers, small and large. Okay, so we work with basically anybody who wants to work with us within our constraints, try to be as helpful as possible. The fact is that part of the reason to build a company is because we don't want Waylon having to serve all of these requests from people like manually himself.

So it's been a challenge, but we try to scale as much as possible. And in fact, one of the things that we help everybody to do is pre-release testing of their models. Okay, so it's not just that we work together to evaluate the models are released, but we also try to be their release partners and say, hey, can we help you guys pick the models that do best on our user base and use that as a guideline for which models they should actually release to the world?

So that's the way our platform works. And that's getting us closer to something, as we've talked about, like reliability is so important. These subjective measurements are so important. How are we going to get to a world where there's a CICD pipeline, where people can test their models pre-release and make sure that they're doing well for all sorts of different diverse people? Well, you need something like Arena to do that. And that's part of what the company is geared to do.

So what people do is they can come, they can test a bunch of different models. We do this with basically every provider that comes to us. And they can see, oh, which one's doing better or worse on the distribution of Arena users. And then they can use that information to help them decide which model to release. After they decide to release the model, it gets continually evaluated forever.

So that's where you get the freshness of the data. The model continues to be tested. And we're pushing towards a world where these subjective and human considerations are part of model developers' final release pipeline. So if I'm hearing you right, the more testing, the more reliable we should expect AI systems to get. So we should be... Like with any software system. This is one of the fundamental debates in this space, right? Is...

what is the measure of progress, the right measure of progress in this space? And there's a body of work that tries to create exams. They call them harder and harder exams. And for two reasons, I've always found LMA Arena interesting. One is the opposite approach, which says let the wisdom of the crowd guide us. And two, let open source actually define the examination. I think this is quite important where

If you get a group of experts in a room who decide what the right exam is for humanity, inevitably, then if it turns out that group's values get encoded in that, we have no way for the rest of the world to use AI systems that are measured by a different set of values. So it's great that people do these expert evals. It's totally fine. It's orthogonal from what we do. I'm glad we have them. But at the same time, you have to ask yourself, what makes somebody an expert, right? What are they an expert on?

I think the whole world is moving in a direction against like experts being the be all end all of everything. And everybody actually has their own opinions and everybody has their own point of view.

And in fact, there's so many natural experts in the world on all sorts of topics that they don't necessarily need a PhD in order to be really intelligent and high taste and have valuable opinions. And I think that's one of the things I'm proud of with Arena is it allows us to actually go and say, hey, where are the natural experts? And actually, can we find data-driven ways to identify them? What if we can go look and say, hey, this person here in this random part of the world is actually incredible at

coding and math. Their vote actually means so much. And their preferences are able to guide the future of AI. That's an amazing thing. And we hope to be able to scale it further. Okay, but I'm going to push back a little bit on this. I'm going to channel a few of the criticisms I've heard from experts, which is that, look, if you're an expert, you've been blessed with the brains, the resources, and so on to be a highly educated individual in your field.

we have a responsibility to guide humanity. It's our job to actually guide the masses. We should be defining what good human preferences is versus not because the masses actually don't know what there was good for them. The everyday users, the lay person prefers slop. Right? We've heard these arguments. Yeah. Is there a grain of truth or how do you think about that? So I think a few things here. So one,

Like Anastasio said, it's this alternative to have hard exam, to create hard exam is valuable. No question about that. I think the other thing I want to point out, and I took that at heart, this kind of criticism and about kind of having expert labeling. Now, I went to quite a few experts I know and I respect and ask them, would you label it? Would you label it?

almost everyone told me, no, I don't have time. Right? So there is a question there about, are you going to get really the experts? Right? Right. And I don't know. I don't think so. Right. You're getting some people from that area who are willing to do the labeling, but the best people are not willing. Fundamentally, they don't have time. Right? Right.

Now, but these people, if we offer to them, and we are not doing now, I'm talking about the future, a platform where their community come there and ask question to push the boundaries, to help with their research and things like that. They are going to be, first of all, you are going to get these people, right? Because they do that in order to advance again in their research.

And we are going to get also their votes. So fundamentally, I do things that we can get in arena. Again, right now, I'm not talking about today. I'm talking about the future. It even happens today. It even happens today. It even happens today. So you get real experts. Right? The top ones. Okay, is the answer. Unhirable people. Unhirable people. That's a great way to say it. The other thing I want to say about the layman and so forth is

And you fund a lot of companies. You are funding a lot of companies, AI and A16Z, and you are on the board of many companies. What do these companies build? What is their product? Who uses their product? Right? Right? It's not the top experts. It's the layman. Right? These are their users. That's how OpenAI makes money and so forth, many others. Right? So then wouldn't the evaluation should take into account

the preference of these users, right? The answer is obviously yes, right? So how can something, again, like these exams are very important to understand the capabilities of these models, these benchmarks. Again, no question about that. But they are not going to reflect. MMLU is not probably as good as reflecting the preferences of the users of these AI models.

products and services. I want to just dig one level deeper here, which is that one of the things that we are really excited about is the question of why do people vote the way they do? Who's voting left-right? Why are they doing it? On what kind of prompts do they vote for one model or another? On what kind of topics or models better? Basically, can we decompose human preference into its constituent components?

Let's say you have a criticism. You say people vote based on slop. Emojis are driving votes and response length is... There's a huge response length bias, which is... It's true that people vote for longer responses, preferentially over shorter responses, even given the same contents or well-known human bias. This gets back into RL. Can we learn this bias and actually adjust for it and correct for it? And the answer is yes. That's why we're making style control default. So what we developed is this method called style control, which...

Allows you to run not just, let's say, a Bradley Terry regression, but also include certain covariates that model the effect of style and sentiment on how people vote. And what you get when you fit this model is not just a prediction of preference, but also an understanding of why.

People are voting the way they do. It's what we're trying to target a causal quantity, which is the causal effect of, let's say, response length or sentiment and so on and so forth.

Okay, there's always more work to do to get things closer to the actual causal estimate that we want. But if we continue to decompose human preference into its constituent components, what we're building is an ever richer evaluation that can tell us all the factors that go into response, how you can optimize people's preferences or keeping style fixed. Let's say I want to remain concise.

but I want to maximize your preferences. Okay, how am I supposed to do that? Well, not a lot of people have that information, but we're building the methodology that allows you to do that. Okay? And the question is, can we disentangle style versus substance? That's it. There is an effect and you want to know about it. Period. The platform helps you. And what is the impact? And what is the impact? Yeah. So this is another, I think, important area to dig into. There was a moment where you guys decided that

it was insufficient to keep measuring the progress of the models on coding with the base design of the platform, which is just Chatbot Arena. And I remember seeing a launch, which was WebDev, I think you called it WebDev Arena, and showing up to a completely different interface. And then realizing that this was a pretty big change for you guys, right? Why did we need...

a new kind of arena to correct for that effect. Why was that necessary? I think that comes back to, I think, Jan's point a little bit. Like when you build a product, an AI product, you want to know about how people use it. You want to understand why users prefer this over that. And in order to collect that kind of data, you have to build something like a product first.

And then as we know, over the past few years, people have been building very different kinds of applications on top of AI beyond just chatbot. Right. The chatbot is one of the widely used interfaces right now for humans to interface with AI. But these days, people are applying these models to coding and then more like tool use, agentic behavior, that kind of stuff, right?

So as the first step, we were thinking like, okay, how can we capture all these use cases? And then the answer to that is we have to build something that people can use again, the same environment that people can test in real time to give us real world feedback.

So, that was the original idea. That was kind of like last summer, just at the beginning of this kind of text-to-web, text-to-app trend. At the beginning was... Very beginning was cloud artifacts. That was the first. And we saw that. We were amazed by that. And then, how can we do eval for that? We have to give credit to Arian on our team. Yeah. And then, Arian basically...

was like just joined the team and then he was interning actually at so we were saying okay why don't we build something new and at that time cloud effect how can we do evaluation for that kind of applications text web and then that was the idea one small parenthesis so far we talked the three of us but eventually the team grew quite a bit and right now i don't know it's like

almost like 20 people, and both graduate and especially undergraduate students, and doing a lot of very exciting and interesting work to expand the abilities and capabilities and the reach of Chatbot Arena. So I just want to make sure that other colleagues of mine are involved here, like Joey and others. And I think the credit goes to

much more beyond the three of us. But that's what we want to provide. You want to, whatever you want, we hope that you'll find the answer to. Fundamentally, there's, if you're looking at the leaderboard, there's only one thing really that matters, which is, do you care about the preferences of the community people that come to vote on our platform?

That's it. That's what we measure. It's the only thing that we claim to measure. We don't claim to be an AGI benchmark. We are faithfully representing the preferences of our community. That's why it's so important to us that we continue to grow our community.

And we get a diverse community of all different people, experts, non-experts, artists, scientists. Different languages. Different languages. Everybody under the sun. We want to come into this platform to express their preferences. Because if we can get that to happen, it already happens to some extent. But if we can continue to grow it, what's going to happen is, again, in order for a model to do well, what needs to happen is new people need to come in and vote for it. And so if we can provide this lens into...

the preferences of the world. Things are moving, right? It's like, these are changing. This is, again, it's not like, it's like we have the study, I don't know, you want to talk about, the freshness, right? About, we always see

fresh prompts, right? It's not like, oh, we are going to see all these prompts over and over again, right? Kind of saturated. That's not the case. That was the very beginning of why we believe ARENA is fundamentally different. Related with contamination. The very beginning of ARENA is to try to solve the contamination problem, the overfeeding problem.

which is like people test models on a static benchmark. Or what people call overfitting. Yeah, or people call overfitting. So how do you overcome overfitting? You collect new data, right? That's how you overcome overfitting. And an arena is designed to collect new data every second. So all the questions are new, all the votes are new. And then we measure basically how different, what's the difference between all these prompts, right? What's the distribution look like and so on.

And then we conservatively estimate like over 80% something like that. Yeah, yeah. If I, yeah, yeah. Correct. And this study was done by another member of our team, Lisa, and basically measures out how many more fresh prompts you have in one day compared to what you've seen in the past three months. Right.

And by a similarity score of something like 70, 75%, you have over 70 of these prompts are fresh. 75, 80%. Over 80%, right? So it's a large number of prompts are fresh, right? And when you say fresh, we are talking about similarity. The test is not very high. It's not like, oh, they are identical. Yeah, just to dig in one click deeper into what Weyland said, what is overfitting?

Static benchmarks overfit. Why? It's because, as Jan said earlier, you're giving the student the same test over and over. You have a model, you test it, you look at whether or not it's improved on a static dataset, then you find another model, you test it, and you pick the one that does better and better. And what ends up happening is that the test becomes meaningless because you've seen it so many times that you've memorized the answers. That's what overfitting is. Chatbot Arena is immune from overfitting by design. These are always getting fresh questions.

In order to do well on the arena, new users need to come and vote for your model. That's it. That means that users like it. One thing I've noticed is that the same researchers who often argue with me that arena is a terrible evaluation system, the leaderboard is not to be trusted, the scores are rigged, they're gameable,

have one, tended to also celebrate when they're on top of the leaderboard. And the second is I find actually increasingly, especially for specialist arenas like web dev, there's a natural tendency to just accept that this is a really good indicator of the underlying capabilities. Why is that? Why is web dev arena such a good proxy for actual performance improvements when it comes to capability like coding, which is quite a general purpose, right? It's very counterintuitive. It's

Programming is actually a very general purpose, discipline and skill. And yet it seems like the capabilities on that in a very general way are still being able to are captured well on a specialist arena like WebDev Arena. Why is that? Yeah, first, let me just say, I think that all of these arenas have signal in them. It's not just WebDev Arena. It's just people have more opinions on language.

So web dev is a little bit more objective. It's easier to see the website you build it and it's this one's better than the other. There's a lot of signal in it. Also, it shatters the models. And what I mean by that is it can be very clear that one model is way better than another on web dev arena immediately. Bam, you see it and it's like, why is that? Just the capabilities of the models. Maybe we'll link and say more. I think it's just much, much harder task because it's like from text, a description of a website.

And you have to first understand the request and then build, like write code, right? And then the code has to fit, maybe satisfy certain requirements that say style requirement or like component, that kind of stuff. And then it has to compile, right? It has to be, we basically run it live in browser and then with connect to a sandbox, that kind of stuff. So there are a lot of like parts the model has to get right in order to build.

a website that people can really interact with. - Fundamentally discriminates much better across the models because it's a much harder problem. So very few get it right. - So it's in reality for critics who say these static exams are hard tests and Arena is an easy to gain benchmark. In fact, WebDev Arena is a great example that shows it's a really hard evaluation. It's a hard test actually.

proxies the real world better than some static multiple choice question test. Is that roughly right? Yeah, for sure. And then a lot of like every input from user is for a real world task. They are trying to build some website real. And then that also measures something that's like beyond just academic benchmark that we imagine what user would do. This is like really trying to approximate

the user intent, user preferences directly. I have to say, I also just like completely disagree with the foundation of the question. The like implicit assumption is that like chat is easy or that it's even easier than web dev. That's completely false. It's a completely naive perspective that people have on this.

Because it's hard to build something that people love. People are good at chat, but you like some people way better than others. It's subjective and everybody has their own opinions and the landscape is very rich. You might like a very different model than me. Somebody else, let's say a musician, might like a different model, a much different model than I do. And understanding all of those differences is really hard. And anybody who thinks that is gameable

is deluding themselves. To take that to its strongest form argument then, isn't the right way to allow me to evaluate whether the model is good or not, to allow me to generate my own leaderboard? You as a person? Absolutely. Yeah, and we should be giving you the tools to do that and we're currently building them.

So this is quite profound. You see the world going where everybody has their own personal arena? Absolutely. Absolutely. It should be personalized just for you. You should understand which models are best for you. And it's going to be for your task, his person and task, right? Because for a different, if you want to do different things, you may have a different leaderboard, right? If you have a question about tax today, you go to different people. Then if you have a question about,

programming or whatever, right? So that's kind of also is going to depend on what task you want to accomplish, right? And I really want to go back to one thing. I think that because it's, I tried to think quite a bit about all the criticism because on the face of it, intuitively for some people make sense, right? That's why many people make the same criticism.

And I think there is another thing going on. We as humans, we believe that why do people say arena is not good because people are fooled, right? That's fundamentally their kind of argument. They're fooled because long answers, more emoji and things like that. And when I look from that perspective as a human, what I, in my mind is that I am not going to be fooled.

Right? That's what it is not relevant. These guys are going to be fooled by these things. Now I am not. That's why it's not good. That's why it's not a good proxy. The problem is that you are, we, I am fault, right? Everyone is fault. And this is actually, that's why the chatbot arena, it's like Anastasios said, it's, it's a proxy. It provides a magnifying glass. But you know, all of us, we have our own

peculiarities, our own culture, our own history built on interaction with different people, right? So we have different preferences. That's fundamentally what it is, right? And these preferences are not fully objective, right? Because people say it's only one objective answer, but all of us are different, right? That's kind of the fundamental difference.

disconnect between the criticism and actually what we provide and actually what everyone we believe needs, right? To double click on that a little bit, you said we all have our own culture, right? And Ben Horowitz, we all know well, has a quote I love, which is, he says, culture is not a set of beliefs, it's a set of actions, right? So let's say,

our belief, the philosophy that Arena has is that the set of actions a user takes when using an AI model is the best source of truth for whether that model is good or not for them versus some third party closed source evaluation. Telling you what is good for you. Right. And that's why again, it's like going back to the previous that, you know, I think what you hear as capturing the human preference is fundamental because we are building

these AIs to interact with humans, right? I think that's kind of the foundation. But if you, like we have in the previous discussion, say, people say, well, like, yeah, but these other people are fooled. This is not me, right? Although I am fooled as well, but I don't believe, right? You always believe about yourself that you are better than you are. But then we can provide like style control. Okay, so forth. You can remove that. Adjust for that effect, right?

And we are going to provide more and more. You can adjust for that effect and that effect, right? So you get your answers as well. LM Arena started as a research project. So take us back to how it began. So it was started around two years ago in late April 2020.

And then at that time, before Arena, we were working on a project called Vicuna, which is like one of the first open models that's been released, like the chat GPT kind of clone. Yeah. And then at that time, Llama 1 was just released, which is a base model. It doesn't really know how to chat with humans, only going through like pre-training process. At that time, I don't think people...

call this post-training yet. People call it instruction fine-tune, that kind of stuff. So we were like in the lab exploring how do we reproduce this? How do we make an open source version of ChatGPT? And then credit to Lianmin, he was having kind of like an idea that we could use some of the open data published on the internet. That's kind of like user ChatGPT conversation. It was called ShareGPT. And it was like high quality chat.

set of dialogues that users share. So basically we come as a group, bunch of PhD students in the lab, set an ambition goal, which is like, we try to release this model, trend this model in two weeks, that kind of stuff. So, and then during that process, the result was surprisingly, at that time we were like playing with the model and then we thought like, we got to demo this to the world. So we basically just set up a website

and then put that model on the website and then release it. At that time, it was like, there's like a huge debate, like internally, like how should we, when we release it, how should we evaluate this model? How good is this model really? It vibe well, right? Because when we compare it to Lama, the base one, you can feel the difference, right? The model just learn how to chat, learn how to speak like a chat GPT.

So there was a huge debate like, how do we evaluate this? At that time, we didn't have much time. So we were like, okay, we either do this kind of like labeling, come up with questions ourselves and label the data and then compare it with other models. Or we do something like something automatic. And then at that time, there was like GPT-4 just came out in March. People were like wondering, what can you do, right?

And then we were like, okay, why don't we just use it for evaluation? We used GPT-4 as a judge to do this automatic eval. At that time, no one believed it. And then it was, again, a huge debate. But we didn't have time. So we just ended up doing it. And it worked surprisingly well again. And we released it. But the huge problem after that is still there's an open problem, which is how do we evaluate these

chatbots, right? So soon after we come up with that idea that why don't we let everyone in the community vote which model is better because at the time we serve that model and then we also serve some of the other open source model at a time. At a time every week there's a new finding. So

We register a website, we demo them, all of them, and then we come up with kind of like a side-by-side UI that people can compare them. And then soon after, we say, okay, why don't we come up with a battle mode, which is like we anonymize their identity and

that people vote. So that was the origin of Arena, which is basically trying to solve our problem, which is how do we evaluate these models, understand these models' difference. Actually, the first time we did start, we did try to have some students buy some pizza, get them in a room and label the replies from Vicunia and other model to compare them. And then obviously that didn't scale.

And then it was this LLM as a judge. We tried and it worked surprisingly well. And then it was quite a bit of a debate. Okay, we are going to build a platform to scale because the question is still, okay, it seems that anecdotally GPT-4, it was released just two weeks before we started to use it as a judge. It's still the question, okay, yeah, it seems that it's doing well, but still it's like...

How about, how does it compare with humans? Right. And that was a question about, okay, how do we scale the human evaluation?

And we discussed quite a bit about how to do it because it was not clear. Because if you think about before, you just ask people, I have a prompt and the prompt is answered by all models. And then you label it, right? Good and bad. That's kind of the typical ways of how would you scale up that process. Then you need to kind of to rank them, right? And you have end choices first.

answers of the same problem and model, it's very hard to rank them, right? It's like, think about it. We get slightly different in tone and so forth, try to rank them. And then we saw quite a bit, I think the inspiration about was humans in real life rate, say, players or teams in games, right? And obviously you have the tournament that's only to do it.

where you have, so to speak, the players play with each other head to head. And then based on that, you are going to have some number of points to win or lose or die. And then you are going to have a leaderboard. But that again, the problem is that if you have a tournament, typically the assumption in a tournament is that the number of players do not change across tournaments.

during the tournament. Right? And also in general, most of the tournament, this player has to play with everyone, so it's kind of an N-squared problem, or N is the number of players. And then we thought about, okay, there are other ways in real life how player or teams are ranked when they don't play with each other. They don't have a chance, either because the number of players is too large, or you need to also accommodate new players.

entering the game. And that's why we were about, then we thought about, okay, there are disciplines where this is done, like chess, that tennis, ATP rating, and many others. And that was the idea. And we said, okay, why don't you do something like ELO score? Okay. And for that, what do you need? Oh, you need only head to head. And not everyone needs to play in the same tournament.

And that's how we adopted. And that's why Arena has this battle mode, which you have a prompt and answer from two randomized, anonymized, light language models. And you can pick which one is better or the risk side and so forth. And when was the moment when you joined the conversation and brought the, from what I understand,

the Bradley Terry approach. Yeah, it turned out it was like deeper technically than we thought. So at that time, Yang was like, we need to find someone to back this up. More like on the theory side,

to have a solid foundation to rank all these models. At that time, it was no longer like really like a fun project anymore. It was started as like a fun project. And people started to pay attention to it, right? So you better do something. And I went to Michael Jordan, my colleague, very famous machine learning AI researcher on faculty here. And I've been working with him actually when I built this kind of labs at Berkeley, cross-disciplinary labs.

We were working with him in like 2005, 2006. He was joining the system people, database people to work together on exciting projects. And he told me, oh, I know exactly, I have the guy for you. It's Anastasio. I saw what was being built at the time. Arena was not still close to what it is today. It was, I think, not that much usage. I saw it and I thought, wow, what a great opportunity to do some interesting statistical modeling and theory.

Like being able to understand how do we optimally sample models? How do we perform this estimation? Okay, let's move from ELO to Bradley Terry because we're actually performing an estimate here instead of just like... And the ELO score moves over time. It doesn't converge, but Bradley Terry models converge. And how do we then construct confidence intervals properly for this estimate and...

So on and so forth. All that stuff was super interesting to me. We were meeting like just next to... Yeah, two doors down. And we wrote on the whiteboard like five or six different topics. Yeah. I mean, we started working on them and the rest is history, right? In many ways, I feel like this is a... The birth of Arena couldn't have happened anywhere else other than...

an interdisciplinary lab at a fundamental research university like Berkeley. Is that true or do you think? Well, it certainly would have been worse if it came out of somewhere else. And the reason is because the fact that we come from Berkeley and from a university is

really speaks to our scientific approach in neutrality. I think if it came from an industrial lab, people would always have questions about, oh, well, these people, are they also training a model and what's their incentive and so on and so forth. But the reality is we were just students. We were doing this in order to evaluate models and it came from a scientific perspective. That's it. And I think that's something that people can see when they look at us and builds a lot of trust in our business. The other angle here is that

In a lab like this one, what you get, you can get. Maybe in the industry, you can also get maybe interdisciplinary teams, but they are going to be large teams, right? Because you go, okay, you are going to do the team which is doing AI, the team which is doing systems work together. These are already large teams. But here what you get are small teams, a few people, which everyone can come from different areas. Right? Right.

We have people who are kind of systems. Early on, we have to build the systems to serve these open source models, right?

we have to serve this vicuña like Eileen mentioned, right? Then you have to have people who are pretty good actually when we use this shared GPT data. We are doing quite a bit of data pre-processing in order to pick some data curation and so forth, right? Then when Anastasio joined, we have now machine learning experts

But the team still is like four or five people. And it's early on, small teams move very fast. So I think that's kind of the difference, that you have a very small team, but interdisciplinary and small. So I think you may get in industry, you may get interdisciplinary teams, but they're going to be large. If you could just teleport back in time to that moment in early 2023,

Correct me if I'm wrong, but if I had to kind of summarize the research environment in the Bay Area at the time, most people basically were extolling the death of AI in academia, right? The idea that, oh, you can't really do any serious research. You can't contribute to the frontier of computer science or in AI from a research institution. It was quite common, actually, if you remember that. And there's nothing more satisfying than proving these people wrong. Right, right. So what do you think people got wrong?

I think that if I may take a step back here, because I'm old enough, so I've seen a few of those. I remember when I was a student, I was doing system work and networking. That was the internet days. And

And I was going to these conferences and there are panels, "Are the operating systems dead?" "Research in the operating system, is that dead?" That was a topic of the panel. And the reason for that was at that time, it was at the end of the previous century, it was Microsoft was dominating and then Apple. And then of course there are some free BSD and so forth.

But then, although it didn't come from Academia, it was Linux. Actually, Linux was preceded by Minix, which came from Academia from Netherlands. So that's one. Then in 2004, when we came here and we started this lab, there was a question also that was, is this kind of distributed system? Because a lot of researchers are on distributed system. What can Academia do? Because it was...

Google was doing all this search of Q systems, MapReduce, Google file systems, all of that happening at Google. Right. Right. Best people going at Google and so forth. And then we've done here, then it comes Spark, right? Right. Which also comes from Academy, right? Right.

And I think that's when this started, it was actually the question people are surprised about Vicunia, just a bunch of students and actually their own initiatives, right? I knew almost after the fact that this happened, pick this kind of data set from the internet, which was high quality and use it. And people are so surprised about the quality, right? That was where people asking, is this real? Right.

Right. So I want an evaluation. I'm always kind of, you just show me some stuff anecdotal, right? And okay, it's looking good, but how about is that? Some people didn't believe it. Yeah. And then say, this is a GPT-4 wrapper. Yeah. I remember that. I remember we were at NeurIPS later that year, right? And I remember NeurIPS

Ailyn was sitting at a table next to me and a really well-known famous researcher who's still at OpenAI asked me, oh, is that the team that worked on that Vicuna bought? And I said, yes.

And he said, oh, yeah, I've been wanting to have a conversation with them because we think they're violating our terms of service. Because they're just reselling our GPT-4. And I don't know if he came and confronted you, but that was very much the default assumption people had. It was a disbelief. So because for a game, for a while, those are the best open source models. I don't know, three months, four months or whatever. Right.

So that's why the evaluation was so important back then, right? Because it's disbelief. So we try to support with some evaluation, which was, seems more objective. Right. That indeed is a good model. Again, since then, there are many other things we've done here and open source like inference, LLM inference, like VLLM and HLANG. But I do think that what happens in this kind of this sense in, in, in industry,

I was actually on a panel yesterday, it was the same, this kind of discussion. Okay, it's like, you know, what can, maybe academia should do this thing. And the industry should do this thing, right? Like let's, as an industry, you just can't do anything like pre-training and so forth. I think at the end of the day, it is about what resources you have and what problems you solve. And over again, I think through the example I gave, if academia has the resources to

is going to surprise you. Clearly it's going to be at the very edge of innovation and creativity. And so that's always almost happened, right? Like in this case. In this case, of course we had, we didn't need huge resources, right? We just a group of smart, passionate students. So when the Chalbot Arena started,

It was a lot of excitement and so forth. But then it was this thing, okay, it was a feeling at least for some people in the group that we are done here. We publish the paper and so forth, right? For a while, even if you look at the usage, it's kind of dropping a little bit and it's almost. Yeah, almost that. Almost that.

And then Weylin, at that time, his main trust of research was different. Some graph neural networks, distributed graph neural networks and things like that. And I remember at some point at one of our one-on-one meetings, Weylin came to me and said, look, I really like and I want to, instead of doing this kind of work, I really am passionate about chatbot arena. I want really to do it, right, and to focus on it.

And then when kind of, then it started and Weylin is like one man backend started. And he started to add more models to the leaderboard, marketed and so forth. And very soon after that, Anastasio came and then it was kind of magical, right? You have these people who are so passionate and they are working so well together. They are so complimentary in skills.

and even personalities, then it started to shut up. And I'm mentioning that because without that kind of inflection point, which came long after it started as a project, we wouldn't be here. I think there's a chart I saw recently that compared the number of models being released and tested on LM Arena per year.

over the last two years. And if you look at Q1 of 2023, it was, I think, two models. And if you look at just this past quarter, I think there were 68 models or something like that, right? In total, that first year, there were about 12 models or so. And today it's over 280 or something on the platform. So at some point it took, it sounds like it took the two of you realizing that

this deserved to be more than a one-off paper. When was that? I still remember when we worked on the paper for Chatbot Arena. It was like a couple weeks of really hard work and we were like pushing all the way until the deadline. Afterwards, I turned to my girlfriend at the time. I was like, you know what? I think this is going to be a pretty good paper. Yeah.

And yeah, Waylon and I were talking at the time, but I think we started very early thinking about this, what this could become and trying to de-risk it in various ways and trying to build it and saying, hey, is this growing? Can we keep building on it? Another field that really like drive the growth is competition.

the competition of AI has become much more intense in early 2020 when Cloud 3 came out. So let me tell you, to answer your question, because I think that it's very interesting and maybe I have a more kind of unique view. I started other companies which are based on projects coming from this lab.

like Databricks with Spark or any scale with Ray. And there, the motion was pretty clear. You have a successful open source project, which gets more and more popular. And then there are some companies which start to use the project. And then you get to the point that say, okay, if I'm going to bet on this project to be part of my infrastructure, what happens when the students who build it, like Mateo and so forth, graduate? Right.

Who is going to maintain it? Who is going to evolve it? So in that particular case, it's kind of natural. Okay, if really this gets to get even more successful, you have to have a company backing it. Whether it's a new company or an existing company, and if there is no existing company, it's people who are on that project, if they want to push it farther, almost like you have to start a company to have enough resources to push it.

But this was different for the reasons Anastasio said. It's kind of, we are Berkeley, it's kind of a trust to be neutral. And Weylin, I remember, mentioned to me like one year ago, he's like, "I think maybe we should do a company." And I told him, "Man, what you're talking about, this has to be really neutral. Maybe we do a kind of foundation and so forth."

And this discussion actually went back and forth for a while. I was even frustrated. I'm telling this guy what I think should happen and we should be just kind of foundation and so forth. And he comes to me, he's like not hearing, he's like telling me the same thing, right? So we were trying to like convince you basically. They were trying to convince me, right? And then for me it was like, we talked with some of these foundations and so forth.

But when it was very clear for me that when he started to get more and more demand and so forth, and there is no way.

You need so much funding to build such a platform, right? Because you need to serve the models and you need to build an entire back end, scalable back end and things like that to do it. And then you are UX, right? So when you look at that sheer amount of work in order to push them to the next level, there is no way you can do it without having significant funding.

So that's kind of for me was a kind of inflection point. But these guys can say more because they are convinced about this long before I was. Yeah. Another thing I think we were discussing last year was like when we were trying to discuss whether this can be really like a business that solve more fundamental problems in this space.

I think Anastasia at that time was giving some perspective on ever more granular evaluation that we can provide with the data. So you want to say more about that? Chatbot Arena, when you look at the leaderboard, runs like a marginal regression, which means that the leaderboard sort of

ranks models on average across all users and all the prompts that they ask. But there's a vision where you take this to the logical extreme where there's the overall leaderboard. Then you can categorize the leaderboard into different categories, coding, math, hard prompts, and so on. But the real value...

is in, well, what if I can tell you which model is best for you? What if I can tell you which model is best for you and your question for your business? There's so many interesting methodological questions to ask there. And actually, they require a lot of resources to answer. So one thing that we've been working on recently is called Prompt to Leaderboard. Prompt to Leaderboard asks the following question. You give me your prompt. Can we tell you which models are best for

for that prompt specifically. Now the problem is we've never seen that prompt before. We've only seen any prompt once or zero times, right? Because most people don't ask every question under the sun. Fundamentally, it's a hard question because the thing that you're trying to estimate is what if infinitely many people came to me and asked the same question and then voted? That's the thought experiment you're trying to run your head, but you can't really answer that question by running a standard regression. So instead, what we came up with was a strategy for training language models

that can output leaderboards. And it's actually a deep question because what essentially you're doing is you're training LLMs to output these Bradley-Terry regressions that we were talking about earlier. And how do you do that? Well, you have to make sure that as you train the model, the regression sort of naturally emerges from the data. And the only thing you're getting is binary preference. But nonetheless, it turns out that you can do it. This has so much utility and it requires...

so many resources in order to really scale up. It converts the problem of testing and evaluations, which is normally kind of like an unsexy problem. You think about it as like, okay, how am I going to evaluate ML? Well, I'll just like calculate the accuracy, right? But the reality is that really doesn't reflect the heterogeneity of the performance of the model for different settings and for different people. But instead, what Prompt Leaderboard teaches us is that you can convert the problem of evaluation into the problem of learning.

What if I learn something that can tell me how my models are performing in all different parts of the space? It turns out that you can do that by training big language models and that because language models are sort of the intermediary that gets you to this evaluation, there's also a scaling law that comes along with it. Right.

Which is to say that the more data you get, the bigger you build the platform, the better you can make your evaluations, the more granular you can make them, the more personalized you can make them. And that's a very powerful idea. And I think that's part of the reason why we were convinced, hey, this deserves to be a company of its own. So fundamental technical innovation that's going to change the way people approach the space. And let me try to follow that with a more, a less accurate explanation, but I think it drives home the point why the data is so important.

So with the prompt on leaderboard is basically when you give your prompt and again, like Anastasio said, we may have never seen that prompt, more likely. However, what we have seen may be a lot of other prompts which are similar with your prompts, right? So intuitively you can think that you can use the votes to these similar prompts as a proxy to compute how good are the models for your prompt. Now from this kind of

maybe not as accurate analogy or explanation, you can see that the more data I have, the more prompts similar to your prompt I have. So the more accurate I can be. There is another thing I want, we didn't touch on, and what actually for me was, I was so excited about the project early on. And if you think about outside of IKUNA and our own story, how people, and still evaluate these models, you have this kind of benchmarks. Right.

MMLU, Helm at that point, Sweepbench, all of these models, right? The problem with that is that they are static, right? So you can sometimes overfeed them. And at that time, if you remember, there are already starting to be discussions. I'm talking about one year and a half ago, two years ago, about contamination, right?

There are some very high profile examples. And why? Because these large language models are going to train, as we know now, the data is a bottleneck. So they train on all the possible data they are going to get their hands on in the internet. And many of these benchmarks are also out there.

So it's not intentionally, probably many, but they are going to train on some of them on the very benchmarks they are going to be evaluated on. So this is kind of another fundamental problem. And I think that the unique thing about Shalvat Arena is kind of evolves over time. We are thinking that the way...

people typically evaluate these models. It's like giving a student over and over again the same exam.

Certainly we don't do that, or at least we try not to do that. I'm talking as a faculty now. For each class, for each year, we need to give different examples. So that's kind of, again, with humans, the same thing. To evaluate humans, to evaluate such kind of learning over time, like these models, you need to come with something, you need to evolve the benchmarks, the example. So I think that's kind of the unique part.

and unique value of Chatbot Arena. And probably these guys can say more about the kind of freshness and the evolution of the benchmark over time. What are the biggest differences between benchmarking and evaluation? So let me just zoom out for a second. Benchmarks, how are they collected? What happens is that you ask a question or give an input and then a human grades the output. And then what is the benchmark supposed to be? There's an answer key. A benchmark is like a test with an answer key.

a human has to look at it and tell you what's right or wrong. The fundamental insight of the arena is that by virtue of the fact that we built this platform, we can do something closer to reinforcement learning. Benchmarks are like supervised learning. Arena is like reinforcement learning. And supervised learning, you can only do as well as the best human that you have. Because what's happening is that you're learning from the teacher. In reinforcement learning, you're learning from the world.

You're able to learn things better than the best human could ever teach you. Why? Because you're only getting these preferences. You're getting, was this good? Was this bad? Nobody needs to tell you why. Nobody needs to tell you, hey, oh, you need to improve the fact. Oh, your writing style needs to improve in XYZ way and you should edit the sentence. Forget all of that. For the same reason why reinforcement learning has been so powerful in training language models. It is also powerful in evaluation. It can capture things that you and I, if we were looking,

could never understand how to encode. It is the open world nature allows you to go back and mine the data in order to extract insights that are much more profound than we could imagine.

come up with ourselves. So this seems to be the fundamental tension, right? If you, let's say you are a leader in the AI industry, your product lab or your product company, and you say, we believe the most valuable thing for us to do is build useful AI products. We're not interested in benchmark hacking. We're interested in making truly useful products. If that is actually true, you should be strictly supportive of testing your systems more and more on

on arenas like WebDevArena. Let's say you want to build a useful web development AI experience. Then you should want your teams to be testing more and more on this product, right? You want to do well on the distribution of natural use. Let's say then we expect anybody who's serious about building useful AI products to want to use testing environments like WebDevArena more. Why are people complaining that...

Some labs are testing more than others. And why are they saying that's a bad thing? So first of all, I think it's worth saying that we offer the same level of service to all labs. There's nobody that we treat preferentially or anything. It's a neutral platform. We want to help the ecosystem advance. But second of all, addressing your question more directly, people do not yet fully understand

the arena. I think people still think about the arena as a benchmark. People still think about it as something like, oh, people can overfit on this thing. But what hasn't sort of permeated, and it's because it's just such a new way to approach evaluations, is when you have fresh data, you can't overfit. It just means you're doing well, period. There's no overfitting that can occur. What can happen is you can do well. And you can argue with me about whether doing well is a good thing.

Okay, that's perfectly fine. That's not where people's heads are at. I think people's heads are still at, oh, you tested so much and that must mean that you're... Because people are used to it. Because people, oh, it's stat 101, so on and so forth. I know statistics. If you're doing well on this distribution, that's a strictly good thing, all else equal. And then people can choose, how much do I want to tune my model for chat? That's your choice. You can choose how much you care about this signal. And that's okay too.

So I think it's a fundamental misunderstanding. But I think as we go, as we continue building this, as it grows, people will become more educated on this topic. And then I expect that the world will understand it. And again, just to make sure, because we had these discussions with Stassi early on. So overfitting refers to the same data. Right. Okay. Okay. But when you do like, you do supervised learning or something like that,

Then what? You have data, trained data, and then you have test data, which you don't show during the training. And you hope that it's going to do well on the trained data, right? So overfitting means it's doing well on the test data, but only on the trained data. If you think from that perspective, there cannot be overfitting because we have continuously...

fresh data. Right? The one thing can people say that it's a particular domain, which is given by the set of users and so forth, and you are going to learn to do better with this domain, in this domain, which is perfectly fine. Probably it should care about that, because the domain is a group of people you care about. Right? But it's very different, overfitting

It's very particular meaning. And what people think about here, oh, I'm going to do, when they use a thermal for fitting, I'm going to do well on, I'm going to learn how to do well on arena audience. That's what they have in mind. But it's again, that's fundamentally different. Well, actually, so let's talk about a second for the arena audience, because you mentioned that's a critical part, right? As opposed to continuing to

train your model to perform well on a static distribution. One of the things that shocked me when between the first time Waylon and I chatted the beginning of last year to the end of last year was that Arena traffic had grown by 10X. The user base of the community had gone up by 10 times. Why is that? That feels like something people don't, it certainly wasn't visible to me. What's going on at the hood? Why are more and more people using Arena? And in your mind,

Is that one of the reasons why people don't realize how hard it is to actually overfit? Why overfitting is almost not possible? It's on the medians of people's preference. And I think one of the reasons why people are kind of like surprised to see usage grow is because when they think about Arena, they think about the leaderboard. They think about, again, a benchmark. How would a medium people use a benchmark? That's strange. But in reality, Arena is basically real-world testing.

And now just real testing the best AI from all the frontier labs. Does the demand grow over time for people to test the best AI, use the best AI? Yes. So that's the very foundation of Arena, which is like, this is like an open space where everyone can come here to compare all the AIs for their own use cases for free.

And this demand we've been seeing has been growing and we believe it has a very strong potential to continue to grow. And in the same time, we collect all sorts of like comparison data that we can use for evaluation for all sorts of plastics. So one thing that I want to point out, because we have been talking through this

discussion a lot about votes, right? The vote is a fundamental construct which allow us to evaluate this model and so forth. It's the votes have to be, so have high quality. If they don't have high quality, it's like you said, garbage and garbage act. And we do believe, and there are two things, at least two things we believe that the votes on arena are high quality. One is that the people who ask questions

are the people who evaluate the answers. So presumably they are going to have the context for that question and for that answers. As opposed to I have one question and two answers and I'm asking someone, random labeler, to say which of these answer is better.

And this is now from information retrieval field for decades. And it's called gold standard when people evaluate the answer to their own questions. When an expert evaluates someone else's questions and the answer is called the silver, if I remember correctly. But the second thing, people who give votes, who vote, in our case are intrinsically motivated. We are not asking them to vote, right? They can choose not to vote.

Only people who want vote. Relative to companies that pay humans to vote. They provide other kinds of incentives. Like, oh, if you vote more, we can give you more resources or something like that. Because you can imagine, you can easily imagine how you can get the wrong incentives, which are not necessarily aligned. When I say wrong incentives, they are not necessarily aligned with...

increasing the routing quality. One of the things that strikes me as I hear you guys talk about the design of the platform is that unlike these other paid services where you can just essentially hand out cash or incentives, when you have somebody intrinsically testing, the usage of the quality of a model, that starts to look more and more like

software testing. So 15 years ago, when software systems were starting to be deployed to the internet, they were bugs, they were insecure, they were unstable, they were unreliable. And so as an industry, we developed the idea of unit tests and CICD and A-B testing. And today, software systems go through a fairly reliable set of

checks before they get deployed to production. Am I wrong? Or should I think about that as a pretty good analogy that we should want, if we'd like the progress of AI, the arc of AI progress to head towards more and more reliability, then we actually want model developers and AI developers to be testing their systems more before we actually get, they get released to the world. So I think that's kind of when we started, and this is another thing about exciting AI

It's about, we do believe, and you can see right now, one of the main challenges of adopting AI in a wide area of scenarios, it's actually reliability. Especially if you look at enterprises, right? Is this answer correct or not? That's kind of fundamental, right? And that's, like you said, it's very similar with software systems. And for software systems, we...

develop, like you said, this kind of long and sophisticated testing processes, right? CICD and so forth, right? So you should think about that. You need something like similar

for these models right now are basically tell the truth is like almost static benchmarks, right? This is what we are doing, right? You start training your model and when the loss rate plateaus, you start testing checkpoints, right? And you have a 60, 70, 80 kind of benchmark and you look at that in a spreadsheet, see which checkpoint is doing better, whatever, then you can measure. This is what happens, right? Today, right?

But like we discussed, if you really are going to build your application for humans, okay, you can still test on your static benchmarks. Nothing wrong with that. Very valuable. But you also want to test your models, your checkpoints on Shabbat Arena for all the reasons we mentioned during this discussion. Yeah. So ideally you want arena to be at the limit part of your CI/CD for training the models.

We spent a bunch of time talking about how ARENA was born and how the big idea, at least the theoretical idea, is that to unlock more reliability in AI, we need more testing of AI. So let's spend a little bit of time going deeper on the practical realities of making that possible. What are the hardest challenges when it comes to actually building the best testing platform to make AI more reliable?

Arena is a very interesting platform. It's unique and it's kind of like end of one at the moment. And so there's a number of like technical challenges that are actually quite exciting. We're always looking to improve the platform, both from the methodological side and from the infrastructure side.

And what makes it unique is that it's this combination of AI, machine learning, converting evaluations into learning algorithms, like reinforcement learning side of things, plus like pretty large scale infrastructure. A lot of people don't know this, but Chabot Arena is used by like a million plus monthly users. We get like tens of thousands of votes on a daily basis.

We have like over like 150 million conversations that have been had on the platform. It's massive. And it's like the leading platform for this kind of like subjective real world evaluation is continuing to grow. So the infrastructure side is actually quite challenging. And then the question is, we have this like unprecedented data set. How can we use it and leverage it maximally in order to actually target what we want? Which is like,

the most granular possible evaluations and measurements of model performance. Why is that hard? Why is granularity hard? Well, granularity is challenging because fundamentally the questions you're asking when you talk about granularity is, how does it work for this specific individual or this specific prompt or this specific use case? That is a hard question to answer. Why? It's because you, Ange, come to the platform. You ask three questions and you vote on one of them.

How am I supposed to tell which model is best for you? It's like a sparse problem where what happens is that there's a big matrix of users and queries. And the number of queries is infinite that the user could possibly ask. And the number of users is very large. And they've only asked three of them. How are you supposed to learn which model is best for that specific user? Well, you have to do something creative.

And the methodology for that, it relates to all these sort of like core topics that are very like deep in machine learning statistics, recommendation systems, so on and so forth. But they come into kind of a new light when you think about language. So one example of a problem that, you know, we're working on towards the future is personalization. How am I supposed to create a personalized leaderboard for you? Let's say I have your prompt history and a few votes. Well,

In order to run a regression that's just for you, I probably need hundreds of votes. It's just going to be too high variance unless I have that much data. But I'm never going to collect that much data on a user. Or like only for the most power users am I going to collect that much data at the moment. So we need a way that we can train models that look at your interaction history and then can compare you to other users and pool between users.

So that you can create leaderboards for specific people, categories of people, so on and so forth. That is a challenging and interesting problem. And you need to do it using only this sort of limited information that we have, which is binary preference data. How do you do that? Well, it's a cool problem. It's a hard problem. And it's one that we like have taken steps for solving. And it's not just personalization. What about if I want to value the data? What if I want to tell you which data points are high signal? Which users are high taste data?

What if I want to say, Ansh, he's fantastic at bioinformatics, but when you ask him about history, this guy doesn't know what he's talking about. Or what if you want to say, hey, this person right here, they're a local expert in this particular topic, and I really should upweight their opinions, let's say. Or this person's just voting noise. How do I take them out? We need to be able to do tasks like these, and they're fundamentally hard because of the

structure of the data that we collect. But they're also very exciting methodologically and we keep making progress on them, which is part of the reason why it fuels us. And it's all enabled by this massive infrastructure and platform. It needs to be done at scale. It needs to be done very quickly. And Wei Lin is kind of the expert on this and he should speak to more of that. Yeah. So before we go into infrastructure, I think one related note and all sorts of problems we are looking at

like ML problem, which also related to recommendation systems in early days where people try to figure out the cost of that problem, right? You only have very few data points per user, but you are trying to do something personalized recommendations for them. Or Netflix. Netflix, yeah. For movies, what do people like? And as we lean toward like a more personalized world where like companies try to build AI products

products for consumer, everyone, and that leverage all these user histories, prompts, that model has memories now. So there's quite a few new methodology need to be developed. And in particular, in this kind of like evaluation context.

It seems like there's two or three emerging frontiers of AI progress, right? Relative to two or three years ago where models were pretty simple, the vast majority of questions people had about the quality of performance of the models were mostly about in-context learning, right? I give the model a couple of examples. How good is it at predicting the next

token or word in that sequence. And it was a pretty simplistic measure. Fast forward two, three years now, models have gotten extraordinary. Models clearly look more and more like systems. And one of the systems improvements that you've described is memory, right? So relative to five, six months ago, when most AI assistants like ChatGPT didn't have memory,

but now do, people are starting to notice a discernible verticalization of the model and the systems layer, right? So famously, OpenAI has spent a ton of time post-training their latest model 4.1 or 4.5 or whatever it was, with the assumption built in that the model has access to the user's memory and context, right? When you have, how do you

solve the problem of evaluating a model that where the lines are blurring between model, system, application, this is turning into a full stack sort of product experience relative to a model that let's say doesn't have all of those, right? Because now these relative to two, three years ago, the side-by-side taste test was naively looked easier to do because it looked like Coke versus Pepsi or whatever, right? Now it looks like a dessert, right?

versus an entree versus whatever. I'm doing a terrible job of the analogies, but you get what I'm saying, right? ChatGPT today, for example, has memory. Claude doesn't, right? These are two consumer apps that look very similar on the surface, but under the hood, fairly different. The implementations are diverging. And yet on Arena, they are evaluable side by side.

So what does that future look like? How do you guys disentangle the fact that the stack is becoming more and more verticalized and integrated across model system interface application, but Arena today is largely side-by-side evaluation of models that people are used to thinking of as basically symmetric systems? Yeah, I think it's a combination of

Again, evaluation would ever become more challenging and more specific to your applications. Just like all software systems needs its own CI/CD pipeline. That's very different from each other. I think the same thing would happen to all the AI products as well. So our belief is like in order to collect data or evaluation that really like means something that matters

to us, to the app builder or to user, we have to build a real-world environment for everyone to test, to use it, give us real feedback. That's also why we are like, and there is a combination of challenge of ML, product, design, and engineering infrastructure. Because ultimately, we are going to serve

We are already serving millions of users. We're going to serve tens of millions of users. How can we design a product that people really love to use? And then at the same time, that's the most organic feedback that we could collect from different kinds of users, including memory. So what if we have memory in Arena? That kind of like applications, testing, like really like the long context

capability of the model to reason about the past and then to have potentially a rack system to retrieve relevant information from the user's past history in order to create a more personalized content for users or more personalized leaderboards for users that help them to choose what's the best AI for their use cases. When ChatGPT has memory built in, but Claude doesn't,

How would that actually work in production when I show up to the site and I'm trying to evaluate these models side by side? Both are serving the model via an API. Does that mean on the arena side, you have to recreate memory and then abstract that away as a shared service that all the models consume? How would that implementation work? Yeah, so I think increasingly, we're going to go beyond just single model. That the model has the capability to connect to different systems

source of information, you would say, like context. The search arena is one example of this. The search arena is like we launched a couple months ago. It's basically an arena that's specific to evaluate models that have internet assets, web data assets, right? And in that case, model is not just model itself.

it has to be in combination of other components. And the same thing happens to memory, right? You have another component which is retrieving relevant information from user history. And then this history is actually richer, like not only just prompt, it has all the battle between different models, comparison data, and then users express preference. So that kind of like

And then it could be also multi-modal, right? It can be like image, it can be like video, or PDF, right? People will upload long document, that kind of stuff. So all these kind of like different contexts, different modality of data, how can we leverage them?

in order to create more personal experience and then evaluate them. That would be a very interesting challenge. Yeah, I would say there's basically two ways that we're moving forward. The first is the platform is going to continue to evolve, for sure. We're going to keep creating new arenas. We're going to keep improving the arena to integrate things like an artifacts component and things like memory and so on and so forth. And the second is integrations. At the end of the day, if someone wants to evaluate their app, we should be able to provide them a toolkit that

integrates with our services to do that. So let's say I'm building a code editor. Yeah. But I'd like to understand which one of the 17 models out there are best for my users. Exactly. What does that look like? I use an Arena SDK? Exactly. Got it. That's exactly right. And so what would that look like? My users would...

generate a bunch of interactions that then the arena SDK is serving on the arena side to run side by side? Or is that eval actually happening in my app? I don't think so. I think it can happen in context. So what you can do is you can have some kind of a gateway that allows people to access all sorts of different models, even maybe the ones that they didn't handpick themselves, but the cutting edge ones that maybe they don't even have access to them. Maybe they're even pre-release, right? Yeah.

And then what we can do is on our side, use all the experience that we've built on sampling data tools, training models, this huge data set that we've collected that has all these multi-provider comparisons to do things like choose what the best model is for your users, understand how all the different models perform, all the cost benefit trade-offs, the Pareto curve of cost versus performance of different models. All that stuff is stuff that we can instrument and we can do it using in-context feedback.

Let's say somebody says, hey, let's hook into a thumbs up, thumbs down button, pass that back to the arena SDK. Well, we can look at that. And using that information, we can produce leaderboards for that organization. We're the experts in doing that, right? We've been doing this for years. Things like prompt leaderboard and various technologies, you know. D3, yeah. So we're building a project now that we call data-driven debugging D3.

It's a little farther out. It'll come in a couple months. But the fundamental premise of that is that pairwise comparison feedback is not the only kind of feedback that we can use to construct leaderboards. We can construct leaderboards with any form of feedback. And because of that, we can hook in not just to pop up pairwise preference comparisons for whatever company, which is, of course, something that we can do. But instead, what if I want to rank code models in part on how many times the code is copied?

Or accept. Yeah, code change is accepted. How many, like, what's the edit distance between the code that the model produced and the code that the human ended up sort of shipping? So that's interesting. You're saying we're moving from a world where the primary signal that is used to figure out whether to improve an AI model is sort of very explicit thumbs up, thumbs down binary preference. You see a future where every interaction I have within a product

engagement, retention, down to a GUI interaction can help tell the model what to improve. Absolutely. So that's exactly the kind of stuff that we can loop into our methodology that we've been developing and generate useful feedback for people to continue improving their models. If you want to create code that people are going to use,

Make sure that people are using it and that the edit distance is low and that people accept your changes. Okay. If you want to build an agent like a Devin, that's going to be your software engineer, how many of these PRs end up getting merged? This is the sort of stuff that we're building the technology that gives you very rich insights into. And I think by virtue of the fact that we're developing this new methodology, we think we have an edge to be able to provide people that kind of service. You talked earlier about prompt to leaderboard. One of the things that surprised me

When I looked at the repo, it's an open source repo, is how well the model performed on Arena. Can you actually just walk through what happened when you guys recap what it is and then what happened when you actually deployed it on Arena? So I'll get a little bit into technical detail here because I think it's cool. So prompt to leaderboard, what does it do? If you look at the chatbot Arena leaderboard, it's Bradley Terry coefficient.

Prompt to leaderboard is a technology that we built that allows you to take a prompt and then produce Bradley-Terry coefficients for every model that are specific to that prompt. The Bradley-Terry coefficients are a leaderboard. Higher is better. It means you're more likely to win a battle. So what's the natural next step from producing a leaderboard? Well, let's make a router. Anj asks me a question. I'm going to produce a leaderboard just for that question. And then how about I route his question to the model that's on the top of the leaderboard?

It turns out that when you do this, when you train a prompt to leaderboard model, which is like, let's say, a 7 billion parameter model, and then you use it to route Ange's questions on the arena and everybody's questions, that model does better than any of the constituent models that were used in the router by a pretty substantial margin. Now, here's another thing that's yet more interesting. Because the Bradley-Terry coefficients have a particular parametric form and a statistical meaning, you can use them in downstream optimization problems.

So one example of an optimization problem is a router. Maximized performance subject to cost constraint. So the router can be, for example, a randomized router that chooses between different models. It has like a random policy that chooses, hey, Ange asked me a question. I'm with 50% probability I'm going to route here with 50% probability I'm going to route here. And I'm going to do so in such a way that my average cost is one cent. And I'm going to maximize my performance subject to that.

Now, if you trace the performance, the best performance that any individual model can give you as part of the router as a function of cost, that's like 2x worse than the router. In other words, the router is giving you double the bang for your buck in terms of performance per cost. If you want to achieve an arena score of 1280 using the router, it'll cost you half as much as it costs you to use any individual model.

That's amazing. What it means is that you're taking advantage of the heterogeneity in performance of these models across different parts of prompt space in order to properly route them. And by virtue of the fact that it has the statistical interpretation, you can cost constrain it too. And that's why prompt to leaderboard is interesting is that because we believe it's like a fundamental first step.

towards addressing this routing problem in a principled way. And from our perspective, it's like the right way to do routing. If you want to do routing to maximize preference, like even internally at opening, they're doing these A-B tests, right? If you want to maximize the sort of feedback that you get there and the engagement, then you should be using a strategy like Prompt to Leaderboard. So our hope is that this sort of thing would make it easier for them to avoid the dropdown and that they can actually implement it in their own product. I'm sure that they have strategies of their own, but maybe this can be helpful to them too.

Let's talk a little bit about, you said that experience will look different over time. The arena experience will look different than ChatGPT. Let's talk a little bit about the roadmap. What are the biggest things that you guys are working on over the next few months? And then let's go longer, longer term. Yeah, well, two that we've already mentioned are personalization and a leaderboard of users, right? Can we get people, first of all,

to figure out which models they like best and sort of lean into that experience, incentivize them to give us better votes, come here for their personal leaderboards and their personal metrics, and then give them a lot of them to drill down really deep in that. And in that case, we align the interests of individuals and the platform as a whole, because you don't want to mess up your personal leaderboard. Just like how people these days, when they use social media, they don't like a random post because they're

If they do that, then their feed will be messed up. So it's like, oh, I will be more careful voting. I'll be more careful looking at all these different models and so on, which we believe collectively will create a better, even more higher quality arena. Absolutely. Yeah. And then on the note of a user leaderboard, can we value the data in such a way that allows people to know where they stand in terms of what kind of questions they're asking, how useful they are?

We think people are going to love that. It's such a fun thing to be able to see that in terms of math, I'm asking the best questions. I would love that. I would love if I was asking the best statistics questions in the world. And I think people will use that and think, hey, I want to be on the top.

And so can we continue to align the incentives? And by the way, once we do that, it'll make it much more valuable. The leaderboard much more valuable because it'll mean that we sort of start removing the noise from people that might be sort of, oh, I don't know what these buttons are. Click. And instead, people are getting really intentional, really high taste votes, identifying who those people are and maybe even being able to personalize so carefully that we can produce leaderboards for different types of people. That would be incredible.

And then on the flip side of it, it's like we as a platform has more visibility into who are those users and how do we even customize the distribution that

on the flip side, model developer care or developer at large cares. I wanted to say, oh, I want to test my AI or my system in developers in Japan, let's say. And then can we have the ability to customize that kind of distribution to target what are the most meaningful distributions that reflect your use cases? One of the things that

you guys have been pretty vocal about is open source. I think from day one, LM Arena has open sourced prompts, votes, chunk of the data that's being generated on the platform. I think we do every week probably updates on the leaderboard. And then all the code infrastructures that we process the data is published

as open source and also research, blog, paper, and then including prompt leaderboard, we publish the paper, open source, the models, the code and everything. Because we believe that this is critical in terms of building trust with the community and also really build the foundation of this that we can like enable more and more value on top of it. So for adoption reason, for trust,

and then for collaborations. As you guys have made the transition from being a research project to now being a company, what are the most important values do you guys create and hold at the company as you guys grow out the team, as the project grows? Absolutely. Well, we are very focused on neutrality, innovation, trust.

We come from an academic background and yeah, we want to maintain the culture of this is a project. It's a community focused project. It's going to continue to grow. Yes, it's going to be a company. The company is going to support the project that we've already built and allow it to grow. Yes, it's going to continue to change. It's going to change for the better, right? We're going to keep improving it. We're going to keep publishing papers. We're going to keep releasing open source. We're going to keep releasing open data, right? That's all going to be part of our culture. And,

It goes both ways because that's the way that you recruit. The best people don't want to hole up at a company and develop a bunch of proprietary technology that is never going to be released. And it's just going to sort of stay in the annals of their nearest neighbors within the company. And they're the only ones that are going to know. We want the world to know what the best ways are of evaluating these models and accelerating the ecosystem. And releasing this data is also a big part of our trust.

If people want to ask the question, hey, how are models performing? Why are they performing well? Go look at the data. That's what we did with Lama. When people had questions about Lama, we just released the data. Easy. Just go look. And we plan on doing things like this for the lifetime of our company. That's how we're going to recruit the best researchers that are going to help us develop the methodology. That's how we're going to develop the best engineers who care about the whole ecosystem, not just one company. And ultimately, that's how we're going to develop the best products. That's how we're going to become central to the space of

we already are, but we're going to cement it, is by remaining open and neutral. And how would you resolve the tension that often exists when there are people who are concerned that as AI gets more and more prevalent, as AI systems start being deployed in pretty mission-critical industries, like defense we talked about, healthcare and so on, that in fact there's an argument to be made that these systems should be

closed source and evaluated in a fairly locked down environment as opposed to being openly tested in this manner and this is actually irresponsible. How do you think about that cultural tension? Listen, I'm not an expert in national security, but I think an evaluation platform like ours has many different ways of being used. If they want to evaluate it publicly, they can. If they need a private deployment, we can probably also do that. It just depends on the sort of level of national security risk, which is way above my pay grade.

But for any of these things, you're going to need sort of these subjective community-driven evaluations. That's for sure. If things are going to be deployed in the real world, you're going to need real people testing them. Yeah. And also, there's a point when you develop the model and this model is going to be used by broadly the public, there has to be a phase of testing it. And then we're trying to... What we are building is to bridge this gap between...

the lab building something that's like the latest frontier research and the world would use it as a large. You need an environment for you to test in the sense that it's a more controlled environment with the people that the distribution you want to customize, you want to understand the preference. There's a need for a platform like this to exist.

and want to serve it. Yeah, could you talk a little bit about Red Team Arena? Yeah, so for example, this real-world testing idea of Arena can be applied to many different applications like we discussed, right? From chatbot to webdap to different modality, image, that kind of stuff. And as well as Red Team, because Red Team at its core is like a bunch of people try to drought break the models to see if it's really faithfully following

what the model has been instructed to do or created to do, right? So these days, many frontier labs have been publishing kind of like model spec, that kind of idea, like how models should behave in this way, in that way, right? And then, but how do you make sure model follow that instructions? You need real-world testing, again. You need red teaming, you need a group of people knowledgeable in this space to help, right? So,

Again, this can be community driven too, because there's a group of vibrant community of job breakers. They want to help. And then they want it to also like, they want, they test it for fun as well. Like, so in Red Team Arena, we have a leaderboard, not just for model, but for user, for job breaker, who is the best job breakers.

that can like identify issues for all different models. So that very particular, the very idea of real-world testing still apply here and still can deliver value to the ecosystem that we believe. So is it fair to say that if I wanted to understand the security or the safety sort of risks in a model, I could go to Red Team Arena and look at the evals that

the models are generating over there. How does Red Team Arena actually work in practice to improve the security and reliability of these models? Yeah, for sure. So I think same as how we understand JetBar, WebDAI, that kind of thing. There will be many different applications people are trying to build on top of the models. That's a customer service model.

or like retriever systems, that kind of stuff, right? You want model to behave in certain way and you want control. And then in Red Team Arena, the idea will be like, why don't we build an environment to simulate that applications? So for example, can we build an environment to simulate customer services

where the model is instructed to not take certain actions. And then you are, as a job breaker, trying to break the model. So that kind of signals that we will be getting in terms of real-world testing job breaking will be reflective to the particular use cases that people care about. By the way, Red Team Arena right now is still a little bit of a prototype. We're continuing to work on it. But it's interesting to see people can...

It's not necessarily the model that's like most, like refuses the most to answer these like queries that people ask necessarily better. Some people want a model that's more controllable. Some people want a model that's going to say whatever they want. Some people want a model that's going to be completely safe and you can use it PG-13 or rated G. That's okay. As long as people have the choice. So as we start to wrap up here, one question that a lot of people ask is what does the world look like, especially the world of evaluation and testing?

as we go from a pre-training world to a post-training world in a world of models to agents, right? In some sense, it seems like you guys were actually a little bit ahead of the curve where Arena has always been an environment for agents more than a set of static. So as people start, as agents get better at long horizon tasks and tool calling and so on, this future where a ton of work in the economy is done largely by fully end-to-end automated systems, does Arena have to change in any fundamental way?

For that future? Or does it largely look the same? Yeah, I think if we've been talking about what's the fundamental is organic real-world testing with feedback. That's not going to change. I can tell you that is not going to change. Will we have to adapt the UI? Yes. Will we have to improve the product? Yes. Will we have to launch new products for evaluation? Yes. Will we have to develop new methodology? Yes. Does the fundamentals change? I think no. I think the reality is if you want to test your model for real-world use, you have to subject it to real-world use. You have to collect feedback from real-world use. And that's it.

So we're really excited about what the future has to hold there. We don't actually even know ourselves where the product is going to evolve over the next five to 10 years. Right? The ecosystem is moving so quickly. But wherever it goes, we're excited to follow. Awesome. Thanks, guys. Thank you. Thank you.

If you made it this far, thanks so much for listening until the very end. And keep listening in the weeks to come as we have some great discussions lined up. Finally, if you enjoyed this discussion or anything else you've heard on this podcast, please do share it far and wide and rate the show on Apple Podcasts.

Beyond Leaderboards: LMArena’s Mission to Make AI Reliable 01:41:43 Share

AI + a16z

Deep Dive

Shownotes Transcript

Beyond Leaderboards: LMArena’s Mission to Make AI Reliable