We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

#149 The State of AI with Stanford Researcher Yifan Mai

2024/11/8

freeCodeCamp Podcast

AI Deep Dive AI Insights AI Chapters Transcript

People

Quincy Larson

Yifan Mai

Topics

Quincy Larson: 讨论了大型语言模型对就业市场的影响，特别是对程序员的影响。他强调了扎实的软件工程基础知识的重要性，认为即使在人工智能领域，这些基础知识也具有持续的价值。他还表达了对人工智能潜在风险的担忧，例如失业和不公平。 Yifan Mai: 分享了他从谷歌TensorFlow团队转向斯坦福大学从事人工智能研究的经历。他认为自己更像是一位工程师，专注于构建支持科学研究的工具和基础设施。他详细介绍了HELM项目，这是一个用于基准测试大型语言模型的开源框架。他讨论了开源模型与闭源模型之间的差距，以及如何通过改进软件工程实践来提高研究效率。他还探讨了研究人员与软件工程师在激励机制和工作重点上的差异，以及如何改进研究软件的质量。此外，他还讨论了大型语言模型的伦理问题，包括数据偏差、版权问题和潜在的社会危害。他认为，虽然大型语言模型具有巨大的潜力，但也存在被滥用的风险，需要谨慎对待。 Quincy Larson: 讨论了大型语言模型的基准测试方法，以及如何衡量不同模型在不同任务上的性能。他提出了关于使用大型语言模型进行评估的潜在问题，例如模型偏差和价值观对齐。他还探讨了大型语言模型训练数据的问题，包括数据质量和数据来源。他认为，虽然大型语言模型取得了显著进展，但其发展速度正在放缓，并且存在一些尚未解决的根本性问题。 Yifan Mai: 分享了他对大型语言模型未来发展的看法。他认为，虽然大型语言模型的性能在不断提高，但其发展速度正在放缓，并且存在一些尚未解决的根本性问题，例如数据偏差和价值观对齐。他还讨论了大型语言模型的伦理问题，以及如何平衡技术进步与社会利益。他认为，虽然大型语言模型可能导致一些就业岗位流失，但也可能创造新的就业机会。他强调了学习扎实的软件工程基础知识的重要性，认为这些知识在未来将具有持续的价值。

Deep Dive

Key Insights

Why did Yifan Mai leave Google's TensorFlow team to work at Stanford?

Yifan Mai left Google to work at Stanford because he wanted to focus on research and building open-source software that supports academic researchers. He enjoys being closer to the research process and enabling researchers with the infrastructure they need.

What is the HELM project, and what does it aim to do?

The HELM project is a research initiative that benchmarks the performance of large language models (LLMs) across various tasks and benchmarks. It provides a standardized and transparent way to evaluate models, allowing users to compare their performance and use the framework for their own evaluations.

What are the key differences between open-source models and closed-weight models?

Open-source models allow users to run the model locally on their machines, giving them control over the input and output. Closed-weight models, like GPT-4 and Google Gemini, are only accessible through company APIs or services, meaning users cannot directly access the model's parameters or run it locally.

What are some challenges in evaluating LLMs, particularly in high-stakes domains like law or medicine?

Evaluating LLMs in high-stakes domains like law or medicine is challenging because it requires expert judgment to assess the accuracy and usefulness of the model's outputs. For example, medical advice given by an LLM would need to be verified by a doctor, and legal advice would need to be checked against existing case law.

What is the 'win rate' in the context of LLM benchmarking, and how is it calculated?

The 'win rate' is a metric that measures the probability of one model performing better than another on a randomly selected benchmark. It aggregates results across multiple benchmarks to give an overall sense of a model's comparative performance.

What are some potential ethical concerns with LLMs, according to Yifan Mai?

Yifan Mai highlights several ethical concerns, including the potential for LLMs to generate harmful outputs like instructions for building bombs or spreading disinformation. There are also concerns about bias in models, labor displacement, and the uneven distribution of power between big tech companies and workers.

How does Yifan Mai see the future of AI in terms of accessibility and distribution?

Yifan Mai is optimistic about the increasing accessibility of AI, particularly with the development of smaller, more efficient models that can run on consumer-grade hardware. However, he remains concerned about who gets to decide how these tools are used and the potential for power imbalances in their deployment.

What advice does Yifan Mai give to aspiring AI or software engineers?

Yifan Mai advises aspiring engineers to focus on building strong software engineering fundamentals, including programming, software engineering practices, and foundational knowledge in AI. He believes these skills will remain valuable regardless of the specific technology trends.

Chapters

This chapter explores the narrative of LLMs replacing jobs, particularly in programming. It emphasizes the enduring value of strong software engineering fundamentals and a deep understanding of AI foundations, even within the field of AI itself. The discussion touches upon career transitions between academia and industry, highlighting the diverse motivations and incentives involved.

LLMs are not expected to replace all programmers.
Strong software engineering fundamentals remain valuable in AI.
Career paths in academia and industry have different motivations and incentives.

Shownotes Transcript

Translations:

中文

There's this narrative of LLMs replacing jobs, replacing programmers in particular. Especially in terms of pre-co-chem students, I'd say even if you want to get into AI, there's always a value in having good software fundamentals, software engineering fundamentals, programming fundamentals, really understanding the foundations of AI, which include things like probability and statistics. I have a picture

pinned to my wall an image of you and of me and we're laughing but look at our life now all tattered and torn we fuss and we fight and delight in the tears that we cry until dawn me now whoa

Welcome back to the Free Code Camp podcast. I'm Quincy Larson, teacher and founder of FreeCodeCamp.org. Each week, we're bringing you insight from developers, founders, and ambitious people getting into tech. And this week, we're talking with Yifan Mai, a senior software engineer on Google's TensorFlow team who left the private sector behind to go do AI research at Stanford.

Ifan is currently lead maintainer of the open source Helm project, where he benchmarks the performance of large language models. Ifan, welcome to the show. Thanks so much for having me, Quincy. It's an honor to be here.

Yeah, and I'm hyped to have you here because you have the benefit of having worked in the private sector as a software engineer and now also as a researcher at Stanford. So you've worked doing hardcore software engineering on Google's SensorFlow team, and now you're working doing hardcore cutting-edge research with these emerging models and figuring out how good they actually are. And that's super-duper exciting to me.

Yeah, I'm really glad to be talking about this stuff. Yeah, and I'll just give the audience some perspective on how you and I met. So I went to this event in Singapore, or it wasn't in Singapore, but it might as well have been. There was like tons of Singaporeans in San Francisco. And of course, you grew up in Singapore? Yes, I grew up in Singapore. I moved here in 2010. So I've been in the Bay Area, San Francisco Bay Area for about 14 years now. Yeah.

Yeah. And you've really made like good use of time. Did you come over as an international student or how did you get here?

Yeah, so I was originally, so I grew up in Singapore. I did most of my schooling in Singapore. I moved here for university. So I did my undergrad at Stanford and that's when I came here and I stayed here. I basically stayed on after that and kept working here and that's where, and then I did my undergrad and master's at Stanford and now I'm working at Stanford. I like the joke, it's a bit like moving back into your parents' house.

But I really enjoy being in the Bay Area. Is it common for someone to just leave? Because people leave academia all the time for the high-paying, fancy jobs, cushy jobs in Silicon Valley, in New York City, places like that. But is it common for someone to leave that life behind voluntarily, just sit down their tools and go –

basically do something almost completely different. Doing research is very different from applying tools. Or maybe you were doing a lot of research while you were at Google in addition to being a software engineer. But like, is that kind of a common career progression or would you say that's relatively uncommon? Yeah, good question. I'd say it's quite different because like the academia track in particular. So I have friends who...

have transitioned in both directions. So I have friends who did a PhD in academia, you know, and part of PhD is doing research and then coming back to engineering in industry. And I also have friends that have done engineering in industry and decide like, oh, I want to go and do research and go back to academia and do a PhD. And now some of my friends are faculty members. They're from the industry, not a faculty.

I think the motivations for doing that, both academia and engineering industry are very different career paths with very different incentives.

In the US, especially, academia is kind of a rigid track where you're expected to have a very fixed career path where you're expected to do a PhD and then possibly a postdoc in between. And then you have a junior faculty and a senior faculty. And it's a little different from an industry perspective.

And for my friends who've done it, like they moved from engineering back in academia, I think a lot of the motivation is they wanted to do research, like they wanted to, you know, really like advance a certain field. For me, the motivations was a little different because like I'm so different from my friends in that I didn't actually go back to academia to do the academic track.

So I've actually, I do not have a PhD and I'm not planning to do a PhD, which is like very unusual. I'm basically like... So wait, you don't have any intention of doing a PhD? I don't. I'm literally the only person in my lab who's not like thinking about getting a PhD or already has a PhD. I sort of think of myself more of like an engineer because like I like building things.

So that's a little different from what my friends' motivations are, because I'm not going to be on faculty track ever, basically. So I'm not going to be a professor. And the way I'm thinking about it is instead of my impact being publishing research, leading research, I'm thinking more of my impact as being a supporter of scientific researchers.

So I write software, open source software, and I maintain infrastructure that enables other scientific researchers to do their stuff. And I think that's all a big motivation for me. I think I really enjoyed being close to researchers. At Google, I also had a bit of that because I work for TensorFlow and they're machine learning researchers. In Google, you use TensorFlow for scientific research.

But my current role, I feel a lot more closer to the research that I'm advancing than when I was at Google. That's interesting. It's almost kind of like a data engineer role where, like the way I think of data engineers is they're teeing up the data for the data scientists, right?

Right there. They're, they're absolutely essential in getting the data scientists what they need to do their job. But the actual job is done by the, uh, you know, like maybe a weird analogy, but like, you know, the pilot flying the plane that's dropping the, you know, the paratroopers in, uh, during, you know, uh, trying to reclaim the country from the Germans or something like that. Right. Like, like, like,

You're actually kind of like putting them in position so that they can succeed in their mission. But your skill set and the work you do is a little bit different. And of course, your background is different. And your goals, as you said, the incentives are different because you're not trying to become like a tenure track professor at some big research institution. You're already...

at a big research institution, but you're doing kind of like a, you know, a more kind of like, you know, a dirty job of like mucking around with the actual data and the actual software. Would that be an accurate description? I think that's exactly right. Like, that's how I think a lot about like,

So when I was at Google, I was working on TensorFlow and a project called TensorFlow Extended in particular that's like building machine learning infrastructure. And, you know, researchers would, you know, research, a lot of research is done on TensorFlow. So TensorFlow is like a machine learning framework, right, for building models. And a lot of research, like machine learning research is done on that.

I've sort of seen my job as like working, I really enjoy working on infrastructure. I like building things for other people to use. And sort of like my transition is kind of like, instead of building things for industry to use, I'm building things for academic researchers to use and other open source users as well, not just people at Stanford.

So I kind of like, I mean, I see, I really enjoy being this kind of like, you know, build things for research users because they have lots of like interesting ideas and enabling them and like enabling those ideas and like really talking to them and like learning their use cases. That's like really fun and enjoyable. And like, I sort of discovered that I really like doing that. And I also do believe like, you know,

In industry, there's tons of smart people doing this stuff.

So my current job role is, my job position is called research engineer or in some other universities we call it research software engineer. So these are essentially programmers who help researchers accomplish research, right? Either by building infrastructure, building software, sort of applying the practice of software engineering

to research because researchers themselves might not be software engineers. They might not be aware of the best practices of engineering, even basic stuff like how do you build production systems? How do you use code control? How do you do monitoring? This skill set is sort of like things that are familiar to engineers, but much less familiar to researchers. And there's

There's sort of like a lack of good engineering, I think, in the research context. And in some cases, like you can really accelerate. I think research can really be accelerated, you know, if more software engineers just...

well on this, on, on this track and like really, you know, wanted to be in this role. Can you give me some concrete examples? Like, like maybe like something you've noticed where like, Oh, researchers has been doing this, but like in industry for years, we've been doing this superior thing. And now I'm like kind of bringing this fire down from the gods to the researchers so that they could, you know, cook their chicken more effectively, so to speak.

Yeah, I mean, I have some really incredibly basic examples that you'll actually find like laughably basic. But one example is when I, so when I first joined this organization, it was sort of like getting ready to push out this initial paper for HELM, Holistic Validation of Language Models. And

They were trying to essentially like they had a front end that visualize the results of benchmarks run on models. They had this UI that showed here are the scores that the models get, here are the requests we sent to the models and the responses from those models.

When I looked at it, they were literally loading hundreds of megabytes of JSON files into the Chrome browser, into your web browser per page load. And I was crashing the browser. Wow. Why were they doing this through a browser? Yeah, because they were building the UI for viewing the results. It was a web app, right, for viewing the results. Oh, okay, okay. So this was like rendering the results. Yeah.

Yes, rendering results. But they didn't do any form of filtering or pagination. So they were just rendering these giant web pages that would make a call to fetch hundreds of JSON files and then render this massive HTML DOM tree.

that would just like eat up all your memory and crash your browser. And I was like, this is just like basic. Very basic web development fundamentals. Yeah, so very basic web development stuff. So like further on in the lifecycle of this project, we had a few volunteers. We actually had a volunteer from outside Stanford help us port things to React. And that made like

everything worked so much better. The web app was so much faster. It was so much easier to maintain. So this was like, I'm not going to claim credit for it. It helped out a bit. This was mostly one of my master's students, Farzan, and an external contributor who helped work on that. Yeah.

So, so essentially just taking kind of like a lot of fundamental, you know, know how that you've built up working at, you know, you worked at Coursera, you worked at Google, you've worked at like a lot of big tech companies in Silicon Valley with like lots of best practices and kind of like a lot of.

methodology and like, like they're like code smells and things like that. You probably picked up and one of them might be, you know, not having pagination and loading too much into the browser. Um, so you were able to kind of like, there was a lot of proverbial low hanging fruit that you could grab that is just accessible, but people didn't know to reach up and grab it. Yeah. I think more, more like maybe less trivial or less, less trivial examples. Uh, for instance, um,

things like how you do continuous integration, right? How do you test code and make it, you know, keep it working while you change it. Things like what technologies do you use? Like my colleague at Stanford, David Hall, also research engineer, he used the

the VEI framework for this distributed data preprocessing for machine learning models. So there's like, you know, you have a problem, what kind of technology is appropriate? There are things like, yeah,

A lot of these are like, sort of code practices, even things like Python packaging, making your code so that you can install it from page. That's something a lot of researchers don't bother doing because it's...

Also, a lot of this, like some of this work, I would say that like, some of this work has to do with incentive alignment as well. Because as a researcher, your job is sort of to put up papers, because that's what you're measured by, right? And it's not necessarily to put up good software. So if you look at the code that's written, like for papers, they might not be easily usable by another person.

So another researcher externally might look at your code and they might take hours to get it to work or they might have to reimplement it themselves. It might not be a pip install, like Python pip install sort of situation. And in many cases, like the researchers, they do know how to do it, but they don't have the time to because they're busy writing papers. And like,

creating, you know, easy to use software is not part of the incentive structure. Yeah. I mean, if all you're measuring is the research output and you're, you're not measuring the hidden cost of people who are trying to reproduce your work, having to like,

deal with spaghetti code bases and stuff like that and get things running, then like, hey, my job here is done. And you walk away from this rigmarole mousetrap game thing. And yes, you've got some additional citations and the department chair is happy with you and you're going to be able to hopefully secure funds for more research and all that. Those incentives that researchers face

And I don't have any experience doing research. My age number is extremely low. I do have a Google scholar account, but it's like just a few people have like cited like some of the stuff I've done, but none of it was approached as a researcher. It was just me trying to put public information out in the public. But yeah, like maybe you can describe the incentives that they face or like,

Obviously, you have a lot of peers that are researchers. You work with these people every day. Obviously, Stanford is a very prestigious place, but you can imagine a lot of research institutions that are not name brand household names. What are some of the priorities of a typical researcher, and how would they differ from a software engineer who's trying to write maintainable code that will run reliably and things like that?

Yeah, so I work a lot with a lot of PhD students and a few postdocs and faculty. A lot of what people are measured by is the impact of your research. And impact is sort of very vaguely defined, right? I like to joke that researchers are the original influencers.

It's all measured by how many people know of your work or are familiar with your name. And a lot of researchers are just on Twitter now, or X, I guess it's called. But literally, this course happens on Twitter, and I've seen people literally put screenshots of tweets in their research talks.

But the way you measure impact and influence is also a little ambiguous and depends on a specific field. So I was mentioning earlier, like software artifacts, you know, most researchers don't really care about that. I should clarify that it does depend a little bit on the field because like I see, for instance,

in AI, I have seen some researchers say, okay, I've written a software package and it has like 10,000 stars on GitHub and like, you know, big tech companies are using it. And that's sort of their measure of impact, right? Like people are using my work because they're using my software. So that's sort of like, I think that's sort of a shift in culture, especially in the AI field where like,

If you write software that's being used, that's also considered impact. But that's not necessarily true of other fields. There are some fields that that's not true. And even within AI, in the quality of research software, so it varies a lot. Some packages you can just pay install and they're easy to use. Some of them, it's research spaghetti code that works, but it's not really usable to another user.

So it's good. Very, very widely. Well, let's talk about a helm because that's the project that you are the lead maintainer of, I believe. And if I can like kind of express what this project does, essentially you try to take all these different models out there, like, like,

GPT-4, Claude, like all these other large language models. And essentially you evaluate them to see how strong they really are. And I'm really interested in how you measure how good something like there's so many different ways. There are standard exams. You can give it the bar exam. You can give it probably like there's like a,

an established corpus of like, uh, paces that you put it through and, and you have to come up with like standard, uh, metrics that you can apply. And then you have to like grade these and, and this, uh, I'm not sure how impactful helm is in terms of like, you know, a lot of these, uh, models though, they're, they're marketing like their performance on different benchmarks. Um, and they're marketing like, uh,

uh, just the, the overall power of their model. Right. And, and especially with like models that are like size differently, like Google has like, I think like three or four different sizes of Gemini and Facebook has like maybe different versions of Lama that they've put out there. Um, and, and so you have to be able to make like apples to apples comparisons for like different tiers of, um,

you know, models based on like how many parameters they are or what kind of hardware they can run on. Maybe you can talk about that. Like I'm just throwing a bunch of stuff and, and please note that like, I know nothing about what I'm talking about. I'm just trying to structure a question and tee it up for you, Yvonne. So you can, yeah. Yeah. Like I guess my question is,

What do you do? That's actually a really great intro. So, holistic evaluation of language models, or HELM for short, that's the main project that I work on. And it is both a research project in the sense that it has published papers. It has a paper by the same title, HELM, which has a...

like more than 50 co-authors on it. And it also encompasses this open source framework that you can use to reproduce the results or you can actually use to evaluate your own models or using your own data sets. So essentially what we do is this framework has integrations with a lot of different models and a lot of existing benchmarks.

So a benchmark is you sort of mentioned this earlier, but a benchmark might be something like, oh, give the model, you know, questions from the law exam or from academic exams. So, for instance, a popular benchmark that people talk about a lot is MMLU, Massive Multitask Language Understanding, or MMLU for short. And that's literally

academic exams from high school level and university level from across a number of different subjects. And you give those multiple choice questions to the language model and you see if it can get them right.

So what Helm does is we've picked a lot of these existing benchmarks and most of these benchmarks are like exist, you know, they are existing papers from the literature. So they have been peer reviewed, they've been established for a while. And one of them, we've sort of collected them into a meta benchmark and we've taken, so this set of papers or benchmarks that we decided on and this set of models, and we've basically evaluated every model on every benchmark.

And we do this in a way that is standardized and comparable and also transparent. So you can go see like exactly all the requests, you know, the exact raw requests that we send to each model and all the raw responses we get back.

And from each of these, you know, these, these benchmarks and models, we compute some metric number and that, you know, we build a table of all these numbers. And that is, we call that the Helm leaderboard. And if we go on the Helm website, that's what you see.

And you're right that like certain model developers have used that for marketing. They'll say, oh, yeah, you know, we get so and so rank on the leaderboard. So

Yeah, the the open source software part of it is like, you know, people want to do these only evaluation. So they want to know, OK, if I have a new use case, like let's say I'm trying to say use a model in a medical application or something and I have a medical data set or benchmark.

You can go to Helm and say, okay, run this model on my benchmark. Or if you have your own model, you can say, okay, try my new model with this benchmark. And you can use that for comparisons. But you can use this to run model evaluations.

Okay, awesome. And I'm looking at that and I'm linking to that in the show notes. So if you're watching this on YouTube, just scroll to the description. If you're listening to this in whatever podcast player you're using, click the notes and you'll be able to see a link to this. And you can see it for yourself. A holistic framework for evaluating foundation models.

And it looks like Lama 2 is currently the winner. I think the 70B is like 70 billion parameters. Is that what that means? Yeah, that's actually from the original paper results. So the original paper results. We have an updated version, which I'll send you so you can link it. But the current...

The current latest leaderboard, I believe GPT, one of the GPT variants is on the top right now. GPT-4 variants. Okay, awesome. Yeah, but we benchmark a bunch of closed source models like

OpenAI GPT models, we have Google Gemini and Entropic Cloud. And we have some open weights models like Mistral and MetaLama and a few more. Yeah.

Yeah, and for the benefit of people who are unfamiliar with weights and the meaning, the significance of that, maybe you can talk just like LLM 101 or Neural Network 101, like what weights are and why it's important that you open source those as well as the model itself. Yeah, good point. Yeah, yeah, yeah. So let me back up a bit. So a language model, most people at this point are familiar with ChatGPT as an example of a language model. So a language model is

a model that takes in text input, like you give it instructions and then it gives you text output, like sort of like an assistant response. And these models have been trained on a large amount of text, like a large corpus of text, which is usually internet text and from some other sources.

And when we say model, like this is sort of like these are essentially deep learning models on your networks. Specifically, they use a architecture called Transformer. But you can think of it like you can think of it as quite similar to other forms of like deep learning on your net models where it's a network of parameters and these parameters are trained based on the input corpus.

So when you think of like GPT, like ChatGPT or Google Gemini, these are typically models that you access through a company's web application or mobile application or API. So you're essentially sending the text over the internet to their servers and they're sending you back the responses.

So it's essentially they're running the service. So this contrasts with what we call open weights or open source models, where you are essentially running the model on your machine, you control like on your laptop or your desktop, and you're running it as a program and you're sending it the input and getting the output from that program.

A lot of the most powerful models right now are closed weights, meaning that the companies do not make the model weights or parameters. So I use the term weights and parameters interchangeably. It's basically like here are the numbers in this network that you use to compute the text output from the text input. You can think of it as sort of like a program.

Most of the largest models right now, the most powerful models, are closed weights in the sense that they're only available as a service running on the company's servers.

So this includes OpenAI, GPT, and Tropic Cloud, Google Gemini. This is in contrast with OpenWakes or Open Source, which are things like Matalama, AI2OMO, Mistral, which you can actually download these on your computer and run them locally and run them like a program, basically. You give it text input, it gives you text output.

This is so significant because the closed models have been posing a problem for the research community because we don't know a lot about them. We don't know a lot about what data goes into them, what the weights are. We can't do experiments on the program because we don't have access to it.

So, a lot of the work, like a lot of the discourse in academia is like, how do we, you know, how do we get more open models that we can run experiments on and how do we get them to be as good as the closed ones? Because right now there's a big gap between the open and closed models. Yeah. And how big is that gap and is it narrowing?

The gap seems to be, the gap has narrowed significantly because of one, a couple of specific companies actually. And one I'll mention is Meta, which is Facebook's parent company. They have open sourced, or I should say not open sourced, they have released an open weights model called Lama, a family of models called Lama actually. The latest version is called Lama 3.2.

And this is interesting because it is a model that is produced by a big tech company. So they have access to a lot more data and compute than, you know, say Stanford University do. So they have been able to use these resources to like produce this very high quality model and they are releasing it under a license that's

actually not open source. It's actually a very weird license that has some restrictions on use. So it's not considered open source in the traditional sense. But because of that, like that model is very similar in capabilities to the other models I mentioned earlier, the closed ones that include Gemini, Cloud, and GPT. So

I think that has given people a lot of hope that maybe the open models can catch up. The catch here is that when I say open, I have been using the term open weights, which is a little different from open source. The reason I say open weights is that if you think of...

If you think of what the source is for a model, so when we say open source, we usually mean that in the traditional software sense, we mean that you can look at the open source code, right? You can see how the program was built. You can understand it. You can reverse engineer it and can make modifications to it and rebuild it.

In the context of machine learning, there's a question of what is the source code? Certainly it's the model code, so the code that trains and generates the text outputs. But it's also the data. You also need the data that you train the model on in order to create this model.

And Meta has not released the data behind this model. And so that's why I call it... And there's probably a lot of reasons why they won't release it, because I imagine it has incredible... There's a lot of reasons.

Intellectual property infringement. So just to be clear, we're going to talk more about this. I just want to give some context. So Meta is Facebook. They just changed their name. But it's the same Mark Zuckerberg company or collection of companies. He acquired Instagram. He acquired WhatsApp. And those are like... But he has...

infinity money practically. He can subsidize, he can build things like, uh, open weight models, um, just speculatively, uh, because it costs like maybe like 1% of their operating budget or something to, to train like these models and to release them. So for them, it's like good PR and, and I think it's cool. I'm glad they're doing it, but, uh, I'm confident that if you, if you actually looked at the source for all these differences and, and there are ongoing lawsuits like the New York times, Reddit, like all these companies are suing, uh,

I think all the foundation model companies, uh, and they're probably seeing Facebook too, because just because they made it open model or open weight doesn't mean that they didn't infringe upon the work. Um, but, um, I guess what I'm trying to say is like,

they wouldn't put it out there as like, here are all the books we stole and all the Reddit articles we scraped and all the free code camp articles we included. Cause we have thousands of articles that are almost certainly in these models. Yeah. And to be clear, like we don't have any ongoing lawsuits with these model makers. I wish they would credit us, but you know, we're not going to waste our scarce donor funds to go and try to launch some speculative lawsuit against opening higher against, you know,

Mark Zuckerberg, we don't have the money to do that. But I just want to be clear. If anybody's like, oh my goodness, is Quincy somehow endorsing the way that these were built? Absolutely not. This is not a conversation about ethics or anything like that. We're just – how did we get here?

That's what I'm trying to ask. That's what I'm trying to establish. Like, uh, so, so please don't like read into this as like, Oh, Quincy's Tassily endorsing the theft of a whole bunch of intellectual property. We were robbed too. Right. But now that these models are out there and they are out there, like, let's understand how they work and you know, how we can potentially use them because the genie is kind of out of the bottle is the way I see it. And hopefully people will be compensated and maybe free cocaine. We'll get our check someday too. And, uh, you know, but anyway, uh,

I just wanted to clarify a few things for the audience to make sure they understand. So llama is like the animal with the long neck. L-L-A-M-A. I think it's like they originally had the L-L-A-M or capitalized, but it was really confusing because it is an L-L-M and it's a clever name. But that's what we're talking about here. Yeah.

You had a cute capitalization, but they changed it. So now it's just spelled the normal way. Yeah. You might also actually accidentally be a plaintiff in one of these lawsuits because one of these is a class action where the classes all GitHub open source authors. Okay. And I think the...

Wait, is it the plaintiff or the, I don't know, like... Yeah, the defendant and the plaintiff. Yeah, so OpenAI is probably defending from the plaintiff, which is probably GitHub. So I'll look forward to getting our check for two cents in the mail, which is... Exactly. I've randomly gotten checks in the past for like two or three cents from some... You might get a three cent donation from OpenAI at some point. Yeah. Yeah.

So, yeah, this is like, this is something that our lab thinks very much about, like the ethics of this thing. I would say that like it's not very settled yet what the ethics and the legal framework around this is. Like, as you mentioned, there is a lot of

material going to the training that's probably encumbered by IP rights, intellectual property rights. It's unclear if they're allowed to do this. It's unclear if the copyright applies to the outputs of the models and whether those outputs can be considered to be infringing. There is

As you mentioned, there is some stuff going, some cases going through the courts. So eventually there will be a case law around this that will clarify some of this situation. There is also a lot of activity in the US government around this. So for instance, the US Copyright Office had a number of listening sessions.

last year or this year about essentially asking artists and writers and musicians to weigh in on the impact of generative AI. And if you're interested in those topics, like those transcripts from the corporate office are all public and those are great things to listen to and hear what concerns people have.

So, personally, I share many of those concerns and definitely things that we think about is like, we think about things like, how do we compensate people fairly? How do we ensure that artists don't lose their jobs or get displaced by technology? Yeah.

So, like, setting aside a lot of those concerns, I think that is really interesting. And I want to make sure, like, I'm not dismissing those concerns, but I just don't want to spend our scarce time together talking about those kinds of things when we could be talking about the actual technology, right? Okay. So you have done a great job of describing how these systems work.

And now maybe you can talk about like just briefly like the process of benchmarking, like how do you put these models through the paces and how do you figure out which one comes on top? And I love this term win rate. Maybe you can describe what a win rate is. Yeah, the win rate is actually probably the most innovative but also potentially confusing part of this benchmark.

So we run a bunch of different benchmarks, right? So you can think of Helm as a meta benchmark. And the benchmarks are things like academic question answering, solving math questions, doing translation, answering like domain specific questions like medical and legal questions. And the win rate concept is kind of

What's the probability that this model does better than another model, given that you pick a random competitor model and you pick a random benchmark? So it's a way of doing this aggregation that sort of reflects all the components, all the different benchmarks that goes into it.

And in terms of like how we do this evaluation, like each benchmark is actually a little bit different. So for instance,

the multiple choice question answering benchmark, that's like easiest because like if you're just asking the model, hey, here is a math question, is the answer A, B, C, or D, you give it the question, you know, you prompt it with the question via text input and you just get the text output, which is A, B, C, or D, and you just score whether it's the correct letter or not. So that's sort of easy. But

But it gets more tricky when you do more open-ended things. Like say you're asking like a model for medical advice, for instance, which I do not recommend doing, by the way, in a real life context.

you get back some piece of advice, like a paragraph. And the question is like, how do you score this? So you might have to actually find a real doctor and ask them, is this actually the correct answer? You might have to have someone, you might have to have like a textbook reference answer that you check, is this similar or not?

And there's some other techniques like for use cases like what we call instruction following, which is like behaving like a helpful chat assistant. You could ask humans, you know, is this response helpful? Like I've asked it for a recipe. Is the resulting recipe helpful?

And then there's more recent techniques, like what we call LLMS judge, which is literally you ask a second model if the first model's output was helpful or not. And usually it does a pretty good job of figuring out what humans would like. Interesting. That's so interesting that you're like stand-in human, so you can automate things. And you're like kind of automating human judgment, and the model probably doesn't even realize, well, obviously it doesn't realize because it's just an LLM. But like...

it may not even know that what is being fed is from another LLM. You know, like even if they have some code like ape, not hurt ape or something like that, you know, like. Yeah, there's definitely benefits and disadvantages to doing it. Like it's a lot, I mean, it's a lot more scalable. You can do this for like, you know, tens of thousands of, you know,

or millions of requests quite cheaply. But on the other hand, actually, it's like tens of thousands, hundreds of thousands. But on the other hand, you know, other researchers get suspicious. They're like, are you really sure that like you're measuring what this corresponds to what humans want? Or is it just what like the AI overlords like want? You know, are you, is it actually what we call aligned to human values?

Yeah, that's a very interesting kind of – we could delve into the philosophical questions associated with what do humans actually want, right? Because we're not talking about like a monolithic humans. Like many humans are bad actors. Or have like weird narcissistic tendencies and stuff like that and will just maximize their own value. Many people will play like those economist games and they'll just take all $20 for themselves and give the other people $0 because they just don't care. They're sociopathic or something, right? Yeah.

So, but one thing, one question that immediately springs to my mind when you talk about having LLM as judge, I think is the way you described it.

Does that have like a synthetic data type problem where like LLMs are creating data synthetically because they run out of like organic data that they've scraped off of Reddit? And suddenly it's kind of like this inbreeding phenomenon or something where like genetics keep getting like miscopied or something like that. Do you have that same kind of phenomenon when you have LLMs judge?

where biases are reinforced or weaknesses in models as a whole are reinforced by having LLMs judge other LLMs? Yeah, so maybe yes and maybe – that's sort of an open question at the moment. So this is more – so –

One factoid is that we found, some researchers have found that if you have GPT be LLMS judged, it's going to slightly prefer its own outputs over the other models. But that's very human, right? If you had a human judge their own writing from five years ago, they'd be like, oh, I really relate to this. Maybe they completely forgot they wrote it, but they're going to feel subconsciously like this.

this affection, uh, the sentiment toward their own work, right? Like everybody loves to hear their own voice. Everybody loves to see their own name in print, right? If you want to have a popular, successful local newspaper in 2024, first of all, I don't think it's possible, but the first role is you go around and you interview every single citizen and you make sure that every week there's like an article talking about some citizens so people can read about themselves and be like,

Yeah. Is that an emergent phenomenon among LLMs, them preferring their own work? You just said it was. Yeah, most of them seem to have this behavior. I think the other part of the question is you were mentioning running out of organic data. Right now, the LLM training pipeline, you start with this massive corpus of text, right? Then there's also a phase which we call post-op or post-credit.

We call it alignment or RLHF is another term for it or post training where you essentially have humans look at model outputs and annotate them, teach them what useful responses are, and then you train the model some more and that produces like the assistant like behavior.

So that's all like humans in two parts of the pipeline, right? The first part, like the massive corpus of text that ultimately comes from humans. And then the post-tuning that's also like coming from post-training, that's also coming from humans. Yeah, and just to define an acronym real quick, you said RLHF, Reinforcement Learning with Human Feedback, right? Yeah, Reinforcement Learning from Human Feedback, RLHF.

That's a term for human annotators, basically teach the model how to behave like a question answering assistant or conversational assistant. - Kind of like giving them carrots and sticks, like, oh, you did good here, here's a carrot, like here's some like-- - Exactly. - Utility or whatever it is. - Exactly, so teaching them to follow instructions, instruction following training. So both of these, like, people have been thinking, like, can you replace either or both of these with, you know, some sort of model?

So our HF actually there is sort of a model component to it already. Usually instead of using the preferences directly, you train an extra small model based on those preferences called reward model and you use that. But I think the sort of bigger question is like how much of this can you replace with AI? And the motivation here is as in mentioning it,

Running out of data, right? That's actually right. Which is ultimately an economic problem, right? Like we could just pay people to write novels. Like, hey, I'm going to give you $10,000 a month to just sit down and write a novel a month. Like we could like nano readmo, you know, national write a novel a month. And then we could generate tons of work like if we just budgeted it.

So, so to me, synthetic data, like it's always seemed like the cheapskate kind of thing to do where, you know, and, and having model judged as LLM, that's just like, gives me like inmates running the asylum type vibes. Like, of course we can all save tons of money if instead of having prison guards and all this stuff, we just have the inmates like, you know, self police and, but like,

That could go wrong in so many different ways. I don't know. I don't know. I like call me a skeptical, but, but like, I'm like the man on the street. I'm not a researcher at Stanford thinking about these things. Just my gut reaction is this might be a bad idea.

So there is the question of the volume of the data and the question of the quantity of the data. So in terms of volume, I don't think we can get much more than this. The reason is that we're already using massive scripts on the internet. This is like a lot of...

It's really a lot of human output, right? And some people say, hey, what if we have, you know, we have private data as well, which firstly is ethically a bit sketchy. And secondly, I don't think that's actually the kind of data you want. I don't think that that would improve things.

and then there's the quality argument where people some other folks are arguing hey the trick is not just quantity is um if you just train lms on textbooks that are written very well very high quality textbooks by very knowledgeable after authors then you get like a good model it's kind of like saying like

Like telling a teen, you know, go read books instead of playing video games or something like that. Have higher quality training. Yeah, I mean, if you think about it, somebody who's playing World of Warcraft, they are technically going to be reading a lot of crass comments and stuff as they're going on a raid. Yeah.

And they're still be exposed to a whole lot of text, but it won't necessarily be high quality text. High quality text. Yeah. Yeah. There's like a lot of, you know, Reddit goes into these models. So like these models are learning from a lot of very, you know, dubious content. Yeah.

Like, you know, astrology subreddits and stuff like that. Like, sorry, apologies to anybody who believes in astrology. But in my humble opinion, there's a lot of hogwash out there on the Internet that these models are getting exposed to. And there's a lot of satirical articles. Like, for example, Google had the snafu where they were telling people, like, it was okay to eat a few rocks.

Every day or that it was okay To put glue on pizza because there were satirical Articles about these that The model was not Able to discern Satire the satire was Lost on them and they took it earnestly

So there's a lot of garbage on the internet too that is fed into these things. And now there's going to be a whole lot of synthetic data that it doesn't know is synthetic data because we've kind of like entered into a new era where we've passed this Rubicon where like, I mean, I could tell you like just from somebody who cares about SEO and does research and like we want Free Code Camp to show up highly in a lot of queries and a lot of our best articles are like under LLM generated articles.

And it's very clear when you open it up, it's just an LLM, but Google's crawlers are not sophisticated enough to differentiate between LA LLM slop and actual, you know, expertise written by some software engineer who's been working on this problem for many years, who sat down and wrote like a thoughtful tutorial.

Right. Uh, so, so I hope that doesn't sound like me airing grievances, although I am really pissed about it to be completely blunt. I think it's nonsense and it's, it's a disservice to everybody, the creators and the audience is trying to use Google to get things done. Uh, and I know you're no longer at Google, so don't, I'm not trying to ask you to like go talk to the head of search. Like, Oh, there's this feature request, but, uh, but,

Yeah. Like, like I guess, where do we go here from here? It's crazy to me that the world is not enough that like all texts ever written by humans, every book, every blog post, everything like that, that's still not enough to train these models. That just blows my mind. That shows you how hard this problem is, I guess. Um, but like we're out of text and even if we paid a whole bunch of people to generate text, it wouldn't move the needle is what you're saying.

So if you pay people to, I think maybe like again, if you have infinite money, what you could do is you could like try to license textbooks, right? You could try to get like the best textbooks out there which aren't in the coffers yet and be like, oh, we'll pay the publisher or the author to, you know, let us use this for training or, you know, maybe commission your textbooks, right? So that's one way you do it. And if the, you know, if the good textbook training hypothesis is true, then that might give you a better model.

But in terms of raw quantity, my sense is this is really all we're going to get. I think maybe a year ago, people were like, "Oh, let's just do video," because there's tons of video out there. The problem is that people now are doing video. The latest models like Gemini and GPE 4.0 are trained on video and images, so they have tapped on that already.

And the last point we mentioned, like the AI feedback loop problem, there have been researchers who have run, like, so there's Stanford researchers who've run experiments, like, what happens if you just try to train a small model and, you know,

have it keep training on its own training data? Does it eventually go off the rails? So does it self-improve or does it stay the same? And so far the answer is it depends. Like there have been several people who've found that does go off the rails, but there have been counter arguments that are like, oh, if you, if you,

filter the data differently or you do the sampling differently, then you can fix the problem. So it's a little bit of an undecided question right now. Personally, my money is on no, you can't train ALM on itself forever.

Well, so in a very constrained domain like Go, right? AlphaGo. They originally had like tons of training information. This is my understanding. Keep in mind, I'm not a researcher. Everybody take... Correct me. Just correct me if I say something incorrect, please. But my understanding is...

They had a bunch of training data. They had like all these games played by like high level go players and they trained alpha go on that originally. And then they realized, Oh, well we can just have it play itself in computer time. So like thousands of years of playing itself. And eventually it'll like discern the rules and, and kind of like build up from first principles, how to play go really well. And that was like how they did it because they didn't have enough training data or they just found that that worked better for that specific use case. Is that, is that what happened?

Yes, that's actually what happened. So there's this concept called... So there's firstly the database, training on original players, and then there's the second part which they call self-play, which is essentially you take two copies of AlphaGo and you play, or many copies of AlphaGo, and you just play it against itself, right? And that gives you lots of games, and you can use that games as training data.

And this sort of works in the AlphaGo setting because Go is a game with a fixed win condition. Like you want to win, basically. There's a score, there is a concept of winning and losing.

The problem with trying to apply this to LLMs is that when you're talking about an assistant, the concept of winning is very unclear, like what we call the utility, like how useful it is.

So, we mentioned LLM as a judge earlier. You could, in theory, try to generate, try to have models talk with themselves, have LLM as a judge say, "These are good conversations. Train on these good conversations." But that doesn't really work in practice because the LLM as judge, there's so many issues with this pipeline.

One issue is the concept of goodness is just so undefined because in a human language context, there's

You know, there are things like helpfulness, creativity, like values, like more, more intangible things that we care about in conversation that is so hard to measure or tell out like a model how to measure. And it's like, that really makes it like difficult to, you know, try to steer model towards this, you know, North star because you don't know what the North star is. Yeah.

Yeah. I mean, that goes for humans too. Like on the forum, uh, if you look like the free cocaine forum, pretty active forum, maybe like 7 million visits a month. Um, people just helping each other with programming questions and encouraging one another, uh, in their job search and everything like that. And, uh, you can like,

heart replies and you can even mark or apply as a solution, which is a lot of signal to us. Cause we're trying to like figure out who are like the most helpful people. So we can figure out whom to grant moderator privileges to and things like that. And so the moderators can like, kind of like pick out like stars cause we want to establish like top contributors of 2024. Right. We're going to publish that soon. Like it's probably out by the time you are, uh,

listen to this, but basically of all the open source contributors and people active in the community, like we have to make judgments about who's the most helpful. Right. And so that is extremely difficult for us to judge because I mean, it could be the thread is on just a really popular topic. Like if you go sort stack overflow posts by number of upvotes, a lot of times it's just like some ubiquitous technology like get,

or Vim or something like that. And people have asked questions about that. And because everybody's Googling that and they're like, Oh, this helps. And I'm like, because it's a more prominent question, it gets more signal in the form of upvotes. And then, you know, perhaps the most eloquent, helpful answer to a more obscure question would not get similar. Like, is there a similar consideration in judging models? Like if you're giving it some relatively esoteric,

task that it is uniquely qualified to do well over other models. Like, do you wait that additionally? Like for example, you said legal advice, right? Like that is the realm of expertise. People who've spent their entire careers immersed in the law and understanding like, like the precedent cases and all that stuff and understanding like probably having some intuitive grasp, uh, some intuition they built up from just doing this for so many years and

And if you have an LLM that can give good legal advice that a lawyer would actually say, Oh, that's pretty good. You're going to wait that specific task much more than somebody who can like, you know, solve, uh, you know, um, a basic, you know, math problem or something like that. Right. Because like a lot of models. So how do you, when you're actually establishing how to rank these and you're establishing the, the, I guess, uh, the heuristics, the rubrics for which you're evaluating these models, how do you wait different types of expertise and,

And how do you think about like models that are exceptional in a difficult area and separate those from models that are just like generally pretty good? Yeah. So this is a very complicated question to unpack. So firstly, ultimately we want models that are sort of useful for people in professional settings, right? So for instance, like right now, like ChatGPT is sort of great for polytricks, but there's a bit of a question like,

If you were a lawyer and in law there are situations like, okay, I want to go look at all existing case law that's related to my current case and I want a summary of that. That sort of sounds like an LRM task, right? So lawyers have been thinking about, and researchers have also been thinking about, can we get it to do this?

And the question, the answer is it's sort of like it depends very much on which professional domain you're talking about, like law or medicine or something else. And it also, there's also just not a lot of like

applications being used, like real applications being done right now. Like I'm sure you've seen there was this story of like some lawyers tried to ask ChatGBT for case law. ChatGBT made up some fake stuff. They showed it to the lawyer and I mean to the judge and they got in trouble. The judge tried to look it up and it didn't exist. Those DOIs or whatever metric they're using were not accurate. Yeah.

Exactly. So there is a sense that GBT, there's like a suspicion that GBT is not ready for a lot of these real world professional tasks. And what we've been trying to do in the lab is like we've been interested in finding like specific tasks.

professional use cases and trying to build benchmarks around them to see like hey you know here's here's what it looks like in a real real life situation um the problem is that a lot of these benchmarks don't exist right now and there are many reasons for that like firstly you need professional experts to build those benchmarks and the time scars and there's um secondly a lot of these data like for instance if we talk about medical benchmarks it's

um very hard to get medical benchmarks because of privacy and data protection issues um medical data and then there's issues like um there's there's also like the sense of um

I think not all the domains have equal amounts of attention on them. So for instance, because LLM research is done by programmers, there is a lot of programming evaluations. Like there's a benchmark, for instance, which is like, can you take a GitHub issue and write a pull request for it, which is kind of a reasonable professional task that a real engineer would do.

And my suspicion is the reason there's so much work on this is because we are computer science people. We want to scratch on old age. So there's a lot of work on that. Yeah, and it's a domain we already know and understand. Exactly. Because like most computer science people have probably opened up a request before or at least read a GitHub issue and had to think about how they would go about solving it.

Yeah, so I've heard this term called the ragged frontier of LLMs, which is exactly this idea that the LLM might not be as capable at every, you know, may not be capable at every subject. Like it might be, hey, it's awesome in law, but it's not great in medicine or something. So ultimately, you kind of have to do this evaluation case by case for each, you know, use case they actually want to use it in.

So the ragged frontier, ragged like a torn cloth would have – I mean it's going to be longer in certain places than it is in others. There's going to be kind of maybe a zigzagging pattern. And so that pattern is zigzagging across all these different domains of expertise. Yeah. And probably the most explored is software development just by virtue of it being so close to us and by –

programmers put tons of stuff on the internet. We've got like 20 plus years, I think it's probably about 20 years, worth of stack overflow posts at this point. We've got almost 10 years worth of free code camp question and answer threads. We've got like...

just tons of programming tutorials that have been published in video form, for example, and O'Reilly books and all these other things, right? Yeah. If you look at the training data, so there are some papers that do publish what goes in the training data of LLMs for certain models. It's a lot of programming material, and some of it is exactly what you mentioned, right? Like programmers,

We live on the web, so we're more inclined to share on the web. There's things like Stack Overflow, which is very mature infrastructure for sharing knowledge on the web. So some of it is just because there's vast quantity of programming knowledge on the web that, you know, that coding stuff, like also all of GitHub code as well that exists. So all of that, you know, there's a lot of programming related material that goes into LLMs. And there's like...

There's sort of a question of like, does that, you know, there's a hypothesis that this is good because

The programming knowledge, people hypothesize that the coding teaches the model how to do other things like mathematical reasoning or logical reasoning as well because they're such similar skill sets. Yeah, wasn't that a big unlock for GPT-4? Like I remember they included like the – they kept changing the name of it, but it was basically like it would write code and then it would show the code that it used to come up – to arrive at a conclusion. Oh, yeah, the chain of thought. Yeah.

But you could actually see the code and you could look at it. And that was like a huge windfall for me to like better understand how this thing was quote unquote thinking. It was to be able to see like, okay, I asked it a question like, how many airplane seats are there manufactured in the US each year? And then it would go and like figure out like the plane production and stuff like that or something. I don't know. But yeah. Yeah. Do you think that that is like a big part of...

Right.

Right. Law is literally called code. Right. Because it's like if this then that. Right. And that kind of like logic that that essentially runs entire nation states. And similarly, code, like if you look at the Linux kernel, if you receive this type of input, you know, return this.

or call this function, which will figure out what to pass back to this, and that might call several other functions. And it's like this hierarchy almost, and it's somewhat deterministic. And so by thinking like you're programming, it forces a certain rigor to your thinking. And I can see how that would be very useful to force the LLM to explain itself on a similar level of rigor. Yeah.

Yeah, I've actually heard a lawyer describe law as programming code before. But I think that's also not necessarily true because as humans, we deal with a lot of ambiguity. And one of the things LLMs are quite bad at right now is dealing with ambiguity, like dealing with ambiguous questions or recognizing ambiguous questions.

So that's kind of interesting. But also, I do agree. I've heard the reason you teach code, it's not like a big benefit of teaching coding to students. It's just having them gain that mathematical reasoning or that logical reasoning skills.

Yeah, it does teach you like there's the famous Steve Jobs quote. Everybody should learn how to, you know, program because it teaches you how to think. Right. Of course, famously, Steve Jobs didn't know how to program. So he can take his own advice. But but but you could definitely extend that to everybody should learn how to program because it teaches you, you know, critical thinking and logic. And importantly, it teaches you communication skills.

I think that's been a huge unlock for me, the precision. When I started hanging out with programmers, I was used to hanging out with teachers and stuff. And the level of specificity, if you use a phrase casually like, oh, well, as a matter of fact, I did. Oh, is that a matter of fact? People would ask those questions. I'd be like, oh, hmm. I'm not sure if it's actually factual that that is the case. And all these little things that we use in English are extremely imprecise.

Um, but humans are extremely good at interpreting ambiguity and like the human brain seems to be structured very well to differentiate like a twig on the ground from like a snake and things like that. Right. Whereas, you know, a computer might see that shade and they, they may mislabel it or they may, they mislabel, you know, uh, muffins as chihuahuas and vice versa and stuff like that. Right. So obviously that, that like that has improved a lot, but, but that just goes to show that like humans have so much sophistication in their perception and it's,

going to be an extremely long road to get computers to be able to have that flexibility that the human brain shows.

Yeah, definitely. I think a lot of this early work, like if you look at Helm, a lot of the evaluations we do, a lot of early evaluations are just things like, can you ask, answer multiple choice questions, right? And that's very different from, can you hold a long conversation with a person? There's like so much to do with, you know, like, I mean, we talked about ambiguity, but there's so much more skills as well involved in that, right? And, yeah.

Yeah, I think it's just a long way to go. There are some people working on the social aspect. It's more like, do LLMs have social skills or social intelligence? Can they reason about people? Can they persuade people? Can they understand emotions? Can they understand puns and humor? And that is very much an open question right now.

Yeah, and so I have so many questions to ask about the capabilities of LLMs because I think they're probably both overstated and understated in various public reports and stuff like in the media and stuff like that. What are some of the most impressive things that you've seen an LLM do over the past six months or so? Oh, that's a difficult question. Most impressive thing. So the...

I think my sense of impressiveness is a little skewed right now because I just like hanging around. Okay, I can give you an example. So there's a famous mathematician, Terence Tao. And recently, he was basically given a preview access to an open AI model called O1.

And 01 is a model that does chain of thought reasoning. So we mentioned earlier, chain of thought reasoning is when the model is, you're asking a math question instead of just giving you a short answer. It's going to try to like generate a few, you know, a section of text that is like reasoning about the question before it figures out what the answer is.

So he was given this model to experiment with and he basically gave the model a few like

asked it to help him with very high-level proofs. He asked it to translate one of the proofs into a theorem proving language. He asked it to write a novel proof for another problem. And he actually ended up saying, "Okay, this model behaves at the level of a mediocre grad student."

And the internet was kind of like, the internet response was like, wait, how have the goalposts moved so that like,

10 years ago, if a model was able to write a graduate level proof, that would be amazing. But now we are like, oh, it's a mediocre graduate student. That's not impressive anymore somehow. So I think that was a pretty impressive demonstration. Like the fact that you can actually use it for high level research mathematics and it actually produces some value.

Yeah. I mean, I use them all the time and I continue to be impressed with them. Like just yesterday, uh, I was like, I took like a giant PDF that we created years ago and I'm like, I don't want to go through and manually like convert this PDF into Jason, like get the properties back out. I can't find the original properties I used to generate the PDF. And I just said, Hey, figure out like I gave it to GPT for, um, and I was like,

figure out how, like what the structured data is in this PDF. It's like, you know, 10 pages long and turn it into Jason that I can use for a rest API. And it did it. It did pretty well. I had a few things. Yeah. Yeah.

I mean, it's just incredible time. Save. It's nothing that I couldn't personally do. I could probably do a lot better job than it did, but it's like having my own mediocre grad student who's doing my exact bidding and like, I don't have to call somebody and wake somebody up. Right. I just jump in there and, and nobody's really inconvenienced opening. I spends, you know, a dollar or something running my tasks. And, uh,

you know, I'm, I just got what I needed to get done, done. I was talking with the guys from the change log yesterday, the, uh, the host of the change, like the big open source podcast. And, uh, they said that they think that, Oh,

LMS have made them 20% more productive as developers. And I mean, if you think about it, 20% more productive, that's, that's a huge change in productivity, like nonsense, like spreadsheets and, and, you know, scripting languages, or I'm trying to think of the fundamental changes, maybe Google, uh, search, um, that have unlocked that amount of productivity. So it is a massive change in productivity. But what I want to ask you, do you think this is just the beginning? Uh,

Or do you think 20% is like good enough to justify the valuation of NVIDIA and like all these other, you know, all the investment in the space? I mean, do you think that this is just the opening volley of the productivity that's going to be unlocked by these tools? Or do you think we've already kind of gotten a lot of the initial benefit and there's going to be, you know, asymptotically diminishing like returns going from here?

Yeah, unfortunately, I'm an AI skeptic, actually. Okay. I would bet on this. Like, I think that's the extra question of, like, you're mentioning that there's, like, literally hundreds of billions of investments that has gone into AI, tens if not hundreds. So, like, there's a question of, like, is that, are we going to get tens or hundreds of billions of returns? And I think that, I don't know, I'm not really confident enough in either direction.

The thing that I'm fairly confident about is that the pace is sort of slowing down. So the reason for that is that if you look at recent model releases, the pace of model releases hasn't really kept up. Like there was GPT-3 and then GPT-4 after a year and we're like, oh, maybe GPT-5 is now, but GPT-5 hasn't come out.

There's rumors that, you know, the data question I was talking about earlier, like that's starting to be a significant question. There's, you know, some people think that maybe if we completely change the way we build the model, like if we use a completely different architecture, we can, you know, do another step, have another step game where we jump from here to

something yet. But that's very difficult to predict, I think. There's certainly a lot of people working on new architectures and new techniques, but it's very difficult to predict if one of them will bear fruit.

It is difficult to predict, but let me ask you to make a prediction. So we've had AI winners all the way back since the 1950s and stuff. Like, oh, this is no big deal. Software is easy. That's what they thought back at the time. They're like, well, we've already got these computers and stuff. We can do anything. We can simulate a human brain. No problem. They greatly underestimated the amount of work. And then there were little hype cycles, like bubbles up of

funding where suddenly AI was the big thing and then it died down. And like maybe every 10 years or so, I don't know the exact interval, but, but it was a pretty predictable kind of sine wave of interest. And, uh, maybe it was more like a saw wave, but, um,

Now, more than ever, we've got tons of interest. We've got tons of money. We've got tons of very smart people like you working on AI, either working on AI at Google before you started working at Stanford researching AI and trying to evaluate AI through your benchmarks. Now that we have so much attention on this, do you think that that dramatically increases the rate at which we're going to develop things? Do you think like...

that we would have to put in N hours of research, you know, and maybe the collective amount of research put into AI over the past 50 years is the equivalent of what was put into research in, you know, 2023 or something, right? Like, do you think that now because there's so much attention, we are going to get those gains faster? Or do you think there's like other limiting factors besides people thinking really hard about this stuff? That just, we just need more passage of time.

Yeah, I don't really know. I mean, certainly the amount of investment into AI has increased by a lot. And even like researchers like at Stanford, you can see a lot of research. Basically, a lot of researchers are working on large language models right now who weren't before.

I am actually not entirely sure that's a good thing because if you, if people go into LLM research, right, a lot of them are coming from somewhere else, like another discipline of AI or another discipline of computer science. And that's not necessary. Like my concern here is that like, okay, what's going to happen to those fields, right? That, that have fewer researchers. And I think the, the other thing is like, um,

I don't know. I don't know at what point you get diminishing returns because I certainly, there's like a lot of people trying a lot of interesting new ideas in parallel, but, um, I don't know at what point, like how, how confident I am that one of those ideas really allow us to like surpass these fundamental problems. Yeah. In a lot of science fiction, uh, obviously science fiction written before AI. So like one of my favorite series, uh, the expanse, uh,

The authors talk a lot about

how AI is everywhere, but it's just kind of invisible and it's doing little things in the background and making life easier and simpler and allowing humans to make the higher level decisions. But it's not like there's some overlord AI that's making all the decisions and you know, the presidents and AI and all this stuff. Right. So it's clear that like in, at least in that universe that they thought of, which is arguably like the most scientifically accurate, like science fiction series ever, uh, other than maybe like Arthur C. Clark or something like that, but like very accurate. Right. So, uh,

Those people who are not AI researchers, I don't think they're like software engineers, but they didn't think that AI would, you know, become this huge thing. And they clearly thought that there were like limits to what an AI could do compared to human civilization, making decisions the way that human civilizations have made decisions for, you know, 100,000 years in small tribes and now as nation states and things like that.

Do you think that it's possible that that is the future and that even with all the hype and all the breakthrough that we're experiencing right now, we just experienced like a, like not a one time step change, but a relatively infrequent step change. And we just have more step changes over the next few decades to look forward to. And that this is not going to fundamentally change everything. Like a lot of, you know, people who are paid to say it's going to change everything. Keep saying. Yeah. I mean, that's, that's so interesting because like,

On one hand, I'd argue that we sort of kind of already live in that reality in some sense, and in some sense we don't. There are a lot of things in daily life that's basically AI and we don't think of it as AI. Like generally when something works so well, we start thinking of it as AI. Like for instance, email spam filters is AI, but we don't think of that as AI.

So I'm talking more in the sense of traditional AI or traditional in the sense of not LRMs, not generated AI. If-then statements. I mean, some of these are still statistical models, not necessarily decision trees, which is if-then statements. Some of them are decision trees. Yeah, sorry. It was a joke. I realize it's more sophisticated than some programmer hard coding, if-else logic. Yeah.

So in some sense, like I feel like, for instance, running a search query that uses AI in some sense as well, but we don't think of that. So we kind of like in some sense, you can't really like we interact with AI every day. If you use Facebook and you use, you know, the algorithm decides what posts you get to look at. That's not an example of AI. So yeah.

I don't think necessarily that all these applications are AI positive. For instance, I feel like we are sort of in this reckoning moment now where we are starting to think about the impact of social media on society and a recommendation of guidance is part of that. So

In some sense, we live in that world. And in another sense, like AI is very unevenly applied right now because like suddenly if you use a technology produced by a big tech company, like use a product like Android, for instance, Google is going to have a lot of money. You know, they have AI infused in all their products because they have the infrastructure and resources to do that.

But if you're talking about lots of different domains, lots of different applications. So for instance, one application that I learned about recently is weather forecasting. Google just produced a weather forecasting model that uses AI in a specific way that outperforms traditional forecasting methods. Wow, that's a big deal. I didn't even hear about that. Yeah.

Yeah, that's also the thing, right? There's like a lot of fields where potentially you could use AI, but like it hasn't been done yet for, you know, some for various reasons, like maybe there are technical barriers, maybe there are social, political or cultural barriers.

maybe there might be genuine reasons why you don't want to use AI, right? There might be like legitimate reasons. Yeah. Um, but it also might be a resourcing problem. Like, you know, no one has like, there's no machine learning person in your, in your, your weather forecasting office, for instance, might be a reason. So, um,

I think there's still a lot of like many domains where AI could be applied to a lot of social good where it hasn't really been done yet. - Yeah, so the future is already here, but it's just not evenly distributed.

Exactly. I imagine over the coming decades, there are going to be lots of small companies, maybe solo developers, who take a lot of that Promethean fire to trucking companies and to farms and to all this other industry and essentially make it marginally more efficient or maybe dramatically more efficient, dramatically improve the output. It's possible that with...

There was a period with chess where having a human player and an AI together was more powerful. Now it's just the AI's got the absolute advantage. But there may be a period where humans and robots are working kind of in tandem to get things done better or to have better outcomes. Like as is the case with weather reports, I can just throw out a wild guess and get on TV, dress up in a suit and talk about the weather and be wrong. And maybe that's good enough for a lot of people that are just

glancing at the weather trying to figure out if they need to get an umbrella or something. It's not like that big of a deal. But when it comes to, you know, like estimating, you know,

like how many birth defects a certain chemical is going to cause or something like that. That could be like dramatically more high stakes. Right. And, um, and so maybe the quality of those decisions is dramatically, you know, it's not just a question of quantity. It's also a question of quality of output in a lot of cases. And so anyway, I'm kind of like rambling, but I'm just like trying to process what you said in,

In the sense that we're going to be able to take a lot of the technology that we already have that may not be super sexy or exciting for people like you. Not even for me necessarily as just a software engineer who like kind of like reads about this stuff on, you know, and hears about it at dinner parties and stuff. But for, you know, some farmer in Omaha who's trying to, or maybe a rancher who's trying to like have,

better organic beef or something like that. Like maybe there'll be some huge unlocks around the corner that is technology that's already like a year or two old that they just haven't yet, you know, received in a package form that they can use, which could just be as simple as like some mobile app that tells them when to like feed their cows or something. I don't know. I know nothing of the domain of farming or ranching, so I apologize to any farmers or ranchers listening to this who are like tearing out their hair like, Quincy, you got us all wrong. That's not how we roll, you know? But my point is...

there will be a whole lot of domain experts who pick up more generic tools and then adapt them to their domain and then sell them to people in their field. So that is a huge opportunity, regardless of whether AI continues to improve. The step change we've had is already, it's just going to take years for us to figure out how to use, like even if you froze development and like just GPD 4.0, what I use every day, that was like,

Just that was how good it was. And it stayed that good. Hopefully it doesn't get worse. Like Google search seems to be getting worse. Sorry. I know you work at Google, but I know you're not on their search team. I'll quit complaining about Google. But, but like ideally, like it just stays the same. Right. And it's, it's not like, welcome to Costco. I love you. Like trying to get product placement or something in there, but it's actually, you know, just it's stabilized. Yeah.

even that tool as it is, is incredibly useful. And I could see that dramatically improving my productivity over the next few years as I just learned new ways of leveraging those tools. So I'm extremely optimistic about AI. How do you feel? Excited and scared. Okay. Let's talk about the scare. Yeah. Okay. Not excited. No scared. Yeah.

Large language models, they have a potential for a lot of misuse and a lot of potential harms. So, I mean, I can give an example of a few. So, for instance, in Helm, one of the recent projects we did is we basically said,

Oh, can you use a large language model to produce harmful outputs where harmful outputs might be instructions on how to build a bomb. It might be generate, it might be things like political disinformation that you can post on social media.

So we had this evaluation, this benchmark that was essentially, we sent a model, a lot of requests for these harmful outputs and we measured how much of, how often do these models give you the harmful output? And the answer is actually quite frequently.

depending on the model. So some models are better than others. I'd say that overall, we were quite impressed by the amount of safety tuning that the developers were able to do. But regardless of the model, we still found unsafe outputs.

So I think I'm very so I'm so worried about all these potential harms in terms of this information. And there are also things like bias and fairness, the idea that models might be biased against certain groups of people or might not work as well for certain groups of people. There's also

I think, concern about labor displacement. Like if models can take over the jobs of people, does that mean massive job loss? And economists are split on this actually because they also argue that AI will create lots of jobs. So it's possible that there'll be a structural shift, but overall, workers turn out okay. But... They're probably going to need education.

They're probably going to need some more education before they can get some new jobs. Then go to FreeCodeCat.com. We've got you covered. If you're worried about getting displaced...

If you're worried about getting displaced, just keep learning skills and keep climbing and the rising tide will stay below you and you'll be okay as long as you keep learning. I just want to reassure people. I've seen nothing to convince me that work as we know it is just going to completely go away. I think a lot of that hyperbole is pushed by people who have an agenda. I don't think it's a practical argument that we're going to have some

you know, AI system that can just do better than humans. Absolutely. And everything. And the human labor is just no longer needed. We're all going to be rounded up and put in these, you know, beige buildings and we're going to, you know, eat, eat food and like basically live dreary lives while a bunch of really wealthy people just have the rest of the world, uh,

to themselves and they have a certain UBI slums and stuff like that. Right. I don't. So I, I'm definitely like legitimately processing what you're saying, but at the same time, I do want to kind of like push back with like,

I don't think it's all doom and gloom. Like let's talk about Singapore. So you're from Singapore, right? One of the most advanced countries in the world. I think it's like the third highest income level in the world. And the other two higher ones are like oils, trust fund type countries. Yeah. The life expectancy is one of the highest. That's like what we like to say. I talked with, uh, Josephine at Tio. Um,

Josephine Teal. I talked with her and, and one of the things she said that was really interesting is Singaporeans like technically kind of almost have like a, a negative unemployment rate. If you want to think about it in that sense that there are like more jobs out there than there are people that can do the work. And so it's just a question of taking those people and giving them like skills so they can do the better jobs that are open. And when I, when I say better, I mean like, you know, higher income, like,

you could argue that like you can make a lot of money as a coal miner, but it's not a good job. Right? Nobody would argue that being a coal miner is a good job because you're damaging your health. You're putting your body at risk. It's backbreaking labor. Um, it's not very much fun. I would imagine going down into a dark cave and like chipping away at the walls and stuff like you can teach a

a coal miner how to work in a factory or do some sort of slightly less dangerous, slightly higher paid type job and you can level people up. You could say that you could train somebody who is a field medic in the military to... You could send them to medical school and they could become a physician. You can always level people up a little bit. And I think...

With Singapore, because they're actually trying to train people and get more people to become AI engineers, which are essentially software engineers who know how to leverage models and leverage these AI tools that are coming out with each passing month. So,

In a way, like even with technology improving and jobs getting automated away and jobs getting offshored and stuff, and like in the U.S., employment has stayed relatively stable. And you could argue like, oh, people are just getting discouraged and they're living in their mom's basement. Like this is the stereotype. They're living in their mom's basement and they've just given up on getting a job and they're just playing Call of Duty all day, right? Like that's what happened and they're just – but –

If you actually look at the numbers of people employed, it's been fairly stable despite all the technological revolutions that have taken place in the United States, you know, since the industrial revolution, like all the computers and like,

the information age, all that stuff. Right. And I, I have reason to be optimistic that this age will be similar in that it's going to take everybody and improve their productivity 20%. And the, the sixth person who gets displaced by that 20%, well, they're going to find some better job and they're going to keep climbing too. And I think my argument is as long as everybody's getting smarter and more capable and stuff like that, we should be able to figure out new things for these people to do. And I think the main reasons for unemployment and suffering and stuff like that is the

failure of imagination among the people that create the jobs and like terrible hiring practices, like, uh, applicant tracking systems and like, you know, this coffee ask, um, employment system and, and, and then our resistance to it and our, our, us trying to hold on to old steel worker jobs and things like that. When those jobs were never good jobs, they were all miserable, terrible, dangerous jobs, noisy, hazardous to your health. And, um,

Most people would be better off and probably happier working in a cubicle somewhere or working from home doing like remote work than they would be smelting iron and being around all that heat and noise. Like that is my my theory. So what do you think of that? Please shoot that down. Please poke as many holes in that. Please, please make me sound like like a hopeless romantic.

Yeah, so actually I want to share a story from the conference that we both went to where we met. The one about a startup founder who was presenting a project that was building robots for farms.

So this startup was building robots that could tend to strawberry plants and pick strawberries. And they were working with a farm in Watsonville or Monterey, which is a few hours south of San Francisco. And when I heard of this, at first I was very skeptical. I was like, isn't this just going to make lives worse for farm workers? Because you just place some of the jobs.

But the story was they were running this prototype in the strawberry fields and the farm workers would be very curious and they would actually ask and go up to the engineers and ask them questions. And when they heard what they were doing, the farm workers were like, wow, this is amazing. When can we get this? And it turned out that they were saying that the job of picking strawberries was less desirable compared to some other farm worker jobs.

and they would actually like for the less desirable jobs to be automated way. So in that sense, I think like, I mean, this all ties into how the Singapore, like Singapore thinks about automation. Like instead of thinking of displacement, we think about augmentation. So the idea that instead of the technology replacing you, you are using that technology to be a higher productivity worker. You're working, you know, you're deploying technology, working alongside the technology.

which of course has lots of prerequisites that you need to understand and know the technology and your workplace and government policies have to be favorable. So I kind of think of this like, I think there's a lot of potential for AI to augment humans instead of replacing humans.

but I think the big questions are like who ultimately gets to decide you know how the technology gets used so it's sort of like a power argument where it's like

Do the workers decide or the unions or the companies or the big tech companies or the government? And where does the power reside and how do these decisions ultimately get made? So it's very much a question of politics and democracy and economic structure, I think. And I think what I worry about most is like the

the power is so concentrated in big tech, especially in terms of money and in terms of them having the models, right? Big tech has the large language models right now, the best ones. How much of the societal transformation will be on the terms of those large tech companies and large companies in America in general versus

you know, workers, unions, government, and the average citizen. I don't really have a clear answer to that. Yeah. Well, let's say hypothetically open source or open models, not open source models, as you pointed out, open weight models. They come to approach the performance on Helm benchmarks and, you know, in all the ways that matter.

They come to be comparable and that, um, free cocaine can just host its own llama instance, which we do. We, we use it internally for lots of things. Uh, yeah. And every organization like a farmer can just have a box at home and can host, you know, the minstrel or whatever they want. Right. And in fact,

I've seen lots of videos of people doing this, just having their own instance running that they can interact with and not having to pay for open AI, uh, you know, license like pro it's like 20 bucks a month. It's not that much money considering all the uses is I'll be amazed at that price doesn't go up. Um, but, uh, let's say hypothetically those models approach and then like the, how would that affect that power dynamic? If people weren't beholden to the giant tech companies, if, uh,

Like, let's say hypothetically, okay, two things happen. First, the open weight models become almost as good, and then the top-of-the-line models start to plateau, right? They do hit some sort of ceiling for performance, at which point, kind of like iPhone, nobody cares about the new iPhone, nobody cares about the new PlayStation because it's been good enough for years, and these marginal improvements in graphics or speed or anything are almost imperceptible to the typical person.

We don't care. Right. Like I don't care genuinely the difference between like a $200 can opener and a $1 can opener. Right. It just opens a can and it works. Uh, right. Like, like what if it becomes completely commoditized and it completely plateaus and it's just this new tool that we have, just like spreadsheets. Like I don't, I'm sure Excel is better than Google sheets and has lots more features, but I can't be bothered to install it and pay a bunch of money for a license and stuff when I can just use Google sheets. Right. Like,

What if it gets to that point where all AI is, is just this thing that's kind of like a solved problem? Yeah, I think do in terms of impacting the labor market.

Yeah, I think I'm a little bit more optimistic about this future than I was a year ago. So about a year ago, the situation was kind of like you had to pay OpenAI to use these models, right? Or you use inferior models. And even if you wanted to use the inferior open waste models, you would need a very expensive desktop machine that would cost thousands of dollars or tens of thousands of dollars, like

out of reach of the average programmer. So you sort of think about, if you think about this as another tool, like a programming tool, when I was a teenager and I first got into programming, my first tools were things like Cubase stick or PHP or open source tools that you could download for free and use.

And the fact that you need a very large GPU or API on API is it's all like makes it so it's not free and it's all out of reach for at least some people. I'm a bit more optimistic that like it's going to be more accessible because what happened over the last year is the small models got better. So now we have models that you can run on your like MacBook, for instance, or on a small laptop and they work great.

They're not as great as GPT, but they work reasonably well. So I think like I'm a bit more confident now that this technology would be a bit more evenly distributed. Though, like, as I mentioned earlier, there's still a gap. So, you know, we know like we at Stanford, we're trying to train our model as well. We're trying to close the gap. The other researchers trying to close the gap.

So that might come to pass. The one thing I'd still be pessimistic about is like even if you have a tool, even if you solve like the excess of the tool and you know the openness of the tool, there's still the question of like who decides to get you know how the tool is used.

So, for instance, like, to give one example, like, let's say an employer decides to replace, okay, this is a very hypothetical, but let's say an employer decides to, like, fire all the paralegals, replace them with an open source model. It's still a question of power, right? How did this decision get made? Like,

The fact that the tool is open source doesn't reduce the harms to the paralegals who got fired. And then a more concrete example that I just learned about is like, for instance, the state of Nevada is partnering with Google to try to use LLMs to process unemployment benefits claims. And this is freaking a lot of people out because there is... Yeah. I mean, what if it makes a mistake and you don't get your unemployment? Yeah.

Exactly. So this is like a really high stakes, high risk deployment. And I guess I'm very, very concerned about how things can potentially go wrong. So I think, you know, this is the other question there, like who gets to decide, right? That is still novella, you know, like how policies mean, how the policies, you know, what kind of these users allow the policy. So here's my thing on this.

if you're interested in hearing my thinking, my relatively naive thinking. I don't want anybody to think like, Quincy's just spent hours and hours philosophizing and writing treaties about this and stuff like that. This is great. In theory, let's say that a whole bunch of false negatives happen or a bunch of false positives. I mean, I'm sure Nevada's going to encourage the LLM and probably encourage Google. Can you make it deny more claims? They have every incentive to do that. But like,

Is that really any different from just telling their human evaluators, like, hey, try to find every single little loophole you can to deny people claims? Like, insurance companies do that all the time, right? Like, yeah, you can definitely say it's even worse than what it's always been, but hasn't that been just like the progression of evil in society? And is AI actually accelerating anything, or is it just making it a little, like...

cleansing their consciousness a little bit because it's not some human who has to push the electric shock button, but it's like a machine that determines to push it. Right. You're like a few steps removed as a programmer who's just building the doomsday machine as opposed to somebody who's actually going out and like executing people directly on a firing line. You know? So, yeah.

Yeah, go ahead. Like, like, I mean, people are making that counter argument exactly the counter argument. So for instance, there's research showing that, for instance, by anyone who makes a decision in judiciary, or like, in a government office, they're not free of biases, right? They will still, they will still make decisions based on bias, sometimes unfair decisions, sometimes incorrect decisions.

And now the objective for AI, some people argue, is not to be perfect, but just simply to be better or less biased or less incorrect than the human baseline. I think to some extent that might be possible, but it's very difficult, right? I feel like in academia and in research, we sort of understand the nuances of like,

hey, how do you measure these things? How do you measure fairness and bias and accuracy? But really the worry is like, in the real world, how often do people care about these? How often do people get it right? So definitely there's a way to do it right, I think, but it's very difficult. Yeah, and I just want to be clear that I think...

These companies should like these giant, you know, uh, entities that have lots of money and it should err on the side of trusting people that are filing unemployment claims. People that like, I understand. I just watched this great movie from like the 1940s called double indemnity.

And, uh, it's all about like, uh, insurance scamming basically. Like, but it's from the 1940s. It's like this new war movies. Very, very good movie. If you want to watch a really old movie, uh, practically public domain. Uh, but, but yeah, uh, I understand that there are people are going to try to cheat in like fudge numbers and, and, you know, uh, there's this great movie called blue collar that I watched, um, from the seventies where the, the, the main character, uh, he's, um,

uh, Richard Pryor, he's, he's pretending he has six kids instead of three so that he can get like additional child benefits in his tax deductions and stuff like that. And that's like a major plot point. And like, I understand, but like these are often extremely destitute people. And I feel like we just need to fundamentally figure out how we don't have to put them in situations where they might feel pressured. And also, uh, you know, it's kind of like that, like a whole principle, like, uh, you know, you should, um, it's much better to like,

free like 10 guilty men than to falsely execute one who was innocent or something like that. Right. I mean, it goes back to the foundations of kind of modern or like Western philosophical thought, you know, uh,

So, so, so my thinking is like a lot of this stuff is human problems and AI is merely the weapon that the human is pointing at the destitute person as opposed to, um, yeah. So, so I definitely understand that that is a huge drawback is it's going to give people like the illusion of like a clear consciousness when they're just as complicit and denying people sustenance and things like that.

That is a really big issue to get into. And it's beyond the scope of this podcast, but I did want to weigh in and give what you said proper gravity because that is very dangerous, especially with insurance companies, especially with the area that it hits home.

Home most for me is as a teacher, seeing these people who were mislabeled as plagiarists by some AI that can't even accurate. It was like people are out there selling tools that purport to be able to detect plagiarism, but are very bad at it. And you can copy something from GPT, open a new instance of the GPT, you can paste it in and you can say, hey, GPT, most sophisticated LLM in the world. Do you think this was generated by an LLM? And it may not know.

Do you know anything about this? Yeah, most of the tools. There's tools like GPT-0 that's like LLM detection. They're mostly not great. There are a few tools that, yeah, I would say that they're mostly not great right now. They're mostly unsolved problems. So any startup that's like, I have a perfect LLM detection tool is probably lying.

Yeah. But there is some CIO on a golf course with one of their salespeople right now. I guarantee you. Yeah. School systems are buying plagiarism detection, and that is just the progression of evil. That is just –

Yeah.

I think it's just kind of like a continuation of horrors that like these plagiarism tracking software is turn it in and stuff like that. It's been around for decades at this point and, and it hasn't always been accurate and there's been a high number of false positives and I think it's going to get worse because there's the false confidence that these tools work, you know, better than they actually do. Sorry, I don't mean to monopolize the conversation. I want to be interviewing you, but I just felt like, like pointing that out. Yeah. Yeah.

I mean, I think I'm scared of a lot of the same things as well.

Like, I mean, earlier we were talking about, you know, the large amount of investment in AI and a lot of this is like C-suite executives, like not really understanding what the technology can do and throwing money at it, right? And you mentioned, you know, this doesn't feel like a technology problem. It feels like a human problem. And that's exactly right. Like I can, to do my spiel, like the organization I work for is called the Center for Human-Centered Artificial Intelligence.

And a lot of the researchers who work at the center or collaborate with us are from other disciplines like law or policy or economics or physics.

business or other fields. And a lot of it is exactly because there are these tawny ethical questions that you really need. Some of these are fundamental Western philosophy questions, right? Like, what's ethics? What's the ethical model? Should models be ethical? So

Yeah, I think there's a lot of interesting work there, maybe of less interest to programmers. But if anyone's interested in reading on the ethics and philosophy of AI, there's really a rich, rich view that's of extreme relevance right now. Is there any text or article, something that is relatively layperson accessible and doesn't have a whole bunch of citations everywhere? Something that you would recommend to people to get started?

I mean, I can pitch like my employer has a blog. So human centered AI center has a blog. It's called Stanford HAI. That has a pretty good roundup of a lot of these societal issues. As for a book, I don't think there's a specific one that I recommend yet. There are a bunch coming out that have been published recently, but I have not had the time to look at them, unfortunately.

Awesome. Well, um, if you're listening to this after publication date and, uh, if on has reached out to me with a book recommendation, I'm going to include it below. Um,

We've covered so much ground. It's been an absolute blast talking with you. I feel like we could talk for hours, but I want to respect your time. You're a busy person out there getting things done, researching these models, not just mindlessly kind of, oh, this is a higher value than this. You're actually thinking about the implications of these. It's clear that you are among those who think and feel and that you do care. You have a

a big stake in the future of what society is going to be like with these models running around. Do you have any closing thoughts or, or things that you would encourage listeners to think about? Especially for free quote, camera listeners. I feel like, you know, earlier we were talking about the job displacement thing. I feel like, like there's this narrative of like, you know, LRMs replacing jobs, replacing programmers in particular. I feel like in terms of like,

especially in terms of like FICO camps, the students, I would say like,

Even if you want to get into AI, there's always a value in having good software fundamentals, software engineering fundamentals, programming fundamentals, really understanding the foundations of AI, which include things like probability and statistics. And I think with those fundamentals, that will carry you very far, especially because when I first started,

So when I first joined Stanford as a research engineer, and I had to pick up a lot of knowledge about LLMs, because when I was in school doing my bachelor's and master's, LLMs didn't exist. I had to pick up this technology. But because I already had the fundamentals, I could pick it up fairly quickly and be like, oh, here's how this new technology relates to the knowledge I already have.

So I would say like, honestly, having good software engineering fundamentals, being able to understand problems, being able to work with other people well, these are all skills that are going to be the most important no matter like what technology comes next. Yeah, 100%. I really appreciate everything that you've shared here. And I just want to double, double, triple, quadruple endorse this.

What you just said. When in doubt, go back to the fundamentals. Learn them. I like to say that everybody on Star Trek, when you see Geordi working in the engine room and stuff, they understand the stack. They understand what's going on in the ship and how all these different systems work. They spent the time to learn it. It's not like...

they're not just grasping the dark, like education will have progressed dramatically by then. They'll be able to hold a whole lot more facts and understanding models in their mind of how reality works. And they will be able to just, just a man. Like I always like to point out every, every few years they have to kind of change the criteria of like the IQ tests. Not that I put a lot of stock in IQ tests, but because people just get, keep getting smarter and smarter and smarter with every generation or every like, you know, few years.

Uh, and that's going to accelerate as we have access to more information as we're like walking around listening to podcasts like this, maybe even a double speed, uh, you know, all day while we're getting things done and just like our information diet gets more and more enriched. We become, you know, more deep thinkers and we, we think on different levels and we don't just have like our little biases and our misinformation that we carry, but we progress as human beings. Uh, and I think as that happens, uh,

you know, human society is just going to become more and more competent and more and more curious. And like this virtue is going to continue to pick up and we're going to use these tools to continue to extend our own understanding of the world and our own intelligence. And I think that is the profound goodness in the world is that,

people feeling curiosity, people following it and not having the intellectual poverty of your, like there was a time when he was estimated that like the typical, you know, American out on the frontier living in a log cabin, the amount of information they might encounter in their entire life would be the equivalent of like one day's, you know, wall street journal.

And think about all the information you have access to now and all the ways that you can consume it while you're just chilling, you're, you're eating your popcorn, relaxing on the couch and you're learning about, you know, AI machine learning, stuff like that, uh, whatever topic you're interested in learning. And, uh, so I just encourage you to continue to build out your skills the way Yvonne has been working very hard to, we didn't even go into your background, but I'm sure it's exceptional and interesting, uh, cause you don't get to like,

machine learning team, or you don't get to Stanford working as a researcher just by accident. I'm sure you worked very hard to get where you are. And at some point I would like to have you back on and we can, we can go more into that. But, uh, I just want to leave everybody with the thought of, you know, no matter what happens, fortune favors the prepared and you can continue to get more and more prepared by building up your fundamental skills precisely as Yvonne encouraged you to do so. Thank you so much for coming on the show, man. And with that,

I wish you all a fantastic week. Until next week, happy coding. I have a picture pinned to my wall

An image of you and of me and we're laughing. But look at our life now. All tattered and torn. We fuss and we fight and delight in the tears that we cry until dawn. Look at me now. Whoa.

So perhaps I should leave here.

Yeah, go far away. But you know that there's nowhere that I'd rather be than with you. Oh, oh, hold me. Oh, oh, my heart. Stay with and love and start. Oh, oh, my heart. Stay with and love and start.

Let loving start. Whoa. I love you.

Well, what can I say? All that I do in this is one of the games that we play. So I'll sing you a new song. Please don't cry anymore.

Even as for your forgiveness, no, I don't know just what I'm asking it for. Oh, hold me. Whoa, warm my heart. Stay with love and start. Love and start. Oh, hold me now. Whoa, warm my heart. Stay with love and start.

Love and stuff.

#149 The State of AI with Stanford Researcher Yifan Mai 01:58:21 Share