We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Measurement in AI Policy: Opportunities and Challenges

2020/10/20

Last Week in AI

AI Deep Dive AI Chapters Transcript

People

Jack Clark

Ray Perrault

Topics

Jack Clark: AI系统的评估类似于一个被遮盖的物体，需要多种测量方法来全面了解。AI评估已从单一指标转向多指标测试套件，但AI系统正在达到甚至超越这些测试的性能，需要更高级的测试方法。学术界对AI伦理的关注度日益提高，但在主要会议上却鲜有涉及，这可能是因为学术激励机制、跨学科合作的不足以及团队多样性缺乏等原因。将伦理作为一项服务可能会阻碍组织内部关于伦理问题的讨论和文化变革；政府在AI伦理领域的投资不足，使得AI研究人员需要更多地关注伦理问题。AI Index项目旨在通过创建详细的报告，推动政府对AI的测量和评估，最终目标是建立大规模的系统性评估体系。 Ray Perrault: AI测量的挑战在于两个方面：一是定义AI，即什么是AI，什么不是AI；二是测量所有与AI相关的内容，包括科学研究、技术发展、投资方向等。仅仅依靠关键词来定义AI是不够的，因为AI领域的界限模糊，关键词难以准确捕捉AI的全部内容。AI的定义挑战在于区分AI基础技术的开发和其在特定领域的应用，以及AI与其他学科（如统计学）的界限。衡量政府对AI的投资存在挑战，因为AI的定义模糊，且政府的支出数据来源分散，难以进行准确的比较分析。解决AI伦理问题需要全面的变革，包括学术界、工业界和政府部门的共同努力；伦理问题应该在AI系统的设计阶段就予以考虑，而不是作为事后补救措施。 Sharon: AI的定义和评估是一个持续的挑战，需要考虑人类因素、社会影响以及不同应用场景。

Deep Dive

Chapters

The discussion explores the challenges of defining AI and the importance of creating holistic measurement schemes to understand AI technology comprehensively.

Shownotes Transcript

Translations:

中文

Hello and welcome to Skynet Today's Let's Talk AI podcast, where you can hear from AI researchers about what's actually going on with AI and what's just clickbait headlines. We release weekly AI news coverage and also occasional interviews such as today. I'm Sharon, a fourth year PhD student in the machine learning group advised by Andrew Ng and the host of this episode.

So in this special interview episode, we'll get to hear from several of the authors of the recent paper, Measurement in AI Policy, Opportunities and Challenges. So that is Jack Clark and Ray Perreault. So Jack and Ray are the co-chairs of the steering committee of the AI Index Project under the Human-Centered AI Institute, or HAI, at Stanford University. Thank you so much, Jack and Ray, for joining us on this episode. Glad to be here. Thank you.

Thank you. Yeah, it's great to appear here. Awesome. So our focus here will be your article, Measurement in AI Policy Opportunities and Challenges, which came out just this last month based on the AI Index Workshop nearly a year ago at Stanford's HAI.

And I'm particularly drawn to this work because it's about evaluation, and that's the focus of my dissertation, which has got me thinking about just how much that encompasses, and importantly, how evaluation defines progress in a field, determines safety of use, etc. And so before we dive into any details, how about I let both of you provide a quick high-level summary of what the article was about and its conclusions?

Yes, I'll give you the example I use for why this work is helpful. I think that characterizing AI systems is somewhat like characterizing an object that's hidden under some kind of thick cloth. And everyone who's touching the object at a different point will have a different interpretation of what that object is and what that object does.

Now, the purpose of this workshop was to gather tens and tens of, you know, some of the world experts in the development and measurement of AI systems and try to see if we could collectively talk about all the different ways we try to measure this same object, which is this sort of AI technology, and what we can learn from that about how to

create holistic measurement schemes that tell researchers, policymakers, and the general public what they need to know. That's going to be a very, very challenging and long-term problem. And what this paper tries to do is just lay out many of the issues that all of these independent experts identify as being salient traits to measure.

Ray, do you have anything to add to that? Yeah, I mean, that's certainly the right picture. Maybe going down one more level of detail, I see this as there being roughly two axes to the problem. The first one is simply what is AI? That is, what counts as AI and what doesn't?

And that was, I think, very well put by Jack about this thing under the blanket. It's very hard to figure out what the thing is. And what the thing is matters a lot to the second axis, which is how do you measure everything that is about AI? Where the scientific work is being done, who's ahead, where the investments are coming,

where the most important work is happening, and so on. You can't do all these other measurements unless you have some agreement as to what it is that they're supposed to apply to. And so that's the big picture.

I imagine it can be pretty challenging to define something like AI, especially as it gets hyped up in various media as well. So going off on that a little bit, in the paper you do say we should truly describe developments of a given domain and in so doing it's important not to filter out information closely related to but also captured by our definition.

Ray, have you seen examples of this? Well, I think the point that Maria Klein, the author of that particular paper, was trying to get at is that it's.

It's tempting to try to solve the problem of what's AI by making a list of say keywords and then going and looking for these keywords in scientific publications, in patents, in descriptions of new companies, whatever. And when you find matches to these keywords, you're done.

And I think what Maria was saying was that that's actually not good enough because in part because of the difficulty of shaping the field as a whole, keywords tend to not work very well.

And there are, I think, two interesting ways of thinking, of looking at that part of the problem. One is that there's this difference between development of new fundamental techniques in AI and their application in a particular field. So if you develop good AI

Object recognition techniques, that certainly counts as AI. But then when you start applying it to, say, medical imaging and these medical – these papers reporting the results of the use of what is by now a standard technique in medical imaging appears in The Lancet or PNAS or some non-AI journal –

Is that an AI publication anymore? And clearly, you'd want to be able to say that, yes, there is AI in there, and it demonstrates the application of AI to something important and maybe the role of AI in the creation of a new industry. But it's not necessarily the development of new algorithms.

On the other side, AI borrows a lot of concepts from other disciplines, statistics, optimization, linguistics. There are many, many of them. And I think one of the things we also have to be careful about in doing these measurements is that AI doesn't end up kind of looking as if it's taking over these other fields.

Um, because, you know, statistics is statistics and I, my statistician friends would not like me to think that they're all AI people. So, um, that's another part of the, that's another part of the challenge. It's really the same challenge. I mean, there it's, you know, statistics applied to AI. And the first thing I talked about was AI applied to healthcare. It's,

Right. And what's kind of interesting also is, so I do work in medicine and AI, I suppose. And sometimes it is working in an applied area that gives rise to actually core AI questions. And sometimes

I think arguably both computer vision and NLP were applications at some point. Were viewed as applications and now more and more they're viewed as like core AI as opposed to just applied to a certain domain. So I think, yeah, I find defining AI is really, really interesting. And what is the extent of that? And maybe all of us, all of us if we're just doing philosophy because that's where it all started. I don't know. Yeah.

All right. Okay.

You really would like it if you could define AI as a specific doohickey that was or wasn't in an application. So you look at something in medicine and you just check, does it have AI in it? And it's a binary check. And if so, you go to a new regulatory regime. Unfortunately, that's impossible.

Like AI is massively broad, as Ray mentioned, it bleeds over into all of these different fields. And I think one of the challenges that we're going to face in the coming years is policymakers want to define AI so they can regulate and constrain it. But AI as a technology has an extremely broad and like liminal border with all of these other fields. So it's not an easy thing to define in that sense.

Definitely. Perhaps we can define it as what it is not. I don't know. Makes me think of GDPR and privacy. Well, going off of definition of AI, I want to also touch on metrics and how to evaluate AI. So given a definition, maybe some working definition that has fuzzy boundaries,

There are definitely tradeoffs between, and you mentioned this in the paper, between focusing on a single metric to measure model performance versus diverse metrics to evaluate the capabilities of a system. Jack, do you want to expand on that? Sure.

Yes. So in recent years, we've seen the transition of measurement from single tests to suites that contain a multitude of tests. A really good example in NLP is something called Glue, which was a set of something on the order of eight different types of natural language understanding and reasoning tests that you'd run your systems through. Now, as some people listening to this may be aware of,

Glue was insufficiently ambitious. It came out and then very quickly a range of new systems, I think also including GPT-2 at the time, sort of came out and scored quite highly on it, which led to the team at NYU that developed Glue to develop Super Glue, an even larger testing suite. And that's because they're dealing with this problem of AI systems can...

are saturating individual benchmarks and they're even starting to saturate multi-test benchmarks. So this era we're moving into feels like one where it's less important to have a single perfect test, it's more important to have tested in a variety of different ways using a variety of different methodologies.

One slight thing I'll note, which feels really unusual, is we've also entered this era where human baselines have gone from being a highly aspirational thing that you're trying to get your systems close to, to a thing that you can expect to actually exist.

match in some sense on some contemporary systems, which feels very weird because if you are now in some sense, some narrowly defined sense better than a human, it's hard for us as humans to develop more advanced testing methodologies.

True. And also human level has been redefined multiple times. Yeah, I mean, and as we all know, like the baseline for ImageNet human level is actually Andrej Karpovi sitting and testing himself. And that has become the AI sector's baseline for human performance on image recognition. And no offense to Andrej, but he's probably not representative of like

all of humanity in this sense. I think he knows more about dog species than myself. I can have him speak for me as a human in dog species classification. Um,

All right. So another concern that I'd love to touch on is around the AI brain drain from academia that you mentioned in your article. And I'm very curious about this as someone in academia and who's seen brainwaves.

brains get drained to large private sector companies. And Zhao Jin looked at the movement of AI professors and faculty in the U.S. and said that this phenomenon has been exponential since 2010 and would reduce the number of future AI entrepreneurs. I was curious why that was. So I don't know the... I don't have the actual numbers in my head, but clearly the...

The deployment of AI in industry has led the commercial sector to need more talent. And of course, the best people are often in universities and they're reasonably but not outlandishly paid in universities.

So it's attractive for them to get these offers to go to industry at much higher salaries.

And obviously, this means fewer people and fewer top-notch people to train the coming generation. And so my guess is that I'm not in a university now, so I'm not living this phenomenon on a day-to-day basis, but that it's leading universities to make –

adjustments as to how they're willing to have their staff spend their time. It used to be, of course, that academics could spend a day a week or something like that working on projects. That number is no longer realistic.

But then I think we're also seeing cases where academics are simply leaving universities altogether because they can move to a full-time job that has satisfaction in a different way.

But it is dangerous and I honestly don't know how this can be dealt with. We can't start, we can't fix the problem by saying that all you need to do to get a PhD has been three years. Although I know a lot of graduate students would love that. But it's a big challenge.

I may or may not be trying to do mine in three and a half. So we'll see. That does make me think, is this an inverted pyramid kind of population that you see happening then? But there are still are a lot of students coming in, but is it just that they won't get the same amount of training and then everyone would have to essentially get their training through Coursera? Education is an industry in, and I use industry in the,

In the economic sense, in which there hasn't been a lot of improvement in productivity because it's essentially a human to human. And so one of the things this might do is force...

the education system to become more efficient in how it dispenses its, how it delivers its product. This, of course, is not a new idea, but

You sense that this kind of pressure may be leading in that direction. Now, the other aspect, of course, is that this is not in every discipline in education. It's, you know, we're now talking about AI. We're not even talking about all of computer science. So we'll have to see what happens. Yeah.

All right. Thank you. Shifting gears a little bit towards ethics, I do want to talk a bit about that or more than a bit, hopefully. So in the paper, you do say that despite growing researchers' interest in exploring ethical dimensions of their fields, the topic is absent from main conference proceedings and relegated to small workshops.

My question is, why do you think this is? I've definitely observed this as well. This is a very true statement. And is it because you think it's hard to incorporate or researchers don't know how or it's still not top of mind? Jack, do you want to take this? Yeah, there are a couple of things that seem relevant. One is that the incentive structure of academia doesn't seem to bias people towards the kinds of interdisciplinary work which

which lets you sort of broadly explore the ethical impacts of that technology. But I'll speak to technical researchers and they'll tell me that their kind of advisors are telling them, you should get these very specialized technical achievement papers done. And that other, you know, maybe policy relevant stuff you do or ethics relevant stuff is nice, but it's not the thing that's going to get you tenure. And I think that this is somewhat true. Additionally,

It requires more diversity, I believe, in the teams that are sort of doing this sort of work. So speaking for myself at OpenAI, you know, I and members of the policy team that I lead did a lot of work with our technical colleagues on the GPT-3 paper where we were running machine learning experiments on bias and on other things in partnership with them.

I guess my observation is that we're lucky because we have resources to have for people with more of a sort of ethics specialism, who have been trained in the technology, who can team up with the technical people. I think for technical people, if they're just expected to do this on their own, are not going to feel very well placed to, and are not in an environment that tells them, like, you should absolutely devote time to getting good at this. Yeah.

I wonder if, Sharon, if you have stuff to add here from your own experience as well. I'm curious to hear from you about how you've experienced this also. Yep. So that definitely concurs with kind of the sentiment in PhD programs that I'm familiar with and have friends in, as well as my own experience. Definitely for a thesis to be valid, you need...

certain pieces of work to fit a narrow topic. And your reading committee will really want that topic to be really narrow. And I definitely have work outside of it. In fact, I think most of my work is outside of that narrow scope. But I think I've been very fortunate to be allowed that freedom. And it has definitely been looked down upon by folks. So I think

But it also has been looked up by folks who, if I do work on something for social good, so I love working on applications to climate, to medicine, to the Black Lives Matter movement most recently that I just felt compelled to work on. And they're not publications necessarily, and sometimes they are, or they're not publications in machine learning venues. And we do it for our medical...

school faculty collaborators, it would make, it would make a much bigger difference to publish to their venues, for example, you know, like it's still within research realm, but a very kind of different flavor. Um, and yeah, I, I think, uh, I really hope it is, will transform. Um, I do suspect that the transformation actually based on what I've seen at Stanford, um,

Might actually happen at the junior levels first, then the more senior levels when it comes to getting tenure. Because I'm starting to see the PhD program, people proposing different things like having a service element to be a requirement as part of the Stanford PhD. That's just that's just being proposed. It's not confirmed yet.

Um, but things like that, you know, starting to get at, Hey, ethics is actually not a separate thing. It should be interleaved with your work. Uh, and you are responsible as an engineer, as a researcher to be thinking about this and you're responsible for the work that you do put out there. And it does make me think that, I don't know if you, both of you have thoughts on this, the, um,

I see it as a pro and a con that Google has this service or talks about the service ethics as a service, essentially. And I thought, oh, wow, that's great on the one hand for the people who are really not thinking about this, who really should incorporate it some way and need to incorporate it easily. But on the other hand, it's like, should we be really decoupling this like that? You know, like, should this really be a separate service?

separate entity and something that you don't have to think about. You could just throw in as a service and you can just pay for. I have a controversial opinion here, but maybe you can weigh in first, Ray. Otherwise, I'll give it to you. I love it. I love it.

I'm a controversial billionaire, so I'm going to defer to you. I have a few things on this. One is that anything that affects people's careers has to be resolved now.

kind of across the board, right? So for a university to decide that they're going to give PhDs in which, say, half the work is ethics and half of it is more technical work, expecting their graduates to then find jobs in schools that pay more attention to the

Technical work and less to the ethics stuff is not going to work. I mean, it's all of it has to change at once. I mean, and there are lots of examples of that. One of my favorites is the whole publication regime in AI.

computer science and the kind of dominance of conferences. But so you can't change some of these things unless everybody agrees that it's, you know, or gradually agree that it should be.

And then the other thing about ethics as a service, it reminds me a bit of cybersecurity as a service. You can't build a secure system unless you build at least, unless you build a security in it from the beginning. It has to be a design consideration, not something that you slap onto it at the end.

And I suspect that's going to be the case for ethics as well. It works through and through the project.

There are so many cybersecurity startups. I'm kidding. Okay. Are you afraid about the habit-forming element of ethics similar to security feels pretty fundamental? Like you, along with doing just objective amounts of work on it in an organization, you're also inherently creating social links between people that do ethics and those that don't, which leads to people talking to each other, even with

during the pandemic, I've had technical colleagues where we've gotten a Zoom call and talked about some gnarly ethics issue. And I think if you're just buying this as a service, it's going to be a lot more difficult to have those sorts of conversations. So I think your culture isn't going to really update in the direction of ethics mattering because ethics is just a commodity that you purchase that doesn't have human relationships and human people on the other side of it. On my sort of controversial opinion,

I think that because AI didn't have like, has only recently started to have massive, massive, massive applications, people are obviously very concerned with the ethics of it, especially younger people who are conscious of their agency with regard to this technology and the effects it can have, partly because it's extremely online, like many of us. But

Fundamentally, governments have underinvested in this entire area. And one of the reasons why the AI policy space is so confusing is that you kind of go into it assuming that there's more work going on in government more generally than there is.

And this is because I think AI has emerged so quickly that the relatively slow-moving infrastructures in government haven't updated towards it. I mean, I follow this in the U.S. context where I just track budgets and we have discussion in the paper about tracking budgets. But what you learn from this is that

You know, many technologists assume that there's some well-equipped regulator to help. And this is not the case in most of the West these days due to underinvestment. So I think that...

This is kind of lurking in the background and is one of the reasons why I feel like it's a matter of urgency to get more ethics integrated into sort of undergrad and graduate level education about AI, because these people are going to need to do it themselves. They aren't going to have as many government partners, at least for the next few years, because government needs to catch up.

I actually don't find that so controversial. I like, I have an even potentially spicier comment. I think, okay, this is spicy and this is recorded. So, uh, I'm aware. I think that we need a, I think almost like the evaluation metrics, so to speak, um, on this to get people, uh,

to understand and care about ethics is I want to see more people in AI research who like humans more than AI. Okay, that was slightly spicy and that does suggest that I think that most people don't. I was going to say it sounds, you know, like this is a very controversial statement and to people outside of AI, they find it quite funny to hear. Yeah.

Yes. It does resonate, yeah. I think that people, one of the issues of measurement, as we've said, is as you get to these sort of socio-technical systems, you need to measure aspects like impact, interaction with humans, how they treat human behavior. And these things all feel pretty important and kind of understudied within the core AI community. Some of this is changing.

Right, right. And there is, unfortunately, some armchair caring about ethics going on. And I think you do mention this in your paper. So an example of...

essentially a failure in this area is the recent revelation that while sidewalk labs, uh, which is a Google subsidiary. And I just want to say, I, I have friends that they're awesome. They're doing great work. Um, but they also have messed up like many groups and they have held an indigenous consultation workshop, uh,

And zero of the 14 recommendations that resulted from this consultation workshop were included in the final 1,500-page report. And this really upset the Indigenous people that they were doing the consultation with because they essentially realized, oh, you're just checking a box. And it wasted their time. Yes, exactly. I mean, I think...

I want to come back to this idea of sort of teams that focus on ethics and policy as being hybrid teams that have technical capacity native to the team rather than outside of it. Because, and I'll just talk in the OpenAI example, like everyone at OpenAI basically gets on with each other. We're a small enough organization, but we all kind of know each other. But it's still...

complex to go and ask another technical team to do a certain type of ethics evaluation or research but if you if you turn up to that team with a graph where you've analyzed something like how susceptible humans are to news or aspects of the display of biases in your models then it becomes very sort of unambiguous that you should all work on it because you've turned it into a

a technical artifact. And I'm not saying that to imply my teammates can't understand ethics if it isn't in the form of a technical thing. But I do notice that if you take the cognitive effort out for other people at the org needing to think about it, it gets really easy to try and weave that into actual changes in policy and changes in behavior.

And so my speculation about the ethics watching stuff is these are frequently done as public relations exercises by people who don't then have the capacity to turn any of the insights into things that can motivate some of the engineers who frequently make a lot of the most consequential decisions about the product.

Yeah, it sounds like you're meeting each other halfway. And in this case, you're speaking their language a little bit. Yeah, exactly. It makes it safe in a sense because you're like, we're not, you know, we've got all of this like mushy stuff that's like hard to think about and controversial and kind of complex. But we've sort of converted it down into some of these things that look exactly like the day-to-day materials that we work with here. It's easier to discuss it without feeling like you're discussing something

a big thing, if that makes sense. Right, right. You're not just, you're not asking them to meditate on Aristotle. You're like, these are concrete things. And I actually see that as, yeah, that sounds like a really great direction. And I see that a lot in my interdisciplinary work, not necessarily with folks who are steeped in policy or ethics, but in medicine and meeting each other halfway, they've

wow, some of the doctors have really more than come halfway. You are writing code. And I've started to be able to detect cancerous bacteria and pneumonia. It's weird. I'm not that good. Please don't.

Let me do that for real, only for noisy training data. Yeah, so I think, yeah, I think meeting each other halfway and engaging, being open to those kind of discussions and learning is really important. So how do we...

How do we copy Jacks into other organizations? I mean, one suggestion is to go and work at them. And I just want to highlight a few things. I mean, many people are familiar with this, but over at Google, there's Timnit Gibrou, Margaret Mitchell, many others. But I'm just going to mention those two because they've both done such prominent work. And

they are affecting a lot of change from inside the organization because they're technical and they do technical research. We also frame it with an ethical lens. We do a lot of collaboration. I think that there are lots of examples like that and

The advice I give to people who ask me how to do this is I kind of say, you need to be willing to be a bit of a maniac in the organization that hires you. And you need to be willing to sort of push a bit because it's not going to be obvious to the org how to do it. But you can usually sell the org on it by noting that

The way that media works is changing and the way that power works in media is shifting from journalists to actually AI researchers themselves who are driving a lot of discourse. So probably the best way to avoid bad PR is you kind of need to do really, really detailed technical work on ethics now because otherwise they'll catch you out. So there's more of a negative incentive, but that can work very well on large companies. I mean, there are disasters out there that can happen if you're not careful about these things, right? Yeah.

Yeah. Do you want to speculate, Ray, about something that happened? I'd love to hear some of these. I think some... What was that startup that kind of got completely destroyed because they wanted to classify gender? The one that did gender from text. Yeah, that was a bad idea. Self-destruction. Okay, other speculations?

All right. Shifting gears onto another topic that maybe Ray can speak to. You do mention that researchers should use multiple sources of data to cross-reference and to analyze government spending on AI. And I'm very interested in

hearing about how to characterize and measure government spending or funding in this space as they get more and more interested in not just the US government, but governments around the world as they get more and more interested in this area that is poorly defined, I suppose. How do we measure their AI investment? It's a conundrum in the US and I think it's a conundrum essentially everywhere else.

The first reason for why it's a conundrum, of course, is the what's AI underlying question. So one...

Another relevant example here is the National Security Commission on AI in its last set of recommendations proposes that the U.S. government should spend roughly, I think it was $2 billion over some period of time.

Supporting AI and half of that is clearly on what anybody in this call would call AI. But the other half, and not unreasonably so, is to support the development of high performance computing.

With the argument being that all this newfangled AI stuff requires a lot of computing and we need to improve that as well if it's going to be deployed properly.

So if you simply look to then and, you know, then the question is, well, if if you were looking at a set of spending figures and it included a billion dollars on high performance computing, would you call that spending on AI? Right.

And, you know, you can argue it one way or the other, but if you don't make it clear, the, you know, whatever comes out of it is going to be pretty misleading.

So I think that the lesson that comes out of that is simply saying we spent X on AI isn't a good enough answer. You have to break it into pieces and make sure that –

People can that that that analysts can then choose the pieces they think are relevant to their view of AI and include them and, you know, not necessarily everything, everything else and try to avoid double counting.

Because the other thing that comes out of this is, you know, say that billion dollars on high-performance computing, well, it ends up getting counted by the AI people because it comes out of that pot. But the high-performance computing look at it and says, oh, it belongs to us. And they put it in their pot. And then you add the pots, and somehow it doesn't, you know, it doesn't add up. So double counting is an issue to look at in all of this.

But then the other part is the sources of information. The good thing about publications worldwide is basically they mostly show up in single databases like Google Scholar or Microsoft Academic or whatever. And it doesn't matter where they come from. They're going to be in there and you can find them. But grant awards or...

under a DARPA project don't show up in individual places. They are in quite specific, and in this case, country-specific databases. And if you're going to do any comparisons, you really have to go through and you have to find where these are around the world,

And nobody's done that. And then you have to slice and dice them, you know, as to what's relevant to AI and what isn't. So I think we're a long way from that. And the surrogates now are mostly by looking at budget numbers.

And these are tricky because for one thing, they cover a multitude of sins and they may not, that money may not be spent the way it was intended to be spent. So you really need to look at where the money was actually spent, not where some budget document said it should be spent.

And that, you know, that also complicates the picture. So there's a massive effort that needs to be put into getting reliable apples to apples figures in this area. Right. Wow. So money being spent where it wasn't intended to be spent. Right.

What a surprise. No, I'm kidding. Welcome to the big government budget language. Yes, precisely what I was thinking. Quick question about the high-performance computing. I instantly think of GPUs, NVIDIA, but I imagine that could also take on a much broader category that doesn't overlap with AI. And they're good purposes. Yeah.

Play your video games. I mean, what better purpose is there than that? So you... But they're multi-purpose. These are all multi-purpose technologies and slicing and dicing across them is, you know, it has to be done, but it's not easy.

And it feels like, you know, yeah, a supercomputer for AI, which, you know, some of these are going to get built. And one could argue that like a TPU pod from Google is a supercomputer for AI. But that's very different, as Ray notes, to the sort of supercomputers that are traditionally fielded by a lot of governments because...

If what I'm doing is stimulating nuclear weapons stockpile and its degradation, which, as a reminder, is what the majority of supercomputers are used for in the US and by other nuclear-armed nations, I need kind of a different system right now because we don't yet do sort of like massive deep neural function approximation for that entire domain. We do a lot of very, very, very specific structured programming which requires linear execution rather than mass parallelization.

And so they look different. And it's kind of challenging to talk to governments about this because governments sort of hear compute and then they're like, here's HPC. And getting beyond that is again something where

I think if you look at things like Dawn Bench or MLPerf, some of these measurement initiatives supported by universities and companies, that's going to help create the sort of tests which basically governments can eventually run on their own HPC hardware. And those tests will help government get to a truer understanding of things.

Yes, definitely. And was there anything else you wanted to touch on from the paper? Otherwise, we can also, I know we chatted about stepping back a bit and talking about performance metrics a bit more broadly in the future of these. I think we'd just like to say that this was, though we're talking to you, this paper is only possible because, you know, more than 100 people came to Stanford. Exactly.

and spent time giving presentations. And so anything that we've said here, really we're representing the view of another expert that attended it. But I just wanted to give credit there because it was a huge effort by so many people. And also Saurabh Mishra, who works on the index with Ray and I and is on the paper. He did a significant amount of work here and should get credited also.

Well, amazing work, all of you. And also for just generally summarizing and codifying all of that, because I know things can be kind of lost on the internet as recorded sessions. Yeah, with imperfect subtitles. Work in progress.

And so what is on both of your minds when it comes to kind of the future of both defining AI and evaluating it and quantifying it in some way that it can be useful for a broader audience than just our community? One of the big things, and I alluded to it earlier with the human baselines stuff, was if you look at the quality of synthetically generated images now, we're definitely in the domain where

The existing measurements we use, like Freccia and inception distance or whatever, are not... We know that they don't really correlate to what we as humans intuitively recognize in subsequently generated images from subsequent systems. We know that there's something more advanced going on, but our measures are not actually showing us today. And I think that this is also the case with...

models like GPT-2 to GPT-3, where the text becomes a lot more coherent to a subjective qualitative interpretation. But many of the tests we'll do might not show as big a difference as one that subjectively appears. So something we're dealing with at the AI Index is how do we contextualize these advances? One idea is just show pictures from different generative models over time, show text samples, show

Surely there must be a more sophisticated way we can do that, though. And that's a challenge I'm thinking about. Yes, that's a huge challenge and...

Probably the, actually forms the basis of my dissertation. If you could hurry up, that would be fantastic for me. No, one of the publications already out called Hype. But basically the premise of that was using crowdsourcing and just humans directly as a gold standard. But what's fascinating and is like touched on really briefly in there is this learning effect that people have on,

If you have never seen a generated image, you probably will think it looks pretty realistic, actually. But if you started to look at more and more of them, you start to pick up on

odd intricacies and, you know, telltale signs. And so at what level, and so it's like, okay, great. We can crowdsource this, but then who is qualified to, to evaluate this? What is the ecological validity of this meeting meeting? Sorry. That's like a technical term. What context will you see this in? Will you see this like on a billboard as you drive by really fast, or will you see this while you're staring in the face, like on Hollywood screen?

where are you seeing this and in what context? And yeah, so I think like that definitely prompts also, oh, we as humans evolve too, as we look at certain media and certain content. So that is why I guess human level, that definition changes, but yeah.

But yeah, it's not just like across all humans, human levels. It is across society, but it's also within an individual. I've seen changes as well. Even for a doctor who's labeling images for us actually will change.

disagree with herself, um, on segmentations of cancer. And I just thought, oh, that's, uh, both really frightening, but also, uh, really interesting. And, um, yeah, context, I guess, context and time and like what you've seen before really matters. I mean, it's, it's training as well. Um,

You know, everyone works with crowd workers on Mechanical Turk or whatever they say, but I think you're already seeing some groups move to their own pools of workers. We ourselves at OpenAI started using contractors so we could have a Slack channel regularly talk to them and sort of get calibrated with them on how they could assess, in our case, summarizations of text because it's a little subtle to know whether it's sort of good or not. I can imagine...

in large quantities needing to be sort of trained to discriminate in increasingly fine ways about these systems, which feels like a sort of lurking big chunk of economic expenditure, which I think it doesn't take very long. I mean, I remember that, you know, when the first time I started looking at generated funds

portraits, you know, somebody pointed out that there's like six things you look at and man, the, you know, the background, whether the teeth align properly, whether the ears are the same on both sides. I mean, there's a few things that will catch a lot of photographs. So, yeah,

And then I'm sure you get, then you get to the stage where it gets even more subtle. Right. And it's also interesting to think about,

Okay, so things are subtle, but we can still detect them. And we know they're fake. But even when they're fake, they cross the uncanny valley in such a way that they can still make some kind of impact. And Jack, I saw this on your blog, which, by the way, everyone should subscribe to Import AI. But essentially a deep fake for various, I guess, political campaigns these days, but also one of...

a boy essentially being resurrected from the dead as a deepfake, uh, having with consent given by his parents in some way. Right. And that, um,

You know it's fake, right? Oftentimes the question we're asking is, oh, no, can we detect if it's fake or real? Well, they're coming out and saying this is fake, but there's also obviously a question of consent there. There's always consent and there's also the societal effect. I don't know if you remember a few years ago,

with the consent of the family, Tupac was resurrected as a hologram at, I think, Coachella. And you can watch videos of this online where a dead virtual Tupac appears on stage and does a few songs, and everyone goes completely wild, like exactly as though it's a real show with, you know, a real person. And I think that, to some of what we spoke about earlier, measuring...

how societal changes like this start to happen feels like a sort of subtle problem that we haven't yet really tackled. It's one we talk about at the Index. I'm sure it's one that is talked about at places like Stanford and stuff. But this stuff, even if it's kind of dumb or obviously in the case of Tupac, obviously kind of a hologram or in the case of this kid, obviously kind of CGI-ish, it will still affect people and they'll have emotional relationships with this stuff.

Driven by the context of them. Definitely. What are we going to do? Oh, dear. Jack, not just pointing out problems, answers, please. Solutions. No, I'm kidding. I mean, one thing we are doing is...

We do the index because it is something that we, you know, both Ray and I have like day jobs and stuff. We spend a lot of time on this in our spare time because if we prototype ways to do measurement and assessment of AI technology in things like the index, it's going to create more competition for there to be more organizations doing this, which has happened. But it's also going to help us

get more people in government who are thinking about measurements to start sort of chatting to us. And I think the grand ambition is you want to prototype stuff in the index, but eventually gets done at a large systematic scale by governments. And how we get there is by creating these increasingly detailed and we hope good reports that can help people sort of think through this and create evidence that way.

And I think that is a very, very noble pursuit. And I'm very glad both of you are doing this. Well, thanks. Thanks very much. It's nice to meet you.

The equivalent of the train spotting hobby in AI, I think sometimes. Look at different systems and then write down free numbers about them. I'm like, oh, look how terribly interesting this benchmark is. You're welcome. Well, thank you both, Jack and Ray, for joining us on this podcast.

So thank you so much for listening to this episode of Skynet Today's Let's Talk AI podcast. You can find the articles on similar topics to today's and subscribe to our weekly newsletter with similar ones at skynettoday.com. Subscribe to us wherever you get your podcasts and don't forget to leave us a rating if you like the show. Be sure to tune in to our future episodes. Woohoo!

Measurement in AI Policy: Opportunities and Challenges 51:58 Share

Last Week in AI

Deep Dive

Shownotes Transcript

Measurement in AI Policy: Opportunities and Challenges