We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

AI’s Drawbacks: Environmental Damage, Bad Benchmarks, Outsourcing Thinking — With Emily M. Bender and Alex Hanna

2025/5/14

Big Technology Podcast

AI Deep Dive AI Chapters Transcript

People

Alex Hanna

批评和分析AI技术，特别是OpenAI的新o1模型和AI炒作指数。

Emily M. Bender

Topics

Emily M. Bender: 我认为大型语言模型本质上是一种障眼法，它们利用我们理解语言的能力，让我们误以为其中存在智能。这种障眼法通过聊天机器人的用户界面设计进一步增强，例如使用“我”和“我的”等代词，尽管它们实际上并没有自我意识。更重要的是，这种技术被用于构建各种应用，例如法律助手、医疗诊断系统和个性化辅导，但这些应用可能会取代工人，并掩盖社会保障体系的不足。我认为这种技术存在从底层到顶层的欺骗性。 Alex Hanna: 我认为大型语言模型的能源消耗和环境影响是一个重要问题，但公司在这方面的透明度不足。即使在谷歌内部，对于能源消耗的估算也存在争议，因为他们会声称使用了可再生能源或利用了非高峰时段进行训练，但这些信息并不公开。我认为模型训练耗费大量能源，这已经对现实世界产生了实际影响，例如社区失去水源和电力线路变得不稳定。此外，在孟菲斯，XAI 使用甲烷发电机为一台超级计算机供电，这影响了传统上贫困的黑人社区。我认为这些环境影响被隐藏起来，因为我们通常通过移动设备或电脑访问这些技术，而计算及其环境足迹在云的非物质性中被隐藏起来。

Deep Dive

Chapters

This chapter explores the deceptive nature of large language models, highlighting how they mimic human-like conversation despite lacking genuine understanding. It also touches on the ways this technology is being marketed and used to solve problems it isn't equipped to handle.

Large language models are essentially parlor tricks that exploit our capacity for linguistic comprehension.
The user interfaces of chatbots enhance this illusion, for instance by using first-person pronouns.
These models are being falsely marketed as solutions for various professional fields, from law to medicine, promising worker displacement and social safety net solutions.

Shownotes Transcript

Translations:

中文

I'm Kwame Christian, CEO of the American Negotiation Institute. And I have a quick question for you. When was the last time you had a difficult conversation? These conversations happen all the time. And that's exactly why you should listen to Negotiate Anything, the number one negotiation podcast in the world. We produce episodes every single day to help you lead, persuade and resolve conflicts both at work and at home. So level up your negotiation skills by making Negotiate Anything part of your daily routine.

From LinkedIn News, I'm Jessi Hempel, host of the Hello Monday podcast. Start your week with the Hello Monday podcast. We'll navigate career pivots. We'll learn where happiness fits in. Listen to Hello Monday with me, Jessi Hempel, on the LinkedIn Podcast Network or wherever you get your podcasts.

Two of AI's most vociferous critics join us for a discussion of the technology's weaknesses and liabilities and a debate on the finer points of their arguments. We'll talk about it all after this. Welcome to Big Technology Podcast, a show for cool-headed, nuanced conversation of the tech world and beyond. We're joined today by the authors of the AICon. Professor Emily M. Bender is here. She's a professor of linguistics at the University of Washington. Emily, welcome.

I'm glad to be here. Thank you for having us on your show. My pleasure. And we're also joined by Alex Hanna, the Director of Research at the Distributed AI Research Institute. Alex, welcome. Thanks for having us, Alex.

Always good to have another Alex on the show. So we try to get the full story on AI here. And so today we're going to bring in, I think, two of the most vocal critics on the technology. They're going to state their case and you at home can decide whether you agree or not. But it's great to have you both here. So let's start with the premise of the book. What is the AI con?

Emily, do you want to begin? Sure. So the AI con is actually a nesting doll situation of cons. Right down at the bottom, you've got the fact that especially large language models are a technology that is-- that's a parlor trick. It plays on our ability to make sense of language.

and makes it very easy to believe there's a thinking entity inside of there. This parlor trick is enhanced by various UI decisions. There's absolutely no reason that a chatbot should be using I, me pronouns because there's no I inside of it, but they're set up to do that. So you've got that sort of base level con. But then on top of that, you've got lots of people selling technology built on chatbots

to be a legal assistant, to be a diagnostic system in a medical situation, to be a personalized tutor and to displace workers, but also put a bandaid over large holes in our social safety net and social services. So it's cons from the bottom to the top.

Okay, I definitely have things that I disagree with you in places on, and we will definitely get into that in the second half, especially about the usefulness of these bots and whether they should be using IRME pronouns and the whole consciousness debate. We're going to get into that. I don't think any of us think that these things are conscious. I just think we have a disagreement on how much the industry has played that up. But let's start with what we agree on.

And I think that from the very beginning, Emily, you were the lead author on this very famous paper about calling the large language models stochastic parrots. And at the very beginning of that paper, there is concern about the environmental safety and the environmental issues that large language models might bring about. So on this show, we talk all the time about the size of the data centers, size of the models.

And of course, there is an associated energy cost that must be paid to use these things. And so I'm curious if you, Emily, or you, Alex, Alex, you worked at Google, right? So you probably have a good sense of this. Can you both share, like quantify how much energy is being used to run these models? So part of the problem is that even, you know, even if you're working at Google, you are directly working on this.

they're not very public estimates of how much cost there is. I mean, the costs vary quite widely. And the only cost that I think that we know was an estimate being made by folks at Hugging Face that worked on the Bloom's model because they were able to actually have some kind of insight into the energy consumption of these models. So part of the problem is the transparency of companies on this. You know, as a response,

at Google after the Stochastic Parents paper was published, one of the complaints from people like Jeff Dean, the SVP of research at Google, and David Patterson, who's the lead author of Google's kind of rebuttal to that, was that, well, you didn't factor in XYZ, you didn't factor in renewables that only we talk about at this one data center in Iowa. You didn't factor into off-peak training.

And so it's part of the problem. I mean, we could try to put numbers on it, but there's so much guardedness about what's actually happening here. We can't quantify it. We don't know when it comes to model training. I mean, we might have something like we know the number of parameters that are in a new model or in an open weights model like Lama.

We don't know how many kind of fits and starts they were with stopping training and restarting or experimenting. So, you know, we could speculate, but we know it's a lot because there are real effects in the world right now. What are those effects?

What are those effects? So you see communities losing access to water sources. You see communities, you see electrical roads becoming less stable. And this is starting to be, I think, very well documented. There's a lot of journalists who are on the beat doing a lot of good work. And I also want to shout out the work of Dr. Sasha Luciani, who's been looking at this from an academic perspective. And one of the points that she brings in is that it's not just the training of the models, but of course, also the use.

And especially if you're looking at the use of chatbots in search, instead of getting back a set of links, which may well have been cached, if you're getting back an AI overview, which happens non-consensually when you try Google searches these days, right? Each of those tokens has to be calculated individually.

And so it's coming out one word at a time, and that is far more expensive. I think her number is somewhere between 30 and 60 times more expensive just in terms of the compute, which then scales up for electricity, carbon, and water than an old-fashioned search. I would also say that speaking about existing effects, there's also a lot of reporting coming out of Memphis right now, especially around the methane generators and

that XAI has been using to power a particular supercomputer there called Colossus there, specifically around emissions there affecting Southwest Memphis, traditionally a black and impoverished community. There's also reporting on

Actually, in research from UC Irvine, in which looking at backup generators and emissions from diesel that are connected to the grid, but just because the SLAs on data centers are incredibly high, you effectively need some kind of a backup to kick in at some time, and that's going to contribute to air pollution. And which communities have been affected by the loss of water due to

So in, I think the best reported one is the Dulles in Oregon. I mean, I think that's the one that is the best known. That is kind of pre-AI in which we're focusing on the development of Google's hyperscaling. And it wasn't until the Oregonians sued the city that we knew that half of the water consumption in the city was going to Google's data center.

That was before generative AI. That was before generative AI. I mean, we have to imagine the problem is probably exacerbated right now. But do we know that? I mean, you both wrote the book on this.

So we have, we certainly point to environmental impacts as a really important factor. It is not the main focus of the book. I would refer people to reporting of people like Paris Marx over at Tech Won't Save Us, did a wonderful series called Data Vampires, looking at, I think there was stories in Spain and in Chile. And yeah, so this is, you know, we are looking at the overall con and

And the environmental impacts come in because it is something we should always be thinking about. And also because it is very hidden, right? When you access these technologies, you're probably sitting, you know, looking at them through your mobile device or through your computer and the compute and its environmental footprint and the noise and everything else is hidden from you in the immateriality of the cloud.

I would also say that, I mean, the reporting on Memphis, I want to give a shout out to the reporting in Prism, um, by Ray Libby, uh, you, uh, you, yeah, I don't know if I'm pronouncing their surname correctly, but they have an extensive amount about the kind of water consumption of this saying that this would take about a million gallons. Um, I'm checking it, but I'm looking, I'm looking at the, the, the reporting on it. I think, um,

I'm seeing the exact number on this. I'm going to look at it. Yeah, so they're focusing a million gallons of water a day to cool computers. They don't... They're saying that they need to build a gray water facility to do it. I mean, this is not anything that any... These facilities don't exist yet, so they'd have to be built. But I mean, this thing is already being constructed and is using water. So I mean...

I don't think it's a far cry to say that this is happening in an era that was in the hyperscaling era in pre the gentrify era. I mean, it's the unfortunate fact about it is that a lot of these community groups are fighting this on a very local level. And a lot of these things are getting under reported on just because, but from what we know from the fights in the dullest and in London County and

And parts of rural Texas, I mean, we'd be surprised if this similar kinds of battles weren't being fought. I agree with the underreporting and that's why we're leading with it here. And we're going to go through a list of some of the things that might be wrong with generative AI. I think it is an issue. I think Emily, you basically hit on it, right? Where

You're producing all these tokens when you're going to generate an AI overview that I checked and it is, you cannot opt out of it. You're correct. You can if you add minus AI to the query. Okay. But you have to do that each time. You can't like put a setting somewhere.

That's interesting. I didn't know about that. Okay, so you can opt out, minus AI, but these things do take more computing than traditional Google search. I guess the argument from these companies would be that they're just going to make their models more efficient. I mean, we see the increasing amounts of efficiency over time, and there might be a big upfront energy cost to train, but inference might end up being...

Not that energy intensive. What would you say to that? I would say that we've got Brad Smith at Microsoft giving up on the plans to become a net zero carbon since the beginning of Microsoft. And he said this ridiculous thing about we had a moonshot to get there. And turns out with generative AI, the moon is five times further away.

Which is just an absurd abuse of that metaphor. But yeah, and you see just Google similarly also backing off of their environment goals. And so if there really was all these efficiencies to be had, I think they wouldn't be doing that backing off. And I want to also add, I mean, I think this argument about the large amount of training in carbon use on the front end and then it tapering off with the inference. I mean, this is an argument that

That straight came from Google. This was, again, in the same paper by David Patterson. I think the title of the paper, I'm not going to get it exactly right, was the cost of training or the cost of generative AI will... Probably not generative AI. I think it was the cost of language models will plateau and then decrease, or the training costs. And effectively, the argument being that

you have this large investment that we can offset with renewables and then it's going to decrease. But you have to also consider that given that the economics surrounding it, it's not one company trading views, right? I mean, it's multiple different companies training these.

And in multiple different companies providing inference. And so as long as there's some kind of incentive to keep on putting this in products, then they're going to proliferate. So if it was just Google, sure, maybe there might be a case in which there was some kind of planning and there was some kind of way to measure and focus on that and then it actually tapering down. But you have Google, Anthropic,

XAI, of course, OpenAI, Microsoft, Amazon, everyone trying to get a piece doing both training and doing inference. So I think that's, again, you know, like it's hard to put numbers on it, but what we see in this is just the massive investment in this. And that gives a good signal to say that the carbon costs have to be incredibly high.

Look, I think it's important for us to, again, to lead here. It's clear that there's some real environmental impacts. And, I mean, we have Jensen Wang, the CEO of NVIDIA, saying inference is going to take years.

hundred times more compute than than traditional LM inference and Every see every well every top executive from these firms that I've asked well is inference going to take more compute It's not exactly as much as Jensen is saying but there is a spectrum So these things are going to be more energy intensive and for everybody listening out there I do think you know, this is important context

to take in that when we talk about AI, there's an environmental cost out there. It's not fully clear what that is, although there is one. And I agree with the authors here that more transparency makes a lot of sense. Now, let's talk about another issue that you bring up in the book, which is benchmark gaming. It's been a hot topic in our big technology discord over the past couple of weeks that we see these research labs keep telling us that they have

reached a new benchmark or beat a certain level on a new test. And we're all trying to figure out what that means because it does seem like a lot of them are training to the test. And

You have some point of criticism in the book about the gaming of benchmarks and what that's meant to tell us. So just lay it out for us. What's going on with benchmarks and tell us about the gaming, Emily. So, yeah. So when you say the gaming of benchmarks, that makes it sound like the benchmarks are reasonable and they're being misused. But I think actually most of the benchmarks that are out there are not reasonable. They lack what's called construct validity. Right.

And construct validity is this two-part test of the thing that we are trying to measure is a real thing, and this measurement correlates with it interestingly. But nobody actually establishes what these things are meant to measure as a real thing, let alone that second part.

And so they are useful sales figures, right? To say, hey, we now have state-of-the-art soda on whatever. But it is not interestingly related to what it's named as measuring, let alone what the systems are actually meant to be for. Yeah. And I would just add that, I mean, there's a lot of work. I mean, prior to the book, Emily and I spent a lot of time writing on benchmark data sets. And so

This has been, you know, like I'm personally obsessed with the ImageNet data set. I'm thinking of another book on the ImageNet data set just from what entails. But I mean...

you know, the benchmarks, what they purport to, there's a, there's a lot of different problems in the benchmarks, right? So the construct validity is probably first and foremost. And when we get something where you have something like mid palm two or mid palm one and two being measured on the U S medical licensing exam, that's not really a test that determines whether one is sufficient, you know, is prepared to be a medical press practitioner. There's so much more involved with being a medical practitioner. Uh,

above and beyond taking the US medical license exam. You can't take the bar and say you're ready to be a lawyer, right? I mean, there's so much more that has to do with

with relationships and training and other types of professionalization. And there's huge literature in sociology and sociology of occupations on what professionalization looks like and what it entails and what kind of social skills involved and what that means and how to be adept at being in the discipline.

But then the kind of different benchmarks are, there's so many different problems just in terms of the way that companies are doing science themselves. They're releasing these benchmarks and often these are benchmarks that they themselves have created and released. So it may be the fact that they are quote unquote teaching to the exam and

but they're also, they have no kind of external validity in terms of what they're trying to do. So OpenAI is saying,

we had a model that did so well, we had to create a new benchmark for it. Well, who's validating that, right? I mean, even the old benchmarking culture, you had external benchmarks and multiple people would be going to it and saying, oh, we've done better on this benchmark. Now OpenAI is saying we have our own benchmarks because we did it so well. Not like the old system was any better, but this new system is that, well, where's the independent validation of this that it says it can do this thing that it's purported to say?

what do you think about the arc agi test yeah well i mean we spent some time focusing on the arc agi test right the arc agi test it is independent at least that it is it is ostensibly independent i mean it is this is that the the french franchise yeah i mean by the way for everybody who's listening it basically asks let me see if i get this right it asks the models to be able to

its ability to understand patterns and putting shapes together.

I think that's the best way to explain it. Yeah, so it's a bunch of visual puzzles where I think they're all in 2D grids. And in order to make this something that a large language model can handle, those 2D colorful things are turned into just sequences of letters. And the idea is that you have, I think it's sort of a few shot learning setup where you have a few exemplars and then an input. And the thing is, can you find an output like that?

And it is when we want to talk about how the names of the benchmarks are already misleading. The fact that that's called Arc AGI.

right? That suggests that it's testing for AGI. It's not. It's one specific thing. And I think Sholay's point is that this is something that is a very different kind of task than what people are usually using language models for. And so the sort of gesture is towards generalization, that if you can do this, even though you weren't trained for it, then that's evidence of something. But if you look at the

OpenAI paper-shaped object about this, they used a bunch of them as training data in order to tune the system to be able to do the thing. So, all right, fine. Supervised machine learning kind of works. Right. And the next test, there was ArcAGI2 that came out with a whole bunch of new problems, and instantly all the models started doing poorly on those. So let me just ask this.

Is there a measure that would allow the two of you to assess whether these AI models are useful or have you just written off their ability to be useful completely? So useful for what?

I mean, you tell me. Well, that's sort of my point is that I think it's perfectly fine to use machine learning to do specific tasks and then you set up a measurement that has to do with the task in context. I'm a computational linguist, so things like automatic transcription are very much in my area. If I were going to evaluate an automatic transcription system, I would say, okay, who am I using it for? What kind of speed varieties? I'm going to collect some data.

people speaking, have someone transcribe it for me, a person, and then evaluate how well the various models work on doing that transcription. And if they work well enough and it is within the tolerances of the use case for me, then great. That's good. Do you believe in the ability to be general?

So the ability to be general, and here I'm thinking of the work of Dr. Tamit Gibru, is not an engineering practice. That's an unscoped system. So what Dr. Gibru says is the first step in engineering is your specifications. What is it that you're building? If what you're building is general, you're off on the wrong path. That's not something that you can test for, and it is not well-scoped technology.

Yeah, I mean, this notion of generality has always had some specificity in AI, too. I mean, we mentioned in the book this idea, this...

This is a word I struggle with and I've taken so many time, but I'm just going to say fruit flies, right? So, right, the Josephina, the kind of fruit fly model of genomics, this idea that you have some kind of sequencing that's very common to this one very specific species, right? And there is in the past what that's become, you

in AI is the game of chess. It's been game playing, right? I mean, these are very specific tasks and those aren't, those don't generalize to something called general intelligence as if something like that actually exists. I mean,

One of the problems in AI research is that the notion of intelligence is very, very poorly defined. And the notion of generality is very poorly defined or is scoped to what the actual benchmark or what the task is that it is being attempted to achieve. So, I mean, that's, I mean, so this notion of generality is very poorly defined.

understood and it is deployed in a way that is, that makes sense sound like there is a notion of kind of general intelligence. And it seems to be the fact, I mean, and there's, you know, one of the, one of the, um, a great paper that we, we, we point to in the footnotes of the book is this paper by Nathan Enzberger, um, which, um,

that is talking about how chess became the the drosophilia of the the uh the ai research age and the prior ai hype cycle in the 60s and 70s and it just happened to be you had a lot of guys that liked chess and they wanted to compete with the soviets who had chess dominance right and so those tasks become kind of these tasks about like well these are the things we kind of like and

And we're actually seeing some of that again. It's like, well, we, these are tasks that we think are suitable. These are tasks that are scoped in a way that think we think are the most worthwhile problems, but they're not tasked to think about, well, what exists in the world that is going to be helpful and useful and scope to specific execution, right? This notion of an everything system is wildly unscoped. But, okay. So it is unscoped, but, but,

I think everybody listening or watching right now would probably say, well, just my basic use of ChatGPT, it can tell me about history, it can write a poem, it can create a game. Okay, I see Emily reacting already. It can search the web and give me plans. It can do all these different things in these different disciplines. So there is, I think for people listening, there would be a sense that there is a

ability to go into various different disciplines and perform. And whether you say it's a magic trick or not, it's clear that it can. And so what I guess I'm trying to get at is, I mean, is there a way to measure that? Or do you think that that is in itself a wrong assertion?

So, yes, I think it's a wrong assertion. What chat GPT can do is it can mimic human language use across many different domains. And so it can produce the form of a poem. It can produce the form of a travel itinerary. It can produce the form of a Wikipedia page on the history of some event. It is an extremely bad idea to use it if you actually have an information need.

setting aside the environmental impacts of using chat GPT and setting aside the terrible labor practices behind it and the awful exploitation of data workers who have to look at the terrible outputs so that the consumer sees fewer of them. And by terrible outputs, I mean violence and, um, racism and all kinds of sort of psychologically harmful stuff. Yes. What's that? No, we've, we've had, we've had one of the people who've been, uh,

rating this content on the show. Folks who are interested, I'll link it in the show notes. Richard was here to talk about what that experience was like. Sorry, go ahead. So setting aside all of that, if you have an information need, so something you genuinely don't know, then taking the output of the synthetic text extruding machine and

doesn't set you up to actually learn more on a few levels, right? Because you don't already know, you can't necessarily quickly check except maybe doing an additional search without ChatGPT, at which point, why not just do that search? But also, it is poor information practices to assume that the world is set up so that if I have a question, there is a machine that can give me the answer.

When I'm doing information access, instead what I'm doing is understanding the sources that that kind of information comes from, how they're situated with respect to each other, how they land in the world. And this is some work I've done with Srirag Shah on information behavior and why chatbots, even if they were extremely accurate, would actually be a bad way to do these practices.

So just to, you know, back to your point, yes, this system is set up to output plausible looking text on a wide variety of topics. And that's, therein lies the danger.

Because it seems like we are almost there to the robo-doctor, the robo-lawyer, the robo-tutor. And in fact, not only is that not true, not only is it environmentally ruinous, etc., but that is not a good world to live in. I just want to hit on this point. I disagree with you on this one. I do think that some of the points that you're making are well-founded. We don't want these things to be lawyers right away.

But let me at least point you to one use that I've had recently and you could tell me where I'm going wrong if you think I am. I mean, I'm in Paris now, a little work, a little vacation at the same time. And what I've done is I've taken two documents that I've had friends who

who they have, they've been here often. They put together documents that they send to friends when they go here. I've uploaded that into ChatGPT and then I have ChatGPT like search the web and give me ideas of what to do. I tell it where I am, I tell it where I'm going and it searches through like, for instance, like all the museums, the art galleries, the festivals, the concerts, and it brings it into one place. And that's been extremely useful to me to find new cultural events, concerts, and

There's even a bread festival going on here that I had no idea about. And now I'm going to go because it's founded for me. So there's a link when it comes to this stuff. There's a link.

that you can go out and double check the work but as far as finding information on the web the the fact that it's able to go and comb the internet for these events and then take into context some of the um the context that i've given it with these documents i think is very impressive and that's just one use case so i'm not asking it to be a lawyer i'm kind of asking it to be what you said an itinerary planner what's wrong with that

So, I mean, first of all, you have these lovely documents from your friends. And I guess what you're saying is missing is whatever current events are. So they've given you some sort of like, these are general things to look for, but they haven't looked into what's going on right now. What's wrong with that? You know, on several levels, what would we do in a prior age, like even pre-internet?

The local newspapers would list current events. Here's what's going on. If you landed in a city, you would go find the local, probably local indie newspaper and look up the events page. And that system was based on a series of relationships within the community between the people putting on festivals and the newspaper writers. And it helps support probably the local news information ecosystem, which was a good thing. But on top of that, I,

If something wasn't listed, you could think about why is this not listed? What's the relationship that's missing?

Your chat GPT output is going to give you some nonsense. And you're right, this is a use case where you can verify whether this is real or not. It is also likely going to miss some things. And the things that are not surfaced for you are not surfaced because of the complex set of biases that got rolled into the system, plus whatever the roll of the die was this time. And anytime someone says, well, I need chat GPT for this, usually one of two things is going on. Usually,

It's either there's another way of doing that, that is giving you more opportunities to be in community with people to make connections, or there is some serious unmet need, which doesn't sound like it's this case. And if we sort of pull the frame back a little bit, we can say, why is it that someone felt like the only option was a synthetic text extruding machine? And here, I think you've fallen into the former of these, which is what are you missing out on by doing it this way? What are the connections you could be making to the people around you

If you're staying in an Airbnb, maybe the Airbnb host, if you're in a hotel, the concierge, to get answers to these questions when you're looking to the machine instead. I would also say this is a pretty low stakes scenario, right? You can go out, you can verify these things.

You can go to existing resources of event calendars that people also spend a lot of time curating online. I mean, there's a lot of stuff that's already curated online. And I mean, it's not like this exists in prior incidents of technology. I mean, one of the people that we cite in the book and talk a lot about is Dr. Sophia Noble's work on Google and the kind of way that Google results

present very violent content with regards to racial minorities. One of the parts of the book that I like to reference and that a lot of people don't reference initially is this kind of part that she talks about. She talks about Yelp and she talks about Yelp and specifically and

what it's referring in terms of a black hairdresser and the way that like Yelp effectively was like shutting this person out of business because there was specific need that she had for black residents of the city that she was studying and braiding hair and doing other black hairstyles, right? And so this is kind of a function of all kind of information retrieval systems, right? You think about what they're including, what they're excluding, right?

So this is not very consequential here, but in any kind of area of, say, summarization or any kind of retrieval, you do need to have some kind of expertise where you can verify that and ensure that what's getting in there is not missing something huge. And it's going to basically then take this expertise

set of information access resources or systems in this case crawling the web and and knowing that that's going to miss something and then it's going to exacerbate that because then you cannot situate those sources in context text.

Okay, let me just give my counter argument and then we can move on from this. My counter argument would be a couple of things. First of all, I don't speak French. So the local newspaper would kind of be lost on me. I am speaking. Okay, so I am staying at a residence place. We swapped apartments. So she's in my New York apartment. I'm here. So maybe she and I could have gone over that newspaper together. That's fair. But the newspaper, speaking of things that leave stuff out,

The newspaper leaves stuff out all the time. It exercises editorial judgment. So it is bot editorial judgment for newspaper editorial judgment, but the bot can be in some ways more comprehensive because it's searching the entire web. And I'll just say one last thing about this. I never felt, I didn't feel the need to use it. I didn't say I need to use it to figure out what's going on. Like, again, I had these documents. What's useful about it is it's,

Speaking of making connections with the local community, if I'm able to, here's the word, be efficient in my research through using it, I could spend much more time out in the community versus searching the web or reading the newspaper. So what's your thought on that, on those arguments? Yeah.

Sorry. So I'm getting distracted by Alex's cat walking around. So listeners, Alex's cat is here. Alex, what's your cat's name? This is Clara. And I'd lift her up, but I have a shoulder injury. But she's knocking the mic around. So I'm going to not. I'm just trying to keep her off the mic. Yeah. Yeah. Thank you. So the efficiency argument is,

So this is efficiency argument in the context of leisure activities as opposed to in the context of work. You mentioned along the way that it is searching the whole web for you. You don't know that actually. That's right. And also the whole web includes a lot of stuff that you don't actually want. Like lots and lots and lots of the web is just garbage SEO stuff. And maybe you're seeing more of that in your chat GPT output than you would with a search engine, which as Alex mentioned, also has issues.

And then finally, I'm going to take issue with you. SEO garbage is made for the search engine. It is, but the search engines also, in order to stay in business, have to be fighting back against the SEO garbage. It's a constant battle. Probably the chatbots as well. So you mentioned newspaper editorial judgment versus bot editorial judgment. And I'm going to take issue there because a bot is not the kind of thing that can have judgment, nor is it the kind of thing that can have accountability for exercising judgment. Right.

And so I think that, yes, as Alex was saying, this is low stakes. But if you're using it as sort of a motivation for these things being useful in the world, then you have to deal with the fact that the useful in the world is going to entail many more higher stakes things. And then we really have to worry about accountability. I would also want to say, too, I mean, there's a lot of, I think, this argument from...

like quote unquote capabilities, which I don't know really what that term means. And that's another poorly defined term, I think, especially when it comes to AGI. But I mean, this argument from kind of like, well, I find it useful. I don't find terribly convincing, right? I mean, it's sort of like, well, okay, you have found it useful in either way.

a situation in which there is a way to have some kind of verification of sources that you know about and have some kind of ground truth about, or you found it useful kind of from a variety of this, of these different situations. But if I'm asking a chatbot about something about an area that I know quite a lot about, say sociology or social movements literature, um,

I then have that knowledge to verify that just from my social skill in that area. And this is a term I'm kind of borrowing from a sociologist, Neil Flinstein, and my knowledge of how to navigate those areas and my professionalization as a sociologist.

Okay. But then I'm, but then once it gets into those areas on which verifiability just escapes me, which is most areas, because we're not professionals in those areas. And although a lot of us want to be jacks of all trades, jacks and jills of all trades,

then we lose that ability and we don't have the social skill or depth of knowledge to verify that in the same way. And so I'm really not convinced by those. Well, these are useful for me in these pretty low state contexts because that slippage then means that we're going to miss some pretty big things in some really dire contexts.

Okay, well, let's turn it up a notch when we come back, because we're going to talk about AI at work and AI in the medical context. And maybe we can even touch a little bit on doomerism, which you write about in the book. And there's plenty else on the agenda. So we'll be back right after this.

And we're back here on Big Technology Podcast with Professor Emily M. Bender and Alex Hanna. They are the authors of the AICon, How to Fight Big Tech's Hype and Create the Future We Want. Here it is. So let's go to usefulness. And we'll start with generative AI in the medical context because...

why don't we just go straight for the example that we'll probably have the biggest disagreement on here. And I'm not saying that I think generative AI should play the role of a doctor. In fact, when I wrote my list of things, I agree with you both on, I don't think that AI should be a therapist, at least not yet. And we know now that AI is the number one use, according to a recent study is companionship and therapy and the therapy side is

really scares me and I think the companionship isn't the best thing in the world either. But in medicine, I do find that there is some use for it. Medicine is a field overrun by paperwork and insurance requirements that I think have ruined the healthcare system because they keep doctors effectively tied to their computers writing notes as opposed to seeing patients or living their lives.

And Alex, before the break, you mentioned that one of the areas that this stuff is useful is when it starts to operate in your area of expertise because you're able to verify that. So, I mean, we're going to go with one use that I find to be pretty good here and would sort of, to me, doesn't make generative AI feel like a con, is when a doctor is seeing a patient and they can put a transcription in.

take a transcription of the conversation that they have with the patient and then have AI synthesize what they talked about and summarize it and put it into the systems that they have for electric medical records and then verify that so they don't have to spend the time writing those summaries up and can actually go and spend some more time with patients. So what's the problem with that?

There are so many problems with that. And the first thing I want to say is that you named the underlying problem when you talked about insurance requiring so much paperwork. So this is one of those situations where there's a real problem here. It's not that doctors shouldn't be writing clinical notes. That is actually part of the care. But there is a lot of additional paperwork that is required because of the way insurance systems and especially the one the United States are set up. And so we could work on solving that problem.

And this is a case where the sort of the turn towards large language models, so-called generative AI as an approach to this is showing us the existence of an issue. But that doesn't mean it is a good solution. So many problems. One is writing the clinical note is actually part of the process of care. It is the doctor reflecting on what came out of that conversation with the patient.

and thinking it through, writing it down, plans for next treatment. That is not something that I want doctors to get out of the habit of doing as part of the care. Now they might feel like they don't have time for it. That's also a systemic issue. Secondly, these things are set up as like ambient listeners, which is a huge privacy issue. As soon as you've collected that data, it becomes sort of this like radioactive pile of danger.

Thirdly, you've got the fact that automatic transcription systems, which are the first step in this, do not work equally well for different language varieties. So think about somebody who's speaking a second language. Think about somebody who's got dysarthria. So an older person whose speech isn't very clear. Think about a doc.

who is an immigrant to the community that they're working in, who's got extra work to do now because their words are not well transcribed. And so the clinical notes thing doesn't work well for them. But the system is set up where there's these expectations that they can see more patients because the AI, in quotes, is taking care of all of this for them. And there's a beautiful essay that came out recently, I think in Stat.

news, and I was looking for the name of the author, didn't find it real quick, really reflecting on how important it is to her that the doctor do that part of the care of actually pulling out from the conversation, this is what matters. And it's not just simple summarization. It is actually part of the medical work to go from the back and forth had with the patient, all of the doctor's expertise to what goes into that note.

Yeah. So I want to add on, Emily has said so much of what I want to get at, which I think is, but I have, I think three or four separate points in addition to that. So first off is the technical point.

So there's so tools that are that are purported to be summarization. There's some great reporting by Garen Spirk and Hilda Shellman and the AP from last October that was looking at Whisper specifically. So that's OpenAI's ASR system, automated speech recognition system, that said that medical transcription had been

basically was making up a lot of shit. And then we knew that they had quote-unquote hallucinations. Again, that's not a term that we use in the book. We say that it's

I say it's making shit up, but that is maybe even granting too much anthropomorphizing of the system for me. And so, but there is a lot of these things. Some from the quoting from that text, some of that invented text includes racial commentary, violent rhetoric, and even imagined medical treatments.

So that's one major problem. The second problem is that medical transcription has been this area which has been an area in which medicine has been forcing kind of this casualization of work for years, right? And so medical note-taking that exists in hospitals now, much of that is done remotely. So it's gone and taken this work that has been seen as kind of like this busy work or this...

this thing that like, I don't want to write up my medical notes to be this type of work that needs to be foisted on someone else. So prior to this kind of ASR element of it is, is we've had these, thanks for linking that Emily and I'll link the, I'll link the AP article that I'm looking at too.

Part of that work has actually been offshored a lot into this kind of movement of outsourcing. So a lot of that is done remotely as a part of this casualization. And the scene to be, I think, to be a lot of... I want to point out the gender notion of this. This is like a very kind of like women's-based work. And that reflects a lot of the ways in which so much of...

quote unquote, AI technology wants to basically take the work that has been traditionally the domain of women and is saying, well, we can automate that or we can casualize that in different ways. And that's important because it sees this work as not actually part of quote unquote, the work. It is seen as work that ought to be casualized and offshored.

And so, and I appreciate the essay that Emily shared because that essay saying like, no, this is actually part of the element of doctoring. And then I want to also just kind of couch all of this stuff in the kind of political economy of the medical industry, thinking about what does it mean to rush and put and have more and more remote medicine, having more and more doctors see more patients and these efficiency gains from doctors isn't going to like

make their jobs necessarily easier. It's going to put more of a pressure on them. Now that you're in a position where you don't have to take medical notes, you're going to be running from appointment to appointment to appointment.

And my sister is a nurse. She's a nurse practitioner. And she's basically seeing this in her job right now at her clinic. She's like, now we have these things where I have to see patients now, you know, and if, you know, it's not that I'm going to go and be on the beach anywhere. It means that I'm going to have, you know, I'm going to have nine to 10, 15 minute appointments a day. I'm not going to have enough proper treatment, proper time to spend with patients.

So if these things could be, you know, like I would say to the coda to all of this is that if AI boosters could really offshore all of doctoring to chatbots, they would. And this is one case in which Bill Gates has said, you know, in 10 years, we're not going to have teachers and doctors.

What a nightmare scenario to have non-teachers and not doctors. And Greg Corrado really gives it away when he cited my book where he says, a MedPalm 2, you know, this thing is really efficient. We're going to increase tenfold our medical ability. But I wouldn't want this to be part of my family's medical journey. Okay.

Okay, but again here, you're picking out what is like some of the most extreme statements, and I started my question saying – It's Bill Gates. And Bill Gates can make extreme statements. He's the guy. I don't think he's the guy, and I think that –

That doesn't reflect the broad consensus here and definitely not the question that I asked, which again was about using this to take some of the time that the doctors are using in paperwork and give that back to either the doctors themselves.

or to have them be able to see more patients. So very much addressed that point. First of all, I want to name the author of that essay. Her name is Alia Barakat and it's a beautiful essay. She's a mathematician and also a patient with a chronic condition. Wonderful essay. But yeah, you said give that time back to the doctors or have them see more patients, right? It is not going to be going back to the doctors. That's not how our healthcare system works. And it's also going to therefore decrease quality of patient care. It is lose, lose, except for, uh, the, uh,

hospitals maybe getting more money and certainly the tech companies that are selling this to the hospitals. Okay. I'm also curious in terms of thinking about it. I mean, yes. What is that? I'm curious in thinking about the more nuanced position and like, who are the reference here that you're thinking of Alex? What's the consensus on this? Cause I, I, I, you know, like we see the egregious, you know, elements of this and I'm wondering what the medical consensus is, you know, like who's, what's an example of,

You know, just to put poison. Now I'm interviewing you, but like, who's someone that do you think is doing this very well? Well, I mean, someone doing this well, like, again, I don't think that this stuff is well developed yet, but I've definitely seen enough doctors just buried in paperwork. And we, we started this whole segment talking about how this is, I guess it's an insurance driven thing. And so, yeah,

It's interesting that, I guess, do you both not like the way that the insurance companies are guiding the system, but also think that it's good practice to have doctors write those notes for them or...

Hold on. There's two use cases for doctor's notes, right? There is actually documenting for the patient and for the rest of the care team what has happened in this session. And that, I think, is a super important part of the work of doctoring. I believe you that there's a lot of additional paperwork that has to do with getting the insurance companies to pay back. And no, I don't like that system at all. It does not. The insurance companies are not providing any value. They are just vampires on our health care system in the U.S.

Okay. I think we can agree on that front. I mean...

And anyway, but I do think that as this stuff gets better, I understand a patient wants this to happen. Do I think a doctor would be giving them worse care if they allowed the AI to summarize the notes or to pick out the more important parts? If this stuff was working well? Not necessarily. So that's a big if. What does it mean when this stuff is getting better and this stuff working well? Do you mean kind of like the absence of...

making shit up. Right. Definitely. I mean, but we all, we both, we all agree that the doctor will have to verify and check this information after. Well, I guess the problem was there then like, then why are we having the doctor double check that to begin with? Right. In an area where the doctor has 15 minutes to see every patient and there is an AI quote unquote scribe doing or quote unquote AI. I don't want to call it AI scribe. There's an automatic speech recognition tool. Right.

doing automatic speech recognition on these things in what, in what space or with what time does the doctor have to verify those in an area? I mean, this is, I mean like, well, the time that they would be spending writing those notes in the first place.

Is verification an easier task than transcription? I guess that's my question. I would proffer no. I mean, just from my experience using these systems. And I mean, I'm not a doctor. Thank God. Although I've thought about it. Not that kind of doctor to the chagrin of my parents. But then I guess the question is...

of that is I, in my experience, I've used these tools for interviews specifically and kind of qualitative interviews with data workers and have spent time with these tools and have just had such an awful time trying to think about this, especially with regards to, you know, this isn't, you know, this isn't medical terminology, but it's,

terminology around doing data work. We're talking about training AI systems. And it is such a terrible job. And at one point, I threw it all out and I said, "Okay, I just am sending this to somebody to actually transcribe because this is not helpful for me and it's taking me more time." Starting with the transcript and then from doing it from scratch. And I've transcribed

you know, I'm not a primarily qualitative interviewer, but I've spent time, you know, transcribing dozens of interviews in my research career and have found it just very difficult. So I mean, I guess the question is, is that verification, you know, is that taking the time that could be just be used for the doctoring and working with patients?

And I mean, like, you know, holding, you know, everything about the insurance industry, you know, stable, you know, like, is that kind of notion of thinking about

you know, different patient, how the patient presents, how the patient is describing their, their, their, how they're, you know, how they're presenting is that, you know, that is often the work of doing it, you know, and the medical training I do have is that I am at one, I was at one point a licensed EMT and writing up PCRs is not like, you know, no one wants to write up the PCRs at the same time you're spending time

Taking note of how a patient is presenting. The patient is, you know, Alex Rhythmic. Just bringing back to the Alex's. The patient is cyanotic around their lips. These are things that a healthcare professional would be paying attention to is making notes maybe because they're writing it later. So I'm thinking about this process of writing and what it does to our own practice of viewing and aiding and administering medical care.

Okay. I mean, we'll agree to disagree on this front. Uh, but again, I think we are all on the same page that insurance companies requiring additional writing just because they hope you don't ever get to the claim. If you don't file it, that's probably bad. Uh, and we don't think that there should be, uh,

That there should be AI doctors, at least yet. That's what I say. I think you guys probably say never. So, all right. I want to end on this, which is, or maybe we can do two more topics. I guess, like, here's my question for you. A lot of the discussion of AI's usefulness in jobs in the book discusses these tools being imposed top-down.

But what if they come bottom up? Like what if a worker can find use for them and actually make their job easier by getting good at using something like a chat GPT or a CLAWD? Or if, you know, again, we like kind of talk through the medical use case. If a doctor does find that this is useful for them, are you opposed to that?

So, yes. And I think that actually Cadbury, of all people, put it best. There's this hilarious commercial that was for the Indian market, sort of showing how the supposed efficiencies that you're getting out of this just ramps up the speed of things and doesn't leave you time to really get into the work that you're doing and be there. I think that the most...

credible use cases I've heard for these things are first of all, um, as coding assessments assistance. So that's sort of a machine translation problem between natural language and some programming language. Um, and there I really worry about technical debt. Um,

Where you have output code that was not written by a person that's not well documented, that becomes someone else's problem to debug down the line. But also in writing emails. People hate writing emails and people hate reading emails. So you get these scenarios where somebody writes bullet points, uses ChatGPT to turn it into an email, and the person on the other end might use ChatGPT to summarize it. And it's like, okay, so what are we doing here? And again, taking a step back and saying, okay,

What are the systems that are requiring all this writing that everyone finds a nuisance to write and to read? Can we rethink those systems? And also, I just have to say that whenever I'm on the receiving end of synthetic text, I am hugely offended. And one of the things that we put in the book is- Yeah, I definitely got one of those emails yesterday. And I was like, you used ChatGPT for this. I know you did. Yeah. If you couldn't be bothered to write it, why should I bother to read it? Right. Yeah. That's a good point. I mean, it's very interesting putting this and thinking about cases in which

workers are using this kind of organically. And I kind of like in a case where it's, you know, like this is the case where like, first off, I've, I've heard a very little of that personally, especially for professionals. I mean, I think there's plenty of workers that are finding a lot of uses, but yeah,

I would say the analog that I find to be where it's not top-down is in education. And to that degree, I think that's kind of a failure in thinking about what education is, right? I mean, in that case, it's... Well, for students to be using this to get through their classes. Yeah, right, exactly. Are you talking about teachers putting stuff together? Well, both, but I'm thinking about the students, right? And I'm just thinking about areas in which... But I'm using that as sort of an analog and then thinking about

what are the conditions that are forcing students to use this, right? If there's kind of cases in which this seems to be sort of useful, okay, what are the cases in which

what does that say about the job? What does it say about like how the work is oriented, right? In that case, then maybe there might be needs to be kind of different efficiencies or thinking about how the job is operating, right? I then worry then that these things become mandated in work environments. And you're saying, well, people are using this. And so everybody's using this. And

Then where does that leave the people who are resistors or thinking about, well, I know this can't do a good job. So where's that putting me? And I think we've already seen such a justification for this as being a place where employers have been

reducing positions by the scores because there's a notion that these tools can do these jobs suitably and to a certain kind of degree of proficiency, which is just not the case. That has me worried about down the line in these areas that Emily's mentioned, the kind of technical debt area, the kind of how do we know, and there's kind of an overestimation of capabilities of these tools in that case.

Okay, I know we're at time or close to time. Can I ask you one question about doomers before we get out of here? Sure, let's end by talking about doomers. Okay, so I definitely saw that there was a chapter about doomers here, and I was excited to read it because my position has been largely that those who are worried that large language models are going to turn us into paperclips are either marketing what they're selling or just very into, I don't know,

they like the smell of their own body odor because it's, I mean, I guess it's not a terrible thing to be worried about, but there's so much more and it seems so unlikely that this is going to hurt us. So I definitely wanted to get your take on why you are, uh, why you're, you're down on doomerism. And I, and let me just give my one caveat here. There's a line in your book that says, uh,

that AI safety is just doomerism and it's only about these long-term problems. But I've definitely heard some of the AI safety folks like Dan Hendricks from the Center of AI Safety talking about really important near-term issues like whether this technology could help virologists with bad intent. So I wouldn't malign the entire AI safety field. But the doomerist stuff...

I hear your point. All right. So attack that and then we'll get out of here. So I just want to put in a shout out for a new book by Adam Becker called More Everything Forever, which really goes deep into the connections between the sort of doomerous thought and these thoughts.

more palatable looking sides of what's called effective altruism. And also in that context, there's a wonderful paper by Timnit Gibru and Emil Torres on what they call the Tascreal bundle of ideology. And I think that if your concern about the systems is not rooted in

real people in real communities and things that are actually happening, like even this like, oh, but bad actors could use it to do more quickly designed, you know, viruses and stuff like that. That's still speculative. Right. So anytime we are taking the focus away, it's like, has that happened? Right. This is this is still people writing science fiction fan fiction for themselves.

And, you know, it's not it's based on these jumped up ideas about what the technology can do and taking the focus away from the actual harms that are happening now, including the environmental stuff we started with. Right. But yeah, I mean, I will. I will say a virus. Right. You want to get you want to get ahead of that. Right. Like we had with social media, there were some issues with social media. But some of these there was not a focus on some of like the potential long term issues. And that only came up later on, at least in the beginning.

You don't agree. Say wait. There are problems with social media for sure. Yeah. And some of those problems were documented and explained early on and people were not paying attention. But they were real problems that were being documented as they were happening, as opposed to imaginaries about, well, someone's going to use this and Dr. Evil up a bad virus.

Yeah. Go ahead, Alex. For the sake of time, I think that's fine. I don't have much to add. All right. Well, look, the book is called The AI Con, How to Fight Big Tech's Hype and Create the Future We Want. The authors are Emily M. Bender and Alex Hanna. Emily and Alex, I've been reading your work for a long time, and it's great to have a chance to speak with you, like I said at the top.

you know for those who are listening or watching and you may not agree with everything either everything I said everything our guests said hey at least you now you know these arguments and you know the arguments for and against and we trust you to make up your own opinion and do further research and we've definitely had plenty of good stuff to keep digging into shout it out over the course of this conversation so Emily and Alex great to see you thank you so much for joining the show thank you for this conversation and enjoy Paris thanks Alex have a great time

Thank you both. Thank you, everybody, for listening. We'll see you on Friday for our news recap. Until then, we'll see you next time on Big Technology Podcast.

AI’s Drawbacks: Environmental Damage, Bad Benchmarks, Outsourcing Thinking — With Emily M. Bender and Alex Hanna 01:01:55 Share

Big Technology Podcast

Deep Dive

Shownotes Transcript

AI’s Drawbacks: Environmental Damage, Bad Benchmarks, Outsourcing Thinking — With Emily M. Bender and Alex Hanna