98: Helping computers decode sentences - Interview with Emily M. Bender

2024/11/22

Lingthusiasm - A podcast that's enthusiastic about linguistics

AI Deep Dive AI Chapters Transcript

People

Emily M. Bender

Lauren Gawne

Topics

Emily M. Bender: 我从事计算语言学研究，专注于多语言语法工程和语言技术的社会影响。我开发了一种语法矩阵，可以帮助人们更快地为不同的语言构建语法。我认为，要让计算机理解语言，需要解决人工智能领域的所有问题，这是一个非常复杂且不明确的目标。目前，计算机处理语言的方式与人类不同，人类学习新词是将词与现实世界中的概念联系起来，而计算机学习新词是建立词与其他词之间的联系。基于规则的计算语言学和基于统计的计算语言学是两种不同的方法，前者需要人工编写规则，后者则依靠统计模型来处理语言。基于规则的语法系统允许我们追踪错误并进行调试，而统计模型则是一个黑盒，难以进行调试。大型语言模型容易产生偏见，因为它们是从互联网上收集的数据进行训练的。大型语言模型生成的合成文本正在污染互联网数据，这不利于语言学研究。大型语言模型并非真正理解语言，它们只是根据统计概率生成文本，因此其输出结果可能不准确或包含偏见。“幻觉”这个词不适合用来描述大型语言模型的错误输出，因为它暗示模型具有感知能力。大型语言模型存在数据问题、计算成本高昂以及劳动剥削等问题。基于符号的语法处理工作仍然有其价值，尤其是在需要精确答案的场景中。语言学可以帮助我们更好地设计自然语言处理系统，使其成为有用的工具，而不是误导性的工具。语言学能够帮助我们深入研究语言的结构，从而更好地理解语言在世界中的作用。 Lauren Gawne: 我与Emily M. Bender教授讨论了计算机处理语言的方式，以及规则与统计模型在计算语言学中的应用。我们探讨了大型语言模型的局限性，以及在构建这些模型时需要考虑的伦理问题，例如数据偏差、劳动剥削和环境影响。我们还讨论了如何改进这些模型，使其更准确、更公平，以及如何将语言学知识应用于自然语言处理领域。

Deep Dive

Chapters

This chapter covers Lingthusiasm's 8th anniversary, the listener survey with linguistics experiments and advice questions, and details on accessing bonus episodes and Patreon gift memberships.

Lingthusiasm's 8th anniversary celebration
Listener survey with linguistics experiments and advice questions
Bonus episodes and Patreon gift memberships

Shownotes Transcript

Translations:

中文

Welcome to Lingthusiasm, a podcast that's enthusiastic about linguistics. I'm Lauren Gawne, and today we're getting enthusiastic about computers and linguistics with Professor Emily Bender. But first, November is our traditional anniversary month, and this year we're celebrating eight years of Lingthusiasm. Thank you for sharing your enthusiasm for linguistics with us. We're also running a Lingthusiasm listener survey for the third and final time.

As part of our anniversary celebrations, we're running this survey as a way to learn more about our listeners, get your suggestions for topics, and to run some linguistics experiments.

If you did the survey in a previous year, there are new questions, so you can totally participate again this year. There's also a spot for asking us your linguistics advice questions, since our first linguistics advice bonus episode was so popular. You can hear about the results of the previous surveys in two bonus episodes, which we'll link to in the show notes, and we'll have the results from this year's survey in an episode for you next year.

To do the survey or read more details, go to bit.ly slash Lingthusiasm Survey 24. That's bit.ly slash Lingthusiasm.

Lingthusiasm Survey 24, the numbers 2 and 4, before December 15, anywhere on Earth. This project has Ethics Board approval from La Trobe University, and we're already including results from previous surveys into some academic papers, so you too could be part of science if you do the survey. Our most recent bonus episode was a linguistics travelogue. We'd

We discussed Gretchen's recent trip to Europe where she saw cool language museums and what she did to prepare for encountering several different languages on the way, as well as planning our fantasy linguistic excursion to Martha's Vineyard.

Go to patreon.com slash Lingthusiasm to hear this and many more bonus episodes and to help keep the show running ad-free. Also, very exciting news from Patreon, which is that they're finally adding the ability to buy Patreon memberships as a gift for someone else. So if you'd be excited to receive a Patreon membership to Lingthusiasm as a gift, we'll have a link in the show notes for you to forward to your friends and or family with a little wink wink nudge nudge. We also have lots of Lingthusiasm merch that makes a great gift for the linguistics enthusiast in your life.

Today, I am delighted to be joined by Emily M. Bender, who is a professor at the University of Washington in the Department of Linguistics. She is the director of the Computational Linguistics Laboratory there. Emily's research and teaching expertise is in multilingual grammar engineering and societal impacts of language technologies. She runs the live streaming podcast, Mystery AI Hype Theater 3000, with sociologist Dr. Alex Hanna.

Welcome to the show, Emily. I am so enthusiastic to be on Lingthusiasm. We are so delighted to have you here today. Before we ask you about some of your current work with computational linguistics, how did you get into linguistics?

Yeah, so it was a while ago. And back when I was in high school, we didn't have things like the Lingthusiasm podcast, or podcasts for that matter, to spread the word about what linguistics was. So I actually hadn't heard about linguistics until I got to university. And someone gave me the excellent advice to get the course catalog ahead of time. And it was a physical book in those days. And just flip through it and circle anything that looked interesting. And there was this one class called An Introduction to Language. And

And in my second term, I was looking for a class that would fulfill some kind of requirements and it did. And I took it. And let me tell you, I was hooked on the first day, even though the first day was actually about like the bee dance and other animal communication. I just fell in love with it immediately.

And I think, honestly, I had always been a linguist. I love studying languages. My ideal undergraduate course of study would have been like take the first year of all the languages I could. Oh, that would be an amazing degree. Just like I have a bachelor's in introductory language. Yeah. I mean, speaking now as a university educator, I think there's some things missing from that. But as a linguist, like how much fun would that be? And I didn't know there was a way to like

study how languages work without studying all the languages. And when I found it, I was just thrilled. Excellent. I think that's such a typical experience of a lot of people who get to university and they're intrigued by something that's like

how can it be an intro to language when I've learned a bunch of languages? And then you discover there's linguistics, which brings you into the whole systematic nature of things. Yeah, absolutely. My other favorite story to tell about this is I have a memory of being 11 or 12 and sort of daydreaming and trying to figure out what the difference was between a consonant and a vowel. Amazing. Because, yeah, like we were taught the alphabet, there's five vowels and sometimes Y and the other ones are consonants. So what's the difference?

My regret with this story is that I didn't record what it was that I came up with, and I have no idea if I was anywhere near the right track. But I don't think that your average non-linguist does things like that. It's extremely proto-linguist behavior. I love it. And yeah, I'm sad we don't have 11-year-old Emilys figuring out from first principles of the IPA. Exactly. Emily, who definitely went on to be a syntax-semantics side linguist and not a phonetics-phenology side linguist. But...

So how did you become a syntax semantics linguist and how did you get into your research topic of interest? Yeah. So in undergrad, it was definitely the syntax class that I connected with the most. I got to study construction grammar with Chuck Fillmore and Paul Kay at UC Berkeley, which was amazing. And sort of was aware at the time that at Stanford, there was work going on in two other frameworks called lexical functional grammar and head-driven phrase structure grammar. And these are like

different ways of building up representations of language. And I went to grad school at Stanford with the idea that I was going to create generalized Bay Area grammar. Oh, great. And bring together everything that was best about each of the frameworks because they are sort of similar in spirit. They're sometimes described as cousins.

And then I got to Stanford and I took a class with Joan Bresnan on lexical functional grammar and a class with Ivan Sog on head-driven phrase structure grammar. And I realized that it's actually really valuable to have different toolkits because they help you sort of focus on different aspects of the grammars of languages. And so merging them all together really wasn't going to be a valuable thing to do. It's good that you could see what each of them was bringing to you.

that we have syntax in their structure, but different ways of explaining it give different perspectives on things. Exactly. And lead linguists to want to go explore different things about different languages. So if you're working with lexical functional grammar, then languages that do

radical things with their word order, like some of the languages of Australia are particularly interesting and languages that put a lot of information into the morphology. So the parts of the words are really interesting. And if you're doing head-driven phrase structure grammar, then it's things like getting deep into the idiosyncrasies of particular languages, the idioms and the sort of the sub patterns and making them work together with the major patterns is a big focus of HPSG. And so you're just going to work on different problems using the different frameworks. Yeah.

I love that. Incredibly annoying undergraduate proto-linguist behavior. I still remember in my syntax class, because you learn to draw syntax trees, and one of my fellow students and I were like, trees are fine, but we need to keep extending them down because they only go as far as words. And there's all this stuff happening in the morphology. And we thought we were very

clever for having this very clever thought. We were very lucky that our syntax professor was Rachel Nordlinger, who is another person who works with lexical functional grammar, which, as you said, is really interested in morphology. You could tell she was just like, you guys are going to be so happy when we get to advanced syntax, but just hold on. We're just doing trees for now. That's how I got introduced to different

forms of syntax helping answer different questions. Because like, oh, this is one that accounts for all the things that are happening inside words as well. It's really cool. Yeah. Yeah. So one of the things about both LFG and HBSG is that they're associated with these long-term computational projects where people aren't just working out the grammars of languages with pen and paper, but actually codifying them in rules that both people and computers can deal with.

And I got involved with the HBSG project like that as a graduate student at Stanford. And then later on, while my first job actually, it's not true.

My first job out of grad school was teaching for a year at UC Berkeley. But then I had a year after that where I was working in industry at a startup called YY Technologies that was using a large scale grammar of English to create automated customer service response. OK. So you've got an email coming in and the idea is that we parse the email.

Get some representation of what's being asked. Look up in a database what an appropriate answer would be and then send that answer back. And the goal was to do it on the easy cases so that the harder cases that the automated system couldn't handle would get passed through to a representative. So the startup was doing that for English and they wanted to expand to Japanese.

And I had been working on the English grammar actually as a graduate student at Stanford because it's an open source grammar. And I speak Japanese. And so I got to do this job where it was literally my job to build a grammar of Japanese on a computer. It was so cool. So that was a fantastic job. And in the course of that year...

There was a project starting up in Europe that was interested in building more of these grammars for more languages. And so I picked up the task of saying, how can we abstract out of this big grammar for English, which at that point was about seven years old, still under development. So it is quite a bit older now, quite a bit bigger. Amazing. How can we take what we've learned about doing this for English and make it available for people to build grammars more quickly of other languages?

And so I took that English grammar and held it up next to the Japanese grammar I was working on and basically just stripped out everything that the Japanese made look English specific and said, okay, here's a starter kit. Here's a place, and this is the start of the grammar matrix that you can use to build a new grammar. And so that's the beginning of that project. And I have since been developing that, and we can talk more about what developing it means, together with students now for 23 years. So it's a

Really long-standing project. Amazing. That is, in terms of linguistics research projects and especially computational linguistics projects, a really long time. And it speaks to the fact that computers don't process language the same way we do. Like, a human by the age of 23 is fully functional at a language by itself and can be sharing that language with other people. But for a computer, you're finding...

more and more, I assume at this point, as really specific rules or factors or edge cases. For the English grammar that I was describing, yes, it's basically that.

The grammar matrix grows when people add facilities to it for handling new things that happen across languages. For example, in some languages, you have a situation where instead of having just one verb to say something like bathe, it requires two words together. You might have a verb like take that doesn't mean very much on its own. And then the noun bath and take a bath means the same thing as bathe. Hmm.

So this phenomenon, which is called light verb constructions, shows up in many different languages around the world in slightly different ways. And when this student is done with her master's thesis, you will be able to go to the Grammar Matrix website and enter in a description of light verb constructions in a language and have a grammar come out that can handle them. So excellent. And not something, if we were only working in English, that we would think about, but light verbs show up across languages.

different language families and across the grammars of languages that you want to build.

computational resources for. So it makes sense to add this kind of functionality. Yeah, exactly. And light verbs do happen in English, but they happen in different ways and sort of more intensively in other languages. You can kind of ignore them in English and get pretty far. But in a language like Bardi, for example, in Australia, you aren't going to be able to do very much if you don't handle the light verbs. And now, hopefully at the end of this MA, we'll be able to. Yes, exactly.

And so why is it useful to have resources and grammars that can be used for computers for languages like Badi? Or, I mean, even large languages like Japanese? Yes, exactly. So why would you want to build a grammar like this? Sometimes it's because you want to build a practical application where you can say, okay, I'm going to take in this Japanese string and I'm going to check it for grammatical errors, or I'm going to come up

with a very precise representation of what it means that I can then use to do better question answering or things like that. But sometimes what you're really interested in is just what's going on in that language.

And the cool thing about building grammars in a computer is that your analysis of light verb constructions has to work together with your analysis of coordination and your analysis of negation and your analysis of adverbs because they aren't separate things. They're all part of one grammar. And so if we can make the computers understand it, it's a good way of validating that we have understood it and that we've described the phenomenon sufficiently.

Yes, exactly. And on top of that, if you have a collection of texts in the language and you've got your grammar that you've built and you want to find what you haven't yet understood about the language, you try running that text through your grammar and find all the places where the grammar can't process the sentence. And that's indicative of something new to look into. And so it's thanks to this kind of computational linguistics that all those blue squiggles turn up on my word processing and I don't make major syntactic mistakes.

mess-ups while I'm writing. So that's actually an interesting case. Historically, yes, the blue squiggles came from grammar engineering. I believe they are now done with the large language models, and we can talk about that some if you want. Okay, sure. But it was that kind of grammar engineering that led to those initial developments in spell checkers and those kinds of things. Yes, exactly. Amazing. Attempting to get computers to understand human language is

has been something that has been part of the interest of computational scientists since the early days of 20th century computing. I feel like a question that keeps popping up when you read the history of this is like, "And then someone figured something out and they figured we'd solve language in five years."

Why haven't we solved getting computers to understand language yet? I think part of it is that getting computers to understand language is a very imprecise goal. And it is one where if you really want the computer to do

behave the same way that a person would behave if they heard something and understood it, then you need way more than linguistics. You need something, and I really hate the term artificial intelligence, but you basically need to solve all of the problems that building artificial intelligence, if that were a worthy goal, would require solving. You can

ask much narrower questions and build useful language technology. So grammar checkers, spell checkers, that is computers processing natural languages to good effect. Machine translation. It's not the case that the computer has understood and then is giving you a rendition in the output language. Yeah.

Machine translation is just, well, we're going to take this string of characters and turn it into that string of characters because according to all of the data that was used to develop the system, those patterns relate to each other. And I think it's also easier to understand from a linguistic perspective that when people say solve language, they have this idea of language as a single unified thing. But like so far, we've only been talking about written language.

things and the issues are around syntax and meaning, but dealing with understanding or processing written language versus processing voice going in versus creating voice, they're all different skills. They require different linguistic and computational skills to do well. So language involves solving

Actually, hundreds and thousands of tiny different problems. Many, many different problems. And there are problems that you say involve different skills. So are you dealing with sound files? Are you dealing with if you actually wanted to process something more like what a person is doing? Do you have video going on? Are you capturing the gesture and figuring out what shades of meaning the gesture is adding? Notting vigorously here. I know I don't need to tell you that. Yeah.

But also pragmatics, right? We can get to a pretty clear representation for English, at least, of the who did what to whom in a sentence, the sort of bare bones meaning in the form of semantics. But if we want to get to, okay, but what did the person mean by saying that? How does that fit in with what we've been discussing so far and the background?

best understanding possible of what the person is trying to do with those words, that's a whole other set of problems that's called pragmatics that is well beyond anything that's going on right now. There's like tiny little forays into computational pragmatics. But if you really want to understand language, a language, right? Most of this work happens in English and we have

A pretty good idea about how languages vary in their syntax. Yeah. Variation at the level of semantics, less well studied. Variation in pragmatics, even less so. So if we were going to solve language, we need to say which language. Which raises a very important point. As you said, most of this work happens in languages.

English. In terms of computational linguistics, there's been the sense that people are very pleased that we've now got maybe a few hundred languages that we have pretty good models for, but there's still thousands of languages that we don't have any good computational models for.

What is required to make that happen? If you had a very large budget and a great deal, many computational linguists to train at your disposal, what's the first thing you would need to start doing? So the very first thing that I would start doing, I think, is engaging with communities and seeing which communities actually want computational work done on their languages. And then

My ideal use of those resources would be to find the communities that want to do that, find the people in those communities who want to be computational linguists and train them up rather than what's usually a much more extractive, we're going to grab your data and build something kind of a thing.

Right. Yeah. And then it becomes a question of, okay, well, what do you want computers to be able to do with your language? Question to the community. Do you want to be able to translate in and out of maybe English or French or some other world or colonial language? Do you want a spell checker? Do you want a grammar checker? Do you want a dialogue partner for people who are learning the language? Do you want a dictionary that makes it easier to look up words? Your language is the kind of language that has...

a whole bunch of prefixes. So just alphabetical ordering of the words isn't going to be very helpful. So, you know, what's needed? And then it depends. Do you want automatic transcription? Do you want text to speech? Right. And then depending on what the community is trying to build, you have different data requirements. So if you want to build a dictionary like that, that's a question of sitting down and writing the rules of morphology for the language and collecting a big lexicon.

If you want text-to-speech, you need lots and lots of recordings that have been transcribed in the language. If you want machine translation, you need lots and lots of parallel text between that language and the language you're translating into. And so a lot of that will use the same computational grammar models, but will have slightly different...

takes on what those models are and will need different data to help those models do their job. So in some cases, the same models, in some cases, different. I think if we're talking speech processing, automatic transcription, or speech-to-text, we're definitely in machine learning territory. And so that's one kind of model. Machine translation can be done in a model-to-grammar, map-to-semantics form, or it can be done with machine learning. The

spell checker, especially if you're dealing with a language that doesn't have enormous amounts of text to start with. You definitely want to do that in a someone-writes-down-the-rules kind of a fashion. So that's a kind of grammar engineering, but it's distinct from the kind that I do with syntax. Yeah. And so it just starts to unpack how complicated this idea of computers do language is because they're doing lots of different things and they need lots of different

Yeah. And obviously, we say data as though it's some kind of objective, general pot of things. But when we say data, we mean maybe people's recordings, maybe people's stories, maybe knowledge in language that they don't want people outside of their community to have. And so that creates...

different imperatives around whether these models are going to be a way forward or useful for people. Exactly. And at the moment, we don't have very many great models for collecting data and then handling it respectfully. There are some great models, and then there's a lot of energy behind not doing that. The

sort of best example that I like to point to is the work of Tehiku Media in Aotearoa New Zealand. And this is an organization that grew out of a radio project for Te Reo Māori, and they were at a community level collecting transcriptions of radio shows in Te Reo Māori, which is the indigenous language of Aotearoa New Zealand. And

And forgive my pronunciation, I'm trying my best. And they have been approached over the years many, many times by big tech saying, give us that data. We'd like to buy that data. And they've said, no, this belongs to the community. And they have developed something called the Kaitia Kitanga License, which is a way that works for them of granting access to the data and keeping data sovereignty, basically keeping community control of the data. So there are ways of thinking about this, but it really requires strategizing.

strengthen community against the interests of big tech that takes a very extractivist view of data. It's good that there are some models that are being developed and normalizing of this as one possible way of going forward. And as you've said, you've spent a lot of time working to build a grammar matrix for

lots of different languages. This goes against a general trend of focusing on technologies in major languages where there are clear commercial and large audience imperatives. Part of this work has been making visible the fact that

English is very much a default language in the computational linguistics space. Can you give us an introduction to the way that you started going about making the English-centric nature of computational linguistics more visible? L – I think

that this really came to a head in 2019 when I was getting very fed up with people writing about English as if it were a language and they would say you know here's an algorithm for doing machine reading comprehension or here's an algorithm for doing spell checking or whatever it is and

If it were English, they wouldn't say that. So it seems like, well, that's a general solution. And then anybody working on any other language would have to say, well, here's a system for doing spell checking in Bardi. Or here's a system for doing spell checking, you know, in Swahili or whatever it is. And those papers tended to get read as, well, that's only for Bardi or that's only for Swahili, where the English ones, because English was treated as default, were taken as general. Mm-hmm.

And so I made a pest of myself at a conference in 2019, conferences called NACL, where I basically just after every talk where people didn't mention the name of the language, went to the microphone, introduced myself and said, excuse me, what language was this on? Yeah.

which is a ridiculous question because it's obvious that it's English. And it's sort of like face threatening. It's impolite because it's why are you asking this question? But it's also embarrassing as the asker. Like, why would you ask this silly question? But, you know, I was just making a point. And somewhere along the line, people dubbed that the bender rule that you have to name the language that you're working on, especially if it's English.

I really appreciate your persistence, and I appreciate people who codified it into the Bender Rule because now it's actually less threatening for me. I'm just going to invoke the Bender Rule and just check if this was just on English. You've given us a very clear model where we can all very politely make pests of ourselves to remind people that

solving something for English or improving a process for English doesn't automatically translate to that working for other languages as well. Exactly. And I like to think that basically by lending my name to it, I'm allowing people to ask that question while blaming it on me. Great. Thank you very much. I do blame it on you all the time in the nicest possible way. Excellent. This seems to be part of a larger process you've been working on. Obviously, there's people working on computational

processes for English and you're trying to be very much a linguist at them. But it seems like you also are spending a lot of time, especially in terms of ethical use of computational processes, trying to explain linguistics to computer scientists as well. How is that work going? Are computer scientists receptive to what linguistics has to offer? Computer scientists are a large and diverse group.

In terms of their attitudes, they are a sort of, unfortunately, undiverse group in other ways. And it's also, it's an area of research and development that has a lot of money in it right now. So there's always new people coming in. And so it sort of feels like no matter how much teaching of linguistics I do, there is still just as many people who don't know about it as there ever were because new people are coming in. But that said, I think it's going well. I have written two books that I...

Call sort of informally the 100 things books because they started off as tutorials at these computational linguistics conferences with the title 100 things you always want to know about linguistics, but we're afraid to ask and then subtitle for fear of being told 1000 more. Yeah.

I mean, it's not a mischaracterization of linguists. That's for sure. Yeah. We're going to keep linguisting at you, right? So in both cases, the first one is about morphology and syntax. And I basically just wrote down literally 100 things that I wish that people working in natural language processing in general knew about how language works, because they tend to see language as just like

Strings of words without structure. Right. And worse than that, they tend to see language as directly being the information they're interested in.

I used to have really confusing conversations with colleagues in computer science here, people who were interested in gathering information from large collections of text like the web. This is a process called information extraction. And when I finally realized that we were focusing on different things, so I was interested in the language and they were interested in the information that was expressed in the language, the conversation started making sense. And I came up with a metaphor to help myself, which is if you live somewhere rainy, you

Can you picture you've got a rain splattered window and you can focus on the raindrops or you can focus on the scene through the window distorted by the raindrops, right? So language and its structures are the raindrops, which have an effect on what it is that you can see through the window. But it is very easy to look right through them and imagine you're just seeing the information or the world outside. And so when I realized that as a computational linguist, I'm interested in the raindrops, but some of these people working in

computer language processing are just staring straight through them at the stuff outside, it helped me communicate a lot better. I feel like I've had a lot of conversations with computational scientists where they're like, oh, you know, we did a big semantic analysis of, so there's a process you can apply where you have a whole bunch of processes and algorithms that run and it says 80% of the people in this chat or this

I think they're used to pulling things from Reddit and you could do that easily. It's like 80% of people in this hate chocolate ice cream. And I'd always be like, okay, but did you account for the person who's like, oh my God, I hate chocolate.

how delicious this ice cream is. And they're just like, oh, no, because we just used like hate was negative. So delicious was positive. So this person probably came out in the wash. And I'm like, no, this is a person who extremely likes this ice cream. And it's also a very like –

idiomatic, informal kind of English. I certainly wouldn't write that in a professional reference for someone. I hate how amazing this person is. You should hire them. As a linguist, I'm really interested in these nuanced

novel edge cases. As a computational scientist, they're like, "Oh, we just hope we get enough data that they disappear in the noise." Yeah, exactly. The words are the data. The words are the meaning. There's no separation there. There's no structure to the raindrops. "If I have the words, I have the meaning," seems to be the attitude. Yeah. Well, it's great that you're doing the work of slowly letting them down from that assumption.

Yeah, we're trying. Oh, one other thing about these books. So there's the first one is Morphology and Syntax. The second one is Semantics and Pragmatics. And in both of them, the second one is co-authored with Alex Laskarides. And in both of them, I have the concept index and the index of languages. So every time we have an example sentence, it shows up as an entry in the index for languages. And there's an index entry for English.

even though it indexes almost every single page in the book, it's in there because English is a language. There's this thing called the Bender rule. I don't know if you've heard of it, but I'm really glad that you're following its principles. So a lot of the work you've been doing is with a type of computational linguistics where you are building rules to process language and create useful computational outputs. But there are other models for how people can

use language computationally. Yes. So I tend to do symbolic or rule-based computational linguistics. I'm really interested in what are the rules of grammar for this language or for this phenomenon across languages? How can I encode them so that I can get the machine to test them, but also I can still read them? But a lot of work in computational linguistics instead uses statistical models. So building models that can represent patterns across large bodies of text. Oh, so that's like...

predictive text on my mobile phone where it's so used to reading all of the data that it

it has from other people's text messages and my text messages that sometimes it can just predict its way through a whole message for me. Yes, exactly. And in fact, I don't know if this is so true anymore, but for a while you could see that the models were different on different phones. Remember we used to play that game where you typed in, sorry, I'm late. I, and then just pick the middle option over and over again and people would get different funny answers. Yes. And you'd get

wildly different answers. Yeah. And so that reflects local statistics being gathered based on how you've been using that phone versus a model that it may have started with that was based on something more generic.

So that is, yes, an example of statistical patterns. And you also see these, and this is fun, in automatic transcriptions, like the closed captioning in TV shows. If you're thinking about live news or something where it wasn't done ahead of time, and they get to a name of a person or a place which clearly wasn't in the training data already represented in the model, and ridiculous funny things come out because the system has to fall back to statistical patterns about

what that word might have been, and it reveals interesting things about the training data. We used to always put the show through a first pass on YouTube, where Lingthusiasm is also hosted, before Sara Dopierala came in and transformed our lives by being an amazing transcriptionist. For years, YouTube would transcribe Lingthusiasm, a word it has never encountered before, in its defense.

as a computer. It would come up with "Link Suzy I am" most often, and we still occasionally refer to Link Suzy I am. It was interesting when it finally clearly had enough episodes of Lingthusiasm with our manually updated transcripts that it got the hang of it. But that was definitely a case where it needed to learn, and we definitely have a much higher success rate of perfect first-time transcripts with Sarah.

And that pattern that you saw happening with YouTube, that change shows you that Google was absolutely taking your data and using it to train their models. So in the podcast that I run, Mystery AI Hype Theater 3000, we have some phrases that are uncommon and we do use a first pass auto transcriber. And for example, we refer to the so-called AI models as mathy maths. Mathy maths. And that'll come out as like Matthew Math. Oh, my good friend, Matthew Math. Yes. Yes.

And the phrase stochastic parrots sometimes comes out as like sarcastic parrots or things like that. And you and Alex both have, I would say, relatively standard North American English accents. Yes. Which is really important for these models because so far we've just been talking about data where it's found and like where linguists are working with it and how

processing it before the computer gets to it. But with a lot of these new statistical models, it's just taking what you give it. That means as an Australian English speaker,

I'm relatively okay, but it's not as good for me as it is for a Brit or an American. And then if you're a Singaporean English or an Indian English speaker, even as a native English speaker, the models aren't trained with you in mind as the default user and it just gets more and more challenging. Yeah, exactly. And some of that is a question of what could the companies train these models easily get their hands on? But some of it is also a question of who were they designing for in the first instance? And

Whose data do they think of as sort of normal data that they wanted to collect? And so these are deliberate choices that are being made. Absolutely. So with these statistical models, how do they differ from...

the grammars that you've created. In a rule-based grammar system, somebody is sitting down and actually writing all the rules. And then when you try a sentence and it doesn't work as expected, you can trace through what rule was used and shouldn't have been used, what rule did you expect to have showing up in that analysis that wasn't there, and you can debug like that. The statistical models instead...

You build the model that's kind of the receptacle for the statistics and you gather a whole bunch of data. And then you use this receptacle model to process the data sort of item by item and have it output data.

according to its current statistics, likely answers, and then compare them to what's actually there, and then update the statistics every time it's wrong. And so you do that over and over and over again, and it becomes more and more effective at closely modeling the patterns in the data. But you can't sort of open it up and say, okay, this part is why it gives that output, and I want to change that. It's much more

Amorphous, in a sense. Much more of a black box is the terminology that gets used a lot. In 2020, we were really lucky to have Janelle Shane join us on the show and walk us through one of these generative statistical models from that era. She generated some Lingthusiasm transcripts based off the first 40 or so episodes of transcripts that we had. When it generated transcripts, the model had this real fixation on soup.

So, it got the intro to Lingthusiasm right because we say that 40 times across 40 episodes. But it'd be like, and today we're talking about soup. And we're like, Janelle, what's with the soup? And she's like, I can't tell you. It's a black box in there. It literally referred to as hidden layers in the processing. And so, because we don't know why it was fixated on soup, there's some great fake Lingthusiasm transcripts that we read

Very soup focused. Very focused on a couple of major pieces of fan fiction literature, which again is kind of classic fan fiction favorite IP because it read a bunch of fan fiction as well. Mm-hmm.

And so you can make some guesses about why it's talking about wizards a whole bunch, but you can't make many guesses about why it's talking about super whole bunch. And that makes it hard to kind of debug that issue. Hard to debug, yeah. But also if you don't know the original training data. So it sounds like she took a model that had been trained on some collection of data. Yes. So that it could be coherent with only those 40 transcripts.

Exactly, yeah. But if you don't know what's in that training data, then you're even more poorly placed to figure out why soup. Yeah. And since we did that episode, I think the big thing that's changed is...

that the models have been given enough extra data that they're no longer fixated on soup, but they've also just become easier for everyday people to use. Like part of why we were really grateful for her to come on the show is that she walked us through the

the fact that she was still using scripting language to ingest those transcripts and to generate the new fabricated text. It all looked very straightforward if you're a computer person, but you need to be a person who's

comfortable with scripting languages. And that's no longer the case with these new chat-based interfaces. And that's really changed the extent to which people interact with these models. Yes, yes, exactly. So there's a few things that have changed. One is there's been some engineering that allowed companies to make models that could actually take advantage of very large datasets. There has been the collection of

of very large data sets in a not very consent-based fashion. And then there has been the establishment of these chat interfaces, as you say, where you can just go and poke at it and get something back. And honestly, the biggest thing that happened recently

The reason that all of a sudden everybody's talking about ChatGPT and so-called AI was that OpenAI set up this interface where anybody could go poke at it. And then they had a million people sharing their favorite examples. And it was this marketing win for OpenAI and a big loss for the rest of us. I think the sharing of examples is really important as well because...

People don't talk very often about the human curation that goes into picking funny or coherent or relevant examples. We had to junk so many of those fake transcripts to find the handful that were funny enough to pretend read and give a rendition of. When people are sharing their favorite things that come out of these machines,

That's a level of human interaction with them that I think is often missing. But making it very easy for people to generate a whole bunch of content and then pick their favorite and share it has really normalized the use of these platforms.

large language model ways of playing with language. Yeah, exactly. And if you are someone who's not playing with it, or even if you are, most of the output you're going to see is other people sharing their favorites. So you get a very distorted view of what it's doing. And in terms of what it is doing, we talked before about when a computer is doing translation between two languages, it's not that it's understanding, it's replacing one string of text with another string of text with these

generative models that are creating this text that on an initial read reads like English. What are some of the limitations of these models? Yeah. So just like with machine translation, it's not understanding. The chat interface encourages you to think that you are asking the chatbot a question and it is answering you. This isn't what's happening. You are inputting a string

And then the model is programmed to come up with a likely continuation of that string. But a lot of its training data is dialogues. And so something that takes the form of a question, provokes as a likely continuation, an answer.

But it hasn't understood. It doesn't have a database that it's consulting. It doesn't have access to factual information. It's just coming out with a likely next string given what you put in. And anytime it seems to make sense, it's because the person using it is the one making sense of it. And because it's had enough input because it basically took...

large chunks of the English-speaking internet that there's a statistical likelihood it is going to say something that is correct, but that is only a statistical chance. It doesn't actually have the ability to verify its own factual information.

Yeah, exactly. I really dislike this term, but people talk about hallucinations with these models to describe cases where it outputs something that is not factually correct. Okay. Why is hallucination not an appropriate word for you? So there's two problems with it. One speaks to what you were just talking about, which is if it says something that is factually correct, that is also just by chance. And so it's always doing the same thing. It's just that sometimes it corresponds to something we take to be true and sometimes it doesn't.

But also, if you think about the term hallucination, it refers to perceiving things that aren't there. And so that suggests that these chatbots are perceiving things which they very much aren't. So that's why I don't like the term. Fair enough. It's a bit too human for what they're actually doing, which is a pretty cool party trick, but it is just a party trick. Yeah. One thing I've really appreciated about your critiquing of these systems is that

that you situate the linguistic issues around lack of actual understanding and real pragmatic capability. But you also talk about it in terms of these larger issues

systems issues in terms of problems with the data and problems with the amount of computer processing it takes to perform this party trick, which are a combination of alarming issues. Can you talk to some of those issues and maybe some of the other issues that you've seen crop up with these models? Yeah, it's so vexed. So

One place to start is a paper that I wrote with six other people in late 2020 called On the Dangers of Stochastic Parrots. Can language models be too big? And then the parrot emoji is part of the title. Excellent. And this paper became famous in large part because...

Five of the co-authors were at Google, and Google decided after approving it for submission to a conference that, in fact, it should be either retracted or have their names taken off of it. And ultimately, three of the authors took their names off, and two others got fired over it. Right. Okay. That is...

Yeah.

was just masterful at taking the ensuing media attention and using it to shine a light on the mistreatment of black women in tech. She did an amazing job. And Dr. Margaret Mitchell was the other one who got fired. It took a couple more months in her case. And she also... Oh, you mean her name is not Margaret Mitchell? That was a pseudonym? That was a pseudonym. Yeah. Who would have thought? I can't believe it. Yeah.

So in that paper, I said we wrote that paper because Dr. Gibru came to me in a Twitter DM in September of 2020 saying, hey, has anyone written about the problems with these large language models and what we should be considering? Because she was a research scientist in AI ethics at Google. It was literally her job to like research this stuff and write about it. And she had seen people around her saying,

pushing for ever bigger language models. And this is 2020. So the 2020 language models are small compared to the ones that we have now. And, you know, doing her job, she said, hey, we should be looking into what to look out for down this path.

And I wrote back saying, I don't know of any such papers, but off the top of my head, here are the issues that I would expect to find in one based on independent papers. So sort of looking at things one by one in the literature. And that was things like environmental impact, like the fact that they pick up biases and systems of oppression from the training data, like the fact that if you have a system that can output plausible looking synthetic text that nobody's accountable for,

That can cause various problems down the road when people believe it to be real text. And then a beat or so later, I said, hey, this looks like a paper outline. Do you want to write it? And that's how the paper came to be. So there's two really important things that we didn't

realize at the time. One is the extent to which creating these systems relies on exploitative labor practices. So that is both basically just stealing everybody's text without consent, but then also in order to keep the systems from just routinely outputting bigoted garbage,

There's this extra layer of so-called training where poorly paid workers working long hours without psychological support have to look at all the awful stuff and say, that's bad, that's bad, this one's okay, and so on. And this tends to be outsourced. There's famously workers in Kenya who had been doing this. And we didn't know about that at the time, though probably some of the information was available we could have. And it keeps outputting highly bigoted stories.

Disgusting text because it's been trained on the internet. Exactly. Which as we all know is a bastion of enlightened and equal opportunity conversation. Yeah. Yes. Yes. But even if you go with only, for example, scientific papers, which are supposed to not be awful, guess what?

There's such a thing as scientific racism, and it is well embedded in the scientific literature. Cool.

What you get out is not science, but papier-mâché, right? But anyway, people were poking at this and very quickly got it to say racist and otherwise terrible things in the guise of being scientific. So I think it was the linguist Ricker Dockum who asked for something about

stigmatization of linguistic varieties and it came out with something about how African Americans don't have a language of their own. Oh, a thing that we don't even need to fact check because that is incorrect.

So anyway, you can certainly get to bigoted stuff starting with things less awful than the stuff that's out there on the Internet. Yeah. But also these models are trained on what's out there on the Internet. Right. So labor exploitation was one thing that we missed. The other thing that we missed in the Stochastic Parrots paper was the

We had no idea that people were going to get so excited about synthetic text. Right. So in the section where we actually introduced the term stochastic parrot to describe these machines that are outputting text with no understanding and no accountability. Yeah. We thought we were going out on thin ice. Like people aren't really going to do this. But now it's all over the place and everyone is like trying to sell it to you as something you might pay for.

Yes, in many ways. It's a paper that was very prescient about a technology that has really become very quickly normalized, which creates a compounding effect in terms of data because now everyone's sharing the synthetic text that they're creating for fun, but people are also using it to populate information.

web pages and heavens knows a lot of the spam in my inbox is getting longer because it can just be generated with these machines and processes as well. What used to be human-created data that it was trained on, now if you try to scrape the internet, there'd be all of this synthetic machine-created language as well. It will just start training on its own output, which

I'm not a computational linguist, but that just sounds like it's not a great idea. If you think about what it is that you want to use these for, then ultimately data quality really, really matters. And ideally, data quality that is not only good data, but well-documented data. So you can decide, hey, is this good for my use case? The ability to use the web as corpus to do linguistic studies is rapidly degrading. And in fact, there's a computational linguist named Robin Spear.

who used to maintain a project called Word Freak, which counted frequencies of words in web text over time. And she has discontinued it because she says there's too much synthetic garbage out there anymore. I can't actually do anything reliable here. So this is done. So it's bad for computational linguistics. It's bad for linguistics.

And just to be clear, with these models, there's no magic tweak that we can make to make them be factual. No, not at all, because they're not representing facts. They're representing co-occurrences of words in text. Does this spelling happen a lot next to that spelling? Do they happen in the same places? Then they're likely to be output in the same places again.

that sometimes reflects things that happen in the world because sometimes the training text is things that people said because they were describing the actual world. But if it outputs something factual, it's just by accident. So your work on the Stochastic Parrots paper really set the tone for this conversation in...

linguistics. You've been continuing to talk about the issues and challenges with these kinds of large language models and other kinds of generative models because, obviously, similar processes are used for image creation. We've only really talked about the text-based stuff, and there's a whole bunch of things happening with audio and spoken language as well. But there'll be

Heaps more of that on Mystery AI Hype Theatre 3000 and also in your book, The AIcon, which is coming out in spring 2025. Yes, I am super excited for this book. It was a delight to work with Dr. Alex Hanna, who is my co-host on Mystery AI Hype Theatre 3000, to put together a book that is for popular audiences. And one of the things that I think worked really well is that she's a sociologist and I'm a linguist. And so we have different technical terms and

And we were able to basically catch each other. It's like, I don't really know what that word means. And so the general audience isn't going to know what that word means. So hopefully it will be nice and accessible. The subtitle, by the way, so title, The AI Con, and the subtitle is How to Fight Big Tech's Hype and Create the Future We Want. Excellent. And it'll be out in May of 2025. And it seems like given the limitations of these big models, there's still lots of space for the kind of symbolic work.

grammar processing work that you do. Yes, there's definitely space for symbolic grammar-based work, especially if you're interested in something that will get a correct answer, if it gets an answer at all, and you're in a scenario where it's okay to say, no possibility here, let's send this on to a human, for example. But also there's a lot of room for linguistics in designing better statistical natural language processing, in sort of understanding language

what it is that the person is going to be doing with the computer and how people relate to language so that we can design systems that are not misleading, but in fact are useful tools. If you could leave people knowing one thing about linguistics, what would it be? So in light of this conversation, the thing that I would want people to know is that linguistics is the area that lets us zoom in on language and pick apart the raindrops and understand their structure.

so that we can then zoom back out and have a better idea of what's going on with the language in the world. Thank you so much for joining us today, Emily. It has been an absolute pleasure.

For more Lingthusiasm and links to all the things mentioned in this episode, go to lingthusiasm.com. You can listen to us on all of the podcast platforms or lingthusiasm.com, and you can get transcripts of every episode on lingthusiasm.com/transcripts. You can follow @Lingthusiasm on all social media sites. You can get scarves with lots of linguistics patterns on them, including IPA, branching tree diagrams, VUBA and Kiki, and our favourite esoteric Unicode symbols, plus

plus other Lingthusiasm merch like our Etymology Isn't Destiny t-shirts and Gavaguy pin buttons at lingthusiasm.com slash merch. My social media and blog is Superlinguo, and links to Gretchen's social media can be found at gretchenmcculloch.com. Her blog is allthingslinguistic.com, and her book about internet language is called Because Internet. Lingthusiasm is able to keep existing thanks to the support of our patrons. If you want to get an extra Lingthusiasm episode to listen to every month,

our entire archive of bonus episodes to listen to right now, or if you just want to help keep the show running ad-free, go to patreon.com slash lingthusiasm or follow the links from our website. Patrons can also get access to our Discord chatroom to talk with other linguistics fans and be the first to find out about new merch and other announcements. Recent bonus topics include behind the scenes on the Tom Scott language files with Tom and team, linguistics travel, and also xenolinguistics and what alien languages might be like.

If you can't afford to pledge, that's okay too. We really appreciate it if you can recommend Lingthusiasm to anyone in your life who's curious about language. Lingthusiasm is created and produced by Gretchen McCulloch and Lauren Gawne. Our senior producer is Claire Gawne, our editorial producer is Sarah Dopierella, our production assistant is Martha Tsutsui-Billens, and our editorial assistant is John Crook. Our music is Ancient City by The Triangles. Stay Lingthusiastic!

98: Helping computers decode sentences - Interview with Emily M. Bender 56:10 Share

Lingthusiasm - A podcast that's enthusiastic about linguistics

Deep Dive

Shownotes Transcript

98: Helping computers decode sentences - Interview with Emily M. Bender