Kia ora, ni hao and hello. Welcome to the Chiwi Journal podcast. I'm your host, Camille Yang. My guest today is Jack Connor, a linguist, programmer and author who speaks more than seven languages and has built a variety of AI language technologies.
In this conversation, Jack discusses his project focused on preserving endangered languages, particularly Navajo, through the use of the LLMs
He emphasizes the importance of cultural preservation and the role of technology in revitalizing languages. Jack also shares his insights from his experience with the Arctic World Archive and highlights the interconnectedness of language and culture. I hope you enjoy the show.
Welcome, Jack, to my podcast. Thanks for being here. Thank you, Cami. Appreciate you having me on. So your project to preserve the endangered language using LLMs is very inspiring. What sparked your interest in endangered language and why did you choose to focus on Navajo specifically?
I have worked with endangered languages in a couple of different ways and a couple of different projects here in California quite a bit. Kind of famously, there's a lot of indigenous languages in the United States because there was a lot of indigenous people here when it was settled in the 1600s, 1700s, etc. So this is something I've done like linguistics projects with smaller tribal language like
to Lobolobo, Cahuilla, like things like that that are California based. Or like, not sure how much you know about the geography of the states, but like the Southwest. I'm based out of LA. You know a little bit about me, like I love to speak languages, love to learn languages, total like, you know, linguistics nerd. Well, kind of working with these really small languages, one, I just loved it. You know, it's a lot of fun. It's really interesting.
When you're learning Spanish, you meet a lot of people that kind of sort of sleepwalking through it. They think like, oh, yeah, maybe I think I should learn Spanish because it's good for me or something like that. When you meet somebody who's learning like a small language, it's because there's a lot of passion there. There's some reason that they want to. And they usually have a really interesting and specific culture, too.
You know, languages form in isolation. And so when you have a small niche language that usually is part of a society or culture that's like interesting, a little bit different.
And so when I got involved with O'Shaughnessy Ventures, which is how we met, of course, the project I pitched to them was preserving indigenous languages, specifically Navajo, as you mentioned, with LLMs. The idea here being that, like, if you had a chat GPT with a text-to-speech for Latin or ancient Greek or one of these languages, we would actually know what it sounded like.
you know, ancient Latin. I like to use as an example, cause it's probably like the best preserved dead language that we have, or at least ancient dead language, not like probably maybe some modern examples that we have more stuff for, but even Latin, you know, most of what we have is either, it's either religious, you know, Bibles, things like that, or a government tax records, trade records, things like, you know, things of that nature. And so while
You can really do a lot of damage to piece together a language in a good way using a minimal amount of material. If I could go back in time, if I could like kidnap Sam Altman, go back in time and force him to build a Latin LLM, then we wouldn't have to do guesswork. We could hear it. We could interact with it. And even though all the speakers had long died off, that's something that can still exist.
So this was my pitch. As you know, OSV loves kind of crazy technology ideas and stuff. And so they were pretty interested in it. And I chose Navajo because it's a really famous language. It was used by the United States as a code. And famously, they had these people called the Code Talkers that they brought Navajo to the Pacific Theater in World War II.
And it was like it was the first unbreakable code that it was specifically the first one that the Japanese weren't able to break and weren't able to torture out of anybody. They tried. There is some horrific stories about them trying to figure out what the Navajo code was. It also had a kind of on the tech side, it was also kind of a huge technology advantage because when the Navajo code breakers were in World War II,
The alternative way of sending encrypted messages took is off the top of my head. So I might have this slightly wrong, but I think it was two minutes or minute 40 or something. And then the code talkers, they actually had a code. They weren't just speaking in Navajo. They were speaking in a code in Navajo.
But it went from two minutes down to like 10 seconds or 15 seconds. In wartime, especially in the Pacific theater, where it's like this really nasty, like, you know, they were taking islands, people were being bombed. Like it was just really brutal kind of hand to hand and like small, small combat.
This was a huge deal. I mean, there's many things that helped win the war with the Japanese. And so it's not, you know, you wouldn't be like, oh, it was all the codebreakers. But like they played a pretty, can I swear on this or no? Yeah, yeah, yeah. You can swear. Yeah, they played a big fucking deal. It was huge. It was like ridiculously effective what they did. So the Navajo became national heroes and are still celebrated every year.
Big deal. Lots of I mean, I've at least literally looking at a book about it, you know, right now. So so I thought it was a good language in that sense to work on because it's one that if nothing else, people know and can kind of like stir up the imagination a little bit. The other thing is it's has a 200000 speakers. I think it's more actually like 170000 speakers, but something like 200000 speakers.
and it is geographically pretty close to me. So I can actually drive to the reservation. It's a long drive. For non-Americans, they'd think this was, I don't know, Europeans are always saying our drives are too long, but yeah, it's like maybe eight, 10 hours away, or I could take like a one hour flight to Albuquerque. But then as I've worked with OSV, I had this opportunity to do this project with the Arctic World Archive where we stored a huge, huge data archive
The Arctic World Archive is these guys that have this ultra long term storage, like literally like this like microfiche style thing, but it's rated for thousands of years or over a thousand years. And then they drop it into the bottom of this coal mine up by the North Pole. And so we partnered with them.
And rather than do one language, because we don't have the LLMs set up. Well, actually, we did put LLMs on there, but not ones that I built, ones that were pre-trained, you know, that were trained by other people. And so we had an opportunity to do the preservation side of it. Because the other big thing with storing languages, like if you were going to, you know, back to my example, if you were going to, you know, kidnap Ilya and Sam Altman or whatever.
time machine back to ancient Rome and make an LLM for Latin. If you can't time machine back and take a time machine back, like how are you going to make sure that it is safe and readable a thousand years in the future? Well, these people at the Arctic World Archive have solved that. And there's a lot of teams that are working on this kind of ultra long-term storage, but they have a really, really, really good solution.
Went up to the North Pole, dropped, not the North Pole, but the Arctic Circle, deposited many hundreds of languages worth of data. Also, there was stuff like Chinese, English. You know, one of the things I pulled various Wikipedia pages in every language I could. So I think it was like one of them had like 270 different languages, I want to say, or maybe it was 150.
Something like that. It was in the 200s. And then we used this opportunity to make a documentary, which we are currently editing, where the deposit in the Arctic World Archive was part of that. But then we went to northern Norway, which is Svalbard. The island I was talking about is in Norway. And met with the speakers of the Sami language, which is like this indigenous language from northern Scandinavia. Reindeer herders. So we hung out with a bunch of reindeer herders.
Talk to them about their language because their language is an example of one that has come back like crazy. It used to be almost dead. There was a period where the Norwegian government had a policy called Norwegianization, and they heavily discouraged using other languages in Norway. So there was...
a dozen or more of these Sami languages and several of them died out as a result. The people got integrated into normal Norwegian society. Specifically, the reindeer herders were nomads. And so the government like more or less couldn't catch them.
And they kept stories about how like they would try and get the kids to go to these like Norwegian speaking boarding schools. And the kids would just tell them to go fuck themselves and then run into the forest. It's like a people where like the seven year olds know how to hunt and skin and can ride a reindeer and are basically like Princess Mononoke level, just bad asses.
And so as a result, their language survived. And now smash cut to the 90s, when now all of a sudden everyone starts feeling real guilty about this, right? That happened all over the world. Same thing here. I mean, for us, I mean, you know, it wasn't always the 90s, but the 90s was kind of when like, it really became like uncool to oppress small niche languages in the West, you know?
And so they started reversing this trend. And one of the ways they did it is having like schools could be bilingual government things. The Sami got their own government that has some sway and a lot of other stuff. So these days, like if you live in a bunch of places in Norway, you could send your kids to a bilingual school that was both Norwegian and Sami. You could send them to an all Sami school.
Or if you send them to a Norwegian school, they can take Sami classes and Sami culture is like kind of reintegrated. And they've done a really good job of, you know, it's not bullshit, too. It's like they've done a really good job of now there are a lot more speakers than there used to be. And the younger generation, especially, there's a lot more speakers than there used to be, which is a big deal, because if the number of speakers goes from a lot to kind of
a little bit, it's really, really, really hard to reverse. So we took a small film crew, filmed the shit out of it. It's like one of the most beautiful places I have ever seen in my life. So it was like...
Oh, there was so much, you know, so much to film, so many beautiful places to film. The Sami were really cool. They're sort of like scampi, very funny, very talkative. Our region's like the greatest people on earth, but they're a little bit more reserved maybe. But like, yeah, the Sami are freaking pranksters. That's kind of like what they're famous for. So...
Yeah, so it was cool. It was cool. So that's, I forget the question. That's what I've been working on. So you mentioned that losing a language is like losing an entire culture. I'm glad you are working on that and can hang out with the people who speak Sami. So can you elaborate on how preserving a language through AI can also preserve their culture, stories? What's your observation there when you hang out with those who are Sami speakers?
Yeah, I mean, AI can't fully replace a culture. You know, there's no replacement for sitting around and hearing stories and talking to people and food and...
you know, music. Like one of the, like we had this, this Sami revolutionary who we got this great interview with. I forget if I sent you the teaser, if you saw the teaser trailer we did. I did. Yeah. He's got this great voice. Oh, cool. So he's that, yeah, the guy in the beginning, halfway through the interview, he just starts singing a song. He's like, I wrote this about this city. And then like a little while later, he starts singing, you know, so it's, it's very AI can't deliver that experience. The preservation, which is what I'm mostly focused on is,
kind of making sure there's records of languages, no matter what, making sure that our data is safe, that future linguists can work with it, that people who want to learn these languages in the future will have an ability to. But it's not the same as linguistic revitalization, which is where you take a language and you start, you know, kind of hoaxing it back into the mainstream, which is what they did. I am making AIs of niche languages
for the reasons that I said. But if I could have my wish, and instead of building an LLM for a language that was dying, just have the language come back to life and have like a million speakers for it,
I would definitely do that because, you know, the culture is people and AI is not people yet. Now, that being said, if you're training in LLM, I'm using AI and LLM interchangeably for the tech nerds out there. Don't get pissed off at me because I know that that's not totally accurate. But the positive on that is you train it on culturally relevant documents. So one of the things I'm trying to do is I'm trying to...
to pitch Trinity College in Dublin on using their library. We have a mutual friend, good old Dylan, hopefully going to help me out with this. But essentially, I just think it'd be really cool to do Irish. Irish is also another example of a language that really has came back in a very powerful way. The government is super involved with it, street signs in Irish and things like that.
So if I build an Irish LLM and there's, you know, tons and tons of content, you're not, you're not lacking for things to train it on. It would be awesome to train it on like the great books and all the, you know, not have it be generic, but have it actually be something that can, that give you a little bit of culture, you know?
Like a chat GPT is not famous for you're like, oh, chat GPT is such cultural knowledge, but it actually does. I mean, you would talk, ask it a question about Shakespeare. It'll answer a question about Shakespeare and things like that. Yeah, but it really depends on what it's trained on. So if you train something on textbooks, it'll talk like a textbook and be able to answer questions from textbooks. If you train on poetry, it'll
It'll talk like a poet and give you poetic answers, you know. So that way there is some retaining of culture. But then on the other side, too, like for the Arctic World Archive, we did music and books and tons of audio files of people like talking about food and cooking and, you know, just kind of day in the life type stuff like.
So that, you know, we have a couple of LLMs in there. We have one Korean and one in English and then a bunch of like machine learning models that kind of text to speech and various other things. But also like if you're, you know, Cami, if you're a future historian, let's imagine, you know, a thousand years in the future and you find this real iPod.
You could use the AI to help learn the language. And you could have, you know, we also have lots of like technical documentation, like dictionaries and white papers from linguistics and grammar books and all kinds of shit like that. But also like, God, I forget what, I think we did do a movie for Chinese, but like, you know, for a bunch of these, like there's like a movie you could watch, like, and not just, you know, it's like an Oscar winning movie. I love movies. So I didn't, I didn't, I wouldn't allow any crap in there.
Or like, you know, listen to an audiophile of a grandma talking about her childhood or like, you know, things like that, that would, again, they don't solve the problem. But for you, the future historian, at least enrich and give some of the culture because I am a big like language and culture are very interconnected kind of guy. That's, you know, that's where I fall in it. So.
I think it's important. Like a language is not just sounds put together in a sequence. It's a way of thinking and a lot of times a way of living.
For me, I got a question about, because, you know, language evolves naturally over time. So how do you ensure that the AI not only preserve the static language, but also allow us to grow and adapt alongside the language, ongoing changes within the community? I guess the short answer is you'd have to keep training it. If you want a AI that can speak, you know, every dialect,
You know, where you could say like, oh, give me a Boston accent or like give me like a Caribbean accent or British accent or something. The only way to do it is to train it on that. If it hasn't never heard those or it hasn't, you know, interacted, then there's then it's not going to be able to do it because it's just not going to know. But the other thing is kind of to your point is like you're alluding to is language is always changing every second of every day. You know, it's not.
There's nothing less static, like even me and you talking.
You know, I'm going to say after this conversation, I'm probably going to say like one word a little different because I heard you say it or you're going to, you know, I say some word that you're like in the back of your head. You're like, oh, that was kind of a cool phrase. Like maybe I'll use that the next week you use it. And and these are really little things that don't seem like they make a difference. But like, let's say you use, you know, I tell you something super like California and then you use it with a friend.
And then they use it with a friend and then they say it in a lecture. And then those, one of those students uses online. And all of a sudden, like in a year, you're like, notice the people of Singapore are,
saying that things are gnarly or whatever. The word dude becomes popular there. This is how language works. I mean, it's a little bit of an exaggerated case, but if you take these little things and then multiply it by every single person on earth, every time they talk, it starts to be a big difference. And if you notice, even listening to people speaking 30 years ago have a different accent than we do. I'm sure in Chinese, it's the... I mean, it's not necessarily like
I would recognize an American speaking 30 years ago, obviously, but the accents can be a little different. Some of the words are going to be a little different. Some of the phrases are going to be a little different. Some of the jokes are going to be different because they, you know, had different commercials or different things they're making fun of. So there's no way of perfectly preserving or recording something.
every accent, you know, unfortunately, accents and dialects and things are just, they're, you know, living and dying all the time. But I guess that's sort of like languages are like people, you know, just always changing. Yeah. So what are the traditional methods of preserving language? Books. Traditionally, one of the sad things is if a language doesn't have writing, and there's a lot of just purely oral languages, once a
It disappears. It is just gone. There is no way of bringing it back because it only lived in people's brains. And this happens all the time. This is a big problem with indigenous languages, actually everywhere, but in the States especially.
that some of them have writing systems. But like Navajo has a couple of books, a couple of kids' books, and like Star Wars and Finding Nemo were translated into Navajo. There isn't really that much. And so this is actually tackling the Navajo problem. This is the first thing that we're doing is being like, we actually need to create a lot of data that we could use. You know, and going back to the Latin example of like,
religious texts and government texts like especially before you know the 19th century when most people were literate not a lot of books around not a lot of authors or anything like that like
I would say for, you know, making this up, but not, you know, 98% of languages that died before 1900, all we have is religious and government text or, you know, almost all we have is religious and government text. So yeah, before AI was books. I mean, there's also audio. Like I love, you know, I love music. I love movies. And this is, but this is also proved to be a really good preservation method because there's some incentive there. But yeah, I mean, just traditional, traditional methods, all of which are,
imperfect, but it is what it is. You know, sometimes AI models can produce bias or inaccurate outputs. So how do you minimize the risk of misrepresenting those endangered languages through these models? What safeguards do you implement to maintain the culture
You have to work with native speakers. You just have to do a lot of testing. That's what you are doing. Yeah. And so one example, I have another project going in Panama right now.
where there's a language called Nave, N-G-A-B-E, which isn't really pronounced the way it looks. And the hospitals in Panama don't have enough translators. So they keep getting people coming in. My partner down there, his wife works at a hospital. And she's like, at least once a day someone comes in and we don't really know what to do because they don't speak Spanish and we don't have someone who speaks Nave.
So there we're building a text, starting with a text-to-speech. Hopefully in the future, you know, there's going to be multiple projects. But if nothing else, just like some, you know, some small tools like text-to-speech and things like that. And then a machine translator, all of which we know will be imperfect because there aren't that many speakers. But right now we're trying to fix a problem. You know, I'm not trying to create the world's best LLM and NABE software.
Trying to help these hospitals actually like service people who are injured and sick and stuff. But the answer there is like we have to have somebody who speaks the language because I don't speak the language. And my partner there speaks a little bit, but he's not like a native speaker. Like you basically just have to have someone verifying and you have to QA the fucking shit out of it. You know, making sure it's OK, because.
If Chet GPT, which is like the most trained AI on the planet at this point, probably, I mean, maybe one of the other ones has a little bit more, but if that can make mistakes and hallucinate like anything can. And so the other thing too is just also like
I think in general, being smart about what you trust coming out of it. As far as I know, we don't really know why they hallucinate. I think it's really funny that they hallucinate because it's kind of like how we dream. So it's sort of the interesting thing is like, you know, we thought that we're going to use our brains to learn how to build an AI, but our AIs are actually kind of teaching us a little bit about how we, you know, our brains and how we think. But yeah, I think like a lot of tech, you have to be careful and you have to make sure that
You know, you can't assume everything is going to be flawless because it's not. And then you just have to really, really work with native speakers. Like I'll give you one example too, is there's a lot of people that want to use machine translated or machine generated content to train more AIs. And I'm not totally against this, but I'm very dubious about this because it presents the problem of you're like,
just your if you feed bad data and it trains it's going to keep making more and more bad data it's like basically like in training version of incest where it just keeps getting worse and worse because the problems keep magnifying so as far as i know there isn't really a shortcut aside from keep saying this but working with native speakers finding people that you know really know the language well and and then also like being careful about your downsides like
If you're going to trust a goddamn AI with someone's life, you better double check or be careful. Or it must be like, or it's an emergency or like, you know, it's not something to do idly. I'll say that. But you wouldn't with a SaaS platform either. An app that was supposed to tell you, you know, call 911 if you were having a heart attack. Better work because if it doesn't, that's a huge problem. What about you, Cammie? What languages do you speak? Mandarin. Mandarin.
What is your accent?
我的口音很复杂。 I have very complicated accent. A lot of people think I'm from, yeah, they couldn't guess where I'm from. Like I don't have that typical Chinese English, Chinglish accent, but I don't have the standard English accent either.
Wait, let me guess. I think you sound like you're from Shandong. Just kidding. You already told me that. What's your... What do they speak there? Is there a second... Is it dual bilingual in Shandong? People in my city, Jinan, they speak like local dialect. I'm from the coastal province. It's called Shandong province. There are like 34 cities in my province. Each city speaks different dialect.
What is the dialect from your city related to Mandarin or is it like a totally different language? It's really similar to Mandarin, the standard Mandarin. I think people from North, although they speak different dialect, but the people can understand each other. But if you're from Seoul, like if you're from Shanghai and you speak Shanghainese, I couldn't understand you. Yeah, because Shanghainese is like significantly different.
It is. Same as Cantonese. It's totally another language, I would say. Yeah. I mean, there's a lot of argument about where the dividing line between languages is. I tend to use mutual intelligibility as the definition where if you speak one language and I speak another language, but I can speak in my language and you can speak in your language and we can understand each other, then I'm sort of like,
Are they different languages? This is a question that nobody's ever satisfactorily answered. Do you know the old joke? Like, what's the difference between a dialect and a language or an accent and a language? No. A border and an army. Scandinavia is a good example of this, where Denmark, Sweden, Norway, and Iceland can all understand each other really well.
They make fun of their accents a lot, but they can all completely understand each other if they're speaking in their own language. But they're their own languages, you know. Whereas France, because like the French are so fucking French. You know, they're like, vive la France. Like we are, you know, we are French people. Like we speak French. It's kind of that Norwegianization thing. Like they're very like, France had a bunch of different languages, including Basque, which Basque is like, it's as unrelated to French as like, I don't know, like Chinese is to like,
whale songs like they're so different it's not even funny like completely different linguistic roots and everything and yet France has the balls to be like oh no this is an accent of French because they didn't want to like essentially admit that anything in their borders wasn't actually French but they didn't want to tell people not to not to speak it so they just declared that it was actually an accent of French and I do notice that living in Portugal a lot of people speak
Portuguese, it's like a combination of Portuguese and Spanish. So I see people inventing the new language. It's pretty fascinating. There's a weird thing too where, you know, mutual intelligibility. My experience is Portuguese people can understand Spaniards really well, but Spaniards can't understand Portuguese people. So it only works one way. That's interesting. How do you like living in Portugal? Portugal is amazing. It's like digital nomad hub. So I love it.
You get a chance to talk to people from different backgrounds and you can learn so many languages there. Yeah, I just love to live in the environment to follow diverse people, not just like when I live in China, the circle is pretty small. Same in New Zealand. Yeah, it's fun to, I lived in Barcelona for a year and Barcelona was like that where it's just everybody from everywhere lives there.
Yeah. Especially I was really involved with the skateboard community. For anybody who doesn't know, Barcelona is for about 20 years now has been considered the best skateboard destination on planet Earth. So if you go there and skate, especially, I mean, it's just you meet people from everywhere. Africa, South America, United States, all over Europe. Like everybody's just trying to like get to Barcelona to go skate. It's great. Yeah.
Yeah, you mentioned that California has a rich skateboarding culture. I wonder how has skateboarding influenced your work, your life? And do you see the parallel between preserving the subcultures like that and preserving language? Oh, that's an interesting question. I've never thought about that. Yeah, probably. You know, me being...
Part of and, you know, really loving like the skateboard community, which is like a very specific niche with a lot of like specific subcultures in it. There's like punk skaters, there's like rapper type skaters, there's like handrail skaters, there's like artistic skaters. There's all kinds of stuff. I hadn't thought about that, but probably because I've always lived in these little sub niches that had a lot, you know, not that many people, but people who are like very passionate about it.
And so maybe that's making me realize something about myself. But yeah, maybe that's what's drawn me to, you know, why I would rather learn a small language with people that were really interesting rather than learn like a big generic language. Also, like why I tend to work at startups and non-incorporate, you know, not at big corporations. Maybe. I don't know. I guess I have a, I guess I have a type. But another, another thing though, skateboarding, see if I can explain this right.
Skateboarding and languages and coding, which is another thing I do, are all things that you really have to teach yourself. Or not you don't teach yourself, but you have to be self-motivated.
Because if you're not, you're never going to get good enough. So like you could do take Spanish classes every week. Oh, I'm going to my Spanish class this week from 7 p.m. to 9 p.m. Hola, como estamos? You know, and stuff like that. And do your Duolingo. But if you're not like if you don't really get into it, you're never going to get fluent. You know, like you have to at some point you have to.
like essentially yank the tooth out as a linguist I know says. He calls it the dentist method because he's a kind of older guy. He's like, oh yeah, the dentist just pulls it out. He's like, you have to do an immersive, like if you're going to learn a language, you have to have at least a phase where you're like hyper immersive in it. Otherwise, it's just not going to work. Or maybe it's not immersive. Maybe it's like
You read 10 books in three weeks that are all in Italian or Spanish or something like that. But they require a lot of self-motivation. It's also why I have met so many coder skaters. Lots of skateboarders get into coding and they are very fucking good at it. Every skateboarder I know who's a coder is like,
the most senior architect at whatever they are at and are like just ridiculously good compared to other people. And it's not because they're so smart. It's because they can teach themselves stuff and they can fail a bunch of times. Like if you want to learn how to do a kickflip, you know, a kickflip is the board like does this like, you know, it goes like, you have to like try it hundreds of times and you will eat shit over and over again.
over and over. You will definitely get bloody. You'll definitely smack your face against the concrete a couple of times, but you'll learn it. And it feels so good when you do. Coding's the same way. To learn how to do a for loop for a specific thing, the best way to learn is just try it as many times as possible, as fast as humanly possible. And every time you fail, just get right back up and try again. Skaters are very good at things where you teach yourself. They're not great at things where you have to...
bow down to a lot of authority figures. That's not their strong suit. So I guess those are some ways that skating has...
affected my career. And then the third one is that I work with tons and tons of skateboarders. So like my, my partner, you know, on the, on the documentary, the guy, he's such a fucking genius. And every shot looks at Nigel. Yeah. Nigel branch. What's up, Nigel. Shout out to you. If you listen to this dude rocks, he's so good. And we, we met each other through skateboarding and I've, I've hired so many skateboarders before.
At my last startup, the first guy I hired was a skater and he was the best hire we ever made. We did so much work together. And part of this is totally biased. It's not like skateboarders are just good. I don't look at other people. My network is a lot of skateboarders, so it's also really easy for me. If I have a job, it's easy for me to put out the bat signal and I'll tend to get a lot of skaters.
But also the people I'm like really comfortable working with who I know if they're good, I know they'll deliver. And Nigel, another thing too is Nigel is also another guy because he had been a skater filmer, like literally film skateboarders and stuff. I know he's someone who could like take care of a really expensive camera and not break it and do things like hold it out the window when we were driving at 50 miles an hour to get a shot and things like that.
because he's used to working under extreme situations. Amazing. Reminds me of that movie, Free Solo. The director also a rock climber himself. So I think only him can do that job. Yeah. I didn't see that because it seems kind of freaky. I also have a lot of rock climber friends. I've heard a lot of stories about just bad stuff happening. But yeah, no, it's absolutely, absolutely. And back to the language thing, it's also like,
you don't know unless you live in it. If you go and visit China as a tourist and you don't speak any Chinese, that's great. You see the hidden city, you can go to Shanghai and go to Fuxinglu or whatever, or Sanlitou and all this stuff in Beijing or whatever. But if you speak the language and you get to know the people, it's a totally different experience. You're not just looking at things, you're feeling like
part of the culture and this is actually something I get really addicted to as one of the reasons I love languages because I love just like sitting down with people at coffee and just talking to them and even if I'm talking like a fucking idiot because like I don't speak that well it's really fun and I always walk away feeling like much more a part of whatever you know culture or society that was
So what are some of the broader global trends you notice in language extinction? The global trend in language extinction is that language extinction is happening. I've had people push back on this, but I estimate it about 100 times faster than it's ever happened before because of the internet. I am not anti-internet. I'm not anti-globalization. Like a lot of people, I have thoughts about a lot of stuff, but the fact is that
If you, I'm going to use an old coworker, actually two old coworkers as an example, worked with these two guys from Brazil. Awesome. So talented. Spoke perfect English. Just a tiny bit of that, like, you know, that sweet sing-songy Brazilian accent. That's so great. And when I, like, after hiring them, at some point we were talking and I was like, well, like, yeah, do you guys like live in the States or?
How did you get to, did you major in English? I forget what the question was, but I was kind of like, yeah, how'd you learn English so well? One of them said because he played Counter-Strike a lot. The other said because he was obsessed, because he watched Netflix a lot. So on one hand, if you're at this time, if you're living in a non-dominant language, almost all of your media is coming in a dominant language.
And we have things like subtitles and some dubbing, but my theory is that like this will actually, this pressure will be relieved to at least to a certain extent once our AI and machine learning translators and everything gets like so good that, you know, you and I could have this conversation and you're talking in Chinese and I'm talking in English and like neither one of us even notices. As soon as we get to that point, it doesn't really matter. You know, we can then all of a sudden that pressure is taken off. But right now,
If you're a 22-year-old living in the world, you're probably being exposed to a lot of either, you know, if you're, yeah, you live in a place with a niche language, like,
you live in Switzerland or something like that or Malaysia or Thailand or wherever, you know, tons of your media is coming in another language, whether it's English or Mandarin or maybe Korean. I mean, Korean is not a dominant language, but obviously there's a lot of like media and Korean. But I mean, honestly, English, English, Spanish and Mandarin are kind of like eating the world in this regard. Like they really are taking over. And then the other thing is work. So because like the remote work revolution is,
You're living in Portugal working for a company based in Connecticut. Right, Cami? Yeah. And you're from China and you used to live in New Zealand and all this stuff. It's a lot more monetarily valuable to speak one of the dominant languages if you for most cases, if you want to work, you know, like get back to, you know, if you if you live in Brazil, you know, you're going to have to work in a country that's not as dominant.
the job in the U.S. is probably just going to pay more, like straight up. Like this isn't, you know, it's not 100% of cases, but a lot of cases it is. And so speaking English gets you, you know, a big advantage when it comes to just like finding work. And so if I'm some high school kid and I'm like, should I study Shanghai Hua or should I study English?
Or, you know, you're in whatever the case may be. Like, there's a lot of people who are going to say, like, you should study English because there's more job opportunities. Yeah. Now, this is kind of bullshit. I actually have a pushback on this, which is that I mentioned I lived in Barcelona. Like, I speak pretty, pretty good Catalan. I understand it really well. I speak it pretty good, but I mess up a lot. But I'm understandable. You know, I can hold a conversation and I can understand almost everything that's said, which is the important part.
Hadalan has gotten me a lot of interesting network things. I don't know if it's actually got me a job, but it's definitely gotten me connections and cool stuff like that in ways that Spanish actually has not. Because if I can speak Spanish, people are impressed. Like other Spanish speakers are like, oh, that's cool. Glad we can communicate, but it's fine. If you met someone who spoke with a Shandong dialect,
you'd be kind of like, holy shit. And especially if it was some white dude from Los Angeles, you'd be kind of like, what is going like, what is up with this guy? You know, but it's, it would, uh, it's a good icebreaker and it's, and it's, um,
I think for a lot of people, it's like, you know, a kind of sign of respect, maybe. You know, it's just cool. So speaking these smaller languages actually opens up, like, really interesting doors. I guess it depends on what the language is, you know, but in ways that sometimes the bigger language doesn't. People love to do, like, rule of thumb. Like, oh, we have to learn English. Like, you have to learn English. Otherwise, you're never going to get a job, you know. But, like, this shit's, like, never 100% true, you know, so. Exactly, yeah.
So what are the long-term impacts to Hope Your World will have on the endangered language community around the world? The big, big, big goal is just making sure that this stuff doesn't disappear. And so the reason that sounds far off, and it's not, because, you know, you're very tech-savvy.
You live on the Internet. You know how there's like you'll go to a blog that used to read, you know, 10 years ago and you're like, oh, shit, it disappeared. Like every article is gone. Or this is a huge thing here in the States is you keep having like scribe just went out of business, like scribe media. And what was that one? The one that like posted the sex tape? Gawker went out of business. And a lot of times they got a business. They pull all of their articles off.
Because they're not going to pay for server space after they've gone out of business. They don't care anymore. But after they do that, this stuff is just gone or lives on the internet archive or something like that. So now think about niche language. Think about a really small language like one around here called Kauia has less than 20 native speakers. The whole language is basically being propped up by one or two non-profit organizations that
are run by a couple of semi-volunteers out of some little clubhouse out in the desert somewhere or on one of the reservations or something like that. And they're getting funding of, I don't know, $50,000 a year or something like that. And this entire language, everything about this language that you can find online is on a single WordPress site. Yeah.
If it wasn't for some WordPress site that's like NSA-Kaweah.gov or something like that where somebody's maintaining it that has links to...
you know, some grammar stuff and things like I'm talking like really small languages, you know? And so like, if you want to find out information about it, like really, it's just living on like one or two websites. And these are like 10 year old WordPress sites. And if those go down, like if that, if that nonprofit loses its funding or somebody forgets to pay a bill or like somebody dies, because a lot of times the speakers for these are in their seventies or eighties or nineties, that's it.
Like it's gone. We talked about before, like if there's no documentation, if there's no books, there's nothing written, it's gone forever. Never to come back. There's no way of possibly retrieving it. And so you could say like, okay, well, but they'll keep the word, you know, they'll keep that site going. That's got this data for forever.
like five years, right? And you're like, yeah, I'll probably have it for five years. Like, well, 10 years, like hopefully another 10 years. What about 50 years? You're like, seems unlikely, right? At 50 years that this crappy little word, I mean, say crappy affectionately, but that this WordPress site's going to still be there. What about 500 years? There's no chance in hell it's still going to be there. Not just like people taking sites down, but like
Databases get corrupted. Servers go down like there's the crowds freaking CrowdStrike hack or whatever. Like, you know, there's just so many things and everything takes a bite out of our data and nothing is putting it back unless people actively do it.
So the problem is almost everybody is storing their language data on the Internet and the Internet is very fragile for stuff like this. So that's why I wanted to do the Arctic World Archive. So coming back to long term goal, Arctic World Archive, everything we put in there will cannot be taken out unless you literally dropped a nuclear bomb down the mine. Like this is the Arctic World Archive. People are so fucking cool.
because they are like they are true futurists i spent like a week with these people and they're like really really really like thinking about like what are people in the future gonna how are they going to interact how are they going to download this data like what are they going to do like they've thought of everything like they are so into this and it was so awesome this is why i want to get osv and the arctic world archive together because it's just like
It'd just be like nerd heaven. It would be so good. That'd be cool. We'll lobby James for that. Yeah. So I'm going to keep putting language data in there for rare languages because I think my prediction is in the next five years, there'll be a situation where
Where some like linguist or somebody somewhere is looking for is like being like, oh, I want to study, you know, I want to study like Tlaloc. I want to study probably not Navajo because Navajo has a lot of speakers or I want to study Shandongese or, you know, whatever the case is. And we are the only people on Earth that have it because everywhere else is slowly going to fade away. But we have the one of the only databases on Earth that is completely impenetrable and will never, ever end.
ever be corrupted or lost or anything like that. So as long as I keep adding stuff to it, ours will keep growing as everyone else's keeps shrinking. And then honestly, like there's going to be a point when like there's a lot of language data that we are the only people on earth that have access that, you know, that have it. Yeah. So I think that's that's one of my goals.
is to just make sure that it's still there. Like that this stuff like doesn't disappear off the face of the earth and that we have this cache that can be accessible to everybody. I mean, we're not, yeah, it's not like a private thing. We're not trying, it's, I'm going to be hosting it on our website. This is not done yet. This is a project that's in the works. Hopefully in the next like week, couple of weeks, but we'll see.
But that being said, then if our site goes down, we just go back to the coal mine, get our data, re-upload it, problem solved. We basically have...
you know, for our language data, we have this infinity switch that nobody else has. For them, if their database gets corrupted, they're fucking done. For us, we have a backup. Cool. Since you are the first one to receive the OSV fellowship, so how does this fellowship help you accelerate your project? And what are the next major milestones you aim to achieve?
Oh, the fellowship. The fellowship let me do this project. I had been doing this part time with like little teams. We built I built an LLM in Ukrainian for as kind of part of the Ukrainian war effort. I don't know if you call it the war effort. It was.
for a conference i mean everything was around about ukrainian or ukraine right now is like obviously focused on the war but yeah i worked with a team we built an llm in ukrainian that was like the point was for it to it had to do his contest where it like answered like sat questions like ukrainian you know kind of like standardized test questions it was pretty cool
But I'd been doing a lot of this stuff on the side and just working with tech. Like I was the CTO of this like B2B SaaS company before going full-time with OSV. So OSV and Jim O'Shaughnessy and you and Matt and Dylan and Otman and all the great people let me really focus on this. And the other thing too is beauty of working. The great thing about OSV, I think that this is a great connection. I love this because...
With OSV, Jim was basically like, all right, I'm going to give you some money. And it was a good amount of money, but it was also, I live in California, married, I have a baby. This is keeping me going. It's not like millions of dollars for me to build out tons of infrastructure. It's basically like living expenses plus a little bit, but enough to just let me do this full time. But that was perfect. And especially because Jim's whole ethos is like, all right,
Go fucking nuts. And I was like, I can do that. And then I've just been trying to just go nuts with language projects ever since. So the next thing coming up is honestly, it's just getting this documentary done. So finish editing. My dream is to turn this into a series. I really want to make this a series where I go to like, you know, I don't know, you'd have a season where you do like 10 different languages or like six different languages or something like that.
you know, really cool ones like Irish and Basque. And I want to go to like the Sahara because the Amazigh is like this a great language down there and stuff like that. And I mean, it's selfish because I love to travel and this just sounds really fun. And I love talking to people. But the overall goal would be like mission oriented in that like the whole point is to get this on people's radars. Most people I talk to have no idea that like if you say like endangered language, they're like,
oh, I'd never thought about that. And then they get the problem. But so there's benefit of just like letting people know. And then also as like a way of helping us like do our work, like we're going to need
If we're going to store tons of, preserve lots of languages, we're going to need to have people actually submitting and collecting language stuff for us. And then the other thing too is with this documentary, we highlighted the Sami, as I talked about. One, because they're really interesting. I love just a good travel story and I love niche cultures. I've been talking on and on about nonstop.
But also this, you know, we have a lot of stuff in there for other people to learn from. Like, so we want other cities and other countries to use this as an example for like literally like, oh, how do we bring a language back that we want to? Like if you're, you know, wherever, like,
this thing in Panama. Maybe the government officials will watch that and be like, maybe we should have bilingual schools because we want to bring this back. Quite know how, you know? And a lot of this shit's tactical too. Like it's not that it's impossible, but there is kind of almost a playbook for it. If you want to, if you want to start bringing, you know, having a language be spoke more by kids and be used more at home and be used more like
You have to make the government bilingual, make schools that people can go to, have events, and just encourage it. Be like Ireland, have street signs in it, things like that. That's cool. When I grew up, I learned Chinese and English at school. And I know in New Zealand, people learn Maori at school. They all like bilingual. But a lot of people won't take it seriously. Do non-native New Zealanders speak a lot of Maori?
A little bit, I'll say, because in the working environment, I worked in corporate New Zealand for a couple of years. I do see there's like a Mali language wig or there's some signs on the government website with the Mali language.
Still, I don't think the majority of people can speak more fluently. It's just like a surface work, I would say. It's like, yeah, good to have it, but nobody really cares. That's my observation. Well, you do need that. And that's the thing is you do need that like immersive part. You know, if kids are going to speak it, they have to have a situation where they're speaking it all day.
or at least for a couple hours a day where they're only speaking it. And then also, you want something where they're actually talking with their friends about it. You seem like you were a troublemaker. So when you got out of high school, your English class...
But then I'm sure you immediately started talking to your friends in Chinese. You weren't like talking to them in English or anything like that. I couldn't talk a single word in English when I was living in China. Just for examination. Yeah, I don't need to talk. Yeah, exactly. And it's really, it's like if you're not using it casually, it's a little hard for it to...
To almost like lock in your brain. I think that there's a lot, this is a whole other topic, but I think there's a lot of muscle memory involved. And so really like, it's not just practice, but also practice in situations where you're just like having fun and being yourself. So the best, I swear to God, the best place, I don't even drink and I still think the best place to learn a language is in a bar.
Just talk to people about fucking whatever. It doesn't even matter. Talk about soccer. Talk about politics. Talk about who cares. As long as the words are coming out of your mouth and you're having fun, you will be learning the language really well. That's good. Okay. To wrap up, I'd love to hear your opinion about
One of my favorite sci-fi writer's story, Tide John, he wrote about the story of your life, which deeply influenced my philosophy of living. If you can master language, this can change your mind. So what's your thoughts on this story from a linguistic point of view? Okay, so let me do linguistic, and then I'll talk for a minute about the movie and the story and that kind of stuff. Yeah, yeah, yeah, sure. Yeah.
The linguistic, I loved it. I loved it so much. I thought it was so good. Obviously, the twist is that learning to speak the language, it's sort of an extreme example of linguistic determinism. Like,
learning to speak this alien language where the aliens don't experience time like we do. They experience it like the Tralfamadorians from Slaughterhouse. I don't know if you ever heard Kurt Vonnegut. I love, yeah. Okay, Slaughterhouse-Five. You know, like the creatures there that also live in all times at once. As far as we can tell, it's very possible. Nobody knows for sure, but
It does appear that our perception of time is a human thing and not a universal thing because most laws of physics, almost all laws of physics do not take time into account or don't have like linear unidirectional time to account. But that doesn't matter. Like the point of the story is like you learn to speak this language and all of a sudden you can start seeing different points in time because the language itself is
is like kind of timeless. So the language stuff in it was fantastic and much better than the movie. The movie was pretty good. The language stuff in the movie is funny. Like everyone remembers it as a language movie when really like the language part is like about 10 minutes long at the beginning, which actually speaks to the problem with both the movie and the story, which is like the endings kind of suck for both, in my opinion. They don't suck. They just kind of like drift off. They don't really like have a really definitive ending. Although the dead...
daughter in the story makes a lot more sense than the movie where it makes literally no sense so yeah i mean i think it's fascinating and it was really that the book he really broke it down to like the linguistics them learning the language it's like walks you through it one step at a time and then she starts to have these weird episodes where she's like i think she thinks he's hallucinating but actually she's like seeing the future and it's very beautiful and touching and interesting
And neither one, they don't explain what the aliens are. The aliens just disappear. And that's about it. So I love it. I would recommend it to everybody. What's the author's Chinese name? Jiang Fengnan. Jiang. So we call him Ted. Everyone's like, Ted Chiang. Well, but here, they're like, Ted Chiang. Ted Chiang. Things like that. I was reading that book of short stories, and I think of Maid in Japan.
either threw the book away or stole it because I had it left in a hotel room and then I couldn't find it. I never finished the rest of them. Right. I think that's the one, yeah, the best story from that story collections. Every time I read it, I just cried for no reason. I don't know why because I never cried for a sci-fi story but since I'm reading his work, anyway. No, sci-fi is beautiful. This is why sci-fi is good because it gives you thought experiments that you would never ever think of. Yeah. Yeah.