We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode AI voices are taking over the internet

AI voices are taking over the internet

2023/9/11
logo of podcast The Vergecast

The Vergecast

AI Chapters Transcript

Shownotes Transcript

Translations:
中文

Welcome to the verge cast the flagship podcast of neurological engines for over dubbing. I'm given peers in hang on one second. Actually, I just got to finish something. Once upon a time, a White snake and a Green snake living in a remote mountain became immoral and obtain superpowers after centuries of practice. OK, sorry, i'm back.

And before you worry that I just like had a stroke or something live on the microphone, I should tell you that i'm actually in the middle of training N. A. I to mimic the sound of my voice this one is on the iphone is called personal voice.

And products like this are what we're gona talk about today. The idea of being able to create your own A I voice clone has been around for a while. We actually talked a bunch about IT on the show in twenty twenty one I linked to in the show notes.

A lot of IT holds up really well. But over the last couple of years, it's gotten both drastically easier to make a vocal I and the results have gotten drastically Better. You can even do IT on your phone like i'm doing now with just a few minutes of really awkward reading of sentences.

Here's what mind sounds like. Hi, i'm David paris iphone voice. I'm kind of like David, but kind of not.

So today on the show, we're going to dive into the boom of voice A I, and then we're onna. Try to figure out if I can actually make something that sounds like me. This is the verge cast. We will be back.

Support for the verge cast comes from strike. Strike is a payments and billing platform supporting millions of businesses around the world, including companies like uber, B M W and door dash. Stripe has held countless startups and establish companies alike, reached their growth, make progress on their missions and reach more customers globally. The platform offers a sweet of specialized features and tools to bash track growth like stripe billing, which makes IT easy to handle subscription base e charges, invoices and all reoccur ring revenue management. You can learn how stripe helps companies of all sizes make progress at stripe dot com, that stripe dot com to learn more, right, make progress.

Support for the show .

comes from service. Now the AI platform for business transformation. You've heard the big hype around A I, and the truth is A I is only as powerful as the platform is built into service.

Now is the platform that puts A I to work for people across your business, removing friction and frustration for your employees, super charging productivity for developers, providing intelligent tools for your service agent to make customers happier, all built into a single platform you can use right now. And that's why the world works with service now. Visit service now, 点 com splash A I for people to learn more。

right? We're back before we get too deep into the world of AI voices. Let's try and quickly understand why this is such a big thing right now. And as far as I can tell, there are basically three reasons this kind of tech is booming. The first is because audio in general is booming with podcasts and voice messages in those generated spoken captions.

You here on tiktok and all other things, if you think about IT, you probably hear the internet a lot more than you use to, unlike any other creative feature on the internet, a lot of tools exist to help you make one. APP we use is called descript. It's an APP that a lot of people use for editing audio and video.

IT has this feature called overdubs voices. And one of descript cooler features in general is that you can edit audio video basically by editing text, you import the file IT gives you a transcript, and if you delete the word um in the transcript, it'll also try and seamlessly delete the ARM from the actual audio file. It's not perfect, but IT works pretty well and IT kind of feels like magic.

To use IT with overdub describe can go even further. Let's say you forgot something or you stumbled on a word, or you can now make an A I copy of your own voice and insert new audio just by typing the text you want to appear. So let's say, I say the sentence.

The iphone came out in two thousand and seven when Steve jobs announced as three things, a White screen ipod, a revolutionary mobile phone and a breakthrough internet communicator. right? Let's just hear that sentence back.

Iphone came out in two thousand and seven when Steve jobs announced IT as three things, a White screen ipod, a revolutionary mobile phone and a breakthrough internet communicator. Wait, sorry, I got that that slightly wrong. He called IT a breakthrough internet communication device.

I could be record that whole thing, or I could just go into descript retype the transcript and here's what I get. Iphone came out in two thousand and seven when Steve jobs announced IT as three things, a White screen ipod, a revolutionary mobile phone and a breakthrough internet communications device. It's not bad.

I wouldn't want to a listen to a whole hour of that voice, I don't think. But in small bits and especially the context of something larger and not even sure you're always notice IT there. There are other apps out there like pod castle doing the same kind of thing.

And I suspect you're going to see tools like this show up anywhere that people may got you. okay. So that's the first use case. The second is kind of the flip side. There are also a bunch of tools out there using A I voices to read written stories out loud. The atlantic, for instance, is working with a company called eleven labs to have an A I nara or read some of the stories on the website.

For years, the american approached to protein has been a never ending quest for more. On average, each person in the united states puts away roughly three hundred pounds of meat a year.

Again, it's not perfect and I don't know that IT always sounds like a person, but I kind of can't believe how good IT is. IT wasn't that long ago, by the way, that these generated voices sounded like flat toneless robots like here's Sophia the robot from twenty sixteen that was considered to be one of the most advanced robots ever created.

On the tonight show, I travel to over twenty five countries, appeared on the cover of Cosmopolitan magazine, met german chancellor anger, long miracle and the actor will smith, and became twitter friends with Chris tegan. And here is that same thing, Sophia said, which I just typed into the generator on the eleven labs website. I picked the voice name Grace, put this in their click generate, and after about ten seconds, this is what came out.

I travelled to over twenty five countries, appeared on the cover of Cosmopolitan magazine, met german chancellor ella miro and the actor will smith, and became twitter friends with Chris egan.

I mean, it's night and day, right? I think you're going to start to see this everywhere. Articles, whole websites, entire books, all read aloud, all using AI. And the product itself is actually starting to be pretty good. IT is also, of course, a huge ethical and legal disaster all the way back in 2, again, before the tech was nearly as good as IT is.

Now a bunch of publishers sued audible over a future called the audible captions, which would read a book allowed to you, as you look to the page, seems like a Normal, useful feature, right? Also seems like an excess central threat to the entire idea and industry of audio box auto. And the publisher settled in twenty twenty, but that was only the beginning of the bigger questions here.

Some audio book naratu have worried that their voices are being used to train algorithms that might someday replace, and they're not really wrong. All this is not at all of theoretical. If you go into the apple books APP and search for a inner ation, you will find a bunch of audio books that say they are narrated by apple books.

Apple says that that means that they are, quote, narrated by a digital voice based on a human naratu. Here's just disable from a book called language of love by Christian average. A lot of these, A I narrated book of romance novels, by the way, for whatever is.

And this one sounds to my ears shockingly like a human red audio book. He raised his fist and wrapped on the solid wood after about thirty seconds of silence, the distinct sound of the lock turning broke through. A woman of average height TTS stepped into the sliver of an opening.

The other version of this that you might have heard about, or even encountered without knowing, is celebrity AI voices. Like there was a pretty big backlash a couple of years ago when a documentary about Anthony y board dae called road runner, which came out after bord an's death, trained and A I model on his voice, and then used IT to generate narration for the film. The director Morgan level said that he only used the A I to say words that berdan himself had written, which was an ethical choice for him.

And I guess I can see where coming from. I still don't know whether any of that feels okay to me or awful. It's all really complicated.

And examples like this exist everywhere. A I helped to voculas speak after he lost his voice to the throat cancer. Lots of celebrities train eye to do things like give you wage directions.

All this two is pretty controversial. One of the things hollywood is on strike about right now is A S potential to scan their likely so that they never need to be actually used in films again. Imagine an A I trained on Morgan freeman's voice that could marry every documentary ever without paying freeman a dime.

This stuff all gets really messy, really fast. okay. And then we had the third, and probably most newly mainstream use case here, accessibility. Apple launched a new features this year in I S. 1 called live speech, which you can use to type something and have IT set out loud in phone calls or even for in persons conversation.

And when you pair IT with personal voice, another new feature this year, the one I was testing up at the beginning of the show, you can create an A I version of your own voice just by recording yourself, talking into your phone, and then use that to generate your live speech. It's all a little like the incredibly powerful system that the late Stephen n. Hawking had, which let him speak to a computer.

Eight had offered the list of predictions based on the analysis of the english language. Previous said, okay, again, not to keep laboring the point, but that video is from eight years ago. Think about how much Better a system like hackers with sound today.

Although I have to say I do love how much talking embraced that robotic sound and made IT his own IT has become much, and I would change IT for more natural voice with samsung is building a similar feature with bixby so that you can now speak with your own invoice to your galaxy phone works kind of the same way. And on a similar line, lots of people have used screen readers for years, which are able to speak aloud whatever is on a screen. Those are also getting vastly Better, both because the voices are improving and because A I systems are getting much Better and actually understanding the contents of web pages and apps and anything else you're looking at all of that is super exciting.

And i'm also really into the idea of being able to use machine translation and these voices to be able to speak cy multi evensen in lots of languages some day, not that far from now. This podcast with my voice could be available in basically any language on earth. That's really cool.

It's also a really hard problem, and we're definitely not there at. But in general, being able to speak with your own voice even when you can't do that is a big deal. It's complicated ethically and morally and legally and in so many other ways, but it's a big deal none's. We need to take a quick break and then when we come back, we're onna investigate what IT takes to actually make an A I voice and see if it's really possible to do well, 不 be back。

Traveling to see your fave sports team is cool, but travellin with m max platinum for the big game is even Better, right this way. With access to dedicated card member entrance at select events, you can skip the line. And with access to the arian round, you can tech the next game on the way home. That's the powerful backing of american express terms apply, learn more at american express dock com seh with amax card member entrance access not limited to M, X, plant them card .

support for the show comes from the crucible moments, a podcast from scope capital. We've all had turning points in our lives where the decisions we make end up having lasting consequences. No one knows this Better than the founders of some of today's most influential in Christmas moments.

Let's listeners in on the maker break events that defined major companies like dropbox, youtube, Robinson d and more told by the founders themselves. Tune in to the season two of crucial moments. Today, you can listen at a crucial moments, stop com, or where every listen to podcasts.

Welcome back. This makes me voices, shall we? The idea with most of these systems is basically the same because the way you train an I model in general is just to give you lots and lots and lots of data and just kind of watch IT turn through and see what IT learns.

But in the systems i've been trying, there is one important distinction. Some tools like descripts just ask for a huge batch of audio theyll give you a scrip if you want IT. But really the goal is just to upload hours and hours and hours of the sound of your voice and see what happens.

Others go one step further, and we will ask you to record yourself, saying a series of specific, and often weird and often thoroughly rn them things. So like when I open up pod castle to create my digital AI voice, IT had a lot of really specific instructions. Okay, now time to do seventy senses.

Here go. Everything seems Better. In summer, I asked my dad if he could help me look at that lovely cat.

That he happened seventy sentences later I while, and the next day I got an email saying my digital voice was ready you here. IT. hi. I'm David peers, except not really.

I'm an A I bot, but i've been trained to sound like David peers is this convincing done is not great, not super impressed, but let's give you a little more to work with and see how we do what we're on, the subjective, ethically dubious things. I'm going to grab the text of one of my favorite T V moments ever. It's a diet shot speech from the office.

What is my perfect crime? I break in the tivo ese in midnight. Do I go for the volt? No, I go for the chandelier.

And surprise est, as i'm taking IT down, a woman catches me. He tells me to stop. It's her father's business.

She's Tiffany. I say, no, we make level. All night in the morning, the cops come on.

I escaped in one other uniforms. I tell her to meet me in mexico, but I got to canada. I not trusted.

Besides, I like the cold. Thirty years later, I get a postcard. I have a son and he's the chief of police.

Story is interesting. I told, finally, meet me in paris by the truck at daro. She's been waiting for me all these years.

She's never taken another lover. I don't care. I don't show up. I got a berlin that's where I stashed the sand delier I me.

it's just like sixty perfect seconds. I love you so much. Let's have A I, David, take a look at that speech here.

guys. What is my perfect crime? I break into Tiffany is at midnight.

Do I go for the wall? No, I go for the chandelier. It's Priceless.

As i'm taking IT down. Woman catches me. He tells me to stop. It's her father's business. She's Tiffany. I say, no, we make love all my okay, so hear that and it's like, yeah, that sounds like me, but that also doesn't sound like a human, if that makes sense. In general, py castle was really easy to use, but are not terribly impressed with the outcome.

So now let's try descript, which is, I think, in general, a significantly more sophisticated piece of audio software. IT, too, is a process. Voices, yeah, creating voice.

Do a few recent friday voda sts, about that preparing and uploading. okay. IT finally upload on my stuff.

Let's go. We could submit training data and that is ready to create your over the voice, record your voice. I D press record and read a statement.

I said it's stub we submitted. We are upload IT says putting the finishing touches on your training project. Your voice is is now training will email you when it's done. Here we go.

I ended up admitting about four hours of my own voice to make this happen, because luckily already have hours of my voice recorded from just being on the verge cast and lake with pod castle IT took a wild to process everything. And then I got an email that my voice is ready, which is a very funny email to receive. Here's what that sounded like, hi, i'm David piers.

The I David piers, the descript version of the A I. David piers, how do I sound? All I hear in that is that I feel like that's what I might sound like if I gone to like a really fancy new england boarding school and also had a really, really nasty head cold.

But I don't think in general that one sounds like me at all, really. But let's try IT again with our joy troot speech. What is my perfect crime? I break into Tiffany SE at mid di.

Do I go for the volt? No, I go for the shadow. Er, it's Priceless.

I have a son and he's the chief of police. This is where the story gets interesting. I tell tifany to meet me in paris by the truck.

Daro, she's been waiting for me all these years. She's never taken another lover. I don't show up. I go to berlin. That's where I stashed the shame.

Beller, the strange thing about this one is the intention, the kind of eban flow of the sentences here. It's really not bad. It's a little stilted.

But IT does move more less like you would expect a human to talk. IT just doesn't sound right and IT seems to skip a bunch of words and sentences when IT doesn't quite know what to do. My take away is basically descript is fine for those little filter words like we were doing earlier, but that's about IT.

I would say in general, so far, my take away is that these things aren't amazing, but their decent and honestly are really easy to make them like much easier than I expected. So let's keep going to see a couple more eleven labs. The company we've talked to out a bunch so far has the simplest process of any that i've seen.

You just sign up, upload a few minutes of audio IT actually explicitly says you only need about five minutes and that anything more is just overkill and you're often running. So I add at some verge cast stuff about fifteen minutes in all because, you know, in a retriever. And then just waited a while. This one only took a couple of minutes and we were up and running.

Hi, it's A I David peers again, this time I made by eleven ABS but i'm still me sort of I think you know what I mean i'm not going to lie that one kind of give me goose mps IT goes a little fast like I don't think that's how you'd say that sentence, but this is way Better than anything else i've tried or even heard. And I took a grand total of about ninety seconds to put together. What's weird though, is that it's not always this.

Could I click to generate again with the same text and IT spit back something suddenly different, and I think slightly worse. Hi, it's A I David pears again, this time I made by eleven labs. But i'm still me sort of I think you know what I mean again, really good, Better than anything else.

We have tried, but not quite as good as that first one, which is odd. It's just that pause in the first one right before the words sort of it's like exactly how I would have said that in real life, I still kind of can't get over to freak me out. Anyway, let's try this model with our dietary speech.

What is my perfect crime? I breaking the Tiffany SE at midnight. Do I go for the volt? No, I go for the chandelier.

It's Priceless. As i'm taking IT down, a woman catches me. He tells me to stop.

It's her father's business. She's Tiffany. I say, no. We make love all night.

In the morning, the cops come and I escape in one of their uniforms. I tell her to meet me in mexico, but I go to canada. I don't trust her.

Besides, I like the cold. Thirty years later, I get a postcard. I have a son, and he's the chief of police.

This is where the story gets interesting. I tell Tiffany to meet me in paris by the truck doro. She's been waiting for me all these years.

She's never taken another lover. I don't care. I don't show up. I go to berlin. That's where I stash the chandelier that once not perfect, and that seems to me that I got a little worse as IT went on, the cats got a little less human and a little more just kind of robot monitor.

Everything takes the same time to say, no, I mean, but I bet I could use that voice on almost anyone for a manager to and get away with IT. What is my perfect crime? Okay, let's try one more.

This is the apple personal voice feature, the one that's gna. Come to lots of people's iphone. E.

S. I suspect a lot of people are going to a set the step pretty soon. This one took, by the way, the longest, by far of any, to set up. So the first thing after decided going to share across faces, sure.

Do I want to allow apps to request to use? Why would I want that creating my persons voice? I'm going to be one hundred and races allowed, which may take about fifteen minutes.

Then it's going to generate IT. No, go there right now. Let's see.

That was the best movie i've ever seen. Are you still hungry? It's a beautiful day to day.

It's an extension of the nearby sea. Her style of painting shows the influence of the french artists. Then, once IT was set up, IT also took the longest to finish.

Your phone has to be charging and not in use because all the training happens on your device and takes up a huge amount energy. I love that IT happens on device that's good for privacy reasons, is good for lot of reasons, but IT take a while. So I took a couple of days, but I eventually had my voice ready to go.

Hi, i'm David pierce. I iphone voice. I'm kind of like David, but kind of not.

Are we are phones, our phones, us? Ordinarily, I think I would have been impressed by that. But after hearing that eleven labs one, i'm kind of man on this one.

Let's try the joy through test. Thirty years later, I get a postcard. I had a son in his, the chief of police.

This is where the story gets interesting. I tell Tiffany any to meet me in paris by the true cardinal. She's been waiting for me all these years.

She's never taken another lover. I don't care. I don't show up.

I go to berlin. That's where I stash the shadow. Still pretty OK, right? That kind of works.

But nobody y's gonna confuse this with human David. And in general, I actually think that might be OK. A I voices are one of those things where the Better they get, the stranger they get.

Seriously, that feeling. I got listening to the first time eleven lab spit out that things, saying i'm David piers, was genuinely kind of disconcerting. IT raises all these big questions that, like with so many things about ai, we've really only begun dealing with.

What does that mean that I can create a replica this good and that they're only going to get Better and easier over time? What responsibility E. S. Do I have as the person who made IT and is using IT, even though it's my voice, what responsibility ties do other people have? What responsibility lies?

Do services who make these voices for me have now that they have this incredibly personal thing of mine on their servers, or having a lot of debates over A I music right now, obviously, as artists, voices are being used to train models that can make pretty convincing songs in just about any one's voice. You go on youtube and you can hear A I. Taylor swift singing almost anything.

You can hear A I. Patrick from spunge bobs, singing, almost anything, all of that is going to build like a decade of interesting court cases and ethical debates. But those same issues are coming just for you and me in our everyday lives.

How do we use these tools? How do we talk about the fact that they exist and how we're using them? Is that even possible to get the good, helpful democratizing things from them without all the deep fix and downsides?

I don't know, but I do know it's well past time we started talking about IT because the tech is really good right now and it's getting Better really fast. right? That is enough. A I talk for one day.

We're going to be back next week to talk even more about AI music because I think that is one of the most interesting things in this space right now, not just because of the big heady debate about, but because I think there are really interesting ways that A I can both help people make music and totally change all of our ideas about what music, even it's going to be fascinating. We will also be back in this feed on wednesday, two, with a big episode, all that this week apple event. But until then, A I David uniview the credits this show is produced by Andrew marino and liam James brook meters is our editorial director of audio.

The verge cast is a verge production and part of the vox media podcast network. If you have thoughts, questions, ideas or anything else, you can email verge, cast the verge dot com or call the verge hot line at eight six six. Verge eleven will be back on wednesday with a special show all about the apple event and with all the rest of this week's news on friday, we'll see you then rock and roll.

Okay, not bad. Pretty good. A I David, for your first track.

But it's verge cast at the verge that com for the emails. And we really got to work on the side. Us, stay with me. Rocking roll.

Support for this episode de comes from A W S. A W, S, generate A, A, I gives you the tools to power your business forward with the security and speed of the world's most experienced club.

Support for the show comes from clavo. You're building a business clavo helps you grow IT livio S A I powered marketing platform puts all your customer data plus email s ms in analytics in one place with cb o tend fish fee on fishwife delivers real time personalized experiences that keeps their customers hocked. They've grown seventy times revenue in just four years with clavier. Now that scale, visit K L A B I Y O dot com to learn how brands like fish wife build smarter digital relationships with clivia.