We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode The Latest from OpenAI is a Total Game Changer

The Latest from OpenAI is a Total Game Changer

2025/4/16
logo of podcast Lex Fridman Podcast of AI

Lex Fridman Podcast of AI

AI Deep Dive Transcript
People
J
Jaeden Schafer
Topics
我个人非常兴奋地看到OpenAI发布了他们最新的转录和语音生成AI模型的升级版本。这些升级对整个AI生态系统有着深远的影响,因为它们是专门为开发者设计的。我将深入探讨这些升级的细节。总的来说,OpenAI对转录和语音生成AI模型进行了升级,这直接体现在我正在开发的AI Box软件中,我相信许多其他开发者也一样。 这些升级主要体现在API接口上,其性能显著优于之前的版本。我已经在我的软件公司进行了大量的测试,结果非常令人印象深刻。转录功能允许你上传音频文件并将其转换为文本,反之亦然。这个过程类似于制作字幕,或者你可以提供文本生成音频,或者提供音频生成文本。这个模型我认为叫做Whisper,非常酷。 随着AI模型的不断发展,我们越来越接近构建能够独立完成任务的自动化系统,也就是所谓的AI代理。而对于许多应用场景来说,语音功能至关重要。想象一下,一个可以和你对话的AI旅行代理,它可以根据你的需求提供旅行建议。虽然纯文本交互也能实现,但我认为,为了让AI代理更真实,语音功能是必不可少的。OpenAI一直以来都是语音模型领域的先锋,他们的消费者应用中已经拥有非常强大的语音模型。现在,他们通过API接口将这些强大的语音模型提供给开发者,这非常令人兴奋。 除了生成通用的语音之外,这些新模型还能生成更逼真、更细致的语音。我稍后会演示一下。更重要的是,这些模型的可控性更强。作为开发者,你可以让AI以各种不同的风格说话,例如,模仿疯狂科学家,或者使用平静的语调,甚至模仿你刚刚跑完步,气喘吁吁的样子。这在几个月前就已经在他们的应用中实现了,但现在才向开发者开放API接口。我认为这非常棒,因为这意味着开发者可以将这些细致入微的语音技术融入到各种应用中。 OpenAI的新文本转语音模型GPT-4 mini TTS更加细致逼真,并且更易于控制。开发者可以更自然地控制语音的表达方式。OpenAI的产品团队成员Jeff Harris在采访中表示,在不同的语境下,你并不总是想要单调的语音。例如,在客户支持中,如果出现错误,你可能希望语音表达出歉意。他们相信开发者和用户不仅想要控制语音的内容,更想要控制语音的表达方式。 OpenAI的新语音转文本模型GPT-4L Transcribe和GPT-4L Mini Transcribe取代了之前的Whisper模型,并使用了更加多样化、高质量的音频数据集进行训练。他们声称这些模型甚至在嘈杂的环境中也表现出色。我推测,他们可能使用了大量的YouTube数据进行训练。虽然这可能存在一些争议,但我仍然对这项技术的进步感到兴奋。 根据OpenAI内部基准测试,新的语音转文本模型的准确性有了显著提高。他们的单词错误率大约为30%,尤其是在印地语和达罗毗荼语系语言(如泰米尔语、泰卢固语、马拉雅拉姆语、卡纳达语)中。虽然这并不完美,但在英语之外的其他语言中,这已经是一个巨大的进步。然而,与以往不同的是,OpenAI这次并没有开源他们的新语音转文本模型。他们解释说,由于这个模型比Whisper大得多,因此不适合开源发布。这与他们过去一直开源Whisper的做法有所不同,也引发了一些争议。他们表示,这个模型的规模太大,无法在个人电脑上运行,因此他们需要谨慎地进行开源。当然,这其中也可能存在商业利益的考虑。无论如何,作为一名开发者,我很高兴能够使用这项技术。

Deep Dive

Shownotes Transcript

Translations:
中文

OpenAI has made some big new releases and with these releases, there's going to be a lot of impacts in the entire AI ecosystem because they made them for developers. So I'm going to be getting into all of that. Essentially, they have upgraded their transcription and also their voice generating AI models. This is something that I personally have embedded into my software. I'm building AI box and I know a lot of other people use.

I'll show you some demos of what this actually sounds like, because I have been very, very impressed, but overall, you know, when opening, I makes a big move like this. It makes it, it's a big deal because it gets embedded into so many other software and services. So I'll be talking about all of that before we get into the episode today, I wanted to mention if you've ever wanted to grow and scale your business using AI tools, you need to join my AI hustle school community. Every single week, I release an exclusive video that I don't share anywhere else.

sharing how I use AI tools to grow and scale my companies and the workflows, the numbers, everything I can't really share publicly. It's all in there. We have over 300 members. And the thing that I love about it is we have people from, you know, people that have started $100 million companies and people that have started, they're just getting started on their entrepreneurial journey. You get a lot of perspectives in there. So no matter where you're at, you're going to find other people that can share great insights about what AI tools they're using and

and really help you kind of kickstart your journey. So if you're interested, I used to have this at like $100 a month, and I have dropped the price to $19 a month. So it's discounted right now. It's a great deal. And if I ever raise the price in the future, if you lock in the price now, it won't be raised on you. So there's a link in the description. I'd love to have you join and see you in the school community. All right, let's get into what OpenAI is doing. So

Like I mentioned, they have upgraded their transcription and voice generating models. And specifically, they're doing this for their API for developers. This is much better based off of what I've listened to. This is much better than their previous versions that they have. You know, I've done a lot of testing for my own software company. And essentially, you know, the transcription means that you upload an audio file and it will then create text, right? So it's like doing captions or you can give it text and it will generate

an audio or you give it text and generates audio or you give audio and generates text. It goes back and forth, right? So there it's called whisper, I think for the, for the transcription and it's really, really cool. So

One thing that I do want to mention here is that as they're kind of rolling this out, we're getting closer and closer to where a lot of companies and AI models are talking about agents and what their agentic vision is, how they're going to build these automated systems, how they're going to go and accomplish all of these tasks independently. Right. And so what I think is really important is you kind of for a lot of things, you need a voice like you imagine like.

Oh, I want like an AI travel agent that I can talk to about my trip and it can give me recommendations. And like, if you're just doing that by text, which you totally can, like technically that would like work and it can get things done. But I just feel like with so many of these agents to feel more realistic, you need that voice. And so OpenAI has been really pioneering

you know, on the frontier with their voice models and on their, you know, consumer usable app, you have these really powerful voice models that you can chat with. And not, these didn't always translate into what developers could get their hands on. And so it's cool that now they have this API where you're able to do that and they've improved a lot of things. So,

beyond just being able to kind of generate a generic voice. It sounds quite realistic. I'll give you a demo in a second, but I think this is amazing. And then the other thing that I think is really interesting here is what they actually said to TechCrunch. They did like an interview. They said, quote, we're going to see more and more agents pop up in the coming months. And so the general theme is helping customers and developers leverage agents that are useful, available, and accurate. So

you and I believe that this was opening eyes head of product Oliver Godeman and he was kind of talking about like what a bunch of these updates were you know a lot of these are directed at business customers right it's not like your average person that uses chat GP on their phone is like I whatever they came up with a better you know API for their text to speech and speech to text like it's whatever but the reason why I'm so excited about and I think it's important for you to know is because you

whether you're using, whether you're a developer or not, every application you use that is tying into that ecosystem, which is the biggest ecosystem, open AI's AI models is going to start using these, these new models. And so they're just getting better and better and everything we're going to have is getting better. And all of the agents that we're going to be using in the coming months and years, we'll be kind of relying on this. So I think it's for me, this is why I kind of geek out about it. And I think it's cool.

So the thing that they're like, these are the updates they specifically have said is that their new text to speech model, which is GPT-4 mini TTS text to speech is now more nuanced and realistic sounding. And it's also what they say is more steerable compared to its other previous, you know, speech models. So essentially.

As a developer, you can now get it to say things in a much more natural language. You can say, speak like a mad scientist or use a really serene voice or you act like you're like what I've said is like, act like you're you just went on a run. You're super out of breath. Like it can it can talk like in all of those variations. And so what's interesting is this was available on the app as of like months ago, but it wasn't available for developers to roll out. So opening, I kind of had a monopoly on this really cool tech, which I mean, they made it. So it's totally fair. But it's really, really exciting to

that developers are now going to be able to start building in that, that really nuanced tech voices into like everything else. Anyone will be able to use this. Okay. I'm going to give you a sample of what they're saying is a true crime styled voiceover.

Okay. And then they also have a sample for what is a female professional voice. It's like very serious kind of like female voice talking about stuff. So I think this is really amazing. And the cool thing is that it's so steerable. So like if I'm like, I want like,

this type of person to speak in this type of way. And I want them to be like acting like a gym coach. And I want them to be like really rah, rah, like motivational. Like it will change how this thing is talking. So to me, that's so exciting. It's beyond just like, you know, in the past we've had like a dropdown with like, okay, pick the, your top favorite of these like seven or eight voices. And you just drop down, pick your favorite voice.

Now you get to decide what the voice is. It's trained off of so many different styles and voices that it knows and you can put them all there. So I think this is very cool.

What they specifically said, Jeff Harris, so he's a member of the product staff over at OpenAI, he was doing an interview and he said, quote, in different contexts, you don't just want a flat monotone voice. If you're a customer in customer support, you want to make the voice apologetic because you've made a mistake and you can actually have the voice have that emotion in it. Our big belief here is that developers and users want to really control not just what is spoken, but how things are spoken.

I love this concept, right? Like if I'm calling customer support and I'm really mad, they could literally like have

sentiment analysis on what I'm saying and be like, okay, this person is upset. Change your tone to be more apologetic or this person seems very happy. Match the mood or vibe of the person. So like there's all these really things to the point where I know this sounds terrible, but like you can, so this will happen. So I don't put this on your radar. It's like this person's, you know, if I wanted to really politically polarize people in a country and add some robocaller using this, I'd be like, this person's really mad. Like get mad at them back. Like try to rile them up. Like I'm sure this

This is like one of the things they're trying to like stop from happening. But like, imagine this is a possibility. So I'm putting this out there as like things people will be doing. Am I excited about that one? I think they'll probably shut it down. But I just say be aware because these things, as these agents are out there, their ability to manipulate people or to help people improves. We have to build, you know, whatever. We have to build our own safeguards and understandings of how these things work. But it's very, very interesting what it will be capable of doing in the future.

So their new speech-to-text models, GPT-4L Transcribe and GPT-4L Mini Transcribe, essentially are replacing their really long-time Whispers model that they've had. And they said that they've, quote, trained it on a diverse, high-quality audio dataset. They don't ever give you exactly where they got their

dataset from. They say they even like trained it in very quote unquote chaotic environments, which is interesting. What I would assume for this, because they were kind of like, I don't know, scared to say anything about this in the past is that it probably a lot of this was like YouTube. And I mean, you can imagine someone filmed a YouTube video of people arguing, someone filmed a YouTube video of someone apologizing, someone filmed a YouTube video of like literally everything in the world and then just grab the audio from that. That's my assumption of how they would get such a powerful model based off of some of the

executives that said, oh, I don't really know if we use YouTube and like resign, aka Miriam Marotti. I would just say that's almost definitely been trained off of YouTube. So anyways, am I mad about that? I don't know. But, uh,

i'm stoked that the technology is better this is what uh harris also said about this quote these models are much improved versus whisper on that front making sure the models are accurate is completely essential to getting a reliable voice expression and accurate in this context means that the models are hearing the words precisely and aren't filling in details that they didn't hear so they're talking about like not making these things hallucinate they're doing a bunch of really cool things um according to their own internal benchmarks

It is much more accurate. It has a what they're calling their word error rate. So it's about 30% right now out of 120%. And that is for Indic and Dravian languages like Tamil, Telugu, Malayam, Canaan. So that means that three out of every 10 words that the model gives you is going to be different from a human transcription in those languages. So that's not fantastic.

But other than English, this is obviously much better. So OpenAI right now, this is not what they've done in the past, but they are not planning to make their transcription model openly available. They historically released a new version of Whisper for commercial use under an MIT license, and they are not doing that this time.

So they said that because this is, quote, much bigger than Whisper, it's not a good candidate for an open release. So they're not open sourcing it. This is kind of the thing that they've been doing in the past where they're always making things more and more closed source and less and less open source. This is what a lot of companies, Elon Musk, there's a lot of drama people are upset about. So I do think that this is very interesting.

They said that, uh, this is a quote also directly from them. They said, quote, they're not the kind of model that you can just run locally on your laptop. Like whisper. We want to make sure that if we're releasing things in open source, we're doing it thoughtfully. And we have a model that's really honed for that specific need. And we think that end user devices are one of the most interesting cases for open sourcing models, AKA they're like, it's too big and powerful. You can't run it on your computer. We're not releasing open source. It's Lee.

They make more money when they don't release it open source. So there's that element of it. So you could say maybe they're trying to save you from running on hardware that is incapable, or you could say they're trying to make more money. That's up to you, however you want to interpret that. In any case, I'm excited about having access to this regardless. Yes, I'm happy to pay for it, whatever. As a developer, that's what I would expect. But I'm really happy to have the ability to access this technology. Very exciting, big update from them. And so thanks so much for tuning in. If you enjoyed the episode today, if you learned anything new, I'd love,

review on the podcast. It would mean the world to me. I really appreciate all of the incredible people that have reviewed AI Chat over the years. Thanks so much for tuning in. And if you want to join the AI Hustle School community, there is a link in the description to that. I would love to help you grow and scale your business or your career using AI tools, something I'm passionate about. And I make a video about this every week for over a year now. So it's been a ton of fun. Thanks so much for tuning in and I will catch you next time.