We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode New OpenAI Releases Are Reshaping the AI Landscape

New OpenAI Releases Are Reshaping the AI Landscape

2025/4/18
logo of podcast AI Education

AI Education

AI Deep Dive AI Chapters Transcript
People
J
Jaeden Schafer
Topics
Jaeden Schafer: 我将讨论 OpenAI 最近发布的重大更新,这些更新对整个 AI 生态系统产生了深远的影响,因为它们是专门为开发者设计的。他们主要升级了转录和语音生成 AI 模型,这些模型已经被我整合到我正在开发的 AI Box 软件中,我相信许多其他人也在使用。这些升级带来了显著的改进,转录功能允许你上传音频文件并将其转换为文本,反之亦然。这被称为 Whisper,它非常强大。 随着 AI 模型的发展,我们越来越接近构建能够独立完成任务的自动化系统,也就是所谓的 AI 代理。而语音功能对于构建更逼真的 AI 代理至关重要。OpenAI 在语音模型方面一直处于领先地位,现在他们通过 API 将这些强大的语音模型提供给开发者,这使得开发者能够创建更自然、更逼真的语音交互体验。 OpenAI 表示,未来几个月将出现越来越多的 AI 代理,其目标是帮助客户和开发者利用有用、可用和准确的代理。他们新发布的文本转语音模型 GPT-4 mini TTS 声音更自然逼真,并且可控性更强,允许开发者控制语音的表达方式,使其更符合语境。例如,你可以让它模仿疯狂科学家的语气,或者使用平静的语调,甚至模拟运动后气喘吁吁的状态。 他们还升级了语音转文本模型 GPT-4L Transcribe 和 GPT-4L Mini Transcribe,这些模型替代了之前的 Whisper 模型,并在多样化的高质量音频数据集上进行训练,使其准确性更高。虽然在某些语言(如印度语系语言)上的词错误率仍然较高,但在英语等其他语言上的表现则显著提升。 然而,与以往不同的是,OpenAI 这次没有开源其新的转录模型,他们解释说,由于模型规模过大,无法在个人电脑上运行,因此不适合开源发布。这可能是出于商业利益的考虑,但也可能是为了确保模型的质量和可靠性。尽管如此,作为一名开发者,我仍然对能够访问这项技术感到兴奋。

Deep Dive

Chapters
OpenAI's latest releases significantly upgrade transcription and voice-generating AI models, primarily benefiting developers. These improvements are integrated into various software and services, marking a substantial advancement in the AI field. The AI Hustle School community is also highlighted as a resource for those seeking to leverage AI tools for business growth.
  • OpenAI upgraded transcription and voice-generating AI models for developers.
  • Improvements embedded in AI Box and other software.
  • AI Hustle School community offers exclusive AI business growth resources.

Shownotes Transcript

Translations:
中文

OpenAI has made some big new releases and with these releases, there's going to be a lot of impacts in the entire AI ecosystem because they made them for developers. So I'm going to be getting into all of that. Essentially, they have upgraded their transcription and also their voice generating AI models. This is something that I personally have embedded into my software. I'm building AI box and I know a lot of other people use.

I'll show you some demos of what this actually sounds like, because I have been very, very impressed, but overall, you know, when opening, I makes a big move like this. It makes it, it's a big deal because it gets embedded into so many other software and services. So I'll be talking about all of that before we get into the episode today, I wanted to mention if you've ever wanted to grow and scale your business using AI tools, you need to join my AI hustle school community. Every single week, I release an exclusive video that I don't share anywhere else.

sharing how I use AI tools to grow and scale my companies and the workflows, the numbers, everything I can't really share publicly. It's all in there. We have over 300 members. And the thing that I love about it is we have people from, you know, people that have started $100 million companies and people that have started, they're just getting started on their entrepreneurial journey. You get a lot of perspectives in there. So no matter where you're at, you're going to find other people that can share great insights about what AI tools they're using and

and really help you kind of kickstart your journey. So if you're interested, I used to have this at like $100 a month, and I have dropped the price to $19 a month. So it's discounted right now. It's a great deal. And if I ever raise the price in the future, if you lock in the price now, it won't be raised on you. So there's a link in the description. I'd love to have you join and see you in the school community. All right, let's get into what OpenAI is doing. So

Like I mentioned, they have upgraded their transcription and voice generating models. And specifically, they're doing this for their API for developers. This is much better based off of what I've listened to. This is much better than their previous versions that they have. You know, I've done a lot of testing for my own software company. And essentially, you know, the transcription means that you upload an audio file and it will then create text, right? So it's like doing captions or you can give it text and it will generate

an audio or you give it text and generates audio or you give audio and generates text. It goes back and forth, right? So there it's called whisper, I think for the, for the transcription and it's really, really cool. So

One thing that I do want to mention here is that as they're kind of rolling this out, we're getting closer and closer to where a lot of companies and AI models are talking about agents and what their agentic vision is, how they're going to build these automated systems, how they're going to go and accomplish all of these tasks independently. Right. And so what I think is really important is you kind of for a lot of things, you need a voice like you imagine like.

Oh, I want like an AI travel agent that I can talk to about my trip and it can give me recommendations. And like, if you're just doing that by text, which you totally can, like technically that would like work and it can get things done. But I just feel like with so many of these agents to feel more realistic, you need that voice. And so OpenAI has been really pioneering

you know, on the frontier with their voice models and on their, you know, consumer usable app, you have these really powerful voice models that you can chat with. And not, these didn't always translate into what developers could get their hands on. And so it's cool that now they have this API where you're able to do that and they've improved a lot of things. So,

beyond just being able to kind of generate a generic voice. It sounds quite realistic. I'll give you a demo in a second, but I think this is amazing. And then the other thing that I think is really interesting here is what they actually said to TechCrunch. They did like an interview. They said, quote, we're going to see more and more agents pop up in the coming months. And so the general theme is helping customers and developers leverage agents that are useful, available, and accurate. So

you and I believe that this was opening eyes head of product Oliver Godeman and he was kind of talking about like what a bunch of these updates were you know a lot of these are directed at business customers right it's not like your average person that uses chat GP on their phone is like I whatever they came up with a better you know API for their text to speech and speech to text like it's whatever but the reason why I'm so excited about and I think it's important for you to know is because you

whether you're using, whether you're a developer or not, every application you use that is tying into that ecosystem, which is the biggest ecosystem, open AI's AI models is going to start using these, these new models. And so they're just getting better and better and everything we're going to have is getting better. And all of the agents that we're going to be using in the coming months and years, we'll be kind of relying on this. So I think it's for me, this is why I kind of geek out about it. And I think it's cool.

So the thing that they're like, these are the updates they specifically have said is that their new text to speech model, which is GPT-4 mini TTS text to speech is now more nuanced and realistic sounding. And it's also what they say is more steerable compared to its other previous, you know, speech models. So essentially.

As a developer, you can now get it to say things in a much more natural language. You can say, speak like a mad scientist or use a really serene voice or you act like you're like what I've said is like, act like you're you just went on a run. You're super out of breath. Like it can it can talk like in all of those variations. And so what's interesting is this was available on the app as of like months ago, but it wasn't available for developers to roll out. So opening, I kind of had a monopoly on this really cool tech, which I mean, they made it. So it's totally fair. But it's really, really exciting to

that developers are now going to be able to start building in that, that really nuanced tech voices into like everything else. Anyone will be able to use this. Okay. I'm going to give you a sample of what they're saying is a true crime styled voiceover.

Okay. And then they also have a sample for what is a female professional voice. It's like very serious kind of like female voice talking about stuff. So I think this is really amazing. And the cool thing is that it's so steerable. So like if I'm like, I want like,

this type of person to speak in this type of way. And I want them to be like acting like a gym coach. And I want them to be like really rah, rah, like motivational. Like it will change how this thing is talking. So to me, that's so exciting. It's beyond just like, you know, in the past we've had like a dropdown with like, okay, pick the, your top favorite of these like seven or eight voices. And you just drop down, pick your favorite voice.

Now you get to decide what the voice is. It's trained off of so many different styles and voices that it knows and you can put them all there. So I think this is very cool.

What they specifically said, Jeff Harris, so he's a member of the product staff over at OpenAI, he was doing an interview and he said, quote, in different contexts, you don't just want a flat monotone voice. If you're a customer in customer support, you want to make the voice apologetic because you've made a mistake and you can actually have the voice have that emotion in it. Our big belief here is that developers and users want to really control not just what is spoken, but how things are spoken.

I love this concept, right? Like if I'm calling customer support and I'm really mad, they could literally like have

sentiment analysis on what I'm saying and be like, okay, this person is upset. Change your tone to be more apologetic or this person seems very happy. Match the mood or vibe of the person. So like there's all these really things to the point where I know this sounds terrible, but like you can, so this will happen. So I don't put this on your radar. It's like this person's, you know, if I wanted to really politically polarize people in a country and add some robocaller using this, I'd be like, this person's really mad. Like get mad at them back. Like try to rile them up. Like I'm sure this

This is like one of the things they're trying to like stop from happening. But like, imagine this is a possibility. So I'm putting this out there as like things people will be doing. Am I excited about that one? I think they'll probably shut it down. But I just say be aware because these things, as these agents are out there, their ability to manipulate people or to help people improves. We have to build, you know, whatever. We have to build our own safeguards and understandings of how these things work. But it's very, very interesting what it will be capable of doing in the future.

So their new speech-to-text models, GPT-4L Transcribe and GPT-4L Mini Transcribe, essentially are replacing their really long-time Whispers model that they've had. And they said that they've, quote, trained it on a diverse, high-quality audio dataset. They don't ever give you exactly where they got their

dataset from. They say they even like trained it in very quote unquote chaotic environments, which is interesting. What I would assume for this, because they were kind of like, I don't know, scared to say anything about this in the past is that it probably a lot of this was like YouTube. And I mean, you can imagine someone filmed a YouTube video of people arguing, someone filmed a YouTube video of someone apologizing, someone filmed a YouTube video of like literally everything in the world and then just grab the audio from that. That's my assumption of how they would get such a powerful model based off of some of the

executives that said, oh, I don't really know if we use YouTube and like resign, aka Miriam Marotti. I would just say that's almost definitely been trained off of YouTube. So anyways, am I mad about that? I don't know. But, uh,

i'm stoked that the technology is better this is what uh harris also said about this quote these models are much improved versus whisper on that front making sure the models are accurate is completely essential to getting a reliable voice expression and accurate in this context means that the models are hearing the words precisely and aren't filling in details that they didn't hear so they're talking about like not making these things hallucinate they're doing a bunch of really cool things um according to their own internal benchmarks

It is much more accurate. It has a what they're calling their word error rate. So it's about 30% right now out of 120%. And that is for Indic and Dravian languages like Tamil, Telugu, Malayam, Canaan. So that means that three out of every 10 words that the model gives you is going to be different from a human transcription in those languages. So that's not fantastic.

But other than English, this is obviously much better. So OpenAI right now, this is not what they've done in the past, but they are not planning to make their transcription model openly available. They historically released a new version of Whisper for commercial use under an MIT license, and they are not doing that this time.

So they said that because this is, quote, much bigger than Whisper, it's not a good candidate for an open release. So they're not open sourcing it. This is kind of the thing that they've been doing in the past where they're always making things more and more closed source and less and less open source. This is what a lot of companies, Elon Musk, there's a lot of drama people are upset about. So I do think that this is very interesting.

They said that, uh, this is a quote also directly from them. They said, quote, they're not the kind of model that you can just run locally on your laptop. Like whisper. We want to make sure that if we're releasing things in open source, we're doing it thoughtfully. And we have a model that's really honed for that specific need. And we think that end user devices are one of the most interesting cases for open sourcing models, AKA they're like, it's too big and powerful. You can't run it on your computer. We're not releasing open source. It's Lee.

They make more money when they don't release it open source. So there's that element of it. So you could say maybe they're trying to save you from running on hardware that is incapable, or you could say they're trying to make more money. That's up to you, however you want to interpret that. In any case, I'm excited about having access to this regardless. Yes, I'm happy to pay for it, whatever. As a developer, that's what I would expect. But I'm really happy to have the ability to access this technology. Very exciting, big update from them. And so thanks so much for tuning in. If you enjoyed the episode today, if you learned anything new, I'd love,

review on the podcast. It would mean the world to me. I really appreciate all of the incredible people that have reviewed AI Chat over the years. Thanks so much for tuning in. And if you want to join the AI Hustle School community, there is a link in the description to that. I would love to help you grow and scale your business or your career using AI tools, something I'm passionate about. And I make a video about this every week for over a year now. So it's been a ton of fun. Thanks so much for tuning in and I will catch you next time.