The real-time API allows developers to integrate OpenAI's voice model into applications, enabling immediate responses during conversations. Unlike previous methods that involved latency due to transcription and processing, this API predicts the end of a sentence and responds instantly, making interactions feel more natural. This is particularly useful for applications like language learning and customer service, where real-time feedback is crucial.
Vision fine-tuning allows companies to upload annotated image datasets to train OpenAI's models for specific tasks, such as identifying tumors in medical scans. By fine-tuning with specialized data, the model becomes more accurate in recognizing specific patterns, like tumors in X-rays, compared to its general image recognition capabilities. This is a significant advancement for industries requiring precise visual analysis.
Model distillation involves fine-tuning smaller, cost-effective models using the outputs of larger, more advanced models like GPT-01. This allows developers to achieve high-quality responses at a fraction of the cost and computational resources. For example, a smaller model like GPT-40 mini can be trained to mimic the performance of GPT-01, making it ideal for repetitive tasks and cost-sensitive applications.
Prompt caching automatically discounts tokens for previously seen inputs in a conversation, reducing costs by 50%. Since the context of a conversation remains largely unchanged with each new message, caching eliminates the need to reprocess the same data. This is particularly beneficial for long conversations, where the cumulative cost of tokens can become significant.
EU users are excluded from the Advanced Voice Plus rollout due to stringent AI regulations under the EU's AI Act. Compliance with these regulations makes it challenging for OpenAI to offer certain features in the EU. This has led to frustration among EU users, who feel they are missing out on cutting-edge AI advancements available elsewhere.
OpenAI just hosted their Dev Day 2024. Now, this is something that a lot of people look forward to because essentially they're pitching new updates to developers, but really they're new updates that everyone can use. And I think the more exciting thing here is that these are a lot of amazing features that are getting embedded in all of the software we use every single day. So today on the podcast, I'm gonna be covering all of the new updates.
Specifically, they've introduced real-time voice API. They've introduced fine-tuning with vision. They've introduced model distillation and something they're calling prompt caching. These are amazing updates, and I'm going to break down all of them on the podcast today. Before we get into it, I want to say, if you are not on the wait list for AI Box, head over to AI Box. It's my very own AI marketplace. And
app builder that I've been working on for over a year. And this month, yes, October, I have some very exciting news and a very big announcement coming. So if you are on that wait list, you will be the first to know AIbox.ai. I would love for you to join the wait list and join us on this journey. So
Let's get into everything that OpenAI has just announced in their most recent Dev Day here. Now, before I say the exact announcements, I also wanted to share a really interesting little snippet. TechCrunch did a little interview with them and they were talking about, we know that right before this big Dev Day and all the announcements that came out, a whole bunch of really key executives from OpenAI left the company. So they were asked about it. And in this briefing that they had,
OpeningEye's chief product officer, Kevin Well, talking about all of them, said, I'll start with saying Bob and Mira have been awesome leaders. I've learned a lot from them. They're a huge part of getting us to where we are today. And also, we are not going to slow down. So this, I think, is big news for a lot of people as we move into Dev Day because they definitely did not slow down. They came out with a ton of absolutely incredible new updates. And the first one that I want to highlight
uh, kind of cover and talk about is what they're dubbing a real time API. So, um, this is the one that everyone's talking about. And essentially what this is doing is it's for their new voice model and it's not actually their, their latest, latest voice model.
But it is essentially their ability for them to go and in real time have an API. So that's for developers to use in real time. When you talk to their voice model, it responds back. So
What this is essentially replacing or what people were doing previous to this was using voice-to-voice AI models. So there's voice-to-text and voice-to-voice or text-to-voice, right? Depending on if you're talking and it's giving you text or if you're giving it text and it's talking back, like whatever. In any case...
For voice-to-voice, meaning like if you have your phone and you're chatting with something like, hey, teach me how to speak better Spanish or whatever, and it's responding back to you. Previously, how this used to work was that you would talk, it would take that clip, it would essentially transcribe it into like text, send it, it would...
listen to your, read it or listen to it. And then it would send you like your voice clip back. And that takes a couple of seconds and there's latency. The problem with is there's all of these new agents or sales tools and all this stuff. And it was taking a few seconds to get your response back. It didn't seem very natural. So they've now officially created an API for real time, meaning it's listening to you in real time. And I think it's partially predicting what you're going to say at the end of your sentence. And as soon as you are done saying your sentence, it immediately
immediately is giving you the response. And you can kind of think of this in the same way that like when you're chatting with chat GPT and you ask it a question and you can watch it, like type out all the letters and all the sentences of what it's saying.
Essentially with this real-time API, the voice is like automatically talking as it types that out. In the past, it would wait till it typed the whole thing out, turn it into a voice thing, and then send it to you, the voice packet. Now it's in real-time responding to you. So what does this unlock? Some absolutely incredible things. The first one, they have two demos that they kind of showed of different companies. One is called Healthify. So this is a nutrition and fitness coaching app. And they are using the new real-time API to essentially help creatives
create really natural conversations with their AI coach that is essentially for humans who are looking to do different things with their diet or they need personalized support. So they showed this demo where essentially there's this company and they talk to it, asking it for different health suggestions. They talked about what this thing's capable of doing. And it responded back to them immediately upon them asking questions. And they even switched between multiple different language while they're talking to
talking to it. So I think they switched into like Hindi and then they like, and it understands all of that and responds really quickly. So absolutely impressive. The second demo they showed is for a company called speak. This is a learning language learning app. They're using the real time API to essentially help their role play feature, which is really, really impressive. And I'm sharing these demos because I think there's going to be like thousands of new apps,
like these that are doing really impressive things, but this kind of gets your head in the right place. So with Speak, they showed a demo where they're talking to it as easy as possible about just different ways to improve their language. And specifically, they're demoing their app where it tells them to say a word and they say the word. And the AI, beyond just listening for the word, it's actually listening for the pronunciation. So it's
Really, really impressive. And what this shows to me when it listens for the pronunciation is that it's not just like you're doing voice to text, it's transcribing it into text and sending it to this model. It's actually listening to the words. It's decoding the language. This is really, really impressive. So they showed a demo where they were just like, you know, they're saying a word in Spanish and it was saying, oh, to say this word in Spanish, you need to really enunciate like the last part of the word and really make sure you pronounce it this way. Try again. So he says it again and they're like,
Fantastic. Like you did a great, whatever it moves on. So this is absolutely amazing to me. When you think of apps like Duolingo and all these other language learning apps that have been very common that we commonly use, these will all shift to beyond just like picking the right word on the screen or, or,
you know, even maybe saying it, it's listening to how you say it. It's correcting your pronunciation. It's helping you not have any grammar errors. And it's a conversation. And this is, I mean, this is how people learn languages. So this makes perfect sense. So I think both of these are great examples. And I think the other great example that we're going to see a million of, sometimes people will be annoyed. It's like the teleprompter or the telemarketer stuff. That's going to be annoying with real time. But then you can also imagine like customer service. A lot of times when I call customer service, like with my internet company, and I have to wait for,
Oh my gosh, so annoying. Recently, I was on hold for like an hour for my internet provider to cancel because I was getting a faster internet company. And if they told me like, hey, do you want to, instead of waiting an hour to talk to an actual person to cancel, because how hard is canceling? This is not a difficult thing. Would you like to talk to an AI and we can have you done in two minutes? I would say 100% yes. So I think there's
So many companies and people that are going to be able to leverage that to speed up the processes, save money. And the customer, myself, would have been thrilled in this case. Like as if I care to talk to actual Susie to cancel this stupid subscription. Please put me out of my misery. So I think this is really exciting. They talk about safety issues.
They say that they have multiple layers of safety protections to mitigate the risk of abuse. And really, it's like scams and all that kind of stuff. I recently saw a demo that somebody did where essentially they told the new voice model to act like a scammer from India trying to scam you into giving
giving them your credit card information. It was this, I don't know, it was this Indian guy that did it. So, you know, it's okay. No stereotypes. He was, he came up with it. In any case, it did an amazing job of the accent, exactly what a scammer would say. And again,
like I, it was kind of like a funny and a joke thing that he was doing. But to me, I was like, oh crap. Like it's not going to be, that's not, that's not going to be what's calling you because they're not going to say have this specific accent. They're going to say, have the accent of have a Southern American accent, have a Western American accent, like wherever geographically you're located, it's going to,
mimic your, your area. So anyways, I think there's, they're putting, that's why I think this safety and privacy thing is important that they're talking about. They're putting these safeguards in now, will there be open source models or the other models that people abuse? 100%. So I think you still have to be vigilant. It's not just like, Oh, they're, you know, opening eyes is going to be safe for that. We don't have to worry. Like, no, this is something we should be concerned about and looking towards, but you know, it looks like opening eyes is going to mitigate, um,
because they're the best, the biggest, the fastest. So people will be able to have the best tool to be able to scam you in any case. This is interesting. This is coming. There's pros, there's cons, but I'm excited because there's a lot of amazing things coming with this new real-time API. All right. The second thing that I want to talk to you about is vision fine-tuning. So, um,
For those that don't know, I guess a refresher on fine-tuning, I got the realization the other day that it's not as complicated as it sounds. This is something we talk about a lot with AI, like, oh, they fine-tuned this model and it sounds like they did some big fancy thing. You fine-tune chat GPT essentially when you give it a bunch of examples before asking it to give you an output. That's all fine-tuning is, so it's not like some...
It's not like some super, I don't know, exclusive fancy thing where, you know, it's like hard to understand what it does. If you're like, hey, like write me a LinkedIn post. Here's five examples of LinkedIn posts I've written that I like, copy my tone and style. You just fine-tuned ChatGPT to copy your tone and style by giving an example. So that's all fine-tuning it.
What they're doing now that they've introduced is vision fine-tuning. So essentially what they said is there are hundreds, there's thousands of companies that are fine-tuning, essentially giving a big data set of text. So you can imagine if it was like, you know, I had a friend that was fine-tuning an AI model and he wanted it to be able to write the best TikTok comments that were most likely to get, you know, top-ranked,
Um, top rank. So that was his goal. He just wanted to make right. Really good TikTok comments. So he went and scraped 20,000 TikTok posts or 20,000 TikTok comments on the top, on a bunch of viral posts and found whatever the top comments were. And then he essentially used all of these top comments to fine tune a model to say, look, you understand how to like Chachapy understands how to write comments, but it generally will write bad ones or generic ones.
here's, you know, for the fine tune, the best ones. These are the ones that get the most upvotes copying this tone, these tones and styles and ideas like now make really good TikTok comments. And it, it was able to do that. It was able to make great TikTok comments that were really interesting, funny or witty. And so he was thrilled about this fine tuning. Okay. So
This is a common thing. Thousands of companies are doing this, are uploading these big text data sets and fine-tuning. But the problem is there's a ton of use cases that are not text, right? When you talk about like medical imaging and you're trying to
locate a tumor, like yes, Chachapi's vision can go and look at an x-ray scan and be like, oh, looks like there's an issue potentially there. But how does it do that? And how accurate is that? So what they're doing is now with vision, they're allowing you to fine tune with images. So let's say for this medical example of discovering things on an x-ray, if you go and grab a hundred pictures of tumors on lungs, for example, and
and annotate them and fine tune and upload these into, upload these to OpenAI. Now when they do their image recognition, they're much better at actually recognizing a specific tumor versus, you know, right now the image recognition can see like what everything in the world doesn't kind of give you an idea, but it's not a specialist in that special area.
So now you're able to fine tune the model to that. So this is very interesting. And they gave some interesting examples of companies that have done this with them. One of them is Grab, which is a food delivery and rideshare company. And they have it where it can essentially see speed limits or speed signs. So it's able to better do that. There's a company called Automate
which is essentially helping you to... Agents kind of do take actions, but it's based off of UI, right? So it's like...
like scrolling through the internet and going to a website and buying and clicking and doing things. And they were able to fine tune the model with images of buttons and UI elements so that the model knew what to click on beyond just like, here's this website, go to the sales page. But if it was like, oh, what's the sales page? They can fine tune and say, okay, these are all sales page buttons. This is what the word sales looks like.
And they can really go quite specific or even maybe it didn't say sales page, but it just said learn more and they can fine tune it to know that learn more typically means X, Y, and Z on these types of sites. So Automate saw a, I think they said it improved the success rate of their RPA agents from 16% to 61%, which is a 272% uplift in performance compared to just the base GPT-4.0. So
Really, really impressive. One other company was called Coframe and they actually use this. This is essentially building in, it's an AI growth engineering assistant that is helping businesses to create and test variations of websites and the UI. And they're essentially trying to optimize business metrics. So a big part of this is autonomously generating data
new branded sections of a website, right? They're trying to optimize this. So they need to generate new ones, um, kind of based off the rest of the website. So they were able to fine tune, uh, GPT for all with images and code, uh, that they improve. And by doing this, they improved the model's ability to generate websites with consistent visual style and correct layouts by 26% compared to the base GPT for a model. So pretty much they upload an image of a website and then they say right below generate like
you know, the next chunk of the website. And they were able to
you know, do it with just the base model. And it was like, eh, it was like, okay, it wasn't great. It doesn't look like it should be exactly what's next on the website. Then they fine-tuned it. And you can see that the next chunk of the website is perfect. It looks just like something that you'd expect to follow in the flow of the website. The way they do their headings with multiple colors of the same word is like the exact same where I wasn't able to do this before.
So this fine tuning made this much, much better. Now, again, they're all about safety and privacy and they're continuing to run safety evaluations on the fine tune models and monitoring everything that's going into them to make sure that they're all used for, you know, things that they're allowed to be used for. But overall, really, really exciting model.
use case. So the next thing I want to talk about that I think is so fascinating, it's this concept called model distillation. This is the first time I've really heard that word used. It's the first time I've really heard this becoming like a very popular thing. But model distillation essentially is fine tuning a cost effective model with the outputs of a larger
model. So what that means is we have GPT-01 that just released and it's this incredible model, but it is way more expensive. I mean, way more expensive and way more computationally intensive to run. But then of course we have much smaller, much faster models, things like GPT-40 mini, which is like a lot of developers I talk to say it feels like it's almost free because you have to send a million messages for it to ever run up any sort of bill because it's just like
so, so cheap, so fast, and it's optimized. But the problem is the responses are not as good as GPT-01-1.
especially the preview. So what they're essentially able to do is fine tune smaller, more efficient models like the GPT-40 mini with the outputs of the better model. Now in the past, they said that people were able to do this, but it was a lot like clunkier. It didn't work very well. And so they've streamlined this whole approach to essentially get it. So you can go and fine tune these small models based off of your, based off of the outputs of the really good models. So you're like, look like the, the,
the GPT-01, it gives me the right answer for this specific question. The GPT-40 mini does not. You go and generate a thousand answers that are from the GPT-01. You feed them in and now all of a sudden this really small optimized, really cheap model is able to give you the responses that you need.
for so, so much cheaper. So this is really exciting and very interesting, especially for companies that have to do some of the very repetitive tasks over and over again. They're saving a ton of money. Okay, so this is really interesting. I'm super stoked about this one. Speaking of saving money and optimizing, I want to talk about the last big update that they released, which is called prompt caching. So prompt caching is an absolutely fascinating topic. Again, it's in kind of this optimization and, um,
making things more affordable vein of thinking, essentially what you're doing, they're going to be offering automatic discounts on inputs that the model has recently seen. Meaning, every time you have a conversation with ChatGPT, it has to look at the context of all the previous messages to help you with your most current one. So all of that context is
It's already seen before. And every time you send a new message, while it adds like a little bit of new text every time, everything above that new text is the same over and over again, right? And your context just keeps getting longer and longer. So everything that it's previously seen, it's going to cache all of that data.
Um, and essentially you're going to get a 50% discount on the tokens used to do that. So how that works is, uh, they, they charge you when you are using chat GPT, the API, they charge you for how many tokens or how many like words pretty much is in a, is in like a message you send it. So if you send it a five word question, it's going to charge you that much for the inputs. And then.
When you do the followup question now, all of a sudden it's like your previous question and response. And that's like maybe like a hundred words and it's going to charge you for all hundred words plus your new ones. And it just keeps snowballing and getting more and more expensive. So these chats get more expensive, the more messages are in them, but now that they're doing cash and you get a 50% discount on everything that it's already seen before. To me, this is really exciting when you're looking at tokens that what that looks like is, um,
For their pricing model, it's like $2.50 if you're looking at the uncashed input tokens. And then as soon as they start cashing it, it's down to $1.25. So it's 50% off, 50% cheaper. They did say, because people are sort of concerned about what it looks like, they have some specifics of how it functions. This is on any prompts longer than 1,000 tokens that they kick this in before 1,000 tokens. It doesn't really matter because it's so short.
But in any case, they said caches are automatically cleared after five to 10 minutes of inactivity, meaning they're not going to like cache your chat conversations and just keep them forever. People are concerned about the privacy aspect here. So five to 10 minutes after inactivity, the caches are wiped and one hour after their use, they're completely removed. So
They said that with all API services, prompt caching is subject to our enterprise privacy commitments. Prompt cachings are not shared between organizations. Okay, so this is something that people obviously want. Absolutely fascinating information.
for essentially cutting down and making these things more efficient. So I see a ton of developers that are very, very excited about this. The last thing I want to talk about that was announced that a lot of people have been very excited for is the fact that
OpeningEye put out a tweet and said, starting this week, Advanced Voice is rolling out to all ChatGPT enterprise, education, and team users globally. Free users will also get a sneak peek of Advanced Voice Plus, and free users in the EU will keep you updated, we promise. Okay, so pretty much the Advanced Voice features that are...
Amazing, right? Everyone's been testing out and showing demos and stuff where ChatGPT is essentially able to talk in like a thousand different ways and accents and tones and styles. All of a sudden, all of the free users are getting that. So people are freaking out on Twitter. They're really excited that all of the free users are getting it. But at the same time, people are kind of mad because at the end it said plus and free users in the EU only.
We'll keep you updated, we promise. So if you're in the European Union, you do not get these. Now, a lot of this is issues with the European regulations, the AI Act and stuff. They sort of, in my opinion, they sort of over-regulated. And so now you see like with the iPhone, all of the new Apple intelligence features, they're not going to be in the EU. When they roll out this cool stuff, it's not going to be in the EU. It's just a lot harder to keep up with regulation and everything in the EU. And it seems like people are kind of upset about that. They're saying things like,
living in the EU is becoming increasingly infuriating. And then they say, you know, someone said, our money is not good enough for you. Like they're getting mad, but like they got to keep up with the regulations and what they have to do. So overall, hopefully if you're not in the EU, you're going to get this right away for all the free users. And that is really exciting to
Now, if you are interested in different ways to make money with AI, different side hustles, I have a school community that I have launched called AI Hustle. I'll leave a link in the description where every single week I create an in-depth video breaking down one of my side hustles that I use, that I'm using AI to do, how much money I'm making, what products I'm using, what tools I'm making, an exact breakdown, things I can't share publicly. It's all over on my school community. So that's $19 a month.
a month and the price will probably go up to $100 a month eventually. But right now it's that price. So if you will lock it in, you can lock it in forever. I'll never raise the price on you. We have an incredible community over 150 people that are all sharing their projects and AI getting feedback, giving you feedback. It's an amazing group and we share exclusive stuff there on how, yeah, people are making thousands of dollars and it's really exciting. So if you're interested, check out the link in the description. I'd love to have you in the AI Hustle School community and I hope that you all have an amazing rest of your day.