We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

EP 435: How 50X cheaper & faster AI transcription is changing enterprise work

2025/1/8

Everyday AI Podcast – An AI and ChatGPT Podcast

AI Deep Dive AI Chapters Transcript

People

Jordan Wilson

一位经验丰富的数字策略专家和《Everyday AI》播客的主持人，专注于帮助普通人通过 AI 提升职业生涯。

Philip Kiely

Topics

Jordan Wilson: 我认为我们没有充分讨论的一点是，我们所说的每一个词，我们进行的每一次对话，都非常有价值，而围绕这些对话的AI技术正在变得越来越便宜、快速和准确，这为各种规模的企业解锁了巨大的潜力。语音转录可以将语音数据转化为文本数据，方便人们和机器进行处理，从而提高效率。更便宜、更快的AI转录技术为企业提供了许多新的应用场景，例如客户服务、内容审核和媒体字幕生成。尽管AI转录技术已经很准确，但在一些需要100%准确性的场景中，仍然需要人工进行验证。 Philip Kiely: Base10是一个AI基础设施平台，帮助客户部署各种AI模型，并优化模型性能，使其更快、更便宜、更高效。我们最近发布了世界上最快、最准确、最便宜的Whisper推理服务。 Whisper是由OpenAI开发的一个开源语音转录模型，具有高准确率和多语言支持。Whisper模型的最新版本速度更快，成本更低，可以实现实时转录。 AI转录的成本已大幅降低，从每小时1-2美元下降到几美分。可以通过将多个廉价的AI模型串联起来，构建更复杂、更经济高效的AI应用，例如AI电话接听。更便宜、更快的AI转录技术正在推动可穿戴设备的发展，使长时间的语音记录成为可能。设备端推理和云端推理的性能差异导致了语音识别准确性的差异。 AI转录技术正在快速发展，未来会变得更快、更便宜、更准确，企业应该及早关注并尝试应用这项技术。

Deep Dive

Chapters

Shownotes Transcript

Translations:

中文

This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business, and everyday life. Every word that you say, every meeting, every speech, that's gold. I think, uh,

So oftentimes when we get caught up in implementing generative AI in our business, we think about other large language models that exist, right? And we think about like, oh, we're limited by their training data. You know, hey, hopefully these models get better. But what about your words? What about all of those meetings? What about that big seminar that you're speaking at?

That is unstructured gold. I think something that we don't talk about enough on this show or in general is how the words that we speak, the conversations that we have,

how valuable those are, and how the AI surrounding that is getting cheaper, faster, more accurate, and what that really unlocks for businesses of all size. All right, I'm excited to talk about that and a lot more today on Everyday AI.

What's going on, y'all? My name is Jordan Wilson. I'm the host of Everyday AI. This thing's for you. It is your daily live stream podcast and free daily newsletter, helping people like you and me, us everyday people, catch up with everything that's happening in the world of AI and how we can use all this information to grow our companies and our careers. Is that you?

If so, welcome home. And your other home is our website, youreverydayai.com. So if you find value in today's conversation with our guests, we're going to be recapping and sharing a lot more insights in our daily newsletter, as well as keeping you up with everything else that's happening in the world of AI.

Also, there's like, I don't know, a thousand hours of audio content text on their exclusive interviews from the smartest people in AI in the world, all for free on our website. All right, before we get started, let's first go over the AI news. So Anthropic is set to close a $2 billion funding round as its valuation soars to $60 billion.

So Anthropic, one of the biggest startups in the generative AI space, is reportedly nearing the completion of a $2 billion funding round led by Lightspeed Venture Partners. So this investment will significantly increase its valuation from $18 billion last year to an impressive $60 billion.

So the latest funding round is part of a broader $6 billion initiative for Anthropic, followed by an earlier $4 billion investment from Amazon. So yeah, they've, I think, raked in like $8 billion in commitment so far in this round. So Anthropic's annualized revenue has reached an approximate $875 million, driven by its model of selling access to its advanced products.

or sorry, it's advanced AI systems to enterprises and through platforms like Amazon Web Services. So who knows? Maybe with this extra cash that Anthropic just put in its pocket, maybe their rate limits will go from unusable to kind of usable. We'll see.

All right, next, NVIDIA CEO Jensen Wang is claiming that his new AI chip performance on his GPUs surpasses Moore's law. Yeah, we're breaking science, breaking science in the face. So the NVIDIA CEO stated in an interview that their latest data center super chip is over 30 times faster for AI inference workloads compared to its predecessor, which could significantly lower the cost of running AI models.

He emphasized that by innovating across the entire stack, architecture, chip design, design systems, libraries, algorithms, et cetera, NVIDIA can achieve advancements at a pace that exceeds Moore's Law. So Wang introduced the concept of Hyper-Moore's Law. Yeah, now we got to learn new scaling laws.

suggesting that AI development is not slowing down, but is instead governed by three active scaling laws, pre-training, post-training, and test time compute. So Wong also claimed that NVIDIA's AI chips today are 1,000 times better than those produced a decade ago, indicating a rapid evolution in technology that could benefit various industries. All

All right, last but not least, Apple is facing a ton of backlash over its inaccurate AI news alerts and has promised an update. So Apple is under scrutiny after its AI feature that essentially summarizes news alerts is generating some false and misleading news headlines, raising concerns about the accuracy of information in its new Apple intelligence.

So Apple announced that it will release a software update in the coming weeks to clarify when news notifications are generated by its AI system known as Apple Intelligence. So the misleading alerts have sparked criticism from various media organizations, including the BBC and ProPublica, which reported similar inaccuracies in AI generated summaries of their content.

All right. A lot more on those stories and everything else you need to stay ahead, not just keep up, stay ahead on our website. So make sure you go check that out and sign up at youreverydayai.com. All right. Enough chit chat. Let's get to the bulk of today's conversation. AI transcription.

You probably don't think about it, but it is a boon for business. So I'm excited to have this conversation. Hey, LiveStream audience, help me welcoming to our show. We have Philip Keeley, the head of developer relations at Base10. Philip, thank you so much for joining the Everyday AI Show. Hey, Jordan. Thanks for having me. Super excited to be here. Let's chat about transcriptions. Before we do, can you tell everyone just a little bit about Base10, what it is you all do?

Absolutely. So Base10 is an AI infrastructure platform. We take open source, fine-tuned, and completely custom models for our customers, and we help them deploy those models on worldwide auto-scaling GPU infrastructure. We also assist with the model performance efforts so that we can get them lower speeds, higher throughput, lower cost, better quality.

Our customers are AI native startups at enterprise like writer, bland, Patreon. And one thing that we've been working a lot with recently is the whisper model. We recently released the world's fastest, most accurate and cheapest whisper influence.

Hmm. So let's, I mean, I do want to dive into whisper and I'm sure it's something that a lot of our audience is familiar with. Uh, but before I even go there, what's the main benefits, right? Like, you know, when people talk about transcription and I kind of started the show out on it, I'm a firm believer, right? Every word I speak on this podcast, it's instantly transcribed and fed into a large language model, but what's the

benefit of capturing your company's words and using those? I think sometimes people just overlook it. Yeah. Well, it's just another stream of data.

So if you think about all of the YouTube videos in existence, all of the podcasts in existence, all of the phone calls that maybe have been made into your company's call center, there's just tons of data floating around out there that takes a long time to process. You know, maybe if you're some sort of super speed listener, you can listen to a podcast on one and a half or two times speed.

But when you think about how fast a human talks, we only speak at, you know, maybe up to 150 words per minute. I know I'm not supposed to actually speak that quickly when I'm doing a podcast. So I'm always trying to slow it down a little bit.

Maybe you listen at 2x speed, you're getting, what, 300 words a minute. But if you think about how fast someone can read, you know, the fastest speed readers can read at 500 or even 1,000 words per minute. So audio is actually a fairly low signal channel. There's not a ton of bandwidth in talking.

But if we can transcribe that audio and then we can get it in text, not only is it much easier for us to process as people, we can read a lot faster, but it's also easier for machines to process. You know, we can feed it into large language models, like you said, or we can do simple find and replace. We can do simple search. There's a ton of things you can do on text that is really hard to do on audio.

It's, it is, you know, I hate floating around the term like game changer, right? But it is, right? Being able to capture everything that's said, you know, I like to say that is your first party or first company gold, all the words that you talk about. Livestream audience, thank you for joining us. You know, if you do have any questions on AI transcription, on what that means for your business, get them in for Philip now. But maybe let's

Not Whisper, but let's talk about Whisper. Philip, what the heck is Whisper? Yeah. So Whisper is an open source model that was created by OpenAI a couple of years ago. And I'll actually give like a kind of little history lesson here. So in 2019, I was working on a blog post about speech to text, which can also be called transcription. It can be called ASR, which is automatic speech recognition.

And I was kind of doing a survey of the state of the art. And one of the best things I found back in 2019, uh, was something called Amazon transcribe. It's like an AWS thing. Um, and it was pretty impressive back then, you know, it was able to take some, uh, some segments of text and it was able to create, you know, a reasonably interesting transcript out of them. But there was definitely a ton of errors, especially around things like names, uh,

Places, proper nouns, as well as just if I kind of mumbled a little bit, then it really didn't know what was going on.

And so actually a year later, I was working on a book. And when I wrote that book, I did a ton of different interviews with experts in the field. These were all audio interviews that I needed to transcribe. And I ended up having to transcribe them by hand because I did all of these, you know, I did the survey of all this technology. It wasn't really good enough for, you know, publication. And so I just spent like a month at the keyboard typing out these 50,000 words from these expert interviews.

So, you know, I've always kept my eye on the space since then. You know, when open source models like Wave 2 VET came out, I was really excited. I wanted to try it. But nothing really approached the quality of my, you know, amateur but still human transcription.

So, September 21st, 2022, OpenAI released a model called Whisper. And what's really exciting about this model is it's actually MIT licensed, which means you don't have to go through the OpenAI platform to get it. You can run it on your computer. You can run it on a cloud service. You can run it wherever you want.

And the first whisper model was really exciting because it offered much higher accuracy. Also, it offered that accuracy across a bunch of languages. So when we talk about a ASR model and accuracy, we want to think about WER, which is word error rate. So for how many, you know, for a thousand words, how many of those words are going to be wrong?

And you want that word error rate to be as low as possible. And so this model came out, it's got word error rates of like 10. Maybe 1% of the words are gonna be wrong versus much higher for other models. And since then, these models have gotten better. Now we're on Whisper V3 here in 2025. We also have Whisper V3 Turbo, which is a little less accurate than V3, but much faster.

So we're able to get faster and more accurate transcription from these open source models in a lot of different languages. Yeah, and what you said there, I don't know if anyone else in our audience that hits them, but that hit me.

me because I remember, right, I was a journalist back then. So I literally had taped interviews on a little tape recorder, right? I had one that was digital, but I think early on it was an actual tape, not to date myself. And I remember hitting play, stop, rewind so many times because especially when you're quoting people for big news publications, you had to get every single word right. You know, I'm even curious as someone that did this as well, what was your first

reaction to you seeing something like Whisper back in 2022. What was your reaction when using it at first? I mean, my first reaction was, man, I wish I had this a couple of years ago because, you know, my fingers were hurting. I had my mouse on the floor so I could kick it with my toe to start and stop the audio recording. I was thinking, wow, my life could have been so much easier if this had been released a couple of years ago.

Hey, this is Jordan, the host of Everyday AI. I've spent more than a thousand hours inside ChatGPT and I'm sharing all of my secrets in our free Prime Prompt Polish ChatGPT course that's only available to loyal listeners like you. Check out what Mike, a freelance marketer, said about the PPP course. I just got out of Jordan's webinar.

It was incredible, huge value. It's live, so you get your questions answered. I'm pretty stoked on it. It's an incredible resource. Pretty much everything's free. I would gladly pay for a lot of the stuff that Jordan's putting out. So if you're wondering whether you should join the webinar, just make the time to do it. It's totally worth it.

Everyone's prompting wrong and the PPP course fixes that. If you want access, go to podppp.com. Again, that's podppp.com. Sign up for the free course and start putting ChatGPT to work for you. So, you know, when we talk about some recent advancements, right? Because yeah, I even remember...

I used Whisper when it first came out in 2022, and I didn't think it was slow, right? But now when I'm using it, because, yeah, I run it locally. I have plenty of programs that run it on the back end as well. Now I'm like, oh, wow, it was slow. What does the recent situation

speed and the cost, right? When we look at Whisper V3 Turbo, you know, maybe whenever we see a Whisper V4, what do these advancements actually mean when it's faster and cheaper? - Yeah, so when we think about speed and cost with Whisper, we talk about real-time factor. So if you have, say, an hour of audio, how many times faster than real time can you transcribe that?

And my real time factor as a person is like 0.3 or something, 0.2. It takes me four or five hours to type out an hour of audio because I'm constantly starting and stopping it and going back. Maybe if I was a faster typer, maybe if I was a professional, I could go a lot faster.

Out of the box, you know, Whisper might get you to, depending on the hardware you're using, I don't know, 50 times, 100 times real-time factor. So maybe that hour of audio, you're able to transcribe it in a minute, and that unlocks a ton. But you're actually able to take it way further through various optimization techniques that we can get into.

And you can get that real-time factor all the way up to say like a thousand times where instead of that hour of audio taking a minute to transcribe, it might only take five or six seconds.

And the other factor in performance optimization is if you're trying to do some kind of streaming use case where you're transcribing the audio not as a file after the fact, but live during the conversation. And so for that, you care about the round trip latency for a single 30 second chunk of audio. And for that, you can get down to about 200 milliseconds. So I'm a martial artist. For me, reaction time is super important.

I don't have the best reflexes in the world, but you know, the average reaction time for a human is about 200 milliseconds. And so if you're able to process that audio round trip in the time that it takes someone to sort of like react to something happening, then to your end user, that's going to feel like it's basically instant. Oh,

A lot of good comments here from our live stream audience and a couple of questions too. So, you know, Samuel's asking, is there any effort to capture tone and inflection during transcription? Spoken language has a lot of context components beyond grammar and vocabulary. That's something I was thinking myself, Sam. So thanks for that question. Philip, are we going to see that in future AI transcription, right? Like I sometimes talk very quickly. Sometimes I talk with

emotion, right? Like, is that something that future AI transcription will be able to tackle?

That's a really good question. Emotion, inflection, that kind of stuff is more of a factor right now when we're going in the other direction. When we're going from text to speech and we want an AI model to be able to do speech synthesis, there's a lot of work that's been put into making that sound much more natural. And that's where those context components in spoken language are super important.

Generally, right now, when we're going the ASR route, when we're going from speech to text, that is going to be just the sort of raw contents of the file or the raw contents of the conversation. But that would definitely be super interesting to look at. Like I said, it's a big area of research going in the other direction, but it's not such a big factor right now in transcription. Okay.

What has, you know, what have all of these updates done to cost, right? Because I remember even originally, I was happy to pay, you know, a dollar an hour or whatever it was, you know, in the earlier days of, you know, kind of AI transcription.

What is the cost now? And, you know, what does that mean in the grand scheme of things as businesses are trying to leverage all of this data, right? They're recording Zoom meetings, very common now, right? I think people have this, you know, goldmine of data that they're maybe sitting on. So can you walk us through the cost changes and then what that actually means?

Absolutely. So, you know, a couple, a few years ago, you're looking at $1 or $2 per hour of audio. And that's generally how it's measured is how much input time are you putting in that that's how much you're paying. So if you're putting in a one hour audio, say like a podcast, and you want to get back a transcript, it's going to cost $1 or $2.

But today it's gotten a lot faster. And when AI models get faster, they also get cheaper. The sort of thing that makes an AI model expensive to run is that you have to run it on a GPU. GPUs are very expensive. So if you use less time on that GPU to accomplish the same task, then that price goes down.

Today, you're able to do these transcription jobs for, you know, it depends. It depends on exactly how fast you want it to run. It depends on the exact type of transcript you're trying to generate.

But if you're doing the simplest, most basic transcription and you're okay with, you know, waiting a couple extra seconds for it to generate, you can get down to just a couple cents per hour. So we're looking at, you know, a 50 to 100x reduction in the cost of doing this transcription.

And that's massive. You know, now for the same price that you were transcribing one hour of audio before, you could transcribe 50 or 100 hours. And that just unlocks so much for business. Yeah. And speaking of that, let's dive into it, because I still think this is one of those areas, just like I started the show off. I think, you know, so many when we talk about business use cases, right?

in advancements in generative AI and large language models, right? I think everyone looks at using a chat GPT, a Gemini, a MetaLama, right? Like people look at using these models, but they don't necessarily look from within what they're creating, which a lot of times is meetings. It's conversations like this, right? Can you talk a little bit about

maybe some new and exciting business use cases that have maybe just begun to become a little bit more unlocked because of that cost and that speed. Absolutely. So a big business is going to generate just so much audio. A lot of that's going to be internal. You know, sometimes you might not want to transcribe literally every single thing that happens, but there are a bunch of places where it is really valuable.

So one of those is, you know, any kind of customer facing situation, you know, if you're doing call center, if you're doing a, you know, teller service, anything where you are interacting with a customer and from the customer perspective, you know, you get on the line there and you hear, oh, this call may be, you know, monitored for quality assurance, right?

So that quality assurance monitoring historically is like a manual process. You have some supervisors who are maybe listening to a few calls and making sure that everything's going well. Now you could just transcribe every single call that's coming into your business. And then you have a fully searchable database. You can do quality assurance. You can also maybe analyze those transcripts to figure out patterns and what your customers are asking for.

You can do content moderation at scale. If I post something with text on a platform and it has some stuff the platform doesn't want on there, that's super easy to identify and flag that I'm using it in words. If I'm posting, say, a podcast on Spotify or something, then that's a lot more difficult. Or if I'm posting a YouTube video,

Because, you know, you can't really just listen to all of the podcasts and all of the YouTube videos. But if you can get that from audio to text, then you can run it through those same moderation algorithms.

You can also do stuff like media subtitling, closed caption generation. You can do that in real time. I know sometimes if I'm watching like a sports game on silent, I see the announcer's words, but it's always like five or six seconds after the play has happened. It's so far behind. It's so far behind, right? And so if we can get that, you know, down to something that's more real time, that's super awesome. And you can also do real time translation with that as well.

So, yeah, there's just so many different use cases where you have these massive volumes of audio being generated that before it just wasn't cost efficient to process these or just took too long. And now with this cheaper, faster AI transcription that's more accurate, you can get a lot more value out of these big audio corpuses. Yeah.

So Cecilia brings up a good point because there's entire industries that for many decades have thrived around just typing words what people are saying. She's asking about how is AI transcription disrupting industries like court reporting? Are we going to see some of these traditional roles where people were just transcribers? Are they just going to go away?

Well, you know, you do still have to verify these transcripts. When I talk about accuracy in an AI transcription and that word error rate, you know, that word error rate is not zero. There is a lot that you can do to make your transcripts more accurate. You know, for example, you can look at, you can have a model analyze them. You can look at, say, chunks that are silent and, you know, replace them or rerun them.

But, you know, at the end of the day, if you're doing something like court reporting where you need 100% perfect accuracy, it's important to have systems beyond just a single transcription model that are going to guarantee that accuracy. And, you know, I think that there's still a major role for human in the loop in these kind of systems where you're able to, you know, go in and verify these transcripts and make sure that they're completely accurate. Yeah.

Yeah, so you talked a little bit about how this, you know, advancing technology, whisper models, you know, in general are helping change how we've done business in the past. But as we look to how these advancements might change how we work in the future, my

What might we see change? Because everything's going live, right? You know, you have your live advanced voice mode from ChatGPT. You have Gemini Live. You know, you can talk to Copilot, right? Like how will more accurate, faster, cheaper transcription change how we work?

So one thing with that live voice mode from ChatGPT is it's really cool, but it's also really expensive. That sort of capability costs, what, like $10 plus an hour?

And this transcription is only a few cents an hour. So if you're a clever developer, you're able to kind of put this model in front of some other models and build these sort of chains of models for these compound AI use cases where instead of having one gigantic model that costs a ton to run and is able to do it end to end, you chain together a few small cheap models and run the same pipeline much faster and much cheaper.

One place where that's really important right now is AI phone calling. So if you want to, say, like have a automatic pizza order taker that you're going to build where a customer can call it up and just say what they want on their pizza and it's going to say, all right, I've got this pizza for you, that kind of thing. And you can build that AI phone calling with these faster, cheaper transcription models.

Another big aspect is wearables. So a big trend right now is, you know, having a pin or some speaker microphone combination on your body that's able to sort of record your daily context so that you have, you know, better information for your decision making for that kind of stuff.

And so if you're wanting to record your life 12 or 16 hours a day, again, if that's going through that historic transcription algorithm where it's costing a dollar an hour, well, that's like $16 a day. That's just not a sustainable business.

But if you're able to do it for, you know, while you're sleeping at night for a couple of pennies and it's costing a few cents a day, then, you know, now we're in the realm where this can make sense as a consumer product. So wearables, you know, local inference, phone calling, all these sort of things are these sort of real time multimodal user experiences that are getting unlocked by these transcription models. Yeah. And, and how,

I do think that we are going to see those in the wild that actually makes sense, right? If you've listened to this show, I'm never one to just, you know, things like the humane pin and, and the Apple vision pro. I'm like, no, not really. But I think some recent advancements, right? The metas, uh,

Meta's Ray-Bans, some of Google's new products. I think wearables are going to be a thing, whether you think they're going to or not. I do think that is kind of the next iteration. But one thing I'm curious about, and it's something I've always thought about, this concept of typing versus talking, right? Like,

I can talk really quickly, but also I don't blame y'all if you listen to this podcast on 2X, I would too. But might we see something in the future where it becomes less and less common to type and we're just interfacing with, I don't know, autonomous AI agents and multi-agent environments and all we're really using is our voice? And if so, what...

part of this technology has to improve or what advancements are we kind of waiting on until that future is finally here where we're just sitting back, kicking our feet up and just talking, you know, to our AI agents. Yeah. So the, uh,

The future is now actually for that. You have all those agent use cases and stuff that are still coming. But if you just want to control a, you know, control your computer, if you want to type an article without using your fingers, that's actually possible. I have a colleague who actually had to have surgery recently on their hands. And

So they went and used a voice transcription app for a few days to do writing while they couldn't type as much. They used something called Whisper Flow, which is an application out there for that.

But yeah, it's, you know, the future is now in terms of controlling your computer with voice. It's not something that's going to be practical in every situation. Like if I'm on the train, I don't want to be talking to my computer and everyone else is talking to their computer. That doesn't sound so good, but it can definitely be helpful if you have, you know, limited typing ability. I don't type particularly quickly. I can definitely talk much faster than I can type.

So it's something that I'm super excited about. Yeah, it's a good point. And I think, you know, having conversations about these type of things is important because I, yeah, I do think, yeah, whether we're talking wearables, whether we're talking, you know, you know, talking to your computer, it is becoming more and more common, more, I think part of how we work in the future. One other thing, you know, what, what, what part of this, Philip, like why are, you know,

If I'm talking to Siri, if I'm talking to Alexa, right? I see a big difference than when I'm talking to, as an example, a Gemini Live or a, you know, ChatGPT advanced voice mode. Why is there still this kind of divide, even between the big tech conglomerates, on which ones can accurately understand our words and sometimes they just can't?

So what you're observing there is the difference between on-device inference and cloud inference. So if you're taking an AI model and running it on the user's device, that's on-device or edge inference, and your user device is not going to be as powerful as an NVIDIA H100 GPU sitting in a data center somewhere. It's not going to be able to run as big of a model or run the same model at as high of a quality.

And so because of that, for these voice transcription things, you're probably seeing a little bit worse results when you're using it on a local device versus when you're using it on the cloud.

However, that's changing really quickly. These models are pretty small. They can be just a couple billion parameters. And so those are actually a really good candidate for local inference, even on stuff like smart speakers or maybe that next generation of smart speakers that has those upgraded GPUs, upgraded VRAM capabilities so that they can run these small models.

And so I definitely think you'll see that gap close in the transcription space pretty quickly. All right. So, Philip, we've covered a lot in today's conversation. I mean, we talked a little bit about Whisper, what this technology is, the cost savings, how, you know, faster and more accurate it is.

You know, voice transcription AI has led to many new use cases. But, you know, as we wrap up today's show, what is the one most important thing that you want our audience to know when it comes to how cheaper and faster AI transcription is changing enterprise work?

I think the most important thing to understand is the trend. You know, in the last couple of years, these models have gotten much more accurate, much cheaper, much faster. And there was, of course, the massive leap from 2022 to maybe like a couple of years before that.

I think this is going to keep happening. So even if you see a use case today where it's like, you know, Philip, actually, like five cents per hour, that's a little too expensive for what I'm trying to do. Or, oh, you can only do 200 millisecond round trip time. Like, yeah, that doesn't cut it. Woo.

We're not done optimizing these models. And even in the last couple of quarters of work on these models, we've gotten much better at running them, been able to run them much faster and cheaper. And that's a trend that's continuing. So I would definitely look at these use cases that you're considering today and say, OK, does this make sense today?

If yes, go for it. If no, still maybe go for it because it could make sense in three months, six months, nine months, once the technology gets even better and you're going to be pretty far ahead. You said, for example, Jordan, that you don't always love some of these wearables. That's a case where having the prototype today is what's going to set you up to be able to use the polished version next year.

for those companies. And so I'd say in the same vein, if you're building some kind of speech use case, if you're building some kind of transcription use case, and if it doesn't work today, still build that prototype, put it in your back pocket and keep an eye on the technology as it advances because it's getting better fast.

That's great advice. And I think words that we should all listen to. All right. So, Philip, thank you so much for taking time out of your day to join the Everyday AI Show. We appreciate your insights.

Hey, thank you so much for having me. I had a great time. All right, y'all. Quick reminder, we covered a lot and there's a lot more. So if you found something valuable today, please, if you're listening on the podcast, make sure to subscribe and rate the platform. Go back and listen to our library of episodes. We literally have...

thousands of hours of content on our website, hundreds of episodes. Also go to youreverydayai.com. We're going to be recapping today's conversation. Yeah, I'm going to upload it in 10 seconds. I'm going to have it all transcribed, but I'm going to be writing about it, a real human telling you more info and insights to take away. So thank you for joining us. Hope to see you back tomorrow in Everyday for more Everyday AI. Thanks, y'all.

And that's a wrap for today's edition of Everyday AI. Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating. It helps keep us going. For a little more AI magic, visit youreverydayai.com and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time.

EP 435: How 50X cheaper & faster AI transcription is changing enterprise work

Everyday AI Podcast – An AI and ChatGPT Podcast

Deep Dive

Conversations are gold; AI makes them valuable?

NVIDIA advances exceed Moore's Law; Apple's AI inaccurate?

Text transcription technology error-prone; manual transcription necessary?

Whisper V3: Low error rate, multilingual accuracy

Whisper rapidly transcribes audio with high efficiency

Emotion inflection crucial for text-to-speech synthesis?

AI transcriptions need human verification for accuracy?

Chain cheap AI models for efficient calls

On-device AI less powerful than cloud AI

Build prototypes now; technology improving rapidly

Shownotes Transcript

EP 435: How 50X cheaper & faster AI transcription is changing enterprise work 35:00 Share

Everyday AI Podcast – An AI and ChatGPT Podcast

Deep Dive

Conversations are gold; AI makes them valuable?

NVIDIA advances exceed Moore's Law; Apple's AI inaccurate?

Text transcription technology error-prone; manual transcription necessary?

Whisper V3: Low error rate, multilingual accuracy

Whisper rapidly transcribes audio with high efficiency

Emotion inflection crucial for text-to-speech synthesis?

AI transcriptions need human verification for accuracy?

Chain cheap AI models for efficient calls

On-device AI less powerful than cloud AI

Build prototypes now; technology improving rapidly

Shownotes Transcript

EP 435: How 50X cheaper & faster AI transcription is changing enterprise work