We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

The Challenge with Voice Agents

2025/2/22

MLOps.community

AI Deep Dive AI Chapters Transcript

People

Floris Fok

Paul van der Boor

Topics

Paul van der Boor: 语音合成技术为AI系统带来了新的交互方式，是AI发展的重要一步，尤其是在B2C领域。之前的语音合成技术主要局限于离线交互，而实时语音交互是语音AI发展的重要突破。在真实世界环境中测试语音AI模型，例如在巴西iFood的送餐员应用场景中，可以更好地评估其实用性和可靠性。在真实应用中测试语音AI模型，需要考虑技术可行性、用户体验以及数据处理等多方面因素。语音AI模型的上下文记忆机制与文本模型不同，在处理多语言和长对话时容易出现幻觉等问题。语音AI领域的技术发展迅速，涌现出许多专注于不同方面的公司和技术，例如语音识别、语音合成、语音翻译等。语音交互相比文本交互，能够提供更丰富的上下文信息，例如情感、语调等，从而提升AI的理解和回应能力。语音AI技术应用场景分为非自主式和自主式两种，目前自主式应用，例如实时语音交互和函数调用，仍面临一些挑战，例如幻觉问题和指令遵循问题。语音AI技术在电商领域具有广泛的应用前景，可以改善客户服务、提升用户体验，并帮助企业更好地理解用户意图。语音AI技术在医疗保健领域也具有应用价值，例如辅助医生进行问诊、提高医疗效率等。未来语音AI代理可以实现主动式交互，在合适的时间点提醒用户，提升效率。未来语音AI代理可以帮助用户处理与客服等机构的沟通，解决效率低下的问题。 Floris Fok: 语音AI代理与文本AI代理相比，处理语音输入时需要考虑更多因素，例如语音的差异、停顿、语调等，这些差异会影响模型的理解和响应。实时语音交互中，需要处理各种异步事件，例如中断、转折等，这与文本交互有很大不同，需要重新设计会话管理机制。语音AI代理中的轮次检测是一个具有挑战性的问题，需要根据用户的说话风格进行调整。为了提升用户体验，可以适当降低语音AI代理的性能，使其表现得更“人性化”。为了评估语音AI代理的性能，可以采用自定义评估方法，例如模拟各种语音风格和场景进行测试。轮次检测模型的个性化和自适应能力有待提高，开源社区的参与将有助于解决这个问题。开源的Kokura模型在文本转语音方面表现出色，具有较高的效率和易用性。在设计语音AI代理的工作流程时，需要考虑工具的响应速度和用户体验，避免长时间的等待和不必要的步骤。语音AI代理的工作流程设计需要避免拼写错误等问题，可以通过改进流程设计来解决。语音AI代理的产品分析工具可以帮助开发者识别和解决用户体验问题，例如识别用户流失点等。

Deep Dive

Chapters

This chapter explores the difficulties in developing voice AI agents, focusing on real-time interactions and the limitations of existing technologies. It highlights the shift from offline to real-time voice interactions and the complexities of handling various accents and noise levels in real-world scenarios.

Real-time voice interactions are challenging due to the need for fast inference and continuous learning.
Handling various accents, colloquialisms, and background noise is difficult.
Memory management and context preservation are crucial challenges in voice agent development.

Shownotes Transcript

Translations:

中文

What's up, everybody? We are back for this limited edition series that we're doing with the Process team talking about all of their work they have been doing on AI agents. Today, it's voice, and we go deep into the experiences that they've had building out voice AI agents. I talk with

Paul a little bit at the beginning about the landscape, about what he's seeing, what I've been seeing. And then we go into the tactical stuff with Floris to talk through how he has been using OpenAI real-time API and specifically what the learnings have been because voice is a whole different beast as we will find out. Let's get into this conversation.

Alright, we're back with episode two. We made it this far and we're coming hot out the gate with voice AI agents because we talked yesterday and I want to give a bit of background. The way that we were going to do episode two was say, hey, let's talk about all the frameworks you guys used while building out some of these agents, what you liked, what you didn't like.

And then we said, you know, frameworks have been around for the last two years. There's a lot of content on the internet about frameworks. You have some people that love them, some people that hate different ones. And what we don't have a lot of content around or experienced stories around are people building voice AI agents, right?

Yeah, so what we saw already for a long time is that the ability to generate synthetic content with voice, so you basically take text and you have voice, allows a new way of interfacing or interacting with AI systems. So it's an important next step for us to be able to open up, in particular in a B2C world,

agents to consumers. And for a long time we were experimenting, doing voice cloning, working with avatars, Synthesia, Eleven Labs, and many others out there just to kind of see how this could look and feel. And

Typically, what was kind of the limiting factor is that this was not possible real time. Because you could generate some voice. There's a lot of voice libraries that you could use, standard voices that you could generate. Give them text and they would generate audio. Text to speech, essentially. And they were great. They were becoming very good, but they were still...

It was an offline interaction. You have to take the text, generate the voice, wait a little bit, and then go back. And it wasn't really real time. Yeah. And about a year ago or so...

there was a moment where this started to become possible. I remember very, very vividly a demo we had with a team from Grok with a queue. And of course their whole aim is to kind of accelerate. Yeah, they're fast. They're very fast. That's what they do. They optimize their inference. And they were using that to enable real-time voice interactions. And then of course we saw the real-time voice API from OpenAI. Well,

Well, we should mention with Grok, there's almost two ways of doing this. These days, it's clear that there's this large push towards speech-to-speech models. So the whole model can do everything that you need done. Right.

And the model takes input as speech and the model gives output as speech. But before that became the trend is what you're talking about with Grok because Grok's so fast on the inference side that you could set up these pipelines and go input a voice, transcribe that, send it to a large language model that is on Grok and it can do things really fast. And I think someone told me where it was like the...

token streaming time is like 300 tokens per second or something. So it's insanely fast. And then you output that with some text-to-speech model. So there's that pipeline. Three steps. You had to basically switch the bridge of the modalities with different models. And that changed. And then you had the OpenAI real-time voice API and others that we were doing early testing with and so on.

And all of a sudden you could have agents that you could talk to that you could stream voice in and they would stream voice out. And we did some tests also with iFood, one of the largest food delivery companies in the world that operates out of Brazil.

seeing if we could use that, for example, for riders as they were out there and they were delivering and we were trying to figure out. And by the way, just take this for a second. One thing is to have a demo where you and I can do a demo with one of these voice agents. Now take that same model

Put it into the context of a food delivery partner who is out on the streets in Sao Paulo, who speaks Portuguese with a Brazilian accent, of course, maybe even a colloquial accent with a lot of traffic noise in the background, is in a hurry trying to figure out which restaurant do they need to go to to pick up the order or what direction or the road is closed or whatever.

And they then expect the voice model to be able to operate. So that was a great test because it sort of takes something that works in a lab to being able to test it. In real life. You know, in real life, that's complicated for so many reasons. And this was them calling the restaurant to say, hey, where's the order? They were calling iFood to basically ask the question. So normally they would otherwise have to stop the vehicle to...

take out their phone, text. And so to not lose time and to do that safely, basically the ability to interact real time through voice was one of the use cases that we thought was a great way to kind of really stretch the limits of the technology while also figuring out if it would work for a real use case.

And of course, there are lots of things you want to test. For us, it was one, technical feasibility. Can we actually get a model into this app in a way to kind of test this? But also desirability, right? Does a writer find this good enough? Is it useful? Does it help iFood answer questions faster, safer, and so on?

And then you enter a whole new world of complexity, right? Because, well, what formats is this audio being streamed in? You're actually streaming, not doing batch processing. You still need to retrieve real data that you have normally in your agent tool calling sort of workflows from the iFood. And that memory aspect that we talked about last episode is so important. How do you keep different things in context without...

losing what is important and the thread of the call. So with the one thing that I am not clear about when it comes to voice agents is on the memory aspect, when you're doing the pipelines, you're just shoving new things into the prompt or you're shoving things

the whole conversation into the context window? How does that work? - Yeah, so a variety of things. Of course, you have a system prompt. You need to stream the content into the, depends on the model we're using. In this case, we were putting everything into the context window. That of course has its own limitation. We tested that also on a bunch of other use cases internally where we wanted it to be an expert on a certain topic. We took like an internal education course.

And people could call, for example, we created an audio interface for Tokan, our internal assistant. They could call Tokan and ask about this course. And you could see that there's a lot of these behaviors where if you just use a simple quote-unquote text-to-text interaction, an LLM, then you know how the context and the stuff you put in the context will work.

performed generally at the moment it generates. For voice, it didn't behave the same way. Of course, we're testing different languages. So you start with Portuguese, and the moment you mention an English term, it starts responding in English, and it doesn't go back to Portuguese anymore, even though that's in the system prompt. It's much more prone to hallucination. So this thing that we kind of solved in LLMs in the voice model world

boom, back. Like it was just making up stuff. One thing that I wanted to talk about was a bit of the landscape in this whole voice AI thing.

you've got different players that are doing different things. And you've got folks that are focusing on different ends of this spectrum, like in 11 Labs that is doing the speech or text to speech aspect. And so if you have the text, you can turn it into speech. And then there's like the Deepgram, which is doing the text to speech aspect.

I'm getting all confused with which ways we're coming in, but voice to text is one aspect. And they also started just going from the text to speech and, and creating that whole voice API piece because I

I think everyone knows that's where the future is. And then you've got other ones like Cartesia is doing a speech-to-speech model. Obviously, OpenAI is doing a speech-to-speech model and doing the real-time. Are there other ones that you can think of in this space that...

have grabbed some attention because this is like the foundation of what you need. And then you build apps on top of it. And when you think about what you need for this, you are looking at a way to get the voice to the model. Yeah. So whether that is on a zoom call or via phone, you need to get that information to the model, the model, then the,

needs to do something with it. It's either the speech to speech model and it spits back out speech, or you do that pipeline that we were talking about. And oh, the other one that I wanted to mention, just a little tangent is Hume. Hume is doing like emotion detections and all that kind of stuff. And they're creating that toolkit. But

What else is there? Yeah, I think you touched on a lot of them. I think there's one more which we typically also look at is the ability to clone voices. So another great example where you used to need a lot of good, rich, high-quality audio content to be able to clone someone's voice. Now you can do that with tens of seconds. You can create a pretty high-fidelity voice clone.

So there are, let's say, the elements you mentioned, so the ability to transcribe text-to-speech, do that in high fidelity, but also that in itself requires, for example, language detection, right? So some, like OpenAI's Whisper does that well. You may, depending if you're doing transcription for meetings or conversation with customers, you need speaker detection. So there's actually, whenever we've put these things into systems, agents, products, there's

It's typically a bunch of things that need to come together. So, like I mentioned, speaker detection, language detection, the actual transcription, we do benchmarking against various languages because, again, PROSE is a global group. We've got lots of different languages. We hardly work only in English. So if you do this in Hindi or any other Hindi language or in South America with its languages, Eastern Europe, we typically try and measure languages

How well does this model perform just for that task, transcription, word error rates in that language? You've got translation, the ability to actually move across languages,

You know, if you have a use case, for example, you want to create educational content, right? You've got a lot of education companies in the portfolio, Udemy, Skillsoft, Stack Overflow, Brainly, and many others, or even others that do corporate learning and so on. They typically want to, you know, translate, dub, and so on. Well, you need to be able to do that with high quality. It can be done offline. So you don't need to do that like you and I are in a conversation now. Yeah.

Then the other kind of environment is, like the one we mentioned in a B2C context, like iFood or a marketplace, you want to be able to interact with a consumer or a driver or a restaurant real time. And there, the ability to use this much richer medium than just characters in a chat message, because like you said, you can detect speakers, emotions, intonation, accent, all those different things,

And using that to better generate answers is also something that you

you know, offers again a whole, you need to actually know how to do that. You can't just say, well, I'm getting voice and I'm giving the right answer. Well, you think, well, maybe is this person annoyed? Yeah. Are they- There's a lot of different ways of saying- Are they, yeah, exactly. One word. Yeah. And if in a learning context, are they still engaged? Right. And so, you know, language learning, how do you help them with their accent? So all those different signals, you can now get out of this much richer medium. Yeah. Utilizing them is a whole new problem space as well. Well, there is two things that,

on these buckets that you're talking about, it's almost like the first one is not so agentic and the second one is very agentic. Because the first one, if we're just dubbing my voice, which actually I have a funny story about that. I was on a Brazilian podcast and for the first half of the Brazilian podcast, I stumbled through with my broken Portuguese voice

and it was all about data, and I was talking about MLOps. So you can imagine how fun that was for me to be sweating, talking about MLOps in Portuguese. And then for the second half, I think they just got sick of hearing me stumble in my Portuguese. They said, okay, now talk in English. And they would ask me the questions in Portuguese. I would say them in English. And later after the fact, they went and they dubbed it in Portuguese.

Brazilian Portuguese. And the agentic part and really where you start to see the voice AI medium come in is when you can go back and forth and get information or get the agent to do something. Yeah. And that's, I think, where this is sort of really at the frontier now, right? So being able to use function calling with these voice models in real time. Yeah.

We found we're struggling with that. It's going to be required to get rid of these hallucination issues. The system prompt and instruction following is, depending on the use case, also not quite there yet. But we're getting there. I mean, it's like all these things, right? This is not a point in time. It's a vector. So we see the improvements over the last 12 months on this front. And once we get those things right, in particular...

the function calling, that will make it a lot easier to then also use the agentic workflows, be able to access tools and so on to get higher fidelity conversations. Let's talk about some of the use cases that we've seen because I...

feel like there are some really cool ways folks are building apps on top of this base that we just mapped out. Yeah, a ton. So we're focused on the e-commerce space. So typically with a B2C world, there are lots of customer touch points that you can think about now changing because of voice changes.

Our geographies, we sit here in Amsterdam, but the people we work with and the consumers of our platforms, they're in Brazil, again, like I mentioned, India, Eastern Europe. They're very different in the way they use technology. There's much higher usage of voice in general. If you think about the percentage of, for example, voice messages on WhatsApp in Brazil, it's significantly higher than here it would be in the Netherlands.

So, voice is also in many ways a much more natural, let's say, capability in some of those markets, especially in the B2C application.

You know, as we work with our restaurant partners, with, you know, real estate partners that are listing stuff on our platforms, being able to confirm certain things with them and check that the menu is fine and using that as sort of more seamless touch points. And then the category I still think is heavily unexplored is as we give users the ability to interact with our platforms,

through voice, how do we use that as a way to help them better detecting the intent that they have because they speak to us and say, hey, I'm looking for, let's say, something simple and light tonight. Then you say, well, Paul is using this account. He's ordering such and such thing. He's maybe agitated or whatever. So get him a fast order with discount, right? But it's a much more richer medium. And then

All these things apply not just to the food ordering space, but also, as you may have seen, we recently announced that we'll be partnering with Despigar, which is a travel platform in South America. They will join the group. And so there, there's going to be a ton of additional travel use cases. Yeah.

So I think this is a fairly unexplored space, lots of ideas and many things to test and try and build still. Some use cases that I've seen that have been quite novel to me have been in the, funny enough, very regulated space of healthcare, health tech, and helping doctors or psychologists or whoever it may be work together

and spend less time on admin than they would need to if they, because we don't know this because we're not doctors, right? Or at least as far as I know, you're not a doctor. I'm a doctor in the wrong kind of engineering. So not very useful. All right, doctor. I didn't know that. That's cool. So the thing about it is that for the most part, doctors can spend, let's just take the doctor example. Doctors can spend,

a lot of their time on admin work, whether that is for insurance stuff or just talking about what they think the patient has or where they think there's maybe pain points, what prescription they're giving and why, et cetera, et cetera. So that is a less agentic use case because maybe you can create a solution that will give someone

Or the doctor brings in their phone and records the conversation and then it helps them fill out that form after. Mm-hmm.

Or you can have it where an agent actually just asks the doctor after. He says, what do you think about this? And the doctor talks to the agent like they would talk to their nurse or whoever. Yeah. Well, there's one company in the group. It's called Corti.ai. As I mentioned previously, we invest in lots of companies, many of them AI companies. Corti.ai is one of them in the healthcare space. And what these guys have done is they were in the space already for some time, but of course...

This is a good example where a super smart team, very capable, AI native, they're building on all of these waves of, or new capabilities that come out, including voice. And their proposition, one of the important ones is helping healthcare workers, doctors, but also others that work in emergency departments, picking up phones and so on, to better assist whoever's on the other side of the call.

And they do that by basically giving them an interface that during the call, they...

Their models listen in and help the healthcare worker ask the right questions. You start to code and say this is likely an emergency of this and this type. You may want to ask such and such question. So they help them basically triage and navigate down the tree of possible options. Next best action. Yeah, exactly. But in a very assistive way to help the folks as they go through the calls.

And a lot of that is live in real time. So they've had to build a lot of their own technology. Of course, in the healthcare space, you can't afford to make mistakes. So you need to make sure these models are accurate. So they have a fairly sophisticated setup to evaluate the accuracy of the models, but also had to do a lot of

let's say their own training, deep model development to be able to extract healthcare specific terms. And that in fact is helping thousands of folks around the world as they are putting this in front of the healthcare workers on a daily basis. Yeah, and that one's fascinating because it's not necessarily that full pipeline that we talked about. It is voice in and then agents helping

but you don't have voice out. So it still is agentic workflows. It still is using voice AI, but it is not necessarily everything we talked about. And I like that you give that because it's a bit more creative thinking. Yeah, you actually bring up another topic, which also, again, isn't specific to voice, but especially as we think about more sophisticated agent is the ability to proactively reach out to at the right time. Right now, all of these systems...

whether or voice or not, let's say these agentic systems, they're basically dependent on you coming in with a question or task. And if you're lucky, then maybe the agent will ask you one or two clarifying questions before it starts to execute the task because most of the time they're just greedy executioners, right? They just go and they execute. As we learned in the last episode, it can't say no. It can't say no. And they're...

So, you know, when you are in a, let's say, in a process, you're trying to do a task, in this case, fill your taxes, which may take you hours.

hours, right? You want to have the ability for a assistive system to come in at the right moment in time and say, hey, shouldn't you think about this? Or have you forgotten about that? And sort of, it's almost like another prediction, like, should it intervene? Should it proactively reach out to you? Hey, haven't you forgotten to also take this form or whatever? And of course, voice is just one way to kind of connect with you. But it's not the only one, right? Yeah.

One that I have not seen in the voice space that I know we've made predictions around that there's going to be agents in the e-commerce area that

But I want my personal agent to go and do things for me. And especially in the voice realm where I'm stuck on the phone with Vodafone, for example, and it is so painful because they're passing me around to different agents, actual human agents, but each human agent doesn't know what I'm calling for. And I would like...

to be equipped to send my agent out to do my bidding for me and figure out my phone bill. Why did I get charged extra? I can tell the agent, go call Vodafone, figure out why I got charged extra. And so it's almost like this personal agent that can sit on the phone and interact with all these companies, especially because the majority of the companies that you are faced with on a day-to-day basis, they have some really bad experience.

voice, it's not agents, it's just robotic calling and you have to interact with that. And so it would be really nice if a voice agent could do that for me. I completely agree. It's kind of an asymmetric reality that typically the call centers you're trying to call, the big companies have their agents versus like IVR and then it becomes something a little bit more sophisticated while we are still stuck calling

With our own human time. I know, exactly. So we should have our own calling agent. Yeah, I think. And by the way, I don't think that's that hard to... Implement. No, exactly, to build or test. You know, we'll have Floris on to talk about some of the tests that we've done where we've used this interface to kind of do and test things

basically you send out an agent that you interact with through voice to do things for you in an e-commerce space. Go order me some food, go buy me this, go find me this. So where I think...

Maybe you'll find 2025 to be a good year for you and you can finally send these agents out on behalf of yourself. Yeah, well, speaking of Flores, let's bring him in now and let's have a conversation on what he has found while creating different voice agents. I know he's played extensively with the real-time API. I also know he's had some insights because, again, this is like, let's pull back the curtains. Let's see. You've banged your head against the wall so many times. How can we help others?

other developers not do that same thing. Welcome back, dude. Talk to me about voice and the differences that you've experienced between voice and working with voice AI agents versus text.

Okay. Yeah. So it was a journey, you know, it's, uh, because I can remember the first time I heard about the audio agents coming live. I was like, Oh, but we can just copy paste the prompt we have for a certain agent. It's easy. We put it in the real time API and it still works. Yeah. That was kind of the dream I had. Uh, and it wasn't true. It didn't happen, huh? No, it didn't happen. And, um, yeah, so there were, there were a few things. Um,

I think the first one was, of course, that if you type something and I type something, it's exactly the same. But the moment I'm talking or I'm speaking that sentence out loud and you're speaking that sentence out loud,

They are quite different. That's true. Well, just in the pauses or in the intonations, the way that I pronounce words, there's so many variables there. Yeah, yeah. And they add up. Like it's an error that can propagate. Like if the beginning of the sentence, something was unclear and then there's another word that is unclear and that whole paragraph that you're just saying to the agent is just becoming super vague to it. Yeah.

which is just a dimension that is really hard to tackle right off the start. Well, it's funny that you mentioned there's different dimensions that you have to be thinking about because I know you said to me at some point...

you have so many different events that fire when you're working with the agents, voice agents specifically, right? It's not just, oh, hey, there's some text and now we interpret that text. There's so many different pieces. So maybe you can map out some of those pieces. When we have text, you know, we send the messages and the messages are really structured. You know, we have, or yeah, we have the tools and the messages and it's like the user says this, the assistant says this.

But if we move to real time, you know, this all becomes real time. So it's asynchronous. You know, you're connected over a web socket and you're not having like, hey, these are the messages. Now give me one response.

You know, you can constantly, you're getting responses all the time. Yeah. And you can send responses all the time. So knowing exactly how did this process and saved in the session, as they call it, the real time, is really important because then you also know how you can manipulate this. So a funny thought experiment is like, if I would now continue to talk and you would interrupt me halfway. Yeah.

I would still remember what I wanted to say to you. If you have the old system where it was text-to-speech and speech-to-text and then to LLM, the LLM would have the memory that it would have told you that part. Realtime API changed this because they have the interruption event. I'm not sure it's not the right word, but they have this event saying like the speaker interrupted you and they would trim that.

the output from the LLM perspective as well. So the LLM knew where it got interrupted and knows that you don't know that part. So if you would ask something, it would not say like, oh, I told you that last time, but actually it was in the trim part. So that's a whole new way of working with that session. And they have a few nice visuals on the API reference page

So for the viewers, I would recommend watching that instead of me trying to visualize it with voice. You can say things a certain way. You can say them...

in certain accents. So you have all of these different vectors that you pull on, but you also have the latency requirements that you have in any tech. And you also have the piece around like, is what is being said actually relevant to the question I asked? Yeah. Yeah. And like one of these parameters within the real time, which is quite difficult, is the turn detection.

So, if I'm talking now, you know, there are some pauses, which are natural, which are just like, you know, give me room to think or because I can't find the next word. But others are like, I'm done talking. Yeah. If I'm done talking, you know, you want this agent to talk back.

Fast. And that's what turn detection is doing. It wants to have the least amount of time of silence, but still kind of leaves room for you to have these stinking moments. Yeah, not interrupt you. But this is highly personal. So if I would now make a voice agent for me, I would tune these parameters so that my way of talking would be perfect.

perfectly aligned interesting and then i gave you this agent i was like oh it's so incredible you know it will never interrupt you and you you will start talking to it and it just starts to interrupt you because your style is there exactly i i have long pauses thinking or whatever and the agent doesn't realize that and so you have to tune it in a different way but the speaking of which i had a friend who who is building agents and he was saying that a lot of times

When the user will say something, the agent will think that it is trying to talk. So normally you have an agent that will be talking. And if it was with a human, I would be nodding along and maybe I would say, yeah, okay. Mm-hmm. Mm-hmm.

And my friend was saying that a lot of times the agent will stop talking because it thinks, oh, you want to talk. I just heard you say something. This is one of those parameters you can change. You can say like, what is talking? If you talk for longer than...

X amount of milliseconds, that's considered talking. So if you, mm-hmm, can be detected as something you want to say and mm-hmm, not. So it's all these different things. And like I said, highly personal. Yeah, actually another insight from my friend who's building these agents, he said that he found the product worked better when they made the agent a little less aggressive

Good, I guess, in a way. So if you make the agent worse and not as good as it is possible, then humans are more sympathetic and they're more patient with it. And they also are able to

They talk slower, they enunciate their words more, and it's much better for him and his product to get that. And so if you make the agent as good as it is capable of right now on these different vectors, it's,

you mistakenly have the customer thinking that it is better than it is and you get a worse result. That's an amazing view. You have this mirroring aspect that people have naturally, you know, that you kind of adjust to the way the other person is talking. That's actually a really cool one. I've not heard that before, so I'm going to write that down for later. So make it, yeah, tell me if you make it worse. I've learned something. There you go. Amazing. Now,

I know you guys did some really wild stuff with evals.

to see what was working, what wasn't working. Can you talk to me about the custom evals that you did for the voice agents? Yeah, so it was pretty cool. Like, you cannot just launch this. You know, you cannot just say like, hey, put it out in the wild. We have an assistant who works by the phone. Good luck. So we needed a way to know the boundaries. You know, we talked about this earlier, like, you know, knowing where it fails also kind of breeds confidence in deploying it.

And since it was like real-time API was a month old, you know, so there was no testing framework. So we came with this crazy idea of like, okay, why don't we use real-time API to test the real-time API?

Since, you know, we had dozens of experiments where we were able to make the real-time API move in all these weird places where like, you know, talk with a German accent, you know, or talk really slow or super fast, you know, it was... Lots of emotion. Yeah, exactly. Like you were able to prompt it to talk differently. Yeah.

So we built an entire system where we would randomize different characters of different properties saying like, you know, the talk fast, slow, talk with an accent or not, you know, British or American, but also be sad, you know, or be happy.

And yeah, we had like a wild range, you know, we went all out because if we thought like we can automate this, then we can just test whatever. You know, that's the amazing thing about automation. You will do much more the moment you automate it. But yeah, so we tried this and we just ran like hundreds and hundreds of experiments saying the same stuff and then seeing how the phone agent would react. And it needed to like order a pizza.

But then in all these different styles. And the fun thing was that the result of that experiment was that we saw that Italian and Spanish were like mix and match. So the moment there was Italian or Spanish, it would just swap over to one another. Like if it didn't really understand that there was a different language, which was really funny. It's the hybrid language. Yeah, so I have the same mistake sometimes. You're like, amigo, that's Italian, right? Yeah.

So it's good to see that these smart models also struggle with that. And, you know, it's the same where the German accent would sometimes convince it to talk German back.

So you had like the Italian accent, it was all fine. But then, you know, when you had the German one, it would just start talking back German. The model would respond in German. Sometimes, yeah. But we saw that accents could indeed trigger it to kind of go over that language barrier. And the final one, like what I discussed with the turn detection, you know, that is something we saw with the slow and fast speaking, that the moment we had slow speaking,

it would randomly trigger that turn detection. So people who were talking really slow were really difficult to have consistent because if the pulse was just a few milliseconds too long, then it would blow the experiment. And those were all the things we were seeing by

generating all this synthetic data sets. So that was quite an innovative approach. And since this whole thing was a month old, we were like, we're probably the first doing this. It was pretty exciting. And do you think with that turn detection, it will become more personalized later on and be able to adapt real time? How do you see that problem being solved? Yeah, it's a difficult one. I think, first of all,

People just need to learn, you know, to work with the normals. Yeah, that's the quickest approach. But I do see the opportunity for companies because also in the real-time API, you can say, I do my own turn detection. You don't have to do it on the OpenAI side. So OpenAI is just a turn detection model. So it would be amazing if the open source community would now work on a turn detection model

that is adaptive. Because if we can clone a voice in 10 seconds, why can't we detect the turn settings of a person in 10 seconds? 100%. Well, yeah, speaking of open source, there is a model that came out and I think you were playing with it a little bit, right? Which model was it? Yeah, the Kokura. Yeah. Kokura.

I'm not sure if I'm pronouncing that right. I'm not good with names. The audio of the real-time API isn't also, so, you know. Yeah. No, but this open source model, it blew me away. And it was not, you know, we've seen 11 labs, great quality, you know, amazing, amazing work.

For the text-to-speech aspect, right? Yeah. So this is text-to-speech. It's not real-time. But it deserves a shout-out in this episode because it's really amazing. They managed to get really good performance there.

from 82 million parameters. Wow. So all these models are in the billions. Yeah. And these guys just say like, okay, we have here a model that, so if I run it on my MacBook, I get five times the speed of real time out of this text-to-speech. Wow.

So it can generate five times as more speech than it has the time for just on a MacBook. And they claim to reach over 100 times if you have like a proper GPU. But that's liberating because you're basically democratizing the text-to-speech because everyone now can download that model and generate speech for free. And that's a big move.

So I'm wondering if Eleven Labs is sweating at this time.

Before we wrap up, I want to talk a little bit about how you think about flows with voice AI agents, specifically comparing them to something like the token web usage or Slack bot that you can interact with. We heard from Paul that it's really difficult to do all of this in real time. The latency starts to get really high and it's not quite feasible yet.

Yeah, so feasibility is a large word, but there are definitely things you need to rethink. I still remember the first time I had the agent running on my laptop with some tools and I was like, oh, I'm so amazing. I made this voice agent work. It will definitely be great. And I started talking to it.

And it called the tool. I was really happy, but then it was just dead silent. And it was awkward. I was like, oh, this tool takes long. Because you don't know if it's working.

You might have known because you were looking at logs and you were seeing it. But if it was me on the other end of a phone call, I wouldn't have known what's going on. Are you still there? What's going on? Yeah, so it would just like stop talking because it was executing the tool because it was the logical next step to do. Because I prompted to do that. If the user says this, call the tool.

It was waiting on the tool output and then it was reading out the tool output. You know, it's like, hey, I just got this tool response and it's saying this. And I was like, oh, that's an awful experience. You know, it was truly awful. And I was like, this is not going to work. And then we knew that we need to rethink these things. So there are a few tricks you can do. You know, it's even if a model works,

calls a tool, it can do, so going back to the text, it can do text and a tool call. Audio is the same. It can do audio and a tool call. So you can prompt it to say, if you call a tool, tell that you are doing it and maybe explain what the parameters were in the tool call. That can kind of buy you a bit of time. But the other side is, of course, make the tool really quick.

So don't add tools that take long in a real-time conversation. Interesting. Then it's just better to offset. Then you're just saying like, hey, that thing you want me to do, it's being initiated. You will get a process message or when it's done, I will send you an email or a text. So you separate the workflow. So don't do stuff real-time that...

cannot be handled at real-time speeds so that's like uh the first first realization where like okay only fast tools we're gonna keep and then we're gonna offset these jobs you know where it's like okay the actual ordering of the food that is a job and we say like we're gonna order you food that's it you know you're not gonna call the pizzeria and then they leave you on the line until the pizza is done right that's so

It would be really weird. So that's kind of the same with the real-time voice. And the second one is you need to rethink flows in a way where you need to avoid spelling. So if I say, hey, my name is Floris, get me my medical record to come back to the medical discussion.

And it can then spell my name on 20 different ways. You know, it can even detect like a normal English word, but it could be really awful at getting my spelling right. And that's essential for the next step. So...

You could kind of work around that. So maybe let's take it more in HR setting. So it could ask me like, hey, what team are you in? I'm like, hey, I'm in the AI team. And it's like, okay, what's your name? And then it already has a short list of names. So it has like 25 options. And then I say floors. And then it's precise because then it knows from all these names...

Flores looks the most like this spelling and then it can map. And so redoing that workflow, now you're overcoming the spelling issue. So it's not a limitation. So where we started, is it viable?

So yes, things are not viable, but there are workarounds. And just we need to be more creative. And I'm not sure if spelling will ever be something that will be solved. Well, you saying that brings up a really cool company that I've seen called Canonical AI. And they basically are doing product analytics for voice AI agents. And one of the things that they do is they see where these voice agents are falling flat.

And it gives you as the creator of the agent, the ability to click in and recognize we have a blind spot when people start to say their names or we have seen that there is for some reason on this flow that

a lot of users disengage after the agent comes back. And so you get to recognize out of the way that people are interacting with your voice agents, where is it not working? Because it could be, again,

It took too long or it could be it's just giving the wrong answer or it didn't respond that it was thinking. There's so many X factors going back to the initial part of this conversation that you want to try and find where these flows are breaking down or where customers and users of the agents are not having the best experience so that you can try and tweak the flow a little bit more.

Yeah, that's super valuable. You know, especially at this point where nobody really knows where these things break, the more insights, and especially real-time insights, if you're able to follow your agent real-time, you can react fast and you know the pain points and you know how to improve. Or think about if you're a call center and you all of a sudden are having...

20 or 50 or 500 calls that are hanging up after the same step. And really a call should be a minute, but now it's like, there's an average of three seconds for this call. You want to be alerted of that. And it's real time insights to say, Hey, something's going on here. It's like your data dog or your Prometheus in software. Yeah.

Yeah, it needs to be there if you're building software products to have these real-time insights. I think it's an amazing company. So one of these insights we actually gained from looking at these logs was actually that it was overconfident if people would mispronounce or missay something. So it would never ask the question, can you repeat that? Yeah.

you know and it's such a normal thing to do for humans you know if they're uncertain but it's exactly the same over with text like they're overconfident and they would never give you a no and the same is with text so I'm really I'm really looking forward to the first audio model that will that will tell me like can you repeat that or it just says huh yeah that's the next frontier you know can you repeat that model oh that would be good

All right, that's it for today. A huge shout out to the process team for their transparency, because it is rare that you get companies talking about their failures, especially companies that are this big in the AI sector and really helping the rest of us learn what they had to go through so painfully sometimes.

a mention that they are hiring. So if you want to do cool stuff with the team that we just talked to and even more, hit them up. We'll leave a link in the show notes. And if you're a founder looking for a great design partner on your journey, then I highly encourage you to get in touch. We'll leave all the links for all that good stuff in the show notes.

The Challenge with Voice Agents 47:37 Share

MLOps.community

Deep Dive

Shownotes Transcript

The Challenge with Voice Agents