Today on the AI Daily Brief, a case study in building voice agents. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes.
Today, we're doing something a little bit different and that I'm very excited for. As you guys might have heard, over the last six months, our team at Superintelligent has been working on a voice agent that is effectively the core of a new type of automated consultant that we deploy as part of our agent readiness audits. Agent readiness audits are a process whereby we go in and interview people inside companies about A, all of the AI activities and agent activities they're currently engaged in, as well as B, just their work more broadly.
The goal is to benchmark their AI and agent usage relative to their peers and competitors, as well as to map the opportunities they have to actually deploy agents to get value. A core part of how we do this is a voice agent that we've developed that can interview dozens, hundreds, or thousands of people at the same time, on their time, 24-7, totally unlocking a differentiated ability to capture information than anything that consultants have previously had.
Today, we're talking with our partners at Fractional who have been helping us build this technology to do a bit of a case study in what it looks like to actually build a voice agent. It's been a really fascinating process and we're excited to share a bit of the learning, especially because we think that this is a technology that many of you are probably going to deploy for your own purposes in the months or years to come. All right, Eddie, Chris, welcome to the AI Daily Brief. How are you doing?
Doing great. Awesome. Thanks for having us. Yeah, this is going to be a fun one. I mean, so this is something where we're talking about something that you guys have built, you know, lots of versions of we have built together. And I think that, you know, this is a little bit different than our normal content because as
as opposed to just talking about what's going on in markets theoretically or what people are building theoretically. We're actually talking about something that we've got live that we've done some reps on, let's put it that way. So I think just to kick it off, maybe if you guys could give a little bit of background on Fractional and yourselves, just so people have that context before we dive in. Yeah. So I'm Chris, CEO, co-founder here at Fractional. The basic thesis behind the business is that one of the biggest winners of this whole AI
moment is going to be non-AI businesses, your everyday company that can use Gen AI to improve its operations, improve its products and services, and that those companies need help. They especially need help from top caliber engineers who can wrangle this magic hallucinating ingredient into production grade systems. And so the purpose behind Fractional is to bring those engineers together in one room, have them all work on Gen AI projects,
and learn best practices from each other and build out the best of body engineering team in the world. And so that's been very much the vision from day one. And it's going exactly according to plan, which is always fun with a startup. And I think the first time in our entire careers where that's the case.
So it's been great. And working with you and your team on the voice agent has been really fun. Awesome. And Eddie, maybe we can actually introduce you a little bit with my first question just to set up. So I think that the main thing we want to do today is actually talk about what it looks like to put a voice agent into production. I think we have learned a bunch of things. We continue to learn things in practice. But maybe to kick off,
I think just zooming out, one of the big questions that we always deal with when it comes to enterprise customers, enterprises that are thinking about AI transformation is this buy-build question, right? And I wonder, you know, you guys are front lines dealing with this. Is this even the right way to think about things at this point? You know, especially when it comes to agents,
Is there actually like a strict buy build hierarchy? Is everything just some spectrum of build? Like, what do you think the sort of current state of buying versus building is with with agents, you know, especially as companies are thinking about what it means to even enter that the agent space?
Yeah, I think it's right that everything exists somewhere on the spectrum. I think it's pretty rare that you have a workflow that's a good fit for, or a product feature that's a good fit for an agentic solution where you can just go buy something off the shelf that just works. The off-the-shelf stuff is great for really general purpose productivity tools and things like deep research that are sort of generalized tools are awesome. But when it comes to specific bespoke workflows in your business,
I think there's a spectrum of, are we building all the way from scratch? Are we building on top of good, powerful new primitives that are coming into the market? Are we doing some building work that requires just sort of integration of off-the-shelf tools? But I think it's rare that we see great fits of sort of off-the-shelf tools that really replace an existing manual workflow.
Yeah. And this has sort of been our experience as well. Everything is to some extent build, even if it's only customized. And so with that as background, you know, you guys have now had a chance to spend a bunch of time, you know, thinking about voice agents, digging into voice agents. There clearly seems to be resonance with voice agents in the market. Like a lot of people are finding a lot of different use cases. Do you have a thesis for why that is or what you attribute that to?
I think the technology has just gotten a lot better. And I think that the applications are obvious. You know, any business that has some kind of call center or has some kind of bottleneck in their business that is voice related is looking in the direction of this technology because I think the applications are obvious.
broad and obvious. And the technology is finally there. You know, if you have an experience of talking to one of these things in the wild, I've only had a few thus far, but they're starting to become more frequent. And every time I'm always impressed by what a pleasant experience it is as a consumer. And so I think we're just going to start seeing these things pop up everywhere.
Also, I think voice is just a great fit for certain kinds of data collection, basically. I think you'll see it in the use case. We're going to dive into it in a minute. With Supers use case, there's a reason why when you go to do research about what's going on inside of a big company, one of the things you do is you go in and you interview people and you ask them questions instead of just sending them a survey, the fixed data entry kind of task.
is not a great fit for a lot of kinds of situations where you want big open-ended responses and you want people to sort of ramble and, and, you know, realize thinking on the fly, things like that happen really naturally over voice. And to Chris's point, finally, the technology is at a place where like we can start to chip away at the kind of stuff that only a human interviewer could have done before. Yeah.
Yeah, I mean, I think it's interesting. So for background, so we're going to talk about, you know, the voice agent that we've been collaborating on is this sort of data collection experience, right? It is meant to capture information around people's current workflows, their current AI, you know, adoption techniques in order to help us give them recommendations around what agent opportunities they have. That's the core idea. And the starting point, the central sort of genesis of this was that
A, to your point, Chris, the technology was such that it actually just is good enough to do this, right? You can actually have an agent interview people and it does a pretty good job. Not off the shelf, as we'll see. We had to do a lot of development to make it work, but still, the capabilities are there. The second piece, and I think, Adi, this is the piece that you were speaking to, is it is actually not just as good an experience as the human equivalent, but it's
There is a lot to recommend this as a better, an actual, just factual, better experience. First, the fact that you can collect information with voice and having people talk instead of people type just instantly. It's so much easier for many, many people, if not most people, to ramble about something and just speak at it than to sit down, try to collect their thoughts, try to structure it and type it. And it's faster no matter what.
right? You can get just the amount of information per unit of time is going to be way, way higher if you're having people talk. So that's one. Second, the ability to do that on demand, on your own schedule, whenever you are, maybe if you're walking to work, whatever, like 4 a.m. at night when you can't wake up, as opposed to having to schedule a human interview is, again, just a, that's not a 1x improvement, that's a 10x improvement in convenience of something. And so I think those two things combined, both the fact that the tech
Technology is there and it's actually just a better potential experience makes a huge difference. You know, certainly that's sort of like what the insight was that when we had going into it. Yeah. In addition to that, you don't have to hire out a team of thousands of consultants in order to conduct the kind of interviews that you guys want.
Yep. In fact, it's interesting to, you know, maybe to come back to this, but, you know, I've had a lot of conversations with consultants after having built this. And on the one hand, it's fairly disruptive to at least a piece of what they're trying to do, right? This is something that consultants bill lots and lots of money for to do this data collection. Interestingly, what I keep coming across is that
consultants don't see their value, their primary value as collecting information. It's like the proprietary knowledge and experience they have, the way that they analyze it. So they're actually extraordinarily bullish. They don't want to have to force their customers to use a huge portion of their budget just on the data collection. They'd much rather have that be able to go to the actual processing, the analysis, what they do next with it, right? So even though this sort of piece is actually theoretically disruptive, I think it's
likely to sort of shape how we see that industry evolve as well. I think there's also just a whole breadth of insights that are probably not being captured in a lot of those sort of consulting scenarios just because you're limited by only being able to do whatever, 10 interviews or something like that. Whereas what could you learn if you could actually do 1,000 custom interviews in parallel and be able to actually process the data coming back from that?
The point about this is not what the consultants want to be doing, too. It's like that is something we see broadly across basically every project that we do. It's the things that it's the repetitive work that takes away from the higher order tasks that you want to get to on your to-do list that don't have time to get to that AI is so well suited for. And very often we find that exact kind of dynamic where we're automating away the things that people just understand.
The bang your head against the wall, do this a bunch of times, and it's not super intellectually stimulating, that kind of stuff. We can delegate whether that's voice or text and free up people to do higher order tasks. Awesome. Well, let's dive in and talk about what it looks like to actually build a voice agent in practice and what we've learned. So Eddie, I'm not sure exactly what the right place to start is, but I'll let you take it away from here and dig into it.
Yeah, absolutely. So, you know, I think you sort of called out correctly earlier that like the technology is there, but that doesn't mean it just works off the shelf or that, you know, you don't need to do a bunch of custom work here. And it's like the technology in this use case, we really leaned on to build this, this interview agent. And by the way, the way this agent actually works in practice is we configure it with sets of interview questions and goals. So here are like the things we want the person to be asked to do.
Here are the reasons why we're asking them. We prioritize those goals. And that's kind of the input to this very agentic system that is then in charge of deciding how exactly do I phrase these questions? When do I follow up? What do I ask next? When have I met my goals?
And so it's got a lot of agency. It's highly sort of undirected. And the kind of out-of-the-box technology that we have access to right now, and there's a few different alternatives here, but the one we chose for this project was the OpenAI real-time API, which has great real-time voice capabilities. It's got nice, realistic voices that sound pretty human, and it's pretty smart in its ability to sort of make decisions on the fly.
If you just like give a monolithic prompt to that model that tells it about the interview and the questions it might want to ask, I mean, you get a pretty cool result, but like you get it goes off the rails all the time. It asks weird questions. It's sort of hard to tune when it follows up. And like if your only kind of mechanism for control here is a giant monolithic prompt.
your hands are really tied. And so we quickly found that while it ran some interviews well, it ran some interviews really poorly and our control over what happened next was pretty limited. And so one of the areas where it fell down was
It didn't always make smart choices about like what question to ask when we would tell it all the, all the questions up front, it would be up to it to decide when, uh, which one is next. And so we ended up doing is abstracting out an entirely out of band sub agent. That's like running in parallel in the background, assessing the conversation. And this whole task is like, if we were to move on to another question right now, which one should we move on to? And then the core agent is just told, here's the one question you're working on now and the goals.
So it's like one example of how we had to sort of take this thing, you know, from going off the rails and getting it back on. Another thing we added was this sort of, we were calling it the drift detector sub-agent. I think for a while we were calling it the rabbit hole detector. Like these LLMs are just so, you know, eager to please. They're really like, they have, anyone who's interacted with LLMs a lot, like knows the personality of one, right? And so we kind of were like stuck where,
We want it to ask follow-up questions. We don't want to constrain it to like never ask follow-up questions. But if you give it like a little bit of rope, what it ends up happening is, you know, no matter what you say, it's like, wow, your job is so interesting. That's crazy. Tell me more about that. Just sort of dig and dig and dig.
And so what we ended up doing was adding this whole side flow that's watching the conversation and just sort of like assessing, all right, has this thing gone off the rails? Are we going down the right path? Should we force under the hood a tool call to force like moving on to the next question? So there's a bunch of these sort of like subcomponents that go into what feels like an overall large agentic experience, actually a bunch of sort of subcomponents. But like one of the more surprising ones, maybe anyone that's worked deep in the weeds on voice has seen this before, but I think this is surprising to a lot of people,
One of the things we wanted to do here was show a pleasant UI. And so that actually added a bunch of constraints. One constraint was you need to actually know what question is being asked. So you can like show a little check mark on the screen. You need to know what you're planning on moving on to next. So this actually adds quite a bit of complexity under the hood. One of the areas where this impacted things was showing transcripts. So like,
We want to show a written transcript of what's happened so far. In fact, we even want to enable the user to interact over text if they want to. The OpenAI models actually make this really nice. They return with a JPI response, both the audio follow-up and a transcript of what's happened so far. The problem is that transcript is produced by a separate model. It's Whisper running on the side, just doing basic speech-to-text. And the core model and the transcript model can disagree with each other.
I think you actually might have had the experience where you were like on one of these interviews and there was like a sneeze or a cough or something. And I think the core model did the right thing. It was like, bless you. And that, but the output of the transcription was just like something that represented the underlying training data randomly. Like it said, like, don't forget to like and subscribe or like it would come out in Korean or something like that. Yeah, we had a lot of like random background noise turns into foreign language switches. Yeah, yeah, totally. So there's a lot that went into kind of keeping this thing on the rails, right?
One of the outcomes of this is that you now have like a lot of different knobs and levers. You can adjust the core prompt. You can adjust what model you're using. You can adjust the questions you're asking. You can change the wording of the goals and, and the, the large number of degrees of freedom. I mean, it's nice because you now have good primitives to control your interviews, but it's scary because you know, kind of anything can happen and you don't want to test that in front of users for all of these, these sort of AI projects generally, um,
Like it's absolutely critical early in your development process to build strong evals, you know, some automated way of producing metrics to tell you how well you're performing and all the sort of key things you want to know about your problem.
This one is just so hard. Like it's voice, it's open-ended. There's no really like great source of ground truth. Like, I don't even know. Did you think at all early in the project what ground truth would look like? I mean, to me, I'm like, could we collect a set of recordings of human interviews? And even if we did, I don't even know what we would do with that. Yeah. I mean, so to maybe reframe the question in just sort of super simple language, what does a good interview sound like, look like, feel like?
It's inherently, it turns out once you dig in, it's like, wow, that's really subjective because it's like, is it a good interview because it got good information? Is it a good interview because it was prompted? It didn't drag you too long. Is it a good interview because, you know, people didn't have to repeat themselves? You know, it's all of these things that it could be. And you add on top of that, the sort of layer of just human variability, like we're, you know, we are live right now, for example, with a major pharmaceutical company with
every single person in a department, 250 different personalities doing the same interview. What's good to them is, is highly variable already before, you know, before you get into just, just on a human preference standpoint. So yeah, I think this is actually an enormously challenging thing. I think one of the things that we sort of, one of the places that we went, and I know you're going to take it in a different direction with evaluation, but even going back to the sort of the way that the experience developed over time,
is we added more knobs. Basically, we made the experience more controllable. Basically, that's sort of a shortcut to making the user experience better is giving the user more ability to modify the experience, right? So, you know, to your point, Eddie, at the beginning, like if you're very open-ended, in fact, a great use case that I would encourage people to play around with voice agents for, the more that you're down to kind of just let the AI wander, you can get some really interesting stuff.
For us, we're pretty constrained. We really needed a set of questions to get answered. And there was some amount of sequencing that was important. And so we ended up, one of the big moments for us, I think, with this particular project was creating an interface experience where people could
jump from different questions to questions. So, you know, we had already added a skip or a, you know, stop kind of button, but we wanted to go even farther. We felt like we had to go even farther, which was just like, I want to look at all the questions, say, I don't care about all these, but I do want to answer that one. And so, you know, there's a bunch of different ways to answer it, but it, you know, it becomes a product design process very, very quickly, it turns out. Yeah. And like,
You want to know, to your point about what even makes a good interview, you want to know in a lab setting that you're going to have good interviews. I think your question earlier about when do you build, when do you buy, actually voice agents are an area where there's tons of great tooling coming out. There's this company, Bland AI, that jumps to mind. They make a great product for designing voice agents. They make it really easy to put a voice agent on the phone to design conversational flows, etc.,
But I think it's that what we see in terms of adoption is the adoption is happening in places where people are kind of willing to learn on the fly from real user conversations when it went off the rails. And the sort of tooling out there for making sure in a lab setting, like that you're confident that when I go send this into a Fortune 500 company to do interviews, I'm not going to do anything stupid. And just getting that confidence is really, really hard. What we ended up doing on this one was we built this whole separate system for people
creating synthetic conversations where we collect all these sort of written personas of the types of real people we think we would interview. This is a person in marketing and here are the tools they use, here are the people they interact with, all sorts of things like that. We write out this sort of persona and then we have a separate LLM play the role of fake customer. We conduct these interviews in the text domain where over text, our agent is interviewing this fake user and then we're measuring a bunch of stuff about the conversation afterward.
You had asked earlier, you know, what makes a great conversation. We spent a lot of time on this one trying to define that. And like, we ended up with all of these sort of metrics we produced and they're all imperfect, right? Like you have to like...
With all these eval sorts of questions, you have to find the 80-20 on, I don't want to spend all my time developing some perfect lab metric for what makes a perfect conversation because there's so much stuff you won't know until you go into the wild. I think we had this experience where someone just started talking to it in German in the middle of the conversation. Luckily, it just worked, but we wouldn't have guessed that one in the lab.
Yeah, you know, and like adding complexity to this, just to the extent that, you know, I think my sense is that we've learned a lot of things, we've solved a lot of problems, but then there's new problems that come up. One that I think is a continued challenge with the evaluations are we have this great, you know, a great suite tool for testing for kind of like seeing how different personas might interact.
But the AI still defaults to assuming that all those personas will in good faith engage for the time it takes to finish the interview. Whereas like within the first three interviews that we tested, a CEO started swearing at the thing like halfway through, you know, question four and dropped out. By the way, he ended up coming back and it was a very useful interview. And so it was all worked out fine. But like,
The AI was not, the synthetic testers did not think to storm out of the room as part of their tests based on their personality. Yeah, I don't know if you've ever done this, but sometimes I just have fun going into chat GPT and trying to get the last word. And it never happens, right? You say, okay, bye. And it's like, all right, see ya. Every single time, they don't give up. I do think though, like the tuning of the underlying, like normally you use these evals just to build tests.
the software. It's like you're writing a custom workflow where you know reasonably well what good looks like. And then the question is, is our system good? Here, you're like also designing an interview while you design the system that can support interviews. And the number of degrees of freedom is super, super high. I think that's common across anything voice and anything that is conversational. Like, you know, the developers working on ChatGPT are
Have their work cut out for them to figure out, like, are we having good conversations? You know, do we mess up? Those are like really fuzzy things to measure. Yeah. You know, and I think, too, one of the experiences and learnings for me is, which is helpful, especially because our use case is literally helping people figure out where to, you know, deploy agents or which agent use cases to think about.
We really are, you know, there's all sorts of different definitions of what exactly an agent means. But I tend to come back to the very, very kind of clear and simple way that I think enterprises think about it, which is,
AI is stuff that I use to make my work better. Agents are stuff that, you know, things that do the work for me. And that is very crisp and clean in the context of this voice agent where we are handing over a customer to it to ask a bunch of questions with information that we need to get with no ability to intervene if it goes off the rails or doesn't do a good job or, you know, like we're just, it's a small thing. It's, you know, it's not all that risky, but ultimately we're letting the agent do the interview and
And it really is a clearly different thing than, you know, us using chat GPT to help prep for an interview or something like that. And it turns out, and Eddie, I think this is sort of part of your point. Literally, as soon as you are allowing a thing to go do the thing, the degrees of freedom just become so much more immense than the normal software experience. And even in a relatively constrained environment, like there's 20 questions that we really need you to answer.
Yeah, I think a question on like everybody's mind right now is like, what is an agent? Like everybody's got this separate definition, a separate way of framing the problem. And it's just like a hot topic in conversation right now. I think we both agree that this one is a highly agentic kind of example in a fairly obvious way, I think. We tend to think of like agency as being this sort of spectrum. Like there are less agentic things that are more agentic things. And like,
There are a few sort of sub attributes that lead to something feeling more agentic. And like, you know, one sort of element here is how open ended is the task? Like here, it's completely open ended, right? Like you're given an interview, but you're you can really vary what you're doing. Another is like, how complex is it? You know, we have some open ended tasks, but it's like the task is spam detection. It's like the eventual result is like, you know, is this spam or is this not? This one is is super open ended. You have very broad goals you're defining. Yeah.
And then the last one is sort of like, I think what you were sort of talking about a second ago, which is who's taking the action at the end of all of this? You know, is there some system that's behind the scenes eventually making a recommendation to a person? In this case, no, right? Like there's nobody sitting there watching the interview. The person doesn't even get involved until you're reviewing the results of the interview and trying to synthesize it. Even then, I think like that's in the to-do list to start to tackle next, right? We're going to keep moving through that and see how many places we can apply agents in this process.
So as we kind of zoom out, having gone through this experience, and obviously you're bringing to bear, you know, tons and tons of different projects at the same time. What does this make you think around? Are there other use cases that you're, you know, excited about for voice agents where you think that companies should be like really thinking about these things? And maybe that's either specific use cases or just like types of problems or types of opportunities that you think they're particularly well suited for?
Yeah, I think inbound phone calls and especially within that spectrum, generally what you're looking for is what's the 50% of call volume that is for very simple tasks and start with that with the ability to escalate for the more complex things. So that's one bucket. Another bucket is outbound B2B calls. So things like calling insurance companies to gather information. That's another big bucket. In general, one of their best practices with this is
You always want the person who's talking to the agent to know they're talking to an AI agent and not to pretend that it's a human. I think people are very forgiving with being on the phone with AI agents, and they tend to be very positive experiences. But I can imagine hiding it from a person would be a very bad, open yourself up to a very bad experience.
If I just think back to like my last week, what I've seen in voice agents, like they're all over the place and they're all super interesting in their own way. Like we see folks in health care that are currently doing a bunch of, it's very similar to your use case. It's, you know, someone conducting interviews today and someone interviewing a bunch of physicians to do market research. And like, I think it's open ended whether the right answer there is like to, you know, it's such a regulated place to kind of allow a voice agent to do that. Or if the voice agents are riding shotgun and providing suggestions.
But in either case, it seems like it can help there. We've seen folks in the rail industry going on trains, doing safety sort of inspections where they're kind of trying to take notes on an app today and it's super awkward. They're on a train interviewing a conductor, talking out loud to them, but also trying to take notes. And it's just a bad UX. And so the agents are guiding that as potentially a better experience.
a technician who's on site and needs to refer to an instruction manual for this big, complicated piece of machinery. And instead of trying to flip through the manual, they could maybe interact via voice. Awesome. Yeah. I mean, I certainly, I think,
Our experience has been immensely positive. Like I said at the beginning, this is not a one or two X improvement over the alternative. It is a massive, you know, it's, it's, you can't even really calculate it. Like it is, it was not possible before to interview every single person in a company about what they do and try to map agent opportunities. It is now possible. Theoretically, if they all did it at the exact same time, it could all happen.
in a half an hour. So we're super excited. We love working with you guys on this. We're excited that more and more companies are interacting with it, giving us more context to learn from. Really appreciate the time today as well to share it and excited to bring you guys back as we continue to build this out. Awesome. Thanks so much for having us. Yeah, thanks for having us.