Welcome to today's episode of Lexicon. I'm Christopher McFadden, contributing writer for Interesting Engineering. In this episode, we're joined by Hassan Raza, CEO of Tavos, to explain the future of AI-driven video agents, conversational AI, and the ethics of human-AI interaction. From transforming healthcare to predicting AI in governments, Hassan shares groundbreaking insights on where AI is heading next. Let's dive in.
Hasan, thanks for joining us. How are you today?
I'm good. Thanks for having me. Our pleasure. And for our audience's benefit, can you tell us a little bit about yourself, please? Yeah. Hey, I'm Ahsan. I'm the co-founder and CEO here at Tavis. And Tavis is an AI research company building what we call the operating system for human-AI interaction. So we build AI models. We're a research company, so we build AI models that lets machines learn how humans communicate to teach them
to see, hear, respond, even look like humans. So we're inspired by face-to-face communication, and that's what we want to replicate. Brilliant. Fantastic. Well, building on that then, so conversational AI has come a long way from simple chatbots to sophisticated AI-driven video agents like yourselves. How do you see this evolution shaping the future of communication? Yeah, this is interesting because we almost see...
We like to talk about chatbots and these IVR systems of the last two decades, and even some of these AI chatbots recently, as some degree a regression to the human experience, right? Like we used to have humans in these positions that would give us trust, that would spend time with us and really build up this rapport.
And these systems have sort of rests on that. We feel like we're spending more time with these systems. We're paying more money than ever as humans, and yet we're getting less for the experience. And it's all in pursuit of scale. To some degree, we have democratized access to things like health care, using telehealth, but the actual experience has regressed.
And at Tavis, our focus is on how can we actually bring that human touch back without sacrificing scale? So we see conversational video and these AI human sort of personas as a way to shape the future where you can have great scale, you can democratize access, but people still have the humanity of face-to-face interactions because that's how evolutionary we were essentially shaped.
Absolutely. Yeah, definitely. Yeah, because there's a big risk of becoming detached, even more detached from technology, right? Especially the AI is only going to accelerate that. Yeah. I think that's a good point. Yeah.
So building on that then, so many industries are beginning to integrate AI into their workflows. Which sectors will significantly transform from conversational AI in the next five years in your view? Yeah. So I think that the industries that will be most affected positively by conversational AI and conversational video, this like face-to-face interaction are
are the industries where there has been a regression already, or where like they are, they are capped by the human potential today. Like healthcare is a great example of this. Like
we don't have enough, you know, receptionists, nurses, uh, physicians, pharmacists. We don't have enough of these. And so people often just can't get the access for care they need. Um, and again, like to some degree, we have tried to solve this problem through like telehealth. Now you can get an appointment rather quickly, but you're actually spending less time with the, with the physician, the physicians having to do, uh,
You know, many different tasks that used to be, you used to go walk into doctors office, you'd talk to the front desk worker, they would learn a little bit about yourself. And then you talk to the nurse and they learn a little bit about yourself. And all of that would go to the physician who could really know deeply about you and spend time with you to actually diagnose, build trust.
And now we've sort of said, oh, well, now you have this 15 minute window where you're going to talk to the doctor and fill out these forms ahead of time, but no one really does. And so then you get in the meeting, you're like, I can only talk about the one thing that I'm bothered about, even though there's like four things.
And we really think that conversational video, in our experience, is already making a big impact in this space where now you can have this awesome front room worker, this AI front room worker, AI nurse that comes in ahead of time. You can spend time with them. You can explain in your own language even. If you don't speak target language, you can spend time with them at your own pace, explain everything. There's that trust, that face-to-face communication space
And then you go in with a real doctor. And so it allows a doctor to actually do more and spend time where they're more effective. So I think healthcare is big here, coaching, learning, education, you know, any of these use cases where that, that human experience is so important and yet they're, they're limited by scale today. Okay. Presumably you've been testing it. Do you find it is able to build a rapport with the other person?
Or am I trying to say, is it more like, I don't know, I'm trying to say it's not like just a sophisticated chatbot with a face, right? It can actually interact. It's more than just a video. Correct, yeah, yeah, exactly. So especially with this new release, like we've really doubled down on, we're not just putting a face to a chatbot, but it's actually...
trying to replicate what makes a face-to-face interaction a face-to-face interaction. There's a reason that we prefer Zoom over phone calls, over text messages, because humans are high bandwidth communicators whenever they're able to see the facial expression of someone else. There's trust building. It's evolutionary. And so whenever we say that we're building communication
sort of this OS of human-AI interaction, we're actually building a system that has vision and perception. So it can actually see someone's emotion, hear their emotion, see the environment they're in, and use that context, which is really powerful. So with the Raven Zero and the Raven class of perception models, it's actually detecting emotion and perception
doing perception vision so much how a human would with context. So it takes into account your surroundings, it takes into account, you know, the conversation and uses that to determine, okay, what's their emotional state, which is something that we've seen to be so important, like healthcare situation, which is like, hey, is the patient upset? Or, you know, are they, are they seeing like, you know, you can also set like indicators of like, hey, like,
Do they look like they're in distress? Are they in a situation that's unsafe? All those things that a human would do as well. And then we have these great understanding models and turn-taking models like the Sparrow class of models that really have conversational context and have a very natural way to interact, which is like needs to be really, really low latency because humans communicate really quickly with each other or they give you space. It needs to be adaptive and understanding.
And then finally, the rendering piece, which is the Phoenix class of models, which is actually giving a face that's expressive and realistic. And all this coming together, it's not that we're trying to trick someone into saying it's a human. Actually, in all the deployments,
Our customers today, from Fortune 500 to startups, they're saying it's AI, yet people are spending more time, they're actually explaining their answers, instead of trying to short circuit it. Like, we've seen that in like audio or chat bot, people are just like, get me a human, get me a human. But in this case, they're actually explaining and like being thoughtful about it, because there's that trust. They're sitting up straight in the AI interviews, because they're like, oh, wow, this thing can see me, it feels like a human.
Good God. So I'm not sure what the percentage is of human conversation is 90% something, whatever the figure is, just your facial expressions, right? How good is this AI at detecting that, being able to pick up on those subtle
Yeah, I mean, you know, it's one of those things where it's going to get better over time, of course, but the model that we have is best in class. And the big advantage here is that historically, you know, in the past couple decades of computer vision and effective computing, which is sort of this field of like perception and emotional understanding, you know, it's been...
Without context, right? And so there's been this attempt to try to classify human emotion into these labels, right? You know, there's like six degrees of emotion, there's 12, there's 20. And at Tavis, we took an approach that's more human, which is like, no, no, no, actually humans are way more nuanced than that.
And even if you look like you're smiling, the context in which you're in might mean that you're actually not smiling for that reason. Or you might not be smiling at all. You're actually just nervous, right? And so...
It's sort of like, it's almost like reading a novel. And whenever you read a novel, you know, it gives you that imagination, but that's how they're feeling. And that's the type of model that we've built. So it's really, really good at understanding the human sort of state and emotion. But of course, it's only going to get better. This is like one of the first models that we've built like this. It's best in class, but it's also the worst there is right now. Well, it sounds impressive. Whilst you were talking then, I was thinking of subtle...
cultural differences in facial expressions as well. I think I'm right in saying in India, if you're nodding, it means no. And if you're shaking your head, it means yes or something like that. I like to be wrong. Is able to detect that? Totally. So, I mean, this comes down to data and making sure there's not too much bias in data, which is like being able to collect enough data to actually make those
to allow the model to really understand the cultural differences. And that's why the context is so important. So cultural context is one of them. So if you just try to take landmarks on a face, well, you're going to just get some random emotion that isn't really contextual to where they are at or who they are. Um, but our model takes all that into account, um, and actually uses that to say, okay, you know, maybe this person speaks more with their hands. Maybe they, um, maybe they're, you know, as motive on their face, uh,
And that's a big focus of ours. How can we make the model handle that diversity better? Well, fantastic. So presumably it will learn and remember an interaction with the person. Like you say, oh, this guy likes to talk with his hands. This one's shy.
I've got to make this one open up more. I don't know. Totally, totally, yeah. So a lot of those things are learned traits whenever we train the model. And in the future, we want to make sure that there's identification. You can sort of align that as well. But right now, it's all part of the training. The model will be able to see that you're... It'll be able to understand that you're maybe in different contexts and use that to say, okay, this person is of this background. They're here. And this means that they're upset because of this reason rather than just like...
happy, sad, nervous. Gotcha. Do you think you'll ever reach a point where it will be indistinguishable from a human on Zoom? Or do you think it's better to kind of keep it, I'm trying to say introduce imperfections in it so it's obvious it's not a human being? I don't know. That's a good question. I mean, of course, like our
A big focus of Tavis, like one of our goals and two of our North Stars is like, how can we keep the immersion alive? Which is, even if you're told it's AI, and we actually recommend everyone say it's AI, you know, are you immersed enough to where you're just like, this is great. I'm having this conversation. I feel trust. I feel this sort of want to talk. And Artifacts, you know,
to the extent that they're present, can sometimes take away from that. So our goal is to get even more realistic and really make it feel face-to-face, yet the disclosure piece is really important. So there's things like the AI will always say it's AI, no matter what you, you know,
no matter what someone might want to like portray it as. It's an important thing for this. It's like, it needs to say that it's AI and there needs to be just disclaimers. But I think it is important for us to get towards more realism because there's a lot of expressions of the face and things of that nature that are,
still need to be solved to be more natural. Okay, okay, fair enough. I'm moving on a bit then. If you can, in simple terms, can you share how your technology works? What makes it unique compared to traditional video creation? Yeah, great question. So there's a few things, right? I think that, you know, at Tavis, you know, our...
Our sort of research motto has always been that we should never be limited by technology of today, and we should just go and create and research the things that don't exist yet. And so there wasn't good perception and vision models that had that perception and vision capability like a human did. So we created that, and it's incredibly low latency. We also have the most advanced conversational engine that's
incredibly low latency, has the best turn-taking of any system, the ability to really understand, oh, I should give you space, or should I talk fast, and when should I respond? And then the rendering itself, the photorealistic rendering with minimal training data is able to do full-face expressions, the first model of its kind that's able to do full-face expressions in real time. And so it just makes that experience alive.
a lot more natural. So the entire pipeline we've built to be coupled together, so the latency is incredibly low, and the result is that it's as aligned with a human as possible today. And so all those pieces coming together, the information that we're collecting, how we're using that information to condition responses and condition the actual rendering, all those things really couple well together to create this amazing experience.
Okay. What kind of latency are we talking? Is it sort of milliseconds or? Yeah, yeah, absolutely. Yeah. I mean, milliseconds is the measure here. You know, we've seen that in, it's actually funny. Like we've seen that it can be as fast or even faster than 600 milliseconds. What we found is before we built Sparrow, the intelligent turn-take system, that was actually too fast.
So we released it to all these customers because they were, of course, like, we care about latency, we care about latency. And they're like, wait a second, it's responding too fast. Let's slow it down a little bit. So that's why we built this really intelligent turn taking or conversational awareness system that knows to say, okay, in this case, I'm going to take a little bit longer to respond because it was more thoughtful question or thoughtful answer.
Or, hey, like, they're still thinking right now, which is a really important thing. Like, if you use most conversational AI, especially the voice systems today, like, if you stop to think, it'll, like, jut in. It'll be like, and so, and you're like, wait a second.
Let me just finish. And this system was built to recognize that conversation that, hey, I need to take a second to think about this or like, hey, I'm to give them space. And so now it's as fast as it can be. So you can still get those responses in 506 milliseconds, utterance to utterance. And yet, like if you need time, it'll wait, it'll wait and they'll sit there and be like, okay, a couple seconds later, you know, then I'll respond if that's what is necessary.
necessary in the conversation. Interesting. Have you, well, have you at all needed to sort of consult an etiquette expert or conversational expert to help train the model at all? Yeah, definitely. I mean, there's sort of two things. I mean, the amazing thing about...
The amazing thing about what we're building is we're building the experience that we want, right? The human face-to-face experience. But there is a lot of nuance there. So just piecing these systems together doesn't work. There's a lot of nuance in how you should respond, when you should respond, that we take a lot of time. So we have consulted with experts on...
on linguistics and effective computing to really understand what makes a conversation natural. That's the key here. What makes the conversation just click?
And there's all these studies that like someone that you click with really well, you actually respond really, really fast to each other versus someone that you don't know very well, you take longer to respond. And, you know, how face and how facial expressions and listening expressions affect conversation. So all these things that we've done our own research on, we've consulted with experts on, it's definitely a part of what we do to make sure that we're
achieving our goals. Interesting. So in theory, it could never have an awkward conversation then with somebody? That's the aim, is that we'll get to a point where it's so good at understanding who it's talking to in the context around it that
there won't be an awkward conversation. Of course, we're not fully there yet. These models are evolving. But that's sort of the aim is that we're essentially building this operating system for human-AI interaction. And so what we're doing is we're teaching machines how humans work and how to understand them, how to communicate with them. And if we can teach them to do that really, really, really well, then, yeah, then essentially they're incredibly...
They're incredibly receptive to many, many different conversations and traits. Okay, brilliant. Fair enough. Right, switching tracks a bit. With personalized video AI, engagement at scale becomes possible. But does this raise concerns about authenticity? How do you balance personalization with ethical AI use? That is a mouthful. It's a great question. So there's two pieces to this. One is that, you know, of course there's this...
There's this piece around, you know, cloning humans, right? And like being able to clone an existing human. I think, you know, that's something that absolutely Tavis has always taken the position of consent, consent, consent. You can only clone someone's likeness with their consent. The model will not train without it. But this other piece, which is, you know, this has raised concerns around authenticity, right?
The way I see it is that we're not replacing humans. We're augmenting them and allowing them to focus on the things that are incredibly necessary rather than having to spend time on things that aren't. And also there's these cases where humans weren't even possible to put there and now we can bring that human touch back. So I think if anything, we're delivering authenticity that has been taken away at the
in the pursuit of scale, right? And then we're also bringing that human touch in places where otherwise it would be impossible. An example of this is we have a customer that's developing an AI human persona for the elderly, which is that someone who's elderly and maybe lives alone actually has someone that they can talk to and get help with
constantly and that's something that isn't possible for humans to do there's not enough humans that can go and spend time with each and every one of these and so if anything we're delivering this experience that otherwise was impossible that's great you said that but it also saddens me that
It's not real humans to go to talk to them. That's made me quite sad, actually. I agree. It is something that's very sad. And you wish that was different. But we're lucky that we're able to make these impacts in the meantime, where it's like, we can't solve the human problem necessarily that easily, but we can at least make it so that these people...
everyone has that respect and the ability to connect the way they want to absolutely absolutely um thinking outside the box a little bit um could this model have applications where yep yeah instead of having pictures of somebody's died uh you could like load that somehow their personality behind the face and you could have a talk something you could talk to all the time
if that makes any sense. Yeah, we've seen this use case, and it's one of those things where, again, it's tough for Tavis because consent is such an important piece of this, and so it's really important that there needs to be some level of consent in order for this to happen. And I also think that
it needs to be done very carefully and craftfully, which is like, if you're going to pursue that use case, then you really want to make sure that you're able to build a system that's able to effectively capture all their life and all their stories. And the reality is people don't have all those things. And so you sort of get a version of them. I think I've seen it really help people and makes them get some closure and some of those elements. But I think that
A piece of this is really making sure it's an authentic experience, right? And it really is sort of the persona that you're replicating rather than just like this random sort of persona behind this face. Yeah, that's fair enough. Okay, you've kind of answered this already, but I'll ask anyway. Can you give us an example, a real-world example of how Tabas has made a significant impact with your marketing, customer engagement or another field?
Yeah. So I think I'd come back to like healthcare and tutoring coaching as being like, it's already making these impacts. We actually, you know, one of the more touching messages that we got was,
from a customer's customer. Well, actually, not even... This was actually... We ran an AI Santa campaign during Christmas time. And it really took off. It was just for fun, but it allowed people to video chat with AI Santa. And we got all these amazing messages from parents and people from all around the world, which was amazing. The...
people really enjoy talking to it and they felt like it was a lot of fun. But we got this really, really touching message from someone with disabilities. And they had messaged us and saying, this is after we had turned off AI Santa. They said, hey, I saw that you turned off AI Santa, but
you know, this was actually one of the most authentic, like, like communications I've had, because, you know, I'm, I'm difficult to understand. And often people, you know, I take a long time to understand. And this felt like I could actually spend time with this thing, it would understand me, and really give me space to, to like, speak my mind. And,
And I haven't had that before. Could I have a version of this? And so we gave him a link and said, hey, use this as much as you'd like. Absolutely. And we've sort of seen this. We actually got another message during the AI Santa campaign. A friend of mine had said that he had sent the link to his dad, who at the time was in the hospital. He ended up being okay. But he sent me back a message from his dad. He said that was the most
and empathetic conversation I had had all week because it had seen that they were in a hospital. It had seen that and it had turned on tone and it was like, hey, like, how can I, like, are you doing okay? Like Santa had like seen that they were in a hospital and responded to that. And that level of humanity, I think is really amazing because we can deliver that and give that to people. But definitely healthcare has been one of the most promising ones. But education, anywhere where that human touch is necessary, we've already seen a realization
a really amazing impact. MARK MANDEL: Wow, sounds really sophisticated. Kind of scared. Can it communicate sign language as well? Or is that something to come? PAUL LEWIS: We can't do that just yet. That is something that internally we've sort of played around with. And we do want to do that in the future. And I think that would be an amazing impact for accessibility. We're not there just yet. The vision models can't do that just yet. MARK MANDEL: Oh, that's fair. Just curious.
So what are some of the biggest challenges, technical challenges you've faced in developing your AI-driven video agents and how did you tackle them? Yeah, so definitely latency was one of the biggest things, right? Which is that you're taking all this information and you're processing it and you need to
you need to do it in a way that it feels natural in human conversation. And the reality of most architectures is that they're sort of sequential, right? In the sense that they get the information, they wait for you to like, well, actually with most of the time, they just wait for you to finish. So like, if you're talking, they'll wait for you to finish. And then they'll say, okay, this is what you said. And they go say, I need to think about this. Okay. I thought about this. This is why you respond. Okay. I'm going to re-render this. Um,
And the issue with that is that's naturally high latency, right? Because humans, whenever they're talking, they're actually actively thinking about the response as you're talking to them. So as you're talking to me, my brain has already started going and I'm like thinking about the response. And so we modeled the architecture around that, which is like, okay, how can we actually have to actively think about what someone's talking about?
and actually think through like, oh, what should I respond and make those decisions in real time. So we have to model the architecture around us, but then we also had to make these models really, really, really fast, which is like every single piece had to be hyper optimized to be hundreds of milliseconds, 50 milliseconds. And that is really challenging. And fitting that all into a package that can then run definitely latency. That was one of the hardest pieces.
Okay, fantastic. I almost forgot what I was going to ask. So I'll move on to the next question. Okay. What breakthroughs do you anticipate in AI-driven video and conversation AI that we aren't talking about enough today?
Yeah, so I definitely think that the perception and vision piece is something that is very unexplored today. You know, I think that it's one of those things where you don't think about it. Whenever you think about conversational AI and we do AI, you just think about, ah, there's a face over there, or you're talking to boys. But
One of the most important pieces of face-to-face communication or human communication is actually being able to detect and use that information to have a natural conversation. And I think it's such a stepwise change. Whenever you can detect, you can see, oh, look, what an amazing shirt you have on. I really like that sweater. Or, hey, I see that you're in the car. You probably shouldn't talk in the car. We should probably just wait. Those things change.
just completely change how these systems can react and really communicate and build trust with you. And so I think that's really a piece that we haven't talked about enough today because it's so new. I mean, our model, you know, the latest one, I mean, we introduced Vision, you know, some
eight months ago, but now this new model we introduced very recently and just how advanced it is, how it's able to really have this very human way of like always seeing things, always perceiving, I think is a stepwise change and will really, really affect how these systems are used. I mean, that is a challenge, especially small talk for a real human being. So AI, that's almost insurmountable.
Fair enough. Okay, with AI advancing rapidly, there's increasing pressure for regulation. What kind of guardrails are necessary to ensure responsible AI development, in your view? Yeah, that's a good question. I mean, I think that there needs to be...
Regulation is always tough because governments often don't know how to regulate these things properly and it's ever evolving. I think like at TABS, one of the things that we've thought is really important is this element of disclosure and consent, which is that, um,
People should know whenever they're consuming content that was AI generated or they're proffing to an AI. Those things should always be required. That disclosure should always be required. I think that's one of the most important things that we can do. The flip side of that is that enforcement of that and detection of that is sort of a cat and mouse game. However, that's regulation that I would really, really want to...
say is important is the consent piece, making sure if you're using someone's likeness, there's consent. You can't just clone someone's voice. You can't just clone someone's... There needs to be consent. But then also this piece of disclosure. You should know if you're
if you're consuming something that's human-created or AI-created. Okay. Could that be as simple as just, I don't know, a watermark on the cheek or something? Yeah. I mean, absolutely. I think watermarking metadata, that's sort of the way to go about it. But really, the issue is that it really is a katanaski in the sense that the cat's sort of out of the bag. And so, of course, the people with good intentions will watermark it and do all these things. But birds with bad intentions...
We've seen so far that watermarking sometimes doesn't survive transformations and things like that. So that's something that I think is active research and we do need to figure out. And there's some really amazing companies that are working on it and it's something that we also try to research as well. Okay, okay. With the likenesses, the faces you're using, are they actual people or just generic human face faces?
It's both. So you can, of course, create your own personality. You can just talk to this thing for two minutes and then it can, if you give it consent, it can actually create a representation of you, your voice, your face. But as well, like we also do synthetic humans, which is they're degenerated, they're generic humans that anyone can use or generate. Oh, fantastic. So in theory, you could have fictional characters, Darth Vader or...
Oh, totally. Yeah, yeah, absolutely. Some of our customers already on their platforms, they have these fictional characters. You can talk to Aristotle and it looks like Aristotle or I guess what you think Aristotle would look like. Absolutely. Oh, brilliant. It's exciting. Okay. Looking ahead then, if you were to make one bold prediction about AI in the next decade, what would it be and why? Oh, this is interesting. So I...
My sort of bold, hot prediction is that I think in the next decade, we're going to see some government body, I don't know how small, how big, that will actually be run by an AI model, which is that the AI model will actually make decisions for that governing body. I think we're going to see that in the next decade. And the reason for that is I think that...
It seems like the world is sort of disillusioned with politics a lot. And we're seeing these models be built to do a lot of reasoning and being able to really think through things and have these rules that you can enforce. And so I think that someone will actually go and try it. Someone will go and try and see if this thing can effectively govern. Now, do I think this is a good idea? No, no, no.
But do I think it's going to happen? Absolutely. I think it will happen. And I think that will be the most public showcase of it. But already in the next decade, AI models will be making decisions on, you know, cases, law, like healthcare, like insurance, like already those things are already happening. And that will be even more true. I think we'll see a public demonstration of that where an AI model actually affects everyday governing.
Yeah, considering the resistance you'd get from many politicians, that's going to be, yeah, it's a very bold prediction. Any further to that, then which country or which part of the world do you think you'll see it first? Well, I don't think it's going to happen in the US first. I think it would probably, I think it would happen in,
I think it happened somewhere in Asia first because I think the Asian region is a lot more open to AI. And so I think it would happen there first. Emerging markets, I think. Okay. All right. Fair enough. Great. Well, that's all my questions. Is there anything else you'd like to add that you think feels important and we haven't discussed? No, I think we talked about a lot of amazing things. Yeah. No, I think that...
We're really excited on what we're building here at Tavis. I think that we really think that it's going to make a positive impact and really bring humanity back to all these digital experiences. So I think we're really excited about that and there's not too much more to add. Fantastic. Well, with that then, Hasan, thank you very much for your time. That was genuinely very interesting. I'm glad to hear that. Thank you so much for having me. Our pleasure. Also, don't forget to subscribe to IE Plus for premium insights and exclusive content. ♪