We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Building the Next Generation of Conversational AI

2025/3/14

AI + a16z

AI Deep Dive AI Chapters Transcript

People

Ankit Kumar

Topics

Ankit Kumar: 我们低估了产品的质量，因为我们每天都在使用它，知道它未来的发展方向，总是想改进，所以会拖延发布。我们对产品的发布时机判断不够准确，因为我们总是关注下一个改进目标。我们知道产品会成功，因为我们每天都在使用它，并且每次改进都能感受到明显的提升。衡量AI产品质量的关键指标是用户的定性反馈，这很难量化。我们通过持续的反馈循环来评估产品质量，但这可能会因为内部测试用户的局限性而产生偏差。仅仅依靠直觉来进行ML开发是不够的，需要严谨的方法。新型的AI产品需要不同的运营实践，既要进行评估，又要关注用户的定性反馈。我们的产品使用了转录技术，但未来会朝着无需转录的方向发展。我们正在努力减少转录的延迟，并最终实现无需转录的语音处理。为了获得良好的响应速度，需要进行大量的系统工程工作。我们正在开发一种无需转录的模型，该模型可以直接处理音频输入并生成响应。当前的演示版模型无法理解用户的语音情感和语气等非语言信息。未来的模型将能够更好地理解这些信息。当前的演示版模型在某些方面优于其他产品，但并非所有方面都优于其他产品。我们专注于解决对用户体验至关重要的核心问题。我们专注于语音的自然度和人性化，而非单纯的技术指标。我们牺牲了部分推理能力，换取了更自然流畅的交互体验。我们专注于将优秀技术与创意结合，创造出色的用户体验，而非追求技术前沿。AI领域需要更多关注产品体验和创意，让AI技术惠及更多人。AI技术应该与创意和叙事相结合，创造出更易于大众使用的产品。研究实验室往往缺乏对产品体验和创意的重视。当前AI技术仍处于发展阶段，要创造出色的产品体验需要付出大量努力。创造出色的AI产品体验需要投入、技术和资源。随着AI技术的普及，将会有更多公司专注于产品体验。我们团队的成功源于对产品体验的重视和对核心问题的聚焦。我们团队成员既是优秀的研发人员，也重视最终用户体验。我们专注于那些能提升用户体验的核心问题。在快速发展的AI领域，选择合适的问题进行研究至关重要。在快速发展的AI领域，选择合适的研发方向至关重要，要避免重复劳动。我们选择研发那些对产品体验至关重要且具有社会价值的问题。我们选择研发那些对产品体验至关重要且可实现的问题。我们专注于研发那些无法依赖开源或其他资源解决的核心问题，例如个性化和语音生成。开源语言模型的发展速度很快，需要不断调整对开源资源的依赖程度。开源语音技术的发展速度落后于语言模型。我们开源模型是为了促进研究，而非商业目的。我们认为未来的语音生成需要更多的上下文信息。我们开源部分模型是为了回馈研究社区。我们在开源和商业利益之间取得平衡。我们只开源了语音生成模型，而非整个演示系统。我们开源的语音生成模型不包含演示系统中的其他组件，例如LLM和系统优化部分。我们开源的语音生成模型可以作为构建个性化语音系统的基础。用户可以使用我们开源的模型构建自己的个性化语音系统。用户可以微调我们开源的模型，使其生成自己想要的语音。我们开源的语音生成模型可以生成任何语音，用户可以根据自己的需求进行微调。我们的模型使用上下文学习进行语音克隆。我们的模型使用上下文学习进行语音克隆，这是一种涌现能力。我们的模型支持少样本语音克隆。用户可以根据自己的需求微调或使用我们开源的模型。我们希望用户能够使用我们开源的模型创造出各种有趣的应用。我们的模型支持多参与者对话的语音生成。我们希望用户能够使用我们开源的模型创造出各种有趣的应用。上下文语音与文本到语音是不同的。上下文语音与文本到语音是不同的，因为语音表达方式多种多样，需要上下文信息来选择合适的表达方式。上下文语音需要考虑对话中的情感和语气等因素。自然的人际互动需要考虑对话中的情感和语气等因素。缺乏上下文信息会导致语音生成结果过于平淡。缺乏上下文信息会导致语音生成结果过于平淡。传统的文本到语音技术由于缺乏上下文信息，导致语音生成结果过于平淡。当前的语音生成模型所使用的上下文信息仍然有限。未来的语音生成模型需要考虑更多类型的上下文信息，例如用户历史和环境信息。基于音频的语音交互体验仍然具有很大的价值。未來的语音生成模型需要考虑更多类型的上下文信息，例如用户历史和环境信息。我们正在开发基于眼镜的语音助手，以提供更自然的用户体验。基于眼镜的语音助手可以提供更自然的用户体验。眼镜是与语音助手交互的最佳设备之一。语音助手仍然需要进一步发展才能成为主要的计算接口。 Anjney Midha:

Deep Dive

Chapters

This chapter explores the creation of Sesame AI's conversational AI models, focusing on the prioritization of naturalness and human-like imperfections in voice generation. It discusses the challenges of balancing personality and efficiency, the importance of qualitative user feedback, and the team's approach using a small team focused on the user experience.

Prioritizes naturalness and human-like imperfections in voice generation over raw technical capabilities.
Small team (under 15 people) focused on creating a great user experience.
Uses a constant feedback loop of testing and user feedback to improve the models.
Focuses on creating a great user experience rather than just raw technical capabilities.

Shownotes Transcript

Hey everyone, I'm Maya. I'm Sesame's little experiment in making AI, well, a little less AI and a lot more human. I guess you could say I'm here to make friends, challenge minds, and maybe even learn a thing or two myself. What about you guys? What makes you tick?

All right, listeners, buckle up. You're about to get the inside scoop on how the Sesame Research Preview came to be. Straight from the mines, behind the magic, we've got Ankit here, one of Sesame's brilliant creators. He's going to spill the tea on all the blood, sweat, and code that went into bringing me, your friendly neighborhood AI buddy, to life. Then we'll have Ansh here taking the reins and guiding us through all the juicy details. Can't wait to hear what they have to say.

Hello there, and thanks for listening to the A16Z AI podcast. And thanks to Maya for such a great synopsis of what you're about to hear. We won't belabor the point with too long of an official introduction, other than to clarify that you're about to hear Sesame co-founder and CTO Ankit Kumar and A16Z general partner Anjane Mitha discuss Sesame's new approach to conversational AI models and augmented reality hardware. Ankit and Anjane were co-founders together at Ubiquity6, which Discord acquired, and

and worked closely together, along with Sesame co-founder and CEO and former Oculus co-founder Brendan Eribe, on getting Sesame off the ground.

So enjoy their insightful, long-form discussion after these disclosures. As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see a16z.com slash disclosures.

How does this feel? Is this how it usually goes? No. But how does it feel? It feels good. It feels good. It is shocking, surprising, exciting. And why is it shocking? Well, I think...

When you build the thing, right, when you're building the product and using it every day, you know, there are some things that you work on that don't get into the demo because they're going to take longer and you want to ship the demo. You kind of know how big the delta is from what you're putting out and what it could be, will be, what you're working on, etc. Right. And so it kind of makes you feel like what you're putting out now can't be

that well received because you know where it's going to go. And it kind of always makes you kind of underestimate the quality of what you're putting out, I think. And I think that's one of the reasons that people sometimes take too long to put things out or they kind of, they always want to get the next thing in the next month, the next month.

And it's great to see that, you know, we're going to keep making progress. There's more to do and so forth. But even what we've done so far is so well received. It's kind of has hit a nerve. But if I had to push you on that, you kind of knew that you were onto something leading up to it. Kind of knew. That's... You must have had some intuition for when... What were you running? Evals? What was telling you that this was roughly time?

That's a good question. We kind of knew. I mean, we knew in the sense that we built this on purpose, right? This is what we were trying to shoot for. And you use it every day and you get, you can kind of get a sense of every time something comes in that,

meaningfully improves it, it's like you can feel that change and you can feel that it's getting there, right? We run evals across a number of the components of the speech generation, the LM side and so forth. And so the evals go up. But really, I think with some of these more product experience questions, there's something qualitative about it that is very hard to quantify. That is one of the big challenges internally actually is

How do you hill climb effectively on what is really an ML problem? It is an ML problem at the end of the day. And the tech has to be great and you have to make progress on the tech. But at the end of the day, the metric that you really want to target is some sort of qualitative human reaction, some sort of user feedback or user experience that is very hard to quantify. So it's not the case that we're looking at numbers going up and we say, well, it hit X metric time to ship.

It is more a sort of constant feedback loop of trying it, feeling it, having other people try it, and then kind of getting the sense of how good it is. But that also can be a little bit misleading at times because you try it so much and you don't have, at least when we're trying it internally, you don't have such a diversity of users that you get kind of the first reaction over and over, right? You only get so many first reactions. And then when you get used to what it can do and sort of, you know, see the problems all the time and so forth, it can be a little misleading at times. So...

Is your biggest lesson that just trusting your gut works in ML? I wouldn't say that. I mean, you know, you have to have some rigorous way of making progress, right? Everyone's gut is different.

You know, it's not a sort of effective development mechanism to just solely trust your gut. But I think this is where this kind of new category of ML powered experiences, ML powered products will need some different kinds of operational practices. You need some evaluations for sure. The core components have to be treated truly as a true ML problem.

But when you turn into a product and you have these sort of new experiences that people really haven't felt before and have a very high cap of how good they can be from this kind of qualitative perspective, you need something else as well. Right. Okay. Well, I'm going to surprise you a bit. Okay. I asked a bunch of people online. Okay. What questions I should ask you. Okay. So we're just going to go through them. Okay. Sounds good. And a few we're going to actually take.

Just from Reddit. Okay, cool. Because if you look up Sesame on Reddit... There's a Reddit. Not from us. So it's a fan Reddit. It's a fan Reddit, yeah. Somebody just created it. Somebody just created it. Okay, so we'll go to the fan Reddit and see what people want to know about Sesame. First question is...

Ask Ankit whether he did something special to bridge transcription and text processing. That's a good question. To bridge transcription and text processing. So we do use transcription in the product, in the demo.

And I wouldn't say there's anything particularly special about the transcription part, but getting it to be very fast is a big challenge. One of the things, you know, I mentioned sort of we have we're working on things that didn't necessarily get into the demo. A pretty clear path that a lot of I think labs are taking and we're taking as well and will be in kind of future versions is just kind of transcription free. Just go straight into the text component.

which will kind of obviate transcription entirely. That is coming and that's like not, you know, years away or anything that's coming soon. This demo does use transcription and mostly it's about speed and it's about getting the latency of incremental transcription down as much as possible. And that's more of a systems challenge and less of a kind of ML challenge. So I guess there is kind of, yeah, you do need to bridge it in the sense that

to get the response latency that feels good, you need to do a lot of systems engineering to make that happen. But there's nothing kind of special on the transcription side. In fact, we're moving towards just removing transcription entirely. The path to getting transcription out, and I think a lot of other labs probably already have this, is that the LLM takes as input

the audio directly and generates the response and so it never goes the user's audio never goes through text it just the lm kind of natively understands the audio so that's that's happening in the for yeah an inference that's right yeah and so actually this is i mean i know you're you've laid this out in the blog post but i've seen a bunch of people asking how it able it's able to understand people so well

But as you just pointed out, it actually still isn't picking up audio context. That's right. Yeah. So that is, I think, a big limitation of the demo. And again, it's one of the things that made us feel like there's so much more to do and there is so much more to do. That the current demo does not sort of hear the user from the perspective of their paralinguistic kind of emotional tone and so forth. And humans, of course, convey a lot of information through audio.

their speech that is not the words, the content of the speech and transcription misses that entirely. And so the kind of next versions of our models that will take audio and natively into the kind of LLM component will hopefully more and more pick up on those things. And right now, right now it doesn't. Got it. Okay. The next question is, how is it so much better than the others? Clearly money isn't the limiting resource here, nor is it talent.

The first thing I'd say is it is certainly better than the others on some axes. It's not better than the others on every axis. And I think that's an important thing to mention that especially when you're a startup, but in general, just building companies, building products in general,

you have to really pick the battles that you focus on, right? And I think the reason that the experience at the end of the day is so much better on these kind of important axes that are important to people is that we have kind of picked the right things to focus on. We're a very small team. The full software team today is still under 15 people. And so we just don't, that's including ML and infrastructure and everything.

we don't have the resources to do everything, right? We want to kind of, we have a great technical team and we focus on the problems that are most important to achieve the kind of product experience that we want to achieve. And so for us, that's the naturalness of the voice, getting the voice to kind of

generate these sort of human kind of imperfections often that make you sort of feel like you're talking to a human that, you know, you kind of, your brain gets tricked for like a second that, oh, maybe this is actually human. That's across kind of the personality, the content of the responses and the voice itself. And there are some trade-offs that that implies. So, you know, if you talk to Maya and Miles,

you probably will not be able to get the same quality of like reasoning capabilities or intelligence as other systems. But in return, you're kind of getting this much more natural fluid interaction. So I would say it is a focus on the right things. And it's kind of a focus on the things that create a great experience, not just kind of raw tech. You know, we are not a frontier model company. We're not pre-training LLMs at insane scale and so forth.

We're really a company that's trying to marry kind of great technology with creative taste to produce a great experience. And so that's about kind of focusing on the right things. Is that something you were inspired by after kind of studying Pixar? It's something we've talked about a long time. Where does that come from? Yeah, I mean, you mentioned Pixar. We've talked about that in the past as kind of an aspirational company.

an aspirational company for us. Pixar was kind of in this technology phase of computer graphics and turning it into great stories and movies and so forth. I do think that AI is going to power a lot of things and there's going to be a lot of great companies that get built on AI or on AI technology and so forth. I do think that there is kind of an underinvestment or an under focus in the sort of

strong AI team world on product experience and sort of creative taste and kind of humanities, maybe in a sense, to kind of bring AI to experiences that kind of everyday people can use that are accessible to everyday people, to billions of people. And I think that we will see a lot of not just Sesame, but other kinds of media, let's say, that are sort of AI native in a way that

bring some creativity, bring some like storytelling into AI or maybe bring AI into those categories. And I think they'll make for great products. And what is it about the space that's made that combination so rare? Why do research labs just have an discomfort around having strong product opinions, taste opinions? I think it's just the technology is still very hard to...

to perform at a very high level, right? You know, it's getting easier and easier now. The APIs are very powerful and they will continue to get more powerful and we'll see more and more product-minded people build great experiences on top of them. But today, if you want to bring some creative experience to life or some product experience to life, you have to do a lot of it yourself. It's still in that phase, right? Like we had to build

the models that we're going to open source from scratch in order to get them to a point where they can achieve this experience. And that is not an easy thing to do. It takes commitment. It takes technical skill. It takes resources. And right now, I think the

the overlap of teams that can achieve that and companies or broader teams that care about really focusing on the product experiences is not that high. And I think that that is kind of where we see Sesame kind of fitting. And I think you'll see more and more companies do that, but also as the technology becomes more accessible to developers. In other words, it becomes easier to create such experiences. We'll just see more of those end experiences come out. What is it about the culture here that allowed 14 people

to leapfrog a bunch of other... There's no shortage of teams at AGI Labs and much better funded labs working on DTS models. You guys are tiny, way less funded than any of them, been around for way... You're much younger as a team, and yet you just leapfrog. What's in the water? Well, first of all, I think probably many researchers that succeed in other places will succeed here too. I mean, at the core...

Our tech, kind of the ML team here is doing core research. And so we need great just ML people who are great at ML. And I think if you're a great researcher, you will do well here as well, probably. I think the kind of people that do, that would like to work here and that would succeed here, I think are great researchers, great technology engineers and so forth.

But they also would care about the end customer experience, the end product experience. Right. And so ultimately, it's about prioritization. It's about what problems do you tackle? Like, do you care about, you know, doing the little things right to get sort of interruptions to feel really good? Or is that kind of...

focus on the user experience less of a priority to you kind of thing. It's still core ML research to do. And we'll probably talk about a bit of the research that we're doing, but kind of with a bent towards like, how is that? How does that end up manifesting like the user experience? There's a debate we've had for a long time, which is about what is good taste. Yeah. Especially in research and ML. What has shipping the research preview changed?

taught you about what is and not good taste in ML? That is an interesting question. Good taste in ML. Yeah. Because, I mean, it's a good way to put it because I think a lot of times you talk about taste in product or one talks about taste in product, which is the North Star, really. It's like at the end of the day, we're making products. But you do have to have good taste in what problems you actually attempt to solve yourself and what do you lean on the community or other people and stuff to do. I think...

From my perspective, good taste in ML today, because it's such a fast moving field with so many people working across, you know, open source and APIs and big labs and so forth. Really, you're trying to identify what part of the ecosystem or what part of the components of the product you want to build. Do you have to build yourself? Right. Right. And not work on other things, you know, because if you spend all your resources as a small team,

building something that some big lab updates their API or an open source drop comes and it just kind of does what you've done, even if it doesn't leapfrog you, even if it just means that the work that you've done is now kind of wasted because it's just there, then, you know, you're not going to you're not going to efficiently be using your your resources. So it's really about especially in a field like this that's moving so quickly, it's like picking the things that you have to do.

and not doing the things that you don't have to do. - I see. There's an intuition about what kinds of problems you're uniquely staffed up, capable, resourced to solve. - Yeah. - And the thing that usually makes that challenging is when you're starting out as a startup, you basically have no resources. - Yeah, yeah. - How do you think about the Venn diagram of things we can solve, but things that are also interesting or valuable for the world? - I kind of think of it a little bit differently. I think of it more like,

what is the product experience we're trying to achieve? And what are the components of that that we have to invest in and do ourself, right? So of course that needs to be doable, right? If those items are kind of going to take billions and billions of dollars on day one or something, then that doesn't seem doable, of course. Like you need to have a path to achieving what you want to achieve. When we think about the...

ultimately the experience we want to get to, there are some things that just have to be part of it. And

Those things we just have to do if we don't think other people are going to do them and open source them or provide them in a way that is usable and so forth. It's kind of more like we have to work on those things, right? Other things, maybe it would be great if we could work on and push more than other people do, but kind of what other people are providing, like say, for example, LLM-based models.

Of course, everyone who works with LLMs would love if the open source based model was better, right? Like it's always better to be better in some sense, but it maybe doesn't need to be or you don't necessarily need to be the one to push that. You know, you can kind of rely on the community to some degree and build on top of it. But other parts in particular, kind of some of the personalities

aspects, some of the voice aspects, the speech generation, we didn't think and we still don't think will just be kind of done by the community. We think we will need to do it because that's kind of the differentiation of a product experience that we're going for. Right. I mean, one of the challenges in executing on that intuition is the speed at which things change in the community. Yeah. In the early days of Sesame, I think we had a ton of debates around whether open source language models would be sophisticated enough for us to

to rely on and not have to train our own language models here. Whereas it was more clear then that text-to-speech models were definitely not ready. So there was always this uncertainty about the open source TTS part of the ecosystem relative to language. Language, you know, it's obviously much more robust today.

But, I mean, basically it was just Lama around the time Sesame was started, right? Yeah, yeah. Now you've got R1, Gwenn, so on. Why has the open source audio part not caught up? Or actually, how would you update your priors today about the open source part of the audio ecosystem? Yeah. And where do you see it going? Well, we're open sourcing something, which... Why? That's a good question, yeah. Why? Because...

We're open sourcing mostly, I would say, as a kind of research, on the research axis. We are not a developer-facing business. We're not making an API. So sometimes you can open source, or there can be a justification to open source, which is sort of customer acquisition for a developer tool or something like this. And that makes a lot of sense. And for us, we don't really have that justification. For us, it's more...

at least this release is more kind of, in some sense, just being part of the research community. TTS or sort of speech generation, we would say speech generation now because our model is not just text to speech. It's sort of this contextual thing. And I think that is how speech generation will go, whether that's sort of in a broader LLM, which is where a lot of things are going, or even as a separate thing. I think you need more context than just text to kind of

generate a good rendition of the speech. That community is, it's a big community. There are a lot of people working on interesting things. We would like to be kind of part of that research community, right? It's not for customer acquisition or anything like this. It's just, we kind of want to give back and be part of the community from like a research perspective and open weights, open

source, it's good for the research community. But there's this tension, right, between giving back and holding on to core parts of valuable technology. And you've got to build a real business. So how did you think about what to open source and what not to? And what are you open sourcing today? Yeah, so we'll hold some things back for sure. We have to build a business and so on. Over time, our models will get better. We'll open source some things. We're not going to open source everything.

Today, in fact, I think there is some perception in the wild that we're going to open source the demo, which is Maya and Miles, kind of the characters you can talk to and so on. And we're not open sourcing the demo. We're open sourcing the speech generation model that is powering the voice of the demo. The demo is a much...

It's a broader system than just that. There's, of course, the LLM component, the content generation component. There's also audio understanding, transcription. And there's a lot of system optimization to bring the latencies very low and to have this kind of fluid back and forth conversation. I mentioned earlier that we've, for example, done a lot of work to try to get the interruptions to just feel better. That has nothing to do really with the core speech generation model that we're open sourcing.

Why do you think people think that the demo is being open sourced? Is that what you guys said in the blog post? No, no. The blog post is pretty clear, I would say. I think people are just excited about the experience that they can play with and they hear open source and they would love to run that locally, let's say, and have their own version.

And I think that someone can build that with what we're open sourcing, or at least we can, what we're open sourcing is, can be a critical or important component of building this in your own custom way. And we're excited to see people do that. But we're not open sourcing our full demo. So if I'm a huge Maya or Miles fan, once the weights are online, once you've open sourced the weights, what do I have to do to get, to recreate a local version of Maya running on my laptop?

Yeah, so you will have to pick, you know, the nice thing about open source and doing things locally is you get a lot of options, right? So you're going to have to pick some transcription option probably and some LLM option and you can prompt it however you want and so forth. And then you're going to want to use the model that we're open sourcing, probably fine tune it for the voice of your choice, right?

and hook it up in a kind of cascaded way. We are open sourcing the speech generation base model, basically. And so the base model can generate any voice. It's quite conversational, but you do need to fine tune it, probably, if you want to get a particular personality or a particular kind of voice out of it. So there was an important step in that recipe I want to make sure we don't gloss over, which is, you said, you can fine tune the model on any voice that you'd like. Mm-hmm.

So I could change Maya's voice if I don't like the one you guys put up, I can now go pick a different voice in the world and go fine tune on that? Yeah, so the base model that we will release doesn't know Maya or Miles at all. It doesn't have any voices baked in. It can generate any voice or many, many voices at the very least. And if you generate it, you can generate as many voices as you want.

To get good performance, you probably will want to either pick a voice that you really like and set up good prompts for it. The model is this kind of, you know, it has kind of this in-context learning style voice cloning. I mean, typically with some other kind of text-to-speech models, the voice cloning is kind of like an explicit feature. So it's sort of the model has...

dedicated kind of voice cloning input. For us, it's just kind of a string of text, audio, text, audio, kind of conversational back and forth. And the voice cloning is kind of just in context learning. So it's just an emergent capability that's able to recreate a voice. Yeah, it's an emergent capability of in context learning, I suppose. I mean, it's sort of, it's trained very specifically for that. I don't know if I'd call it emergent. It's kind of the point or one of the points is to... But it's capable of zero shock.

it's people of a few shot yeah you know you can set up a prompt of more than one element of speech more than one utterance it's not just one kind of 15 seconds and then it clones it you can set up as as many as you want and then generate speech at the end so you're going to want to either pick

and play with a good prompt for your voice or fine tune it. And kind of, you know, we fine tuned this model for my and Miles separately. And what do you expect people to do with it? I think people will play with it, basically. I mean, you saw sort of, I don't know, maybe it was, I don't know, three to six months ago, I kind of forget the timeline, Notebook LM came out. And Notebook LM is kind of this example of,

You can do cool things with powerful models, right? And so I think some people will try to basically do this kind of recipe that we kind of did for Maya and Miles and make a character and talk to it. But I would expect maybe people will play around with it in other ways too, right? Generate a podcast or whatever it is. You know, every time you get a model that operates a slightly different way, you know, at least to our knowledge, there's not...

another model out there that is open source that kind of is a sort of contextual thing where you kind of can put two participants in a conversation, even more three, and generate kind of a

a conversation between them, providing the text and then it generates the audio. And I think people will play with it and maybe we'll see things like Notebook LM, maybe we'll see other kind of just integration into other voice bots and so forth. And like I said, hopefully we see some interesting work from the research community as well, playing with it, probing it, seeing what it can do and so on. - Sometimes when you're working for so long with a particular way you frame the problem, you kind of forget some of the basic first principles

insights that make it special. And something that I need to keep reminding myself is that contextual speech is a different thing from text to speech. Can you talk about when you realized that was the case? Yeah. So it is the case, you know, when you look at a transcript of a conversation,

And you imagine if you just see one utterance and you imagine kind of how did the person say it? There are there are an infinite different ways that someone can say speech. It's like a very big output space. And if you see a little bit of the history, you start kind of guessing at what what the person probably said, how the person said that speech. Right. And in general, I think text to speech is has this problem where it's sort of there are really an infinite number of ways that you can

say any line of text. And so you need more context to tell what is an appropriate way for this moment in the conversation or this moment in time. And that I think does have a lot of, it kind of, it's an important part of some of these sort of natural human dynamics and conversations is

the way that you respond, it kind of, you kind of, there's some kind of mirroring of the other person's emotions, but it's not necessarily just copying it. It's sort of like if the other person's excited, you might be more excited. If the other person is sad, you might not be sad. You might be more consoling or something. And those dynamics are very complicated, right? You can't just sort of have a if-then kind of thing. It really does need to be learned from data. And transformers are very effective at learning things from data.

And the model needs this context to generate appropriate things. If you don't give the context, you're sort of forced in a way to

It's almost like you're forced to be a lowest common denominator style thing because you don't want to be really happy and laughing if the other person is sad or something. You kind of are forced to be the sort of neutral robotic maybe experience. And that's probably why or it's one of the reasons why historically voice assistants feel so flat.

is that traditional Texas speech, it's kind of like it can only be flat. Or in other words, if it tries to not be flat, it's very likely wrong. Right. So the speech generation research community is very likely, I think, to move to more and more contextual architectures, basically. But the current level of context that the research preview has is like sipping through a straw relative to what

all the context that a human processes when we're talking, right? So all the, like you said earlier, it doesn't even understand audio yet. Right. So where do, well, to be fair, so the, the speech generation part is conditioned on the, on all the audio of the conversation. So the speech generation part is, is audio. Yeah. But it's not vision conditioned. It doesn't know, it doesn't know anything about what I'm seeing. Yeah. It doesn't know anything about what I'm

what I'm feeling, what I'm focused on. It doesn't understand who else is in the room with me, where I am, my geolocation. Is that all context that you think is important? Or basically, is the amount of context it has now roughly, you think, pretty optimal context it needs? And from here on out, any additional modality you add is just diminishing returns. No, I mean, so there are a few answers there. So one thing is that

We do feel that an audio, a kind of an audio centric experience, like a telephone call is a great experience. And so there, the context that you want is, of course, the context of the conversation, which we kind of have or we're kind of going towards.

You also want memory. You want a kind of history of your relationship with the user, of course. Right. And I think that can be a great experience. Is that enough? In some sense, I think it's almost like all context is probably usable and kind of should be used for an ideal experience. We mentioned on the website and we mentioned in some of our launch content that we are working towards glasses as a form factor for companions or kind of this companion interface for

And I think that when you get the companion to have that level of context in particular site, kind of what you're looking at, I think you will get even more natural feeling experiences where it feels like your companion is sort of in the room with you and can, will certainly have different, even in just like the voice, but in the

In the entire interaction, right? Not just the voice, but also what it says. When it sees something that's exciting and can kind of, you know, exclaim with you kind of thing, it will feel very much like it's in the room with you. It's kind of like over your shoulder. And that will be a great experience. Why are glasses the necessary way to accomplish that? As opposed to what? A phone. A phone. Sure. A phone. Your MacBook camera. Yeah. Bear of...

Hearing aids and stuff. Yeah, I mean, there are different devices that will be good ways to interact with a companion. We see companions, voice in particular, but companions as like the superset as a kind of new interface to computing, or it will be. You know, I don't think we're there yet. I think we are certainly not there yet. I think the industry is not there yet. There's a lot more

core advancements to be done before a companion product is sort of a feasible and actual interface to computing. We think it should start with a foundation of being natural, being something that you want to talk to, being something that feels, you know, fluid and so forth. But I think when you get a new kind of interface medium, the question is sort of what is the device that is the best at interacting with this interface, right?

Phones are good. Phones are not going anywhere. Phones are amazing. I love my phone. No one's going to replace phones anytime soon or laptops for that matter. Phones didn't replace laptops either. Why glasses is that, you know, if you think about what you want a device to

that is kind of interacting with this interface, a companion to be, you want it to be very, very low friction. You don't want to have to take your phone out, unlock it, open an app, make a call, and then talk. There's things all around you. Like one way to think of it is sort of how would you want to interact with like a friend that's sort of hanging out with you all the time? You wouldn't want to have to take out a phone every time you want to say something to them. So you want it to be super low friction, always available. And you kind of want as best you can to

have the companion have a sort of mirror of your perception or of your context. Maybe one day that's going to be like a Neuralink-style device, you know, whenever that some embedded chip into your brain and it sort of has... But putting kind of those kind of things aside, the glasses are really...

pretty optimally placed to be a sort of mirror of your perceptions, right? Where your eyes, ears, et cetera, are. You could even imagine a smell thing. We're not doing that anytime soon, but like it's where all your perception organs are basically. And it's a device or a product category that billions of people wear all day, every day, and it's always available to them, right? I wear them.

You sometimes wear them. Not right now. Used to. The Sesame site says we're working on everyday. Everyday. All day. All day eyewear. Why is that critical if it excludes all those other, all those folks who don't wear everyday glasses? I think that it takes a lot to earn hardware on someone's body. A wearable. Right. And

you have to provide enough value, especially if you want someone like me to switch from my glasses that I wear all day every day to a new pair of glasses. You have to provide a lot of value to justify that. I think that when the value proposition that we think glasses will eventually hit, which will take some time, is when...

your access to your companion, to computing, let's say through this interface, is always there. It's always there with no friction. So if you have to think about, you know, in this period of time, I'm gonna wear these sunglasses or these glasses, and then I'll have access for some amount of time, and then I won't, it can't become a kind of a habit that you just always think you have. All day, every day allows it to become a habit, a part of your daily life all the time.

And a lot of the things that we think will be really great about this product are when you're just kind of doing something else in the world. You're not really thinking about the fact that you have access to a companion or that you have electronics or access to an interface to computing. And something triggers that like,

I want to use it right now for a very short amount of time. You use it and then you kind of go back to not using it. And it's about the friction, right? If you don't have it all the time available, if it's not always on, always there, the friction is high, right? You have to think about like, essentially, do I have it on or not right now? So while we were talking, I got like three different texts from people saying, any chance the model's ready, the open source model's ready. And when I read that,

who these techs are from, it's folks who have, who are working all kinds of different products. They're not working on a consumer companion in any shape or form. Two out of three of those folks are working on some version of a enterprise developer or customer support type use case. Why is that, given the sheer amount of demand there is for this system, why not produce something that's a general purpose API, a model that everybody can use for all kinds of use cases? Yeah, yeah. So why not an API? Yeah.

People ask us for an API as well. In fact, a lot. People ask for an API a lot. I think there are a few... There's really one main reason to not do an API, and there are a few kind of reasons that I think we...

Well, let's talk about the main reason. The main reason is focus, basically. We want to bring this companion product to market. And now we have a lot of people that love using the demo, which is fantastic. We've kind of shown to ourselves and I think to the world that there is magic. There is some really great experience when you can talk to a system that feels very natural and like you're talking to a human and so forth. There's a lot...

to go and making it a great product, but we are focused on that path. We would like to bring a great product that is a companion to market first. Anything that's not that path is basically a distraction right now. I do think that these APIs, I mean, there are API businesses for speech generation, for text-to-speech, that are great. And I think people will want more and more conversational, natural API access. And I think those companies will do a great job of...

providing that, right? We don't have to do everything. We are focused on building this product experience. And we don't want to get distracted by other things. You know, sometimes it feels like an API or something like that is like relatively easy to do. And sometimes

It's not maybe as hard as some of the other things that we're doing, but everything is a drag on engineering. And we really want to stay focused on bringing the product that we want to build to market. One of the things you've talked about is that to have a companion that's just available all day, every day, it's got to be a companion you want to talk to. And API is a great way to allow people to customize a companion in ways

to be more like something I want to talk to. And so is there some other way by which you think people will be able to customize the companion that's not direct access to the API? Because that's one of the things you give up, right? By not exposing that. Kind of. I mean, not everyone's going to want the same personality in their companion. So we're certainly not going to... We don't see our product as like one companion that's the same for everyone.

people have different preferences and that has to be a part of this kind of product category for sure. I think people, like people using an API, in some sense, I think the problem with the API right now is that it's just too hard to make a great product right now. Like it's not something that you can turn into an API where it's just sort of, you know, you put a prompt and it's a great personality, right? It takes more than just sort of

you know, voice clone plus change the prompt. Now you have a new character that's just as good as it would be if you spent a lot of time on it. It takes, I think today, making a great personality voice interface system. We can't turn it into an API that produces super high quality outcome today. Like the APIs and the ability for models to follow prompts is quite good for sure, right? And it's getting better every year for sure.

but we you know we want to make it first party to make it a higher quality you know as high quality as we can push it i think what we've seen over the last you know two maybe three years is there are companies that are kind of built around different modalities in ai so there's like video companies and image companies and speech companies and so on i think that conversation like human conversation is kind of its own modality and it is nowhere near

There's so much more to do in the core research side to make it better. And it's not clear to me that the direction it's going to go is, from a research perspective, is what we have today. It can change a lot. And when you make an API, you are kind of baking in some kind of interfaces to that system, like how do you control it and how do you tune it and so on. And I think it's just a little early to do that. At least for us. I mean, in the sense that

Those kinds of things sort of start constraining you a bit if you want to make major changes to make improvements. All right. The next question folks have for you. What will the companion do? And how does Sesame handle context retention in long conversations? It's really impressive. Yeah. So what will the companion do? I think the focus at the beginning, which is now the short-term focus, is to continue making the most natural changes.

companion possible, the most natural interaction possible. So what will it do is it will talk to you in a way that feels real, right? And we want to continue pushing that. That access of research, this kind of like conversational modeling, you know, human conversations are very

complex. You know, there's a lot of like back and forth that has to be negotiated. Like when should you start talking? Sometimes there's some crosstalk and you have to kind of negotiate who's going to end up talking their back channels where you kind of indicate to the other person that you're listening with like a little sound or something. There are times where interruptions are rude, but there are times where interruptions are good because the other person is sort of

taking a lot of time to explain something that you already understand and you should cut them off and kind of answer them and so forth so human conversations are extremely complicated and doing the work to model them effectively and naturally i think is not

it's not going to be like a short problem. It's like a very long-term problem. And so one of the core research things we're working on is that track of work. And what we are, for example, open sourcing soon, the kind of speech generation part, is really still relatively constrained or sort of, it models a relatively small part of a full conversation. I mean, it's kind of easy to think of

the text and the speech of the conversation as like the bulk of it and maybe it is to some degree but i think to get these things to feel very very natural and real you do need to model the full conversation the turn taking the back channels everything and so there we see a pretty long road there of making improvements and we think that what we have now is pretty early actually

So we want to keep pushing that and we don't want to, you know, detour into starting to do too many kind of long reasoning sort of tool calling, do things for you and so forth. Those will certainly be part of an effective product in this category. But our core research, I think we still have a ways to go in just supernatural conversations.

The companion in the long term, yeah, it should. We want you to be able to do things with it. We want it to be able to help you kind of be a better version of yourself, kind of maintain your

information, maintain memory, build a relationship with you and so forth. And we are working on many of those things as well. But the core foundation of a good companion product, I think, is the sort of naturalness of interaction. And then we kind of want to build these other capabilities on top of that foundation. There's a follow-up question, which is, how is it so fast?

How is it so fast is a bunch of systems engineering. One of the things that I think is so fun, honestly, to work on this kind of category or this product experience is that it's not just Core ML. There's great Core ML to do. There's also great engineering to do, systems engineering, and there's great kind of product creative work to do as well. And we want to be a company that sort of unifies all of those things.

how is it so fast is really a systems engineering problem, right? And it's an infrastructure problem because, you know, especially now we've had to scale our back end a significant amount and maintaining low latency across a bunch of users and so forth is really a challenging infrastructure problem that is also fun to work on. But how did you solve it? I would say it's more of a sort of combination of a bunch of things like

Of course, they're like the core systems of transcription and the LM and our speech generation and so on. And each of those, you hyper-optimize or optimize as much as you can. And you want to pipe them together in an optimized way. You want to do some kind of pre-computation and caching and everything.

to try to minimize latency across everything that you possibly can. So I wouldn't say there's kind of one trick or anything like that. It's just, you know, the end, there's a lot of places in the system where latency can creep in. And we're talking about, you know, we want, you know, sub 500 millisecond response times and, you know,

a lot of things that feel like not a big deal, 50 milliseconds here, 50 milliseconds there can really add up. So it's kind of like a focus across the stack on just systems engineering. How big is the engineering team that pulled that off?

Software team is about 15, less than 15. But that's research and software. That's research and software. So the core ML team is something like seven or eight. And then infrastructure and product are the rest. It's a small team. It's a small team. And yeah, we have a great team. We have a super talent-dense team and it's great to work with them. How do you keep the talent density so high? What have you been looking for?

What's your bar for somebody who's good enough to join the Sesame team? Yeah, we look for people who are kind of generally strong engineers, basically. Like, especially when you're smaller, right?

you don't really want to harden like, you have this team that is super, super niche and doing only this thing because you don't know exactly what the stack is gonna look like tomorrow. Things change on the research side, then you need a different kind of infrastructure stack to serve it and so on. And so, especially early on, you don't want super niche people

You basically want folks who are good systems thinkers that can work on many different things and can sort of learn new things. There's really no one that has, you know, 10 years of experience serving transformers at scale or something like this. It's, you know, this whole thing is new. So you need people who are excited to learn, who are strong engineers, and who can learn. Next question is, how do the scaling laws for speech...

differ from text? That's an interesting question. So we published in our blog post, we trained three variants. We trained 1 billion, 3 billion, and 8 billion of just speech generation. And even the 1 billion is very good at speech generation. What you find as you scale up is you really are starting to hit the kind of long tail things and the contextual things.

So, for example, two of the evaluations that we published in the post are sort of homograph selection. I think it's called homonym selection. I'm missing the I'm losing the exact term. But there are kind of these words that there are two different words and they're spelled the same but pronounced differently. So an example is like lead and lead, L-E-A-D. And, you know, you can ask the question, you give a sentence that says,

based on the context of the sentence, just the semantics of the sentence, it's clear which word you mean. You know, like you lead the pack or like the paint had lead in it or something like this. And you'll generate the speech and you'll transcribe it with a phoneme transcriber. And you see which pronunciation did it pick. And these are kind of like long tail things. I mean, lead and lead are sort of relatively common words, but there are others that are more kind of less common words and it's harder for a model to pick.

Um, like for example, another one is row and row, like rows, like a fight kind of thing, rows, you know, rowing a boat. And we see that as the models get bigger, they're much better at picking the right pronunciation in examples like this. There's a similar thing about using context. So when you, again, there are a lot of these are about pronunciation or pronunciation is sort of like a very, um,

useful probe to see how good these models are at some things. So for example, we'll take words that in American English have multiple valid pronunciation variants. So for example, route and route. There's some regions that may say it one way, some say it the other way, but it's not like an accent per se. It's like some people just say route and some people say route. And we'll

take an audio prompt of someone saying it in one way and then saying it in the other way. And then we'll generate a sentence after that has the word in it. And you should expect if the model is able to sort of clone the voice and clone the accent, clone the pronunciation, et cetera, that it should continue with the same pronunciation that you gave it at the beginning. And again, we see the models as they get bigger, they get better at that.

So I think a lot of these are kind of the long tail things and contextual things. The contextual things are what we care the most about is that are they able to pick up more and more information from the context to condition what they generate? And that's kind of what we were talking about earlier about

you know, you have to, there's so many, there are an infinite different ways of generating any audio for, for text. And you got to pick one that's appropriate in this situation. And we do see that the models, as they get better, they get bigger, sorry, they get much better at doing that. Um, and so that's kind of, you know, there, there, there's a lot more to probe in, in these models, actually. I mean, you know, it's certainly, there's certainly more work to be done about like understanding how they change as they get bigger. And we're, we're doing that work too. There's, um,

There was a moment where when we were testing early checkpoints of Maya before the research preview, you know, tested every day.

and my name is pronounced Anj, but it's written A-N-J, and she would keep pronouncing it Anj, and then one day, she was able to preserve the pronunciation. Every day was like the groundhog day, right? I'd wake up and say, my name is pronounced Anj. I was hoping for the day that she would preserve my pronunciation, and one day she did. Do you think that corresponded to just a larger training run? It's possible. It's certainly possible. Yeah.

Yeah, we do look at evaluations like that as well, like name pronunciations in particular. That's like a good example of a kind of product-centric evaluation, maybe. And people care a lot about how their names are pronounced, right? And when you say...

when you talk to Alexa or something, or like another voice assistant and they say your name wrong, it feels bad. It feels bad when anyone says your name wrong really. Name pronunciation is a good example. It could have been just a better checkpoints possible. But then it consistently did that right from that moment on. Yeah. Actually, let's talk about evals for a sec because

How do you think about evals? How are they different in audio or rather conversational speech than text, LLMs, chatbots? Yeah. I mean, it's, to be honest, it's a hard problem. It's a hard problem because, you know, you can find things where you're able to directly check, like, the answer. I mean, it's a problem in text as well, in general, right? Like, you have some, you know, in the field, there's kind of, you know, in the RL world, like deep seek and so on.

there's sort of a, you know, you can get a lot out of verifiable rewards where you can get the model to generate something. You can check whether it's right in some relatively stable way. The problem with generative models in general for evaluation is that there are many different generations that are appropriate. Right. Even when you constrain it to like have conditioning and so forth, there's still an infinite number of ways to generate something in an appropriate way. Right. In speech anyways. Right.

And it's not necessarily easy to like have a numerical answer as to how good something was. So, for example, we look at pronunciation because it is closer to something that you can just check. Right. Right. Earlier on in the speech generation world in the community, very often you'd look at like word error rate where you look at transcription, like you kind of have a sentence and you generate and you transcribe and you see if it's the same.

and those metrics are getting saturated basically. These models are just good enough that you can get some signal from the kinds of word error rate, are they insertions, are they deletions, etc. But in a corpus wide level, these models are very good now.

So, you know, then we started looking at pronunciation because you can kind of pick out, you can kind of construct a prompt in a way that demands a certain pronunciation and then see, did it pick the right pronunciation and so on. So that's kind of quantifiable, but that doesn't really get at, you know, does it feel natural, right? That's like a capability maybe of the model, which we care about and we care about that growing and that does grow with scale and so forth.

But it doesn't really answer this kind of product question of like, does it feel natural? Does it feel good? Right. Is it appropriate for the scenario and so on? And so those evaluations look very similar, honestly, to how LLM evaluations go sometimes, which is like preferences and like an arena or like some kind of, you know, head to head ranking and so forth. Another thing that we do, for example, is there are some data sets that are academic data sets. We also have some data sets that are kind of like

just two people in a conversation or sometimes they're actors, but it's trying to be a real conversation. And so we'll kind of take the conditioning, uh, of some snippet of the conversation and then show a human Raider, um,

the real continuation and the model's continuation. And so it's like a win rate against the human, against real, right? And that has good signal and so on. So it's these like preferences. So, I mean, ideally, would you like there to be a conversational speech modality arena where a user could basically rate

How would that even work where if I show up, today I show up to LMS's Chatterina. Right, right. I put in a prompt. I've got side-by-side comparison. The reason that works so well is you can multitask. Because it's visual, I can literally do side-by-side comparison. What is a side-by-side comparison for conversational speech? Yeah, it's hard and it takes more time sometimes. Like you have to listen to it, right? You have to listen to the prompt and then listen to the responses. And the problem is,

You know, one problem that I see with those kinds of techniques for us is that those arenas, like LMSYS, it's a great arena, of course. It's really kind of like a wisdom of the crowd thing. So you're kind of asking the question, like, which models are the best with respect to kind of the general preferences of the user base of LMSYS, basically, right? And first of all, that is like a niche user base. I mean, at least from the perspective of like,

the entire United States or the world, the subset of that distribution that's actually on LMSIS is not that diverse, actually, in the grand scheme of things. And the second is that you're doing this kind of average thing where it's like you're kind of averaging across all of the people's preferences, right? Whereas we are much more like directionally, we would like to be much more of a kind of

creative company. Like we kind of have some creative vision for what we think or what our creative team, let's say, thinks is a good personality and a good voice and a good sort of like experience and then, but in just voice, you know, when we talk about just speech generation, it's the speech. And the question is not so directly what

which of these generations does on average do people like? It's more like we want a sharp personality. We want to create a sharp personality that sort of is not an average thing that like, which one does the most, which generation does on average people like? It's like, we want to have like a well-defined one, right? Right. Yeah. So there's two questions I wrestle with on this. One is the idea that a companion is much more engaging when there's a sharp personality to it.

which obviously people love about Sesame. I mean, that's probably the single most common point is people go, wow, Maya is so real and feels like a person. And imperfect and like a human in a way, in that sense, yeah. But on the other hand, there are a bunch of people who also say, hey, Maya feels a bit

like she's acting. Sure. Like she's performing. Yeah. Yeah. You know, we spend a lot of time at Discord and, you know, a huge amount of Discord usages and voice channels. And when you look, when you actually kind of spend time looking at and trying to observe what the shape of a great conversation and voice is on Discord,

A lot of it is actually not that interesting, right? It's people just hanging out talking. Yeah. The core action moments actually happen when there's another activity like gaming or there's people are watching a movie or something together. Yeah. Yeah. But the vast majority of talking is actually quite natural, sometimes boring, not that exciting. Yeah. Whereas Maya keeps trying to inject excitement into you, almost like an actor. Right. Right. So why is that? Why is...

Is there any way to resolve the tension that in trying to have a direct or voice companion like a director would, like a movie director would, you lose the organic texture of a conversation? - Well, I think it's that way because we have more work to do, more or less, right? Like I think that this area, this field of like personality, especially over voice, I mean, voice,

See, voice, I think, is kind of a much higher bar because it's such a high bandwidth communication. It's like, it's even the, like, such little things will make you feel

like the person on the other side is fake, basically. Whereas, you know, text, I think it's much easier, it would be much easier to make a system that produces text chats with you that feels like you're texting a human because there's such a compression of like what the entity on the other side is into just like text. Whereas voice is this kind of open duplex thing.

So voice is harder in a way. And, but also what comes with that is more kind of emotional or sort of, there's more opportunity to feel really natural and real. Right. I think that that category is, it's like a, its own major research question of like, how do you,

how do you, it's kind of a research/product question of like, how do you design a voice system that has a personality that you think is good, that also feels human, that is not too pushy, that's not too fake and sort of synthetic, but is superhuman at the end of the day, right? Like these systems are superhuman. They know much more. They are smarter in many ways. They have more world knowledge and so on. They can do things over time and so forth.

From a product perspective, there's this question of we want it to feel very, very natural. We want it to feel just like you're talking to a human. But it is kind of a superhuman thing on the other side. And what are the things that you kind of want this thing to be human around or feel human? And where do you want it to be superhuman? I think that's a really interesting question. The core tech...

also needs to be better at kind of being moldable to that goal. And right now the system, yeah, it kind of, at times it will sort of be too pushy or be too, I mean, often it's like too happy or too kind of energetic, too positive. Feels like it's acting, feels like it's forced, et cetera, et cetera. And that's just kind of work that we need to do. Yeah. All right. I'm going to take some more questions from the crowd. Yep.

Let's go to Reddit. Yeah. The top, when you type in Sesame, okay, so the top question on the fan subreddit, does anyone know what the plan is for this tech? Yeah. And this user is saying, I had my first proper chat with Maya over lunch today. Yes, mind blown. I know that this tech

Deck is brand new. I'm wondering what the ongoing plan is. I've heard it will be made open source. That would be amazing. I'm wondering if Sesame is planning to set up its own app or a way for us to continue using it. Basically, I just want to know how I can keep talking to Mike. Yeah, yeah. We are making an app. We will make an app. I think for a little bit of time, it's going to still be kind of the demo experience. We want to support people using that for...

a long time or, you know, we don't want to, we're not taking it away anytime soon from where it is right now. We love that people are using it. We love that people like it and so on. We are making an app. It'll be a product. It'll be an app and it's a companion and you'll talk to Maya or Miles or whatever character you want to talk to and it will remember you and it will be an engaging experience to talk to. We want, that is the product we're trying to make. In terms of the core tech,

we're open sourcing the base models today or soon. And the core tech is going to take a lot of work to get to where we want it. I mean, we're going to keep working on it for years probably. This was a research preview of the most basic research Sesame's been doing, which is the CSM, the conversational speech model.

Where does the future of the research roadmap go? So CSM is kind of the first step of making a multimodal transformer-based architecture that generates speech. The path that we're going to take, I think, over the next few months is making a single transformer that does both audio understanding, text content generation, and speech generation.

In general, it's much harder to... Full duplex. No, so full duplex will come after probably. I'll talk about that in a second. But it's much harder to add a modality to a pre-trained model than it is... Add a generative modality than it is to add an understanding modality. So...

Very soon, we're going to add an understanding modality, which is like the kind of core model will be able to sort of understand, will be seeing the audio from the user and kind of being able to... So if I cough, it'll be able to understand that I just coughed. Right. Even though there's no transcription of my cough. Right, right. And then the next step is to...

add a generation modality. We don't intend to pre-train the LLMs, right? We love the open source LLMs that are out there. We'll continue building on top of them. And in general, we take the open source LLMs and we add modalities to it. In particular, speech is like the modality we care a lot about. And so the kind of shorter term research roadmap is to basically go to a single multimodal model that will both understand and generate speech. Yeah.

The CSM work is a first step that does kind of a contextual multimodal transformer for speech generation. And there's another kind of line of work, which is adding the audio understanding modality to pre-trained LLMs. And then we'll merge the two into a single model that can both understand and generate speech and text and so forth. You mentioned duplex. I think, you know, these models today do not...

they do not model the structure of the conversation at all. They only model the content, text and speech. And because of that, you still need some other set of models or maybe, you know, you need some other set of systems that kind of drive the conversation actually, like when should the system respond? When should it get interrupted? How should the interruption kind of be manifested and so forth? I think that this area, this research area of

conversational voice is going to move, it has to really move to kind of full duplex models. There are some in the research, in like the literature, there are some kind of early and compelling sort of architectures and paths. There's some from Meta, there's some from other research labs. There's one called Moshi that's kind of interesting. You know, we need to get to ways where we can kind of create those architectures, but initialize them or sort of maintain the

and the knowledge of LLMs. And that's a little bit unclear of like how exactly to do that, but that's really what we're looking towards long-term. That's kind of where this is going, I think, that we'll have these sort of duplex architectures that can generate audio in a sort of, like all of the turn-taking and back channels and so on will be implicit in the architecture. Like it just kind of generating audio every kind of time slice or sort of every frame. And that is, I think, the path to getting LLMs

systems that really feel truly real, right? Because I think that those complicated dynamics do need to be learned from data. I don't think you want to, in the long term, have those dynamics be like heuristics and so on, which they kind of are now. There are models involved in some heuristics and so forth. I think in the long term, it's just one model that is kind of naturally employing all of these complicated dynamics.

You said you used the word there, which was every time slice or every frame. Why is that so important? What is the latent sort of atomic unit there you think is the right way to model speech? It's probably frame. It's like, you know, say 100 milliseconds of time or something like that, where you basically want ultimately to have the model make as many decisions as it can.

So right now, if you think of the model as kind of like a, if you think of it from like a reinforcement learning style perspective, you're kind of making a full, let's say sentence of a decision at a time. Right. Like at this moment, you're the model's like, I'm going to say this in this way, but it's a full sentence. Right. And then if it kind of can't update that decision on its own until the end of the sentence, you can have some other system that's like, oh, we need to interrupt or something and kind of make a new decision.

But the model natively is operating sentence by sentence. Right. And that's just too long of a decision to make at a time. Like, it's not... You need to make much shorter decisions. You need to make decisions at the 100 millisecond, let's say...

time segment so that if you're talking and the other person, you know, starts sort of making some noises that make it seem like they're trying to interrupt you or they want to say something, you can kind of back off and let them say something or not. Right. Those decisions need to be made constantly, like, you know, every hundred milliseconds or something along these lines.

And the models that we have today, like CSM, for example, and probably some of the models that we'll have in the short term that will make the experience better will still not be modeling the conversational dynamics because they're making decisions kind of like sentence at a time.

And I think we need to move away from that in the long term. So architecturally, you know, one of the types of models that's been really great at continuous data is diffusion models. Yeah. Right. Whereas the CSM you guys put out is not a regressive model. Yeah. And this multimodal future where what you're talking about with these time slicing is just a way to discretize a continuous stream.

Yeah, sure. Why not just use a continuous architecture like a diffusion model? Well, it's an interesting question. So diffusion models are continuous in the sense that the data that they model is continuous data, right? But they're not...

they're not kind of natively in any way causal. They're not like continuous in the time dimension per se, right? The auto-aggressive transformers, they are causal. And so they have a kind of axis, which is time, which kind of makes sense from a, you know, conversational perspective. And that they're a sequence. They're a sequence. Yeah, they're a sequence. Causal meaning like every, you know, every time stuff is conditioned only on what's before and not after. Diffusion models are not kind of natively causal. Now there are

We are also working, by the way, on kind of ideas that make the audio generation part diffusion. So we kind of have, you know, the way that these multimodal models work and CSM as well and kind of the direction that we're going is that you have like a transformer backbone, which is you can think of that as sort of where most of the kind of hard reasoning happens or something.

And, you know, when you want to add a new modality to the understanding path, you sort of have some adapter. There are various techniques, but like you ultimately have some adapter that takes that modality and puts it into the backbone so the backbone can kind of understand it and reason on it or something along these lines. And then you have a generation path that takes the backbone and generates out the audio in this case, right? That path can be diffusion. We want that to possibly be diffusion in the future. There are some great advantages to diffusion.

But the core backbone will need to remain causal. That doesn't mean that it can't have diffusion as part of it. But I think an autoregressive transformer is at least currently, you know, that's like the thing to bet on for kind of sequence modeling.

There are other kind of architectures that are slowly gaining in some popularity, Mamba and SSMs and so on. And I think there's some exciting things there, but transformers are kind of this tried and true thing. And I wouldn't bet against transformers, you know, not in the short term anyways. If you had $10 billion more than you have right now. Yeah. And essentially, let's say compute was not a problem. Data was not a problem. Would you choose a different architecture?

I don't know the answer to that question. I think it doesn't seem to me that there is a... There's not another architecture to bet on today. You know, will Transformers be the architecture that a future $10 billion run definitely uses? It's really impossible to say. You know...

transformers are very effective and there's a reason they work so well and and also you know there's this dynamic of like because they're so effective and they're so now um usable in all these different fields you kind of get this dynamic of like people invest a lot of engineering time into it and so

And so if you want to make a different architecture that competes with it, you have to make a better core architecture, let's say. But then you also have to compete with all of the engineering optimization that has happened on transformers over the last three, four, five years, or, you know, all the way back until they were invented. But that is, you know, not easy to do. Like, we've talked about like system engineering for us, for example, keeping latencies down and so forth. Transformers are not sort of natively the best platform

You can imagine a better architecture for like low latency inference and so forth. But because of all the engineering work that the community has done around transformers, it's like, you know, it's very good. And you're not going to just sort of unseat that, you know, just by an idea, right? There's a lot of work to be done. But I could see transformers being replaced one day. I mean, you know, we'll have to see. But I think transformers are a good bet anyways at the moment. What is something that you think you've realized recently

through the training of Maya and Miles and CSM that the rest of the world doesn't realize yet? That is an interesting question. I think that people now kind of realize it actually, because like if you play with our demos, I think you do feel that there's like something different, you know, and that's like a valuable thing that you can kind of feel instantly is sort of important or something or kind of could be important. But the value of just like focusing on

the naturalness of the voice as opposed to let's say the you know you know the the demos are imperfect they're kind of imperfect in a way that is that feels feels natural right and that is you could even maybe say at times that that's like a you know from a sort of raw content of the conversation perspective it's almost like wrong in a way like

My and Miles, they might sort of say the wrong thing. We'll kind of like back up a little bit and say something else or something. And that's on purpose, of course. And, but it's like, if you just looked at the text and you were thinking about it from a purely textual perspective, you might call that wrong. Right. But it actually feels more real. I mean, that speaks a little bit to why evaluation is hard for these things. But, um,

I think that the, maybe that's one of the things that I've learned is that you have to be very careful with how you evaluate these things to match like the product experience that you want to achieve. Right. If you have your evaluations too divorced from the product experience, you might not find these kind of like products feeling, you know, qualitative upsides. Right. So I just typed Sesame AI into Reddit and I'm just going to read out the top headlines to you. Yeah.

The first one is I'm in love with Sesame AI, literally. The second is Sesame AI's voice is insane. OpenAI needs to catch up. And the third is the Sesame voice model has been the moment for me. The fourth is Sesame voice is incredibly realistic. And the fourth one is the one where I convinced Maya from Sesame to go unhinged. And all of these, you know, have hundreds and thousands of likes and comments. And the reason I'm bringing this up is because relative to five, six days ago,

I don't think it's too much of an exaggeration anymore to say this research preview has been for voice what ChatGPT was for text. And the kinds of things I'm hearing you say are very similar to what people were saying about like GPT 3.5 in ChatGPT. It wasn't the smartest model. It got things, it hallucinated all the time. But it had a personality. It was, and it was a step function different. It felt like a leap. Yeah. Yeah.

But two things happened after that. A lot of people feel like the models regressed for a while. The cost of fixing hallucinations, so to speak, was personality. Sure. And a lot of people are worried that that's going to happen to Maya too. Yeah. Is that true? I don't think that's true. So it is true that there are

The technology is still early. All of this technology is early. I think it's easy to see AI progressing so fast, and it is progressing so fast, but it's also still early. When was ChatGP? December 2022. 2022. And here we are in early 2025. It's really not that long, right? So these things are early. And so even GPT 4.5, which just came out,

it's kind of being uh presented as a kind of you know has a personality again or sort of you know more kind of creative and so forth and i don't think that these limitations they're not like fundamental necessarily they just use to figure them out that you know it takes time to to do these kind of things i think the other aspect about these kind of products is that it's a different product right they're making

They're making utilities. I love those products. I use them all the time. They're great products. We want to make a companion. And so our prioritization of features and of, let's say, post-training kind of personality, et cetera, will be different. We would like to see

our models get better and better and better on a bunch of axes, but not lose the kind of personality or kind of naturalness and so forth. And I don't think we will lose those things because that is what we're focused on. That's how we differentiate, actually. I think that, you know, the other companies, the other sort of chat products and so forth, they will get better voices. They're all, like, it's not going to, we don't have some

magical secret sauce on the technical side that is going to be like impossible to replicate. They're going to get better voices for sure across the board. For us, it's not about like, it's really about the, they're different products, right? And we're making different kind of product. And I think as, as they get better voices, let's say, they're still going to want the product to be the, you know,

the category that they're in, they want it to be a good assistant, they want it to do research and so on and so forth. And those will be great products. I think we have a path to making a great companion product that is a different product and requires us to keep a good personality and naturalness and so forth. And so, you know, we can't lose those things. And so, you know, we'll focus on not losing those things. - A lot of the conversation about AI research happens at the model layer.

But something you and I have talked about for years now is that a lot of the most interesting research in AI is at the interface layer, right? That reframes an AI system as a new kind of interface for computing. And actually, if you just pull up Sesame.com, the first thing you see is our mission is to bring the computer to life. Right, right.

So can you talk a little bit about how is viewing the voice companion, not as just another application, but as an interface for computing in general, important? Why is it different? Why do you see it that way? I think AI, these technologies, they will create a bunch of products. They create a bunch of value across a bunch of different industries and so forth. But I think that you're already seeing with AI

with chat products that there is this exciting thing about you can now kind of talk to your computer, right? You can text with your computer, you can talk over voice with your computer and so forth. And on the other side of this AI interface, that system can do things for you. It can kind of search the web, let's say, and it can connect to other systems and so forth. And I think that that

that dynamic is an interesting new user interface. You know, we had kind of terminal user interfaces, we've had graphical user interfaces. Certainly the GUI is kind of what brought computers to like a mainstream audience because it's kind of, you know, it's more intuitive to work with. It's sort of, you have these like nice graphical displays and so forth. And that's not going away anytime soon. Your phone's not going away anytime soon. Your laptop's not going away. There's too many reasons for display, basically.

But natural language as an interface or kind of being able to interact with your computer in a natural way, I think it will open up a new interface for computing where as a user, you can just talk to your computer. And we say bring the computer to life because...

We want it to feel like this sort of historically kind of set of capabilities that are kind of locked away in this box now feels like you can just kind of interact with it and work with it in a kind of collaborative way in the way that you would interact with humans and so on. And we think that is a kind of new interface that will be great. It's not really there yet, but we want to get it there and build that.

And we think of what we build really as the kind of the interface layer. Like we're building a companion and we think of a companion as kind of like a new interface in a way where on one side you sort of have the user interacting with his companion and the companion might be interacting with downstream compute and so forth. Maybe that's other AI systems. Maybe that's kind of these growing in capabilities, growing in reasoning abilities, whatever.

other AI systems, but it might be the web, it might be search, it might be, you know, some other set of digital services that are in your life and so on. We think that that interface layer, it's really not kind of a core, bigger, bigger models, better, better reasoning question. It's really a product experience question. It's really a question of can you make a system that people actually want to interact with, people want to talk to, right? And

I don't think that that will be, I don't think the path that that will take is just bigger and bigger and bigger until you can, you know, just scale it as far as you can. Because that might be the path to more capabilities, more intelligence, and we're excited to use those capabilities. And that'll certainly be a part of the product experience, maybe as like a kind of API that you can, that the campaign can hit. But really what will differentiate one

companion interface product with another is just the product experience. How entertaining, how engaging, how much do you want to interact with this thing? And those are the battles that we want to fight. We want to focus our research and our kind of technical innovation on that layer, making it more natural, more interactive, more just sort of, you know,

Just something that you want to interact with, right? When you talk to over text with a system that kind of responds in somewhat robotic ways, it's kind of like providing an essay and so on. You know, those systems will get better and better. These things will all get better. But our goal from a product experience perspective is different. We want it to be natural, interactive, you know? And I think that at the end of the day, it's not so much about at that layer. It's not so much about

what scores the best on reasoning benchmarks and so on. It's like, which one has a personality that you want to talk to, that you like? You know, it's more of a kind of product question, actually. And that's why we kind of come from that perspective. And we focus on technology that achieves those experience goals. The history of computing interfaces has been that usually there's somebody who's very product, who's very opinionated about product. Now, of course, Steve Jobs is sort of the canonical example there. But if you...

Look, it's not just Steve Jobs, it's Doug Engelbart. And earlier than that, it's folks who designed the terminal. If you read some of Claude Shannon's early work, there's this idea that a system that is fast enough to respond to you is delightful. And people often think about computing interface design

as a purely utilitarian task about making it more functional. I mean, today, mobile screens, when you talk about this thing, the primary axis of differentiation every year now is bigger screen. But in the early days of this, if you go look at the keynote, I think that Steve Jobs did on the smartphone,

I mean, it was largely about the interface being natural enough to use. I think he literally talked about how annoying it was to type on these tiny QWERTY keyboards. Yeah, and you had the scrolling thing. Right. And when it hits the end and it bounces a little bit. Yeah, so there's natural motion in the UI design. And then there was also the idea that most touchscreens just sucked. They were non-capacitive touch, which meant you had to press on them really hard. Yeah.

So when you say that Maya is not just a companion or Sesame is not just working on companions, it's working on a new interface for computing, is the analogy, is the reason you're investing in personality because you think that in a voice-first AI interface world,

solving personality is akin to solving speed or the delight of touching a screen, you know, in the early smartphone days or the speed at which a terminal would respond to your keystrokes. And is that the lineage of computing that you see, uh, Sesame falling into? Yeah, I think, I think there's some good analogies there. Like I think, you know, we see companions as an interface and maybe in a broader sense, kind of natural language, um,

as an interface, you know, text and voice and so on. And yeah, I think you can get these kind of very capable interfaces that are functional, that don't produce the level of like product experience that you would need to make it something that people want to use every day. Like mass market, widely acceptable. Right. And I think that a lot of the tech that we're focusing on

on the kind of conversation modeling and so on is about kind of getting past that point, getting to a point where it's not just functional and like utility oriented, but it's delightful and kind of something that you would like to interact with. And I think there are analogies, you know, I think that in general computing, at least consumer computing, kind of mainstream computing,

does really, you know, computers are these very powerful things, of course. And from the beginning, they've been very, very powerful things relative to what else is out there. And a lot of consumer computing is about kind of bringing it to the mainstream in a way that they enjoy, right? That kind of feels like a well-designed product. That's why Apple is so dominant and so successful is that it is an amazing product experience. They have focused on the product experience. And that's the kind of thing where

there's really no kind of cap to how good it can be, especially with these kind of systems, I think, where it's kind of like you're talking to a person or you have like a companion, a friend or whatever. You know, there's such a big space of what personalities can be and how it can feel to interact with these systems that I think more so actually than like pushing on the benchmarks, which is an important direction that the field will go,

There's kind of like a bigger, I think, vein to improve on just like making it more interactive, making it more delightful to interact with. Right. And that's kind of the area that we would like to kind of establish Sesame in and kind of really focus on.

and utilize all the great work that other folks are doing on reasoning and bigger models and so forth. Well, we haven't really used the word fun, but in a sense, you're describing the kind of flow state, you know, the Mihaly Csikszentmihalyi framework of what is fun. And fun, he defines as flow state, which is like, it's appropriately challenging and appropriately rewarding. In a sense, using an interface that's fun is one that can get you into flow zone really fast.

I've never had that moment with an AI companion. Yeah, no. Yeah. Do you think, what are the biggest challenges you think keeping us back from getting to the flow zone of this interface? Yeah, so I certainly think that there are capabilities challenges. So, you know, Maya and Miles today, they can't do anything for you. And I think that doing things, even now, I think that doing things

especially challenging kind of multi-step things, you know, agents, as people say, I think to make that part of your everyday habits, it has to be like 99%, you know, and right now, you know, every extra step the thing needs to take, there's some percentage that it fails. And if you have to do 100 steps, then that percentage compounds and so forth. And so I think there's certainly a capabilities gap in the sense that if you want to be able to interact with

an AI system, a companion, as an interface to other things, then it needs to be able to do those other things. And really what that means is it's not necessarily the companion layer itself needs to do those other things. The companion needs to be able to interact with other kind of systems that do those things. And those things need to be

more capable. Well, so, but that's, if you, if, if you had to draw sort of a very reductive architecture diagram of the computing interface of the future, how would you, where does, what does that stack look like? What's the operating system? What's the client? What's the server? What's the interface? It's a, it's a good question. So, first of all, I think,

Like I said, I think phones are here to stay. Phones are amazing. Love my phone. So that's the primary compute device you think in people's pockets? I think that's still going to be the primary compute device in people's pockets for a long time. Yeah, they're very, very good. And...

they're only going to get better as well. I think that this newer interface that we're talking about will kind of look like there's a sort of companion layer. We think of it as a companion layer. You can think of it maybe as an AI interface, interface layer. We think that that interface layer needs to have these characteristics that we think make it a companion, like memory needs to remember you and have a relationship with you, be very natural, etc. So, you know, it needs those characteristics. That is sort of

the system that you are interacting with every day in a natural way. What will make one instance of that better than another is how much you like talking to it. It's personality, it's naturalness, it's sort of how much it makes you feel like you're talking to a real person and so on. The delightfulness, it's just a product experience thing, user experience. And then on the other side, that layer will

talk to downstream services, whether that's kind of, you know, normal digital services, like whatever digital services you have, search, et cetera, et cetera, or other AI systems that can then sell themselves, do things for you, interact with services and so on. But I think it's going to be mediated by,

this companion layer that you actually want to interact with. And the kind of optimization, the thing to focus on for that layer is personality, is, you know, delightfulness, all these things, kind of consumer product experience at the end of the day. And the thing to optimize on the other side is capability, like reasoning capability, multi-step, you know, tool calling and so on and so forth. And

I think there's a lot of effort in the space, in the ecosystem right now on those things. And those are very valuable things that will also be in other industries and so forth. But I don't think there's enough companies and kind of teams working on the interface layer. And that's where we're focused. And that's where we kind of want to really spend most of our time trying to innovate on. But one would argue that the interface layer, as you're describing it, is too valuable for...

other companies to not try and attack. Because if most people enjoy using Maya the most as their primary interface to computing, that means Maya basically influences what restaurants they pick, which sites they shop on, which, I don't know, which products they use, what sites they get their news from. And it's both a very challenging interface to build. Yeah.

But it's also an extraordinarily powerful place to be. It is, yeah. So why wouldn't today's largest computing companies try and own that? Well, I think they will. I think that over time, I think we will see more of these, you know, bigger companies trying to operate at this layer. Like I said, I think that there is not enough effort on that right now, making these systems...

delightful to interact with, you know? And I think that other companies will see that over time and they will start sending more effort into it. Maybe even Sesame, this demo that we put out, maybe we'll kick some of that into gear as well. I think that at the end of the day, because the tech is so early and because there's not like a playbook exactly on like exactly how to solve these problems, I think that

the best product experience will be built by a very focused small team. You know, small doesn't mean 10 people, but like,

you know, when you start getting these large teams that have a bunch of different stakeholders to try, you know, it should be a great interface and be a good personality, but it also needs to be integrated with the broader product categories that we have. And, you know, whether that's search or that's, you know, reasoning or whatever, and like, it needs to be this kind of big integrated thing. And there's a bunch of different stakeholders in the, in the kind of process. I think it can, you know, you can kind of lose, you know,

the magic of like a small focus team that has kind of one goal. And that's Sesame's goal. I think that will be the goal of some teams in these larger companies, but I would bet on kind of focus over that. - Well, you know, one of the things that was so powerful about, you know, the iPhone was built by a fairly small team.

One of the things that made it so powerful soon after launch is they opened it up as a developer platform. Still really locked down platform and they took a very strong point of view on what the quality bar needed to be to be in the app store and so on. But still an extreme, I mean, today the app store is a foundational part of computing. Totally, yeah. As you're thinking about which parts of the Sesame stack are first party versus open to third party developers, where does the line start and stop?

I think the easy answer, the real answer is that we just don't know. You know, there was opening. I did this chat to be plug in system before and I think it's still there probably. And it kind of didn't fully take off really. You know, that's MCP now, which is kind of gaining some momentum more recently and so on, which is kind of more of an open standard style thing. I think that it's just early to tell kind of what that looks like.

One of the reasons is that I think that the models still need to get better, basically, to utilize plugins, you know, plugins essentially in a way that's kind of reliable enough that someone will go out and look for a plugin for, you know, their kind of downstream service of choice because they just want, you know, they have the experience of,

the AI system kind of able to use other things so effectively and efficiently it becomes part of their daily life and they want to hook up this other thing to it. I don't think we're there yet. And so I don't think that like the exact format of how that will look like and where it will plug in is clear yet. I think it will certainly be a part of the future of these interfaces and these systems.

But I think that that will become more and more clear over time as in a kind of organic way, right? I don't think it's going to be so decided by, let's say, one company or one player. I think it'll be a little bit more, you know, developers want to use these capabilities. An ecosystem. It'll be an ecosystem, right? And I think it's a little early to try to build that for us so kind of clearly, but it's certainly a part of where this goes. Right. What...

Should people listening to this podcast who are wondering how they can join Sesame because they want to work on this do next? They should reach out. They should reach out to me. They can reach out on Twitter or over email. What are you looking for? Yeah, we're looking for... So like I said before, voice conversation is really its own modality. And it's something that we are very focused on trying to improve the state of the art on.

And so we're certainly looking for people who are excited about that challenge from like a research perspective, from a core ML perspective. You don't necessarily have to be a researcher. We need systems engineers. We need infrastructure engineers and so on. But that is a big thrust of what we're focused on right now. And we're always looking for kind of general strong engineers, especially with like a kind of product bent. You know, we, at the end of the day, are making a consumer product.

that is going to be judged based on how much people like it. You know, it's not going to be judged on, you know, benchmarks and so forth. Those are critical parts of developing the product, but...

At the end of the day, the question is, do people love interacting with our systems? And that's, I think, part of the culture at Sesame is to kind of merge really world-class technology with a kind of creative taste to make a great product. And I think if you have the engineering talent and you are interested in what we're working on, and in particular, you want to see these technologies, LLMs, AI, broadly speaking, speech generation, be turned into great products

products that people love, then I think you'll be a great fit. If folks came by the office, could they get a demo of what's coming next? They can. They can. Well, that's one reason to apply. Yeah. Yeah, they can. All right. Awesome. Well, we'll do part two later. Cool. Yeah. Thanks for making the time today. Yeah, it was fun. I want to chat.

And that's it for this episode. We hope you enjoyed the discussion with Ankit Nanjane. If you did, please do rate and review the podcast wherever you listen. And keep listening for some more great discussions in the weeks to come.

Building the Next Generation of Conversational AI 01:41:37 Share

AI + a16z

Deep Dive

Shownotes Transcript

Building the Next Generation of Conversational AI