We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Full-duplex, real-time dialogue with Kyutai

2024/12/4

Practical AI: Machine Learning, Data Science, LLM

AI Deep Dive AI Chapters Transcript

People

Alexandre Défossez

Topics

Alexandre Défossez介绍了Kyutai实验室的背景、使命和研究方向，强调了其作为非营利性组织的独立性和在开源研究方面的贡献。他详细阐述了其最新研发的Moshi语音模型的特点，包括全双工实时对话、低延迟等，并比较了与其他商业实验室的差异。他还探讨了法国人工智能生态系统的现状和发展趋势，以及Kyutai在其中扮演的角色。此外，他还分享了关于开放科学的理念，以及如何通过开放科学促进人工智能领域的民主化。 Chris Benson和Daniel Whitenack作为主持人，引导Alexandre Défossez深入探讨了Moshi模型的技术细节、数据处理方法、模型规模选择以及未来研究方向等问题。他们还就法国人工智能生态系统、开放科学的意义以及大型语言模型与小型模型的比较等话题进行了深入的交流。

Deep Dive

Chapters

Kyutai, a non-profit research lab based in Paris, developed Moshi, a full-duplex, real-time speech-to-speech AI assistant. Moshi allows for fluid, human-like conversations with minimal latency and has potential applications in various fields.

Kyutai is a non-profit, open-source AI research lab funded by three donors.
Moshi is a full-duplex model, meaning it can listen and speak simultaneously.
Moshi has a latency of around 200 milliseconds.
Kyutai prioritizes on-device models, which are harder to protect as intellectual property but offer wider accessibility.
The French ecosystem is conducive to AI research due to a strong emphasis on mathematics, engineering, and PhD residencies in private companies.

Shownotes Transcript

Translations:

中文

Welcome to practical AI, the podcast that makes artificial intelligence practical, productive and access. Like this show you love the changes on p tec interviews, days in on fridays and awesome talk show or your weekend enjoyment. Find us by a searching for the change log whereever you get your broadcast, thanks to our partners at fly out I O, launch your AI apps in five minutes or less. Learn how at why .

I O was our friends. I'm in the curt magi, cofounder and CEO of fly. As you know, we love fly.

That is the home of change. Love that com. So I went to know how you explain flight to developers. You tell them a story first.

How do you do IT I to change? How explain IT based on, almost like the generation of developer i'm talking to. So like for me, I built and ship APP on heroica, which used to roku is roughly like building and shipping and APP on versie today.

It's just it's twenty twenty four instead of two thousand later, whatever. And what pressure to me about doing that was I didn't. I got stuck.

You can build a ship, a rails APP with a post grass on, argue the same way. Ly, you can build a ship and next japp on your sel. But as soon as you wana do something interesting, like as soon as you want to.

At the time, I think one of things I ran into is like I wanted to add, used to be like kind of the basis for elastics, and I want to do full text search in my applications. You kind of hit this wall with something I can. Okay, where you can't really do that.

I think lately we ve seen IT with my people wanting to add all of them, kind of instance, stuff, to their stations on the cellar, heroic e or club flare, whoever. These days they ve y've started like releasing abstractions and sort of let you do this. But I can't just run the model.

I run locally on these black box platforms that are very specialized for the people. My edge, it's always like herku was great, but I I agreed IT. And one of the things that I felt like I should really do what I was using hook was like, run my up close to people in tokyo for users that were in tokyo and was never possible for modern generation.

Debs, it's a lot more versie based. It's a lot like visas. L is great right up at one of their hard blind boundaries. And in your kind of stuck, there's one we've had someone in the company I can make name of this game, but the tagline was like five minutes to start forever a master. This sort of hour pitching fly is like you can get an up going in five minutes, but there is so much depth to the platform that you're never gona run out .

of things you can do with them. So unlike a or a oku or vessel, which are all great platforms, the cool thing we love here to change most about fly is that no matter what we wanted do on the platform, we have primitives, we have abilities. And we, as developers ers, can charge our own mission on fly.

IT is a no lamas platform. Butt developers, and we think you should try IT up, go to fly the IoT. Lr, more launch rap in five minutes too easy. Once again, fly the I O.

Welcome to another episode of the practical AI podcast. This is Daniel light. Nac, i'm CEO a prediction guard and joined us always by my khost chis Benson, who is a principal A I research engineer .

locky Martin, how you Chris? I am really .

it's gone great. Um I think we talk about this a little bit on the last show, but now we're officially up against the thanksgiving break. So a couple couple days off uh, here in the in the U S, which will will be nice. Maybe I can catch up on on some of the cool A I stuff that i've been meaning to to play around with in my spare time. But one of those cool AI things that definitely made its rounds over hear a prediction guard and and we were talking about was the recent kind of advances in real time speech assistance and in particular this sort of, you know what what opening eye was doing but then also what a lab and in france called cuti released and today i'm really excited because we've we finally got the chance to um have h alexon the there, if you say who is a scientist and cofounder at q te with us.

Welcome alex. Thank you and thank you Chris um for the invitation. Looking forward to discuss about .

the details about machine yeah yeah excited about IT. Maybe before we do that, if you could give us a little bit of A A background on kind of what what cutie is, how I came about.

yes, uh, so cute is a nonprofit lab, uh, that we launched a year ago in paris. We are funding from three donors. So xavier, a holder sunday and election meet electra t is probably know the best of the uh, successful one uh but a and then the hold of side, they works in a logistics so they gather together to try to fund this effort to bring a kind of independent lab with a mission to do open source uh research.

Uh, at the time where the open sources may be suffering a bit from the competition, uh, between some of the major labs. Uh, so that I think A A big motivation for everyone in the 接吗。 And basically, we have sufficient capacity to be kind of competitive with big labs. We can't really to fight every Better. But as as we show with motion, we can definitely bring interesting ideas and uh innovation uh to the table .

yeah and I find that I mean, maybe for those here in the in the U S M. A, I ecosystem. We we do see a lot of kind of innovation and interesting things happening in um in france and in paris. As i'm wondering, just out of curiosity, what is the the ecosystem like there and how would you I I mean, you seem to be kind of formed out of part of that. So um how is that sort of shaped to you and and what is the ecosystem like there?

I think the ecosystem kind of start with the with the studies with friends um like there's a very strong engineering culture, uh also very strong emphases on mathematics, which I think was a like giving a good soil that initially attracted the number of big american players uh like uh, facebook that opened.

So I I think at the time the facebook A I live in paris was probably the second largest uh, after the californian one about thai with new york. Uh, so I think that kind of says, oh attractive the city can be uh, because it's not so so easy to compete with the attractively of america. So now I think what has change in the recent hours is really the kind of independence, uh, that's growing from this kind of initial sitting.

I think for many years, there weren't a number of like truly french organization where you could have access to a sufficient number of GPU like large ing of clusters so as to develop machine learning, mother, follow the number of applications and and that especially the case with large language models. But there has been a number of events that as a kind of LED to this diversification of the ecosystem in france. yes.

And so now I guess there's like a number of big startups. There's like cuti, there's a and and I think that's only going to grow. Also there is one specificities in france, which I think is a very nice especially for deep learning.

And it's the fact that we can do A P H D as a resident in a uh private company. So for instance, or even like an own profit. So I qi we're going to have PHD students are at facebook where I partially did my P H D.

Uh there was also uh a number of P H D students. And I think it's such a great opportunity to get to use uh, graphic cards so early during our a kind of Carrier and and even a students. Uh, and I think that's very specific to friends. And that's also part of the success we're seeing at the moment. And that I think can only be growing as we train more and more people in such a way.

I'm curious as you you know that we're describing the ecosystem there in france and how strong IT is. What was the specific dynamic that brought about with all these four profits organizations around you that brought about the desire to have the nonprofit? And how did you find yourself in the middle that as you are in the format of stages.

I think for me that was um growing a will to become a bit more independence. I think even though at meta, for instance, there was a lot of value put on the parish office at the same time, american company always takes decision in its center, uh, so that that would be california and satellites office always have to kind of bear the consequences of those no matter the contribution they will make to the the overall value of the lab.

So that was kind of the the initial desire a to be a bit more independent, uh, in terms of the decision making, the ability to lead the research. I got the opportunity. So I was uh, contacted by, uh, no, I do was a doing this P H D with me at facebook at meter.

And then i'd been uh google uh doing very successful research there. Uh so U S. Part of the the first was contacted, I think by xian year and I think the projects uh was initially very appealing because it's like, you know, the same business as usual.

So doing research, what I love the most, having sufficient a resources to do IT in, uh, completely independent and french environment. So that that was of course, very big. I didn't take very long. I guess even at first IT seems a little bit too good to be true uh, but so far so good so .

um yeah you tii kind of a promote the idea of open science and uh you know democratization of of A I artificial our intelligence through open science. Some people the in in our listeners ers might be familiar with sort of open source, open source A I or even like open access models. How would you define and think about open science as a thing, in particular, how that connects to kind of 对 way in which you envision the building of A I or A G I？

yes. So um I think the two are quite related. A usually the open science comes really around explaining you arrived at the final results and kind of what is the mistakes you made, uh, what are the things you tried, what was important and what not so I I would say that's like a first part that we've been doing uh really well with uh mushy release, like a preprint uh technical report with a lot of details that actually took us uh, a bit of time.

And that's something that's not necessarily I don't think if we were not with this kind of non profit mindset, we would dedicate as much time. But I think on the long run is kind of important. And then there are several aspect. The open sourcing can go from just the weights to like full training pipelines. So releasing more CD around the training of touch models is also on our road map. We didn't get a chance to do IT yet because that the paper already took a bit of time and we have other things we we're working on a but I think there is also part of IT like explaining exactly how you got to the final results and not just having a set of weights for one specific dusk and but being kind of stuck with IT if you need to adapt to something else. That's kind of the, I think, the vision of open science.

Could you talk a little bit about um kind of what you're able to do with that model, uh, that maybe the commercial labs that you have in the same ecosystem are able to do? And maybe also kind of is IT more standard with nonprofits around the world that are doing similar things? Or do you guys is there is something very, very distinctive compared to you um that may be other on profits that you've seen or maybe the model after they have?

Yes, that's a good question. So i'm not the only familiar with all the nonprofit is in the in the a ecosystem, the alan and institutes for instances. One of them I think it's very um as there is also the the final team.

J I yeah I I think we're kind of serving a similar mission. I don't think there is necessarily a big difference. Some of them might be more around like contribution to science, for instance, like general science or core deep learning.

I think for us, we are mostly focused on car deplaning. We don't necessarily want to compete, for instance, on the purely takes based, uh, at space. There are differences in terms of the choices uh, of the research we are doing. But yet fundamentally, I don't think there is a big difference. And then your other question was with respect to like other four profit.

what do you feel is, is really in your sweet spot to put IT in another way? You compared to these competence is very easy to gonna say, recognizing that all the resources that some of the largest companies in the world have and you'll put into their labs, but there is definitely a place for others out there. And I think that gets used a lot by the public. And so given the fact that, that you have you have the space that you're playing in, just kind of, you know what set you apart from this commercial in terms of maybe advantage is that you are just having, you know, the mass number of G, P, S available to them. What are some of those distinct things .

compared to some of the for profit? If we take the biggest lbs, um would you say I guess we are like that is not really possible in a like super large company where every action will have encies in the stock market, for instance. So the decision process can be really fast.

That was the case for the release of the model, for instance, were able to release its under a commercially friendly license, which would be a bit harder in larger structure. And I I think we have a strong, for instance, we have a desire to go more and more towards on device, uh, models, I think so more she's kind of barely on device. We we demoted on a macbook pro, but IT was like top to your mac book o so it's kind of like proof concept runs on device, not every device.

But I think we definitely have a value there because the number of four profit are not going to uh develop really powerful on device model because that would be a potential threat to their uh, like it's harder to protect in terms of intellectual property. And I think in general, between the bigger players, that is kind of the race to the very top, very best numbers, unlike the benchmark mml and everything. And so you know if IT takes like ten times is more inference time to beat the oser on the benchMarks.

They are going to do IT because it's either beating the owner on the benchMarks or a kind of leaving the R N O. So we're not really in this mindset. We're more like the on device, I think could have a very large number of applications. IT definitely can not solve all issues. But I think as a non profits, we won't have the kind of reservation as a foul profit might have for on device matter.

Okay, friends, i'm with a good friend of mine. After our sweden. From time scale, they are positioning postgrads for everything from IoT sensors, A I dev tools, corpo in finance apps. So understand time post gress is most opposition to be the database for A I applications.

It's the most popular 的 abase according to the stack overflow， develop a survey. And one of the distinguishing characters, tics is is tenable. You can extend IT for use cases beyond just relational and transactional data for use cases like time series and analysts that kind of with times about the company started as well as now more recently, Victor search and vector storage, which are super impact for for applications like rag recommendation systems and even A I agents were are seeing, you know, more and more those things.

Ks, today, yeah, progress is super powerful, is well loved by the developers. I feel like more death because they know that I can enable more developer to become A I developers, engineers and build .

from outside. We think post press is really the the no brain a choice. You don't have to manage a different data base.

You don't have to deal with data sync, onions, ation and data isolation because you have like three different systems and three different sources of truth. And one area we've done working is around the performance and scalability. So we put an extension called P.

G back to scale that enhances the performance and scalability your post gresser that you can use IT with confidence for large scale AI applications like rag and such. And then also another area is coming back to something that you said, enabling more, more developers to make the jump into building A I applications and become A I engineers using the expertise that they already have. And so that's what we vote, the P G, A, I extension that brings alarms to postpone to enable things like Allan reasoning on your test as well as embedding creation. And for all those reasons, I think, you know, when you're building an air application, you don't have to use something new. You can just use post gress with friends.

learn how time scale is making posters. Ss, powerful over three million times. Skill database power, I O T sensors, A I dev tools, crib do on finance applications.

And they do IT all on postgrads time go, uses postgrads everything. And that you can do, learn more at time scale 点 com。 Again, time scale 点 com。

So alex, you've mentioned uh, motion a few times now. Um maybe if if you could just give those that haven't heard of this an idea of, first what is mochi and then maybe if you could din after that step back and describe, well, how did the lab how did qt I start thinking about that sort of model or that sort of research direction as a research direction of the of the lab?

yes. So mush is uh a speech base foundation murder that also integrates text as a modality uh so it's especially build for speech to speech dialogue and especially real time dialogue. So we put real emphasis on the model being able to act in a way that if the most fluid as possible， like a real conversation with a human being.

And so one of its characteristic is that it's completely full duplex, uh, meaning that the middle can both listen and speak at any time. So it's not turn based, uh, like Walker turkeys, which I think is the important features like when us we communicate. So we wanted the model to be able to do the same thing. We also, as I mentioned, that allows us also to have a very little latency.

So we have like around two hundred million uh between the time the audio leaves your microphone and the time you get a reply that has accounted for that uh auto and um yeah I do moment it's kind of like mostly we designed IT as a speech agent with which you can discuss as question ask for advice that could potentially serve as a basis for much logia use case. That's why we also mention IT as a kind of foundation with and also a framework for a number of tasks that would require, uh, kind of reacting to your speech and a beyond just being a like kind of a assist. Then the second part of the question was how did we start working on that? So we were two people on the on the initial team.

So now, and I do i've done most of our research on of your murdering and then a ig love had been on the on the like a core number of the initial team of lama h, the very first lama atta. So we kind of add the right tools. So I I guess the first reason is like basically we we stand together and well, like what can we do and word we have an h on the competition.

And I think on this aspect of like combining the technology dge and the oil, like top of the line audio murder lin techniques, we are the real edge, uh, compared to oser labs. So that was important. And also there was a sense that like speech was becoming an important modality.

And what i'd been done in a number of other modalities were still completely lacking. So I was back in november, uh, at the time like uh, OpenAI, I don't made any announcement. So IT was still pretty much A A new area, a new area to to cover. So we we kind of immediately started working on that.

We actually started so both uh on alien and in part that we worked on the mini, the collect that we use uh with the go of having a really highly compressed uh representation, twelve to five hearts uh to get as close as possible to the the text which would be on like three hearts uh, of course it's not regularly spaced with respect to audio. yes. And then we once we were happy with me, I we immediately moved on to the the kind of aspect of old we mother the speech, how do we end as a food diplex? Uh, how do we instruct the model like a number of a chAllenging question that arises all the way. That was a kind of first demo, a public demo in july.

That's great. And just one more kind of background question for those. Some people might have seen, I guess, non real time agents. So agents that would take an audio transcribe that you maybe transcribe that with mod model, use a language model to generate any answer and then use a third model maybe to generate speech. So that's one kind of way to process this pipeline. You're talking about something different here, particularly for these speech to speech models or or um the kind of multiplex models you're talking about.

Could you give a little bit of a background like how long have people sort of been uh studying this, researching this type of of model and has IT really only been possible and sort of recent times to make this kind of real time speech reality because I think some people are, but you know, at least public wise, they may have seen things like alexa in the past, right, the process of speech in in certain ways. But this sort of demos, at least that they're seeing from OpenAI demos that they're seeing from cute, this is a different type of interaction. So how long has this sort of been possible? And what is the kind of history of of research? Just I know that's a hard question because there is probably a million things that have been done. But from an overall perspective, how would you be IT?

So I was just to put in perspective. So i'm not necessarily entirely familiar with our alex. I works, but IT, it's more I mean, anything that kind of pra GPT model would be kind of rule based or based on like uh, automatic recognition, which is actually a fairly odd field and even real time speech recognition, uh, has been successful for a while.

Uh, not necessary with the amount of success we see with deep planning. Uh, I mean, was already using a some of them a deep learning before, uh, but then it's kind of rule base. So if you don't formula a request in quite the right way, it's quickly gonna say I don't know or just do a good of search.

Then what brought a channel paradigm was all the uh GPT model and uh ChatGPT in particular, or with this ability to perfectly understand human request no matter how IT is formulated. Then to bring that to the audio domain, what you need is the ability for a kind of language model like a transformer to process the audio streams. Uh, I you would think it's very easy for a GPT model.

You have text token in and you have text tokened. You predict next token. And then you just need some special characters to differences between the request and the reply.

And you want to be able to do something similar with audio, but, uh, things are not quite as easy with audio. Audio is not as dense in terms of information. You can think of words as being like the almost information from an information theory point of view, like optimal way of transmitting information.

Well, audio as recorded by a microphone. It's just a wave that's oculina like maybe forty thousand times per seconds. And if you just look at IT with your naked eye, IT will make no sense.

So you need the right representation to be able to feed that into like a transfer ma, have the tonstal understands and be able to produce the outputs. And that has been quite a chAllenging task like just if we talk about audio, like the first few success for, for instance, a wave nets and on top, uh, a wave net. There was two books by a OpenAI that I think was the first.

Like, let's use a transformer language model to try to model audio. But I think I record from their paper that kind of processing one minute of audio would take eight towers on the like h top of the line, H G, P U, at the time. So if you see the technology has progressed ve a lot, and I think some of this progressed was especially done, uh, by a, there is a do, for instance, uh, so is an another confederate day.

I GLE with, uh, sound stream in particular that provided this kind of discrete uh, representations at a relatively low sample rate, low frame rates. And then already very quickly, nail and his team shows that this could be fed into a transformer at the time they were kind of using a technique where you would still have many more like for one second of audio you would need to do maybe like um a few hundred auto aggressive steps, which is very costly. Uh one second with the transformer of like equivalent information would be maybe three or to aggressive steps is that naturally put a constraint of both your context and the kind of length of the the sequence you can generate and completely rules out the real time aspect?

Then when I was at meta, I also worked on similar topic, uh, especially an old to kind of not do asthma to aggressive steps steps, but try to predict some of the information in party and to organize in a you would have kind of minimal h dependency between the different aspects you need to predict that maybe I guess it's would be hard to say orally, but basically it's like for each time step instead of just one token like you would have been text now you have maybe four or eight or sixteen tokens and yeah, you need to make sense of that. You cannot just flatten everything because that that is not going to work in terms of uh of performance. And then there was a number of work, I think a when we use for mac the R Q transformer that kind of murders the dependency between those tokens for a given time step with a smaller once formal, I guess, was a pretty important algorithm contribution from i'm trying to find out who did that, but I don't know.

It's under my eyes. But h, yes. So we kind of build so both on this expertise, the work that now i've been doing, the work that I have been doing and this kind of argue ones female paper, and that's to solve the aspect of, like being able to run a big language model.

So let's say, seven billion parameter to like take audio s input and take oil and then output audio sufficiently fast for a real time processing. And yes, then the author aspects, I guess the one where we kind of brought a lot of innovation was the food duplex aspect of kind of having multiple use dream. So one would you stream for the user um one would you stream for mushy? And I think that kind of it's not something you would naturally do with with text because you already have one stream.

So going to two stream, you know it's kind of assad. But if you think of IT for audio, it's like all those kind of dockers in a they already formed like up to sixteen streams that were already had to ender. So we was just like, okay, I just divers the number of streams then know we have two of them uh that are clearly separated uh, we do actually the mother is trained, for instance during free training to also generate some of the users uh reply.

And even if I had that stage of the training, there is no real like it's just kind of a participants in the conversation that sample. Uh then obviously with the model release there is no IT only tries to murder l its own own stream but yes so that that's kind of like the rough uh line of work that can let to then of course, in auto moderating, there are many other other techniques that I didn't mention in particular, diffusion is very popular. So uh, there is many mother doing diffusion for a music generation, for instance, uh, for T T S, for a number of things. And would you see that not compatible or like that much? How to make compatible with the real time aspect, which kind of the more natural and a dominating party that was really .

fascinating in terms of like understanding. And I definitely learned as you are going to describing IT, I don't think i've heard such a an excEllent you know kind of not just promotion, but just how how to get there on the what i'm wondering in my head is like what are some of the I can I can imagine as you're talking so many cool things to do with this technology, what are some of the cool things that you've seen already there are that you guys have tried specifically that maybe wasn't possible before or that maybe people could only do at some level with something like a like a ChatGPT four o kind of know through the API that way.

But you know, this is open source. It's open science. They have a lot more capability. There must be some pretty awesome stuff out there.

I mean, there is like a few things that we've done that we're really, really funny. For instance, just training on on this old dataset set from the nineties and like early two thousand of fun course and then IT IT was not really like assistant anymore, which just like you end up on the phone with someone random and they will tell you their name, they will tell you what they think about U. S. Politics at the time. And it's really it's kind of a different thing that we try to keep with the final motion.

But obviously with the face of instruct tuning, we we lost a of this um I mean it's still quickly falls back to the helpful AI assistant personality uh that maybe not as nice um but there was a funny thing like basically we can train IT on anything and then this is going to act like kind of actors that would pretend to be A A certain person uh in a very real way as there is a number of things that we're expLoring with this kind of approach, anything that would be like speech to speech or text to speech or vasa, some of them we kind of mention in the paper. Oh, we just this framework because we also have a text stream that's basically we use only for the model to be able to, like output its own words。 We don't actually represent the world from the user, but the model output its own words.

And this kind of aspect, by making the text late or early on the audio, we can turn the model from being like a text to speech engine, because if the text is earlier than the audio is just going to follow IT. But if the text is late and you kind of force the audio to some value and you only sample the text tokened, that now becomes an automatic speech recognition. So I think that kind of shows a versatile this mult stream approach is, uh, and all of those applications are really streaming.

So we could actually something we did for the synthetic data was you think this kind of approach to generate long scripts and and you could imagine like generating maybe fifteen minutes or whatever, that are things that we are working no more independently. And as in terms of more the general community, uh, i'm not uh aware of anything. Uh, in particular.

I think one thing we wanna do though is to really sort to allow find tuning, uh, maybe with Laura and also make IT pretty easy. Obviously, the pipeline is a bit more complex because you need audio ad, you need transcripts, uh, you need separation uh, between the agent you want train and the users. So we want to help with that regard and try to to make IT easier to adapt IT to a new a new use case.

All friends, i'm here with a friend of mine, Michael ginni, cofounder and CEO of work. OS. We're big fans of workers here.

Michael, tell me about off kit. What is this as that work? How to make IT? Workus has been .

building stuff in authenticating for a long time since the very beginning, but we really focus initially on just interprets of single sign on saml of an ocean. But a year to end into that, we heard from more people that they wanted all the off stuff covered. Two factor of password off with blocking passwords that have been reused.

They wanted off with other third party systems, and they wanted really work us to handle all the business logic around time together, identities, provision in users and even more advanced, like robed axes, control permissions. So we started talking about that more, how we could offer IT as an API. And then we realized we had this amazing experience with radix, with this, this API really that the components system for building front and experiences for develop radix is downloaded to tens of millions of times every month for doing exactly this.

So we glew those two things together and we build off set. So off kit is the easier way to add off to any APP, not just next jail if you're building a rails APP or geno APP or just straw up express APP or something IT comes with a hosted log in box so you can customize that. You can style IT.

You can build your own logging experience to its extremely modeler. You can just use the back in aps in a help less fashion. But out of the box, IT gives you everything you need to build the serve customers and it's tied into the workers platform so you can really, really quickly add any interprets features you need.

So we have a lot of companies to start using IT because they anticipate they're going to grow up market and want to serve enterprise and they don't want to have to rearchitects. They're all stack when they do that. So it's kind of a way to like future proof, your OSS system for your future growth. And we have people that have done that, people that started off and I got just kicking the tires and just doing this, proof their upkeep of bunch attraction starts growing awesome and they go close coin base or disney or united airlines or even like a major customer. And instead of saying, oh no, sorry, we don't have any business enterprise things and were going to rebuild everything, just go into the workplace, dash work and .

check a box and you're done a side from the fact that off kit is just awesome. The real awesome thing is that IT is free for up to one million users. Yes, one million monthly active users are included in these other gates.

So use IT from day one. And when you need to scale to enterprise, you're already ready too easy and learn more at off cat dot com, or of course, a work O S dot com. Big fans check IT out women users for free, wow works dot com or off kit dot com.

So um alex, you touched a little bit on um the data side of this and also kind of hopeful future fine tuning opportunities. But I wonder if you could go into a little bit in particular because this you were able to talk about this sort of thing, which sometimes were not able to talk about given the nature of the models that we're talking about on the podcast. What was the sort of data situation that you had to put together in terms of the specific training? Data sets are fine tune ing data sets that you put together and curated for the model that you've publicly released as as kind .

of model builder. Obviously, we had to put kind of a bus like a retraining that a in audio and index. Initially, we had to put the text that I said together also because they wasn't necessarily at the time and alternative ah that we could use in terms of license and also we wanted to be able to keep training both on text and audio so as not to a kind of traffic for getting uh of the knowledge that would come from the text.

One thing we realize that basically it's much easier to have very wide coverage of human knowledge. We text then with audio. And then there were a number of other difficulties in bartie lar. The fact that for the last stage of the training, when he is audio was like clearly separated speakers. And also we needed some kind of instruct dataset.

So for the separation, we book strap things from the Fisher dataset, which is the that I said I mentioned earlier of fun calls that kind of gave us a good enough base to then be able to train T, T, S. Model with separate speakers, also in combination with some recordings. Secretary, as I talk about, like taking faster decisions and in larger organizations, at one point we are like, okay, we needs like pretty good studio equality recordings of people in separate microphones.

So then we got in contact with a studios in london. And the next day we were on the europe star. And just like recording a few people, which I think was was really fun, it's good to have a break from just launching jobs and cranching numbers uh knowing then and yeah leveraging that plus official data said then we could train A T T S model that we could have followed like specific emotions and have two separate streams as output uh so far the two speakers uh, and then we used that to withstand instance data set.

Initially, we tried to convert to a your existing instruct dataset for text, but we quickly realized that a few scripts that were specifically tuned for oil would give much Better researchers. And one of the reason is that if you look into some of those uh existing instruct data sets, it's very good to our first. We're gona use tex mother.

So maybe like some people copy based to mark down tor and they ask to a comment on IT, there's a number of entries that are specifically done for kind of benchmark steps of questions. So it's gona be multiple choice and and the model just answers be but that's not something you're going to do really. You're not going to black give for choice and the model just answers be, uh, we needed like a lot more a military turn, like also shorter reply.

You don't want the model to speak out IT like an entire paragraph for a reply. So with that in mind, we are too kind of rebuild everything so that, that IT will did a lot about that. So some of IT was kind of ping existing ireland being like, okay, what are like a hundred different task we could do with a speak assistance? And then for each task, give give me like a hundred quickly scenario.

And then we are the noser model that we are find to specifically to photo kind of the oral style. So shorter answers, like maybe short change of turns, we would like randomly sample topics and have discussions around them. So we tried to cover different aspects like that, and then we synthesized everything.

So at the end, the data set was fairly large, uh, I think a few tens of uh, thousands hours. And IT was kind of sufficient to get like to the state for the demo. Um so even though there was kind of cool that we could be good strub this entire modality basically from like this like one or two thousand hours recordings from the early two thousand and a few hundred hours that we had record in the studio.

One thing we noticed that there is still uh what we call the model ity gap. So there is still a gap in knowledge between the tax model that we started from and even actually as we train the model, we can we still train IT on tax. We can always switch IT to text mode and asked the question in pure text and the murders would be like get much Better replies on trade acua than IT would gets with the audio uh and that I think a really fascinating question of all to make model understand that it's the same thing at the same time. It's very easy for you to think of is two different modalities, especially with free training on audio where IT gets like kind of random audio, not necessary to focus on like giving the right answers all the time. We could recover some of that with the instruct, but I think there's still work to do uh to be kind of as uh sample efficient as a text model into really becoming super useful and factual.

Curious if and you may have mentioned that you mention seven billion parameters earlier, but is that the .

size of the model is in a billion s model? Billie, to its air q transformer architecture. Actually I found back the auto so it's do you please, uh, and the collaborators, uh, that police first, this mode, which is the main backbone transforming e and the small tank farming that just tries to predict the different active token s uh, this one is kind of smaller.

I don't have the weight as the size exactly. But in terms of run time in france, time is negligible. Most of the knowledge and decision is is done in the in the big seven billion tones for mal.

How did you pick the model being that size? And also as an addendum to that, what is your perspective on have you know relatively smaller models versus the relatively larger models? What's how do you how do you see that?

Yeah, I guess when we started, seven billion was kind of the minimum size for large language models. Now, I guess two billion and three billions, uh, down to one billion. Especially with the advance of distillation techniques from bigham fale males, i've become very efficient.

They are not as efficient as takes model at seven billion bommel like you're on you're in the head to go. But yeah the time when we started were like, okay, we don't know exactly how much compute, how much capacity gonna take to test. So we don't want to take too many risk.

Uh, seven billion was a well charity directory and at the time like a pretty good baLance between the two. Now that we know that we can solve the tasks with seven billion, uh, obviously we want to try to go lower a than seven million. And that's something we are expLoring because like the way we see things, it's it's going to be very hard and probably not super useful to try to put all the like thinking capacity and problems solving capacity into。

We wanted to be smart enough to have a direct conversation, understand what the user wants and potentially then access user source for getting like more complex answers that would also allow like a more plugin play aspects. And like now you have a new text language with that, you don't necessary want retrain from scratch the audio parts. So the way we see IT is going towards like smaller model for managing these direct low latency interaction, uh, delegating some of the works to a larger model when you did so for sure. Now that we know IT works with seven billion, we would try smaller ah so that we can run on a much larger number of device.

I guess you already started kind of talking towards um additional things that you want to try with respect to moshi and these types of models in the future. But maybe stepping back a little bit as we get close to the end of the episode here, when you, as a as a researcher in this area, looked towards the future, IT could be work that you all are planning to do internally or just things going on more broadly. But what what kind of is some of the the most exciting things for you as you think about the the kind of next year as you of your work and the things that you're following, the things that you're looking at um what what's on your radar and and what are you excited to kind of participate in and see happen in, in the coming months?

Okay, in the coming months. That's that's a good question. I mean, I think one to pick that i'm interested at the moment is the question of whether we're gonna be one day in the post transformer era like I love transformer and I love not having to wonder anymore.

I mean, if you look at the set of ipod parameters to train those models there, i've been frozen for maybe two years, two years and a hart, you know, architecture, which is good because now we mostly focus on just like making the right data to solve problems, and there's a lot we can do. At the same time, I think I would be really excited to see advancements that could happen either on the optimization side or the or the architectural side. We've seen a lot of like interesting work and this are year.

But the moment we are more on the parity thing, we've found other ways of doing kind of the same thing. But there is not really something that as one a decisive uh, on a decisive aspect or feature that would be potentially not sufficiently well done with transformers. Uh, there is been like tons of engineering going into IT.

So you know each time you think, uh maybe quadra tic uh cost is bad, but then people are like, no, you can just art core optimize your account and know it's no longer your problem. But yeah that I think in terms of the just the scientific excitement that that's one thing that I want to keep my eye all you see as at the same time, there is a lot of competition going going on. Just applying the current model is not necessary easy to free time and mental space to try to think about those issues.

So that's when aspect and uh yeah then then I also curious about oh all the the framework uh aspect is going to evolve. Uh, working a day today with those technologies really feels like your duck in the seventies of like PRC era where you have to think about the CUDA like the code is different for each architecture. There is a lot of leakage.

Uh, abstraction leakage is not like you're not going to write a nice function. You you need to write kind of dirty things. You need to do equivalent of pointer arithmetic all the time. So that's another thing. So maybe i'm not replying your question of what's coming in the next few months.

But longer term, sometimes I just think of myself in the ten years and you know you can just write your a attention, Caroline, a few lines of code in a dedicated language and you get like almost perfect code and I think that would be amazing to just explore more things more easily. Uh but we'll see so too yeah too big, big potential changes. But I think someone thing's going to happen in in the coming years.

No yeah. Well, thank you very much for for sharing your perspectives with us um and also thank you for the way that that you in the Q I team are inspiring many, many people out there that are working on open models, open source, open science and and kind of just generally collaborating in this space. Really appreciate kind of what you're doing as part of that. And thank you for taking time to to chat with us.

It's been great. great. Thank you very much for the invitation is you to .

present .

and full in the.

All right, that is our show for this week. If you haven't checked out our change log newsletter, had to change love 点 com slash news。 There you'll find twenty nine reasons, yes, twenty nine reasons why you should subscribe. I'll tell you reason number seventeen, you might actually start looking forward to mondays.

Sounds like somebody is got a case of the .

s twenty eight. More reasons are waiting for you at change of 点 com slash news。 Thanks again to our partners at flight to I O to brake master cylinder for the beats and to you for listening. That is all for now, but we'll talk to you again next time.

Full-duplex, real-time dialogue with Kyutai 50:05 Share

Practical AI: Machine Learning, Data Science, LLM

Deep Dive

Shownotes Transcript

Full-duplex, real-time dialogue with Kyutai