We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Video generation with realistic motion

2025/1/23

Practical AI: Machine Learning, Data Science, LLM

AI Deep Dive AI Chapters Transcript

People

Paras Jain

Topics

Paras Jain: 视频生成技术发展至今，取得了显著进展，但仍面临诸多挑战。首先，视频数据量巨大，且需要模型理解物理规律和现实世界的规则，这使得模型训练难度大，成本高。其次，高质量的视频数据难以获取和筛选，因为互联网上的大部分视频缺乏高质量的动态信息。再次，复杂的运动模拟，例如体操动作，也对模型提出了极大的挑战。最后，视频生成模型的评估需要结合定性和定量方法，既要考虑人类的视觉偏好，也要关注模型对物理规律的理解。Genmo 的视频生成模型发展历程经历了三个阶段，每个阶段都吸取了经验教训，并对模型架构和训练方法进行了改进。Genmo 开源其视频生成模型的决定，是基于模型大小和计算资源的权衡考虑，旨在平衡模型能力和社区的可及性。视频生成模型的训练对GPU资源消耗巨大，且长序列长度带来了额外的挑战。Genmo 的 Mochi 模型采用分阶段架构，先进行视频压缩，再进行扩散模型训练，以降低计算成本。Mochi 模型在运动模拟和指令遵循方面取得了显著进展，在基准测试中与顶级闭源模型不相上下。视频生成模型的应用场景涵盖娱乐和专业内容创作领域，例如替代素材视频、创意构思和视频编辑。基于视频生成模型的视频编辑技术正在发展，例如添加、删除或修改视频中的物体。未来的创造力将是人机协作的产物，人类负责提出创意，AI 负责放大和实现创意。Genmo 的长期愿景是通过视频生成技术推动人工智能领域的创新，最终实现对现实世界的理解和模拟。 Chris Benson: 就视频生成模型的评估方法提出了疑问，并与Paras Jain讨论了如何平衡定量和定性评估方法，以及如何设计有效的测试用例来评估模型对物理规律的理解。 Daniel Whitenack: 与Paras Jain讨论了视频生成模型的应用场景，以及如何将视频生成技术融入到大众的日常生活中。

Deep Dive

Chapters

This chapter explores the history and current state of video generation technology. It highlights the challenges in creating realistic motion and the increasing role of compute power in enabling larger models.

Video generation has lagged behind other AI advancements.
Creating realistic motion is a major challenge.
Compute power is crucial for scaling video generation models.
Sora's release was a watershed moment.

Shownotes Transcript

Translations:

中文

Welcome to Practical AI, the podcast that makes artificial intelligence practical, productive, and accessible to all. If you like this show, you will love The Change Log. It's news on Mondays, deep technical interviews on Wednesdays, and on Fridays, an awesome talk show for your weekend enjoyment. Find us by searching for The Change Log wherever you get your podcasts.

Thanks to our partners at Fly.io. Launch your AI apps in five minutes or less. Learn how at Fly.io.

Welcome to another episode of the Practical AI Podcast. This is Daniel Whitenack. I am CEO at PredictionGuard, and I'm joined as always by my co-host, Chris Benson, who is a principal AI research engineer at Lockheed Martin. How are you doing, Chris?

Doing great. Happy New Year. This is our first show of 2025. Happy New Year. Yeah, this is the first one we're recording for the year. First time jumping back on the mics to talk about AI and definitely something that I think will be a theme in 2025, which will be

of course, multimodal AI in general, but I think something that a lot of people are wondering where it's going to go, I guess, in 2025, which is video generation. So we're very pleased to have Paras Jain with us today, who's CEO at Genmo. How are you doing? I'm doing great. Happy New Year, everyone. It's really wonderful to be here.

Yeah, welcome. I know we've been trying to make this one happen for a little while, and I think the timing worked out well because, like I say, people are thinking a lot about video generation and how that will evolve in 2025. Maybe from someone that's working in this area and has been thinking about it deeply, maybe...

A lot of people, listeners, have just started thinking about this topic recently, but you've been thinking about it deeply for some time. Could you give us a little bit of a sense of what happened, what led up to video generation and where it is in 2024? And then as we're entering this new year, what's the current state of

video generation, I guess more generally in terms of what people can access in actually released systems or released models? Yeah, absolutely. There's been a long path to kind of where we are today. I think

you know, given so much excitement in what you might call the left brain of AI, that is like language models, reasoning, your O series of models, you know, I think the right brain has kind of lagged progress for quite a while, right? Like people didn't really widely use kind of creative AI at a huge scale. And I think video is the ultimate creative modality here, right? If you think about it,

So much of how we communicate as humans is through visual mediums and specifically just video through motion. And so I think it's incredibly exciting in video being this ultimate form of creative multimodal synthesis. It was always really exciting, but the technology really kind of was far behind, I think, what people really wanted from it. And so it's interesting, my co-founder worked on some of the earliest image generation models and then 3D generation. And then video was always this kind of

big modality we wanted to target. What I think was really interesting in 2023 and 2024 was first the development of the image generation, which is kind of a precursor to video. But even then, the gap from image generation models to video generation models was always really big because if you think about it,

An image might have thousands of pixels or even a million pixels, but a video would have hundreds of millions or even billion pixels in it in just a short clip. And so there was a huge gap to cross. I mean, so compute has scaled a lot and that has enabled larger models. And so I think...

top of mind since we rescheduled this podcast, Sora came to market, right? And I think that was really exciting. That was a watershed moment for a lot of people to kind of see what was possible with video generation. And to me, I think this is a really early bellwether of like, you know, what is to come? I think we're still really early here. - Yeah, and for those that don't know, Sora is from OpenAI, right? - Yeah, correct.

Yeah. And I mean, you talk about some of the challenges with video generation being a kind of different animal. I know that some people might, you know, if they've been longtime listeners of the show, we've had episodes talking about kind of stable diffusion and these sorts of models for video.

Image generation, what are the main, I guess, if you want to be a video generation model builder, what do you have to think of differently, both in terms of kind of the type of model that you would use and also kind of, you know, the process that you'd have to go through in terms of curating data and that sort of thing? Yeah, I mean, I think first and foremost, video data,

is really data intensive, right? Like you just think about it compared to even images or text, like text is tiny, images were more expensive, but video is like 100X more in terms of data volume in a short clip than you might have for images. And so when you think about training these models, that's really the most important challenge is,

How do you build architectures and then systems that can scale to process large data sets? That was a big bottleneck for the community. I mean, again, at Genmo, we've been innovating heavily to actually make that possible. But I think this was why progress took a little bit longer than, say, for images or language to come to market. For companies that are training this, though, like, again, they're having to curate

massive scale data sets, usually in the petabytes of data, essentially just to pre-train these models. And that's really intensive. Many practitioners are beginning to find in these models too, but even that remains more challenging than your staple diffusion, for example.

That's got to make it hard for new entrants to come into the field. Just the sheer volume of what you have to get set up ahead of time to handle that is probably, I would imagine, beyond what most organizations are really able to do unless they have specific expertise or experience in the area or something. Yeah, absolutely. I mean, it took us a long time to get ready to pre-train models. We'll talk about this more, but we open sourced one of the state-of-the-art video generation models in 2017.

part of the goal here was to kind of let other people have a chance, right? Let them pick up a model and begin to fine tune it. And they're kind of skipping past like,

And maybe talk a little bit about that data side, because I know that this is one of the things that's been, I mean, it has been a struggle on the tech side, but I think especially on the image and video side where there's a lot of questions about how to do it.

Hey, well, where can you actually source all of this video and imagery? And what are the rights associated with that? But I also imagine there's definite curation that's needed in terms of like...

All these prompts that I've seen people do with like, oh, generate this, you know, it's shot with a Canon DSLR or whatever. Like all of that sort of thing has to be curated on the prompt side as well. So yeah, could you talk a little bit about that data curation and what...

the source of that kind of where where you could even get videos and then and then the curation process. Yeah, I mean, I think pre-training in general, and that's true for image models, text models, audio models and videos, they rely on like, you know, large volumes of Internet scale data. But I think what's uniquely challenging with video is it's just

you know, it's easy to kind of get drowned out in the noise, right? And one angle here that I think is really interesting that we, for example, zeroed in on was how do we learn high quality motion with a video model? And it turns out the vast majority of video you find on the internet doesn't move. It's like a static object or it's someone talking. And if you think about that, that doesn't actually teach a generative model about the world. It doesn't teach about physics. It doesn't teach it about how objects interact. And so it's not going to learn strong reasoning. And so the way we think about it, it's

Really, the goal with the video model is to learn physics and realism and the laws that govern our world. And so you might think about inertia, mass, optics, fluid dynamics, all these kind of base properties and how they all interact. That's really the goal with video generations to learn an engine that can simulate this because

The output is a video and we can consume it. It's creative and it's beautiful. But the hard step here is finding data that can really help you learn these base rules of the world. And this was one of the most fundamental gaps we had across.

It's kind of non-trivial. I'm kind of curious as you're describing that, I would imagine that some things that you're training for are harder for the model to learn than other areas. And that, I mean, if you think about just narrow it down to just animals and mammals and humans and they move differently and the physiology and the anatomy is a bit different across those and that all has to somehow be inferred by the model if they're going to make a video that's realistic.

What have you in your experience as you, you know, you talked about motion being so important stuff. What are some of the harder things it's been to get right over time, not just in where you're at today, but like for the industry, what is the industry and maybe early on? What have you struggled with? Yeah, I think it's a really it's kind of funny that one of the test cases people use now to test different video generators is gymnastics. Right.

And I think the reason for this is hilarious videos online you'll see of Sora or other video generators doing gymnastics. And I think one of the answers is video generation models just can't do it now. And it's really complex human motion. It's really rare. So you talk about data curation, for example, like there

There isn't that much complex motion. We see people like doing twists and twirls and backflips and stuff in nature. Right. And so it's kind of interesting is like that is that requires a fundamental understanding of how human human kinematics behave for you to simulate that properly without it feeling disturbing. And so, yeah,

This has been one of the challenges for people. I mean, for example, early on when we were training, I mean, we've gone through three fundamental pre-training foundational models in our history of our company. And what's interesting with Mochi, which is our latest model in the prior one replay, was walking was actually a really basic thing that was really hard to nail. It turned out most video generators early in kind of,

early to mid 2023, they would make humans kind of hover as if they're hovercrafts. Like their feet would not move. They would just kind of levitate off the ground and move. And so the models were not capable of synthesizing, forget gymnastics, just walking. And so that was one of the critical watershed moments that we had to cross, for example, as a company. I would suggest that might've been some folks that had a little too much bourbon in their eggnog over the holidays right there. Kind of that floating thing going there. I've definitely seen the like Jedi vibe. Yeah.

Which is kind of cool in one respect, but not that awesome if you don't want it. One thing here is I think like we've invested heavily in evaluation infrastructure at Genmo. And as part of it's like, how do you benchmark these capabilities? Like one of the test cases we have is, you know, you might have a woman drinking a glass of water with ice and you want to look at, hey, does the ice move realistically? There's a water flow, but also like,

You know, what's interesting is once in a blue moon, like the character will try to drink the water through the side of the glass, which is just not physically consistent. And you'll actually see this with some of our competitor models. It's something we had been trying to develop. And I think just that isolated test case alone communicates a lot about the video generation model's capability to just understand the laws of reality. Right. Like it's kind of, yeah, it's a Jedi mind trick. Like you just cannot...

You should not be able to do that, right? How, like when you're using those test cases that you've developed, is that a lot of...

human review or how how does that like how do you create kind of the the tooling around that because I know there's sort of like comparisons between this image and that image right and or this frame and that frame and you can compare closeness and all of that but there could be a lot of

sort of closeness in the overall image. But if the woman is drinking from the side of the glass, there's kind of a major failure moment, even though maybe everything around is really good. Yeah, I mean...

Look, there's a lack of external publicly available quantitative benchmarks. I think one of the ones that are publicly available are these leaderboards. So artificial analysis has a video generation leaderboard. I mean, we are the number one open source model and kind of neck and neck with closed models there. And that's just human preferences.

hundreds of thousands of people look at two videos side by side and they say this one's better or this one's better. And you kind of get like a chess style ELO rating. And I think this has been one of the best or better public benchmarks. You know, internally, one of the ways we think about this, though, is as we're measuring these capabilities, such as world understanding and physics, it's very hard for a human actually to rate by that. It turns out like when we as humans look at two videos side by side and you're saying, which one do you prefer? You often prefer the one that might have slightly higher resolution or more detail.

But if you actually think about it, if I'm going to use this in an actual production application, like film production or gaming or something else, like I probably actually care more about the motion. And so we actually have to override the human intuition, your first order intuition to say, select for detail and

use these test cases as sort of like functional testing of how we can measure these capabilities. You know, in my career, I started out actually in self-driving. I worked at one of the early companies applying deep learning to self-driving perception. And, you know, I took a lot of inspiration to how we built functional safety testing, for example, for deep learning systems, right? And in that way, you're going to enumerate these test cases and use cases, and you can actually say yes or no to you kind of pass that test case scenario, right? And so

Whether it's a human that has to do that review and we're starting to develop more automated metrics, I mean, it's just producing more structured forms of valuation I think are really important because otherwise the world is just too intricate for us to test everything, right? So we have to kind of go use case by use case and just measure progress. And it turns out as you scale the models and scale the data sets, we begin to see percentage completion rates improve. And this gives us a semi-quantitative benchmark of progress.

Well, friends, AI is transforming how we do business, but we need AI solutions that are not only ambitious, but practical and adaptable too. That's where Domo's AI and data products platform comes into play. It's built for the challenges of today's AI landscape.

With Domo, you and your team can channel AI and data into innovative uses that deliver measurable impact. While many companies focus on near applications or single model solutions, Domo's all-in-one platform is more robust with trustworthy AI results without having to overhaul your entire data infrastructure, secure AI agents that connect, prepare, and automate your workflows,

helping you and your team to gain insights, receive alerts, and act with ease through guided apps tailored to your role, and the flexibility to choose which AI models you wanna use. So, Domo goes beyond productivity. It's designed to transform your processes, helping you make smarter and faster decisions that drive real growth, and it's all powered by Domo's trust

flexibility, and years of expertise in data and AI innovation. And of course, the best companies rely on Domo to make smarter decisions. See how Domo can unlock your data's full potential. Learn more at ai.domo.com. That's ai.domo.com.

So Paras, I'm wondering, you mentioned this kind of history of pre-training at Genmo and the most recent model, which of course we want to talk about, but I'm sure that that most recent model is informed by things that you tried in the past and kind of your history there. So could you give a little bit of a snapshot of kind of the history of your team and how

how they approached this problem, how you all approached this problem and the kind of generations that you went through with that. Absolutely. So we're just about two years old at this point. We actually started working on the company Christmas 2022. So it was a holiday. And Jay and I were both the co-founders of the company. First and foremost, we're brothers. I think that's really unique. That's awesome. So

you know, and we didn't really plan to start a company with brothers. I mean, it's a little weird. I mean, it's like, you know, normally you have sibling rivalries and things like that. I don't know. We didn't have much of that, but it turned out our skill sets were super complimentary. Both of us were doing our PhDs in UC Berkeley. I was working on large scale distributed systems, you know, in the UC Berkeley amp lab and rise lab. And this is the same lab that created Apache spark and, um, and you know, uh,

Ray and the AnyScale project. And so really hardcore machine learning systems for scaling large language models. That was what my dissertation topic was on. And concurrent to that, Ajay was working on the foundations of modern image generation. So

He had joined Berkeley to work on early image generation models. This is kind of like in your GAN era. And I think for him, one deeply unsatisfying thing was a generative adversarial network was like a mirage. It wasn't actually like a grounded loss objective that was learning real motion or dynamics. It kind of was like this game, but you got image generation aside artifacts. So I think...

His story was really interesting in that he ended up kind of writing his paper, DDPM, or the Denoising Diffusion Propylastic Models paper, which is one of the foundations for how we think about image generation with diffusion today. It's one of the most highly cited papers in this area. And that came from, I think, early inclination that how do we build video image models that understand physics and realism instead of just kind of like artificially playing this, you know, game that results in image generation, grounding it in real generative pre-training. And so that's some of the

early academic history of the company, but starting the company, we decided to do video because it seemed impossible back in 2022. It was just completely outside the frontier. And we said, fundamentally, we need a new architecture to solve this. And so let us discover both from a systems perspective and distributed systems perspective, but also machine learning perspective, what the right approach to do that is. I mean, so yeah.

It's been about two years since founding. We've gone through three large pre-training runs. And in each time, we learn something new about the world and integrate that into our approach and our framework and architecture for how we train these models. I think the single underpinning thing, though, is motion. We always joke like Genmo doesn't really have a

and an explanation, but we kind of retroactively apply this idea of generative motion, right? Genmo is like, we care so much about motion and video that that's really a core element of our founding history and our framework for how we approach video generation.

I'm wondering, you got me, I had a question for a moment or two as you were talking through it, you kind of talked about that evolution and, you know, kind of starting with GANs, the generative adversarial networks and finding your way across kind of the architectural progression that you guys have found. Could you talk a little bit about that in terms of like, you know, if you were coming into it during, you know, the age of GANs right there and that was the thing, but, you know, what did, I'm kind of curious, like at a high level,

What was the problem with that? Why did that not work for you? What did you look to next? You know, could you kind of give us kind of a highlight skip over the top of a couple of different major architectural twists and turns to give us a sense of what your journey might have been like? Yeah. So I think the earliest form of image generation models, I think, started to work well were autoregressive image generation models. This is very similar to a large language model. You kind of take a...

you know, you might take an image and you make it a single vector, a line. So if it's like, you know, 28 by 28 image, now you have 784 pixels in a straight line and you just go one by one by one and decode the next one. So that was the earliest form of image generation. There are models like PixelRNN or PixelCNN or ImageGPT, which are from OpenAI, which were the earliest works here that worked well. But the problem is like images have millions of pixels. This would never scale to produce high resolution images.

I think what's interesting from when Ajay was working on this early on in 2018, 2019, was I think I remember he trained an autoregressive image generator model. And the first models he trained were trained on L-Sun, which is a data set of bedrooms, basically.

But what was so interesting is that in it would be like a little five by five or 10 by 10 pixel region, it would start to put artwork on the background of people's bedrooms. Why? Because that's just like what nature looks like. That's what real estate listings look like. But it was the first indication of AI generated art in some sense with an early image generation model.

The problem is this would not scale, right? Because you're kind of going pixel by pixel by pixel. So it would take hours to make a small image. I mean, so GANs were the next kind of major approach. I think that really worked well for this. And GANs are trained, again, with this generative adversarial objective. It's kind of this dueling game between a generator and a discriminator. But they were really hard to train.

It turned out they were getting these bad states where you might get mode collapse, for example, was one of the biggest issues. It would mean you could produce images of a single domain, but you couldn't produce everything in the world with a GAN. So you could get a really good model for making faces or a really good model for making bedroom pictures or making a really good model of tigers. But it was really hard to say train a model on all of ImageNet, meaning cover thousands or thousands.

thousands of different categories, right? And so diffusion models was a really exciting approach that Jay began to work on because it had the potential to provide that kind of mode coverage

you could learn diverse representation of the world that generalize beyond just a single, single domain like faces or animals to, to everything. And so, you know, that, that was what kind of resulted in DDPM. And I think since then, I mean, you've had late into fusion, the stable diffusion approach, and then video generation is I think the next major evolution, but, but the, the learning paradigm has mostly remained, remained similar to this, like learning to this, like, you know, diffusion setup or kind of iterative denoising, right. This is the formulation of this diffusion problem, but like,

It's remarkable to see how far that has scaled, like literally 10,000x in pixel scale from the earliest diffusion models to kind of where we are with video generation.

I guess on that front, how did you decide, I guess, because I know that part of what you've done, and I'm assuming what the intention was with the models that you've created is to open source them in one way or another. And as you mentioned earlier, release things into a community where people could experiment and try things and fine tune. How did you think about kind of

size of the model and, and that sort of, was that kind of purely driven by what was needed to produce kind of a certain size of video or certain resolution, um, a certain kind of performance metric that you were after? How, how did you make some of those, I guess, trade-off decisions, uh, maybe also the compute that you had had access to?

Yeah, I mean, first of all, pre-training is incredibly GPU intensive. I mean, we have access to more than 1,000 H100 grade...

And so, I mean, that is incredibly GPU intensive, but I think it's also a question of how you utilize that hardware effectively. And one of the critical challenges with video is they have really long sequence lengths, like training a video generation model is equivalent to kind of training a million token length kind of context window for a new language model. And so this introduces a huge set of challenges that are kind of orthogonal to kind of parameter scalings you might see typically with large language models.

What I think is interesting is though, certain capabilities only emerge at certain parameter scales. So like I talked about walking, like it's very difficult to get walking to work with like a, you know, a one or 2 billion parameter model or something smaller than that. It just like,

you won't learn that capability. So you do need a certain amount of scale for it to work. But at the same time, you're not seeing models that are like 100 billion or trillion parameter scale, as you see kind of with the frontier grade language model. So we open source Mochi one, it's a 10 or 11 billion parameter scale model. So it's big, a lot bigger than your conventional grade, older grade of

video generation models, but it again still is runnable on a consumer grade GPU. People can access it and they can use it. That was a very intentional choice by us to kind of right size it for the community while making sure it wasn't too small to limit its capabilities. And what reasonably I know one of the things that I've noticed over time as I've experimented with different video generation, either demos or products is

there there's definitely an element of it to where you can only generate so much i'm imagining that there as you mentioned there's a sequence that is being generated and in a way similar to a sequence that's generated out of a language model there's iteratively or iterations of calling

the model, which is more compute intensive, the more you generate. Is that a true assumption about video models? Or is there, I think people...

are somewhat familiar, at least if they've been around the podcast or have done their own research in terms of how language models generate tokens, right? So I have a prompt, the model generates a token, and then that's added to my prompt. And then I iteratively generate another token. And

So the model is being called the more that I'm generating. Is that same thing true for kind of generating these sequences of videos? What are the kind of concerns around actual compute and usage of these models in a realistic environment? Yeah, I think video generation models share common elements with large language models, but they also differ in some key ways. So first and foremost...

A language model decodes tokens, auto aggressively one at a time. So if you want to generate, you know, 1000 tokens, however many like, whatever, let's say 500 words,

you need to do 500 forward pass or 1000 forward passes the model in a video generation model, like every pixel is kind of generated at once. So each four pass produces all of the pixels that you see in your video across space and time. And we do multiple denoising steps. So you start with a kind of a pure noise sample and through maybe 50 or 104 passes, all of those pixels eventually become full resolution. So you'll actually see if you use our product,

We stream those pixels as they're getting denoised in real time to your browser. So you'll see a full video, like not just a frame, but a full video. But it's kind of blurry. And slowly the video gets sharper and sharper and the details begin to resolve. Like you'll see blobs.

that become more and more detailed. And eventually you get fine details like hair or teeth or plant leaves and so on. That appears in the last stage of this. And similarly, the motion might start with coarse grain motion, but eventually becomes much more detailed and realistic as the denoising process proceeds. So it's kind of a different axis in which we do the compute. Just like you do tokens and decoding, in video models you have this denoising step.

But one really important thing to talk about architecturally, at least with Mochi as we open source it, is it's kind of a multi-stage model. There's first what we call it's a variational autoencoder, or VAE, which refers to essentially video compression. There's just too many pixels on video for us to learn over natively in the model. It just...

It's just way too expensive. So in Mochi, we train this 100x video compression model through the variational autoencoder setup that takes the input video and actually projects and makes that sequence that we talk about. So you're going from something that's like, you know, hundreds of millions of pixels or something down to something that ends up effectively taking about, you know, 50,000 tokens equivalent, 50 to 100,000 tokens equivalent in a language model. So we do that compression stage first and then in that latent space,

That is actually what the diffusion model is learning, right? So that 10 billion parameter model is learning to kind of reconstruct, you know, that 100x down sampled or compressed space.

Do you envision that there's ever a point, you know, with with compute, you know, growing so fast? Is there ever a point where you think compression will no longer be needed and you'll be able to do, you know, very large and detailed videos without the need for that just because compute is so available in the future? Or do you think that's unlikely and we're going to keep chasing it with compression and doing other things? So there was the first diffusion models actually were what we would call pixel space models. So they were done at the full resolution of the sequence.

And so this is actually still doable for images. I think what's interesting is that this like latent diffusion setup has outperformed the pixel space approach, even in images where it is feasible computationally still do that. You know, I think it is interesting though, because like there has been a lot of hybridization of architectures between like

auto aggressive setups and diffusion setups. That was one trend at like, for example, our team went in Europe's this year, then to 2024. And, you know, several people have begun to explore combining different elements of auto aggressive models, diffusion models, both in pixel space and latent space. I think it's like a really diverse space that like is just extremely underexplored. For example, we open source mochi, we actually developed a new architecture, we call it the ASIM did or asymmetric did it was just an evolution on the kind of area that people were in. I mean,

people leverage this diffusion transformer setup for the architecture. It's part of why it's so expensive, but we began to take some early steps to do architectural exploration. So I hope we can eventually, long story short, find some global optima between compression and the actual generation part. Today, we kind of factorize it for computational reasons, and I think it'll just get more and more blurry as we kind of combine these different elements. ♪

Well, Paraz, you've mentioned Mochi. This is the latest wave of what you have created at Genmo. Could you talk a little bit about Mochi in relation to previous models and also Mochi? I mean, you mentioned that Mochi is achieving kind of top performance on certain benchmarks. Could you kind of

help us understand where it fits into the ecosystem of video models out there and also kind of what it represents to you all in kind of progression from your last generation to this generation. First and foremost, before I dig into this, my belief is video generation is super early. I think we're 1% of the way there. So I think people look at this stuff and it's really surprising, but there's a huge gap between reality and where the state of video generation is, right? And I think...

That mindset is really important because when we looked at the field of video generation as of, you know, mid 2023, when we kind of had our last generation model, sorry, mid 2024, when we had our last generation model replay was they would synthesize high resolution videos, but they just wouldn't move. They weren't that interesting. Right. So you would see a video of a person and they would just stand there and maybe there was camera motion. So the camera would kind of orbit the person or pan a little bit.

but the subject wouldn't be moving. And to us, that would indicate some kind of learning failure with the video generation setup as of kind of the last generation of these models. And so that was,

first and foremost, the most important thing we wanted to solve for video generation was solve motion and subject motion specifically. And so Mochi one is kind of neck and neck with the latest frontier grade kind of closed source models, your Google Veos or, you know, Sora is in that way, specifically by motion benchmarks, actually. And I think this is really important and subtle, but that was kind of the key component we wanted to solve with video generation. The second one

that was really important for us to solve in Mochi was prompt adherence. It was really common. I think many people have this experience with video generation. As you say, I want X, right? Like you might say, you know, I want, you know, a classic test for this is, you know, like I want a dog wearing a hat, holding a teacup, but it'll make that, but the order of those things and the composition of those elements is wrong, right? So they might be sitting next to it, but not holding it.

We talked to a user in user study about video generation. They described the state of video generation was kind of like pushing on a rope. You kind of want the rope to go one way, but you just can't get it to go, right? It's just really hard. And so with Mochi, we also invested heavily in prompt adherence in addition to motion. And so prompt following is, I think, a really important element that will be critical to make these systems practically usable.

We'd love to talk about like, you know, we open source this also because there was no good open model, let alone close. There were a few of these closed models, you know, Runway and Sora had been kind of previewed in their blog for several months, but nobody had actually...

trained and released an open model. And so that was holding this field back. And because we're so early, our viewpoint is releasing this model and creating this bedrock foundation for people to actually do the research on aspects like motion and prompt adherence was going to be critical for the field. And it benefits us as a company because people are building on top of our models, right?

So what kinds of things are you seeing people want to do with the model? And what are of the different categories of use cases, you know, that people might be addressing? What, what are the ones that are the high value? Yeah, I think everyone's first experience is just play. So like, people just want to open it up and they want to see something wild, right? Like,

a baby riding a dog, right? And so I think that was always a funny one that was like, you know, you might have these things that just don't happen in the real world that you want to see the model do. And so people start with that and explore the surface area. But when we look at actual real use cases, I think what's really interesting is this video generation technology is beginning to work its way into like enterprise content creation workflows. And I think of this as like

creation and then there's like editing right and these are two kind of uh halves of practical application of video generation so creation i mean first and foremost like many people are starting to begin to explore using video generation as a substitute for stock video like if you can't find exactly what you want to stock catalog you can just go generate it and it's going to come with all the right adequate licenses it's exclusive to you right no one else gets that video because

you made it right and it's n equals one and so that that's actually really powerful for a lot of content creation workflows video is just really hard and expensive also to iterate with right you shoot it once and if it's not perfect you know you might want to re-prompt and re-edit it and so

I think that's an exciting application, for example, in the brainstorming and pre-visualization and storyboarding process of content production. That goes way faster if you have a tool like a video generator in the loop. And then editing. Actually, that's exactly where I was about to go, was on the editing, is kind of how do you envision that fitting in as that becomes a problem that people are attacking aggressively? What does it mean to edit video in the context of video generation? If you're generating the video from scratch,

What does it mean to edit a video like that? And how might that be done? Is anyone really thinking about that right now? Is that on the table? So we released Mochi 1 as open source. We didn't know what people would use it for. And one really exciting thing, within two weeks of open sourcing it, one of the community members built this workflow called Mochi Edit. It's a full video editing pipeline built on top of our open source model. And with it, you can add, remove, or change an object. So it's a crazy video. You can search up Mochi Edit on GitHub. And what...

was the demo that I think he showed me that was really cool. They took a video of a person talking and they said, give him a hat. And it actually put in a fully realistic, exactly 3D tracked hat on him. Just look totally realistic. And I think that full process with the conventional video editing pipeline between tracking and rendering and compositing everything would have taken like, you know, two, three weeks, honestly.

Very cool. Do you see, I mean, I know there's certain, if I remember right, Coke did like a commercial, Coca-Cola did a commercial for their winter advertisement with Gen AI. Do you think, well, this is maybe a wider question, but how do you think people kind of understand

in 2025, you know, how are we going to experience video generation kind of at the general public level? Do you think it will start to like, in what ways will it start to filter into people's everyday, everyday lives? Because I, Chris and I, well, everybody remembers, like we are talking about lots of language models before chat GPT on the, on the podcast, but

You know, we weren't talking about them at Thanksgiving dinner, right? No, not at all. And so you do have those moments of like the Coca-Cola video where people were talking about this more widely, but that's probably not like the chat GPT moment of video generation. Any thoughts on kind of how general public will kind of start to intersect with this technology in the coming year?

I mean, I think the early adopters are certainly here for video generation. I mean, our platform has more than well past more than 2 million users who use it just beyond open source and open source is probably some many multiples of that. But I think that still represents like this like drop in the bucket compared to conventional media. And I think one of the biggest limiters like I shared was like your ability to control it and like

You know, once you can actually get something out of it, the wow moment is almost instant. Like you'll ask it for something that just couldn't exist in the real world and you see it in front of your eyes. I mean, that...

That is a jaw-dropping experience for most people, right? But I think the hard part there is the tech has required too much expertise with prompting and understanding of how to actually get good results of the model to make it usable. I think 2024 is the year that we will see instruction following and prompted here and solved here that makes this stuff actually follow what you want to say. And I think of this as like going from GPD3, which was just like an unaligned language model in some sense, which...

Kind of would ramble about whatever topic at end, but not in a particularly useful way towards chat-based instruction tuning, right? That was the breakthrough moment for language models. I think very similarly for video models, it kind of comes to the moment where like somebody can pick it up and use it without being an AI expert. You know, today many people are already...

talented and mid journey or other kind of conventional forms of image generation that kind of translate into video. And I think this is really one of the critical moments that has to be solved for this to have like breakout exposure. But I mean, I just imagine a world like I think in five years when I hit a point where, you know, there might be a poor kid in Mumbai or Kenya or something just has a phone and a good idea, push the button on their phone and it wins an Academy Award, right? Like that, that's going to change the world. I don't think we're that far from that, to be honest.

Yeah, I think that there's... I love how you've framed that in that kind of expanded agency sort of way. So instead of like AI models generally, I think...

The way people think about them as a bummer is like, oh, these things are going to automate everything. Every video I'm going to see, I'm never going to see cool videos again because they're all going to be AI generated without creativity. But I think the fact that what we're seeing with language models, what we've seen even with image generation is there's so much creativity that the human can bring into that. But it also...

democratizes a lot of potential, you know, production and that sort of thing to those that have amazing ideas, but maybe not access to a Hollywood film crew. Right. So I love, I love that there's still that element in kind of your vision of that human, human agency being expanded upon and even, you know, people, people getting to tell stories that maybe they, they wouldn't otherwise. So I love that.

I've got a question for you. It's a little bit of a random one, but interesting. People ask me this a lot.

What does creativity mean as we go forward? As we're having these tools and human creativity is coming to bear, you're having these tools that some people consider creative in a sense, some people don't and all, but what does that look like? What is that person and a tool together going and doing that thing the Kenyan boy is doing? How do you think about that? How do you contextualize that? I think human ingenuity and creativity is the root

of all like interesting form of content. Like if you have AI, like I know people are scared, hey, I was going to automate all this stuff. But if you if you look at what like LMS will just ramble on about, it's just like the aggregate average of all their training inputs. And that's not particularly interesting or novel to anybody, right? Like

I think the greatest films come from someone with a new idea, right? And a new lens on the world, right? A new interpretation of what it means to be human, right? And live in the world that we do. And from that, you have great media, right? And I think that will forever be true. The human's role here is always going to be pushing the frontier. I mean, language models learn and video models learn by just adding

averaging and aggregating, compressing all the information around them. But in some sense, they won't ever be able to really push the frontier alone. Like a human plus now a video model, though, is something entirely different beast, right? Now you have something I like to term creative amplification as possible, right? Like the human alone is producing the creativity. But with that video model, it now amplifies in such a way that just wouldn't have ever been possible with this old, older technology.

older generation media in the older world, right? Like that iteration cycle might've taken years, an entire lifetime to kind of go through and discover an idea space. And now somebody can do that, you know, within a matter of like months or weeks, just iterating on new ideas and testing them out and seeing them visualized for them.

I guess that kind of leads us naturally into... That was a great kind of vision wider, but what is your vision for Genmo specifically? What keeps you up at night? What are you most excited about as you move into a new year with a lot of new possibilities? So I think...

Our vision has been very consistent over a long period of time, which is to build frontier models of video generation. But the goal was to unlock the right brain of artificial general intelligence. It's completely neglected. I mean, open AI and kind of these frontier models have taken over the left brain. And we said, hey, this other side is just as capable and just as important as the left brain here. And so, you know, I term that as like thinking, imagine AI that can say anything possible or impossible, right? And I think...

The first step here is creativity, is media, people creating, like I described, this vision of empowering creators. But longer term, I actually think this is really interesting in that if we can explore this world of synthetic realities, it'll unlock huge progress in embodied AI, for example. And that's when this tech starts to become really powerful. I started in self-driving in my career, and the big problem is there's too many edge cases to simulate.

right and then even if you get millions of miles on the road there's still new things that will happen but but i think for the first time a video model will enable training robust agents that can operate in the real world and and actually understand all the possible realities that they can just simulate through that right like that that's an entirely new paradigm that i think we're starting to see explored even in the reasoning with the o01 style of models as well but

To me, that's one of the most exciting long-term tenure potentials that we'll see for video generation. And we at Genmo are kind of like trying to work towards that future.

Well, thank you for how you're digging in in this space. It is truly inspirational and really appreciate you taking time to chat with us as you head into those innovations. Exciting stuff. Please come back when you release whatever the next is. You're welcome back to chat about it. Thank you so much, Paras. It's great to chat. Thank you, Daniel. Thank you, Chris.

All right. That is our show for this week. If you haven't checked out our ChangeLog newsletter, head to changelog.com slash news. There you'll find 29 reasons, yes, 29 reasons why you should subscribe.

I'll tell you reason number 17, you might actually start looking forward to Mondays. Sounds like somebody's got a case of the Mondays. 28 more reasons are waiting for you at changelog.com slash news. Thanks again to our partners at Fly.io, to Breakmaster Cylinder for the beats, and to you for listening. That is all for now, but we'll talk to you again next time.

Video generation with realistic motion 45:11 Share

Practical AI: Machine Learning, Data Science, LLM

Deep Dive

Shownotes Transcript

Video generation with realistic motion