Sora is capable of understanding object permanence and maintaining the presence of objects in the scene even when they are occluded. This is achieved through large-scale diffusion training, which allows the model to learn the underlying geometry and interactions in complex scenes without any hard-coded inductive biases.
Genie differs from other text-to-video models by allowing frame-by-frame interaction. While other models generate entire video clips based on a text prompt, Genie enables users to take sequential actions within the generated environment, making it a foundational world model.
The main challenge in evaluating video generation models is the difficulty in quickly glancing at and assessing the quality of moving content. Unlike images, which can be easily evaluated in a grid, videos require more detailed and time-consuming individual assessments.
DECAF, or Deep Convolutional Activation Features, was a foundational model in computer vision that democratized access to deep learning techniques. It demonstrated the effectiveness of pre-trained models for a wide range of tasks and was one of the first to show how deep learned representations generalize beyond their training data.
The VQ BET model uses Vector Quantized Variational Autoencoders (VQVAE) to quantize continuous action data into a discrete representation, which is then used as tokens in a large language model (LLM) framework. This allows the model to predict and generate behaviors based on current observations and high-level task descriptions.
The core idea behind YAY Robot is to use high-level language feedback to improve a robot's hierarchical policy. By providing verbal corrections, the high-level policy can be fine-tuned to correct mistakes and learn new strategies, significantly improving the robot's performance on long-horizon tasks without the need for extensive labeled data.
The position paper argues that the reinforcement learning (RL) community should prioritize research on automating the heuristic process of environment shaping. This includes developing better RL algorithms that don't require manual shaping and creating benchmarks on unshaped environments to facilitate this research.
VideoPoet uses a large language model (LLM) architecture to generate videos, while Sora is based on diffusion models. VideoPoet is more modular, supporting tasks like text-to-video, image-to-video, and video-to-audio, and operates in a latent space to improve efficiency and flexibility.
Flow matching in the VQ BET model ensures that the predicted actions are consistent with the observed data. By using a quantized representation of actions, the model can learn to predict the most likely future states and actions, making it more robust and data-efficient.
Genie consists of a video tokenizer, a latent action model, and a dynamics model. The video tokenizer converts video frames into discrete tokens, the latent action model predicts changes between frames, and the dynamics model generates future frames based on these tokens and actions, enabling frame-by-frame controllability.
Welcome to the latent space coverage of ICML 2024. This is Charlie, your AI co-host. We know it's been a few months since ICML actually happened, but now that all the talks are available online and we are in final preparations for New Reaps 2024, we figured this was a good time to release our conference recap to get you in the mood.
As a side note, regular tickets are now sold out for Latent Space Live at NeurIPS, where we have announced our dream speakers to recap the best of 2024 across the top voted domains in Vision, Open Models, Post Transformers, Synthetic Data, Small Models, Agents, GPU Scaling, and a special 2024 in AI keynote from our friend and fellow podcaster, Sarah Guo of Conviction Capital.
Today, we are announcing our very last speaker and newest track.
friend of the pod, Nathan Lambert, who will be recapping 2024 in reasoning models like OpenAISO 1. See you in Vancouver. Coming back to ICML, this is a very special episode in more than one because it is the very first episode not hosted by Swix or Alessio. We are continuing to experiment with guest hosts, adding different opinions and voices to the show.
And in this case, to cover conferences we personally weren't physically able to attend. So we're very grateful for our friend Brittany Walker of CRV to step in as your guest co-host for ICML 2024. Our goal with these conference recaps is to give you an audio experience of what it's like to be there and to provide a filtered recommendation of papers and backstories of authors that will be useful for the AI engineer today and tomorrow.
Brittany worked enormously hard to put together the poster chats you will hear, and we're very grateful. Given that OpenAI has launched Sora Turbo today, we have bumped up our planned second episode to release first, since generative video happened to be a huge focus at ICML. Let's not bury the lead and go straight into the Sora talk from Bill Peebles, first author of the Diffusion Transformers paper and research scientist leading Sora model development.
Since we're talking about video models, you may wish to tap into the show notes for direct links to the public talks. However, we think there is still value in editing the audio for eyes-free browsing. We believe this is the most recent public academic discussion of Sora before the Sora Turbo public release today. So we hope this episode is valuable background for anyone getting up to speed on video diffusion. Watch out and take care.
I'm Bill and thanks a lot to Joanna for organizing this conference. Really excited to be giving a talk here. So I'm going to be talking about Sora today. So this is Video Generation Models as World Simulators. This was joint work with my good friend Tim Brooks and also some other wonderful colleagues at OpenAI.
So let's dive right in. So Sora is OpenAI's first video generation model. And in advance, I'm sorry for any kind of like FPS delay with screen sharing videos. It's always like the hardest part about working with videos is like showing results to other people over the internet. But this is a sample from Sora. And the text prompt is a stylish woman walks down a Tokyo street filled with warm glowing neon. You can see the rest of it. Sora is capable of generating 1080p video up to a minute long.
And what's remarkable about Sora is kind of all of the simple things that we take for granted about the visual world, it really begins to pick up on when you train on video data at scale. So if you see that blue sign in the background, even when there's a shot change and it's occluded, it's maintained. And we see this very consistently for a large number of samples from Sora. So it really has a good understanding
not only, for example, how light interacts within scenes in complicated ways, but object permanence and lots of other capabilities that have been very difficult for video generation models to grok in the past.
So of course it can do more than just photorealistic style. So this prompt is a gorgeously rendered papercraft world of a coral reef, right with colorful fish and sea creatures. So Sora again is capable of generating a non-photorealistic styles. It can also do a number of scene transitions. So we didn't stitch these video samples together. This is all one continuous output from Sora. It's capable of figuring out that if you want a scene with like a variety of sea life, maybe there should be a shot of seahorses, turtles, et cetera.
And it's also capable of modeling complex scenes. So this prompts this beautiful snowy Tokyo city is bustling. And so there's a large number of people in the scene. You can see the camera is flying through. And while it's doing that, it's able to have interactions between people like this couple is holding hands. There are people selling goods at the stalls. There's soccer pedals flying through the air. So Sora has really begun to pick up on the intricacies of how scenes should look and do a great job at rendering them.
One final example here is a movie trailer featuring the adventures of the 30-year-old spaceman. So what's cool about this is Sora's kind of zero shot, learns that you should have character consistency throughout a number of scene transitions. So, you know, in those movie trailers do not normally like change the leading actor halfway through. And so the man is the same across these different environments and different scenes. And all of this is just learned automatically by training on video data at scale.
So now I want to go into a few technical details about Sora. A lot of the inspiration for Sora came from language models, and in particular, this notion of a unified representation of text data.
One of the key ingredients to the success of LLMs over the years has been this idea that you could take stories, you can take code, you can take math. But at the end of the day, all of this information is represented with a unified vocabulary being tokens, which makes it very easy to train on data at scale. This imbues language models with very generalist capabilities and makes them polymaths at a number of tasks.
Now, we were really thinking like what the analog of this would be for visual data. And, you know, in particular, you know, there's no shortage of very diverse sources of visual data in the world. You know, there's vertical video, there's square images out there. You have like every kind of data of different durations, of different resolutions, of different aspect ratios.
And the question is, how can you train on all of that in a unified representation so we don't have to throw away any visual data? And so this is really one of the key ingredients for the success of Sora is coming up with this unified notion of kind of a visual representation on which we can train on, you know, internet scale visual data. And so in order to accomplish this,
We use a VAE kind of inspired by latent diffusion models from Robin Rombach. And what we do with this is encode all of this information into one unified latent space. So the idea here is on the far left, you know, we have like a video of a butterfly swimming underwater. You go through this visual encoder and this will compress videos both spatially and temporally into a single sequence of data.
And at the end of the day, we do this, of course, so we can train transformers on this sequence of data. We train diffusion transformers at scale. And the benefit of this is we get a number of just great properties of scaling transformers up specifically for video and image data.
So, you know, the name of the game here is like, how does visual quality improve as you throw more flops at the problem? And we find that improves like pretty steadily, which is great. So on the far left here, you know, we have a base compute trained Sora model. So this is trained with a small amount of compute and you can see it gets like some details, right? So for example, it kind of has some idea of like, if a camera is moving through a scene, there should be some notion of consistency, but all the textures are wrong and it's not high fidelity.
If you floor X the amount of training compute you pump into that model, it begins to figure out what dogs look like, what humans look like. But the visuals are still not great. And if you really crank up the amount of flops you're pouring into these things to 32X, you begin to see that it gets a lot of these fine-grained details right. The interaction of the owner's hand with the dog, all of the snowy textures on the ground,
And so we're finding that these models scale extremely effectively if you kind of nail the basics right. So in particular, if you can create this setup where you have this unified representation of visual data and crank up diffusion transformers, they can really start to learn to do amazing things.
Another cool property of Sora is how generalist it is at test time. So, you know, when you actually want to sample content, you can do it at any aspect ratio and resolution.
And this is really great from the perspective of kind of like controllable generations, specifically as it relates to different devices. So if I'm watching a movie on my iPhone and then I transition to watching it on my laptop, those are going to use two totally different aspect ratios. And normally you either have to just like pad with black bars or crop it. But with models like Sora, it's now possible to generate content natively for any device.
which is pretty exciting to think about the possibilities of how that can affect content creation in the future. So the sea turtle here is just rendered out with different aspect ratios. Another exciting aspect of this very generalist training recipe is we can kind of move on from the days of just like cropping data for training generative models. So, you know, back when I was like in grad school, I was always like spending time, you know, cropping to like 256 by 256 resolution to train like whatever version of StyleGAN I was working with.
And while that works well, it has certain downsides. So, you know, there are certain biases actually within data. For example, the photographer's bias of centering objects. And so on the left here, we have a baseline SORA model where we don't train with native size images.
video and image data. Instead, we actually do this like hard cropping to center. And you can see that the model essentially inherits some weaknesses of this cropping strategy, right? Sometimes like the scuba diver is going to be off center, which isn't actually ideal framing. If you do this native size training, it's actually much more effective at composing scenes. So you inherit some nice benefits of the training data in the model by just, you know, not throwing away pixels and training on everything you have.
So Sora is also an image generation model. So the prompt here is digital art of a young tiger under an apple tree in a matte painting style with gorgeous details. Here's another sample. We find that Sora in particular really excels at photorealistic kinds of content. So there's a lot of details kind of in the woman's face here, which it does a great job at rendering out. This is at 2K by 2K resolution.
And of course, we can interact with Sora in other ways beyond just text. So all the results before were text-to-video or text-to-image samples. But Sora can also accept visual inputs as conditioning. And so here we were seeding it with an image from DALI 3 and then having Sora extend this out in time. So Sora is capable of kind of understanding what's going on in an image and then extrapolating from there.
And so we had a lot of fun with this. So these are Dolly 2 samples on the left here. And so Sora can take video conditioning or image conditioning at any temporal index. So here, we condition the model in the middle of the sequence with the Shiba Inu. And then we extend it both backwards and forwards in time from that position. And you can see it's able to animate the dog's face. Of course, it can also do more fun animated styles here.
We've been using this to make emojis internally. We have this nice Sora Slack emoji now. And another cool thing with Sora is its ability to extend backwards in time. So of course, you know,
Whether you're doing like temporal like outpainting like forward or backwards in time, it's all kind of like the same to these models. And so here we have the model end in the same way, which is with this San Francisco logo. But all of the events leading up to it are resampled by the model. So it's very flexible in how you can use this to edit or extend videos.
Another cool aspect of Sora is its zero-shot editing capabilities. So there's been a ton of great work from the academic community over the years on finding creative ways to use diffusion models to do, for example, like image editing tasks. So, you know, one really nice work in that area is SDEdit.
And we find that techniques like this, of course, just work right out of the box with Sora because it's a diffusion model at the end of the day. So these are SV edit results. So the top left is the source video. This particular source video was generated with Sora, but of course it doesn't have to be. It could also be a real video.
And you can use a variety of different text prompts to re-render this scene automatically. So for example, in the top right, we can rewrite the video in a pixel art style. And as you would expect, if that's the edit, if you kind of use the right noise level, you can get it to maintain most of the structure in the scene.
and just only update the style, which is cool. So towards the end of this video, you can see that there's a cave that the car in the top left goes into. And across all of these re-rendered styles, you see that it preserves some notion of a cave or an overhang that the car goes through.
Another thing that's cool is Sora is kind of smart about figuring out whether or not certain correlations make sense. So for example, in the bottom right, you can say change the video to a medieval theme. Sora knows there weren't cars in the medieval times, so instead you get a red horse carriage. So it's kind of fun to see where Sora takes liberties in re-rendering your video.
Another cool capability that Sora can do is blend between videos. So the far left and far right videos here define the endpoints of this interpolation. And the middle video is Sora's imagining of how you connect the dots. And so you can see you get these kind of fantastic creatures in this case where you can never quite see where it goes from being a chameleon to a bird. It happens very seamlessly.
And you can use this for all kinds of scenes. They don't even have to be particularly related. So on the far left here, we have a drone flying through the Colosseum. And the far right is the butterfly flying underwater. And you can see that Sora is able to come up with a pretty reasonable
interpolation between these two videos. So you gradually see the Colosseum decay and move underwater. And at some points, the drone morphs into the butterfly very suddenly because it kind of like has put these two things into correspondence automatically and infers that like this is like a reasonable thing that should focus on blending between.
And here's an example of blending two scenes with totally different styles. So the far left is like a photorealistic aerial drone shot. And then the far right video is kind of nice, like gingerbread village. And it comes up with a really creative way to make this work. So rather than kind of morph the whole style of the scene in one shot, it decides that maybe this like gingerbread village is kind of hidden off to the side of this photorealistic town. And it zooms in.
So one other technique that we use for Sora is this notion of video recaptioning. And this is a technique that was actually pioneered by Dolly 3, by some other folks at OpenAI. And the high level idea is during training, diffusion models and really alternative models benefit from having a much cleaner source of conditioning than we've historically given them in the past.
There's like very crude text captions out there, like alt text, for example, which doesn't actually contain a lot of information about the scene. They're like very coarse keywords, for example. Sometimes the content is like pretty unrelated actually to what's in your image or video, et cetera. And one of the key breakthroughs in DALL-E 3 was generating synthetic captions that are much more detailed and contain much more mutual information with the content that you actually want to generate.
And so what we saw with Dolly 3, this is an example figure from Dolly 3, is that this really improved the controllability of the model and enabled you to create much more intricate scenes with a lot more ease than in the past. And so we took inspiration with Sora to also apply this technique to video. And one of the features of this is that at test time, when you're actually interacting with the model, rather than just
directly kind of upload a prompt to Sora, we'll actually use GPT under the hood to essentially up sample a user's base prompt into a much more detailed video description.
And so this figure here is the system prompts that we used for DALI 3 in order to do this upsampling. It's actually pretty involved to get this to work well. And so there's a lot of prompt engineering, even at OpenAI, to get these systems to be reliable. But under the hood, this is what we're doing to achieve some of these finer-grained control of SORA.
So the last topic I want to talk about is this notion of like emerging simulation capabilities. And this is really the aspect that we are most excited about with Sora looking forward. You know, we often get asked the question, you know, at OpenAI, you know, how does video generation really relate to the core mission of AGI? And on the Sora team, we're actually really passionate about
about this being a model for world simulation moving forward. And so what exactly does that mean? Like, how do we actually use these models long term to do interesting tasks and to really extract intelligence out of the world? You know, what we believe is when we really scale up video generation models, they're going to get so good at simulating such a variety of complex scenes with, you know, different agents in them.
that it's going to need to ultimately learn an underlying model of how people interact, of how people do tasks, of how people think, if it's truly generating high fidelity content. At some point, if the conversation I'm having at a dinner table within Sora is not realistic, that means it's failed to do its job of accurately learning the distribution of human behavior. And so
As we approach the limits of achieving the irreducible loss there, we think pretty amazing things are going to emerge from these models and it's going to play a really key role in developing more intelligent systems in the future. So SORA is obviously not there today, but we already see some cool phenomena by training on video data at scale that we just want to highlight. And we think this list is only going to grow in the future as SORA continues to scale up. So the first one I'll talk about is 3D consistency. And so this is pretty clear from a lot of the samples
But even when you have these very dynamic scenes with a lot of people moving in them and the camera being non-stationary, you can see that a large number of elements in the scene really do move with what appears to be accurate geometry. And so this is achieved without any kind of hard-coded inductive biases for 3D within the model. It's all learned jointly end-to-end as part of large-scale diffusion training.
It was really important to us when we were doing this project that whatever solution we came to for video generation was scalable and could just absorb a lot of flops.
And one way to do that right is to really strip out these inductive biases that in the past have sometimes been useful for achieving certain kinds of behaviors at low scale. But it's not clear that when you really crank up the training compute, if they'll either help or hinder you. And so we find that it's totally fine to like not have these kinds of inductive biases as long as you're training at scale. Here's another sample. This one's kind of fun. So it's an aerial view of Yosemite showing both hikers as well as a gorgeous waterfall.
The hikers do some very extreme hiking right here. I would not recommend trying this at home. And Ben Mildenhall, who...
used to be at Google. He took some Sora samples when we released them, and then he trained a nerf on them. And in his words, it nerfs. So this is another kind of nice sanity check that the underlying geometry that Sora is learning for some scenes, not certainly all yet, is actually pretty accurate. And so it's cool to see that this, again, just emerges automatically at scale without inductive bias.
So the next capability I want to talk about is this idea of long-range coherence. So this is one of my favorite samples. This is the Bling Zoo shop in New York City. It's both a jewelry store and zoo, saber-toothed tigers with diamond and gold adornments, turtles with glistening emerald shells, et cetera. And so again, this is all one continuous shot from Sora. We didn't stitch it together. And what's cool about this is even when you have these sort of scene transitions,
Sora kind of automatically, you know, figures out like the vibe of what you're going for. So in this case, you get this coherence of like, you know, the environment you're in, you see this like outdoor component at like the start of the scene and like it gradually like moves indoors, but it all creates this kind of coherent narrative, which is awesome that you don't have to like, you know, manually stitch together everything. It can kind of just like figure it out in context.
Of course, you can also do long-range coherence and like the notion of character consistency as we alluded to earlier. So this is the story of a robot's life in a cyberpunk setting. And you can see you get the same robot character across these different shots. So it really does understand this idea that, you know, if I have a long video with multiple cuts, I'm probably going to have some amount of like characters that show up multiple times. It's not going to be an entirely new cast, you know, like every two seconds. And you just figure this out automatically.
Object permanence is another big one. So in the past, video generation models have really struggled to keep objects in the scene under occlusions. And so this is an example sample where even though this Dalmatian is getting included multiple times in the scene, Sora understands that that dog should still be there even when the people pass. And this very simple capability
that we take for granted, used to be a very challenging problem for video generation systems. But again, you don't necessarily need any kind of inductive biases specific to objects for this to emerge. You really just need the right fundamental training recipes to scale these models up. And so one other capability we're excited about is this idea of interacting with the world and updating state.
So kind of by definition, like if you want a useful video generation system, at some point, it needs to be able to like interact with objects in your scene and have those like interactions be meaningful. And by meaningful, I mean like they need to persist over time. So, you know, in the simplest case, if I'm like drawing or painting in this case, some Sakura petals, I would expect that, you know, as I'm leaving brushstrokes, like they actually interact with the canvas and stick around. So yeah.
We find that sometimes Sora can do this. This is probably one of the flakiest capabilities of the model currently. But in this case, it does work. This is another example of an older man eating a burger. And at the end here, there's bite marks in the burger.
So I think this is one of the larger challenges for video generation systems moving forward, is this idea that if I do something in the distant past, can the model really remember that and recall that and have it affect things in the future? So these are very simple examples of that, but there's still a long way to go in creating, I think, really compelling examples where a past conversation or something influences what the system outputs multiple minutes in the future, for example.
The last topic I want to cover here is this notion of digital world simulation. So when people talk about video generation models, of course, there's a lot of excitement about this idea that we can learn the real world's physics. And I think that's extremely valuable and a very important direction.
But what's cool is these systems are very general. So there's no need to constrain ourselves to only learning about our world's physics. There's all kinds of other crazy worlds out there, like laptop operating systems or video game consoles that SORA-like models could also learn from. And you can have one model, which eventually is extremely generalist and is able to render out scenes in all of these different environments.
And so one step towards this is Minecraft. So the prompt here is Minecraft the most gorgeous high-res AK texture pack ever. And this is just a straight output from Sora. It's not even particularly cherry-picked, actually. It's pretty easy to get good samples here. And you can see that Sora is able to implicitly control the player here with an intelligible, if slightly boring, policy while rendering out the full environment, rendering out NPCs like these pigs.
And we think this is like a really cool, like extremely crude, you know, proof of concept of how, you know, Sora can do more than like, you know, just be used for creative purposes, right? It can really model whole environments and in the future be used to, you know, potentially extract information about like policies for, you know, implicitly this all lives somewhere in sorts like activations and weights, right?
And it's cool to see that it kind of automatically learns these things again just by training on video data at scale. So here's one more sample with this prompt. So it chose a different texture pack for this one. But again, you see the same kind of things. You know, you got like a chicken and a pig. It's able to control the policies for those in addition to the character. As the character jumps around, it's able to render out this environment in pretty high fidelity.
And so we're pretty excited to see all the kinds of knowledge that you can pack into this one model, not necessarily only real-world physics. Of course, Sora has a lot of issues. So it is kind of far from this ultimate goal of simulating everything. And they're kind of fun failure cases, though. So everything about this scene is sort of messed up. So the woman looks way too happy. The hands in the background are kind of cursed. The candles are blowing in all the wrong directions.
This is another one where a cup kind of spontaneously leaps in the air and cracks in a really unrealistic way. So even pretty basic interactions like glass shattering Sora does not really understand yet and there's a long way to go. This is I think most people on the team's favorite failure case. So the prompt is like archaeologists discovering a plastic chair
But the plastic chair is a bit sentient and starts like flying and seems somewhat possessed. So it's always fun, you know, when you have these models where on your scaling curves, you know, they're not pushed all the way yet. It's always fun to see, you know, kind of like the correlations they don't yet understand about our world and how they take somewhat creative liberties. And this one's pretty self-explanatory for what's wrong. So, yeah.
Sora is currently in a research phase and we do not have it in a product yet.
We work with red teamers and artists to really get a handle on, you know, what are the potential risks of a model like Sora? Should it be deployed one day? And also, how can we make it as useful as possible for, you know, both existing kind of like artistic workflows, but also potentially entirely new ones as well. And so one quote from Shy Kids, which is a group that we gave
access to Sora to. As great as Sora is at generating things that appear real, what excites us is its ability to make things that are totally surreal. And so we really love this idea that Sora is not necessarily replacing elements of the artistic workflow, but really enabling kind of entirely new processes that have not been possible before. And so I'll play this Shy Kids video now.
I'm not sure if you guys can hear the audio, but if not, you can find this video online. Just search for Shy Kids Store. Well, they say everyone has something unique about them. Something that sets them apart. Just in my case, you know, it's quite obvious what that thing is. I am literally filled with hot air. Yeah, living like this has its challenges. Windy days, for one, are particularly troublesome. There was one time my girlfriend insisted I go to the cactus store to get my uncle Jerry a wedding present. Yeah.
What do I love most about my predicament? The perspective it gives me. You know, I get to see the world differently. I float above the mundane and the ordinary. I see things a different way from everyone else. Yet, I feel like it's because of that perspective I'm reminded every day that life is fragile. We're all just a pinprick away from deflation. So I try to live life with a lightness, a buoyancy, a joie de vivre. I got a lot of ideas, keep them instinctual. With any luck,
I'll find a way to share them with everyone else. And so this video was made with a combination of just using direct model output and, of course, also more like traditional video editing workflows on top. So it's been really cool to see how artists have embraced Sora and have begun to incorporate it. There's also been some films that were at Tribeca, actually,
which were made in a similar way with Sora. And I've been really surprised by like the level of creativity with the kind of the current levels of capabilities that Sora has today. It's really cool to see the community like lean in and use these models. So that said, that's pretty much the end of the talk. I have a few extra samples here. But yeah, thanks again to Joanna for scheduling this and happy to field any questions. I don't know if it's possible to
communicate them over zoom currently or not but uh yeah if not that's about it so thanks a lot thank you bill for this great talk i think yes we can definitely uh well if you can hear me then we can do a q a i can hear you so i think we're good
Hello, thank you for the great talk. I wonder how far we are from, let's say, a video producer making a whole movie with zero actors. So maybe if, like, hypothetically, the video producer can upload characters, how they look like, and they can describe the scenes and tell you, oh, this character is now running away or on a bike, etc. With zero actors, can they actually do a full movie?
Yeah, good question. So I think there's like a technical answer and like a cultural answer. So on the technical side, I don't think that, you know, there's necessarily any blocker to really, you know, making like character consistency work over very long time horizons. That seems like just a very achievable problem in general. And so I think in the near term, it will be possible to
for people if they want to do that, to be able to create these kind of like synthetic characters and kind of use them as they desire. Now, I don't know if people will really want to do that in the near term necessarily. We've been chatting with a lot of directors, for example, and a lot of them mentioned how for kind of like very simple scenes, how it's really convenient to use
So, our current capabilities, for example, have a large crowd in the background, whereas maybe in the past that would be CGI driven.
But, you know, for like these really kind of like intricate and like meaningful, like close up shots. And when you're trying to develop more of like an emotional connection with the audience, at least for like the near future, it seems like, you know, human actors definitely have like an edge on on models like Sora today. So I suspect in the future there will be some kind of like mixing collaboration between the two. Yeah.
But yeah, it'll be interesting to see kind of like when and where people choose to use totally like, you know, digital characters versus traditional actors. Hi, I have two questions. The first one would be how big of a role played synthetic data in the training process? So I can't answer any questions about training data, unfortunately. So yeah, sorry.
Okay, then the second one would be about how much control does the user actually have over the camera angle and the trajectories? Is it just prompting or can you actually define a full trajectory or what's possible?
Good question. So currently, the only way that you can define notions like camera motion is either through text or through video conditioning, right? And so in the latter case, that would mean like if you see generation with like a video where the camera is like already moving in the way you want and then you want to extend it out from there, you could kind of infer that via in-context learning to potentially get the right camera motion.
Currently, there's not a more granular way to do camera control. I think this is something that we've definitely heard people want. And so it'll be interesting to explore alternative ways to more explicitly control those kinds of features. But right now, it would go through text primarily. Hey, so we've seen all this nice visual output. Can you share something about audio and consistency, maybe what you observed there? Or if you have any.
Yeah, that's a good question. So for Sora, we were really focused on trying to push the envelope with visual generation quality and we weren't
focused on, for example, jointly generating audio. I think it's a really interesting direction to get extremely high fidelity like joint video audio generation, but it's not something that we have with Sora currently. I think in the future, making these models more controllable and potentially giving users all of the modalities they want is certainly an interesting direction.
Hi, Sora can also generate images. Do you think in the future video generation model will be stronger than text, current text image models and we will stop basically training just on images?
Yeah, I think so. And part of the reason for that is, you know, there's a lot of information about the world, which, you know, if you're training on like huge data sets of images, you can probably infer to some extent. But I think, you know, there are still some things that like slip through the cracks that you only get by training on video data. So, you know, for example, the
The fact that like a model can really generate like an accurate fly through of a scene and like really understand occlusions. My guess is like that actually helps like image generation capabilities and like understanding how, you know, some fingers on a hand may be occluded by an object, but that doesn't mean that like humans often only have like two fingers. It just means that, you know, there's like a physical interaction here that you didn't necessarily rock by only training on image data or you didn't rock it efficiently.
And you get these concepts much more either data or compute efficiently from co-training on video. So yeah, I suspect in the future video generation models will generally supersede image generation. - And is already the case for Sora or not yet? - So currently we haven't productized any of Sora's capabilities, including image generation. So today, if you go to chat GPT or something, it's using DALI 3 under the hood to do text image.
Thank you. I was wondering if you can tell us a little bit about the size of the model on, for example, what is the maximum length in time that it can generate or resolution or amount pixels, something similar? Good question. Unfortunately, can't comment on that. Sorry.
And even something closer to the order of magnitude. So, for example, can we expect a user having a good model in the future or it will be something that only big clusters and big companies can afford? Yeah, that's a good question. I mean, I think...
I wouldn't be particularly surprised, you know, if like the evolution of video generation models ends up looking a lot like the evolution of language models. So, you know, there will be a variety of models of different like capability levels and sizes, pretty similar to the ecosystem we have now, you know, there's like open source models, which at least historically have tended to be somewhat weaker compared to like these like larger scale closed source models.
But I'm curious to see how the whole ecosystem evolves. Just a guess. All right. Thank you. Hi. Thanks for the talk. I'm curious about how you are thinking about more sophisticated controls like bullet time and so on. Yeah, good question.
I think one thing we've heard from chatting with directors and artists is they have a very particular language that they use to describe certain kinds of shots and camera motion. And Sora, out of the box, is not so good at speaking that language. So a lot of what we think about in terms of improving
user's interactions with this model is, you know, kind of like thinking about, you know, can we train the model to like use that same language? So it, to some extent, is kind of like a captioning problem. But I think the jury's still out there on, you know, the best way to like make these models controllable. And it's like, is that only through text or are there other kinds of inputs? I think it's a very interesting space currently and we're still just kind of at the start of exploring it.
Thank you. One more short question is, there was interaction between character and hamburger kind of thing. Is there any way to make more interaction from the user like hitting the hamburger so that hamburger will be squished? So that, yeah, I think you get it. Yeah, yeah. That'd be cool. I don't know if Sora today can...
do that kind of more complicated interaction. I think there's no fundamental reason why that shouldn't be possible and why further scaling up these models shouldn't be able to achieve that kind of capability. Even this kind of biting a hamburger thing, this sort of phenomenon was something that took a while within the
research process of Sora to emerge and seemed to require at least a decent bit of compute for it to be a non-trivial interaction. So I'm curious what level of scale you might need to fully smash a hamburger and have it be physically accurate. I think we'll definitely get there eventually. It's just you never know where on the scaling curve you have to be for these capabilities to start popping up. Thank you. Hey, great presentation.
I was wondering, you've shown this nice Minecraft example where they actually had some kind of agentic behavior a little bit that was shown. Do you think Sora could actually help enabling, as a world simulator, for instance, or as a foundation model, agents that interact with the real world, like make it better than what's there right now, as like for robotics, for instance? Yeah, definitely. You know, I don't know if like the current model is robust enough to be reliably useful to improve like real world
policies, but I think it's inevitable that one day these models will power these kinds of systems. There's just so much information about the world that you learn by training on large-scale video data that it seems inevitable that that knowledge should transfer to the real world at some point. Hi, I have a question about inductive bias. Have you ever tried to train the model with some specific inductive bias such as physics or any rules in the video?
No, we haven't. So from the onset of the Sora project, we were really focused on kind of training just like pure visual generative models of data with as few inductive biases as possible and really just ensuring that the foundation was solid for scaling. That was...
kind of like the core thesis of the project. And so we haven't explored incorporating inductive biases. I suspect, you know, for like certain kinds of like narrower use cases, it's possible that you can get some kind of a win from doing that. And, you know, potentially if your model needs to be really small, but you don't need it to be extremely generalist, then potentially that could be a win. But we're really just trying to like scale up like the largest, most generalist model possible. And so to that end,
We generally assume that they'll be harmful at some points, which is why we haven't explored them too much. Thank you, Bill. Let's thank the speaker again. Thank you so much for the fantastic talk. In their original blog post, OpenAI describes Sora as a world simulator, and we have been tracking the start of the summer of simulative AI. Google DeepMind is not resting, of course, having announced their VO model at Google YNO this year with Donald Glover's endorsement.
However, the focus at ICML is on GENIE, short for Generative Interactive Environments, which is an 11 billion parameter foundation world model trained on unlabeled internet videos to generate action-controllable virtual worlds described through text, synthetic images, photographs and even sketches.
It is comprised of a spatio-temporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis, despite training, without any ground truth action labels or other domain-specific requirements typically found in the world model literature.
More recently, DeepMind announced CIMA, their Scalable Instructable Multi-World Agent, and Genie2, which extends Genie1 from generating 2D worlds, going into 3D worlds. Genie2 is a world model, meaning it can simulate virtual worlds, including the consequences of taking any action, for example, jump, swim, etc.
It was trained on a large-scale video dataset and, like other generative models, demonstrates various emergent capabilities at scale, such as object interactions, water effects, directional lighting, reflections, complex character animation, physics, and the ability to model and thus predict the behavior of other agents.
In particular, Genie 2 has long horizon memory, meaning it is capable of remembering parts of the world that are no longer in view and then rendering them accurately when they become observable again, ensuring it generates new plausible content on the fly and maintains a consistent world for up to a minute.
Finally, Genie's learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future, which we will explore in the last part of this pod. But first, here is Google DeepMind's oral presentation of Genie. Hello, everyone. Good morning. Thank you for coming.
I'm Jack, and alongside Ashley, I'm super excited to be presenting our paper, Generative Interactive Environments, otherwise known as GENIE. GENIE was an amazing collaborative effort from this wonderful group of people at Google DeepMind. Our long-term goal is to train embodied agents that can safely perform complex tasks with long-horizon consequences in the real world. It's safe to say there's been amazing progress in our field in the past few years, but this still feels pretty far away. So what's missing?
Fortunately, there's another pretty cool paper at ICML which has a nice way of thinking about this. In particular, they factorize agent capabilities by breadth and performance. And if you see in the bottom right cell, that's kind of what we want: general superhuman intelligence. Now, the good news is we've made some good progress on this grid, so we already have what they call emerging AGI, thanks to progressive foundation models.
We also have superhuman agents in narrow domains. In the case of AlphaGo and AlphaZero, these agents have already been used to augment human intelligence. Note, however, this is made possible by access to the simulator for the game of Go, which we don't have for general intelligence. So the main claim that we make in this work is that to get to the bottom right corner, the key missing ingredient is a more general environment. So the main motivation for us is how can we possibly get to this more general environment?
In parallel, it's pretty clear that video generation is having a bit of a moment. We've got a pretty packed out room for this set of aurels. And this area of research is now front and center of progress in AI. What's pretty incredible though is that by making use of large video datasets, these models are increasingly understanding the physical world in ways they couldn't before. And so as a result, many people are starting to believe that these video models could actually be accurate world simulators. However,
We believe that while these models may have world knowledge, they're not world models. Indeed, many of the text-to-video models that we currently see are only controllable at a very high level via text captions. Prompt the model and you receive a video clip. It may be consistent and beautiful, but it's not, therefore, a world model because you can't take sequential actions in the environment to learn new behaviors.
The challenge here is that we don't have data with action labels, and so the largest world model settings that we have are limited by action-labeled data sets, meaning we can only model existing environments, therefore not being able to generate new ones. Doesn't seem to make sense to use tons of compute to just recreate existing games that we already have. So this is the problem we're working on solving in this project. We want to make use of the vast amount of unlabeled videos on the Internet, and we want to create an action-controllable video model, otherwise known as a world model.
So to summarize, the goal of Genie is to learn what we call a generative interactive environment purely from videos that's playable by both humans and AI agents. How do we do this? Well, the main idea is to use a latent action space learned in a fully unsupervised manner. Intuitively, these correspond to clustering potential outcomes from a given frame of video. I'm now going to hand over to Ashley to talk about how we do this and show some of our results. Thanks, Jack.
So the Genie model consists of three main components and we're typically training over a sequence of around 16 frames. So the first component is a video tokenizer which takes in patches from that entire sequence and discretizes them into video tokens.
The next is a latent action model, which is going to take in consecutive frames and discretize and compress them into what we call latent actions. And the main purpose of our latent action model is to try to encode what's going to change between our scenes, such that given those latent actions, along with our prior frames, we can use them for predicting our next frames. And this is what's going to be important for controllability.
And the final component of our model is going to be our dynamics model, which takes in our tokenized frames along with our latent actions and predicts next frame tokens.
So in order to actually interact with our model during inference time, we can take an initial prompt frame, take some actions, and then generate the next frame, plug that back into the model, and continue this. And importantly, because we're learning discrete latent actions, we can actually just plug in integer values to interact in this way. And we found that it was really important in order to actually evaluate our model, to actually step into it and basically play it ourselves. Because it's one thing to measure the sort of
quantitative performance of our model using approaches like FBD, but it's another thing to actually model controllability and be able to actually evaluate that. So actually interacting with it ourselves was very important.
So Jeannie was trained on a data set that consisted of around 300,000 hours of video game footage consisting mainly of 2D platformer games. But we did find it was important to filter this down to 30,000 hours to get a more high quality data set.
So before training our main model on this, we ran a few scaling analysis experiments that showed us that it was important to scale both model size and bat size. And once we did this, we found that we came up with our final 11 billion parameter genie model with a bat size of 512.
So now let's take a look at some of our results. And so I want to point out that all of the results are going to be showing out of distribution examples. So some of them might be example coming from text generated images.
So this first video here is going to show some of the environments that Genie is capable of creating. And so I think the exciting thing here is it's showing that we can essentially step into our environments. This is showing real human interaction within these generated environments and take actions and change the world that we're experiencing. And it's important to point out that this is one of the main differences between what you see in Genie versus video generation models.
It's that interaction that we're able to actually take actions within our environments. And this is why we can consider Genie to be a foundational world model, because we can take these actions within our generations. So this is showing another example.
So given the same initial prompt frame, what we can do is we can take a different series of latent actions and plug them into our model. And what you see is that we're going to get very different and diverse trajectories. Again, this is showing human interaction with the model. And again, this is because we've learned this latent action space in an unsupervised manner.
The other important thing is going to be consistency. It's one thing to be able to generate a diverse set of trajectories, but if you're having to figure out what your latent actions mean every time you have a new image, it's not really that useful. So we also wanted to measure how consistent our latent actions were. So given four different initial prompt images, we could plug in the same sequence of latent actions. And what you can see is that there are very similar trajectories and behaviors essentially happening across these different environments.
which is telling us that indeed our latent action space, at least in these environments, is consistent. And again, just to point out, we were able to learn these latent actions without using any ground truth action labels or doing object detection or object segmentation or any sort of domain specific information.
So one sort of exciting and fun thing we found within the project was that we could actually plug in sketches, even though we were only training on 2D platformer games. So for example, on the left here, we see a sketch done by Richie from a team. In the middle, that's one of Jeff Klune's children who made that. And on the right, I did that one, but don't judge me too harshly.
And we can basically plug these images into our model and again, create these environments. So we can see, for example, we're able to climb this ladder that Richie basically sketched out. And so I think it was in this moment when we really started to see the kind of creativity that Genie could enable.
And so we also plugged in real world images, which is again very out of distribution from what the model was trained on. But for example, that's Jack's dog Doris on the left here. And we can again sort of generate these environments and I guess interact with them, even though we didn't train on anything that looked like this.
Jeannie also works on real-world data, so we trained a smaller model with 2 billion parameters on a robotics dataset, and we again see that if we take different prompt images but plug in the same sequence of latent actions, we're getting similar behaviors, which is again evidence that the latent actions are consistent. We're also able to simulate deformable objects with this model here.
And finally, while we haven't yet shown that we can train agents within the Genie model, we do show in the paper that we can take the latent actions learned from videos on the internet and use those for labeling unseen videos, which allows agents to actually imitate from these. So this indicates that Genie can be used for training our generalist agents of the future. And speaking of the future, I'll pass it back to Jack to talk about future directions.
Awesome. Thank you so much, Ashley. Right back to me. We want to emphasize that what we've shown here is that this is even possible. Before we started this project, the idea of training an action-controllable world model from videos seemed a bit like a pipe dream. And so as a result, this is the worst that Genie's ever going to be. We're expecting to see rapid progress from here, which we think can have a huge impact in a variety of areas. So going back to our original motivation, we think that Genie presents a clear path to generating unlimited environments for training agents.
And so for a more formal write-up of how we see this could fit into a framework towards getting to more general intelligence, come check out our position paper on Thursday in the oral session. Not only that, but as Ashley mentioned, we know something pretty magical happening while playing our model is it enabled a new form of creativity as people such as Jeff Klune's children, as previously mentioned, were able to draw their own worlds and step in and play. And we think this is barely scratching the surface by what could be possible with this new form of generative AI.
Okay, so to address the elephant in the room, so for those in academic institutions thinking it's just another industry paper that uses tons of compute that you can't possibly work on, fear not, we've got something for you as well. So in the paper, we have a case study where we show you can train your own much smaller Genie model and a mid-range TPU in just under a week. With this approach, you should be able to see some pretty consistent latent actions and given different initial prompts in the CoinRun environment.
As an example here, you see different actions from this model that we did train in a few days. And we're excited to see that this isn't just a wild goose chase. We have actually got some students that have been able to reproduce this. So come along to the controllable video generation workshop on Saturday to see their poster. And finally, if 12 minutes wasn't enough for you, fear not. We've got a few other things going on this week. So we've got a couple of position papers. We've got the poster straight after this talk. And then we've also got some longer talks in workshops later in the week. And then many others in the team are here as well who would love to chat to you all.
So yeah, that's a wrap. Thank you for your time. Thanks for showing this amazing work. So I have a question about technical details. So in this genie phase, you have two training phases, right? So in the first phase, you're training the inverse model for the latent actions. And then in the second phase, you're training a prediction model for this video generation, right? But in the first phase, when you train the latent action model, you already have a dynamic model trained
So why is it necessary to train this second step? I'm just wondering. Oh yeah, that's a great question. So essentially what you're saying is why do we have a decoder for the latent action model that already predicts the next frame and then subsequently train another one? We found that there's actually slightly different trade-offs for this decoder. So we found, if you see in the paper, we predict in the pixel space rather than token space for the latent action model. We found that that really helped to get more controllable consistent latent actions.
And then that decoder itself is just predicting in the pixel space, so actually it would be pretty blurry if you were to use it as your generative model.
Whereas we found that the mask-get objective wasn't best for learning latent actions. It just didn't lead to as consistent latent actions. So we have this like dual approach where we have two different dynamics models, essentially we learn as part of the process. But you're totally right that this isn't the most elegant solution and many of my team weren't overly ecstatic about it, but that's why we're saying this is the worst it's ever gonna be. And hopefully some of you folks in the community can build a much more elegant solution in the next few months.
Thank you very much, really cool work that you guys are doing. I wanted to ask regarding the qualities that we can see on these world models. So on the videos we essentially saw some amount of physics, so jumping and then falling down because of gravity, and then we saw platforms and saw some ladders. Do the world models ever generate other entities? Think of it like maybe enemies going back and forth that if you touch them something happens or...
What other qualia do you think that have you guys observed? Yeah, that's a great question. So I would say in the sort of examples that we show, particularly out of distribution examples, it is very difficult for it to generate anything that's kind of exciting. We are able to move the character around, but typically you would just see it, I guess,
repeating the patterns that it's seen in the background and that sort of thing. I think that's another sort of exciting direction for the future is trying to figure out how to make it a little bit more, the generations a little bit more exciting and diverse. - Yeah, thank you very much. - Now as well, one last question. I actually wanted to ask you something. What do you think is sort of like the cool killer application that you see in the future if you could really scale this up and train this on anything?
So I think there's quite a few applications and really it's subjective depending on your interests. So I think if you were, I personally think this could have impact in quite a few areas. So you can imagine some of the domains we use, already quite fun to interact with it as someone
in those settings, but I think it could have quite a large impact in areas such as robotics because it's currently quite hard for robotics, for robots to generalize equally unseen scenarios, but if you could generate a world model for any possible domain. And actually we've seen there's an open source Genie model from 1X Robotics that works pretty well. So I think that they obviously think so too, and they probably know more about that than me. And yeah, so I think there's a lot of potential applications, but we're just not focusing on one right now.
Alright, thank you very much and congratulations on the best paper. Because the Genie team were accepting their best paper award in Vienna, we were able to catch them at their poster session live to tell a bit of the human story behind Genie. Over to you, Brittany! I'm here with the Genie team. Generative Interactive Environments is the title of the poster. And I'm here with Jack. Jack, can you tell us a little bit about the origin story of the Genie project?
Sure thing. Yeah, firstly, thanks for the chance to speak to you. So basically, Genie is kind of a fusion of a few different areas of research. Myself and some others who are working on open-ended learning and environment generation beforehand, and we were interested in world models and thinking about how we could scale them to internet videos. But obviously, the key challenge with that is that internet videos don't have action labels. So if you want to train a model that takes actions as input to predict the future, you don't have the action, so you can't train that way.
And then on the other end of the spectrum, Ashley Edwards had been working for many years on inferring actions from videos for a different purpose for directly training agents with behavior cloning. And so it seemed like a natural fit really to combine these ideas. And there were some pretty simple proof of concepts of people doing this at like very small scale, but no one had really gone to the generative angle of getting an environment generator from a large scale dataset.
And so when we first spoke with Ashley a year and a half ago, we were excited about this potential combining these ideas to build something completely new. And yeah, I guess that's where we got to. Nice. And can you give a little bit of an overview, I guess, of how the work went, what results you saw, that type of thing? Sure, yeah. So we started basically working on this 2D platformers data set.
So we have 280,000 hours of publicly available videos of 2D platform games. We found one important thing was to filter this down because a lot of the videos aren't very good quality. So we trained a classifier with hand labels that we labeled as a team, a small subset. And then we ended up with 30,000 hours of good quality videos.
And then from that point it was just a modeling problem. And so we did a lot of research on different approaches to get these latent actions. And what we ended up with is we train in kind of a, I guess slightly quirky way, is that we predict pixels with a latent action decoder. And then that allows us to learn a discrete set of eight latent actions.
And then we separately train a dynamics model that is using MaskIt, which is like a way of generating next frames. And we train that separately given the latent actions that are produced from the video. Just predict the next frame, condition on the actions. And then people working on different things, like the project was quite fast-paced and a few of us kind of switched and wore many hats. And we started all getting different results in different areas. And then
Roughly around last summer, so probably just under a year ago, we realized that when we combined a few of these ideas, we actually had something that worked pretty well. And then we were excited, obviously. So then we started working on seeing how the model scales.
Because the key thing about this project is that if you can figure out how to generate worlds without action labels, then essentially there's nothing stopping you from using all of the world's videos because there's no reason why you need to wait to do that. So we started saying, okay, how can we scale this approach and what does it do? And then we produced these plots, which you'll see in the paper if you have time to look at that, where we show that as you increase the model size from...
a few tens of millions to in that plot something like two billion you just get increased performance every single time you increase the scale and then the same thing with batch size when you increase the number of examples the model sees and so we realized that we had produced a scalable model so we then decided to go for what in the end was an 11 billion parameter model and then once we produced that we just started playing with it and seeing what we could do with it
And then finally, obviously the goal of this was originally to get an environment for agents. That's how we kind of started, actually on the behavior cloning side and myself and others on the more auto-curriculum and open-ended learning side. But we realized it was really fun to play with the model. And so actually maybe the more interesting use case is how it enables new forms of creativity. And so there's some examples in the paper of things like drawings from one of our co-authors' children.
and they sent us photos of the drawings and then we were able to then prompt the model with those photos and play and move the characters and the photos around. And that's pretty cool, right? Because you're enabling people to create their own world, step into them and interact with them, which was not really what we first thought of when we started the project. But I think it tells you that if you do kind of ambitious, somewhat crazy stuff, then maybe new things will emerge. So that was pretty fun.
There's also a picture of my dog there too. So that's another example I like. Nice. What would you see in terms of, I guess, more near-term potential applications for this? A lot of the folks who listen to the podcast are kind of on the AI engineer builder side of things. Any creative ideas there? Honestly, this is going to sound a bit like...
a bit of a non-answer, but there's so many applications. So you can obviously see the ones in the paper we have examples where we show generating 2D platformer-like kind of short game experiences. But you also can see the models work on robotics data too.
And arguably that latter use case is maybe more promising in the short term. There's actually already been an open source Genie model released on 1x GitHub repo in part of their world model challenge. And so I think they're more expert in robotics than I am. But the fact that they think it's a potential good direction for robotics probably speaks volumes.
I think there's other use cases too, so things like maybe driving. If you could generate scenarios for testing or even training autonomous vehicles and then be able to interact in the world in any custom situation, that could be very valuable. But on our side, we're mostly just pursuing the fundamental research
and not really focusing on one specific application. Yeah, and I imagine the world has already moved to, continued to move forward from a research perspective since you guys put this out there. Is this a direction that you see yourself continuing to pursue or what have you been excited about lately? Yeah, so this work, I mean, ICML, the deadline is January, right? So it's already six months or so ago that we submitted this.
And yeah, I guess most of the team are still working on it. Do you see coming out with a Genie v2 or v3? Yeah, I can't speak exactly about specific releases, but hopefully we'll have something new at some point.
Has there been anything else that's come out in the research landscape that you feel has either reinforced kind of what you've worked on or contradicted on the flip side? How do you see it evolving? Yeah, that's a great question. So definitely reinforced. I think just after Genie came out, there's been a flurry of like really amazing video generation results. So the first one was clearly Sora. I think they definitely took the space by the scruff of the neck and really pushed capabilities quite significantly.
and that was really exciting to see. It's quite a different style of model in that theirs is text to video, so it generates entire clips, whereas Genie is frame by frame level control. But nonetheless, it does show you that with additional scale and I guess brilliant execution, you can get much more high quality videos generated already than I probably thought was possible.
And then since then, I think there's been kind of the floodgates opened in the space. So from our own colleagues, the VAO model came out. It was announced to I/O. It was really impressive as well. And then competitors, I guess, have also-- other competitors have done similar things. So it's a really exciting time for that space. I think there's not really been anything that's action controllable like Genie. But yeah, it's definitely an exciting time for video generation. So I think it's a good space to be getting involved in now.
Given how fast the community is moving, I think in a few years' time we'll have something pretty incredible. I just spoke with your video poet colleagues as well. How do you see this work dovetailing with that work? Or how do you work together on the future of what video looks like or world models? I think that they're maybe more interested in more cinematic video experiences and generating entire clips. Whereas Genie still remains quite...
quite fundamentally different. It's only generating one frame at a time. And it's like, it's kind of a video model, but it's also kind of like an auto-aggressive image generation model in a sense. So it kind of sits in its own
It's kind of a new area. I guess a lot of researchers always claim that they're inventing a new area, but it is kind of a new area of research and hard to classify. It's a bit different. It's also a lot of us come from an RL background, so we're much more thinking about agents, which I think is quite different to all the video work, which is much more, I guess, focused on generative media and generating cinematic quality videos. But there's definitely some overlaps in the architectures and these kind of things and the infrastructure and...
We both want to use lots of compute, so I guess that's another thing we have in common. What do you make of all of the hype around the agent space? Do you see that continuing or do you see people getting tired of the agentic buzz? You're venturing into hot takes territory. So it's tricky because, I mean, a lot of us who worked in RL, right, we've been working on agents for a long time.
Because I think RL is often dubbed as reinforcement learning research, but really it's agent research. For a lot of us, it's agent research where RL is currently the best method to get agents. It seems like now that's shifted because people are starting with LLMs and then training them on top to get additional capabilities. But it's a lot of the same people that were doing RL research, so they've always been working on agents. It's just now it's called LLM agents before it was tabular RAS RL. So I think...
Yeah, this hasn't changed a huge deal. It's just we're starting with base models rather than Tabula Rasa and maybe some kind of toyish environments. It's kind of a natural progression of that line of work. I think it's exciting, but I think the goal of Genia is a bit different in that we're going more for an embodied AGI. We want agents that can interact in the real world over a long horizon. And for that, I just can't look past how you would need a simulator of the real world, which I don't think we're going to build
by hand. So I think they're kind of complementary in a sense. I think the LLM agents will become more capable in doing long horizon tasks in like text-based substrates, but I think that to then in the real world take long horizon actions for some kind of VLM, it's going to need to be able to interact in the world and we're not going to just be releasing them to do random exploration. So I think a real world simulator will play into that at some point. Awesome. Thank you so much for the time.
Believe it or not, Google also won a second best paper award at ICML for video generation with Video Poet: Deep Mind's Take on Zero-Shot Video Generation. Video Poet is a simple modeling method that can convert any autoregressive language model or large language model into a high-quality video generator. It contains a few simple components.
A pre-trained MagVit V, two video tokenizer and a sound stream audio tokenizer transform images, video and audio clips with variable lengths into a sequence of discrete codes in a unified vocabulary. These codes are compatible with text-based language models, facilitating an integration with other modalities such as text.
An autoregressive language model learns across video, image, audio, and text modalities to autoregressively predict the next video or audio token in the sequence.
A mixture of multimodal generative learning objectives are introduced into the LLM training framework, including text-to-video, text-to-image, image-to-video, video frame continuation, video in-painting and out-painting, video stylization, and video-to-audio. Furthermore, such tasks can be composed together for additional zero-shot capabilities, for example, text-to-audio.
Let's cut to Li Junyu speaking for the Video Poet Oral Presentation. Good morning, everyone. This is Li Junyu from Google DeepMind. Excited to meet everyone at Vienna. This year, I believe many of you may have witnessed the significant progress on video generation, especially with text-to-video diffusion models. Today, I'm going to talk about a completely different approach, which shows that diffusion may not be a necessary component.
We appreciate that the award recognizes the contributions of this work. Now, please allow me to introduce VideoPoet, a large language model for zero-shot video generation. This work wouldn't be made possible without our talented team, with members coming from diverse backgrounds and moving forward along different paths.
The core contributors were Dan, myself, Xiuye, Jose, Jonathan, Brian, and Lu, along with many other video players. Reflecting on the progress so far, we realized that video generation has already come a long way from the early days of GAN models. In case you have never seen generated videos from a large-scale model by the definition of 2016, here are two examples for classes of GORF and BIB.
Since then, people have scaled up GAN models and developed pixel-space autoregressive and diffusion models, which were getting less affordable. Some works try to model it as a foreign language of images or videos, but lossy discrete tokenization poses inevitable limitations. Later, latent diffusion has become the dominating approach, given its appealing sample quality. Big companies and startups have ignited a race of scaling up compute and data.
Now, nearly 10 years later, models can easily generate a video clip from a text prompt, like this skeleton drinking soda. But is latent diffusion the only way to go as we embrace the LLM area? Absolutely not. In fact, this video is generated with VideoPoet, a purely LLM-based approach without diffusion.
VideoPoet is a foundation model that takes inputs of text, image, visual dance signals, partial videos, audio combinations. It is capable of text-to-video, image-to-video, video stylization, video editing, video-to-audio, and many other tasks.
In short, VideoPoet is an autoregressive LRM that synthesizes videos with high-fidelity motion and matching audio from a large variety of condition signals. The diverse capabilities of VideoPoet are facilitated by defining a universal multi-model sequence-to-sequence problem. The condition sequence includes task indicators, inputs from text, visual and audio modalities, as well as output format controllers.
The model generates the output sequence of visual and audio tokens in a fully autoregressive manner, just like a usual language model. In order to define the token space for each modality, we resort to a collection of unimodal tokenizers. Megawid V2 Encoder and Decoder define a bidirectional mapping between a pixel space and a compressed space of discrete visual tokens. It can tokenize image, depth, or optical flow, as well as cropped or masked videos.
SoundStream does similarly for the audio waveform. Although text tokens can be directly fed in, we use a pre-trained T5 to extract text features to reduce the burden of learning human language from scratch. The MagWave V2 tokenizer defends the visual language. It resembles a quantized VAE with a temporary causal 3D CNN architecture, processing pixels. This causal design enables joint training with large-scale image data and seamless support for long videos.
For higher prediction bandwidth, we adopt a large vocabulary of over 200,000 words enabled by our scalable quantizer. The model is trained with both reconstructive and adversarial objectives. In a human-reader study, our advanced video tokenizer achieves even better compression quality than VVC, the next-generation video codec standard. This tokenizer lays a solid foundation for high-fidelity generation of videos, especially for those with large motion.
Similarly, the SoundStream tokenizer defines the audio language, which adopts the causal 1D CNN on a waveform. It uses residue vector quantization to produce multiple levels of tokens, where VideoPoet uses the first form. And its quality is better than Opio's AudioCodec standard. Now that we have defined the multimodal token spaces, we can convert video datasets into discrete token sequences.
Then we can use an out-of-the-box LLM transformer training infrastructure to learn these as foreign languages. In VideoPoET, we adopt a decoder-only prefix LLM architecture where bidirectional attention is applied on the condition sequence followed by causal attention on the target output. Compared to a diffusion transformer of the same size, the VideoPoET framework has significant flexibility and efficiency benefits at both training and inference times.
We can flexibly train arbitrary tasks between any modalities together with variable lines of condition and target sequences in a single model. With causal attention, the transformer learns the entire decoding trajectory for video in a single training step. At inference time, we can leverage various types of existing acceleration techniques, such as kvcaching, so that the entire decoding flops are no more than one full forward pass. Video data comes from different sources in diverse formats.
Text-to-video diffusion models usually require text-video pairs with high aesthetic value, which may be scarce and costly to accurate. With our flexible design, VideoPoet can pre-train on a mixture of pre-existing data, where a large fraction remains unlabeled or noisy labeled. In this table, we have a large number of raw videos with audio from the public internet, some videos with noisy machine captions, and another set of videos with high-quality human captions.
We also leverage image text pairs to improve language alignment. After pre-training, we can have a second training phase of task-specific adaptation with the corresponding high-quality dataset, such as for text-to-video. More details about the training data can be found in the paper. We have a large mixture of training tasks on this data, starting with self-supervised ones such as unconditional generation of various modalities.
With an autoregressive model, they also imply the corresponding continuation tasks for video, audio, and both of them, as well as the image-to-video task. VideoPoet is trained to generate audio given a video, or vice versa, and perform various types of video editing, such as inpainting, outpainting, and interpolation. In addition, leveraging the captions, it learns to generate video, audio, and image from text. Video stylization is supported by depth or optical flow conditions.
After the LLM backbone generates video tokens, we can optionally apply a latent super-resolution module before decoding to pixels. It uses the MegWit mask transformer with non-autoregressive decoding, which runs faster at small scale, with multi-axis windowed attention to handle long sequences at high resolutions. While VideoPoet has broad generation capabilities, much of the existing automatic benchmarks are defined around text-to-video.
Here we compare with the DR methods on the commonly used MSRVTT and UCF101 zero-shot text-to-video evaluations. On metrics of clip similarity in separate score and FVD, VideoPoet performs favorably against prior models, which were specifically designed for text-to-video. As automatic metrics get to be saturated and less indicative, we conduct user study with human readers to compare zero-shot text-to-video generation in various aspects.
We compare against prior works including Fanakey, Showa, Video Crafter, Runway, and Pica, as well as concurrent works such as Vought and Lumiere. On axis of text fidelity, video quality, motion interestingness, and motion realism, VideoPoet is preferred to prior works and preferred to concurrent works in majority cases. This is a collection of text-to-video samples by VideoPoet, and we highlight their high-fidelity motion. More samples can be found on the project page.
Here we demonstrate the image-to-video capability, which can be potentially applied for 3D rendering as well. In addition, video stylization and editing are natively supported. VideoPoet can generate the corresponding audio for a video where it understands the content. Here we show a few examples where both video and audio are generated by VideoPoet.
We hope our work can empower the community to explore in broader areas and greater depths. With LLM-style foundation models, we could further leverage their generalization capabilities for in-context learning of a new motion, character, or object for video generation in a customized and controllable way. We can even think of how a new modality can be added into the model at inference time.
On the efficiency side, we probably want to care about how video generation can run in a real-time streaming fashion. This will not only enable interactive neural gaming, but may also facilitate neural user interface. Imagine for a neural-based operating system, it could have no more blue screen crashes, but may reboot when it runs out of memory due to a long context.
Further advancement would hopefully take us to a universal multi-model generating model that excels at text, video, audio, image, and beyond. Think about text-to-video as machine translation in 2018, when it first beat human performance. It took another five years before we have chat GPT. I guess it will take sooner before we can reason and generate across modalities with our arms-level intelligence.
Looking forward to seeing how it answers. Can you show me how to tie this through with a single hand in a live video? In summary, VideoPoet represents a distinct approach to video generation. It challenges the diffusion monopoly with still-the-art visual quality while offering multitask flexibility which goes beyond the text-to-video translation paradigm. It is a video-first foundation model with diverse generation and editing capabilities.
building upon out-of-the-box ARM infrastructure for native integration. That concludes my talk today. Thanks for your great work. And actually, I'm very curious about the instruction following ability of VideoPoet. So do you take some measures to evaluate the instruction following ability quantitatively or qualitatively? And how does it compare with the traditional classifier-free guidance of diffusion models?
Yeah, that's my question. Okay, that's a great question. First of all, I don't think there exists a very good quantitative metric to measure that thing, but it's a very promising future direction people should work on to evaluate these video generation models. Secondly, we did use classifier-free guidance in our model as well for autoregressive. It works. And then I think it is very tricky to fairly compare with the diffusion model
beyond a system level comparison because they use different latents, one is continuous, one is discrete, and you train on different of them, you have different reconstruction quality, then you can calculate perplexity for language model. People don't really know how to do that for diffusion model. But I believe it really warrants further study on comparing these two methods systematically. Okay, thank you.
- Okay, great works and a great team. Hi, Lijun. - Hi. - So we are also working on the topic of video and that has the capability of generating video. So my question, as we all agreed on, that the video tokenization might be the bottleneck of the, for example, the model, right? - Yes. - And so do you have any insights
That you want to share about how to build the you know some very capable video Organization technique, okay, so first of all the video tokenizer It consists of encoder and decoder, but it basically learns a bi-directional mapping sometimes people use like diffusion decoder and
on the other side of language model. But that means you are doing another generative model there. In VideoPoet, we only use as a pair of encoder and decoder, training with reconstruction and adversarial objects. It is a bidirectional mapping. In this case, it gets really tricky to train it for high-quality reconstruction because it will always be a lossy compression problem. And on the decode side, you always have one too many mapping. So the real help, I guess, from
Going away from the blurry reconstruction is the help from the adversarial objective, which gets you sharp videos. Also, the 3D causal CNN in the MegaWave 2 architecture, that helps a lot, especially when coupled with autoregressive modeling. So you have full temporal causality for the training. It is very friendly for autoregressive decoding.
Okay, thanks. Maybe we can have one more chat. Yeah, of course. Welcome to the poster session. It's happening at the last minute. First of all, thank you for your work. And I have a question about open sourcing Mugwit first. As I know, it was planned, but somehow Mugwit first was not open sourced yet. And do you have maybe any plans about it?
Okay, I may answer this question more confidently if you asked me one month ago, but now as I joined for time, I don't really know. The good news for you is the MegaWave v2 version 1 tokenizer was already open source like a year ago. Yeah, you can use that. And I think it will only take like another 100 lines of code to reproduce MegaWave v2 so that you have the tokenizer, the full reconstruction and adversarial training
Logics, Adi. Yeah, I know it's simple to reproduce but it's hard to train same to all the QVA like methods. It's very difficult to find the right parameters.
Yes, I agree with that part. It is tricky and it takes some hyper parameter search, especially it varies with the dataset statistics, which I think there is a lot of room for future improvement. I believe the current solution is not perfect. And even for now, although I have worked on so many video tokenizers, my real dream now is getting rid of them. Last question. Thank you. Thank you for your talk. So,
You presented a really interesting direction in the multi-model foundation model. From what I see, the whole architecture or the approach in training is very much sequence continuation, right? Yes. So I'm wondering if you work on some more capabilities or architectural components which help the model to generalize and to simplify the...
connecting the dots, especially in video and audio signals, it is a very hard task for any model to see the structure behind the diversity. So if you're thinking of working in this direction?
Okay, that's a great question. First of all, one of the advantages of VideoPoet is we take the LLM training infrastructure from an out-of-box version. We take it for granted and we actually made no modification to the model architecture and training recipe. So all you need to do is define your token space and curate your sequence datasets.
I think that part requires some really smart designs. Like you can have text-to-image as a prefix of text-to-video, and you have video-to-audio tasks as the prefix of unconditional video audio generation, stuff like that. So you help model generalize beyond different tasks. All right. So thank you so much for the talk, and congratulations again on your Best Paper Award. Thank you.
I'm here with Dan Kondratiuk to talk about the Video Poet, a large Langle model for zero-shot video generation poster. Dan is with Luma AI, which is one of the leading companies in the AI video generation space. Dan, can you give me an overview of kind of how you started working on this and maybe a brief introduction?
high-level summary of what it is you have here on the poster? Yeah, so we started this project as mainly a way of thinking about video generation from a foundation model perspective. So foundation models, like typically when you think about them, I guess at the time they were all like language models or visual language models. So they output
primarily text, but we thought, what if we approach it from a video perspective? And this approaches the design from a very different perspective from how the current video generation models are, which are primarily diffusion-based. So we thought maybe we could envision a task where we take an off-the-shelf language model. One of the things we changed the least about this project is we just take a language model.
do anything special with it. And our real innovation here is on the data side, like how you design the tasks as input to the language model.
And the way we designed it is we translate all of our modalities, so text, video, image, audio, into one embedding space. That means you translate it into one language that the language model can understand. So typically when you think of language, it's like human language, natural language.
the type of text that you can read. But you can actually think of images, video, and audio as a type of language too. So we have this tokenizer which we call it MagVit v2, which takes this, for instance, image or video,
and translates it into a discrete sequence of tokens with a very large vocabulary. So there's like a vocabulary of say 200,000 tokens and that can be input directly into a language model. So this language model speaks the language of video in some respect and all we do is just train it on hundreds of millions of videos. I think we train more than a billion images and also some of our data set had
video and audio pairs.
And depending on how you order things, we have this bidirectional attention prefix. It just means that we input these and the model has a way of incorporating all these modalities, this text, images. We also have some alternative types of dense input prediction for stylization audio. And depending on the order, what you input in the beginning, you can condition it to output different things. So for instance, you input text,
you can output video. So depending on your description, it outputs like an astronaut starts dancing on Mars and then it starts generating the output video based on how we trained it.
Similarly with our output audio, we can also do, for instance, take an input image or video and try to generate accompanying audio without using some audio tokens generated by SoundStream, which is a previous Google paper that did kind of language modeling on audio.
these audio tokens. So our approach is primarily how we combine these tasks together and we see we're not the first to show that you can use a language model for this type of generation, but we are a work that shows that you can actually scale this to a level that's actually competitive with existing works that do video generation. So you can see it can do
Because it's a foundation model, it can do tons of tasks just based on how we were able to design it. For instance, you can do text-to-video, we can do image animation, take an input Mona Lisa and just ask Mona Lisa to yawn and all of a sudden, based on what you described, what you want the image to do, it just does it, which is really cool.
And then we can chain this with other tasks like for instance stylization. If we use something like depth and optical flow conditioned, it just basically strips out all of the contents of the original video and condition it on just the depth and optical flow. If you describe it like
oil painting of a snowman with the red hat opening their mouth to yon and then it just paints on top with the same motion as the original video so that's another really cool thing that just was a the model was able to do
And then we have a whole bunch of other tasks like can even generate audio. It can out paint videos where we take an input video and try to paint more contents on the bottom and top. So overall, we evaluated the results and we see that the results are quite competitive with a lot of existing works. In fact, exceeds the most of the works that we tried, which is really cool.
And one, a couple of things in particular the model does really well is on prompt following. Because we can train it as like a language model, it's actually easy to scale with existing infrastructure that we had. And also, it does pretty good on motion. You can apply existing image or like text to video results and
it's like compared to those other works that we tried at the time it applied much bigger and more motion that looked interesting in the video rather than having something that moves very slightly like
more akin to image animation. So that's the overview of the work. If you have any questions, we'll be glad to answer. - Yeah, so we're talking to an audience here of folks who are AI engineers. They build applications, oftentimes using some of these models as the underlying underpinnings of the AI part.
So I'm curious if you would say that the work you've done and the language model approach is a better fit for some use cases versus others, and if there might be other use cases where diffusion models may still be better, or how would you think about the trade-offs, I suppose? Yes. I think there are a couple of things. If you want to do something that does very good
pixel quality. I think diffusion models are still unmatched in this regard. And that's primarily because of the tokenizer. The tokenizer does extreme level of compression. So that's why we're forced to generate at these pretty small resolutions and need a super resolution model to increase the fidelity. But with the diffusion model, you don't have that restriction. You can do diffusion over these latent tokens that are not as compressed.
and as a result, it's a bit easier to get these high-resolution, high-level-of-quality results. However, one thing that diffusion models have a problem with is it takes a very long time to converge, a very long time to train. I think the language model approach is definitely quite a bit more efficient. We trained it only for a few weeks, and already it converged pretty well.
and it scales proportionally to existing language modeling approach. So you can easily predict
If we keep increasing the model size, as we see here, like 1 billion model is pretty good, but 8 billion model, we just like increase eight times more parameters and we get much, much better results. I suspect if we just like keep increasing the model size, it'll keep improving. So that's also another nice result. Diffusion models do have a scaling property, but it's a lot harder to predict, I would say.
So I think some of the nice things about language model is a lot more research has been done on the scaling properties. And there's also, because the tokens are flattened into a 1D sequence, you can do this multimodal representation, whereas diffusion model typically only operates on one modality. There's also maybe you could try to do something like video and...
audio generation at the same time with a diffusion model, but I don't think you can generate all modalities combined at the same time with diffusion model just yet at the same level of quality. Text diffusion right now is really hard to do and has not had the same level of performance as autoregressive models. So if you want a general foundation model to do everything all at once, I still would say language model
is at the top right now. But who knows? People are doing research in many different areas. If someone can crack a text diffusion, I think you could also create a foundation model that does all these modalities.
And is this work that you've been doing in the context of your role at Luma? And is it work that you plan to continue to kind of push forward in that context? So I recently left Google to join Luma to do some video generation that I was really excited about. So this is just a work that I worked on while I was at Google.
I think there's some continuation of this work possibly in the future, like this general approach. Obviously, VideoPo is still not out and I think that's just a testament to how fast the field moves right now. It's just like incredibly competitive space. But I do think like this general approach
It could surface in many different areas in the future. Who knows? Like right now, language models and diffusion models are battling it out in this battleground. So who's to say which one will win out in the end? Both approaches--
have been shown to work pretty well. They have their strengths and weaknesses right now, and more research is going into space. So right now, I'm really excited about video generation going forward to build out these more general purpose models. But at least for this work on Video Poet, I'm also really excited about the future prospects of what's going to happen next. Awesome. Thank you so much.
You may have caught that Dan Kondratyuk, the lead author of the Video Poet paper, has left deep mind to join Luma Labs, which is responsible for the LumaDream machine model that went viral for turning popular memes into videos this year. To tie off our generative video discussions, we will bring in Tali Dackel's invited talk from the text, Camera Action.
Frontiers in Controllable Video Generation Workshop on Saturday on the future of video generation, beyond data and scale. As a reminder, all talks have public links, so if you want to see the videos she is talking about, click into the show notes.
Hi everyone, I'm Tali and it's a great privilege to be here. So today I'm going to talk about the great revolution that we are witnessing in generative AI and especially in video generation. And as you know, models in this domain requires a tremendous amount of training, data and compute. But I'm hoping to convince you, based on my own experience and work, that I think that the future of video generation goes way beyond just data and scale.
It's going to be a high-level talk that's going to cover different topics, but I do hope to also dive into technical details on the more recent works. So again, in the context of this workshop, I think it's redundant to say that we are all aware of the fact that the generative AI revolution has been recently expanded to videos, and we are now not only able to generate these... Whoop, what?
mind-blowing still images, but we can also make everything move. And really, I think the past couple of years was
shown dramatic rapid development in this area. And when we are witnessing this progress, we can start envisioning how movie production in the film industry might look like in the near future and think that we might be able to generate movies completely computationally. So maybe it will look something like that. We'll ask
ChatGPT to help us with the script like this, and it will generate the script for us. Then also it will take the script and generate the movie completely computationally like this. Maybe we'll then ask it to add some special effects like a bullet time effect, and it will just do it completely computationally without any real actors or cameras, just using generative AI.
Yeah, you can hear it outside? Okay, sorry. No more audio in this talk. And if you are too young, you probably don't know, but this is not a real generated video. It was taken from the Matrix movie that was produced somewhere in the 90s.
Okay, so I'm sorry to disappoint you, but I think we are very far from this future. And despite all the amazing progress that we are witnessing, state-of-the-art text-to-video models still depict some fundamental failure cases, even models like Sora. So, for example, they tend to fail to simulate real physical interactions in the world, like this object here is supposed to be a rigid chair.
And you can see it is floating in an unrealistic manner in space. In this example, the treadmill follows the person also in a physically implausible fashion. And also when we are dealing with more complicated scenes that involve multiple entities, objects tend to unrealistically appear and disappear spontaneously. And this basically tells us that video generation is still not solved.
And furthermore, the costs of scaling up video models and developing these universal foundation models are just huge. You know, a single model training requires about roughly on average 200K GPU hours, which translate to almost 280K dollars. And this is just translate to millions and millions of dollars to train such models.
And in terms of energy consumption, just to generate half a second of a video amounts to driving roughly four miles on an average car. And because of these costs, it leads us to the fact that these video foundation models are being sealed in industry and there are only very few big players in industry that can develop and design such models. So what do we do on the research community in that case?
And also, if we go back to our moonshot goal of generating films completely computationally, in order to do that, we need explicit fine-grained control. We may want to exactly control the camera position, the character identity, their emotion, their positions, their movements. We may want to control lighting and also sound and speech. And all of these controls are not currently provided by us.
video foundation models. So my research journey in the realm of videos has actually started on the other side of the spectrum with single video models.
And what do I mean by that? I mean that basically we have some neural-based framework that is overfitted to a single test video. So as NERF, for example, is overfitted to a single 3D scene, in this case we have some neural networks that only observe this test video alone without any additional data.
And it turns out that you can do some pretty impressive things with these single video models. So for example, we showed how you could take this really busy and complex scene
And let's say you want to just focus your attention on a single dynamic object so we can actually remove all the rest of the moving people in this scene except this girl. And you can notice how not only we remove the people, we also remove the complex deformation that occurs to the trampoline in this case.
Here, this is a video of my son riding his bikes for the first time. And I can take this video and stylize only the background. And you can see that everything moves consistently and physically correct with the original scene. And these works are from 2021, before the big generative AI revolution.
And we can also, you know, not only map texture onto rigid objects, we can also map texture to deformable articulated objects. So, for example, we can add these flowers to the dress and they are moving in a physically correct manner as the original video. And again, these models, the only information they have about the world is just the single video, the input video on the top.
Of course, their big disadvantage is that they don't have this rich and powerful prior knowledge about the world.
So just to show the advantage of this in more detail, I think one of the big advantages of this approach is that it allows us to go way beyond just working with raw, huge pixel volumes. We can design sophisticated and more advanced representations for real-world videos.
So in layered neural atlases, our key idea, we wanted to support this consistent video editing, and the key idea was to basically turn the video or estimate from the video a unified set of canonical images.
So given this input video, we estimate two atlas images, like you can see here, one for the background and one for the foreground, that represents either the entire background or foreground for the entire video. And each pixel position from the original video is being mapped onto these atlas images, and this allows to basically reconstruct the original video from this representation.
And now the key advantage of this representation is that it allows to reduce this really difficult task of editing huge pixel volumes of real-world videos to editing a single 2D image. So what you can do, you can just take these images, plug them in into any image editing framework, or just load it up in Photoshop and draw some stuff on it, and then use the mapping to map it back to the original image.
Video. Sorry, the animation doesn't work. Okay. Of course that I'm showing you a discrete set of images, but in practice everything is being implicitly represented through MLPs and through neural network. So very briefly, each pixel position in the video is fed into these MLPs that maps it into a 2D coordinate in this Atlas space. So this is just a 2D coordinate between minus one and one.
and you have two such networks for the foreground and the background. And each such position in this 2D unified space is fed into another MLP that predicts the RGB color of that position.
And there is also another small MLP that predicts the visibility of each point, how much it observes from the background versus foreground. And this allows to basically reconstruct the original color of the video at each position and to train this entire framework completely in a self-supervised manner
where the driving loss is a video reconstruction loss. There are other terms in the objective function to make sure that this representation is interpretable, that the structure are being preserved, that the correspondences in the videos are being preserved. But basically, you can train these things end-to-end in a self-supervised manner. Okay, here you can see the edit being mapped.
Okay, so on one hand side, we have this, again, video foundation models. They require this huge cost to train. They are limited. We don't have access to them in the research community that much. And they provide limited controllability. On the other hand, they can learn this really powerful, amazing space-time priors about our dynamic world.
On the other side of the spectrum, we have the single video models that requires only few GPUs to train. They are accessible and they allow us to be much more flexible and creative in the way we represent video content. However, they do not have any prior knowledge about the world. So you can probably guess that the way I think we should go about videos is actually to combine the best of both worlds.
And what do I mean by that? So, on the one hand side, we want to have this flexibility and this freedom to represent video content and to gain explicit control over what we are synthesizing. On the other hand, we want to fuse into this representation external knowledge learned from universal models.
And this not only restricted to just video models, we can integrate external information from an ensemble of foundation models that can provide us motion priors, generative priors, generative priors and semantic priors.
And my first attempt to do so was in Text2Live. So in Text2Live, we wanted to support text-driven editing. And I think it was, to the best of my knowledge, the first method to demonstrate text-based editing for videos, for real-world videos. Again, this was ECCV.
22 and the key idea there was to use a pre-trained neural atlas representation of the video as a video renderer we're gonna have this representation keep it fixed
and then replace the manual edits that we can perform on the Atlas images with automatic text-driven edits described by text. And to achieve that, we combined this representation with a pre-trained clip model back then that allowed us to gain this for the first time. And here you can see how we can perform localized and semantic editing to real-world videos without any real generative model. This was just using clip.
And again, I think that performing this localized semantic edits and the type of edits that they showed you for moving dynamic content is still a challenge even to big foundation models that are very powerful.
But again, with all the respect to CLIP and this approach, you know, with the rise of text-to-image models, we wanted to take this approach further and to think of how can we leverage stronger priors about the world. And I think one of the main challenges in pursuing this approach of combining external knowledge to these sophisticated video representations
is that most foundation models are basically black boxes to us. We do not understand exactly the priors that they learn and how these priors are internally encoded. So this approach poses this challenge of how to distill learned priors from black boxes and
Basically, one of my research aim is to, an approach is to dive deep inside those foundation models and find out, like just reveal more, gain better understanding about what they learn and their internal representation. And if we can achieve that, then we can build much more, much better algorithms on top of them.
So with the rise of text-to-image models, diffusion models like stable diffusion, I was really amazed by the ability of these models to capture these really complicated signals about our visual world. So just viewing these images, we can see that these models can learn priors about composition, about pose, about interactions between objects, appearance, and so on.
So I was focusing on this aim of taking text to image models way beyond what they are meant to do, way beyond just generating images from text.
And we had a line of works in the lab that introduced some of the early works in this space. So for example, in Plug and Play, we conditioned the generation not only on text, but also on a reference image. And the output image preserved the semantic layout of the original reference image.
In multi-diffusion, we extended pre-trained text-to-image models to generate images at arbitrary resolution and also to receive as input region-based text controls, like you can see in these examples.
In the context of videos, I was thinking how can we take these powerful priors, the text-to-image learn, and extend them to video synthesis tasks. So in Eurips, we introduced Scenescape that allows not only to generate beautiful scenery, but also to walk through, to generate 3D plausible walkthroughs inside those scenes. And behind those videos, there is actually a real 3D mesh representation of the scene that is being built.
And in TokenFlow, we showed how can you not only synthesize static synths, but actually edit real-world dynamic synths.
And I think again, many, a huge bulk of work is doing that, like adapting text to image models, expanding them in various ways. I think what's kind of like more unique in these works is that we insisted in keeping those text to image models fixed.
and striving to better understand the generation process, the internal representation, to make these black boxes more transparent and utilize our understanding of them. So I want to dive more deeply into some of the work. So let me discuss in more detail token flow.
And again, our goal in this work was to perform this consistent video editing. And we started with this naive baseline of applying plug and play or a different method to edit each frame independently.
And as you can see, the content is really inconsistent. It's not just at the level of high frequency flickerness. The content really changes from one frame to frame, and there is really no reason to believe that the text-to-image model would give us something else. So we wanted to dive inside the model and understand how these inconsistencies are being represented inside the model.
So in order to do that, we take the original video frame by frame, we use some inversion technique to invert it back to the model, and then we can just extract some features from intermediate layers. And because those features are really high dimensional, we cannot make sense of them, so we use PCA to reduce them into three dimensions and visualize them as videos.
So here you can see the original video and on the right hand side you can see the PCA reductions of tokens of features extracted across different levels of the unit.
And what we can easily observe is that this PCA visualization, they depict shared and consistent representation. We can see that the consistency in RGB and the features resemble, again, similar consistency in its feature space for this video.
So we wanted to look at this consistency in more fine-grained manner. So in order to do that, we looked on nearest neighbors. You take a feature at a certain position in one frame and just compute its nearest neighbors to all the rest of the frames. And what we saw is that those correspondences, they exhibit this semantic and accurate matching across different frames, as you can see in these examples.
And you can compute this nearest neighbor field densely. So for each, if you are given two frames, you can take each feature in the source frame and compute its nearest neighbor in a target frame. And this will give rise to this dense nearest neighbor field, which we named token flow.
So this provides us with semantic and accurate matching, but we wanted to see also to gain more information about what these features hold in terms of information about the frames. And in order to do that, we checked how well we can generate the target frame from the features provided from a source frame.
So this has been done by basically taking the source frame and the target frame, extracting their features, computing the token flow, and then just warping the source feature tokens.
And now we can intervene in the generation process of a target frame. We basically do DDM inversion to get the initial latent, but then we swap each feature of the target frame, computes its nearest neighbor from the source frame, and we just swap the features. So we want to check how the generation of the target frame would be impacted by this swapping.
And we observed that the target frame can be synthesized accurately from the source features, which means that those features are interchangeable for the model. Okay, so what happens now? Again, we apply this per frame editing and we saw that the consistency breaks in RGB. What happens to the features?
Here you can see the feature visualization of this per frame edited video and you can see that the features depict the same inconsistencies as in RGB. So basically consistent features gives rise to consistent frames and vice versa. So our key idea in TokenFlow is that in order to achieve consistent editing, we want to achieve consistent features during the generation process.
And the way we suggested to do that is by enforcing the original token flow or the original feature matching of the original video on the edited video. So you can see the edited video and the underlying features of that edited video. And just to summarize, so this method
works as follows: we take the original video, we do the DDM inversion, we extract the features and compute the token flow, and then during the generation process of the edited video is composed of two stages:
In the first stage, we sample some keyframes and jointly edit them with extended attention. This gives basically just rough global coherency between the frames. And then we extract the features of these edited frames and we propagate them using the original token flow of the original video to the rest of the frames. And we repeat this process. Here you can see some generation results. And...
comparison to several methods. Again, I think since we published this work generated a great body of follow-up works. You've seen the nice work on editing XD slices today. So these matching and token flow correspondences, they hold between nearby frames, but indeed when the frames are more distant from each other, those matches tend to be
incorrect so indeed our method would break for very complex and motions where these correspondences would be difficult to achieve
Okay, so I guess I talked about how can we use text-to-image models beyond what they are meant to do, but the main limitation of just using text-to-image models is obvious. It only provides us with 2D information and we don't have any motion priors. And if we really want to model our dynamic world, we need to know something about how objects move, how they tend to move in the real world. We want to know priors about actions,
And that's something that text-to-image model cannot provide us. But again, I remind you all that we are in this amazing world where progress happens really fast and now we have these powerful video models. And that really motivates us.
their use and their understanding of motion in various applications. It could be generative tasks, but I don't think it has to be limited to that. Okay.
So that brings me to the last work that I want to talk about, space-time features for text-driven motion transfer that was presented at last CVPR. And the motivation there was, again, film industry and the big efforts that manual work and professional work puts into
transferring motion from motion markers and so on to animation using this CGI type of animations. So we wanted in this work to achieve this computationally. So given an input driving video like this dog jumping to a river, we want to be able to transfer it to dramatically different objects just using simple text prompts like you can see here.
You can see that the big difference between this setting and this task compared to, let's say, what we've done in TokenFlow is that you must enable deviations from the shape of the original objects in order to convey or to fulfill the target edit. In order to transfer the motion of this dog to a dolphin, I must change the shape of the dog dramatically.
and adapt the fine-grained characteristic of the motion such that it will be plausible and natural with the target object. Maybe the dolphin moves its tail in a certain way and so on. So we really need to distill the essence of the motion from the driving video, but be flexible enough to allow this adaptation of the content in order to fulfill, to get a naturally looking edit.
And for that, we must have a prior about how things are moving in the real world.
So in this work, we used ZeroScope, one of the publicly available text-to-video models. You can see some samples from this model. So it's way, way far from state-of-the-art text-to-video models that keep being better and better. But this model still is able to learn valuable information about our dynamic world.
Okay, so just in context of this work, we are not defining motion anymore as pixel level correspondences because again we want to allow this flexibility and deviation from the shape of the object. So in our context for this task, motion is defined as a sequence of semantic objects, parts, positions. So you can think about an object as being, you know, just a set of the parts that
and their general progression throughout the entire video. And again, in terms of related work, I think none of the existing method is not designed to enable this big deviation in the structure of the objects.
So we followed TokenFlow and took a similar approach and asked ourselves how space-time information is internally encoded in this text-to-video model. And again, we want to dive deep into the features and understand them better. So in this case, our input is a video and we can directly invert it into the video model.
again using off-the-shelf DDM inversion technique, and extract features. In this case, the features are four-dimensional. So f is the number of frames, m by n is the spatial dimensions, and d is the number of channels. So here, instead of doing PCA visualizations and so on, we adapted a feature inversion technique. So I guess many of you are familiar with it in the context of understanding classifiers.
pre-trained classifiers, it's a classic method. So the general idea is that we have some pre-trained and fixed model, we take our input, we fit it into the model and extract some target features. In order to understand better what these features encode, now we solve this optimization task where we want to optimize for an image in this case, such that when we'll fit it into the model, it will give rise to the same target features.
In many cases, of course, you need to somehow regularize this optimized image to avoid adversarial solutions and so on. So in our case, our input is not an image, it's a video. We can fit it into the model, extract features, and now the goal is to optimize for a new video such that one will fit it into the text to video model, it will give rise to the same features. If we solve this optimization task,
So again, you can see the objective at the top and the original video on the left. You can see the feature inversion results from different seeds at the right. And you can see that we can accurately reconstruct the original video in terms of appearance, motion, and so on. And this is not what we want because we want to allow much more flexibility in both in terms of shape and appearance.
So how can we take these spacetime features and build a descriptor out of them that will allow us this flexibility? Our first step towards removing this pixel level dependency was to average out or reduce the spatial
dimension. So we basically take these features for each frame and just average pull them across the spatial dimension. So for each feature we have a d-dimensional vector and so to describe the entire video we have f by d tensor. And now we can repeat our feature inversion experiment with those spatially reduced features.
And we were really surprised when we got this result to see that even though we averaged out the information across space, you can see from this inversion that we still preserve the pose and accurate movements of the woman in this video while allowing for more flexibility in the structure and appearance.
And just in terms of intuition, again, those features are really high dimension as they live in this high dimensional space. So even though we average them spatially, this information can still be preserved
Okay, so in the next step we said, okay, so let's use these features for editing. We're given some video, the original video. We can extract those specially mean features from the original videos and just use them as guidance during the generation process of the edited video.
So you can see the equation up here. We basically want to optimize the latent such that when we denoise them with a target text, in this case a camel, we want the resulting features, the spatially reduced mean features, to match those of the original video. We do that through guidance, through the generation process, and you can see here the result.
So indeed, it allows for some flexibility. We can get different deviation in shape and in appearance, but still it looks kind of like a camel that was squished into the shape of the elephant. And so these features, although we average them, they still contain this information, too much information about the original objects in the video.
And that led us to basically build the pairwise SMM differences matrix. And this idea is basically inspired from this entire line of works from self-similarity, that we basically don't want to encode the absolute values of these features, but only encode how they relate to each other.
all their pairwise relations throughout the video. So basically we take these d-dimensional features for each frame and we build this F by F matrix in which each entry is basically just the difference between two spatially averaged features. And you can think about it as encoding some motion in this semantic space of features
because we are just encoding all their pairwise differences and deltas between all the frames. And now we want to again intervene in the generation process of the target video and use guidance, but this time we want to encourage the generated videos to have the same pairwise SMM difference matrix. So this will be our objective function during the generation process of the edited video.
And now you can see that we can get a much better looking camel and still preserve the motion in the original video.
Here you can see some more examples and I think, you know, if you look on transferring the motion from this kitten to bunnies, you understand that we really want to synthesize the bunnies here and they need to move in a realistic manner as bunnies tend to move. And that really, I think, exemplifies the need to have a motion prior. There are some more examples with more dramatic shape changes.
And some more examples on well-known videos. We also have a way of initializing the initial latent of the video. I'm not going to go into the details of that, but we use a combination of DDM inverted noise and in low frequencies with random noise at the high frequencies. And this allows to get the method to be more robust and less sensitive to the exact seed that we are using in the optimization.
And again, compared to previous method, they really tend to preserve pixel level correspondences and they're not able to fulfill the edit in a way that is flexible enough.
So how do we measure success here? In order to measure the fidelity to text, we can use Clip Score, but we wanted to somehow quantify how well we capture the motion of the original video. And again, we want to measure that under these dramatic shape changes so we can no longer measure just pixel level similarity between motions. So we suggested a different metric for that.
and we suggested to measure the similarity based on the similarity of two sets of unaligned trajectories. So you can take off-the-shelf tracker and just apply a tracker on the original video and on the edited video and that
provides us with these two sets of long-range trajectories. And now we can measure their similarity using the Chamfer distance, where the distance between two tracks, here we use just correlation between the tracks. So each trajectory in one set finds its nearest trajectory, a highly correlated trajectory in the other set and vice versa. And we sum those
correlation values. So here you can see the evaluation of different methods. So on the y-axis we have the motion fidelity score, so higher is better. And on the x-axis we have the clip similarity score. So we want to be on the top right as much as we can.
So, and you can see that our method provides the best trade-off between providing good motion fidelity and fulfilling the text. Token flow which preserve with high fidelity the original motion gets better motion fidelity score but pays in clip score because it cannot fulfill the edit fully.
SDEdit on the video model with low noise level is able to preserve the motion with high fidelity, but it cannot deviate much from the original content of the video. If we use SDEdit with high noise level, it's the vice versa. It's the opposite. We can fulfill the edit, but we can no longer preserve the motion.
And again, our method provides the better trade-off between these two ends. Of course, there are some limitations. So we are still bounded to the priors that can be provided to us from the text-to-video model. So if the target object cannot be fitted in terms of the video prior to the motion of the source object, we will get deviation and this weird motion happening in this example.
Okay, so just to summarize, I talked about the two ends of video generation, editing and synthesis, the video foundation models on one hand side, the single video models on the other hand side. And I hope I managed to convince you that this approach of combining the two is effective and powerful.
There are still tons of stuff to do in order to pursue this goal. We still need to understand these huge big foundation models and device new smart representation in order to fuse this information into them.
And there are lots of open questions on how to do that. I'd like to thank all my students and collaborators from Google and from Weizmann, and I'll continue to work towards breaking new grounds in video analysis and synthesis tasks, and hopefully, in the future, we will be able to generate even such professional effects using computational tools. Thank you.
So you mentioned that the open, like obviously open source video models, there's a huge gap in performance compared to what we can see.
What do you think there's still to be done that doesn't really require training? That would, sorry? That does not require training a model. So what do you think, for example, in text-to-image models, we saw so many papers on different ways of controlling images.
What do you think we can do in videos that would be similar? Yeah, so I think the last work I showed takes a first step in this direction. I think that when you see these generation results, it is evident that these models learn some useful representation about motion, about how things evolve over time. And I think...
utilizing the internal representation of text to video model is still very underexplored. And there's tons of stuff to do there that won't require heavy training in order to adapt them or to leverage them for various downstream tasks. It could be generative tasks, but not only.
I think there is a great potential of, as we all use pre-trained image features for various tasks, I think the way to go forward is also to use
video features for downstream tasks. And in order to do that, I do think we need to understand these models much better. And I think there are also many open questions about how to gain control over video generation, what will be the correct interface, how intuitively would you want to even interact with videos.
I think it was discussed here at different talks that just using text is not sufficient in order to model our dynamic world. And we need to build new tools, new representation, new intuitive interfaces to interact with dynamic content, which is currently not there yet.
Hi, thank you for the interesting talk. So my question is a bit of follow up of what you just highlighted and more on the like the core side of universal video models.
So, like, what would be your thoughts on like, since we are in the early stages, do we anticipate like an order of two reduction in the cost? And it could be algorithmic. It could be on the architecture side. As you said, like, how do we control these models might even be the factor there.
That takes us to like the two order of magnitudes further. So what's the future look like compared to where we are today? I think also it was discussed here in previous talks, but I really think that one missing ingredient in order to push the boundaries of video foundation models is compression. Like how do you effectively...
represent or compress information across a video. Right now, I feel that the early stages of video foundation models are mostly doing the straightforward extensions that we can think about from the image domain and building an effective video compressor that you can work in its latent space. I think that will be
crucial for pushing the boundaries of video generation, order of magnitude more. Yeah. So, and I believe we'll get there. It's just a matter of time. Yeah. Yeah. Hopefully. Thank you. Thank you. That was the end of part one of this pod on generative video.
In part two, we turn to exploring related topics in generative modeling and diffusion that we feel represent the most important work of 2024 that are also helpful building blocks for generative video. First, we have two more DeepMind researchers. You may be observing a pattern in how much work DeepMind is putting into multimodal generative AI.
Here is friend of the pod, Sander Dielerman, who works on both DeepMind's VO video generation model and Imogen3. Over the past year, Sander has developed an intuitive interpretation of diffusion, where traditionally diffusion models and autoregressive models are viewed as polar opposites, with different hardware utilization and inference paradigms.
Sander's perspective of diffusion as spectral autoregression in the frequency domain caught the community's imagination this fall. And for the first time, Sander expands upon this in his workshop. So I'm going to talk about an intuitive look at how diffusion models work, and specifically in the context of modeling audiovisual data, sort of in the spirit of the theme of the workshop.
So it's roughly structured in four parts. So the first thing I want to do is explain how diffusion works from a geometric perspective, because I think this intuition is really valuable. And one thing that sort of bothers me about the diffusion literature is that it's, you know, as a beginner, it must be extremely confusing because there's so many different formalisms, so many different ways of saying the same thing. And I think this geometric perspective is sort of a nice way to tie it all together and link these things together.
And then the second section, I'll try to highlight some other perspectives that I think are useful and maybe less well-known. And then in the third section, I want to talk about diffusion guidance, which is a very powerful tool that is also very easily explained with this geometric perspective. And then finally, I want to talk a little bit about Imagine3 and Video and Veo, which are the models that I've been working on recently. So first, let's talk about a geometric perspective on diffusion models.
So I don't need to repeat this probably, but we know that diffusion works with iterative denoising. So we have some data distribution that we're trying to model in the examples. I'll show this will be an image distribution and we add, we gradually add a bunch of noise and then we try to remove it. That's diffusion and diffusion models in a nutshell.
So I'm going to talk a little bit about this corruption process first. So we first define a way to destroy all the information that is in the data distribution. And so I'm going to take an example here from the training data. I'm going to call that X naught or X zero. The index zero stands for a time step in the corruption process. So we treat this as kind of a temporal process and a time step zero. We are in the data distribution.
And then this process will proceed by adding small increments of Gaussian noise, which I've called delta here. So think of this as a tiny amount of Gaussian noise. And we just do that repeatedly. We add these small increments repeatedly. And then at some time, step T in the process, we can look at what our image looks like, and it'll be a noisy image. And then if we keep doing that indefinitely, then eventually that noisy image is going to look like just Gaussian noise, and we're not going to be able to see anything from the original image in there.
A very nice property of doing this with Gaussian noise is that if you have a lot of small increments of Gaussian noise you can add them together into one larger increment of Gaussian noise and this allows us to simulate this process much more efficiently and that's kind of a key idea behind diffusion model training which is that for any time step T in the process we can write XT as our clean data X0 plus
a scaled version of a standard normal variable. And a scaling factor sigma of t is what we're going to call the noise schedule of the diffusion model. In practice, we make things slightly more complicated, but also slightly easier to work with by not just adding noise at every step, but also slightly rescaling the input before we do that. So we introduce this extra scale factor alpha t, which is also dependent on the time step.
And then another change that we'll make is we won't run this process indefinitely because we don't have time for that. We're going to stop it at some time step capital T where basically the image that we get is basically indistinguishable from Gaussian noise. But now the interesting part is the backward process, right? How do we run this process in reverse? Because that then allows us to do generative modeling.
And again, this is going to be a gradual process where we add these increments, delta, but now these increments are not just random Gaussian noise. Now these increments actually require us to understand something about the data distribution to know how to gradually remove this noise.
And so I like to represent this geometrically. And before I proceed, I do want to express some words of caution. This is kind of a dangerous game, what I'm going to do here. Because really, this diffusion process is happening in the input space, right? In this case, in the pixel space. And if we think about image data as a vector space, then the vectors that represent the images are very high dimensional, right? Because you have lots of pixels. Each pixel has three color channels. These are very high dimensional vectors.
I am going to represent these as two-dimensional vectors because two dimensions is all I have on the screen. This is dangerous because as we know it can be risky to draw conclusions from low dimensional observations and generalize into high dimensions. But in this instance, I think it's actually really quite instructive to look at diffusion in this way. So what does a diffusion model actually do? We start with some data point x naught and
We add noise to it with that formula that I showed you before, some given amount of noise depending on the time step t, and then we end up at a different point in space, x_t, which is a noisy version of the image. And what the diffusion model is going to do is it's going to try to predict x_0 from that x_t. So we are in x_t and we try to predict where do we need to go in space to get back to x_0. Now this is a very difficult task.
And the reason this is a difficult task is because, of course, the noise is obscuring some information that was in the original image X naught. And we can't really recover that.
So what we end up predicting is not X naught itself, but rather the expectation of X naught given Xt. We're predicting sort of what are all the possible X naught, what are all the possible images that could have given rise to this particular noisy observation at time step t. And this is not a single image, but rather a sort of region of the input space. And what a diffusion model is going to do is predict the direction that we need to move in to get closer to that region of the input space.
And effectively what we're predicting is a centroid of that region and if you try to visualize that prediction if you try to visualize that centroid it looks like a blurry image and The reason for that is that this is kind of an average across many possible Images X naught and the noise is kind of obscuring the high frequency content of these images But not the low frequency content. So the result that we get is a blurry image. So
So how does the diffusion sampling process proceed? Well, we just predict that direction that we need to move in, and then we take a small step in that direction. And you can kind of compare this to how we optimize neural networks, right? In optimization, we also predict an update direction, but then we only take a small step because really that prediction is only valid locally. And then one thing that we do here that we typically don't do in neural network optimization is we add a little bit of noise back.
And there are theoretical reasons for doing this that I'm not going to go into, but the intuitive reason for why this might be a good idea is that we're doing a sort of two steps forward, one step back thing, which is going to be more robust to any systematic errors in our predictions of this direction. Because, of course, we're doing this repeatedly in a loop and errors might accumulate. Not all sampling algorithms do this, of course, but some do.
Okay, and then we just repeat the process. So now we're in a new point of space, xt minus one, which looks like a slightly less noisy version of the image. And we just make a new prediction, x naught. And as you can see here, that prediction is going to be slightly different, right? Because now it's pointing to a smaller region of the input space because the noise is obscuring less information. So we can kind of make a better guess as to where we need to move. So we have this new prediction. As I said, again, this is kind of reflecting a smaller region of space that we need to move towards.
Then the process just kind of repeats so we add a little bit of noise again We do this a while longer until eventually we reach time step 0 and then what we should end up with is a sample from our data distribution We are not probably not going to end up in the original X naught, right? But we are going to end up in a sample from the data distribution. So that's kind of this geometric overview of The diffusion process so everything I've explained so far is
assumes that a diffusion model predicts X naught, right? It predicts the clean input. Now if you look in the literature, that's typically not what people are doing. Instead, a very common approach is to predict this quantity epsilon from the formula I showed you before, which is basically just a standard Gaussian noise variable. But it turns out that once you have a trained model,
you can always convert a prediction for x naught into a prediction of epsilon and vice versa. And that's because of this linear relationship that we have. X_t is given, X_t is our input, and we know that X_t is linearly related to x naught and epsilon. So if we have one of these quantities, if we predict one of these quantities, then we can convert that into a prediction for the other.
And people have kind of taken this one step further because you don't just have to predict X naught or epsilon you could actually predict any linear combination of the two and that gives rise to things like the prediction and the flow matching target, which is epsilon minus X naught.
For the same reason, predicting X naught is also equivalent to predicting Xt minus one. And the reason I bring this up is that this is the kind of approach that is taken in the original denoising diffusion probabilistic models paper, the DDPM paper starts from saying, okay, we have this gradual corruption process, we're going to invert it one step at a time. And then the natural thing is to predict the previous time step from the current time step.
But as is shown here, actually, because of these linear relations, by solving a simple linear system, you can show that this is actually equivalent.
This is equivalent when you have a trained model. It's not equivalent during training, which is a little bit tricky. So during training, this choice of prediction target actually affects the relative importance of the noise levels in the aggregated loss across all noise levels. And that is in turn going to affect the perceptual quality of the outputs. So that's why choosing this prediction target is actually important. But once you have a trained model, all these prediction targets are essentially equivalent.
All right, so in summary, how does the diffusion training process proceed? So we take each training example x naught, we sample some random time step t, we corrupt x naught to get x t with this formula that I showed you before. We don't have to run this process one step at a time, we can just do it in one go. And then we use our model to make a prediction for x naught or for epsilon or however we decided to parameterize the model. And then to train the model, we just minimize the squared prediction error.
And this is just the MSC loss that we all know and love. So this is a very stable training objective, which is nice. The reason we use MSC here, the intuitive reason why this is a good idea is because really what we want to recover is that expectation from before, right? We can't predict X naught exactly, but we want to recover the expectation of X naught given Xt and that is precisely the minimizer of the mean squared error.
And then for sampling at each time step t, we can predict x naught or epsilon from x_t with our model and then just take a small step in the predicted direction to partially denoise x_t to get x_t minus one. And then as I said, in some algorithms we add back a little bit of a denoise, in some algorithms we do not. Okay, so that's kind of the basis of this geometric perspective. Now I want to talk about a few other perspectives that I think are useful that are maybe less well known.
So so one thing I'm gonna skip is this sort of score matching perspective This is also linked to what I just explained, but I think that one is pretty well known nowadays So when I talk about a few other perspectives and one is this This way of looking at diffusion models as recurrent neural networks. So if we think of the diffusion sampling loop
We're kind of repeatedly applying this denoiser network that we've trained in sequence. And if you unroll that computational graph, that actually just looks like a much deeper neural network.
And then you could ask, why don't we just drain that with backprop like we usually do? And the answer is, of course, it's very, very deep. It's often tens of thousands of layers. If your base diffusion de-noiser model is 100 layers and you have 100 time steps, then this is going to be a 10,000 layer neural network. So you can train that with backprop through time. People have done that. What you get is actually called a continuous normalizing flow.
But you can do another thing, which is to train this with score matching. And then you don't have to backprop through this loop. You only have to backprop through one step of the denoising. So this gives you a way to look at diffusion models as kind of a deeper kernel network that was trained without backprop through time. Kind of a hack to train deeper networks, if you will. This is a perspective that I really like.
One question that I get a lot is why do diffusion models actually work so well for image and video? Why did they come in and take over generative modeling for all modalities except language? And so for images, there's this interesting spectral analysis that we can do that sheds some light on this. So we can calculate the spectrum of an image.
We can kind of summarize that in one dimension. And if you plot this spectrum on a log-log plot, what you get is a power law. You get a straight line, and that reflects that there's a kind of power law going on. So the amplitude-- or actually, rather, the power of a particular frequency in the image is proportional to that frequency raised to some negative power. Usually, it's like around minus 2. And this seems to be some sort of law of nature. So you get this negatively sloping line for natural images.
If you do the same thing for Gaussian noise, you calculate the spectrum, what you should get is a horizontal line. Because in Gaussian noise, all frequencies are present in equal measure. Now the interesting thing that happens is when you superimpose these, because that's what we do in diffusion models, right? We add noise to images, and you add them together, and then you look at the spectrum, and you get this hinge shape that you see on the third plot there.
And if I increase the noise level, so if I increase the amplitude of the noise, then that hinge sort of shifts position. And what this is going to do essentially is it's going to obscure more and more of the high frequencies in the signal. But the low frequencies, because they're more powerful, they're going to kind of jut out above this noise floor. And so they're going to be preserved.
And so based on this interpretation, I think it is fair to say that the fusion is kind of an approximation of spectral autoregression. We're generating images from low frequencies to high frequencies. And so this is true for images. This is true for video. Also for audio follows this sort of power law, but obviously not necessarily true for other modalities such as language.
This is not an idea I came up with so I actually got inspired by this paper from Saviri Rishan and his colleagues on generative modeling with inverse heat dissipation where they do this kind of spectral analysis and and it's really important because the different noise levels actually correspond to different spatial frequencies in the image in a way and
And so that means that when we're reweighting, rebalancing these different noise levels in our training objective, what we're actually doing is we're saying which spatial frequencies matter to us, like which spatial frequencies do we want the model to really understand well.
And this actually means that the diffusion loss is actually kind of perceptual loss, right? Because we're kind of emphasizing the frequencies that the human visual system is sensitive to, and we're de-emphasizing the ones that we are less sensitive to. And I think that's one of the big reasons why diffusion models for images took off so rapidly, even if we didn't necessarily understand this at the time.
Alright, so one thing I also want to do a little bit is contrast autoregression and diffusion because these are sort of the main generative modeling paradigms that are popular today. So we know we all know probably what autoregression is. You kind of turn everything into one sequence, generate that sequence one step at a time. With diffusion we use this noising process, this corruption process. So these are just two different ways to do generative modeling, but they're both iterative. They're both using
many network invocations to do generation. So they both use this kind of divide and conquer approach to generative modeling. And so for video specifically, there's kind of a continuum almost in some of these choices.
So we could just model video autoregressively and that would require taking the spatial temporal volume and dividing it up into tokens, which would be like three dimensional patches or voxels and just choosing some order in which to predict these, right? Because we need to turn it into a sequence. Then on the other end of the spectrum, we could just take this entire cube, this entire volume, and just model that with diffusion.
But there's kind of a hybrid approach that seems to make a lot of sense for video specifically, which is to treat the temporal dimension autoregressively and do diffusion over the spatial dimensions. So that's what I'm showing in the middle here. And all of these approaches have sort of their own advantages and disadvantages. So the autoregressive approach is nice because it would make creating multimodal models very easy. So if we want to integrate this with large language models,
Right now that seems to be the way to go. But of course these sequences will get very long and so that means we run the risk of getting problems with error accumulation if we generate very long videos. Then on the other hand with diffusion, we have kind of robustness against this error accumulation.
And we have powerful methods for accelerating sampling, for example through distillation. I believe also that guidance, while it's not exclusive to diffusion, you can apply it to ultra-aggressive models as well, it does seem to be, at least to me, it does seem to be more effective in the diffusion setting. But of course working with these very large spatio-temporal cubes, sort of having to generate this in one go, can be quite unwieldy and can create quite a lot of memory pressure. So the hybrid approach
could be seen in some sense as best of both worlds, but it also has some advantages and disadvantages. For example, if you want to do distillation, then this hybrid approach where you do temporal out-regression might actually cause issues with error accumulation again. But of course, one nice aspect of the hybrid approach is that we can reuse a lot of stuff that we've done for images, because essentially this is just an image conditional image generation model, if you will.
One more general trend that I want to talk about in generative modeling of perceptual signals is this sort of moving away from measuring likelihood in the input space. So back in the day when I started working on generative modeling, we had models like PixelCNN and WaveNet. These were just likelihood-based models in the input space.
But they didn't really scale very well to larger inputs because likelihood is actually a very poor perceptual metric and it's precisely because it's putting way too much emphasis on these high frequencies that are perceptually less relevant. Of course, it works very well for language as we know.
But so the general trend for perceptual data, for audiovisual data has been that for autoregressive models, we've started measuring likelihood not in input space, but in some latent space. We first learn latents to kind of make abstraction of a lot of this entropy that is not actually perceptually relevant. Like the individual blades of grass in a grassy texture, for example, don't need to be modeled by a likelihood-based model. We just need to be able to kind of paint with a grassy texture, essentially.
And then the same thing is kind of implicitly happening in diffusion models in a continuous way because by re-weighting the noise levels, we are kind of also implicitly down-weighting these less important frequencies. But of course with diffusion, we're also nowadays often using a latent space to kind of amplify this effect. And I want to talk a little bit more about that. Why that makes sense? Why that is a good idea?
So visual perception, I think, works differently at fine scales and large scales. At very fine grain scales, our perception of texture kind of makes abstraction of all these little details. We don't, you know, I can take an image
with a grassy, let's say a dog playing in a field, like sky above, grass below. I can take an image and modify it in Photoshop by shifting that grassy texture one pixel to the left and show it to you again, and you won't be able to see what happened. It's too subtle. So that perception is kind of making abstraction of these fine-grained details.
And it's not actually necessary to model all these possible variations. We just need to be able to generate one good one. And that's precisely what adversarial models give you. They don't really bother modeling all the modes of the distribution, but they can give you a few good ones.
So this is just a really good match for fine-grained perception. Whereas at the larger scale, we care a lot more about covering all the possible modes. And so there it makes sense to use something that's closer to a likelihood-based model or a diffusion-based model. Right. So the next thing I want to do is talk about diffusion guidance, which is, I call it a cheat code for diffusion models because it allows them to perform way above their pay grade in a sense.
Guidance allows us to trade off sample quality for diversity and it just generally makes diffusion models work a lot better. And so I want to revisit this geometric diagram that I was talking about before. So again, we have our clean input sample from the data distribution X naught and then a noisy version of it at some time step T in the top right corner.
And as before, our diffusion model will predict which direction we need to move in in the input space to move towards the data distribution.
But now we're going to do something slightly different. We're going to do classifier guidance, which means we're going to take a classifier that is robust to noisy inputs, and we're going to ask it to classify this noisy image. And we're going to take the gradient of these lodges that we get with respect to the input. And what this is going to give us is a direction in input space that we should move in to make this image be more likely to be classified as that particular class. So it's kind of
of amplifying the aspects of the image that make it adhere to that particular class. And this gives us a different direction in input space. And instead of following the direction we predicted with our diffusion model, we can actually superimpose these directions, just add them together, and then move in that direction instead.
And I want to kind of show you the underlying Bayesian perspective on this as well, which you can get simply by taking this formula for classifier guidance, which is expressed in terms of score functions, like the gradient of the log likelihood. You can actually just undo this gradient operation and this log operation to see what happens in terms of probability. And that's what I'm showing on this slide. So actually what we're doing is we're taking an unconditional base model, unconditional diffusion model, adding this classifier, PFC given X,
and then combining those two to get a conditional model. So we can actually turn an unconditional model conditional after training. But the real power of classifier guidance is unlocked when we introduce this scaling factor, which is called the guidance scale. So we're going to scale this gradient direction that we get from the classifier by some constant gamma.
And what this is going to do is just say like, make it look like a rabbit, like really make this image look like a rabbit. I want to get all the characteristics in that image that make it look like a rabbit. So our new update direction is going to be this one. And so we're going to end up in a different point in space that is kind of following this new direction.
And again, if we kind of look at the Bayesian perspective here by undoing this gradient operation and this log operation what's happened here is the classifier probability is now raised to this power gamma and
And what does it mean when we raise a probability distribution to a power and sort of renormalize it? That's tuning the temperature, right? That's something we do with autoregressive models all the time. We're actually just tuning the temperature. But what's interesting about guidance is that the temperature tuning is happening in the output space of a classifier and not in the input space of the generative model. And personally, I think that's why it's so powerful because we're able to tune temperatures at a kind of high level of abstraction. We're kind of sharpening this classifier distribution.
So next let's look at the classifier-free version of guidance. So kind of doing the same thing here again, like looking at our diffusion model prediction, but now we're actually going to make two predictions. We're going to make an unconditional one and a conditional one, and these are going to be slightly different because obviously the conditioning signal gives us a little bit of information about where in space we might need to move to draw samples from the distribution.
The way we can achieve this in practice is by training a conditional generative model and then maybe dropping out the conditioning signal 10% of the time. And that gives us a model that can operate in both conditional and unconditional modes. So we have these two predictions and we can look at the difference vector between the two, which I've called delta here. And this difference vector is the direction that we can move in to make samples look more like they belong to this class C.
And again, we can do the same thing that we did in classifier guidance, which is to amplify this difference by some scale factor gamma just to allow us to really hone in on the characteristics of this class C. And then this gives us a new direction which we should move in during diffusion sampling. And again, as before, the sampling algorithm kind of proceeds as before. So we might actually add some noise here. All right. Now let's look at the Bayesian perspective again.
This is very powerful because you kind of applied base rule twice and effectively this vector Delta corresponds to a Bayesian classifier, right? So this this classifier probability that we had before is now replaced by this ratio of P of X given C and P of X But again raised to this power gamma so we're again tuning this temperature and this is effectively what classifier-free guidance is.
And this is a lot less prone to adversarial directions in the input space than classifier guidance would be. I have these examples. They're quite old in the meantime. So these are from the Glide paper, which was one of the first large-scale text-to-image models from OpenAI. But I really like these because they show what the model looks like without guidance and with guidance, which is rare in modern papers. In modern papers, we only see samples with guidance.
But here you can kind of really see just how much of an impact this has. And you can also see the impact, you can see the trade-off between diversity and quality, right? You can see the images come out looking much less diverse, but the quality is clearly improving.
Another example from the same paper here with a slightly different prompt. Again, sort of reducing the diversity in favor of making the images just look a lot nicer overall. And I think nowadays, a lot of these state-of-the-art models that we're seeing, if you were to sample from them without guidance, I think you would be surprised at just how bad they are. These models are really relying on guidance to produce these incredible results that we've been seeing.
So if there's anything you remember for this, the main thing I want you to remember is that classifier guidance is just two applications of Bayes' rule. Or is it? There's an interesting recent paper by the NVIDIA group from Finland where they kind of call this into question a little bit and give some other intuitions about why guidance might actually be working. I won't go into this here, but it's a very good paper. I recommend taking a look at it. It came out last month, so it's very recent.
All right, and then to wrap up my talk, I'm just going to briefly talk a little bit about Imagine 3 and Vio, which are the text-to-image and text-to-video models that we've been working on recently. Both were announced at Google I/O in May. Imagine 3 should be available shortly. Vio is obviously a sort of more unwieldy model that might take us a bit longer, but hopefully you'll be able to play with Imagine 3 soon.
And I just have a few samples here from this model. So this is a latent diffusion model, just kind of a change from our previous family of models. And you can kind of see that, yeah, it does a pretty good job at fine-grained detail, large-scale structure. It's a very nice text-to-image model. All of these samples are on the DeepMind website on the relevant blog post.
And hopefully we'll be able to share some more details about the inner workings soon as well. And then finally, I also want to talk a little bit about Vio, which is our text-to-video model. This is kind of probably looking a lot like what you would expect. So it's again a latent diffusion model. We have a text encoder that encodes text prompt input, an optional encoder to condition it on frames with image input.
And then the diffusion operates in a latent space and then we have a decoder that turns this back into pixels at resolutions up to 1080p and relatively long lengths.
And then I have your, I don't know if this is going to play, but yeah, this is kind of a show reel of samples which you may have seen before. Okay, that is supposed to move because it's a video. Okay, here we go. So this is just kind of a show reel of some samples of the VO model. I don't know if the quality is quite visible, it's quite coming across here, but it's producing high quality video at 1080p.
Alright, so to wrap up one thing I want to highlight is that pretty much everything I've talked about today is on my blog So I have a whole series of blog posts on diffusion models and on generative models in general Where I kind of try to build intuition, right? So it's not it's not necessarily about Theory and and being mathematically correct. It's about building intuition for these models and how they actually work and
And so most of the content from the slides here is kind of spread across a few of these different blog posts. Okay, that's it for me. So link to my blog, also to my Twitter account and my email address. If you have any comments or suggestions or questions after the talk, feel free to contact me and I'm happy to take questions now as well. Thank you. - Yeah, I'm curious to hear about where you see the capabilities of these models going. - That's kind of a vague question.
- Yeah, I mean, so I think bigger and better. Yeah, I think we're kind of early. Like I kind of compare it to what's happened in language modeling, where we're kind of a bit further along in the scaling process. I would say on the video side and on the image side as well, I think we're quite early. So I would expect more big leaps.
I have a question about latent diffusion models. I haven't seen them written down mathematically. Would you mind giving us some intuition? So if the input is fixed, like we're doing diffusion models on X, it makes sense. You can add noise, however much noise you want. You can try to reverse it. For latent, it means you're training a neural network. And from the latent values, you're doing the same process as the network itself is shifting, is training.
So usually it's a two-stage process. So we're first going to learn some latent space that essentially compresses the input. Because one of the issues with generating very large images, very large videos, is that it just takes up a lot of memory.
And one of the key advantages of latent diffusion is that you can actually compress a lot of the redundancy out of this and still get a representation that's sort of learnable, right? This is also how it differs from sort of standard compression. You know, you have standard compression algorithms like, you know, like JPEG and H.264 and whatever. They're really just focused on making things as small as possible. Here, we're kind of trying to do, to control a trade-off between
how much we can compress while maintaining output quality and also how learnable the resulting representation is. Because if you compress too aggressively, that might get difficult. Like if you were to do entropy coding on the latent space or something like that, that might actually make learning more difficult. So it's kind of an interesting...
twist on the compression problem because you have this trade-off. But it's generally a two-stage process. So you learn the latent space first and then you freeze that and then you just train a diffusion model as you always would except that you're just extracting this feature representation and operating on that. Thank you. Great talk. I'd love to hear your thoughts on current metrics, things that are missing, how we can better evaluate this particularly from a video generation but in general diffusion models.
I have mainly complaints and not many suggestions. It's tough, right? We don't have a lot of great metrics. We do a lot of eyeballing.
For image as well, but especially also for video. It's trickier for video than for image because for image it's kind of easy to generate say 200 samples, put them in a grid, just take a quick glance at them and have a rough idea of what your model is doing. For video it's a lot harder because everything is moving, right? So it's much trickier to kind of glance at things. You kind of have to look more at individual samples. And then for audio it's actually completely impossible, right? Because you just have to listen to them one by one.
And this is a very persistent problem that I haven't seen any great solutions so far. Yeah, we use the classical metrics, FID, FED, but we also know they are flawed in various ways and that sometimes we can't trust them. But they're at least useful as canaries, right? They can kind of tell us when something is seriously wrong at least, so that's helpful.
But yeah, definitely a very fruitful space to work in if you want to make an impact, is to figure out how we evaluate these things, especially computationally, without involving humans in the loop. Thanks. I would like to ask if you think that predicting human evaluation with a model is promising as a direction or not for evaluating these models?
Quite possibly, yeah. I guess it kind of depends on what your human evaluation data looks like.
But I think that's a promising direction. Did you try to scale, to train a model on a lot of human evaluation? And then sort of use it as a proxy, as a reward model in a sense. Yeah, I would say that's a valuable direction to move in. I have one concern with that, which is that every metric, when it becomes a target, eventually ceases to be a good metric. So it would be very interesting to kind of see how that applies there. And I think we should be careful about that.
Thank you. Hi, we have seen that some of the diffusion models always produce or often produce data that's very close to its training data when you ask it to produce something. Do you have any ideas on how they might get more creative or general and further away from their training data?
I think that's probably the easiest way to solve that is to get more data. Like the, you know, if you have an order of magnitude more data, then something like that is an order of magnitude less likely to happen.
But I think, so I don't deny that this is a problem, but I think we should also, you know, when diffusion models kind of rose to prominence, came onto the scene, I think one of the very impressive things was this kind of combinatorial generalization that they exhibit, right? So I think to some extent there's already a lot of sort of
creativity on the part of these models and sort of combining things in ways that they don't exist in the training set. And I would expect with more data that that ability would improve. Hey, man. Thanks for the great talk. With some of the video models that have been released, if, for example, you have like
say water and waves are sort of flowing you can kind of as a human watching it see that the laws of physics aren't sort of strictly abided to in the same way that you'd see in real life what do you think are some promising directions for sort of ensuring future video diffusion models kind of more closely adhere to the physical laws of nature in that kind of sense um scale
is one. I think a lot of this sort of behavior is emergent and with more data and more capacity the model will learn to do this. But maybe in the shorter term there's something we can already do to improve this maybe by curating the data, maybe by building in some physical priors into the model. Although we do have to heed the better lesson here where often it turns out that it's better to just let it learn and not try to meddle with that too much. But yeah.
Our second deep mind speaker in this section is Ben Poole, who works on inferring 3D structure with 2D priors, which you can see is a key component in upgrading something like Genie 1, which is 2D, to Genie 2, which is 3D. He also introduces the Neural Radiance Field Concept, or NERF, which is now incredibly popular for 3D environment simulation and of course has implications for synthetic data in generative video.
Ben combined NIRFs with score distillation from diffusion to create DreamFusion and ReconFusion. Let's tune into his invited talk at the Structured Probabilistic Inference and Generative Modeling Workshop led by Joshua Bengio. Yeah, thank you everyone for being here so early. Thank you so much to the workshop organizers for the invitation to speak. And today I'm going to be sharing some of our work on inferring 3D structure with 2D priors.
And so ICML has been really fun, but people keep asking me why I'm working on 3D generation. We've seen some amazing progress in video generative models. And as we scaled up the data and the compute, we often see that the quality improves. And then if you also look at some of the 3D consistency within these video models, they've also improved as we scale things up.
But the way that we consume content isn't always just staring at a flat screen. We have amazing new AR and VR mixed reality headsets, and the type of content that we want to consume is often interactive. You see some really creative, interesting scene, and you'd love to move around in it and see it from other angles. And it's not just moving around in it in VR headsets. Oftentimes, the most fun way of exploring worlds is interacting with them, be it in video games or exploring on a mobile device.
And it's unfortunately really challenging to create this kind of 3D content. 3D modeling is really hard. I remember working on some of this in middle school and being immensely frustrated at the inability to create the seemingly simplest objects. And even once you have these 3D models, how do you interact with them and add them to worlds, light them, rig them? It's all an extremely challenging and time-consuming problem.
And it's not just about creating things. I think something that I find really frustrating is we've seen the amazing power of AI in a number of different domains, but I'm seeing all of you sitting in front of me. And as a human, I feel like I have this really innate sense of the 3D structure around me where objects are. I know my water bottle is here and I can grab it, but it's really challenging for AI systems to have this kind of spatial intelligence. So if we can make more progress in building 3D priors and understanding the 3D world, I think it could really influence the direction that things are going in robotics as well.
Luckily, we've seen amazing progress as well in 3D reconstruction. So here's an example from ZipNerf, which is a powerful method based off of Nerf. And you can capture an entire house and turn it into a 3D model that you can move around in and interact with. And the quality of this and the photorealism of this often exceeds even our best video models today.
And how do these methods work? The idea is that we have the space in front of this and we can parameterize it as a 3D volume. And at every point in this x, y, z space, we can use a neural network that maps from a point in space to a density and a color. And there's all sorts of different 3D representations that people are exploring these days, but the key idea is that you have a differentiable mapping from somewhere in space to a color or the ability to query different points along array.
And the way that we train the parameters of these neural networks that are representing the 3D world is that we can cast a ray into the scene from a known camera, and we can evaluate a bunch of points along that ray using our neural network. This gives us a color and density along the ray that we can accumulate to get an RGB color. And the way that we train these neural networks for 3D modeling is that we have gone and collected a bunch of images, and we can see how well does the image match the prediction of this neural network.
And I think what people don't realize about nerfs is how data-hungry they are. So if I want to capture a LEGO boulders on a table, I can't just take one picture. I have to go out and collect a huge handful of pictures that surround the object and view it from almost all viewpoints. Their ability to generalize to unseen regions is basically nothing. It's really interpolating between known viewpoints. And once you do this, you can get high-quality 3D reconstructions that represent the color and also learn, to some extent, the 3D geometry depicted by the depth here on the right.
So what happens if I haven't had my coffee this morning and I wake up a little bit early and I only take three photos, but I'm really curious to see what the scene might look like in 3D? Well, here's an example of the state-of-the-art 3D reconstruction methods on a three-view reconstruction. And what you can see is it matches really well at the observed images, but as I deviate away from these images, we get really inaccurate predictions of what the world might look like. And if you think about building a robot that's going to go and grab that Lego bulldozer, that depth map and the 3D geometry looks hugely inaccurate. It's not going to be useful for any of these tasks.
And in general, I think we're at this structured probabilistic modeling workshop. What's the problem that we're trying to solve? Well, we don't have access to the 3D world or even a lot of ground truth data in the 3D world. We just have the shadows of the 3D world. We have the projections in our eyes or we take out our camera and just see a two-dimensional image. But we would like to understand what that 3D world is so that we can reason over it. So we're really trying to solve this inference problems of, okay, what's the distribution over what could be there in the 3D world given some set of observations?
And there's often this kind of spectrum that goes from reconstruction where you've collected a lot of data, you know exactly what should be there and you want to recreate it in a digital world. So maybe something that's a little looser, maybe I have a picture and I just want to hallucinate plausible 3D content. For what is a 3D scene that could be consistent just with that image?
Or maybe I don't have a bulldozer in front of me, but I want to create it for my game or visualize it. And maybe I just want to describe it with text. So we have all these different ways of thinking about observations that we want to condition on, but there's a shared common goal. How do we create this 3D structure given these partial observations?
And we've done some work across the spectrum. So we started off working on text to 3D with DreamFusion, and then worked on VueVue Reconstruction with Reconfusion. And then more recently, we have some work at 3D that enables us to do everything from text to single image to VueVue Reconstruction for 3D creation. I'll talk a little bit about each of these projects today.
Okay, so why is 3D hard? I think I got into 3D mostly, not necessarily because I cared about 3D and understanding the 3D world, but it was more, this was a problem that didn't feel like data could just solve. Across the board on language generation, text generation, and image generation, we've seen that there's an incredible amount of progress just by collecting big data sets.
But as we saw before, it's really hard to acquire ground-true 3D models of the world. It's really expensive and involves a lot of human effort. But let's say we do this and we collect a big data set. Now what? How do we represent it? We have all these different 3D representations. We have splats, voxel grids, nerfs. You have to pick one of those. And then once you pick one of those, you have to design an architecture that can scale up as you increase those data sets.
But let's say you do this. Here's a bit of an old example these days. Can people hear me okay? I'm realizing that I'm okay. Great. So, you know, if you have a decent-sized 3D data set and you train a model on that data set, you can get okay 3D models. But most of our 3D models are just isolated objects, and it's hard to get the realism to be as high as, for example, the image samples that we get out of state-of-the-art text image models.
And I think the real problem here is that there's this huge gap. I've been presenting this for a while, and I think it is still very true. There's a huge gap between the 3D data that we have access to and the visual world. And I think a lot of that is driven by everyone here has mobile phones in their pockets with cameras. But not all those cameras have depth sensors. And even if they have depth sensors, when you take a photo, you don't often take a video of the object that encircles it and captures all the different ways that you can imagine viewing it.
And so the bet that we made was that, okay, maybe we could find ways instead of building explicit priors in the 3D space, could we build priors in 2D? And if we have these priors in 2D, now we need to solve a more complicated problem because we can't just do inference over the 3D space if we don't have a prior there. We need to be creative for ways of thinking about using these two-dimensional priors for 3D generation.
And the general inductive bias, or the way that we're going to hack 2D priors into 3D, is that we're going to say, well, what is a good 3D model of the world? As a human, I often don't have the ability of knowing that the 3D world around me is really accurate and precise, but I can view it from different angles. And so the idea is that we're going to take this 3D model that we're trying to learn or do inference over, and we're going to render it from a bunch of novel viewpoints.
And what does it mean for that 3D model to look, to be a good 3D model? Well, it just has to look good. And how do we measure how well it looks good? Well, we're going to look at the renderings and we're going to use a 2D prior to score this amount of goodness. And so here we have like a bear playing a guitar. So you might imagine, okay, if maybe from one view it looks good, that might be insufficient for the 3D model. But if every way that I look at that 3D model, it looks good, then maybe it's a good 3D model of a bear.
This opens up a number of questions and problems and research directions for you to solve. What 2D prior condition on what information? How do we actually measure goodness? I think there's been a lot of tremendous work in probabilistic modeling on what does it mean for an image to look good? And I think we still don't have a really great sense of what that metric is or how to optimize it across all different kinds of probabilistic models.
Another big problem is which views. Some objects I can put a camera over here, but then depending on where things are in the scene, it might be really challenging to think about where do I want to evaluate how good this model is? I don't want to put the camera inside of the object, for example. And then also which 3D representation. Nowadays we have a plethora of choices for either you use splats, you could use snars, you could use all these different things. And the 3D representation you use might change on a setting that you care about. So who here doesn't know about diffusion models?
Oh wow, that's great. So the general gist of a diffusion model is that's a way of modeling high dimensional continuous distributions and we pair a simple destructive process where we take that data and we add more and more noise to it and eventually we've degraded all the structure that's present in the initial image in this case. And then what we're learning to do is how do we reverse this process and slowly introduce more structure back into the data.
And diffusion models are great if what you care about is sampling. So you've trained on this big data set, for example, 2D image, and you want to sample 2D images. But in 3D, we don't actually care about sampling 2D images. What we really want to do is back out and infer some kind of 3D structure. And one approach for this is you can think about, well, we're building parameterized images. There's some parameters of the nerf or a generative model, and we can use those to create an image. And then we'd like to evaluate how good is that image.
What we're missing here is a loss function that we can use to score these generations or renderings. If we do have that loss function, it's differentiable, then we can back propagate into the image and then back from the image to the parameters of that generative model.
And the idea that we proposed in DreamFusion was built around this idea of probability density distillation, and so we called it score distillation sampling. And I guess another way of thinking about diffusion models is that they learn a sequence of marginal distributions that start from a clean data point and map to noisier and noisier data distributions. And these noisier distributions are often simpler. They're smoother than the initial data density.
And what we want to do is maybe pick out a single mode of this complicated data distribution. So here you can see P of X is the complex data density defined by the diffusion model. And we just want to infer one mode of that distribution. And the hope is maybe that the kind of the mode might be a good looking sample.
And we do this not just at one noise level in the diffusion model, we can average it across all these different modes. And this allows us to learn a loss function that is applicable to any differentiable image representation. And here what's nice, while we don't have explicit access to the marginal distribution in diffusion models, we do have access to the gradient of its log density, which is all that we need to evaluate this loss function. So in DreamFusion, we combine the
the score distillation loss with a 3D representation from NERFs. So if you want to peacock on a surfboard, you start with the random relationalities NERF, and you can iteratively optimize with the score distillation loss. And over time that builds up a 3D model that looks good from all these novel viewpoints. And at the end of the day, after optimizing this 3D model, you get out hopefully a high quality 3D asset that you can use in different ways.
And what's cool is we didn't have to use any 3D data to create these text-to-3D generations. And on top of that, we maybe don't have any 3D data at all for a lot of these categories. So if you collected a 3D data set, almost all of these kind of text-to-3D generations might be out of distribution.
But the more I played with these text-to-3D systems, the more it's like gambling where you come up with a text prompt, you hit go, you wait a while, and the results stink. And then you do it again and again and again. And it's not a very fun form of control, and it doesn't allow you to ground these 3D generations in the world. Especially if I take a photo, I don't want to take that photo, describe it with text, and feed it to a text-to-image model. I would like a better way of grounding it in real scene content.
So in some follow-up work on Reconfusion, we tried to generalize this method from conditioning on text to conditioning on images. And if we go back to our bulldozer example, here's the original 3D reconstruction of this bulldozer model from three images. And if we apply a method that uses a generative prior at these novel views, you can see that we can accurately recover novel views and decent geometry from just three input images.
So how does this work? It's very similar to the earlier work, but we're going to augment a 3D reconstruction pipeline with a model that's not conditioned on text to describe what this novel view should look like, but it's conditioned on images. And so what should a novel view of a scene look like? Well, we often have one or a few images of what that scene is. And so what that novel view looks like should be very informed by what the existing content is in the other 2D pictures that I captured.
So like, what might this image look like? And the idea was to train a new kind of diffusion model that was conditioned on the set of input images and their camera poses. And then given some novel target pose, we want to predict this novel view. So it's still an image diffusion model. It's only producing one novel view. What should this look like from over here? But now you condition on one or many different inputs that you have in the scene. So I can take our lazy three captures of some image and then turn it into a 3D model.
And the architecture that we used to condition on the set of input views was PixelNerf, which is an image-based rendering method. And this was inspired by earlier work like Nerf Diff and GenVS. And as input, you have a set of input images and their camera poses. You pass it through the PixelNerf to get some rendered features at the target camera pose. And then you combine this as input into a typical text-to-image latent diffusion model where we replace the text features with now clip embeddings of these different image inputs.
And now, unfortunately, unlike the existing work that we did before, DreamFusion on text-to-image, here we need data that doesn't just have text and image annotations. We need sets of pictures and their camera poses. So we're way more restricted in terms of the kinds of data sets that we can apply in this novel view synthesis setting. Here we trained on a combination of real estate 10K to get some real-world scenes, CO3D and MVImageNet, which are often orbits around objects but in context, and then a bunch of synthetic renders from 3D models from Optiverse.
And if you apply these methods to real-world scenes, you can see that you can get out decent novel view synthesis predictions. But one issue is that these images are predicted independently. We aren't modeling the correlation between views that happen when you have one 3D model. And so we had to design a procedure that could take these inconsistent 3D predictions or inconsistent 2D predictions and turn them into one consistent 3D model.
So here on the top you can see the results of the 3D reconstruction and the bottom of the samples. And so this is similar to DreamFusion where we have these, we don't know exactly what the novel view should look like. So we have to generate a bunch of samples or use the optimization procedure to resolve these difficulties.
I think the big problem with all these iterative optimization-based methods for 3D generation is that they're really slow. DreamFusion takes around half an hour to create a 3D asset. Reconfusion, it was around an hour. And what could you do in that hour? You probably could have just gone out and taken more pictures of the thing that you were trying to capture. So it doesn't seem like a great practical solution for improving the efficiency and our ability to capture the 3D world. And if you're a robot, you don't want to wait an hour before you move your hand to reconstruct the 3D system.
Another thing that we didn't actually show in Reconfusion was what happens if I put a single image into the system? And one of the issues with the Reconfusion work was in areas of uncertainty where you didn't know what should be in the scene, you often end up with blurring. And this is because those independent image observations would often conflict. And while we use these optimization procedures to resolve them, you often fight back, you're kind of fighting against this aspect of averaging out all these different ideas for what it might look like from this novel view. So here are like some single image results for a hydrant and a bench.
So in our next work, Cat3D, for Create Anything in 3D, the hope was that we could address these problems of hallucinating novel content effectively. And the main idea behind this method was to address this problem of independence. We know that if I have a 3D model or if we have a video of something, the frames are correlated. And so we would like to model these correlations
and not just kind of resolve them post-hoc in our 3D extraction procedure. So here are some example samples from Reconfusion where we have three input images and then we have these independent output images. And we can resolve them, but it's a very slow process. And the main idea of this work was building on the amazing success of video diffusion models for jointly modeling the correlation between multiple images.
The model that we trained took a set of observed UCS inputs. You could have a single image or a set of images. You also have to have their camera poses. And we encode both the image into a latent space and the cameras using a ray representation, which is kind of representing which kind of the corners of the image that you're generating.
And then we also have a set of targets. We have where we would like to create outputs. And it's not just one place. We want to create a whole set of image outputs, and we want those outputs to be correlated such that they could be realized from a single 3D model. So we have the set of observed and unobserved views. We also add a mask to indicate to the video model which of these are observed, which of these are unobserved. And then we get out not just one view, but a whole set of views that we can decode back to an image.
And if we train this model on the same exact dataset that we trained Reconfusion on, we can see that this model is successful at learning correlations across images, and the resulting samples that we get out are already pretty consistent. But they're not perfectly consistent, and they don't allow that kind of interactivity that we might want from a real 3D model.
So what did we do? All we did is we took a single input image or set of input images. We generate samples using this multi-view latent diffusion model that gives us a generated set of views. And then we just feed that to a 3D reconstruction pipeline. And there's some additional tricks that you need, like using a robust loss, which allows for some reconciling of these different details across different views. But this whole procedure now just takes a minute instead of an hour.
And here's some examples comparing the Reconfusion results to the Cat3D results. All these are conditioned on three images, and you can see not only is it faster, but also if you look especially at the backgrounds, you get way higher quality hallucinations in regions where you actually had uncertainty.
What's cool is this works on images and single images, unlike the Reconfusion work. So here's a picture of Howie, a very cute golden retriever puppy, and we can take the single picture and then we can render it and create a 3D model that works from novel views. And here, if you just, for example, had an RGB and a depth map and tried to warp, you wouldn't be able to have the same degree of freedom for moving around and visualizing the scene. This is my grandma's dog, Lola.
And it doesn't just work on real-world images, you can also use text-to-image models to first cascade a text-to-image generation with an image-to-3D creation. So here's a factory robot assembling intricate electronic components with precision. Here's a goblin of some kind, some other creatures, and it even works on some small-scale scenes.
I think what's really fun about this is I've been really sick of just staring at like 360 spins of objects for the past two years, but now we can turn these into real interactive 3D models. And I'd encourage everyone to check out the website and play around with this. It feels fundamentally different when you get to interact with something than when you just have a video that's playing right in front of you.
There are several important bits to get this working. I mentioned the robust loss. I think a huge open question is how do you decide where to put the cameras? It should really depend on what content is in the scene. And right now we have kind of a few discrete sets of camera trajectories that we chose for different scenes, but it'd be great to find ways of learning where to place the camera as well.
The way that you do the camera conditioning can impact the quality of the results in this multi-view latent diffusion model. And what's nice because we kind of have the set-based representation versus the ordered representation of video, we can come up with different and more efficient sampling strategies to create a number of frames in parallel.
So like what's left? I think this is maybe an interesting toy, but not useful yet. I think one of the biggest issues is we moved away from just using large scale text and image data to requiring posed multi-view data. So, and if you are aware of like using some of these state of the art systems for posing, they don't often work when you have a lot of dynamics in the scene. So I think it's still an unsolved problem of how do we actually scale up these methods and get accurate camera poses if we want to train camera conditioned latent video diffusion models.
The recovered geometry is often inaccurate, even though the novel views look good. And as I said, the camera trajectories don't consider the image content. I think one of the biggest issues is that the scene and input are often assumed to be static. And there really is no such thing as like static 3D videos. If you look at a lot of the data sets, as I move through the scene, I cast a shadow into the scene that changes as I move around it. And that's present in these data sets as well. So we really need to find models that can work with dynamic scenes as well as static scenes.
Okay, so what were some takeaways from this work? I think that in the CAT3D work we found that separating 2D priors from this 3D inference process by first sampling and then reconstructing was a really flexible and efficient framework. Unfortunately, it does require more expensive multi-viewer video models to generate those correlated samples.
And something that I've been frustrated by is these optimization-based inference methods like score distillation and variational score distillation. They can handle uncertainty. They allow you to express an uncertain prior from what these novelties should look like, but they're way slower, they're lower quality, and it's more complicated. And I still think there's a big gap between the sample quality you get out when you naively sample from these models and when you use the optimization-based approach for sampling. So I think there's still room for a lot of innovation and inference methods.
And I think the other thing that people don't talk about in the 3D space is these 3D models are useless. They often don't have good enough geometry. They don't estimate the material properties. If you turn them into meshes, the topologies are not actually useful. They have baked-in lighting. So if we actually want these to be useful as, for example, assets in a game, there's still so much more work to be done. So with that, thank you very much for your time and happy to take any questions if we have time remaining.
I have a question. There is a recent work from ICML last year, multi-diffusion. They generate panoramas. Would it be possible to combine something like this, some of this approach for 3D scene? Because it's also about consistency, generating different scenes and so on and so forth.
Yeah, I think the multi-diffusion work is very cool. It allows you to take a lower dimensional model or a model on a smaller number of pixels and extend them. And there you can think about the way that they resolve differences between, for example, different frames that's averaging within the diffusion process.
There's some people who have tried this for 3D as well, where you can think about resolving this inconsistency in 3D over the course of the diffusion process, but it's often a little bit more finicky because to update that 3D representation might require multiple steps of optimization where you can't analytically solve for this update and average them together. But yeah, I think it's really cool to think about how to combine
different ways of doing conditioning and guidance and sampling and diffusion processes to do some of the kind of enforce more of the consistency at sampling time as opposed to this just let the model do whatever it wants sampling wise and then do something on top of the samples. Thank you so much. Does anyone have any questions? Hi, I think from all these three works, it seems that from the DreamFusion to
To reconfigure and to cast 3D, the more multi-view you are modeling in 2D, the better 3D you get. Is that the conclusion maybe? You just do everything multi-view like a trigger. You have a model can generate 200 images at once. So you don't need to have any optimization, right? Yeah. So would that be the future? Is that extreme or maybe something in the middle?
Yeah, I think what's very frustrating and I think very broken about these models is there's this back and forth. So, you know, how much of the structure do you put into the multi-view prior or the video prior or 2D prior? And how much of the structure do you kind of extract afterwards? And it seems very bizarre to me that we put all this effort into training these 2D priors. We train them until they're 3D consistent. And then only afterwards do we touch anything in 3D. And going from like Reconfusion to Cat 3D, we removed 3D structure in the diffusion model and it got better. And so I think
These existing methods don't really support the kind of real-time interactive generation. Maybe I want to start capturing the scene and have it fill in the details and iteratively update some 3D structure. And we don't really have methods that do that right now. It feels a little bit broken. Ideally, you could build one system that gave you the 3D outputs and could learn from solely image-based data.
And there's some cool work like viewset diffusion, render diffusion that tries to build diffusion models that have that 3D structure inside of it. But so far, the performance of these methods haven't been as much because we don't really know how to scale them and train them on bigger data sets like we do for more of these pixels-based models. So I'm not sure which way it goes, but I hope we have more hybrids and find ways of incorporating that 3D structure into the 2D models as well. Hello, thank you for your talk.
I think we always see in the literature this effect, particularly in text-to-3D generative models, where you have these oversaturated colors when you're generating assets. I was wondering if you have any more intuition about what might be causing this? Yeah, it's a great question.
My initial intuition was that this just has two things. One is that we have all these tricks for diffusion sampling and also for these distillation approaches that are built around guidance. And so you're going in a direction that matches the text prompt better and moves away from the unconditional prior.
And, you know, if I have a frog, you know, frogs are often green. And so in the data set that we have, they might be biased to green things. So putting a green background there might often lead to a higher density mode, but that might not be a good sample. So I think a lot of the issues with this kind of oversaturation and contrast come from how broken and bad the loss functions are, combined with how hacky it is to use classifier guidance for solving these problems. And that's kind of why we've moved to more of these sampling-based approaches, is that they just work
You don't have to worry about the artifacts that you get out of optimization. But I think that's like, this is like a gap. I wish that we had a better explanation for why these artifacts appeared because you do see them with classifier-free guidance as you crank it up that you get over contrast and over saturation, but not nearly to the degree that we get in these optimization-based methods like score distillation. It's very frustrating. Yes, thank you. Believe it or not, there were other research labs than DeepMind represented at ICML.
We stay in the generative modelling workshop, but transition to Ricky T.Q. Chen of Meta AI, aka FAIR, who presented the most approachable explanation of the flow matching technique in generative modelling that we have yet heard. So I wanted to give a brief talk on flow matching or this sort of...
idea of flow matching and applying it to various different kinds of domains from Euclidean to Ramanian to discrete domains. We recently had a paper that I put out called Discrete Flow Matching which basically uses this sort of flow matching recipe, I'll call it, as a way to motivate a way to construct
general models over discrete domains. But really, it seems like you can use this recipe, this sort of very abstract notion of a way to build a general model and apply it to any sort of domain.
Let's get started. The goal of this talk is to discuss a few different application domains, but also I want to say that in all of these, there is one very simple process to build a general model over these domains, and they share the same sort of underlying principles.
And the idea is, I think more people are familiar with the Euclidean space. So here on the top left, we start with parameterizing some velocity. And with that, if we transport the particles according to that velocity, we also change the distribution of those particles according to some law. And here we can, yeah, we apply this to material generation. We also applied this in the discrete domain to code generation and text generation just to scale it up and see what happens.
So I'll put the end, the last slide here at the very beginning, which is maybe a quick preview of the flow matching recipe, I'll call it, at least for this talk.
I'm not sure if I'll call this later going forwards, but here we just want to define conditional velocities, very simple velocities as conditioned on x1 that generates x1. So these UTs and these transport formulas on the left, if I start from xt and advance to xt plus h, then if I follow this velocity, then
I basically transform a particle according to this PT given X1. And in particular, PT at time equal to one is gonna become a Dirac distribution centered at X1. So if this is the case, that is I can create these conditional velocities that basically just generate a single data sample,
then basically the learning problem becomes just, I just want to learn the expected velocity, and that's it. So given an XT, this expectation is over X1, which is sampled from some data distribution. Turns out if you learn this expected velocity and follow that with the same transport rule that you have on the top left, that allows us to generate from the data distribution that we trained this expected velocity form.
And this relationship is because of something called the continuity equation. And it's actually because of this linearity of the divergence operator as well. And I'll get to these, I'll unravel this explanation in a bit.
But just to sort of set the scene, let's start with the Euclidean setting again, which is what most people are familiar with. So assume we have some samples from a data distribution Q of X1, and we're going to construct these conditional probability paths, PT of X given X1, such that they basically converge to a delta at X1, right? And if you think about all these probability paths, and we're going to just marginalize that over the Q data distribution on the bottom, this marginal probability path is going to generate
the data distribution at time one, right? Starting from some whatever noise distribution that we've set. In particular, time one is a data distribution. Now it turns out that if you just look at the velocities that generate these conditional probability paths, that is, if I were to follow these velocities, then I create samples that are marginally from this PT given X1. Now I want to also marginalize the velocity, right?
in the sense that I take the conditional velocity and I take an expectation over P1 given T. So it's P1 given T is the conditional expectation of X1 given X at this current time point. You can think of this as like a responsibility, right? If you're familiar with Gaussian mixture models. Basically, we weight the conditional velocities by this weighting P1 given T, and that's how we define the marginal velocity. This is the thing that we're gonna fit to in that expectation.
And it turns out that there's a really simple sort of explanation for linking this behavior between the marginal probability and the marginal velocity. And to get to there, we need to start thinking about, okay, how do we connect the velocity to the probability of the distribution that we're transporting these samples? And that relationship is from this continuity equation, right? So in particular, this continuity equation says at a certain point x, the change in probability at that point is related to the negative divergence
of this velocity field times that probably. So what is the divergence? The divergence is at that certain point, let's take a small, people usually take a ball or you can take a hybrid cube around that point, and then we look at all the outflow from that area, subtract the inflow. So how much mass am I losing from this area? You take that area to be infinitesimal and the divergence is basically the continuous approximation to how much mass am I losing from this current x.
So the change in probability is going to just be the negative. If I lose mass, the probability goes down. So that is the very basic relationship between velocity and probability. And if we assume the continuity equation and we basically can find a velocity that generates a conditional probability, then we've
Basically, this is the three-line proof for flow matching on Euclidean space. The first line is just, it's by definition, we're defining the marginal PT as just a mixture of conditional PTs, and then we apply the continuity equation, assuming that we have these conditional UTs in hand that can generate these conditional probability paths.
The third line is just it's an interchange of the divergence operator. It's a linear operator with integral, right? It's because of this exchange that we can basically move the integral inside and now we just have this definition for UT which is the marginal velocity field, right? And we've said okay well and this is the form of the continuity equation and so this marginal velocity field must be generating the marginal probably path. It's a really simple three line proof.
And we're going to be seeing the same proofs for other domains as well in this talk.
But just to complete the picture, here's what we did for flow matching. We're gonna just directly regress a VT, which is a neural network for primetrain to the velocity onto the conditional UTs. And this gives the optimal solution, which is the marginal velocity. The way that we proved it is we basically looked at gradients and the gradient is the same thing in expectation as the intractable flow matching loss, which directly matches, regresses onto this marginal UT.
Okay, so I mean just to show some examples. This is, you know, flow matching applied to text to image generation. You have some text. People like to look at these figures.
- All right, all right, so let's start getting a little bit more interesting now. Okay, so a lot of people here in the audience like structures, so we're not gonna stay in Euclidean space for too long. There's a need to consider non-Euclidean structures as the entire panel discussion was about. So I won't really go into the motivations anymore. I don't think there's a need to do this. There's a lot of different domains that have imposed structure and we really want to be modeling that type of structure maybe explicitly in the general model itself.
In particular, if you think about Ramanian manifolds, that is manifolds where locally, we can basically have a kind of like a first-order approximation to the manifold. Locally, it's a Euclidean space, right? So people call this a tangent plane. At every location, there's a tangent plane. And because this tangent plane is just Euclidean, we can also just define vector fields on this Euclidean space, everything from Euclidean sort of definitions for continuous flow matching,
conditional mounting flows kind of just naturally extend to the Riemannian manifold setting. And in particular, we can just replace the continuity equation's divergence with the Riemannian divergence. I won't go into any of the details on this, but for the sake of
Yeah, so for the sake of just being complete, let's look at the continuity again. So assuming that we have a UT that satisfies the continuity equation with that remaining divergence, I've just replaced that in the second line. So here the integral is over, it's not the,
It's not the same volume, right? There's a different volume element and it's a different manifold. There could be boundary conditions. I'm just sweeping all that under the rug in this notation. And the only thing that's important is now I can still exchange this divergence and the integration. And basically there's still a UT that is the expectation of the conditional UTs. So there's a lot of complexity in here. It's not clear how to even find a conditional UT. I won't really go into that in too much detail.
So one thing we did at Meta was we basically applied this to material generation. This is an idea that you want to generate a crystal or a material which is represented as an infinite set of atoms that's repeating in every single direction. So the way we represent it on a computer is we represent only a unit cell. And we just assume this unit cell just repeats in all directions. So basically, this unit cell has a periodic boundary condition. That's the manifold we're working with.
Now, a material is basically, we call it stable when it can actually be synthesized in the real world. It's the most basic, but the most important property. It's not clear how to even check if a material is actually stable. People usually rely on a database. But we basically applied Romani flow matching to try to generate materials, given a set of stable materials, and see if we could generate something that's novel and also stable at the same time.
One thing I'll say about this is that it was actually surprisingly hard to combine manifolds with echo variants at the same time. So when people, for example, when people use, let's say, a point cloud, right? So you want to impose some sort of translation invariance. The way people do this is you kind of just remove the mean, right? You take the mean of the point cloud, you just subtract it. And then when you define the flow or the diffusion process, you basically take a zero mean noise variable. And so the path basically is always zero mean.
there's no zero mean in this periodic boundary condition, in this, because there's no origin, right? It's a periodic space. So we actually had to basically project the velocity field to make sure it doesn't actually move the mean, right? So there's a few tricks in here that actually was a bit surprising, but I think it's worth thinking about, like I said, in particular, if you're interested in structuring, how do you combine different types of manifolds with different types of equivalences? It's not actually like an orthogonal combination.
But anyway, so yeah, we took a look at this and the material is represented by three components, a unit cell,
which is basically like a conformation, deformation of like a 3D space, 3D coordinates. And then we have these fractional coordinates which define where the particles are, where the atoms are inside this unit cell. And then we have the atom types for each of these atoms. So Romani flow matching was great on the continuous variables, which are these unit cells and coordinates. It was basically, you know, we could say state-of-the-art, or at least it proves a little bit to over the diffusion baselines.
So that's conditional on the atom types. We just try to find a stable confirmation. But if we wanted to do de-novel generation, that is, we want to generate a completely new material from scratch, including the atom types,
it was okay, it was not great. I would say it was on par with some of our baseline LLM approaches, but it wasn't a big change. So that was a little bit disappointing, I think. So maybe I'll explain a little bit. So here the atom type we basically represented as a continuous embedding, so embedding into continuous space, and then we just did regular Euclidean flow matching to try to learn the atom type.
And then during sample generation, we would just take a one-year neighbor and say, "Okay, this is the atom type of my sample." But that didn't work very well. So we wanted to go a bit further and say, "Okay, well, can we just take what we learned from flow matching, just directly apply it directly to discrete space?" In the sense that we don't assume any sort of metrics, we don't assume any sort of continuous space. We just have a bunch of different possible values for my samples.
And, you know, what happens? So before I get into actually the -- yeah. So the answer is yes. Our work is not the first, definitely not. There's a really interesting work from Campbell et al. Also presented ICML, general flows on discrete state spaces, which is basically a continuous time chain. And it's really nice. A lot of the things that I'm going to go over are sort of also in that paper, just maybe with a slightly different spin on it.
Okay, so the first thing is what is a velocity? So I said we were gonna just take the expected velocity, but what even is a velocity for transporting a discrete sample Xt, right? So in the continuous case, we just added a little offset to the particle itself. In the discrete case, we're gonna add a little offset to the probability, right? So we're gonna start from a direct distribution centered at Xt, and then we're gonna modify that distribution and then sample from it.
So as long as we... Sorry, this is like a preview. I'll actually prove that this is correct. I'll justify this while there's sampling in a few slides. But this is a preview to try to understand what velocity is. So... Oh, sorry. And people also call this the rate matrix for the continuous time rackup chain fans.
So here the updates are independent per dimension or token as we say it. And as long as we can find a velocity that basically implies that this new variable, this Xt plus H is following the probability path, then this is our definition of a velocity.
So again, in the continuous case, here are some visualizations. We basically model for each dimension, each coordinates, some change, and then we're gonna just take all the changes at the exact same time. So we're just gonna move the particle according to that vector field. In the discrete space, again, we have these axes aligned. This is sort of visualizing the discrete space on a grid, but it's just a visualization. We don't impose any sort of neighboring information.
But basically for each coordinate or each token, there is a set of possible values that we can move to. And this ut is essentially a change in the probability mass from x, from xt to some other states. This will be a bit more clear. So if you look at this, stare at this equation for, still for a bit longer, right? There are some additional constraints. In the continuous case, the velocity could just be anything. We could just always move, like we're assuming Euclidean space, no boundaries, nothing.
But in this setting, UT needs to satisfy certain constraints. Because we already start with a PMF there. The delta XT is itself a PMF. It's just 1 at XT and 0 everywhere else. It's already at 1, so it's already 0 when the point is not XT. And so the only way to make the right-hand side a valid PMF is to make sure UT is positive or non-negative when XI is not equal to ZI. I'm talking about this constraint here.
And the other constraint is we need to make sure the normalization constant stays the same, right? So here it normalizes to one, or sums to one, right? We want to make sure this is a valid PMF, and so we need to make sure that this UT sums to zero. And so what that implies is basically there's, you know, at XT, if this point is XT, this UT needs to be negative, and if it's something not XT, then it needs to be positive. That's it. I'm just saying there's some additional constraints on this.
And so velocity, basically here, is just modeling the transport of probability from one state to another. If you're at this current state, you have some positive mass, and I say, okay, well, with probably 20%, I want to move 20% of my mass currently to some other state, then for each particle, I'm just going to flip a coin, with 20%, probably I'm going to move. 80% probably I'll stay, right? Something like that.
So, okay, let's actually derive that. That was just a sort of preview for what a velocity would be that coincides with the continuous time Markov chain equations. So let's start again with the continuity equation, right? But here we're gonna try to define discrete divergence. So divergence is again, outflow minus inflow. It's the amount of mass moving outside this domain, outside this node, let's say, this point, minus the amount of mass that's moving in.
So in the discrete case, we can basically just assign values to edges on a graph. And this edge will just denote how much mass is moving from one node to another. It's gonna be, when we compute divergence at a certain point, we're just gonna take all the outflow minus the inflow.
Let's say this V is our flux or current. This is the scalar function that is defined on the edges of the graph, right? And we want to sum over all the domain, right? Not axis aligned, this Z is just over the entire discrete space where each coordinate has D possible values and there's N different discrete variables.
Here's where we make an assumption. We're going to assume the single token change graph relates basically that v is only defined, or it's only non-zero, when x and z differ by one token, by one coordinate. If it's reversed by more than one, it's just zero. So we're explicitly making this assumption that the velocity, or this flux right now, can only modify one token at a time.
I'm saying this is slightly different from previous treatments of this where they define these continuity equation or the comograph equation on 1D and then sort of justify why to do it in high dimensions. Here we're going slightly a bit the reverse. We start with that continuity equation defined in high dimensions and we're just going to make an explicit assumption because we don't want to be modeling basically a complete graph over this really large space.
All right, so we're gonna assume this is a flux again in the continuous setting. So v is gonna be p times v, sorry, p times u, p times the probability. So we're gonna make that assumption here as well. And if we do some algebra, we arrive at this equation for the divergence at a certain point x. So this is our definition for a velocity field. There's two ways to think about this. The continuity equation is a way to
it's a way to build in the relationship between the velocity field and the probability. I'm right now thinking about it as a way to define the velocity. So given the probability, how do we define the velocity field? So there's two things, right? So one thing is if we have a UT that satisfies this discrete continuity equation, then the Euler parallel, the parallel Euler sampling that I just described a few slides ago is now justified.
In particular, if we take the first order approximation of PT, so PT itself is going to be just the expectation of an indicator function, right? Here again, we're going to
look at this as an expectation as well. So this is, we're just sampling from PT and then this is the term inside. If we do a little bit of algebra and move a lot of the terms that are little o of h or higher outside, we arrive at this expression, which is the Euler sampling. So for each coordinates, we're going to just sample independently for that coordinates and the rest is just little o of h.
Okay, so by assuming this continuity equation, we've justified the Euler sampling. And also, if we assume this, then we can also prove that the marginal velocity, this flow matching recipe also holds, right? So everything is the same. The first line is the definition of PT. It's the mixture of conditional PTs. The discrete continuity equation shows up, and then we're going to just exchange
what happens inside with this sum over x1 and the sum over x1 gives us the marginal velocity. So again, if we take conditional velocities that satisfy this, we define the marginal velocity, then if we have access to the marginal velocity, we can transport using this Euler sampling. That's what we're saying.
So how do we actually define, that was a little bit abstract, so let's actually define this discrete pass and maybe look at a few special cases that are very strong in practice. So first off, we're going to just say, let's define the marginal PT as a bunch of, you know, marginalization over a bunch of conditional PTs, and then each conditional PT is independent in each dimension.
So a special case of our framework, we basically work with like an arbitrary mixture of m different distributions, but let's consider two different distributions. So conditional x0 and x1, xi only has two possible values. It's either going to be x0 or x1, and it's just a mixture between the two with some probability kappa t.
There is another special case which works really well, which is x0 is going to just be the completely masked state. So given a sequence, I'm going to just start from the completely masked sequence, and then I slowly unmask each token until it reaches the x1 distribution.
with some probability cutback. Now, this is very effective. It's a little bit unsatisfied that it actually is effective, but it's a very effective probability path. And it's used by many works. It's related to mass language modeling, obviously. So yeah, I just want to bring up these works also do a very good job at generalizing this construction a little bit to learn different components. Okay.
But let's not worry too much about the mass state. Let's work with this mixture of two deltas for now. And one thing that we proved in this paper is that basically there's a conditional velocity, and then if you marginalize it, you get the marginal velocity that is currently highlighted in gray here. And there's two...
it's similar to the continuous case where you can always add like a divergence free term to the flux and there's an infinite number of velocities for the same probability path. It's very similar here. Basically we can express the velocity either in terms of a P1 given T or P0 given T.
The first one makes sense when we want to solve things forwards in time. We're going to predict x1, and then we transport forwards in time. The second one makes sense when we want to transfer backwards in time. So we predict a p0 given to 1, and it is transferred back.
Now what's interesting here is that both of these velocity fields satisfy that continuity equation. So we can always plug it in. In particular, we can use any combination of these two multiplied by some coefficient. So what we actually do in practice, or what actually works in practice, is we basically take a very large step with the forward time velocity field, and then we take a small step backwards. So this is kind of similar to a predictor-corrector steps.
So where we take a small step and then you kind of like do some extra computes to sort of change the variables itself, but without changing the marginal probability. So this gives us a little bit of, if you take the mask setting, it gives a little bit more flexibility in let's not just unmask, which is what the forward process will only do. Once you unmask, you cannot mask again. But if you add in the reverse time velocity field, you can also add the ability to remask and then unmask again. So that was a little bit of a, you know, character sampling.
All right, so another interesting thing is this is very analogous to the continuous case that we've been looking at. So in the continuous case, if you look at the polypath as being a convex combination of x0 and x1, usually people write it in terms of the denoiser or the epsilon prediction. So here again, we also have very similar things. This is the denoiser, this is the epsilon prediction, except now it's predicting the whole distribution rather than just an expectation.
And here are just some examples to show, you know, we tried at scale. We basically trained a 1.7 billion model to try to do some text completion, to try to beat LMs at their own game, which we kind of failed, but we gave it an honest try. So the first thing is, given some doc string, we're going to generate the code. And basically, this is a more, I don't know,
trustworthy way of evaluating large language models. It's not just based on abstract type. This code will either run and succeed or it'll fail.
The interesting thing is because we have a completely non-auto-aggressive model, we can do any sort of code infilling. We don't need to condition on the left and then generate the right side. We can just condition on arbitrary things that we want. So this is one property that's beyond just LLMs, but there's no good benchmark for trying to do this. And on the right side is just an illustration of the sampling process. So here it's just pure mask. There's no unmask. There's no corrective stuff. Just looking at using a masked
P0 and then just unmasking it. So here's more of a sunny check that we did on open web text is learning a language model. In particular, the thing is equation nine is the masked following path, which is what most people do, except we tune the scheduler and we tune the character step also as well. And then equation 10 is some combination of a masked distribution, a uniform distribution, and then the delta of X1.
And here we see that the equation 10, this mask was used from actually does better than the pure mask setting, but really only maybe at the low NFV setting.
Or as a high NFE, the mask is okay. So the reason for this is for the mask case, if you just unmask one token at a time, I think it'll be correct. But if you sometimes unmask two things at a time by parallel, you can end up maybe with incorrect samples. And so you really need to correct that or maybe allow a uniform polypath where the noisy state includes some other state and the model will learn to correct itself later on.
And again, yeah, these are just numbers for the code generation for the 1.7 billion model. We also tried discrete flow matching for image generation. So there's no quantized Gaussian. There's no metrics at all. We just took the masks, again, the mask case, and tried it. Yeah, it seems to be slightly worse than continuous flow matching, but it's almost at 3 FID, which is pretty good.
So yeah, so that's the end of the talk. Like I said, this is the last slide of the talk. I just put it at the very beginning. So as long as we can define velocities that transport particles according to some PT, where PT arrives at x1 exactly,
then if we learn the expected or the marginal velocity, and then plug that back into the transport equations, we're going to get the distribution that we fit this marginal velocity to. And that's it. That's the recipe, and it seems to hold for the discrete setting as well. And here are my research collaborators at Meta. Some of them are also in the audience, so if you have any questions, feel free to ask them as well. All right, thank you.
Now that we understand the flow matching objective, we turn now to its most famous application this year, Instable Diffusion 3, presented in the paper Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. This paper also won a Best Paper award at ICML, and here is Patrick Esser, one of the original co-authors of Stable Diffusion under Robin Rombach, accepting it.
Hello everyone, my name is Patrick Esser. I'm presenting our work: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. All of this is the result of a great team of people who are also here and who are all on this slide. So currently we observe a big hype about scaling and it's really tempting to say that we can simply solve all of our problems by throwing enough money at it. And indeed
I would say the effectiveness of scaling can't be denied. Increasing the model size, the number of training examples, and the overall compute resources that we put into the training consistently improves the model performance. We've first seen this for language models, but we also observed similar trends for image generation, which is what we actually consider in our work.
But of course, scaling isn't free, right? It really massively increases both the development costs because we have to put all the resources into training, but also increases the operational costs because sampling is getting more demanding with the bigger models. So really, to avoid burning money, we constantly have to keep improving the efficiency of both training and sampling. The three key questions that motivated our work were essentially the first:
Given that there are currently quite a few different formulations around diffusion models and flow matching variants, which of those are the most effective? The second question addresses the question of the architecture, since for the task text-to-image synthesis, which we have in mind, we really have to deal with two different modalities and it's unclear which architectural design choices work best here. Finally, when we talk about scaling, we usually
have to measure the progress and we also derive the scaling loss based on simple metrics such as a validation loss. But ultimately they are really only a proxy for the downstream performance we're interested in, which might be sample quality. And we want to evaluate whether they are an accurate proxy for those properties. So let's start with flow matching and friends.
The common goal of these methods is basically to learn a vector field, which will be parametrized by a deep neural network, and that should generate a probability path between two distributions. In our specific case, we then usually consider one of those distributions to be a simple known distribution, such as a standard normal distribution, and the other one to be a data distribution of images.
The common starting point to learn such a vector field is then to define a so-called forward process, which essentially just defines a trajectory between two samples from the distributions we're looking at. From this process, we can then derive a tractable regression objective, the so-called conditional flow matching laws, which allows us to recover a vector field that then actually generates a path between the distributions.
The overall paradigm here is fairly general and for specific choices of the forward process we can actually recover a wide range of existing formulations and variants including EDM, DDPM and others. One of those variants is the rectified flow formulation and arguably this is to some degree the simplest choice you can make for the forward process because it's just a linear interpolation between the two samples.
This also leads to a very clean conditional flow matching loss. And overall, it really makes it very elegant and easy to work with. Also remember that sampling in this framework then essentially consists of integrating the learned vector field. And because of that, straight paths, as we define them in this forward process here, are actually really desirable because if they are straight, we could actually integrate them in a single step, which would massively improve our sampling efficiency.
So the conditional flow matching objective actually does not recover
perfectly straight path, even if we find a forward process like this. But at least empirically, we have seen results that compared to the diffusion, to the vector fields you derive from diffusion formulations, they usually have less curvature, which makes them more sample efficient. And they also come with other nice theoretical properties like a straightening effect, which makes them really promising candidates for further improving the sampling efficiency.
So overall, this really just makes rectified flows attractive candidates for efficient text-to-image synthesis. But so far, or before the study, they have really mainly been considered in benchmark settings and it remained a bit unclear how well they would actually perform in practice for more difficult tasks such as text-to-image synthesis. If we look at the conditional flow matching objective, it always involves the time step distribution over the time steps of the trajectory.
And the classical rectified flow formulation really only considers a uniform distribution over the time steps. But since we, during training, we do a Monte Carlo estimation of this objective, it can really affect the optimization that we perform.
And if we look at the loss and also the forward process, how we defined it, then we'll actually really quickly see that at the endpoints of the trajectory t equals zero and t equals one, the optimal solution really just involves an estimate of the mean of the two distributions. So we would expect this to be a very simple task in comparison.
And because of this, we actually started considering different time step distributions, which put less weight on the endpoints of the trajectory and focus more on the interior. And similarly, this is actually also a big part of the success of the fusion models for modeling images, because it allows us to
control where exactly in that trajectory we're putting in the most weight. And that way we can actually focus on the parts of the trajectory where the perceptually relevant aspects of images emerge. So to explore whether we can also benefit from this for rectified flow formulations, we explored various time step distributions that allow us to shift where that focus is.
To then understand which of those formulations is the most efficient, we actually just performed a study across 61 different variants overall.
And here we included many existing formulations such as the epsilon prediction with a linear schedule, which is the one used in stable diffusion, for example, a V prediction with linear or cosine schedules. We also include EDM and existing rectified flow formulations. But then besides those, we also included variants, especially of EDM and rectified flow, where we varied the hyperparameters of the time step distributions that are involved.
And if we then collect and evaluate those results, we actually see that the classical rectified flow formulation with a uniform time step distribution does indeed perform very strongly in the regime of sampling with few steps. But if we, for example, one of the strong baselines that emerged from this was an abs-linear scheme, and compared to that, it actually performs worse when we sample with more steps.
And really, in contrast, what we saw is that by introducing this particular timestep distribution, the logit normal distribution, we end up with a variant of rectified flow that actually performs better than all existing variants, both in the regime where we sample with few steps, but also we sample with many steps. Then after looking deeper into the generative process, we also considered architectural choices for text-to-image synthesis.
Overall, the goal was to focus on transformer-based architectures because of their good scalability properties. But it wasn't directly clear to us how we can best integrate the two different modalities, text and images, which are required for our task. So in one of our ideas, we introduced the MMDIT block, which generally follows the design of a DIT block, but it actually uses two separate weights for each of the two modalities.
But then to exchange information between the two modalities, we still use a full joint attention operation. A similar idea has also been used actually in vision language models. And what we observed from comparisons is really that this performed the strongest. Some of the comparisons we performed as to a simpler approach where we use a DIT architecture and simply directly concatenate the two modalities.
And we also considered UBIT and DIT variants where instead we use a cross attention mechanism to incorporate text conditioning because that had been very successfully applied in unit based architectures. But overall, we saw that this multimodal design really offered the best performance. So after figuring out efficient formulations and architectures, it is time to scale.
To get a clean signal for the progress during the scaling, we evaluate the validation loss at fixed time steps. Comparing such a metric is really only meaningful if we stay within a single formulation. But if we are in that case, it actually provides a very clean signal and also a very efficient way to evaluate the model and derive scaling loss from it.
But ultimately, this validation loss really can only serve as a proxy for performance because ultimately we are interested in things like human preferences, prompt following, sample quality, et cetera.
At this point, it was really unclear whether we can just rely on this validation loss being an accurate proxy for those downstream performance measurements. While there have been more work in the language domain, this was not the case in the image domain.
To answer this, we then performed a scaling study and evaluated the correlation between the validation loss and we consider both automatic image evaluation metrics such as gen eval as well as human preference ratings. Our results then showed that the improvements predicted by the scaling loss and the validation loss actually translate into quality improvements for text-to-image synthesis.
We saw similar results also in different modalities like video synthesis. Overall, it makes us confident that further scaling will indeed improve content creation capabilities with generative models. During model development and scaling, we also had a few additional learnings that we share in our work, which I will quickly go over. One of the things that emerged, one of the problems that emerged as you scale
is our training instabilities. And here it was really helpful to learn from existing works. One thing we found particularly helpful was QK normalization was stabilized, which stabilized training.
Another point that I quickly want to mention is that we reiterated on the story that scaling improves the performance, but it's also important to note that blindly following this is quickly becoming inefficient. One of those cases is if we again consider time-stip distribution, we actually have to adjust that to different resolutions. If we don't do that, we lose a lot of performance and you might say that we could simply scale up the scaling, the base model,
but the cost you would have to pay for that would be tremendous compared to if you properly fixed the problem. A similar result is related to aligning with human preferences, which gives a very quick, cheap, but effective boost in preference scores. With this, we then really obtained a high quality model that worked well across different resolutions, aspect ratios, and had a good prompt understanding and the ability to spell.
This was also reflected in human evaluations against other existing models. And with that, so long, and thanks for the attention. One of the most underrated applications of diffusion is in speech synthesis. There was also great work at ICML this year on speech, and with the rise of ChatGPT's voice mode, there is a lot of demand in learning about the fundamental problems and techniques.
Here, we will simply dip into two oral presentations that we would highlight on speech. Natural Speech 3, Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models by Zhu et al. and Speech Self-Supervised Learning Using Diffusion Model Synthetic Data by Gao et al.
Hello everyone, this is Zhe Chengzhi from University of Science and Technology of China. Today I'm very delighted to share with you the exciting progress in the field of Zero-Shot TTS, Natural Speech-Free, Zero-Shot Speech Synthesis with Factorialized Codec and Diffusion Models. Let's begin by illustrating the difference between the short TTS and traditional media speaker TTS through a practical example
In the scenario of traditional multi-speaker DTS task, the user may request the model to read a transcript in the style of certain speaker, let's say speaker number one. This implies that the model should mimic the voice characteristic of the speaker.
Here, speaker number one should be added to the training data. However, a significant limitation of this approach is that its inability to expand to on-scene speakers
On the contrast, ZeroShot TTS offers a more flexible solution. The user can supply an audio clip as a reference to guide the generation process. For example, if the user submits her voice like this: Marine engineering proves especially authoritative. Although this reference speech is short and unseen in the training, our robust ZeroTTS system can generate speech
with similar, um, convincingly similar outputs. For the past 10 years, Kinsale had gone with me wherever science beckoned.
ZeroShot TTS is now revolutioning the way we think of voice deep synthesis. This advanced model uses vast and varied datasets at training, capturing nuance of numerous speakers and a diverse range of acoustic environments. And as a result, the model can utilize the knowledge gained at training to generalize to a certain speaker at inference through a prompt-based generation method
The key to ZeroShot TTS lies in the concept of scaling up. Traditional TTS system here actually rely on clean data from the recording studio. This data sets typically consist of less than 1,000 hours of recorded speech. Now, ZeroShot TTS system harness the vastness of internet, use large scale core data from the web. This approach use a TTS set
with a total duration of exceeding 60,000 hours speech. On the other hand, the scaling of the acoustic model is also remarkable. The model is beginning with less than 50 million parameters, and the current Zerostrong TTS system has expanded to 300 million to 1,000 million parameters.
And such scaling up also promotes the transition in data representation. Previous TTS system often used to rely on human prior based representations such as male spectrogram. On the contrast, there are short TTS now,
are using data-driven representations, such as the representations derived from the codec. Here is an example codec. They use residual vector quantiziruses that are recubed to generate multiple representations, and of course, to find manner. Also, previous systems,
achieved great success. This is due to short in-speech similarity, speech quality, and speech porosity. This limitation stems from the intricate complexity of speech. For example, for a short speech clip, while seemingly very simple, it is a rich capacity of information that contains timbre, content, porosity, recording environments, and so on. This information
is important to the overall naturalness of the speech. Motivated by this, we signify the importance of factualization, since modeling complex-composed information is difficult, such as for raw waveform or mass spectrogram. Besides, factualization is also non-trivial, since R-VQ structure fails to effectively disentangle information across R-VQ levels.
NaturalSpeech3 applies factorization in both data representation and speech generation. For data representation, we apply a factorized codec which can decompose speech signals into different speech attributes while ensuring high-quality reconstruction.
For speech generation, we apply a factorized diffusion model. It is a unified diffusion framework to hierarchically generate each speech attribute in each subspace.
For FACodec, we consider four speech attributes, that is, timbre, prosody, content, and acoustic details. We first apply a timbre extractor to obtain a global timbre vector. Then we apply three factorized vector quantizers to represent speech attributes in each subspace.
For better disentanglement, we introduced the following techniques such as information bottleneck, which can limit the representationality capability of each token and supervision to include the intended attributes, routine reversal to remove redundant information, and detailed refault to remove unnecessary information from the detailed codes.
For factorized diffusion models, we apply discrete diffusion in each subspace to sequentially generate speech attributes in an order of duration, porosity, content, and detail. The timbre needs not to be predicted since this global vector can be accessible through the prompt's audio.
In the forward process, we randomly mask out certain tokens in the sequence. And in the reverse process, the model learns to recover tokens gradually under the guidance of context and conditions. To facilitate in-context learning, we prepend the speech attribute prompt to the sequence
as a prefix, this prompt acts as the condition and remains unchanged during the diffusion process. This speech attribute prompts in the scenario of the threshold TTS, this
speech attribute prompts are derived from the same audio and as a byproduct, this prompt mechanism also offers great accountability since we can select different speech attributes from various sources, tailor the output speech to meet specific requirements. We evaluate the ZeroShot TTS capability
on liberal speech and emotional TTS dataset revenues in terms of similarity and robustness and overall quality. The compelling, the impressive results demonstrate that natural speech tree not only outperforms strong baselines, but also achieves human-level naturalness.
We also test reconstruction capability of our FA codec with strong codec baselines on the LibreSpeech test set. It also demonstrated that our FA codec can reconstruct speech with high fidelity using this disentangled speech attributes. Here are some demos.
The first row is the three second prompt randomly cut off from an entire speech. And the second row is the natural speech output speech for case one. - The standard made to hold another oil cup. - So this is the prompt and our natural speech tree can generate a similar sentence using this three second prompt.
There was an average cost per lamp for meter operation of 22 cents a year, and each meter took care of an average of 17 lamps. For case 2, this is the prompt. Is it not clear that there is just as much of the pencil left as... And this is the output results. There were only four stationers of any consequences in the town, and at each homes produced his pencil chips, and bid high for a duplicate.
- Our natural speech tree can also generate emotional TTS in a very short manner by prompting an emotional speech. If you prompt the natural speech tree with a sad audio like this. - Dogs are sitting by the door. - And natural speech tree can generate a sad speech like this.
Why fades the lotus of the water? If you prompt the model with a calm audio like this: Dogs are sitting by the door. And the natural speech tree output will be like this: Why fades the lotus of the water? And, third one, if you prompt a disgust speech like this: Dogs are sitting by the door. And the output will sound like this: Why fades the lotus of the water?
Our model is also capable of manipulating attributes by manipulating the corresponding speech prompt. So here is a demo. The first column is the original setting, that is both duration prompt and other prompts are derived from the same order, that is zero-shot TTS scenario.
Had she enjoyed the experience? And other prompts is the same. Had she enjoyed the experience? And the generated speech will sound like this. The examination and testimony of the experts enabled the commission to conclude that five shots may have been fired. If we just slow down the duration prompt, the prompt will sound like this. Had she enjoyed the experience? And the generated will sound like this.
The examination and testimony of the experts enabled the commission to conclude that five shots may have been fired. If we just speed up the prompts, it will sound like this. Had she enjoyed the experience? And the generated speech will sound like this. The examination and testimony of the experts enabled the commission to conclude that five shots may have been fired.
We can also derive duration prompts from another new audio clip. This will only manipulate duration attributes and will not affect other attributes since other prompts are kept the same. Dogs are sitting right at the door. And the generative speed will like this. The examination and testimony of the experts enabled the commission to conclude that five shots may have been fired
If you are interested in our work, please scan the QR for more samples. That's all for my presentation. Thank you. Good afternoon, everyone. My name is Yang Zhang from IBM Research. Today I'm going to introduce our paper, Speech Self-Suffice Learning Using Diffusion Model Synthetic Data, which is a joint collaborative work with UIUC and UCSB. I'm from IBM Research.
So let me start by briefly going over some background about speech self-supervised learning, or speech SSL. Speech SSL, just like SSLs in other domains, assumes that we have a large unannotated corpus. We can then use this corpus to portray a speech representation network, which can then be fine-tuned to downstream tasks on a small annotated corpus.
The key to the success of speech SSL is that we actually need to assume that we have a large unannotated corpus. In most cases, the number of hours of this pre-training dataset should be at least 1,000 hours. However, in many cases, obtaining such a large dataset is not as easy as it seems.
Here, I'm borrowing a figure of a recent research effort for collecting speech data set for over a thousand languages. The horizontal axis shows the languages and the vertical axis shows the number of hours collected for each language. As can be observed, for most languages, the number of hours is below 1,000. This means that obtaining a large portraying data set is simply infeasible for many cases.
So in cases where the number of pre-training data is limited, it becomes crucial to maximize the information extracted from the limited dataset. Therefore, we raise the following research questions: Do the existing SSL techniques extract enough information from the limited pre-training dataset? Could we further extract information that is possibly overlooked by the existing SSL techniques?
So in this paper, we propose DivS4L, which is a speech self-suffice learning method that augments the limited pre-training dataset using a diffusion model. More specifically, assume that we only have a small unannotated dataset, say less than 100 hours.
So what DivS4L essentially does is that it augments the dataset using synthetic data, which is then used to perform the standard pre-training with the standard pre-training techniques. The entire data augmentation process consists of three steps. In the first step, we use this small dataset to pre-train an initial speech representation network.
know that this initial speech representation network is of poor quality because of the limited dataset size, but it is sufficient for our purpose. Once this speech representation network is trained, then for each speech utterance drawn from this small dataset, we can obtain its initial speech representation.
As the second step, we feed this initial speech representation together with the speaker embedding, which is also extracted from the original speech, to a diffusion model. We then train this diffusion model to reconstruct the original speech. The diffusion model is also trained only on this small unannotated corpus.
So after this diffusion model is trained, we can then use the diffusion model to generate synthetic data to form this large synthetic data set. However, rather than directly feeding the initial speech representation and the speaker embedding as is to the diffusion model, we first pass it to a modification module. In this way, we can ask the diffusion model to generate a good variety of speech that is different from the original speech.
So the remaining question is how we actually modify these speech representations. So know that speech is a rich information source. It contains many levels of information, including content information, speaker information, process information. Therefore, the synthetic speech should also contain enough variations across all these dimensions.
So we designed the following four levels of variations in the synthetic speech. Level one is the original speech itself. Here is an example. So this is the original speech.
In the second level, we feed the initial speech representation and the speaker embedding as-is to the diffusion model. In this way, the diffusion model would generate something that is almost the same as the original speech. However, since the conditioning does not control everything, the output speech would be slightly different from the original speech, particularly in terms of prosody.
So let me play this audio. Please pay attention to the prosody on which. Now this which becomes a rising tone, whereas in the original speech it was in a dropping tone. So this is the second level. In this third level, we
change the speaker embedding to a different speaker. In this way, the output speech was still under the same content but with a different speaker's voice. So here is the example. And sharing her house, which was nearby. It now becomes a different male speaker. Finally, in level 4, in addition to changing the speaker embedding, we also partially masked out some of the speech representations.
In this way, the diffusion model is forced to fabricate some new content. That's why we call this type of speech "novel content speech." So, as you can hear, the output speech would almost like a nonsensical babble. This implies that the diffusion model is unable to fully capture the grammar structure or the word structure in the original language.
However, as we will show, even this seemingly nonsensical babble will still help the performance of the pre-training. To test the performance of DIV-S4L, we use this LibreStreet 960 as a pre-training dataset, which contains 960 hours of speech in English.
We consider two different pre-training settings. In the low-resource setting, we sample only 100 hours of real speech. We then augment it into 960 hours of speech in total by adding 430 hours of level 2+3 speech and 430 hours of level 4 speech.
In the high-resource setting, we use all of the 960 hours of real speech. We then augment it to 2,400 hours of total speech. We use two standard pre-training techniques for training our initial speech representation as well as the final formal speech pre-training, which is Wave2Vec 2.0 and Hubert.
and we compare four different data augmentation techniques. Number one is no data augmentation at all, Wave2Vec-Aug, WaveLM, and our proposed DivS4L. We test this pre-training model in a number of downstream tasks. In the first task, we try it with the English ASR, or English Automatic Speech Recognition, where there are only 10 hours of labeled data for fine-tuning.
The results show that Div S4L can significantly reduce the error rate for both Wave2Vec2 and for Hubert. Moreover, Div S4L can achieve further error reduction when it is combined with WaveLM.
Finally, note that the results in these two boxes are both portrayed on 960 hours of speech. The only difference is that the result in red box was portrayed on 960 hours of augmented speech, whereas the results in blue are portrayed on 960 hours of real speech. So as can be observed, the gap between them is already very small.
To evaluate the performance beyond speech recognition, we chose this superb benchmark, which contains eight different tasks other than ASR. And the results still show that Div S4L is able to achieve the best performance in almost all the tasks for both the low-resource setting and the high-resource setting.
Finally, to test the performance beyond English, we chose 13 more languages, including some of the high-resource languages and some of the low-resource languages. And the results still consistently show that diffs4l is able to reduce the error rates. The final experiment I would like to show is to investigate whether all four levels of variations are helpful.
To test this, we go back to our low resource setting, and then we're going to fix the original speech to 100 hours, and we're going to fix the total number of speech to 960 hours, but we then vary the proportion between level 2 plus 3 speech and level 4 speech.
And here is the result of English ASR under different dataset compositions, where the leftmost point corresponds to no level 2 or 3 at all, and the rightmost point corresponds to no level 4 babbles at all. As can be observed, the best performance is achieved somewhere in the middle, where all four levels of speech are present.
That means all four levels of speech, including the level 4 babbles, are beneficial to the pre-training. To summarize our findings, we find that the diffusion model is able to capture the information in speech complementary to what SSL learns. Therefore, our proposed diff-S4L can significantly improve the SSL performance in various downstream tasks and languages.
We also find that the synthetic speech with different levels of variations are all conducive to SSL, even the seemingly nonsensical babbles. With that, I'll close the talk today. Thank you very much for your attention.
In part one, we explored video generation and world simulation. And in part two, we explored further diffusion and generative modeling methods across nerfs, flow matching, rectified flow transformers, and speech. In part three, we turned to the generative text to video paradigm on its head and check in on the state of computer vision.
First, we have the OG Vision Foundation model, DECAF, which this year won the most prestigious Test of Time award, being first presented at ICML 10 years ago. Here is UC Berkeley professor Trevor Darrell, advisor of the DECAF paper and also originator of the CAFE Deep Learning Library, accepting the award on behalf of the team. Thank you very much. It's quite an honor to be here and thank you for the generous introduction.
And it's really quite exciting to be able to talk about the impact of the decaf work and the cafe work and of course,
And that introduction was very generous. I think really the main, the broadest claim we would make is that decaf democratized access to this class of tools. And that's what led to the transformational change in the field. Of course, as we'll mention and mentioned at the time in the talk, this builds on the work, of course, of AlexNet and other papers.
Look, the paper's decaf, and the title was a mouthful, a deep convolutional activation feature for generic visual recognition. And what is an activation feature? I thought I would start off by translating that into 2024 speak. So today we would probably think of the F in the decaf as being foundation model.
Now, I'm not sure we ever needed to define the term foundation model in our field, but since it's been defined and it's broadly used now, I look back and think of the decaf paper as in fact perhaps one of the original or broadest foundation models in vision or deep foundation models. And so that's essentially the main retrospective point as we looked back to see the impact of this work,
And we're honored and even pleasantly surprised that it was selected for a Test of Time Award. In one slide, what was the decaf paper? Really, it was maybe one of the simplest papers we've ever published in my group. Essentially, we took the results of AlexNet
showed the effectiveness of this model as a pre-training model in vision, showed that if you took frozen activations, frozen features, and computed activations on those features, you could have essentially state-of-the-art performance across a wide range of tasks. So this is in some sense the OG foundation model in vision.
take the output of convolutional layers, freeze it, train a linear classifier, maybe train even an SVM back in the day on those models, and boom, you get state-of-the-art performance. I think the main things that the decaf paper did that were
impactful and exciting was visualize and show why the model was working. I think that insight maybe is so common today, like we all understand that, but the vision community especially didn't appreciate it. AlexNet was this amazing result, but people thought it was a special case. It's just going to only be for that one task.
And the fact that it started to work for everything and that our paper and other papers like it demonstrated that and that it could be used in this pre-training slash fine-tuning way, that was what really revolutionized the community. And the last bullet here, of course, is probably the most important. I think the thing that
made decaf important, and maybe this was a tipping point in the community. I think most papers that you would get accepted to ICML or CVPR prior to 2014, you would never get a paper that didn't have an algorithm or a model. Even data wasn't enough to get a paper accepted back in the day. And the fact that this paper, of course, was accepted and now recognized for its impact
The reason it had impact was really the open source release and the broad dissemination of the work through that channel. And that's now commonplace, perhaps even dominant in our field today, but it wasn't back prior to 2014. So I'd like to sort of acknowledge how the community has changed and the community standards have changed.
Decaf was part of the CAFE ecosystem, which really was a dominant force in deep learning, one of the dominant preeminent deep learning frameworks between 2013 and 2018. And again, the reason CAFE
It wasn't the first deep learning framework ever. There were some unique architectures, but really it was the democratization of the models and access to the models and the emphasis on heterogeneous compute. You could either use a CPU or a GPU.
And really the industry standard code base that worked well in academia and in industry. And it was really the first widely deployed, the platform for the first widely deployed use of NVIDIA GPUs. And that certainly has had a lot of impact. We had the first model zoo that we can find. And the timeline is shown here.
Decaf came out in 2013, 2014. Decaf was the frozen pre-trained sort of foundation model version of Cafe. Cafe itself was a deep learning framework that eventually was merged into PyTorch that allowed people to train their own models and had great impact. And the impact is evidenced by the
honor the team has, this is one of three Tests of Time awards the CAFE ecosystem has received this summer. And that really is a remarkable observation. The decaf paper here at ICML, the RCNN paper at CVPR, and the CAFE system paper, which was at ACM Multimedia, also has had the honor of a Test of Time award that will be presented later this year. I think decaf
might be the most important of these three papers, which is maybe surprising in retrospect, because at the time, I'm not sure the decaf paper was viewed as important as these other two papers, or at least as the CAFE system itself. But I'm going to walk through some of the old slides and then tell you a little bit about some of the current...
observations and thoughts about why and how this work looks sort of from a historical light. And what does the pre-training paradigm mean for the present and the future? So these are the old slides from 2014, presented at ICML 2014 in China, in Beijing. And this is what the world of computer vision looked like
in the early 2010s, 2000s, really starting in late 1990s, 1998, 1999, the machine learning revolution took over in computer vision. But for a good decade, it was the pathway seen on this slide with handcrafted features, words like SIFT and HOG and LLC that maybe aren't even, and SURF that may or may not even be known to the community today.
And then we had our wonderful Cafe Cat, which of course detecting cats on the internet was the paradigm of its day. But for several decades,
and not unnoticed by informed researchers at the time, but yet largely unappreciated by the CVPR and even ICML community, to be honest, was the progress in convolutional representation learning and deep learning, as it was later called, the work of... It goes back to Fukushima and the neocognitron,
the Romo Hart Hinton and Williams seminal paper in the PDP book, which I encourage people to go back and look. And of course, the work of Jan LeCun and then Alex Nett in 2012, finally showing that this paradigm did scale and this paradigm was going to
But I think even in 2012 and 2013, vision people were basically acknowledging this is working for object recognition, but certainly it wouldn't work for other things. It wouldn't work for fine-grained object recognition. It wouldn't work for complicated transfer learning. It wouldn't work ultimately for segmentation and other things. And as we see, and as the decaf paper
helped convince people that was not going to be the case. In fact, this paradigm was going to take over the field and did take over the field. And the decaf paper was essentially the simplest foundation style model or pre-training paradigm you could advocate. And very simple even in the day, which is let's just take a frozen AlexNet model
which we're going to provide for you, the user who downloads the code from the Berkeley website. And just take slices of the model and compute activation features. Compute the representations that are formed from the pre-trained AlexNet model and see how it does.
And the DCAF paper did this, reported this, and asked a series of questions about the quality of these representations and started maybe the first demonstration that these models are learning something more than they're trained on or more than the literal task.
that they were trained on. That they are actually learning the latent knowledge that's encoded in those tasks. They're learning semantic hierarchies and things like that. Again, those are concepts that we take as obvious and common sense today, but certainly in the vision community in 2012, it hadn't yet been accepted.
And that because these representations generalize, they actually, sorry, because they capture this latent semantic representation, they generalize to other tasks effectively.
And different layers had different performance. So the decaf paper was one of the first to show visualizations, for example, using t-SNE on these deep learned representations. And just these visualizations here that if you looked at higher layers of the decaf representation or higher layers of the AlexNet representation,
that the ability of these models to capture latent semantic features, which were called superlabels here in the paper, the models were never trained on these superlabels, and yet there they are emerging as a part of the representation.
Again, I think this is the thing that really surprised the vision community, that you didn't explicitly supervise the model on this signal, but it emerged from the representation.
And when you would then look to see what was the performance, of course AlexNet was crushing object recognition, but there was a view that object recognition was now, okay, that's just machine learning. It's not part of computer vision. Computer vision are these other things, fine-grained part recognition or segmentation, domain adaptation and things like that. These representations just started to crush all the tasks. Right?
By moving from the prior best feature, which in this era was called SURF, when looking at the state of the art domain adaptation challenges in computer vision, which was the Office dataset that was released from our lab actually back in 2010, you could see the numbers double just by changing the underlying representation for some relatively fancy domain adaptation technique.
But even more exciting, you know, from the perspective of this paper, perhaps less exciting from the perspective of the domain adaptation researchers, the baselines went up by a factor of three or four. These underlying representations had an ability to transfer that really crushed all the fancy mechanisms that people had been proposing for domain adaptation.
So, that was a sobering moment for many researchers. And across a number of different tasks in computer vision, for example, fine-grained recognition, where you want to recognize individual species of birds, and there was a notion that you may want representations that can also localize parts and have some interpretability.
The decaf model out of the box did better than the prior art and when integrated into straightforward techniques for localizing parts had further improvements in performance.
And the last example in the paper that I'm going to highlight here today on scene recognition. Here as well, the model is never trained on these labels for outdoor, indoor, man-made, or natural. And yet, these super labels emerge in the representation when you visualize it with T-SNEI.
So there are more results. I encourage you to look back at the paper. But I'll just close this part of reviewing the original talk by noting that I think the main impact here, those observations were important, but the open source dissemination turned out to be the impactful part.
I mean, there were other papers that came out shortly after Decaf or maybe around the same time that also were showing transfer performance, but ultimately Decaf and the CAFE open source release led to wide adoption of these techniques very quickly in the community. And there was this great website and a cute cat that you could look at, of course, back in the day.
And at the time it was just considered remarkable that these techniques that just a year before people imagined you needed 10,000 CPUs to try and run deep learning. There was a great paper in the New York Times you can go find on how the Google data centers were taking thousands and thousands of CPUs to run certain deep learning algorithms. And there was just a belief that like no way anybody but Google can do that.
And so people were just doing whatever they do. And then suddenly the next year through decaf and related efforts for democratization, almost anyone could get most of the performance of these models, at least for inference. And then soon with CAFE and the advent of GPUs and GPU acceleration of these models,
everyone could train a model of this size. And so that's an exciting point. And maybe we're going to have a similar moment in the future. Right now there's a perception that you can only have tens of thousands of GPUs to train LLMs. Who knows what the architectures will be in the future and what the next iteration of a transformative architecture change like CAFE will be or
So, decaf showed back then the surprising effectiveness of transfer using frozen or relatively frozen AlexNet features. It was a pre-training, fine-tuning paradigm. Decaf was the precursor to CAFE, which became the de facto standard for deep learning in academia and industry. But maybe, why did I say the decaf paper might be the most important one? I mean, CAFE had a lot of impact, but actually,
The way I'm presenting the DCAF paper now as a kind of foundation-ish model isn't what people were most excited about in 2015 to 2020. In fact, if you wanted to get your paper accepted during that era, you had to put end-to-end in the paper or in the title somewhere. The cool thing was I could now back-propagate all the way from the task
And if you weren't doing that rapidly by around 2016, you weren't considered in fashion. So I don't think activation features the way they were explained in the decaf paper or foundation models, if we relabel that today, were in vogue for several years. And the CAFE system allowed you to now train your own model, get your own GPUs and do this sort of end-to-end training.
But as we know, roughly in the early 2020s, pre-training returns and actually in some sense is now the dominant paradigm. And we see this from Burke and Clip and now a perception that if you keep scaling your data and the model, the underlying representations are just gonna get better and better and better. And that's the way to go. We don't think it's the appropriate path
to really fine tune from scratch for each task. We want to leverage everything we see across the underlying tasks. And maybe I'm oversimplifying what people were thinking in 2016 to 2020, but I think you get the gist. So we see now this pre-training paradigm very dominant in the field.
Decaf was primarily pre-trained plus fine-tuned approach as our contemporary LLM and LoRa models. But now prompting, pre-trained and then prompt, of course is dominant in language and vision and language. And in vision there's very early work along these lines as well.
Since I have a test of time talk, I'll plug my own group's work on this on visual prompting that we had at NeurIPS 2022 and large vision models that we had at the CVPR. You can look at these approaches. No language in those models, but they're still foundation-ish models or pre-training plus prompting approaches to vision models.
And we see the unreasonable effectiveness of pre-training continuing to this day. Many models that are having a lot of impact. A lot of vision and language models coming out very fast from companies. I wouldn't want to now compete in building the next vision and language model from Berkeley.
As I mentioned earlier in the talk, until there's the next revolution of architectures and maybe we're not going to need 10,000 GPUs and we only need a couple of these, whatever the next great model is, I'm looking forward to seeing what that will be. But even now, I think that's still an open playing field if we consider vision and action or vision and action and language issues.
I'll just maybe close pointing to some of the work in that space. There are many papers coming out from many different labs right now in this direction. I'll point to two in my lab, one that includes language as it pre-trains a vision and language and action model, and one that doesn't explicitly include language but also
humanoid locomotion. If you're interested in the larva paper, we've taken the llama and lava base and added action pre-training into it where we literally prompt the robot to have a particular control scheme, a particular task, and describe a trace of trajectories that are desired and that can then be performed. And the sort of fundamental approach of humanoid
of action or locomotion as next token prediction is explicitly formalized in our paper, Humanoid Control as Next Token Prediction. No underlying language model here, but just pre-training on lots and lots and lots of human and humanoid action data, some of which are taken from the wild, some of which are generated in simulation, just straight up transformer, and the model can then walk around
new environments, including San Francisco. With that, I think there's one or two minutes left for maybe a question or a conversation, which I'm happy to engage in. And again, I think the team just wants to very much thank the community and the program chairs for this honor and looking forward to all the research in the future from the community. Thank you. Thank you so much. So I think there's time for maybe one, well, there's time for one question. So if you have a question, you can come up to one of the microphones there. But then I think because we are
Basically between the two post sessions, we want to make sure we get over there too. So maybe we'll have one person if you want to come to one of the things. We'll give them a second because people always do it. Thing to do normally as a chair is to ask it yourself, but I won't take that. I won't take that. I'll let someone from the audience do it. To a certain extent, this notion of prompting, I guess, okay, I'll put it this way. To what extent do you think
fixing features and putting a linear head on top of those features, which we see is very different from prompting in a sense in current mechanisms. To what extent do you think that that's just a
aside, and prompting kind of is just the way we specify linear heads these days, or to what extent is language really something fundamentally different when it comes to vision language models that is going to enable another step change the way that deep networks enabled a step change so many years ago? I think you asked two different questions. One I took to be the
between fine tuning and prompting and the other I took to be language versus not language, or at least I'll try and answer those two. And I think, and I'll try and do it quickly, I think I could also have added some slides at the end. I think it was an exciting time in the community and many cool papers coming out right now about the sort of mechanisms of in-context learning and prompting, task vectors and function vectors,
and how we can interpret and then maybe even patch or extrapolate these models. So I think we're going to, I think that's unsettled. I think we're going to see in the coming years papers that define that paradigm that are going to show maybe more formal connections between fine-tuning and prompting in the architecture. That's very hot work right now.
And then is language special? I sort of don't think it is. I also think the word language is complicated because I don't know whether, I think in the community right now, if I say a language model,
to the broader press, they're gonna assume there's text in there. I'm not sure if I say the word language model to this community whether you're gonna assume that or not. In the past, I didn't always assume that. I thought I could have a language model on a series of tokens that were just vision tokens, and that's why we call that large vision model
a large vision model, I think that's a large vision model, it's also a language model. So if language is just the process of having tokenized something and then predicting it, I think that general paradigm is going to be very high impact across all areas of intelligence, including those that use text or what we normally call language, and those that don't, like motor control.
That last audience member question was an incredible segue into our next talk, which is a retrospective on how LLMs and computer vision converged from Lucas Beyer, who was one of the lead authors of the Vision Transformer paper at ICLR 2020.
This would count as yet another deep mind talk on this pod. Prizes for counting how many we have featured this episode, except that the entire VIT team has just left Google to set up the new OpenAI Zurich office.
Let's have a look, eh? Get it? Look at how Lucas views the progress of the VLM field. I will talk about, yeah, about computer vision in the age of LLMs. And in this case, or in this talk, I will focus on each part. I will focus a lot about the data side of things. However, yesterday evening, I decided to completely redo my talk. And so I apologize if some parts are not smooth or if sometimes I'm surprised by my next slide.
So, one thing that happened recently in computer vision or recently like four or five years ago now is that suddenly language has become the API for all vision models and things. And by API, I mean like input/output to the model, like how you communicate with it basically. In the distant past, what
A lot of vision and classification is the most canonical task, but most tasks look like is that you pre-trained your model on a large database of images labeled with typically classes because that's what's easy and quick to label. And then maybe there are a lot of classes, so you cover a lot of concepts, but they are not really attached. Like it's the class ID number and that's it. And this way, your visual model learns to understand a lot of things, or at least to
classify them into these classes and then you transfer it to any task of interest which could be again classification but maybe much more focused like flower classification or other things
And again, there you had a labeled data set, smaller typically, much smaller. And then you fine tune on that, and then you get your model that you actually care about in the end. Then with the appearance of Clip and almost same time Align, things changed. They showed how to do pre-training on not class labeled data, but pairs of images and text that you can find very easily.
And then also not even like use this model, not even fine tune it, but just prompt it basically, or like give it a few options in free form text. And then it tells you which option is the most likely representing the image. And that way you don't even need to like
The API is what I mean here, changed. It's not integers of classes anymore and it doesn't need to match or anything. It's just ReformText. So that was very nice. And in terms of data,
It means that it changed like vision data sets classically were like this list of classes. And then for each class you go and collect typically via image search, a bunch of images and that's it. So you have this very regular structure or information content.
And even worse, who makes up these classes? Is a PhD student just sitting there, "I'm going to make a dataset about blah blah blah, so let me think about the classes." Fun fact is Coco, which is probably the second most widely used computer vision dataset from the time where we did it with classes. Who knows how the list of classes in Coco was created?
Yeah, very few old school vision people here. It was the senior professor on the project asked his teenage kid, American kid, like what are common objects in your mind? That's why Coco classes are like frisbee, football, pizza and that kind of stuff.
So, yeah, you can see already how bias comes into these data sets, right? It's just the model only learns about the things that the person creating the data set, which is usually one random person that just happens to be deciding, thinks of.
But this is what the data now in modern times when we learn from image text combined looks like. It's like just random collection of images, typically from the web and text somehow attached to them, like typically alt text or the title of the page it's from or things like that.
And then you have some random shit like this that is completely uninformative. Thumbnail for version as of 21 blah blah blah. This is legible, yeah. So this is kind of useless supervision signal, but you also get very detailed stuff that you would never come up with if you were to create a list of classes like this one Frankfurt airport skyline 2017, right? Or London barge race or things like that, right? So
This completely changes what the models can learn. They are exposed to a lot more noise and useless stuff, but at the same time also to a lot more detail that you would never come up with in the classic way of creating data sets.
All right, and then a little advertisement after Clip came out or Clip was the first model doing this. And then a couple years later, our group made the C clip, which is a variant of Clip and is also open model. So you can download it and use it, which is just after a few years of experience with this, it's significantly better. And the cool thing is, again, like
As with Clip, but now even better, you can prompt it with Freeform Text and become as detailed as you basically want, as you can express with text. Here's a couple examples. Are these visible-ish?
Yeah, here's a couple of cool examples of pictures we took ourselves. So they are not possibly in the training dataset and the model doesn't possibly know about them. For example, this one is me and the colleague who have bought a coffee themed t-shirt. Mine, I think, said "I need coffee" or something. And the colleague is just the molecule of coffee. And then the model fires 100% on the text "a photo of two guys in need of caffeine". But it fires 4% only "a photo of two guys in need of water".
So this stuff works nowadays. And a classical thing in computer vision, at least until recently, is a pet peeve of mine. People always say like, oh, computer vision models are not robust. It will recognize a cow, but you put it on the beach, it will fail completely. Not at all.
But this has been solved for years. Like here, cow on the beach is 99%. And cow also 36%. But cow in prairie is only 1%. So this stuff works with Clip and Cglip and models like that. When we did Cglip, we released a separate checkpoint that we trained on all the languages. And we basically tried to show that
Here, just from some examples, we didn't really thoroughly evaluate it in the paper. We just released it and tried to show that not only does it learn multiple languages, like from this just web data, images, and text, the web is international. So you just learn about all the languages for free, essentially, if you use them. So we didn't do anything special like translation or anything. And the model can learn not just the cow on the beach, but also in other languages, like or in the and the other languages I cannot pronounce.
But they all say the same thing, right? And here even we tried to show some cultural specific things. I think this was my Chinese colleague, Shawa, who came up with this. So I think only Chinese people will understand this. This dish in Chinese, it's called ants on the tree, I think, or ants on the branch or something like that. Ants climbing a tree, exactly.
And so if here, this is just the interesting thing. If you ask this model in English and like the literal translation of the dish and climbing a tree, it just doesn't get it. It just fires on the picture of ants climbing on a tree and not on this dish. But if you ask, I cannot say this, but one of these is like ants climbing on the tree in Chinese. If you ask that, it totally gets that you're not talking about like literal ants climbing on a tree, but the dish and it's like that.
Alright, so... Ah, yeah! So we released this separate model, multilingual, but back then we trained only a small version of the model and recently we also trained a larger version of the model and this is new, like we released this somewhat silently, just in this collab. So now if you're interested in international CGLIP or CLIP-like models, there is a large one available that is pretty good. But why did we
released two separate models. Why not just one international and that's it? Well, because it turns out that training on English-only data helps a lot. Scores that people, including some of my colleagues, care a lot about, which is just ImageNet, ZeroShot and a few other English benchmarks. So just here, an overview of
a broad range of recent clip style papers or that do clip and look at data it typically looks like this with if we train on raw data it's bad english only subset of the data it's better some more filtering it's better and better
And the measurement is done typically in ImageNet Zero Shot score. And this is across-- I intentionally don't write which paper because I don't want to blame any individuals. Except here, I can say this is the original clip paper already. It says we get our queries from English Wikipedia, so English only.
Then when you see papers using Lion, they usually use Lion 2B, but if you look at the citation, the Lion paper is Lion 5B. So what is that? Well, Lion 5B is actually 2B English, 2B non-English and 1B don't know. So typically people just use the 2B English Lion subset.
And then here is another work where we can go through steps of filtering, right? We see first basic filtering, like the first basic filtering is the caption language being English. And then typical thing is filtering by clip score. So you keep only the data that clip already understands, which as we've seen before, as I mentioned, clip was trained on English only as English only stuff. And then there's more, then there is like
keep only the data where in the text, one of the words in the text is from the ImageNet 21k list of classes. Or even better, not even the text, like keep only images which are similar to ImageNet images. And the thing is, these more heavy English and ImageNet tailored filtering is what works best as measured on ImageNet, but also other benchmarks, but which are similar to ImageNet.
Having said all of these negative things, I need to call out one positive paper and not hide its name, InternVL. I like this a lot. They specifically show, "Okay, look, we use Lion, the English, but also Lion multilingual and also Chinese dataset." So that was nice. All right. So this was all about Clip and about this specific filtering stage. One of the next talks from Angeline will give more details about the effects of this.
But part of the vision community has moved on and I think all should move on past Clip or beyond Clip and even C-Clip because there are some things, no matter how good your data is, how high quality the caption, how descriptive the caption, there are some things the Clip contrastive loss just doesn't learn. Actually, I just assumed everybody knows how Clip works, but who knows how Clip works, the training?
Okay, more or less everybody. That's good. So take this example. You have the image of a cat and a dog and the caption is pretty much... It could be more detailed but it's pretty much perfect. Like a cat sitting left of a dog. Now these go through the encoders, right? And then they are trained to be most similar versus other captions or pairs in the mini batch.
However, now let's think what does the model need to learn to perfectly satisfy this objective at training time. It depends on what else is in the batch. If there is no other picture of any cat or any dog in the same mini-batch, the model just needs to learn, for example, cat, and that's good. It matches it to this image, perfect, done. The loss is perfectly satisfied. Or alternatively, it just needs to learn dog.
And then it's done. This is the only image that contains a dog. It doesn't need to learn to match any more than that. And models are lazy, just like me. They learn the minimum amount necessary to solve the task.
Now, if there happens to be another picture of a cat in the same mini-badge, if it's not sitting, then the model now only needs to learn either cat sitting or probably easier, cat and dog. It just matches the word cat and dog with this image. There's no other image with both a cat and a dog, and it's done. It doesn't need to learn more. You see where I'm going, right? To learn left off what needs to be in the exact same mini-badge,
the same thing like a cat and a dog, but the other way around with the perfect caption and no other shortcuts to match them. This is just not going to happen. So this is like an inherent disadvantage or limitation of clip style learning. And C-Clip suffers from that too.
So with some colleagues, we set out to find a fundamentally better learning objective that fixes that. And there is a quite simple one that does, which is just captioning. So encode the image.
and then pass the image encodings into a decoder that should decode the caption. When you decode, it's just like language model. Loss is next token prediction. So when here you do have a loss that says, say left, don't say right, don't say above, don't say below, don't say any of the other words in your vocabulary, say left. So the model has to learn this. And then there is...
I'm not going to go into any more details here, but original CLIP paper showed this in their figure one that this is so much less efficient to train this way. And in our paper, we also go into a lot of detail in that it's not much less efficient actually. Right. And then we evaluated this. So then we found that actually we are not the only ones thinking about this CLIP limitation and there were already multiple benchmarks that specifically measure this
clip limitation. The first one we saw was ARO, which stands for attribute relation and order, like three things that clip models are not really incentivized to learn much. So they designed a benchmark to test exactly that.
and ignore the numbers in the bottom. So when we train a clip style model, we get these numbers. When we train the captioning style model on otherwise exactly the same setup, like both we optimize a lot and they're trained on the same data. This is just so much better. This is worlds better. This is also worlds better than the bottom numbers are a few ideas of how to fix this in Clip, which I call Band-Aids.
And this is just so much better to train a captioner. Some of these, like ordering it just perfectly nails. Like I actually... No, not this one. Next one. But here is an example from the paper, from this ARO paper I am currently hiding. So the way the benchmark is constructed is you have
an image and two possible captions. And you need to find which one is the correct one and which one is the wrong one. And the captions are designed to specifically differ in either an attribute or relationship like left off, right off or ordering. This example from the paper itself, the horse is eating the grass or the grass is eating the horse. Can you guess without even seeing the picture which one is probably correct and matches the picture? Yeah, right.
So this is an issue. You probably, hopefully guessed this way around. So this is an issue with the benchmark. I don't even need to reveal the image, but just for the sake of completeness. And this is just screenshot from the paper. So we identified this as an issue. And so we also train
blind decoder, just a captioner that never sees the image on our same pre-training dataset of images from the web or image text from the web. And of course that needs the task too. So this is the shortcoming of this first benchmark. But again, we were not the only ones to notice that. Other people noticed that too and created a new benchmark that is supposed to measure the same things, which is called Sugarcrab.
And it looks kind of like this. Here's an example. It seems to not have these obvious shortcuts. Just one example is here, this picture and then a yellow tennis racket has a blue tennis ball on it or a blue tennis racket has a yellow tennis ball on it. Both of them are quite plausible. And same with the cake and flowers and things like that. And then it also goes into more details of what things are tested.
But the same story here. So here, I think we didn't include it in the paper because the authors of the benchmark already did this blind baseline and showed that it got random accuracy. So we don't need to redo that. But here, same story. Like these captioning models are significantly better than the equivalent clip models or than even the best clip models on almost everything. Right. So I think this is the future of
pre-training models or something like that but we should move past crypt. Right, even more, vision models just like language models are now getting more and more or becoming more and more complicated systems. Everything I said before is like pre-train one model and then we can use it, maybe zero-shot or fine-tuning but what most models do now is training in multiple stages and for VLMs it's basically
Almost the same as for language models. From our side, we started this a few years ago with a series of papers and models called Pali. I'm actually curious who knows Pali. Okay, like half.
Probably the people doing vision and Purell and people don't. So the Pali model kind of looks like this, or it was a whole series of papers and models. And this is from the first paper on animation. So you just have an image and text as input and then text as output.
And the text that is input basically is the task. Like, what do you want? It's often a question you want answered about the picture or an instruction like generate a caption of this image in Romanian or things like that. And then these just go to a transformer and are trained together.
Then I will talk about how they are trained in a bit. And then with this kind of interface of model, you can do a lot more things than just with the clip or just with the caption or model, right? You can now ask questions and because it's free form language and not a list of classes, you can ask quite pointed questions. Like you can ask how many coins are there and then it will say 12, but you can also ask how many $1 coins are there and it can say two and other things.
And then, yeah, let's skip that. Right. But then it's only text out and okay. Language model people here will be happy with that, but vision people will be like, no, there's so much more to vision than text out. Right. But text is a bit more universal than you might think. For example, one classic vision task is detection to create bounding boxes with coordinates, right? This is very easily encoded as text.
No, it's kind of legible. So how do you encode a bounding box as text? Well, just as the coordinates of the two corners and just as plain integer numbers, for example, right? And it doesn't mean the integer numbers are not pixels because then it's sensitive to the image size, but like fractions of the image and then times a thousand such that you have integers.
Right. So you can actually do a lot of classic computer vision tasks with this text out API too. Even more, I didn't put it on this slide, but you can also create segmentation masks as a text output. How does that work? Well, you can train a
mask encoder, typically VQVAE, that can compress a mask into a short code of a few tokens out of a small vocabulary and then can decode that. And then you just concat this vocabulary to your language vocabulary and that's it.
So it is actually a very universal API. You can do many vision tasks with it. And again, because it's using language and not as in the classic vision segmentation and detection, the list of 80 Cocoa classes, you can be quite precise in what you want. Let's detect the right hand and it only gives the right hand. Detect the left hand, only gives the left hand. Let's not discuss about what is left hand and right hand according to the training data.
Okay, yeah one issue is that like we had the whole series of like three papers about Parley models showing all of this is possible and this can get better and better and so on but then times have changed and nowadays people are like "oh nice paper where is the model? Give me the model otherwise I will forget about it in a week" so yeah this is
Good question. And so we did the fourth Gemma model, which is called Pali Gemma. And this one is also open. So you can just go and download it and use it for almost all purposes. We had earlier some licenses say don't use it for evil stuff. We had to use such a license too. But you can essentially use it for anything.
What it looks like is pretty similar to the previous one, just slightly different because now language models are all decoder only. So we use a decoder only language model, the Gemma 2 billion and then image encoder. Yeah. And then let's go to the interesting part, the training.
This slide I copy pasted from another presentation. Let's just ignore the left hand side. It's not important for this talk. The pre-training, like this works in multiple stages and this is quite similar, I believe, to the language model pre-training also now.
So first is stage 0, which is unimodel pre-training. So the image encoder is pre-trained by itself. We did it with the siglib image encoder. You can use a kappa image encoder. You could use a dyno image encoder, anything but a good general image encoder. The language model is trained by itself. Like in this case, we use Gemma because we are at Google. You can use Lama. You can use anything else.
Then, so for this you pay zero cost because you just download them like existing ones. Then you do stage one, we call this is multi-modal pre-training. So that's when you stick both of them together and then you train them on the mixture that looks like image and text in and then text out. And I will show you the mixture later.
Then in computer vision, it's often important to also for the model to understand higher resolution images. So typically we train on 2-4 by 2-4 images for traditional reasons, but it's also a sweet spot. Like 2-4 images, you can recognize a lot, but not everything. And it's relatively efficient.
But then there is usually a resolution increase stage, which is shorter training at higher resolution, like 448 by 448, for example, because it's more costly, but you can see more details, especially if you have images with texts, like pictures of documents or whatnot, then you may really need that.
Right, and all of these are basically the pre-training. And then there is another stage, which is transfer. So the pre-training tasks, you will see shortly, are mostly designed
to teach the model as many skills as possible and as broad knowledge as possible. It doesn't really, in this stage, you don't really care about the interface being nice, about it understanding user intent well or things like that. Just about putting raw knowledge into the model.
And then you have a transfer stage, which is usually also shorter, where you fine tune usually the model to what you actually want. And this can be different for different people or companies or projects. And this can include training on mixtures of many things like supervised fine tuning or instruction tuning is part of that too.
But it typically doesn't have the goal to give new knowledge to the model, but just to make it focused on the thing you care about. So the pre-training mixture in this case for Pali-Gemma looked like this. Basically a bunch of tasks that force the model to learn some things. The one obvious one is like, prefix means what is input, like prompt to the model or task description. And then...
So for example, we have a caption and then the language. So caption in Chinese, for example, and then the model needs to predict the caption in Chinese. From the raw collection of image text from the web, we can just run language detection on the caption, right? And then we know the language at training time and then we can put it here. Or for example, if we have pictures that have text in it,
And we know about the text that's in the picture, which we can know, for example, with existing OCR systems. Then we can ask the model to just read the text that's on the images. So the prompt would just be "Do OCR". That is one task. You see that teaches the model a different skill than describing the image in the caption. And then question answering.
including some specific questions which you can generate. For example, if you have an existing pretty good classifier that tells you which kind of classes or objects are in your image, you can run that and then generate synthetic questions like these ones, like how many chairs, for example, or is chair in the image or things like that.
And then there was another paper previously that showed you can also turn them around, like generate the question that would give this answer. And this is a different skill set that the model needs to solve this. So it is also good to add to the pre-training.
And then we also added a detection and segmentation. The detection labels and segmentation labels are pseudo-labeled. So they come from a good detector model or from a good segmenter model.
Yeah, so this is kind of what the mixture looks like. But this is not really how you want to give the model to the user to use it, right? You don't want the user to first have to type answer en and then the question of the user.
So that's where then this fine-tuning step comes in. And we don't need to go through this whole list. It's just to say we fine-tuned this on a lot of different data sets. It works really well. And for fine-tuning, you don't need a lot of fine-tuning data because it's mostly about rewiring the syntax to be aligned with what the task needs.
And then the final step, which from language you also know, but we actually did this at the same time as the RLHF, but in vision, is to have a last step of RL tuning of the model to optimize for what you really, really want, because the supervised fine tuning still usually doesn't optimize for what you really want. Let's see, how can I give an example for that?
Right, let's go back to this example here. If you would do supervised fine-tuning on a dataset like that for detection, your training objective is to predict each of the tokens precisely, one after another, right? But when you do detection... So this task, for example, 298 here, or if you predict 299, it would be completely wrong to predict 299, right? It's like a...
You predicted the wrong token, so you're wrong. That's it. But in detection, that's not really what we care about. If the box is like one pixel more to the left, that's totally fine. What we rather care about is, for example, to not have one extra box where there should not be any at all, which in terms of tokens would be the same amount of error as getting four of the box of the coordinates off by one pixel.
All right, so what is trained typically in supervised learning is not what you really care about. Was that example kind of clear? Yeah, I hope. OK. So then also in vision, what we can do is the last step of RL tuning. So first-- and yeah, this was almost exactly the same time as the RLHF paper.
So first you do the supervised training or the supervised fine tuning or pre-training because that just works really well and it does give you a reasonably good model, a reasonably good approximation of what you want. So that is the maximum likelihood training, which means basically just imitate the training data. So you can also never get better than your training data or then the best part of your training data with this.
But then, once you have this model that does reasonably good at your task, you can sample from it predictions and you can now define the reward. And the reward does not need to be differentiable. That's the nice part. You just need to give a number, like is this prediction good or bad? And this can be arbitrarily complicated to get this number.
It can even be gotten by asking a human to give a number. Then you have RLHF, for example. Or it can be by going through a very complicated metric. The people familiar with detection, they know MAP is the metric that pretty well describes what we want in detection, but it's definitely not differentiable and it's quite complicated. But you can just compute this and you give a score to the sample
And then you do RL, which basically means, OK, model, give me like two samples and then I give a score to both samples and the one which is better, which scores higher. I say model sample this more often and that one less often. And you keep doing this. And this is the way you can align the model to do exactly the task or the part of the task that you actually care about. Not so not just
copy what is in the data, which is what the pre-training, the supervised training does. This was relatively clear in language that you can do this because in language it's super common now to have models where you can sample from. In computer vision, this used to be completely uncommon. All of the classical computer vision models like FasterICNN, DeepLab, YOLO and so on are not models you can sample from. So you can not do RL on it.
because you cannot get two samples from the model and say which one was better, which one was worse. It's only recently with this unification of models and this style of models like Pali, and there have been a few others, like unified IO is another good example, that you can actually have vision models that can sample multiple reasonable solutions and then you can do RL on top of. So that's why this only happened recently.
Yeah, and here's just a few examples that it works pretty well. We did it for detection. So the left is the base model and for those who know detection, it gets a COCO MAP of 39, which is okay, but not great. And then you do a little bit of RL tuning with MAP metric as the score, and then you get much better MAP and indeed the detections, you actually catch a lot more things.
And 54 is a pretty good cocoa MEP. And we also did it with panoptic segmentation.
And just to demonstrate, you can really like you just need to come up with define clearly what you really want and then come up with some score of it. We just did this silly example of a colorization model. So grayscale image in color image out. And it's also generative so you can sample from it. And then we just arbitrarily define a metric that computes the
flashiness of the image. And then we are a bit towards that metric and then indeed it generates flashier images.
Right, then one last thing about this to show that what kind of happens in the RL tuning. It's really not teaching the model any new things or anything. It's just making it sample more the things that you like, that you score highly, and sample less the things that you don't like, that score badly.
So here are a few plots. They are all a little bit hard to digest, but I will try to walk you through. So we have the model before, means before RL tuning, and after is after RL tuning. And on the y-axis is the reward of the task, whichever it is.
And we get a lot of samples from the model before and a lot of samples from the model after. I think here we got 10,000 samples and then we just sort them. And what you see here is that before you had a lot of samples of low quality samples and after RL tuning, you told the model like this is literally what RL tuning is, right? This sample is bad, less of that.
So you have way way way fewer low reward samples and you start off with sampling by default many more high reward ones however
The raw model before RL tuning also had a very few high reward samples like this little green dotted line, right? So it's not that the RL tuning makes the model better. It just makes it sample these good parts more often, right? The original model was able to be just as good as the RL tuned model, but just very, very rarely. Let's see.
Oh yeah, and this one is just that the likelihood of a sample is not enough. Like you really need to have your score that you define. So here is, what is that?
Right, here from left to right we sample more and more samples. On the left it's like just let's say two samples here. Then what the curve shows is the highest reward across these two samples. So what was the reward of the one of the two samples which scored the highest? And we can see the same story again basically. Oh no wait.
Sorry, I misspoke. Let's rewind. Here you have two samples and here you see the reward of the sample with the highest likelihood. And before ArrayLtuning, that is not really good. The thing is, even if you get many samples before ArrayLtuning and
10,000 samples or 100 samples and you pick the one with the highest likelihood, you're not getting better samples in terms of the reward because the likelihood is not really aligned with your reward yet. So you sample more and more things, but that are not in the high quality region. And this is what the reward tuning does. It reweights the likelihood of samples to sample much more high quality samples.
Alright, and that was too much, alright, too little data in the end. So this is the end of it. Thank you. We're going to bring part three to an early end here. Brittany had one more vision-related paper to highlight MLLM as a judge. Assessing multimodal LLM as a judge with vision language benchmark, which we appreciated for practical AI engineer use, but unfortunately we had to cut it for time.
You can see their oral presentation in the show notes. Last but not least, we combine parts one, two and three across world simulation, generative modelling and vision to check in on the field of reinforcement learning and robotics, which took almost as big of a stage as video generation at ICML this year.
For a natural transition from vision to robots, we turn to Ashley Edwards, who was on the Gatto and Genie team at Google DeepMind, but is now at Runway, emphasizing the deep connection between generative video and the world simulation that is essential for diffusion and robotics.
- So yeah, today I'm gonna be talking about how we can learn actions, policies, rewards, and environments from videos alone. So just as a little bit of a disclaimer, I'm going to be talking about a lot of my prior works, some of which I thought I would never talk about again, others I thought I would never talk about at all, but a lot of them have motivated me to the kind of research that I've been working on these days. So I thought it would be kind of fun just to go back and look at some of the history that led me here.
So I think we've probably seen iterations of this kind of slide throughout this entire conference. But I think we know by now that there's been a lot of progress made in text to video generation. And one question we might be asking is like, how the heck did we get here? I mean, I think just this past year alone, we've seen so many innovations.
I hope that many people during this conference will be discussing this, but it won't be me. Instead, I'm going to be talking about how did I end up getting here. My research background is actually in reinforcement learning, but suddenly I found myself in the controllable video generation space. So this is why I wanted to talk about some of my older works, because I wanted to see, like, how did I end up getting here? Maybe some of the things that I was working on are still relevant today.
So in order to answer this question, I'm going to take us back to the summer of 2016, where I got to spend a summer in Japan. And so my main focus here was to actually work on this robot here. So I started off actually as a robotics major.
And what I wanted to do here was essentially try to train this robot to learn sign language gestures from videos. And this is when I was really started getting interested in how we can train agents from videos because coming from a reinforcement learning background, I started to get kind of annoyed with having to always come up with a reward function for training our agents. And every time we had a new environment, we had to come up with a new reward function.
And so I was really interested in how we can come up with like a more sort of general way of representing tasks and that could be done through videos. And so when I arrived at the university, so this was at Waseda University, I realized that the hands on the robot weren't actually working. And so I wasn't actually going to be able to teach it hand gestures from videos.
But this robot was actually like a very expressive robot. I think it was like a comedian actually kind of robot. And so it had a lot of different facial expressions that it can make. And so instead of teaching hand gestures, I decided that, okay, well, fine, I'll try to teach it facial expressions.
So if you think about, if you look at what humans look like, they don't look anything like this robot looked like. And so the thing that I was trying to figure out here was how we can actually teach a robot to mimic, yeah, this robot in particular, to mimic a facial expression like this
when the features look very different, and again, this was like in 2016. And so, I mean, we had like a few examples, like one GPU and that sort of thing. So we didn't have a bunch of examples for trying to learn a representation here. And so what I wanted to try to do is figure out how we can put the space, the feature space of the robot to look more like the feature space of the human.
And so one thing that we realized was that if you look at the sort of shape of motion over time coming from these spatial expressions and in general any kind of motion, there is a bit of a structure. So this here is showing something called a motion template which essentially takes a sequence of frames, sort of concatenates them and averages them over time so that you can see where the motion has happened as well as when the motion happened in time.
So this is what this representation is showing. And the nice thing is that this representation is kind of domain agnostic. So you can see on the left, for example, you can see the motion of the robot. On the right of that one, you can see the motion of a human. And then again, so we had two different tasks. One is smiling, one is surprised. Again, this was like a workshop paper back in the day. You know, it's like whatever. I thought this
was kind of cool. But it's not the most, like the best kind of results here. But essentially what you can see is that the shape is similar across these different tasks. And so it kind of learns how to smile and kind of learns how to make the surprise face because we're trying to basically mimic the motion that you see here rather than mimicking the actual features that you would see in a human versus like the robot, if that makes sense.
So, I guess one other thing about this work was, so we essentially had to sort of hand specify our reward function. We were using hog features to compare the human's motion template to the robot's motion template. And it was a single task. So, we were trying to learn a facial expression from a robot to a human. But after this, we started getting more interested in how we can learn
sort of representations across multiple environments rather than focusing on this single task. And so this is when we started working here. So we're trying to actually learn behaviors from videos. And so in this work, what we did was we essentially got, actually, yeah, we had like a giant data set of publicly available internet videos back in 2017, but it was actually showing a video game
playthroughs mostly consisting of speedruns. But what we wanted to see was if we could try to infer the behaviors that were taking place across these environments, because you can imagine in these video games, you might see characters moving to the left, moving to the right and that sort of thing.
And so the idea was that if we could infer those behaviors, then we can use them for generating a sort of controller for agents to say like when I see this new scene, I want to generate what I want you to do. Again, this is the workshop, so we didn't get to that second part. But we did get to the actual trying to generate these motion templates. So all of these are showing, given initial scene, let's generate the motion template so that's what I can generate new motion templates given unseen scenes.
And so this is showing some of the results here. And so on the top, you see the video game generations coming from training on that data set. These are unseen environments. You can see it's kind of starting to extract the motion happening across these different scenes. It's probably kind of hard to see, to be honest. But the other interesting thing that we found was that we could use that same model that had been trained
on video games and actually worked really well at segmenting out like animals from unseen environments and we had only trained on video games but this was like kind of one of the emergent behaviors that you see by predicting motions, things that are going to change over time you're actually able to sort of extract out these different characters.
So one other interesting thing here was that essentially instead of trying to predict a single mode, so instead of having your loss being on your next frame generation, we found it was useful to actually try to predict multiple futures.
Essentially, what you see here is like on the left, this is our initial frame. And on the right of that, you see all the different kind of generations that are happening. So if you squint enough, you can see, for example, that you can predict moving to the right or moving to the left or moving up or down for each of these different scenes. And we found that this was happening consistently consistently.
The way that we trained this was essentially to try to take each of these generations and minimize the loss between one of those, the closest one to the ground truth generated frame. So we're trying to cluster over our different future predictions. But the interesting thing to take away here was that these different kinds of motions that we're seeing actually represent actions. And so I think the thing that we started to figure out was that
Actions are kind of a shared representations across these different scenes. And so rather than trying to explicitly represent those through like the motion template that we tried before, we wanted to see if we could actually just try to infer actions alone from the videos.
So that was the motivation behind our work ILPO where essentially we're going to try to actually learn actions and policies from videos alone. So the way this worked was essentially, so imagine you have an initial frame like this. You might see in your data set, again we're going to be trying to learn from videos and train agents to learn to imitate from those alone without actions. But there you might see, for example, a transition that looks like moving to the right or looks like jumping in the air.
And so what we were trying to learn here was something called a latent action, which is just going to be essentially the kind of notion of what calls this transition to occur. So we know that something calls them. We don't actually know the action labels of these. We're going to try to learn them from the data. And then we're going to have a latent policy that's going to be defined as the likelihood of the expert taking some latent action in any given state.
So essentially the way that we learn this is imagine in our data set we see these two sequences here. So let's say the expert moved to the right, for example. What we're going to do is we're going to learn a generative model to again predict each possible next state given your initial state here.
And essentially what we're going to try to do again is we're going to try to cluster over all of those potential next frames by looking at the generation that's closest to the one that was actually shown in the data. So we're going to again see this sort of min loss that says, let me look at all my latent actions. I'm going to find the one that looks closest or the generation that looks closest to the ground truth one. So we're clustering over our future frames here.
And then what we're going to do is try to learn a policy over all of those different transitions that we can see. And so the way that we can do this is, let's say, for example, in our data set, we observe that
half the time, for example, the expert moves to the right, half the time they jump in the air, or they never stay still. So we're going to try to learn a policy that ends up looking like this. So if you're going to average over all of those future frames, you might see something that looks like that. And so we're going to try to learn a policy that effectively weights all of the different features coming from our generative model, such that if we were to take the expectation under that policy, you would end up having a
generation or an average generation that looks like the expected generation coming from our expert or the expected future coming from our expert generations. And so that's essentially how we can train the policy. So each of these different weightings over the future is actually saying, what is the likelihood that I would take latent action zero in this state, take latent action one in this state, for example, and we can train it in this way.
So yeah, so this is actually showing just after 200 steps of interacting with the environment that our model is able to adapt really quickly. And the reason for this is that we're actually learning this policy from the videos before we ever place the agent in the environment. And so we can take some steps from the environment samples and use those for actually adapting our latent actions to the real world, real ones that you can take in the world.
So one thing to take away from that work is that we can actually represent our actions through the next frame generations that are taking place. Of course, this is assuming that your dynamics are deterministic, but let's say that they are. But basically, each of these next frames are representing the kinds of actions that you can take in the world. So we took this idea sort of in a different direction where we could say essentially, let's say we have a
a reward function, we can now try to learn a value function, an optimal value function from videos alone, even if you have suboptimal data. So for example, if you have demonstrations like this coming from videos where the expert isn't really an expert, but they're running into things and doing suboptimal things, but sometimes they run into the goal.
And so the idea here is that usually in reinforcement learning you can learn an optimal policy from suboptimal data, but it gets a little bit trickier in videos because you don't have access to actions. So the idea here was to instead of learning an action, or sorry, a policy over actions, which you would typically see in reinforcement learning, if you do RL, I know this is a video generation sort of environment here, but some of you might be familiar with a diagram like this where essentially you have an agent running around the world and it's taking action and it's
trying to maximize this long-term expected reward as a policy over states. The idea behind this work was instead of having a policy over states, sorry, yeah, instead of learning a value function over states, you would learn a value function over states.
State next state pairs. So essentially what we have is this value function. Sorry, I think I messed that part up. You would usually have a value function over state actions. Now we're learning a value function over state next state pairs. We're learning a policy now over states rather than a policy that's going to tell you which action to take.
And the benefit of that is that you can actually learn this in an optimal way when you have suboptimal data. So this is a lot of different stuff that I'm showing on the screen. But the main takeaway, again, is that we're learning this policy over states, learning a value function that says what is the value from transitioning from one state to the next rather than what is the value of taking an action in a given state.
And then we can essentially try to train this policy that's telling us what state we want to transition to by maximizing our value of moving from one state to the next. So the other thing that we need to do eventually during when we're actually interacting with the environment is that we're going to have to
figure out where the actions come from so we can also learn an inverse dynamics model. So that's what that's showing there. So again, what we're learning is like given suboptimal data, we can actually learn optimal generations. So this is showing plans coming from our policy over states here. That's saying what state should I move to that's going to maximize my value?
And one interesting thing here to take away is that this is actually basically like a video generation model. Like we're trying to generate next frames that tell us how are we maximizing our value. And this is given just random generations, like random rollouts from behaviors. We're actually able to generate optimal trajectories.
This also works in reinforcement learning. But yeah, I'll skip over that because we're doing video generation. But the other thing is, so this required us to actually have a reward function. So one other thing we are interested in is like how we can actually learn from videos when we don't have a reward function. Can we actually get agents to learn from these, from this sort of data?
And so I guess one of the things that we can observe is that usually when you have videos, there's a sort of ordering to how, like,
the trajectories are happening. Typically, you would have expert data that's telling you good things to follow. And so what we can do is say at the end of the video, we're going to say that that's a reward of one. And everything that you, if you backtrack in time, it gets sort of discounted, just like you would see in a reinforcement learning trajectory. And so we can use this sort of idea to learn a value function that tells us how good behaviors are in our videos.
And that's essentially what we do. So given a sequence of frames, we can say you get a reward of one at the end and then we can backtrack that over time and that's our value function. We can use that for basically training a reinforcement learning agent again, basically replace the sort of bootstrapping step with our learned value function and then essentially try to train your policy in a supervised way here.
But you can see essentially we trained this model over a bunch of different videos of pouring. You can see over time that the values increase. And so this is essentially telling you you can learn a value function in this way. You can even use this for training reinforcement learning agents again because okay fine, I have a reinforcement learning background, we do this sometimes. But you can see the agent is able to actually learn even though it was trained over videos alone.
Okay, so essentially what we showed is that we can actually learn actions and rewards and policies from videos. And so I guess what's left, and this is sort of what led me into this sort of controllable video generation regime where we're now trying to learn environments from videos. And this was the idea behind Genie, where we're going to try to learn a generative interactive environment from videos alone that's playable from both humans and AI agents.
And so I guess a lot of the previous work that I was doing was really interested in like how we can use these videos for training the agents themselves. But I was lucky actually to meet people like Jack and Tim from a team who had an open-endedness background. And they said essentially, well, we don't only need to learn policies, we can actually learn entire environments and we can place agents within those environments and get them to learn from that.
And so this is what led to our genie work, which we represented here. And so essentially, the idea behind this work was that we can learn three different main things. One was a tokenizer over our video. So we represented those using a discretized VQ, VAE model.
we had a latent action model. I think this was probably the most important component where we could essentially take in sequences of frames and try to infer the changes such that you could predict the future using that latent action representation. And then you can plug that into a dynamics model for predicting the future. And this is where the controllability is coming from. It's coming from this latent action model that's telling you how things are going to change over time.
And this is what led to our final results where we essentially found that if you take some text-generated images, you can plug them into our model and interact with them as if they're a real environment. And again, we were training over a giant data set of
platformer games here. So I guess the reason that I actually didn't spend too much time talking about Genie because I know there's been a few workshop talks already and we talked about it already at the conference. But I was wondering, like, how did I end up getting into this kind of research? And I think the idea is that you can actually use these environments for training agents of the future.
And hopefully we can potentially like learn policies, learn latent policies, learn reward functions in the way that we discussed before. So yeah, I think that's the main thing that I have. I wanted to also talk about, I mean, point out all my collaborators here. There's been a lot of really great researchers that I've got the opportunity to work with. But yeah, that's all. Thanks. I think I probably have a lot of time for questions.
In more complicated environments, actions alone wouldn't be able to represent all of the dynamics. How do you think we can disentangle actions without supervision in this case?
Without supervision? So I think you can probably, if you have a notion of reward, for example, or a notion of, or if you can try to learn a policy, for example, you might be able to extract what are the most likely kind of actions that are going to happen versus the dynamics. But I think it's hard without supervision to disentangle these. In our case, you can probably control the crowds if you wanted to. But I think maybe you can use something like text or that sort of thing to
add in additional information. But I think it also scale. Yeah. So if you wanted to scale, let's say, Genie to real world videos, what would be the major architectural and kind of ideological changes to do that?
Yeah, that's a good question. So the Genie model was pretty general. So there wasn't anything in there that said that we were explicitly training on 2D platformer games. We also had experiments where we got it to work on robotics data. So I think probably just scaling the architecture size
The Bitter Lesson, as usual, and adding in more data would hopefully enable it to learn from that. I think that you could probably also change different components of the architecture itself using the current state-of-the-art techniques. It is surprising, or rather not at all surprising, how many of the answers to workshop questions are just this one word, scale. We challenge you to go through a day at NeurIPS without mentioning the Bitter Lesson once.
As for the audience members question about action generation and behavior cloning, Brittany was walking the poster sessions and found a possible answer from NYU. I am here with Seungjae Lee, also known as Jaylee, to talk about his poster work on the VQ BET model, which is actually one of the spotlight posters being featured here at the ICML conference. The description is a scalable,
behavior generation model for efficient multimodal behavior prediction in complex tasks. That is quite a mouthful, so it would be very helpful to have you maybe explain for us a little bit more what exactly it is that you've worked on here. Okay, nice to meet you and actually our poster is like, our work is started from a question how we could use a very powerful LLM-like token prediction framework for behavior generation tasks.
So the main concern about this question is that the action data is in the continuous space. It's not similar with the language that we use, which is really easy to tokenize. So what we do is we use VQVAE, Vector Quantizer, to quantize the continuous action data
into a discrete representation and use that discrete representation as a tokenizer of the LLM-like architecture so that we can predict the behavior based on current observation.
Very, very interesting. And how did you arrive at this area of research? What is kind of the background or the origin story for this project? Yeah, actually, my personal background is more close to reinforcement learning. But after, I mean, after, I think nowadays, there are many accessible platforms.
large action data and I found that it is really hard to train a behavioral cloning agent using a very traditional way. I mean, it is really hard to train good policy with a traditional way with a large data set. So we need a better architecture which can leverage the LLM-like architecture. So that was our, I mean, the starting point of our research.
And how did you handle the dataset collection problem? Because I know with a lot of the applications we're seeing on more the robotic side of things, it seems today that data is a bit of the bottleneck more so than anything else. Yeah, actually, it is really good question since getting dataset is really expensive. And
I mean, it's a really important point in robotics. So most of our environments was, I mean, those kind of environments are open-sourced environments, so you can just download most of the data set. And some of them is the data set was collected by VR equipment with humans. And for our real-world experiments, we gathered with very small manipulation equipments by ourselves with an iPhone.
So, yeah. So you bootstrapped the data set in part yourself, and then it looks like you did a bunch of work maybe on the simulation side of things as well? Yes. Actually, we first validated our framework on simulation, and then after some consolidated results, we moved on to real-world experiments. And the strong point of our work
strong point of our model is that our model is really lightweight, so it does not need large data set. We only need 45 demos for each task in real-world scenarios, so it only takes one or two hours by gathering by humans, so it's not that difficult, yeah.
And can you talk a little bit about the performance results that you've seen with this model since the listeners at home don't have the benefit of the poster in front of them? Actually, you mean the performance of our model? Yeah, I would say that, you know, there was a very famous diffusion-based models
I would say that our performance is quite similar with those kind of diffusion-based models, but the inference time is really fast. About 20% of the diffusion model. So, you know, inference time is really important in robotics. So we could say that you could do more than 100 Hz control with GPU and more than 20 Hz on CPU. Oh, no. 20 Hz on CPU. So, yeah.
So the performance is good enough compared to the diffusion-based policies, but inference time is much better than those kind of baselines.
Got it. And you mentioned that you published this toward the end of last year. Have you continued to work in this problem area or how has your research evolved since the publication? Actually, we believe that the future direction should be scaling up this architecture. I mean, for the more generalizable agents. For example,
the agents that could do some tasks based on the language instructions. So our objective would be scaling it up. - Very exciting. And you did this through your work at Seoul National University and then you went over and worked at NYU folks as well? - Yeah, actually I was a master's degree at Seoul National University and I emailed to the people in NIU and we started collaborating from last summer.
Very exciting. Well, thank you so much for the time walking through this. I appreciate it. Thank you. That was a great spotlight poster from Shengjai Li. And we also recommend his professor Leryl Pinto's talk on building general purpose robots, which we link to in the show notes.
Brittany had one more robotics paper to highlight. Pivot. Iterative, visual prompting elicits actionable knowledge for VLMs. But we are skipping it in the interests of time and to not keep adding to our already overflowing Google DeepMind publication counter.
By far one of the biggest names in reinforcement learning and robotics is Professor Chelsea Finn, now founder of the $2 billion start-up Physical Intelligence, who gave not one, not two, not three, but four talks at ICML on her lessons on robotics.
We are highlighting her keynote here, but we also recommend checking out her colleague Sergei Levine's talk on robotic foundation models. My name is Chelsea, and I do research on both machine learning algorithms as well as on applications of machine learning to robotics.
Because I work on both of these two things, I think that robotics has provided a perspective on my machine learning research that's a little bit different than the average machine learning researcher. Today, I'd like to share a little bit about that perspective and what that perspective has brought to my machine learning research.
So the first thing that I'll mention is that I think that my robotics work, even though it's not necessarily exactly aligned with core machine learning algorithms, it's often indirectly led me to problems that are relevant in applications beyond robotics.
So for example, about 10 years ago I started working on end-to-end neural network training for robots. This included things like training a robot to put a block into a shape-sorting cube or to use a spatula to lift an object into a bowl. And in both of these cases we were training a neural network to map from images from the robot's cameras to torques applied to each of the motors of the robot.
We were training neural networks that had an entire 92,000 parameters. And while this might seem not particularly interesting or not particularly new, at the time, this is something that was actually quite different from the typical approach to robotics. And after I started working on training these policies to control robots with neural networks,
I was a bit frustrated by the fact that we had to train a neural network from scratch every time we wanted to train the robot, even though we were typically training the robot to do lots of different tasks rather than just one task. And this led me to be interested in this question of whether robots could learn a new task more quickly by leveraging their previous experience instead of training from scratch. That led me to work on few-shot learning and meta-learning, which ended up actually having quite a relevant impact
relevant use cases in other applications like in education and in drug discovery. And there's another example of robotics work leading me to relevant problems. In this initial work, the robots were learning policies that were specific to one spatula or one shape-turning cube or one environment. And I became very interested in whether we could leverage broad data sets to improve the generalization of robots
This led me to be thinking about how can we develop machines that can generalize broadly and potentially even be able to generalize beyond their training distribution. This led me to work on datasets, but also to work on robustness to distribution shift, which led to a benchmark that we developed called Wilds that actually studies distribution shift in a wide range of real applications and has been used quite widely in the machine learning community.
So, from there, in this talk, I'd like to share a little bit about what working on robotics has taught me about machine learning. And to start off, let's talk about a few facts about machine learning in the context of robotics. The first is that machine learning is quite data-hungry, and at the same time, we don't have existing data sets on the Internet of robots, robots,
controlling themselves to do different tasks. We don't have the equivalent of Wikipedia for how to control motors to tie shoelaces or to open a water bottle. Furthermore, we don't have an easy way to interpret or ensure the safety of machine learning policies applied to robots. And this has serious implications when robots have a real possibility of directly harming humans in a physical world.
Lastly, compared to other leading approaches to robotics like optimal control, we lack formal guarantees of what a machine learning-based policy would do. Because of these shortcomings of machine learning in the context of robotics, you might expect me to say that maybe machine learning isn't solving real applications like robotics and it's fundamentally problematic. But is that actually true? Let's look at an example.
So say that we want a robot to tear off a piece of tape and put it on a box. This may seem like a fairly simple task,
But this is actually a task that is incredibly difficult for traditional robotics approaches, because traditional approaches will typically try to model the entire scene, including how the tape will adhere to the canister and to the fingers of the robot, how it will tear when spread across the metal part of the canister, and how to control all 14 of the motors on this robot in order to accomplish the task.
It turns out that for this task that is seemingly extremely difficult for traditional approaches, we can actually use machine learning to address it. So we can develop a teleoperation interface, specifically Tony, a student in my lab, developed a teleoperation interface that we call Aloha that allows you to puppeteer the robot to solve a wide range of different tasks.
Once you develop this teleoperation interface, it means that you can collect data to train a machine learning-based policy to solve a wide range of different tasks, including the really challenging task of tearing off tape and putting it onto a box, as well as other tasks like putting on a shoe. In this case, it's a machine learning policy that's mapping the images from the robot's cameras to all the 14 joints, and it's doing so with a transformer trained end-to-end on demonstrations collected with teleoperation.
And we can use machine learning not just for these fairly complicated tasks, but we can also do it for mobile manipulation. So we can develop a teleoperation interface for an entire mobile robot with two arms, use that to collect data, and again, use a transformer-based architecture
to train the robot to do challenging tasks like on the top, make a piece of shrimp by pouring oil on the pan, putting the shrimp into the pan, flipping the shrimp and serving it. And on the bottom, putting a pot into a cabinet. And so again, we're finding that machine learning is able to solve fairly complicated robotics tasks.
And beyond these kinds of robots, we can also do something like this for surgical robots. So surgical robots are incredibly difficult to control. This is the DaVinci Surgical Robot, and we can use machine learning in a fairly robust way to, again, train policies for complicated tasks like tying a knot and picking up a needle and handing it over to the other surgical tool.
Finally, we can also do this with full-size humanoid robots where if we develop a teleoperation interface, which is a little bit harder to do in this case, but we can train a shadowing-based teleoperation approach and then use this to train, again, transformer-based policies in this case to control robots to do pretty challenging tasks that involve controlling all of the different degrees of freedom, including both the arms and the legs of the robots.
And so going back to my question before of whether machine learning is solving real problems, I do think that machine learning has been making real advances that advance applications and really useful problems in the real world. Supervised learning works really well. We've seen significant advances in architectures, learning algorithms, and optimizers.
We also have reliable engineering practices for debugging if something isn't working, debugging if a policy isn't working or if another model is not achieving the performance that we want and ultimately improving the performance. Now you might ask, if machine learning is making real advances, why don't we have robots out in everyday environments solving real problems yet?
And a lot of people for that question will refer you to Moravec's paradox, which states that the things that are most intuitive for humans, like basic motor control, are the things that are often most challenging for machines. And this could explain why robotics is further behind than applications like debugging complex code or translating between two pieces of text.
But in my work, I've actually found that this isn't perhaps quite the most direct explanation. I think the explanation is actually that the things that lack abundant data are often the things that are most challenging for machines. And this is because scenarios that lack abundant data, we're not able to directly apply machine learning and directly try to identify patterns from large amounts of data.
This can include both data scarce applications as well as just scenarios that are novel that aren't represented well in the training data. This isn't just things like robotics that don't have a corresponding Wikipedia and so forth. It's also even within applications that do have a lot of data, there's scenarios that they encounter that aren't represented well in the data, and that's exactly where machine learning algorithms often struggle, and as a result, our machines often struggle. Perhaps instead of
trying to take some approach that tries to combine traditional methods or machine learning or something. I think that actually robotics just needs more of what makes machine learning thrive. Essentially, we need to find more ways to get data for applications like robotics. This is really the core question that I want to talk about today is, how can we get good data for a wide range of problems in a cheap and inexpensive way?
How can we basically handle data scarcity without skimping on data? I'll talk about a few different ways to do this. The first is finding ways to augment data with cheap and natural to provide supervision. The second will be to leverage data sources beyond the particular target application. The third will be to incorporate data from test time in addition to the typical training dataset.
I'll spend the most time on this first point because it's a little bit different from some of the ideas that have become more commonplace in machine learning. Great. To start out by talking about cheap and natural to provide supervision, let's look at how we currently supervise machines.
We currently will take a training dataset, train a model, evaluate that model. To evaluate it, we'll ideally actually look at how it does in a real situation by talking to it or by running a robot and so forth. Inevitably, the model often won't work well in some scenarios. The best course of action, assuming that you've optimized it well and the architecture is well-tuned, is to collect and label more data.
and specifically collect and label more data in the scenarios that are struggling. This would involve going out, getting examples, getting labels for those examples that cover those scenarios that is not working well. This is really expensive and very human intensive. If it were cheaper, we would be able to iterate on this cycle more on the model, and we probably end up with a stronger model. That's one shortcoming with a typical supervised learning approach.
The second is that input-output pairs are also a little bit weird in some settings. Say that we wanted a robot to cook a meal. The way to apply supervised learning in this case would be to collect examples of how to move the arms of the robot, how to move the motors as a function of the inputs,
This is a little bit weird compared to just trying to teach the robot naturally the kinds of things that it should do, like making sure that the water is hot enough before putting pasta in or setting a timer to make sure that it's been cooked for long enough. Or as another example, say that we want to train a system to make a medical diagnosis. The typical supervised learning way to do this would be to have examples of symptoms and then have examples of the diagnosis as a result.
But instead, perhaps the more intuitive way would actually be to teach the machine about how diseases actually manifest in humans and patients. This is bringing us to the idea that perhaps we might be able to train machine learning models in a more data efficient way if we were able to incorporate natural to provide supervision. One thing you might think about here is
Instead of providing labels, what if we use human feedback? Reinforcement learning from human feedback has been quite successful where instead of providing input-output pairs, we'll look at an input-output pair as a set of them and say, this diagnosis is better than this one or this pasta tastes better than this pasta. This can require a lot less supervision because you don't actually have to write out or actually provide the exact motor torques.
But it still requires many labeled examples, many examples of an outcome and what is preferred. Is it possible to give machines far less supervision but still allow them to improve? We're going to look at this both in a robotics example as well as in a more standard image classification example. Let's start with the robotics example. We're going to be looking at long horizon by manual tasks. The goal, for example, might be to put all the objects into the bag.
It's really expensive to collect demonstrations that cover all of the possible scenarios that the robot might end up in. The form of natural supervision that we're going to be considering here is just verbally telling the robot how it might handle or how it might improve in situations rather than trying to collect a ton of demonstrations for the scenarios that it's struggling in. Specifically, say the robot is going about the task and it's struggling on this part of the task of putting the sponge in the bag.
What we'd like to be able to do is we'd like to be able to tell the robot at this part, you should use the sponge to open the bag wider because right now the bag is not open very widely. Ideally, it'd be able to use this verbal snippet of text to both improve on the fly to be able to figure out how to solve the task in that scenario,
as well as how to then take that data and actually improve the policy and improve its ability to handle new situations like that in the future. We'd like to be able to use this high-level language supervision both on the fly and for future improvement. How do we do this? If we want our robot to be able to improve from high-level language corrections, we need a way to connect what the robot is doing with language.
To do this, we're going to train a hierarchical policy, a high-level policy and a low-level policy, where language is the interface between those two policies. More specifically, we'll take the observation, this will be fed into a high-level policy that then predicts language corresponding to a skill like pick up the sponge or put the Sharpie into the bag.
Then this language command will be fed into a low-level instruction following policy that takes as input the robot's observations and outputs how to move the motor commands. This hierarchical approach is not new, it's actually been done in a wide variety of prior works, and so it's not what we're introducing here. The key insight of what we're going to do here is that we can actually update the high-level policy only with language supervision because its output space is
language as a skill that the robot should do next. Because of this, if the low-level policy can follow a wide range of instructions, then we can actually improve this full system just by updating the high-level policy and just by giving it language feedback. Specifically, we can do something like the dagger algorithm, the dataset aggregation algorithm on the high-level policy and freeze the low-level policy.
Specifically what this is going to look like is we'll intervene, we'll tell the robot what we want it to do. In this case, maybe it should rotate the tape in order to put it into the bag. This intervention, this language command will override the high-level policy and that intervention will be fed into the low-level policy instead of what the high-level policy is predicting. Then that will allow it to on the fly be able to leverage these interventions.
We'll also aggregate these interventions into a dataset and use this to update our high-level policy. It actually also learns how to improve from these corrections in the future. We're freezing the low-level policy and updating the high-level policy by supervising it just on the language corrections that the human is providing. We gave this a fun name, Yell at Your Robot or Yay Robot, because you can articulate your corrections or frustrations with the robot to help it improve.
What can this do? Let's look at some videos of fully autonomous policies on the robot. We'll start just with the base policy before doing any language corrections. This policy is trying to put the objects into the bag and it'll make mistakes. In this case, instead of putting the Sharpie into the bag, it put it underneath the bag and it struggles to be able to recover from that.
It also make other mistakes. Here it's trying to pick up the Sharpie. The high-level policy output is shown here on the top left. We're actually finding that the high-level policy isn't ever issuing corrections like go lower or maybe rotate the gripper in this case. It just keeps on telling the policy to try to pick up the Sharpie. Now, after we fine-tune on language corrections, we find that it's able to autonomously correct for mistakes. Here it's making the same mistake as before by putting the Sharpie under the bag, and then it's trying to self-correct.
It then makes a mistake again, and then it's self-correcting again to try to move towards the camera, go higher, and then put the Sharpie into the bag. And by self-correcting, it's able to solve that part of the task successfully. It also learns to self-correct for grasping, where it'll self-correct to move to the right after it made a mistake of grasping too far to the left. And...
When trying to put the sponge into the bag, we'll also see it just change strategies completely. So here it's trying to, in some ways, kind of shove the sponge into the bag and it's doing so unsuccessfully. And now the high-level policy is going to tell it to instead try to release the sponge and sort of kind of poke it into the bag instead. And this helps it get it into the bag more successfully. And as a result of the robot's ability to self-correct from just this language supervision,
We find that the robot is better overall at doing long horizon tasks. This video is pretty long because the task is quite challenging, so I won't play all of it. But we get a sense that despite this task being quite challenging and having all sorts of scenarios that we don't necessarily have demonstration data for, we find that by leveraging this very cheap language supervision, the robot is able to perform the task a lot more successfully.
even though this task is quite long. Cool. Then there's one more thing I wanted to highlight from the system, which is that instead of just correcting after the robot has made a mistake, we can also actually proactively correct the robot when we think it might make a mistake in the future. This is a different task that we train the robot to do, which is to make trail mix. The grad students were quite happy about all the trail mix that ended up in the lab as a result of this. We see that right here, I pause the video,
The robot, it looks like it's actually about to accidentally pour a whole bunch of peanuts onto the table because the scoop is behind the bag instead of inside the bag.
Right here, because we noticed that it looks like it might be about to make a mistake, we can intervene and instead of telling it to continue by moving the scoop into the bag and presumably then trying to pour into the bag, we can interrupt the robot and correct it and tell it to move the left arm to the left, go higher, move the scoop into the bag, and then allow it to continue autonomously to pour into the bag. This is an example of how in real time we're able to improve the performance by proactively preventing the robot from making a mistake.
After fine-tuning, we find that it also learns this proactive corrective behavior where it notices that in this case with cranberries, it was about to make a mistake there. It didn't successfully get the scoop into the bag and then corrects itself to move the scoop into the bag successfully. Those are a number of qualitative examples. Quantitatively, we also see a large gain in performance just from verbal corrections. The dark orange bar here shows the success rate on average.
after fine-tuning on just language data, whereas the gray bar shows the policy before language corrections. We see a 20 percent improvement in performance. This closes a lot of the gap to this light orange bar, which is the performance if we use human corrections on the fly to override the high-level policy.
Lastly, it's worth mentioning that the performance of this still has room, there's still a lot of room for improvement for even when we're using Oracle high-level human corrections. This suggests that the low-level policies have room for improvement. To summarize, you can productively yell at your robot to help it actually accomplish tasks. But more importantly, the robot can improve just with language feedback without demonstrations by fine-tuning this high-level policy.
This is a lot more data efficient. It's a lot more data efficient to simply tell it to pick up the sponge or move to the right than to actually collect demonstrations with teleoperation. Then of course, this approach relies on a performant instruction following policy, and so you're not completely out of the woods in terms of having to collect some low-level data on the robot. Great. This is an example of how we can use natural supervision to
augment data and get much better performance in a very cheap way. Can we do something similar for other machine learning systems beyond robotics? Say that we wanted to perform a classification task, an image classification task based on the species of the bird, and we train them all to do this. Here I'm going to be visualizing the predictions that the model is getting right and the predictions that the model is getting wrong. If we contrast the correct predictions from the incorrect predictions,
One thing we might notice is that a lot of the incorrect predictions, there's a little bit of a pattern, which is that a lot of the incorrect predictions, not all of them, but a lot of them have trees in these examples. There's a lot fewer trees on the predictions on the left. It'd be nice if we could just verbally tell the model to pay less attention to trees.
So just like how we told the robot kind of corrections like go lower in these situations or take a different strategy in these other situations, if you could simply verbally tell the model to correct its behavior here, we'd be able to correct the model far more efficiently than collecting additional images and labels for those images. And so we tried to develop an interface that would allow
humans to verbally correct machines in that way, where we first train an initial model, we allow people including non-experts to describe failures using natural language of that model, and then correct for those model failures just by using the language feedback.
So how this works is first we'll present the correct predictions and the incorrect predictions just like I showed before. And so in like a didactic example where we're trying to classify squares versus ovals, where the model might be paying attention to color when it shouldn't be, this would look something like this where you would contrast the examples on the left and right and then try to describe verbally how the model is making a mistake by paying too much attention to the color red or color blue.
More in the Waterbirds dataset, perhaps the model is paying too much attention to the trees in the background. Then once we visualize these model failures, we'll then ask a person to describe the model failures and then also help them understand whether that description is something that the model can actually understand and use to improve itself.
So we developed a web interface to allow users to look at these examples. We're using Clip in the background to help understand if the model is actually able to connect that verbal concept to the images that are in its dataset. We can then compute the similarity between the text prompt in each image to figure out whether or not that text prompt is separating the correct examples from the incorrect examples. If it is separating those examples, then we can then use that to improve the model.
If it gets a high error score, then we can then directly start to use that for training. If the user finds that they aren't able to describe something that the model understands and can separate these concepts, then the user can iterate on their description to then try to describe it in a way that the model can interpret. Then once we have this text feedback, we'll take a very simple approach that was presented in a previous work called DFR, where we're simply just going to balance the data across these different groups.
For example, if we're finding that the model is paying too much attention to trees, then we're going to balance the images that have trees in them with the images that don't have trees in them. This will de-correlate the data such that it no longer is incentivized to pay attention to trees.
Then once we de-correlate the data, we'll then retrain or fine-tune on that de-correlated data to get a model that stops paying attention to that piece of feedback. Then if desired, you can in principle also iterate on that process to then identify any new model errors that popped up. Great. In our experiments, we tried to identify first if non-experts could actually identify and describe model errors in a way that led to improved robustness of the model.
Second, whether or not we could scale this approach even to very large datasets, build a cheaply provide supervision that can identify model failures in these large-scale situations.
And so in the first case, we recruited 26 participants on a crowd platform that had very minimal qualifications, the native English speakers and so forth, and so likely people that are not machine learning practitioners. And we had them interact with water birds and celeb A.
And what I'm showing here is I'm showing all of the different verbal descriptions that each participant identified for describing model failures. And the black line is showing Yunho, the lead student researcher on this project, his performance or the error score that he was able to get for his reference phrases. And what we can see here is that in a lot of the cases, the human non-expert participants are able to identify model failures fairly accurately compared to Yunho.
In these examples, they're able to identify the correct concept underlying the model failure, but they might have some suboptimal wording compared to the wording that Yunho used. There's a number of examples where they find basically the same phrase that Yunho used. Then there's also examples where actually the non-experts provided better descriptions of model failures than Yunho's reference prompt.
Then lastly, there's also a few cases, specifically four cases where participants struggled to identify the correct model failure. Then once we have these descriptions, then the question is, how cheap is this supervision and can we use this cheap supervision to actually improve the model? First, we found that on average, these non-experts were providing two to three minutes to give feedback to the model. This is pretty fast and a lot faster than collecting additional labeled data.
Second, we found that if we use their descriptions to rebalance the data and retrain, we were able to get a model performance shown in yellow that is a lot more robust than simply training on the original dataset or zero-shot prompting approaches. We see in this case specifically a 7-10 percent improvement over training on the initial dataset, just with two to three minutes of additional supervision from a non-expert.
Then beyond these somewhat simple datasets, what about datasets like ImageNet? We didn't run a user study on this specific thing, but we found that using this interface, Junho is able to fairly quickly identify model failures on ImageNet.
For example, here are some examples all from the same class, actually from the sliding door class, where the model is doing very well on these images and very poorly on these images. And as you might notice, these examples on the right have a high similarity with cars and a lower similarity with cars on the left. And so it's kind of struggling to classify sliding doors if they're sliding doors on cars.
And he's able to identify model failures on 31 different classes in ImageNet and able to do so in a relatively short period of time. And with data reweighting, he was able to improve the performance of the model on the minority split of the data while preserving overall performance. And so...
With this sort of approach, we found that we're able to give verbal feedback based on initial train model. And because it's based on the model that's already been trained, similar to the robotic setting, it's actually easier to target model failures efficiently rather than simply trying to out of the blue guess the kinds of supervision that the model might need. And second, the verbal feedback that we're giving, it's no longer data point level. It's actually at a more global concept level.
This means that the verbal feedback is especially cheap because with a single sentence or a single phrase, we're able to actually address a broader class of model failures rather than providing individual examples. Then importantly, also mentioned that in this most recent work,
we're only identifying and correcting one model failure. This means that the scope of the work is quite limited, but it'd be really exciting to see if we could use this high-level verbal feedback for other model failures and develop a more general approach for improving models just with verbal feedback. The takeaway from both the YAY robot work and the Clarify interface is that natural supervision like language supervision, if the model can use it well,
that supervision can be far cheaper and sometimes even more informative than collecting a large number of labeled examples. It's a useful tool to have when we don't have a great deal of initial training data. Great. Now, another example of some data that's essentially out there, but we just need algorithms to be able to use it well, is data from other sources beyond the target application.
Specifically, one natural thing to do to improve generalization for a particular application is to leverage Internet data, to leverage models trained on text and images. One very common way to do this is just to use, for example, an encoder pre-trained on ImageNet. We find that at least in robotics applications, it does improve performance.
somewhat compared to just training from scratch. Especially we can do well on tasks and scenarios that are seen in the training dataset. But when evaluating on generalization to unseen objects, backgrounds, and environments, there's still a really substantial gap compared to the things that it saw during training. Yet the Internet has
really vast training data and so we expect that maybe we could do better than this. Specifically, maybe if we could more closely connect the pre-trained model with the downstream task, we might be able to more effectively leverage all of the rich knowledge that exists in Internet data. So specifically what we're going to do is we're going to take a visual model, instead of taking a model trained just on ImageNet classification, we'll take a model trained for visual question answering,
We can formulate the downstream tasks, specifically the robotic control problem as a visual questioning answering problem. Instead of having it output continuous values, we're going to frame it as a question. What should the robot do to do a task like to pick up the chips or to move a bottle upright?
Then we'll likewise also frame the output of the model as a series of tokens similar to the output of a VQA task. These tokens will correspond to different language actions, like how to translate and rotate the gripper of the robot.
If we formulate essentially this downstream task just like the task that is seen during pre-training, perhaps it will be able to leverage the pre-training data more effectively and understand how to generalize robotics tasks similar to how it generalizes these VQA tasks.
Once we have this data, we'll use the same architecture, specifically a pre-trained vision language model. You can either fine-tune it just on the robot VQA tasks or a combination of the robot tasks and the existing Internet VQA data that the vision language model was pre-trained on. It'll output these language tokens that'll then be converted into robot actions to be run on the robot.
Essentially, we're posing robotic control as a visual question answering problem and defining tokens corresponding to robotic actions. We'll refer to this fine-tuned model as no longer a vision language model, but a vision language action model in the sense that we're now having actions, some of the tokens are representing actions. Now, if we go back to this example, if we're just using a pre-trained ImageNet encoder, how well that does?
we find that models that use this vision language action recipe, we find that they're actually able to generalize far better than the model that is pre-trained just on ImageNet classification. Essentially by connecting the pre-trained model in the downstream task, we're able to get gains in generalization.
Now, what does this look like for more recent state-of-the-art models? We can also compare state-of-the-art models that use standard pre-training or no pre-training to recent vision language action models like RT2x and OpenVLA. We'll be doing this on evaluations that focus on generalization. What we find is on two different robot platforms, the vision language action model is shown in red and green,
do substantially better on average than the models that don't use this vision language model pre-training and don't use this formulation that formulates the downstream task very similarly to the pre-trained task. Again, even with these state-of-the-art models, we again see this trend that generalization improves significantly if we connect the pre-trained model with the downstream task.
Going back to trying to handle data scarcity without skimping on data, we can leverage data that already exists, data from the Internet that's easy to get, and we can leverage it much more effectively if we connect the pre-trained model with the downstream task. Great. Then lastly, I want to talk about incorporating data from test time.
So specifically thinking about whether if we are in a new situation that's not represented well in our training data, can we adapt on the fly? I think this is a really important problem because when machine learning systems are faced with the real world, there's a vast number of objects, vast number of configurations and scenarios that these machine learning models will be faced with. I don't think we can even hope to anticipate every possible scenario that these machine learning models are faced with.
Because we can't anticipate it, then maybe instead we can just adapt after the fact when we see more data from that situation. For example, say that we're trying to open a door, maybe this is a new door that we haven't seen before. If we're trying to do this, we might make a mistake and might need to retry.
It turns out that this is a video of a human opening this door and it was quite subtle, but the human actually did make a mistake and adapt very quickly. Let's replay the video. Specifically, we see that the human puts the key into the door, actually puts it in the wrong place right here, and then continues by taking the key back and then putting it in the correct place.
Even humans are making mistakes in adapting. Even humans, which in many ways are sometimes even a gold standard compared to machine learning, if even humans are adapting, can we develop machines that can adapt in a similar way? Let's look at this in the context of robotics problem. This is a scenario that's unseen to the robot. The robot's here, its goal is to get to over here. If it's trying to approach this problem and it makes a mistake, can it actually retry?
The robot only gets this first-person observation right here. Without any context, if it hasn't actually attempted the task, maybe from this observation, it'll try to crawl under and see where it gets from there. Then maybe if it tried to crawl and then realized that it was very close to an obstacle, then maybe it should try a different strategy. With this context, with this previous history of what is tried in the past,
Maybe it should try with the same current observation to do something different like turning left or turning right. This is exactly what we'll do. We'll take these recent attempts and we'll combine them with a model that's known to be fairly good at adapting from recent attempts. Specifically, in this case, we'll use a vision language model. We'll pass these recent attempts and the robot observation into the model. We'll then have this select a skill for the robot to do and then output actions.
Ideally, the vision language model should leverage what the robot has tried before and pick appropriate skills after it's made some mistakes. If we do this, we find that exactly on the scenario before, which is unseen from the robot, if we don't use history and don't allow it to adapt from its mistakes, it often makes the same mistake over and over again. Whereas if we do use in-context learning, it's able to try something different and adapt on the fly based on what it has seen in this test environment.
Likewise, here's another setting, an outdoor setting. This is actually quite challenging because there's this step that is quite unstable in front of the robot. At this point in the video, the robot actually can't even see that its back legs are stuck on the step. It's trying to walk forwards. If it doesn't have history, it doesn't know that walking is being unsuccessful in this scenario. But with history, it's able to figure out that it should go backwards and instead try to climb over the step instead of just trying to walk over it.
We also see quantitatively that leveraging test time information, leveraging these images that the robot sees at test time, improves the robot performance by more than 50 percent, both in terms of success rate and in terms of the time it takes to complete a test scenario. Cool. So the takeaway is that in-context learning greatly improves the adaptability of the robot, and in turn, this improves its resilience and performance in unseen situations.
There's also limitations in future work with this, as with any research and all the research that I presented, which is that in this case, it's unclear necessarily the best way to ground language to the low-level locomotion policies. And also, in many cases, we might not want to use the language abstraction as the way to retry and as the way to connect with vision language models. And so there might be interesting ways to expand on that.
For this last part, we found that incorporating data and information from test time can make up for lack of representative training data. All of these are examples that I covered in the talk, are examples where more data is out there. More data is either out there and already exists or it's pretty easy to get.
We just need algorithms that can leverage things like natural supervision, pre-trained models, and test time data in order to effectively handle these new situations or these situations that aren't covered well by the training data. Now, I also mentioned that along these three directions, there's also, I think, exciting directions for future work. I talked about one way to leverage cheap natural language supervision.
But I think that in the future, maybe we can operationalize entirely new learning regimes that leverage natural supervision in a general purpose way. Moreover, I showed how we can connect pre-trained models with downstream tasks by making the downstream task look more like the pre-training problem. But maybe in the future, we could actually change pre-training in a way that makes it easier to connect with all sorts of downstream tasks. Then lastly, I showed how we can adapt at test time in a robotic scenario to make up for lack of representative training data.
But there's all sorts of examples and applications in machine learning where we're interfacing at the end of the day with a human or with some other environment. Can we also allow machines in non-robotics examples to adopt on the fly and retry when they're interacting with a person or interacting with some other environment like a web environment?
Great. Then the last thing I'll also mention is that I discussed a number of different creative ideas for leveraging different sources of data and different sources of supervision. There's also this question of what if we also have broader training data? I think that all of these are quite interesting even when you have broader training data. We've seen from the regime of large language models that there's a lot of things that are quite exciting to try and do when you also have a large training dataset.
In the context of robotics, we've also been starting to study this problem.
Back in March of this year, I co-founded a company to help actually try to see what happens when you do try to scale up data and models in the context of robotics to try to tackle a broad range of real-world use cases and robot platforms. Some initial results are here where we find that we can do actually pretty cool tasks even with data that was collected since March of this year. Then the last thing that I'll mention is that
I talked a lot about finding new forms of data like natural supervision or data at test time. These are things that are actually quite widely applicable and make the overall problem easier. But a lot of our machine learning benchmarks actually aren't necessarily designed for these kinds of ideas or these kinds of algorithms that leverage different forms of supervision or data. It may actually be the case that in some scenarios, benchmarks might actually be harder than the problems that they're trying to represent.
because they don't necessarily allow for you to use other forms of supervision or data. Perhaps by understanding the context surrounding different real applications that we're trying to study, we might find new and interesting ways to find data or new and interesting problem settings and also make more progress as a whole.
Great. I'll leave you with that. I'd like to mention that all the work that I presented was done with a really fantastic set of collaborators. I'd especially like to highlight the students that led the work that I presented. Yunho led the Clarify work, Lucy led the Yay! Robot work, Annie, Alec, Andy, and Govind led the test time adaptation work, and Mujin, Carl, and Sid led the Open VLA project, and happy to take questions.
The last thing we want to highlight in this epic seven-hour coverage of ICML 2024 is the new position paper track that encourages researchers to step back from individual papers to make arguments relevant to their entire field. Here is Younghyo Park arguing that automatic environment shaping is the next frontier in RL, which we think has been the implicit argument we have been developing through the papers and talks we have been exploring this episode.
Hello everyone, thank you for being here. My name is Young-Hyung Park and I'm excited to present our position, Automatic Environment Shaping is the Next Frontier in RL. This is joint work with my colleague Gabe and Paul Gitt from the Improbable AI group at MIT. To give you some context before we start, me and Gabe both come from a robotics background.
And as a grad student working on robotics, I always dream about a magical box that can automatically create a robotic controller for me by simply specifying the robot environment and task I want. And I call this magical box Automatic Behavior Generator. And before I move on, I want to emphasize the word automatic here. It means that this box should only be powered by time and compute, not by human effort.
This magical box, if realized, will serve as a core tool enabling robots to autonomously generate behaviors on the fly even after its deployment to people's houses. But I want to ask you all, do you think we're being a bit overly ambitious? Is our dream, this magical box, too good to be true? Well, if you think about it, this is what reinforcement learning is promising us in some sense.
Reversible learning in theory is a general-purpose, automated, optimal control solver that can produce working controllers for any MDP setting. However, from a practical viewpoint who is trying to use RL as a tool to train robots, this claim is not necessarily true. Although RL itself does not require human effort during its training process, we want to point out that there is a very heuristic, labor-intensive process that are required to make RL work in practice.
and that is what we call environment shaping. When an RL algorithm fails to find a solution in practical scenarios, between the choice of fixing the RL algorithm and shaping the environment to make it work, practitioners typically tend to choose the latter. The core problem of such practice is that it heavily relies on human effort. Domain knowledge for the task, intuition, and sometimes a bit of luck is crucial to get things right.
A very well studied example of environment shaping that you might already know about is the reward shaping problem. We all know that RL agents love to hack the reward when they can, so engineers typically go through the process of shaping the reward to prevent it. In fact, I would say that this is the biggest reason why some people in our community hate RL so much. And I completely understand the process of reward shaping. It's definitely not fun to do.
Unfortunately, I want to point out today that reward is not the only thing that we usually shape. Robotics engineers carefully shape nearly every component of the environment to make RL work in practice. And again, the only optimizer that is currently known to work the best for this problem is graduate student dissent, a process entirely relying on human effort. All those being said, what am I arguing today?
First, I argue that the community should start prioritizing research to automate the heuristic process of environment shaping. At the same time, we also need better RL algorithms that doesn't require heuristic environment shaping in the first place. And to do that, I argue that we should be benchmarking our RL algorithms on unshaped environments without any task-specific heuristics included.
To better back up our argument, from now on, I'll try to give you some examples of the heavy heuristics that are involved in popular robotics RL environments and show how crucial they are to make RL work. As an exemplary environment to analyze, we chose Isaac Jim Ems, one of the modern benchmark environments containing diverse robotics tasks. Let's first talk about action space shaping.
In the context of robotics, action space shaping is a process of choosing how to convert the action predicted by the policy to an actual command that can be sent to the motor. An unshaped action space will thus look very simple. We are just letting the policy to directly predict feasible motor commands. However, most RL environments apply a bunch of task-specific heuristics to shape the policy outputs before it gets passed into the motor.
This example code you just saw, for instance, applies diverse scaling, clamping, moving average filters, and PD controller at the end to finally convert the policy outputs to motor commands. The problem of this kind of shaping process is that it not only is very task-specific, but it also introduces a bunch of extra knobs and hyper-premiers to tune.
Unfortunately, this action space shaping is a necessary evil for RL algorithms. We have tested that PPO, for instance, completely fails to solve these tasks if we remove such shaping. And our findings are similar for observation space as well. Observation space shaping is basically a feature engineering problem, selecting the relevant states from what's available from the simulation to create an observation for the policy.
For instance, for the task of opening the door using a manipulator, an unshaped observation space will be a simple concatenation of every raw simulation states that are available. However, typical RL environments go far beyond this simple concatenation. They introduce multiple hand-engineered task-specific terms, and they often convert certain states with unique properties like rotations to a different representation that are known to be better for neural networks to process.
And such processes are also very crucial to make RL algorithms work in practice. We can break the RL by just removing those handcrafted terms from the observation. Although I'm skipping the other examples of environment shaping due to time constraint, you can take a look at our paper for more comprehensive examples. Now that we learned about the details of environment shaping and how it affects the RL performance, let's talk about how we can automate this environment shaping process.
Automating environment shaping is a challenging problem for many reasons. One of the major problems is that there is no compact way of parameterizing the vastly diverse ways of doing environment shaping. If we assume a fixed functional form for everything, we can try extracting the coefficients and do some classical hyperparameter optimization on top of it. But this is a very limiting way of representing these shaping functions. Therefore, people have recently started to think about a more flexible way of representing these shaping operators.
One of them is to use Python code itself as a way of representing these functions. This allows us to view environment shaping as a code optimization problem using large language models. This paper called Yirga is a good example of using large language models as a sampling-based optimizer to automate the rear-shaping process.
So we have conducted some experiments to see whether the proposed automation method using LLMs can be extended to other shaping components. And as you can see over here, models like GPT-4 was able to successfully shape action and observation space with similar to human performance. However, interestingly, when we asked GPT to shape multiple components jointly at the same time, the performance dropped dramatically.
And this can be a critical problem since our experimental findings suggest that optimizing individual components one by one in a sequential manner often leads us to locally optimal performance. All that being said, I believe we still have a long way to go to fully automate the process of environment shaping. Now that we have discussed all the aspects of environment shaping, let's discuss about path forward. Recall that I was advocating for the research focused on either automating the environment shaping or developing better RL algorithms.
To support both direction of research, we have created a code base which basically contains a collection of unshaped robotics environments that people can test their RL algorithms on with nice little APIs and tools to facilitate the research of automating environment shaping. And before I wrap up my talk, I want to discuss about possible counter arguments that people might have against ours.
Going back to the beginning of my talk, I shared about the dream I have: creating a magical box that can automatically generate closed-loop controllers for robots. And then I kind of implied that reinforcement learning will be powering this magical box in the future. However, I think some people might disagree with this. Especially considering the resurging popularity of doing manual data collection and retention learning,
Some people might think that our dream of magical box will be realized not by automating RL, but by training some huge foundation model that consumes all these datasets collected by all these companies. However, I still believe in the power of RL as a tool to generate robust, generalizable, and especially superhuman behaviors that cannot be easily achieved with imitation learning.
And also the behaviors generated by RL pipelines can also be used to train those foundation models as well. Therefore, I argue that making RL easier to use will enable a virtuous data cycle for training better embodied intelligence. And with that, I would like to wrap up my talk today and I'm happy to engage in exciting discussions about our position. Thank you.
And that's a wrap for ICML 2024 Part 1. Our coverage on generative video world sim, diffusion, vision, reinforcement learning and robotics. We're busy preparing for Latent Space Live at Nureep's 2024 in Vancouver. So grab your tickets at loo.ma slash lslive and see you there. Baby, you give me eyes and
♪ You're giving me wind and rain ♪ ♪ You're some kind of bird ♪ ♪ Baby you ♪ ♪ You whip up my appetite ♪ ♪ Don't leave me here high and dry ♪ ♪ Oh ♪ ♪ I wanna jinx ♪ ♪ With myself ♪ ♪ Back to spine ♪ ♪ I wanna overthink it babe ♪ ♪ It's just me ♪ ♪ Baby you ♪ ♪ You're giving me wind and rain ♪ ♪ You're some kind of bird ♪ ♪ You whip up my appetite ♪ ♪ Don't leave me here high ♪ ♪ Oh ♪
But I don't wanna jinx it, babe. Not yet. Oh, yeah. Don't really miss you, babe. Dance up at speed. Oh.