We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Generative Video WorldSim, Diffusion, Vision, Reinforcement Learning and Robotics — ICML 2024 Part 1

2024/12/10

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

AI Deep Dive AI Insights AI Chapters Transcript

People

Ben Poole

Jack

与Ramsey Network或Ramsey Solutions相关的个人，具体信息不详。

Li Junyu

Sander Dieleman

William (Bill) Peebles

Topics

William (Bill) Peebles：Sora是OpenAI的第一个视频生成模型，能够生成长达一分钟的1080p视频。它基于扩散Transformer，并使用VAE将各种视觉数据编码到统一的潜在空间中进行训练。Sora能够生成逼真的和非逼真的风格，进行场景转换，建模复杂的场景，并保持角色的一致性。Sora的成功关键在于统一的视觉表示和扩散Transformer的扩展性。Sora还具有零样本编辑能力，可以根据文本提示重新渲染视频，在不同视频之间进行混合，并进行视频重新字幕。Sora的未来目标是成为一个世界模拟器，学习人类互动、任务和思维方式，并模拟各种复杂场景。目前Sora还存在一些问题，例如对基本交互的理解不足，但随着模型的扩展，这些问题有望得到解决。

Deep Dive

Key Insights

What is the key capability of Sora that allows it to understand object permanence in video generation?

Sora is capable of understanding object permanence and maintaining the presence of objects in the scene even when they are occluded. This is achieved through large-scale diffusion training, which allows the model to learn the underlying geometry and interactions in complex scenes without any hard-coded inductive biases.

How does Genie differ from other text-to-video models in terms of user interaction?

Genie differs from other text-to-video models by allowing frame-by-frame interaction. While other models generate entire video clips based on a text prompt, Genie enables users to take sequential actions within the generated environment, making it a foundational world model.

What is the main challenge in evaluating video generation models, according to Tali Dekel?

The main challenge in evaluating video generation models is the difficulty in quickly glancing at and assessing the quality of moving content. Unlike images, which can be easily evaluated in a grid, videos require more detailed and time-consuming individual assessments.

What is the significance of DECAF in the history of computer vision?

DECAF, or Deep Convolutional Activation Features, was a foundational model in computer vision that democratized access to deep learning techniques. It demonstrated the effectiveness of pre-trained models for a wide range of tasks and was one of the first to show how deep learned representations generalize beyond their training data.

How does the VQ BET model leverage large language models for behavior generation?

The VQ BET model uses Vector Quantized Variational Autoencoders (VQVAE) to quantize continuous action data into a discrete representation, which is then used as tokens in a large language model (LLM) framework. This allows the model to predict and generate behaviors based on current observations and high-level task descriptions.

What is the core idea behind Chelsea Finn's Yell at Your Robot (YAY Robot) approach?

The core idea behind YAY Robot is to use high-level language feedback to improve a robot's hierarchical policy. By providing verbal corrections, the high-level policy can be fine-tuned to correct mistakes and learn new strategies, significantly improving the robot's performance on long-horizon tasks without the need for extensive labeled data.

What is the main argument of the position paper 'Automatic Environment Shaping is the Next Frontier in RL'?

The position paper argues that the reinforcement learning (RL) community should prioritize research on automating the heuristic process of environment shaping. This includes developing better RL algorithms that don't require manual shaping and creating benchmarks on unshaped environments to facilitate this research.

How does VideoPoet differ from Sora in its approach to video generation?

VideoPoet uses a large language model (LLM) architecture to generate videos, while Sora is based on diffusion models. VideoPoet is more modular, supporting tasks like text-to-video, image-to-video, and video-to-audio, and operates in a latent space to improve efficiency and flexibility.

What is the role of flow matching in the VQ BET model for behavior generation?

Flow matching in the VQ BET model ensures that the predicted actions are consistent with the observed data. By using a quantized representation of actions, the model can learn to predict the most likely future states and actions, making it more robust and data-efficient.

What are the key components of the Genie model and how do they enable controllability?

Genie consists of a video tokenizer, a latent action model, and a dynamics model. The video tokenizer converts video frames into discrete tokens, the latent action model predicts changes between frames, and the dynamics model generates future frames based on these tokens and actions, enabling frame-by-frame controllability.

Shownotes Transcript

<context>生成视频世界模拟，扩散，视觉，强化学习和机器人技术 — ICML 2024 第一部分常规票现在已经售罄，Latent Space LIVE! 在 NeurIPS 上的活动也已结束！我们刚刚宣布了最后一位演讲者和最新的主题，播客的朋友 Nathan Lambert 将回顾 2024 年在推理模型（如 o1）中的表现！我们为那些现在正在决定的人开放了一些晚鸟票——如果需要，请使用代码 DISCORDGANG。期待在温哥华见到你！我们已经坐拥 ICML 的录音一段时间（来自今天首位 SOLO 嘉宾共同主持人 Brittany Walker），鉴于 Sora Turbo 的发布（博客文章，教程）今天，我们认为这是发布第一部分的好时机，这部分内容将深入探讨生成视频世界模拟的现状，顺利过渡到视觉（相反的模态），最后是机器人（它们的最终应用）。Sora、Genie 和生成视频世界模拟器领域Diffusion Transformers 的作者 Bill Peebles 在 ICML 上进行了他最近的 Sora 演讲，这开启了我们的节目：* William (Bill) Peebles - SORA（幻灯片）关于 Sora 的一个常见问题是，为了实现这些结果引入了多少归纳偏见。Bill 提到了来自 o1 团队的 Hyung Won Chung 提出的相同原则——“迟早这些偏见会反噬你”。我们还推荐 2024 年关于 Sora 的这些阅读材料。* Lilian Weng 的视频扩散模型文献综述* Sora API 泄露* 估计需要 100k-700k H100s 来服务 Sora（而不是 Turbo）* 使用 Sora 进行专业叙事的艺术家指南Google DeepMind 在 ICML 上对视频生成模型的表现非常强劲，赢得了两项最佳论文奖：* Genie：生成交互环境（在口头，海报和研讨会上都有报道）* VideoPoet：用于零样本视频生成的大型语言模型（见网站）我们通过 Tali Dekel 的演讲结束这一部分，主题是视频生成的未来：超越数据和规模。第二部分：生成建模和扩散自 2023 年以来，Sander Dieleman 在 Imagen 和 Veo 上工作的观点（博客文章，推文）将扩散视为“频域中的光谱自回归”，引起了公众的想象，因此我们强调他的演讲：* 在噪声中徘徊：对扩散模型的直观观察然后我们转到 Ben Poole 的演讲，主题是用 2D 先验推断 3D 结构，包括他在 NeRFs 和 DreamFusion 上的工作：然后我们调查两篇流匹配论文——一篇来自流匹配的共同作者——Ricky T. Q. Chen（FAIR，Meta）以及它是如何在稳定扩散 3 中实现的，使用缩放整流流变换器进行高分辨率图像合成。我们对扩散的最后一击是几场关于语音的口头报告，我们留给你通过我们的音频播客探索* NaturalSpeech 3：使用因子化编解码器和扩散模型的零样本语音合成* 使用扩散模型合成数据的语音自监督学习第三部分：视觉ICML 测试时间获奖者是 DeCAF，Trevor Darrell 显著称其为“OG 视觉基础模型”。Lucas Beyer 的演讲“LLM 时代的视觉——以数据为中心的视角”在网上也受到好评，他谈到了自己从视觉变换器到 PaliGemma 的旅程。我们特别提到 MLLM 作为评判者：使用视觉-语言基准评估多模态 LLM 作为评判者。第四部分：强化学习和机器人我们借助 Ashley Edwards 的帮助将视觉转向机器人，她在 Deepmind 的 Gato 和 Genie 团队的工作总结为仅通过视频学习动作、策略、奖励和环境。Brittany 突出了两篇海报会议论文：* 使用潜在动作生成行为* 我们还推荐 Lerrel Pinto 的《构建通用机器人的方法》* PIVOT：迭代视觉提示引出 VLM 的可操作知识然而，我们必须将大部分空间留给 Chelsea Finn，现在是 Physical Intelligence 的创始人，她进行了四场演讲，主题是* “机器人教会我关于机器学习的事情”* 开发机器人通才* 自主适应的机器人* 如何给你的语言模型反馈* 特别提到 PI 同事 Sergey Levine 关于机器人基础模型的研究我们以一篇将生成环境与 RL/机器人联系起来的立场论文结束播客：自动环境塑造是 RL 的下一个前沿。时间戳* [00:00:00] 介绍* [00:02:43] Sora - Bill Peebles* [00:44:52] Genie：生成交互环境* [01:00:17] Genie 访谈* [01:12:33] VideoPoet：用于零样本视频生成的大型语言模型* [01:30:51] VideoPoet 访谈 - Dan Kondratyuk* [01:42:00] Tali Dekel - 视频生成的未来：超越数据和规模。* [02:27:07] Sander Dieleman - 在噪声中徘徊：对扩散模型的直观观察* [03:06:20] Ben Poole - 用 2D 先验推断 3D 结构* [03:30:30] Ricky Chen - 流匹配* [04:00:03] Patrick Esser - 稳定扩散 3* [04:14:30] NaturalSpeech 3：使用因子化编解码器和扩散模型的零样本语音合成* [04:27:00] 使用扩散模型合成数据的语音自监督学习* [04:39:00] ICML 测试时间获奖者：DeCAF* [05:03:40] Lucas Beyer：“LLM 时代的视觉——以数据为中心的视角”* [05:42:00] Ashley Edwards：仅通过视频学习动作、策略、奖励和环境。* [06:03:30] 使用潜在动作生成行为访谈* [06:09:52] Chelsea Finn：“机器人教会我关于机器学习的事情”* [06:56:00] 立场：自动环境塑造是 RL 的下一个前沿获取 Latent Space 的完整访问权限，请访问 www.latent.space/subscribe</context> <raw_text>0 Welcome to the latent space coverage of ICML 2024. This is Charlie, your AI co-host. We know it's been a few months since ICML actually happened, but now that all the talks are available online and we are in final preparations for New Reaps 2024, we figured this was a good time to release our conference recap to get you in the mood.

As a side note, regular tickets are now sold out for Latent Space Live at NeurIPS, where we have announced our dream speakers to recap the best of 2024 across the top voted domains in Vision, Open Models, Post Transformers, Synthetic Data, Small Models, Agents, GPU Scaling, and a special 2024 in AI keynote from our friend and fellow podcaster, Sarah Guo of Conviction Capital.

Today, we are announcing our very last speaker and newest track.

friend of the pod, Nathan Lambert, who will be recapping 2024 in reasoning models like OpenAISO 1. See you in Vancouver. Coming back to ICML, this is a very special episode in more than one because it is the very first episode not hosted by Swix or Alessio. We are continuing to experiment with guest hosts, adding different opinions and voices to the show.

And in this case, to cover conferences we personally weren't physically able to attend. So we're very grateful for our friend Brittany Walker of CRV to step in as your guest co-host for ICML 2024. Our goal with these conference recaps is to give you an audio experience of what it's like to be there and to provide a filtered recommendation of papers and backstories of authors that will be useful for the AI engineer today and tomorrow.

Brittany worked enormously hard to put together the poster chats you will hear, and we're very grateful. Given that OpenAI has launched Sora Turbo today, we have bumped up our planned second episode to release first, since generative video happened to be a huge focus at ICML. Let's not bury the lead and go straight into the Sora talk from Bill Peebles, first author of the Diffusion Transformers paper and research scientist leading Sora model development.

Since we're talking about video models, you may wish to tap into the show notes for direct links to the public talks. However, we think there is still value in editing the audio for eyes-free browsing. We believe this is the most recent public academic discussion of Sora before the Sora Turbo public release today. So we hope this episode is valuable background for anyone getting up to speed on video diffusion. Watch out and take care.

I'm Bill and thanks a lot to Joanna for organizing this conference. Really excited to be giving a talk here. So I'm going to be talking about Sora today. So this is Video Generation Models as World Simulators. This was joint work with my good friend Tim Brooks and also some other wonderful colleagues at OpenAI.

So let's dive right in. So Sora is OpenAI's first video generation model. And in advance, I'm sorry for any kind of like FPS delay with screen sharing videos. It's always like the hardest part about working with videos is like showing results to other people over the internet. But this is a sample from Sora. And the text prompt is a stylish woman walks down a Tokyo street filled with warm glowing neon. You can see the rest of it. Sora is capable of generating 1080p video up to a minute long.

And what's remarkable about Sora is kind of all of the simple things that we take for granted about the visual world, it really begins to pick up on when you train on video data at scale. So if you see that blue sign in the background, even when there's a shot change and it's occluded, it's maintained. And we see this very consistently for a large number of samples from Sora. So it really has a good understanding

not only, for example, how light interacts within scenes in complicated ways, but object permanence and lots of other capabilities that have been very difficult for video generation models to grok in the past.

So of course it can do more than just photorealistic style. So this prompt is a gorgeously rendered papercraft world of a coral reef, right with colorful fish and sea creatures. So Sora again is capable of generating a non-photorealistic styles. It can also do a number of scene transitions. So we didn't stitch these video samples together. This is all one continuous output from Sora. It's capable of figuring out that if you want a scene with like a variety of sea life, maybe there should be a shot of seahorses, turtles, et cetera.

And it's also capable of modeling complex scenes. So this prompts this beautiful snowy Tokyo city is bustling. And so there's a large number of people in the scene. You can see the camera is flying through. And while it's doing that, it's able to have interactions between people like this couple is holding hands. There are people selling goods at the stalls. There's soccer pedals flying through the air. So Sora has really begun to pick up on the intricacies of how scenes should look and do a great job at rendering them.

One final example here is a movie trailer featuring the adventures of the 30-year-old spaceman. So what's cool about this is Sora's kind of zero shot, learns that you should have character consistency throughout a number of scene transitions. So, you know, in those movie trailers do not normally like change the leading actor halfway through. And so the man is the same across these different environments and different scenes. And all of this is just learned automatically by training on video data at scale.

So now I want to go into a few technical details about Sora. A lot of the inspiration for Sora came from language models, and in particular, this notion of a unified representation of text data.

One of the key ingredients to the success of LLMs over the years has been this idea that you could take stories, you can take code, you can take math. But at the end of the day, all of this information is represented with a unified vocabulary being tokens, which makes it very easy to train on data at scale. This imbues language models with very generalist capabilities and makes them polymaths at a number of tasks.

Now, we were really thinking like what the analog of this would be for visual data. And, you know, in particular, you know, there's no shortage of very diverse sources of visual data in the world. You know, there's vertical video, there's square images out there. You have like every kind of data of different durations, of different resolutions, of different aspect ratios.

And the question is, how can you train on all of that in a unified representation so we don't have to throw away any visual data? And so this is really one of the key ingredients for the success of Sora is coming up with this unified notion of kind of a visual representation on which we can train on, you know, internet scale visual data. And so in order to accomplish this,

We use a VAE kind of inspired by latent diffusion models from Robin Rombach. And what we do with this is encode all of this information into one unified latent space. So the idea here is on the far left, you know, we have like a video of a butterfly swimming underwater. You go through this visual encoder and this will compress videos both spatially and temporally into a single sequence of data.

And at the end of the day, we do this, of course, so we can train transformers on this sequence of data. We train diffusion transformers at scale. And the benefit of this is we get a number of just great properties of scaling transformers up specifically for video and image data.

So, you know, the name of the game here is like, how does visual quality improve as you throw more flops at the problem? And we find that improves like pretty steadily, which is great. So on the far left here, you know, we have a base compute trained Sora model. So this is trained with a small amount of compute and you can see it gets like some details, right? So for example, it kind of has some idea of like, if a camera is moving through a scene, there should be some notion of consistency, but all the textures are wrong and it's not high fidelity.

If you floor X the amount of training compute you pump into that model, it begins to figure out what dogs look like, what humans look like. But the visuals are still not great. And if you really crank up the amount of flops you're pouring into these things to 32X, you begin to see that it gets a lot of these fine-grained details right. The interaction of the owner's hand with the dog, all of the snowy textures on the ground,

And so we're finding that these models scale extremely effectively if you kind of nail the basics right. So in particular, if you can create this setup where you have this unified representation of visual data and crank up diffusion transformers, they can really start to learn to do amazing things.

Another cool property of Sora is how generalist it is at test time. So, you know, when you actually want to sample content, you can do it at any aspect ratio and resolution.

And this is really great from the perspective of kind of like controllable generations, specifically as it relates to different devices. So if I'm watching a movie on my iPhone and then I transition to watching it on my laptop, those are going to use two totally different aspect ratios. And normally you either have to just like pad with black bars or crop it. But with models like Sora, it's now possible to generate content natively for any device.

which is pretty exciting to think about the possibilities of how that can affect content creation in the future. So the sea turtle here is just rendered out with different aspect ratios. Another exciting aspect of this very generalist training recipe is we can kind of move on from the days of just like cropping data for training generative models. So, you know, back when I was like in grad school, I was always like spending time, you know, cropping to like 256 by 256 resolution to train like whatever version of StyleGAN I was working with.

And while that works well, it has certain downsides. So, you know, there are certain biases actually within data. For example, the photographer's bias of centering objects. And so on the left here, we have a baseline SORA model where we don't train with native size images.

video and image data. Instead, we actually do this like hard cropping to center. And you can see that the model essentially inherits some weaknesses of this cropping strategy, right? Sometimes like the scuba diver is going to be off center, which isn't actually ideal framing. If you do this native size training, it's actually much more effective at composing scenes. So you inherit some nice benefits of the training data in the model by just, you know, not throwing away pixels and training on everything you have.

So Sora is also an image generation model. So the prompt here is digital art of a young tiger under an apple tree in a matte painting style with gorgeous details. Here's another sample. We find that Sora in particular really excels at photorealistic kinds of content. So there's a lot of details kind of in the woman's face here, which it does a great job at rendering out. This is at 2K by 2K resolution.

And of course, we can interact with Sora in other ways beyond just text. So all the results before were text-to-video or text-to-image samples. But Sora can also accept visual inputs as conditioning. And so here we were seeding it with an image from DALI 3 and then having Sora extend this out in time. So Sora is capable of kind of understanding what's going on in an image and then extrapolating from there.

And so we had a lot of fun with this. So these are Dolly 2 samples on the left here. And so Sora can take video conditioning or image conditioning at any temporal index. So here, we condition the model in the middle of the sequence with the Shiba Inu. And then we extend it both backwards and forwards in time from that position. And you can see it's able to animate the dog's face. Of course, it can also do more fun animated styles here.

We've been using this to make emojis internally. We have this nice Sora Slack emoji now. And another cool thing with Sora is its ability to extend backwards in time. So of course, you know,

Whether you're doing like temporal like outpainting like forward or backwards in time, it's all kind of like the same to these models. And so here we have the model end in the same way, which is with this San Francisco logo. But all of the events leading up to it are resampled by the model. So it's very flexible in how you can use this to edit or extend videos.

Another cool aspect of Sora is its zero-shot editing capabilities. So there's been a ton of great work from the academic community over the years on finding creative ways to use diffusion models to do, for example, like image editing tasks. So, you know, one really nice work in that area is SDEdit.

And we find that techniques like this, of course, just work right out of the box with Sora because it's a diffusion model at the end of the day. So these are SV edit results. So the top left is the source video. This particular source video was generated with Sora, but of course it doesn't have to be. It could also be a real video.

And you can use a variety of different text prompts to re-render this scene automatically. So for example, in the top right, we can rewrite the video in a pixel art style. And as you would expect, if that's the edit, if you kind of use the right noise level, you can get it to maintain most of the structure in the scene.

and just only update the style, which is cool. So towards the end of this video, you can see that there's a cave that the car in the top left goes into. And across all of these re-rendered styles, you see that it preserves some notion of a cave or an overhang that the car goes through.

Another thing that's cool is Sora is kind of smart about figuring out whether or not certain correlations make sense. So for example, in the bottom right, you can say change the video to a medieval theme. Sora knows there weren't cars in the medieval times, so instead you get a red horse carriage. So it's kind of fun to see where Sora takes liberties in re-rendering your video.

Another cool capability that Sora can do is blend between videos. So the far left and far right videos here define the endpoints of this interpolation. And the middle video is Sora's imagining of how you connect the dots. And so you can see you get these kind of fantastic creatures in this case where you can never quite see where it goes from being a chameleon to a bird. It happens very seamlessly.

And you can use this for all kinds of scenes. They don't even have to be particularly related. So on the far left here, we have a drone flying through the Colosseum. And the far right is the butterfly flying underwater. And you can see that Sora is able to come up with a pretty reasonable

interpolation between these two videos. So you gradually see the Colosseum decay and move underwater. And at some points, the drone morphs into the butterfly very suddenly because it kind of like has put these two things into correspondence automatically and infers that like this is like a reasonable thing that should focus on blending between.

And here's an example of blending two scenes with totally different styles. So the far left is like a photorealistic aerial drone shot. And then the far right video is kind of nice, like gingerbread village. And it comes up with a really creative way to make this work. So rather than kind of morph the whole style of the scene in one shot, it decides that maybe this like gingerbread village is kind of hidden off to the side of this photorealistic town. And it zooms in.

So one other technique that we use for Sora is this notion of video recaptioning. And this is a technique that was actually pioneered by Dolly 3, by some other folks at OpenAI. And the high level idea is during training, diffusion models and really alternative models benefit from having a much cleaner source of conditioning than we've historically given them in the past.

There's like very crude text captions out there, like alt text, for example, which doesn't actually contain a lot of information about the scene. They're like very coarse keywords, for example. Sometimes the content is like pretty unrelated actually to what's in your image or video, et cetera. And one of the key breakthroughs in DALL-E 3 was generating synthetic captions that are much more detailed and contain much more mutual information with the content that you actually want to generate.

And so what we saw with Dolly 3, this is an example figure from Dolly 3, is that this really improved the controllability of the model and enabled you to create much more intricate scenes with a lot more ease than in the past. And so we took inspiration with Sora to also apply this technique to video. And one of the features of this is that at test time, when you're actually interacting with the model, rather than just

directly kind of upload a prompt to Sora, we'll actually use GPT under the hood to essentially up sample a user's base prompt into a much more detailed video description.

And so this figure here is the system prompts that we used for DALI 3 in order to do this upsampling. It's actually pretty involved to get this to work well. And so there's a lot of prompt engineering, even at OpenAI, to get these systems to be reliable. But under the hood, this is what we're doing to achieve some of these finer-grained control of SORA.

So the last topic I want to talk about is this notion of like emerging simulation capabilities. And this is really the aspect that we are most excited about with Sora looking forward. You know, we often get asked the question, you know, at OpenAI, you know, how does video generation really relate to the core mission of AGI? And on the Sora team, we're actually really passionate about

about this being a model for world simulation moving forward. And so what exactly does that mean? Like, how do we actually use these models long term to do interesting tasks and to really extract intelligence out of the world? You know, what we believe is when we really scale up video generation models, they're going to get so good at simulating such a variety of complex scenes with, you know, different agents in them.

that it's going to need to ultimately learn an underlying model of how people interact, of how people do tasks, of how people think, if it's truly generating high fidelity content. At some point, if the conversation I'm having at a dinner table within Sora is not realistic, that means it's failed to do its job of accurately learning the distribution of human behavior. And so

As we approach the limits of achieving the irreducible loss there, we think pretty amazing things are going to emerge from these models and it's going to play a really key role in developing more intelligent systems in the future. So SORA is obviously not there today, but we already see some cool phenomena by training on video data at scale that we just want to highlight. And we think this list is only going to grow in the future as SORA continues to scale up. So the first one I'll talk about is 3D consistency. And so this is pretty clear from a lot of the samples

But even when you have these very dynamic scenes with a lot of people moving in them and the camera being non-stationary, you can see that a large number of elements in the scene really do move with what appears to be accurate geometry. And so this is achieved without any kind of hard-coded inductive biases for 3D within the model. It's all learned jointly end-to-end as part of large-scale diffusion training.

It was really important to us when we were doing this project that whatever solution we came to for video generation was scalable and could just absorb a lot of flops.

And one way to do that right is to really strip out these inductive biases that in the past have sometimes been useful for achieving certain kinds of behaviors at low scale. But it's not clear that when you really crank up the training compute, if they'll either help or hinder you. And so we find that it's totally fine to like not have these kinds of inductive biases as long as you're training at scale. Here's another sample. This one's kind of fun. So it's an aerial view of Yosemite showing both hikers as well as a gorgeous waterfall.

The hikers do some very extreme hiking right here. I would not recommend trying this at home. And Ben Mildenhall, who...

used to be at Google. He took some Sora samples when we released them, and then he trained a nerf on them. And in his words, it nerfs. So this is another kind of nice sanity check that the underlying geometry that Sora is learning for some scenes, not certainly all yet, is actually pretty accurate. And so it's cool to see that this, again, just emerges automatically at scale without inductive bias.

So the next capability I want to talk about is this idea of long-range coherence. So this is one of my favorite samples. This is the Bling Zoo shop in New York City. It's both a jewelry store and zoo, saber-toothed tigers with diamond and gold adornments, turtles with glistening emerald shells, et cetera. And so again, this is all one continuous shot from Sora. We didn't stitch it together. And what's cool about this is even when you have these sort of scene transitions,

Sora kind of automatically, you know, figures out like the vibe of what you're going for. So in this case, you get this coherence of like, you know, the environment you're in, you see this like outdoor component at like the start of the scene and like it gradually like moves indoors, but it all creates this kind of coherent narrative, which is awesome that you don't have to like, you know, manually stitch together everything. It can kind of just like figure it out in context.

Of course, you can also do long-range coherence and like the notion of character consistency as we alluded to earlier. So this is the story of a robot's life in a cyberpunk setting. And you can see you get the same robot character across these different shots. So it really does understand this idea that, you know, if I have a long video with multiple cuts, I'm probably going to have some amount of like characters that show up multiple times. It's not going to be an entirely new cast, you know, like every two seconds. And you just figure this out automatically.

Object permanence is another big one. So in the past, video generation models have really struggled to keep objects in the scene under occlusions. And so this is an example sample where even though this Dalmatian is getting included multiple times in the scene, Sora understands that that dog should still be there even when the people pass. And this very simple capability

that we take for granted, used to be a very challenging problem for video generation systems. But again，您不一定需要任何特定于对象的归纳偏见才能使其出现。您只需拥有正确的基本训练配方即可将这些模型扩展。因此，我们对另一个能力感到兴奋的是与世界互动并更新状态的想法。

因此，按定义，如果您想要一个有用的视频生成系统，在某个时候，它需要能够与场景中的对象进行交互，并使这些交互具有意义。我的意思是，它们需要随着时间的推移而持续存在。因此，在最简单的情况下，如果我在这种情况下绘制或绘画一些樱花花瓣，我希望您知道，当我留下笔触时，它们实际上会与画布相互作用并保留。

我们发现有时 Sora 可以做到这一点。这可能是该模型当前最不稳定的能力之一。但在这种情况下，它确实有效。这是一个老年人吃汉堡的另一个例子。在这里的最后，汉堡上有咬痕。

因此，我认为这是视频生成系统向前发展的更大挑战之一，即如果我在遥远的过去做了某件事，模型是否真的可以记住并回忆起这一点，并使其影响未来的事情？因此，这些都是非常简单的例子，但在创建我认为真正引人注目的示例方面还有很长的路要走，在这些示例中，过去的对话或某些内容会影响系统在未来几分钟内输出的内容。

我想在这里讨论的最后一个主题是数字世界模拟的概念。因此，当人们谈论视频生成模型时，当然，对我们可以学习现实世界的物理学这一想法充满了兴奋。我认为这是极其有价值的，也是一个非常重要的方向。

但有趣的是，这些系统非常通用。因此，我们无需仅限于学习我们世界的物理学。还有各种其他疯狂的世界，例如笔记本电脑操作系统或视频游戏控制台，SORA 类模型也可以从中学习。您可以拥有一个模型，该模型最终非常通用，能够在所有这些不同环境中呈现场景。

因此，朝着这一目标迈出的一步是 Minecraft。因此，这里的提示是 Minecraft 最华丽的高分辨率 AK 纹理包。这只是 Sora 的直接输出。实际上，这并不是特别挑剔，实际上很容易获得良好的样本。您可以看到 Sora 能够隐式控制玩家，这是一种可理解的，尽管略显无聊的策略，同时呈现出完整的环境，呈现出 NPC，例如这些猪。

我们认为这是一个非常酷的，极其粗糙的概念验证，表明 Sora 可以做的不仅仅是用于创意目的。它确实可以建模整个环境，并在未来用于提取有关策略的信息，您知道，这一切都隐含在某种激活和权重中。

很高兴看到它通过大规模训练视频数据自动学习这些东西。所以这是另一个样本，使用这个提示。它为这个选择了不同的纹理包。但同样，你会看到相同的事情。你知道，你有一只鸡和一只猪。它能够控制这些角色的策略。随着角色的跳跃，它能够以相当高的保真度渲染出这个环境。

所以我们非常兴奋地看到你可以将所有这些知识打包到这个模型中，不仅仅是现实世界的物理。当然，Sora有很多问题。因此，它离模拟一切的最终目标还有很远。不过，这些失败案例也很有趣。所以这个场景的一切都有点混乱。那个女人看起来太开心了。背景中的手有点诡异。蜡烛朝错误的方向吹。

这是另一个例子，一个杯子自发地跳到空中，并以一种非常不现实的方式破裂。因此，即使是像玻璃破碎这样的基本交互，Sora也还没有真正理解，还有很长的路要走。我认为这是团队中大多数人最喜欢的失败案例。所以提示是考古学家发现一个塑料椅子。

但这个塑料椅子有点有意识，开始飞起来，看起来有点被附身。所以当你有这些模型时，总是很有趣，你知道，在你的扩展曲线中，它们还没有完全推向极限。看到它们对我们世界的某些关联尚未理解，以及它们采取某种创造性的自由，总是很有趣。而这个问题的错误是显而易见的。所以，是的。

Sora目前处于研究阶段，我们还没有将其产品化。

我们与红队成员和艺术家合作，真正掌握模型如Sora的潜在风险是什么？如果有一天它应该被部署？同时，我们如何使其尽可能有用，无论是对现有的艺术工作流程，还是对潜在的全新工作流程。因此，来自Shy Kids的一句名言，我们给他们提供了Sora的访问权限。

尽管Sora在生成看似真实的事物方面表现出色，但让我们兴奋的是它能够创造完全超现实的事物。因此，我们非常喜欢这个想法，即Sora并不是在取代艺术工作流程的元素，而是实际上使一些以前不可能的全新过程成为可能。因此，我现在播放这个Shy Kids的视频。

我不确定你们是否能听到音频，但如果不能，你可以在网上找到这个视频。只需搜索Shy Kids Store。他们说每个人都有一些独特的东西，使他们与众不同。就我而言，你知道，这件事显而易见。我简直是充满了热空气。是的，像这样生活有其挑战。风大的日子尤其麻烦。有一次，我女朋友坚持让我去仙人掌商店给我叔叔杰瑞买结婚礼物。是的。

我最喜欢我这种境遇的是什么？是它给我的视角。你知道，我可以以不同的方式看待世界。我漂浮在平凡和普通之上。我以不同于其他人的方式看待事物。然而，我觉得正是因为这种视角，我每天都被提醒生活是脆弱的。我们都只是一针刺就能泄气。因此，我努力以轻松、浮动和生活的乐趣来生活。我有很多想法，保持本能。希望有好运，

我能找到一种方法与其他人分享它们。因此，这个视频是通过直接使用模型输出和当然也更多像传统视频编辑工作流程的组合制作的。因此，看到艺术家们如何接受Sora并开始将其融入其中，真的很酷。实际上，还有一些在Tribeca展出的电影，

也是以类似的方式制作的，使用Sora。我对Sora今天的能力水平所展现的创造力感到非常惊讶。看到社区积极参与并使用这些模型真的很酷。因此，话虽如此，这几乎是演讲的结束。我这里还有一些额外的样本。但谢谢Joanna安排这一切，很高兴回答任何问题。我不知道现在是否可以通过Zoom进行交流，但如果不能，那就差不多了，所以非常感谢。

谢谢你，Bill，感谢这次精彩的演讲。我想是的，我们绝对可以。如果你能听到我，那么我们可以进行问答。我能听到你，所以我想我们很好。

你好，谢谢你的精彩演讲。我想知道我们距离让一个视频制作人用零演员制作整部电影还有多远。所以也许如果，假设视频制作人可以上传角色，他们的外观，并且他们可以描述场景并告诉你，哦，这个角色现在正在逃跑或骑自行车等等。没有演员，他们真的能制作完整的电影吗？

是的，好问题。所以我认为有一个技术答案和一个文化答案。在技术方面，我认为没有任何障碍可以让角色一致性在很长的时间范围内工作。这似乎是一个非常可实现的问题。因此，我认为在短期内，如果人们想这样做，能够创建这些合成角色并根据需要使用它们是可能的。现在，我不知道人们在短期内是否真的会想这样做。我们与许多导演进行了交谈，例如，他们中的许多人提到，对于非常简单的场景，使用

我们当前的能力，例如在背景中有一个大人群，而在过去这可能是由CGI驱动的。

但是，你知道，对于这些真正复杂和有意义的特写镜头。当你试图与观众建立更深的情感联系时，至少在不久的将来，似乎人类演员在今天的Sora模型上确实有优势。因此，我怀疑未来会有某种程度的混合合作。

但我想看看人们选择在何时何地使用完全数字化的角色与传统演员之间的情况。嗨，我有两个问题。第一个是合成数据在训练过程中扮演了多大的角色？所以我无法回答关于训练数据的任何问题，不幸的是。所以是的，抱歉。

好的，那么第二个问题是用户对相机角度和轨迹的控制有多少？只是提示，还是你可以实际定义完整的轨迹，或者有什么可能的？

好问题。目前，您可以定义相机运动的唯一方式是通过文本或视频条件。因此，在后者的情况下，这意味着如果您看到生成的视频，其中相机已经以您想要的方式移动，然后您想从那里扩展，您可以通过上下文学习推断出正确的相机运动。

目前，没有更细粒度的相机控制方式。我认为这是我们确实听到人们想要的事情。因此，探索更明确地控制这些功能的替代方法将是有趣的。但现在，主要通过文本进行。

嘿，所以我们看到了所有这些漂亮的视觉输出。你能分享一些关于音频和一致性的内容，也许你观察到的，或者如果你有任何。

是的，这是个好问题。因此，对于Sora，我们真的专注于推动视觉生成质量的边界，而我们并没有

专注于例如联合生成音频。我认为这是一个非常有趣的方向，以获得极高保真的联合视频音频生成，但这不是我们目前在Sora中拥有的。我认为在未来，使这些模型更可控，并可能为用户提供他们想要的所有模态，肯定是一个有趣的方向。

嗨，Sora也可以生成图像。你认为未来视频生成模型会比文本、当前的文本图像模型更强大，我们将基本停止仅在图像上进行训练吗？

是的，我认为是的。部分原因是，关于世界的信息很多，如果你在巨大的图像数据集上进行训练，你可能在某种程度上可以推断出来。但我认为，仍然有一些事情会滑落到裂缝中，只有通过训练视频数据才能获得。因此，例如，

模型可以真正生成场景的准确飞行，并真正理解遮挡。我猜这实际上有助于图像生成能力，并理解如何，您知道，手上的某些手指可能被物体遮挡，但这并不意味着人类通常只有两个手指。这只是意味着，您知道，这里有一种物理交互，您并不一定通过仅在图像数据上进行训练来获得，或者您没有有效地获得。

而且，您从共同训练视频中更高效地获得这些概念。因此，是的，我怀疑未来视频生成模型将普遍超越图像生成。- 这已经是Sora的情况了吗，还是还没有？- 因此，目前我们还没有将Sora的任何能力产品化，包括图像生成。因此，今天，如果您去chat GPT或其他地方，它正在使用DALI 3在后台进行文本图像。

谢谢。我想知道你能否告诉我们一点关于模型的大小，例如，它可以生成的最大时间长度或分辨率或像素数量，类似的东西？好问题。不幸的是，无法对此发表评论。抱歉。

甚至更接近数量级的东西。例如，我们可以期待用户在未来拥有一个好的模型，还是这将是只有大型集群和大公司才能负担得起的东西？是的，这是个好问题。我是说...

我不会特别惊讶，如果视频生成模型的发展最终看起来与语言模型的发展非常相似。因此，您知道，会有各种不同能力水平和大小的模型，和我们现在的生态系统非常相似，您知道，有开源模型，至少在历史上，它们往往比这些大型封闭源模型稍微弱一些。

但我很好奇整个生态系统是如何发展的。只是一个猜测。好的，谢谢。嗨。感谢演讲。我很好奇你们如何考虑更复杂的控制，比如子弹时间等等。是的，好问题。

我认为我们从与导演和艺术家的交谈中听到的一件事是，他们有一种非常特定的语言，用于描述某些类型的镜头和相机运动。而Sora，开箱即用，并不擅长说那种语言。因此，我们在改善

用户与该模型的交互方面所考虑的很多内容是，您知道，能否训练模型使用相同的语言？因此，在某种程度上，这就像是一个字幕问题。但我认为在这里，关于如何使这些模型可控的最佳方法仍然没有定论。是的，是否仅通过文本，还是还有其他类型的输入？我认为这是一个非常有趣的领域，目前我们仍然只是开始探索它。

谢谢。还有一个简短的问题是，角色与汉堡之间的互动那种东西。有没有办法让用户进行更多的互动，比如击打汉堡，这样汉堡就会被压扁？所以，是的，我想你明白了。是的，是的。那会很酷。我不知道Sora今天是否能做到那种更复杂的互动。我认为没有根本原因说明这不应该是可能的，为什么进一步扩展这些模型不应该能够实现那种能力。即使是这种咬汉堡的现象，这种现象在Sora的研究过程中花了一段时间才出现，并且似乎需要至少相当多的计算才能实现非平凡的互动。因此，我很好奇你可能需要什么级别的规模才能完全压扁一个汉堡，并使其在物理上准确。我认为我们最终肯定会达到这一点。只是你永远不知道在扩展曲线的哪个位置，这些能力会开始出现。谢谢。嘿，精彩的演讲。

我想知道，你展示了这个漂亮的Minecraft示例，实际上显示了一些代理行为。你认为Sora能否作为一个世界模拟器，帮助启用与现实世界互动的代理，例如在机器人技术方面，是否能比现在的情况更好？是的，绝对可以。你知道，我不知道当前模型是否足够稳健，能够可靠地改善现实世界的

政策，但我认为有一天这些模型将为这些系统提供动力。通过在大规模视频数据上进行训练，您学习到的关于世界的信息太多，因此似乎不可避免地，这些知识应该在某个时刻转移到现实世界。嗨，我有一个关于归纳偏差的问题。你有没有尝试用一些特定的归纳偏差来训练模型，比如物理或视频中的任何规则？

不，我们没有。因此，从Sora项目的开始，我们真的专注于训练纯粹的视觉生成模型，尽可能少地引入归纳偏差，并确保基础坚实，以便扩展。这是...

项目的核心论点。因此，我们没有探索引入归纳偏差。我怀疑，对于某些较窄的用例，您可能会通过这样做获得一些收益。而且，您知道，如果您的模型需要非常小，但您不需要它非常通用，那么这可能会是一个收益。但我们真的只是想扩展尽可能大的、最通用的模型。因此，为此，

我们通常假设它们在某些时候会有害，这就是为什么我们没有过多探索它们。谢谢你，Bill。让我们再次感谢演讲者。非常感谢你的精彩演讲。在他们的原始博客文章中，OpenAI将Sora描述为一个世界模拟器，我们一直在追踪模拟AI的夏季开始。谷歌DeepMind当然没有休息，今年在Google YNO上宣布了他们的VO模型，并得到了唐纳德·格洛弗的支持。

然而，ICML的重点是GENIE，代表生成交互环境，这是一个由110亿参数构成的基础世界模型，训练于未标记的互联网视频，以生成通过文本、合成图像、照片甚至草图描述的可控虚拟世界。

它由一个时空视频标记器、一个自回归动态模型和一个简单且可扩展的潜在动作模型组成。Genie使用户能够在生成的环境中逐帧行动，尽管在训练中，没有任何真实动作标签或通常在世界模型文献中发现的其他领域特定要求。

最近，DeepMind宣布了CIMA，他们的可扩展可指令多世界代理，以及Genie2，它将Genie1从生成2D世界扩展到3D世界。Genie2是一个世界模型，这意味着它可以模拟虚拟世界，包括采取任何行动的后果，例如跳跃、游泳等。

它是在一个大规模视频数据集上训练的，像其他生成模型一样，在规模上展示了各种新兴能力，例如物体交互、水效应、方向照明、反射、复杂角色动画、物理以及建模和预测其他代理行为的能力。

特别是，Genie 2具有长时间记忆，这意味着它能够记住不再可见的世界部分，并在它们再次可观察时准确渲染，确保它在飞行中生成新的合理内容，并在长达一分钟的时间内保持一致的世界。

最后，Genie学习的潜在动作空间促进了训练代理模仿来自未见视频的行为，为未来训练通用代理铺平了道路，我们将在本播客的最后部分探讨。但首先，这是谷歌DeepMind对Genie的口头介绍。大家好，早上好。感谢大家的到来。

我是Jack，和Ashley一起，我非常兴奋地介绍我们的论文，生成交互环境，简称GENIE。GENIE是谷歌DeepMind这个美妙团队的一个惊人合作努力。我们的长期目标是训练能够安全执行复杂任务并具有长期后果的具身代理。可以说，在过去几年中，我们的领域取得了惊人的进展，但这仍然感觉相当遥远。那么缺少什么呢？

幸运的是，ICML上还有另一篇相当酷的论文，它有一种很好的思考方式。特别是，他们通过广度和性能来分解代理能力。如果你看到右下角的单元格，那就是我们想要的：一般超人类智能。好消息是，我们在这个网格上取得了一些良好的进展，因此我们已经拥有了他们所称的涌现AGI，这要归功于渐进的基础模型。

我们在狭窄领域也有超人类代理。在AlphaGo和AlphaZero的情况下，这些代理已经被用来增强人类智能。然而，请注意，这得益于我们没有通用智能的围棋模拟器。因此，我们在这项工作中提出的主要主张是，要到达右下角，关键的缺失成分是一个更通用的环境。因此，我们的主要动机是，我们如何可能获得这个更通用的环境？

与此同时，很明显，视频生成正在经历一个时刻。我们为这一系列口头报告准备了相当多的房间。这个研究领域现在是AI进展的中心。令人难以置信的是，通过利用大型视频数据集，这些模型越来越能够以以前无法做到的方式理解物理世界。因此，许多人开始相信这些视频模型实际上可以成为准确的世界模拟器。然而，

我们相信，虽然这些模型可能具有世界知识，但它们并不是世界模型。实际上，我们目前看到的许多文本到视频模型只能通过文本标题在非常高的层面上进行控制。提示模型，你会收到一个视频剪辑。它可能是一致且美丽的，但因此，它不是一个世界模型，因为你无法在环境中采取连续的行动以学习新行为。

这里的挑战在于，我们没有带有动作标签的数据，因此我们拥有的最大世界模型设置受到带有动作标签的数据集的限制，这意味着我们只能建模现有环境，因此无法生成新的环境。使用大量计算资源仅仅重建我们已经拥有的现有游戏似乎没有意义。因此，这是我们在这个项目中努力解决的问题。我们希望利用互联网上大量未标记的视频，并希望创建一个可控的动作视频模型，也称为世界模型。

因此，总结一下，Genie的目标是从视频中学习我们所称的生成交互环境，这个环境可以被人类和AI代理玩。我们是如何做到这一点的？好吧，主要的想法是使用以完全无监督的方式学习的潜在动作空间。直观地说，这些对应于从给定视频帧中聚类潜在结果。我现在将把时间交给Ashley，让她谈谈我们是如何做到这一点的，并展示一些我们的结果。谢谢，Jack。

因此，Genie模型由三个主要组件组成，我们通常在大约16帧的序列上进行训练。因此，第一个组件是视频标记器，它从整个序列中获取补丁并将其离散化为视频标记。

下一个是潜在动作模型，它将连续帧输入并将其离散化和压缩为我们所称的潜在动作。我们潜在动作模型的主要目的是尝试编码场景之间将要发生的变化，以便在给定这些潜在动作和先前帧的情况下，我们可以使用它们来预测下一帧。这对于可控性至关重要。

我们模型的最后一个组件是动态模型，它将我们的标记帧与潜在动作结合起来，预测下一帧标记。

因此，为了在推理时与我们的模型进行交互，我们可以采取初始提示帧，采取一些动作，然后生成下一帧，将其插入模型中，并继续进行。重要的是，由于我们学习了离散的潜在动作，我们实际上可以插入整数值以这种方式进行交互。我们发现，实际上与模型进行交互以评估我们的模型非常重要，因为用FBD等方法测量模型的定量性能是一回事，但实际上建模可控性并能够真正评估这一点则是另一回事。因此，实际上与其进行交互对我们来说非常重要。

因此，Genie是在一个包含大约30万小时视频游戏镜头的数据集上训练的，主要是2D平台游戏。但我们发现，将其过滤到3万小时以获得更高质量的数据集是很重要的。

因此，在对我们的主要模型进行训练之前，我们进行了几次扩展分析实验，显示出扩展模型大小和批量大小的重要性。一旦我们这样做，我们发现我们得到了最终的110亿参数Genie模型，批量大小为512。

现在让我们看看我们的结果。我想指出，所有结果都将显示分布外的示例。因此，其中一些可能是来自文本生成图像的示例。

因此，这个第一个视频将展示Genie能够创建的一些环境。因此，我认为这里令人兴奋的事情是，它展示了我们可以基本上进入我们的环境。这显示了人类在这些生成环境中的真实互动，并采取行动并改变我们所经历的世界。重要的是要指出，这就是我们可以将Genie视为基础世界模型的主要区别，因为我们可以在我们的生成中采取这些行动。因此，这显示了另一个示例。

因此，给定相同的初始提示帧，我们可以采取一系列不同的潜在动作并将其插入我们的模型。你会看到我们将获得非常不同和多样的轨迹。同样，这显示了人类与模型的互动。再次强调，这是因为我们以无监督的方式学习了这个潜在动作空间。

另一个重要的事情是将保持一致性。能够生成多样的轨迹是一回事，但如果你每次都有新的图像时都要弄清楚你的潜在动作意味着什么，那就不是很有用。因此，我们还想测量我们的潜在动作有多一致。因此，给定四个不同的初始提示图像，我们可以插入相同的潜在动作序列。你可以看到，在这些不同的环境中，发生了非常相似的轨迹和行为。

这告诉我们，确实我们的潜在动作空间，至少在这些环境中，是一致的。再次指出，我们能够在没有任何真实动作标签或进行物体检测、物体分割或任何领域特定信息的情况下学习这些潜在动作。

因此，在这个项目中我们发现的一个令人兴奋和有趣的事情是，我们实际上可以插入草图，尽管我们只是在2D平台游戏上进行训练。例如，左边是Richie团队的一幅草图。中间是Jeff Klune的孩子画的。右边是我画的，但不要对我太苛刻。

我们可以基本上将这些图像插入我们的模型，并再次创建这些环境。因此，我们可以看到，例如，我们能够攀爬Richie基本上画出的梯子。因此，我认为在这一刻，我们真正开始看到Genie能够启用的创造力。

因此，我们还插入了现实世界的图像，这再次与模型训练的内容非常不同。例如，左边是Jack的狗Doris。我们再次能够生成这些环境，并与它们互动，即使我们没有训练过任何看起来像这样的东西。

Genie也适用于现实世界数据，因此我们在一个机器人数据集上训练了一个具有20亿参数的较小模型，我们再次看到，如果我们采取不同的提示图像，但插入相同的潜在动作序列，我们获得了相似的行为，这再次证明潜在动作是一致的。我们还能够使用这个模型模拟可变形物体。

最后，尽管我们尚未展示我们可以在Genie模型中训练代理，但我们在论文中展示了我们可以利用从互联网上的视频学习的潜在动作，并将其用于标记未见视频，这使得代理能够从中模仿。因此，这表明Genie可以用于训练我们未来的通用代理。说到未来，我将把时间交回给Jack，让他谈谈未来的方向。

Latent Space LIVE! 在 NeurIPS 的常规票已售罄！我们刚刚宣布了最后一位演讲者和最新的主题，播客的朋友 Nathan Lambert 将回顾 2024 年的推理模型，如 o1！我们为那些现在正在决定的人开放了一些晚鸟票 - 如果需要，请使用代码 DISCORDGANG。期待在温哥华见到你！我们已经保存了 ICML 的录音一段时间（来自今天首次 SOLO 嘉宾共同主持的 Brittany Walker），鉴于 Sora Turbo 今天的发布（博客文章，教程），我们认为现在是发布第一部分的好时机，这部分将深入探讨生成视频世界模拟的现状，顺利过渡到视觉（相反的模态），最后是机器人（它们的最终应用）。Sora、Genie 和生成视频世界模拟器领域Diffusion Transformers 的作者 Bill Peebles 在 ICML 上进行了他最近的 Sora 演讲，这开启了我们的节目：* William (Bill) Peebles - SORA（幻灯片）关于 Sora，常常被问到的是为了实现这些结果引入了多少归纳偏差。Bill 提到了来自 o1 团队的 Hyung Won Chung 提出的相同原则 - “迟早这些偏差会反噬你”。我们还推荐 2024 年关于 Sora 的这些阅读材料。* Lilian Weng 的视频扩散模型文献综述* Sora API 泄露* 估计需要 100k-700k H100s 来服务 Sora（不是 Turbo）* 使用 Sora 进行专业叙事的艺术家指南Google DeepMind 在 ICML 上对视频生成模型的表现非常强劲，赢得了两项最佳论文奖：* Genie: 生成交互环境（在口头、海报和研讨会上都有报道）* VideoPoet: 用于零样本视频生成的大型语言模型（见网站）我们通过 Tali Dekel 的演讲结束这一部分，主题是视频生成的未来：超越数据和规模。第二部分：生成建模和扩散自 2023 年以来，Sander Dieleman 在 Imagen 和 Veo 上工作的观点（博客文章，推文）将扩散视为“频域中的光谱自回归”，引起了公众的想象，因此我们强调他的演讲：* 在噪声中徘徊：对扩散模型的直观观察然后我们转向 Ben Poole，他的演讲主题是用 2D 先验推断 3D 结构，包括他在 NeRFs 和 DreamFusion 上的工作：然后我们调查了两篇流匹配论文 - 一篇来自流匹配的共同作者 - Ricky T. Q. Chen（FAIR，Meta）以及它是如何在 Stable Diffusion 3 中通过高分辨率图像合成的缩放整流流变换器实现的。我们对扩散的最后一击是几场关于语音的口头报告，我们留给你通过我们的音频播客探索* NaturalSpeech 3: 使用因子化编解码器和扩散模型的零样本语音合成* 使用扩散模型合成数据的语音自监督学习第三部分：视觉ICML 测试时间获奖者是 DeCAF，Trevor Darrell 显著称其为“OG 视觉基础模型”。Lucas Beyer 的演讲“LLM 时代的视觉 - 数据中心的视角”在网上也受到好评，他谈到了自己从视觉变换器到 PaliGemma 的旅程。我们特别提到 MLLM 作为评判者：使用视觉-语言基准评估多模态 LLM 作为评判者。第四部分：强化学习和机器人技术我们借助 Ashley Edwards 的帮助将视觉过渡到机器人技术，她在 Deepmind 的 Gato 和 Genie 团队的工作总结为仅从视频中学习动作、策略、奖励和环境。Brittany 突出介绍了两篇海报会议论文：* 使用潜在动作的行为生成* 我们还推荐 Lerrel Pinto 的《构建通用机器人的方法》* PIVOT：迭代视觉提示引出 VLM 的可操作知识然而，我们必须将大部分空间留给 Chelsea Finn，现在是 Physical Intelligence 的创始人，她进行了四次演讲，主题是* “机器人教会我关于机器学习的事情”* 开发机器人通才* 自主适应的机器人* 如何给你的语言模型反馈* 特别提到 PI 同事 Sergey Levine 关于机器人基础模型的研究我们以一篇将生成环境与 RL/机器人技术联系起来的立场论文结束播客：自动环境塑造是 RL 的下一个前沿。时间戳* [00:00:00] 介绍* [00:02:43] Sora - Bill Peebles* [00:44:52] Genie: 生成交互环境* [01:00:17] Genie 访谈* [01:12:33] VideoPoet: 用于零样本视频生成的大型语言模型* [01:30:51] VideoPoet 访谈 - Dan Kondratyuk* [01:42:00] Tali Dekel - 视频生成的未来：超越数据和规模。* [02:27:07] Sander Dieleman - 在噪声中徘徊：对扩散模型的直观观察* [03:06:20] Ben Poole - 用 2D 先验推断 3D 结构* [03:30:30] Ricky Chen - 流匹配* [04:00:03] Patrick Esser - Stable Diffusion 3* [04:14:30] NaturalSpeech 3: 使用因子化编解码器和扩散模型的零样本语音合成* [04:27:00] 使用扩散模型合成数据的语音自监督学习* [04:39:00] ICML 测试时间获奖者：DeCAF* [05:03:40] Lucas Beyer：“LLM 时代的视觉 - 数据中心的视角”* [05:42:00] Ashley Edwards：仅从视频中学习动作、策略、奖励和环境。* [06:03:30] 使用潜在动作的行为生成访谈* [06:09:52] Chelsea Finn：“机器人教会我关于机器学习的事情”* [06:56:00] 立场：自动环境塑造是 RL 的下一个前沿获取 Latent Space 的完整访问权限，请访问 www.latent.space/subscribe</context> <raw_text>0 Awesome. Thank you so much, Ashley. Right back to me. We want to emphasize that what we've shown here is that this is even possible. Before we started this project, the idea of training an action-controllable world model from videos seemed a bit like a pipe dream. And so as a result, this is the worst that Genie's ever going to be. We're expecting to see rapid progress from here, which we think can have a huge impact in a variety of areas. So going back to our original motivation, we think that Genie presents a clear path to generating unlimited environments for training agents.

And so for a more formal write-up of how we see this could fit into a framework towards getting to more general intelligence, come check out our position paper on Thursday in the oral session. Not only that, but as Ashley mentioned, we know something pretty magical happening while playing our model is it enabled a new form of creativity as people such as Jeff Klune's children, as previously mentioned, were able to draw their own worlds and step in and play. And we think this is barely scratching the surface by what could be possible with this new form of generative AI.

Okay, so to address the elephant in the room, so for those in academic institutions thinking it's just another industry paper that uses tons of compute that you can't possibly work on, fear not, we've got something for you as well. So in the paper, we have a case study where we show you can train your own much smaller Genie model and a mid-range TPU in just under a week. With this approach, you should be able to see some pretty consistent latent actions and given different initial prompts in the CoinRun environment.

As an example here, you see different actions from this model that we did train in a few days. And we're excited to see that this isn't just a wild goose chase. We have actually got some students that have been able to reproduce this. So come along to the controllable video generation workshop on Saturday to see their poster. And finally, if 12 minutes wasn't enough for you, fear not. We've got a few other things going on this week. So we've got a couple of position papers. We've got the poster straight after this talk. And then we've also got some longer talks in workshops later in the week. And then many others in the team are here as well who would love to chat to you all.

So yeah, that's a wrap. Thank you for your time. Thanks for showing this amazing work. So I have a question about technical details. So in this genie phase, you have two training phases, right? So in the first phase, you're training the inverse model for the latent actions. And then in the second phase, you're training a prediction model for this video generation, right? But in the first phase, when you train the latent action model, you already have a dynamic model trained

So why is it necessary to train this second step? I'm just wondering. Oh yeah, that's a great question. So essentially what you're saying is why do we have a decoder for the latent action model that already predicts the next frame and then subsequently train another one? We found that there's actually slightly different trade-offs for this decoder. So we found, if you see in the paper, we predict in the pixel space rather than token space for the latent action model. We found that that really helped to get more controllable consistent latent actions.

And then that decoder itself is just predicting in the pixel space, so actually it would be pretty blurry if you were to use it as your generative model.

Whereas we found that the mask-get objective wasn't best for learning latent actions. It just didn't lead to as consistent latent actions. So we have this like dual approach where we have two different dynamics models, essentially we learn as part of the process. But you're totally right that this isn't the most elegant solution and many of my team weren't overly ecstatic about it, but that's why we're saying this is the worst it's ever gonna be. And hopefully some of you folks in the community can build a much more elegant solution in the next few months.

Thank you very much, really cool work that you guys are doing. I wanted to ask regarding the qualities that we can see on these world models. So on the videos we essentially saw some amount of physics, so jumping and then falling down because of gravity, and then we saw platforms and saw some ladders. Do the world models ever generate other entities? Think of it like maybe enemies going back and forth that if you touch them something happens or...

What other qualia do you think that have you guys observed? Yeah, that's a great question. So I would say in the sort of examples that we show, particularly out of distribution examples, it is very difficult for it to generate anything that's kind of exciting. We are able to move the character around, but typically you would just see it, I guess,

repeating the patterns that it's seen in the background and that sort of thing. I think that's another sort of exciting direction for the future is trying to figure out how to make it a little bit more, the generations a little bit more exciting and diverse. - Yeah, thank you very much. - Now as well, one last question. I actually wanted to ask you something. What do you think is sort of like the cool killer application that you see in the future if you could really scale this up and train this on anything?

So I think there's quite a few applications and really it's subjective depending on your interests. So I think if you were, I personally think this could have impact in quite a few areas. So you can imagine some of the domains we use, already quite fun to interact with it as someone

in those settings, but I think it could have quite a large impact in areas such as robotics because it's currently quite hard for robotics, for robots to generalize equally unseen scenarios, but if you could generate a world model for any possible domain. And actually we've seen there's an open source Genie model from 1X Robotics that works pretty well. So I think that they obviously think so too, and they probably know more about that than me. And yeah, so I think there's a lot of potential applications, but we're just not focusing on one right now.

Alright, thank you very much and congratulations on the best paper. Because the Genie team were accepting their best paper award in Vienna, we were able to catch them at their poster session live to tell a bit of the human story behind Genie. Over to you, Brittany! I'm here with the Genie team. Generative Interactive Environments is the title of the poster. And I'm here with Jack. Jack, can you tell us a little bit about the origin story of the Genie project?

Sure thing. Yeah, firstly, thanks for the chance to speak to you. So basically, Genie is kind of a fusion of a few different areas of research. Myself and some others who are working on open-ended learning and environment generation beforehand, and we were interested in world models and thinking about how we could scale them to internet videos. But obviously, the key challenge with that is that internet videos don't have action labels. So if you want to train a model that takes actions as input to predict the future, you don't have the action, so you can't train that way.

And then on the other end of the spectrum, Ashley Edwards had been working for many years on inferring actions from videos for a different purpose for directly training agents with behavior cloning. And so it seemed like a natural fit really to combine these ideas. And there were some pretty simple proof of concepts of people doing this at like very small scale, but no one had really gone to the generative angle of getting an environment generator from a large scale dataset.

And so when we first spoke with Ashley a year and a half ago, we were excited about this potential combining these ideas to build something completely new. And yeah, I guess that's where we got to. Nice. And can you give a little bit of an overview, I guess, of how the work went, what results you saw, that type of thing? Sure, yeah. So we started basically working on this 2D platformers data set.

So we have 280,000 hours of publicly available videos of 2D platform games. We found one important thing was to filter this down because a lot of the videos aren't very good quality. So we trained a classifier with hand labels that we labeled as a team, a small subset. And then we ended up with 30,000 hours of good quality videos.

And then from that point it was just a modeling problem. And so we did a lot of research on different approaches to get these latent actions. And what we ended up with is we train in kind of a, I guess slightly quirky way, is that we predict pixels with a latent action decoder. And then that allows us to learn a discrete set of eight latent actions.

And then we separately train a dynamics model that is using MaskIt, which is like a way of generating next frames. And we train that separately given the latent actions that are produced from the video. Just predict the next frame, condition on the actions. And then people working on different things, like the project was quite fast-paced and a few of us kind of switched and wore many hats. And we started all getting different results in different areas. And then

Roughly around last summer, so probably just under a year ago, we realized that when we combined a few of these ideas, we actually had something that worked pretty well. And then we were excited, obviously. So then we started working on seeing how the model scales.

Because the key thing about this project is that if you can figure out how to generate worlds without action labels, then essentially there's nothing stopping you from using all of the world's videos because there's no reason why you need to wait to do that. So we started saying, okay, how can we scale this approach and what does it do? And then we produced these plots, which you'll see in the paper if you have time to look at that, where we show that as you increase the model size from...

a few tens of millions to in that plot something like two billion you just get increased performance every single time you increase the scale and then the same thing with batch size when you increase the number of examples the model sees and so we realized that we had produced a scalable model so we then decided to go for what in the end was an 11 billion parameter model and then once we produced that we just started playing with it and seeing what we could do with it

And then finally, obviously the goal of this was originally to get an environment for agents. That's how we kind of started, actually on the behavior cloning side and myself and others on the more auto-curriculum and open-ended learning side. But we realized it was really fun to play with the model. And so actually maybe the more interesting use case is how it enables new forms of creativity. And so there's some examples in the paper of things like drawings from one of our co-authors' children.

and they sent us photos of the drawings and then we were able to then prompt the model with those photos and play and move the characters and the photos around. And that's pretty cool, right? Because you're enabling people to create their own world, step into them and interact with them, which was not really what we first thought of when we started the project. But I think it tells you that if you do kind of ambitious, somewhat crazy stuff, then maybe new things will emerge. So that was pretty fun.

There's also a picture of my dog there too. So that's another example I like. Nice. What would you see in terms of, I guess, more near-term potential applications for this? A lot of the folks who listen to the podcast are kind of on the AI engineer builder side of things. Any creative ideas there? Honestly, this is going to sound a bit like...

a bit of a non-answer, but there's so many applications. So you can obviously see the ones in the paper we have examples where we show generating 2D platformer-like kind of short game experiences. But you also can see the models work on robotics data too.

And arguably that latter use case is maybe more promising in the short term. There's actually already been an open source Genie model released on 1x GitHub repo in part of their world model challenge. And so I think they're more expert in robotics than I am. But the fact that they think it's a potential good direction for robotics probably speaks volumes.

I think there's other use cases too, so things like maybe driving. If you could generate scenarios for testing or even training autonomous vehicles and then be able to interact in the world in any custom situation, that could be very valuable. But on our side, we're mostly just pursuing the fundamental research

and not really focusing on one specific application. Yeah, and I imagine the world has already moved to, continued to move forward from a research perspective since you guys put this out there. Is this a direction that you see yourself continuing to pursue or what have you been excited about lately? Yeah, so this work, I mean, ICML, the deadline is January, right? So it's already six months or so ago that we submitted this.

And yeah, I guess most of the team are still working on it. Do you see coming out with a Genie v2 or v3? Yeah, I can't speak exactly about specific releases, but hopefully we'll have something new at some point.

Has there been anything else that's come out in the research landscape that you feel has either reinforced kind of what you've worked on or contradicted on the flip side? How do you see it evolving? Yeah, that's a great question. So definitely reinforced. I think just after Genie came out, there's been a flurry of like really amazing video generation results. So the first one was clearly Sora. I think they definitely took the space by the scruff of the neck and really pushed capabilities quite significantly.

and that was really exciting to see. It's quite a different style of model in that theirs is text to video, so it generates entire clips, whereas Genie is frame by frame level control. But nonetheless, it does show you that with additional scale and I guess brilliant execution, you can get much more high quality videos generated already than I probably thought was possible.

And then since then, I think there's been kind of the floodgates opened in the space. So from our own colleagues, the VAO model came out. It was announced to I/O. It was really impressive as well. And then competitors, I guess, have also-- other competitors have done similar things. So it's a really exciting time for that space. I think there's not really been anything that's action controllable like Genie. But yeah, it's definitely an exciting time for video generation. So I think it's a good space to be getting involved in now.

Given how fast the community is moving, I think in a few years' time we'll have something pretty incredible. I just spoke with your video poet colleagues as well. How do you see this work dovetailing with that work? Or how do you work together on the future of what video looks like or world models? I think that they're maybe more interested in more cinematic video experiences and generating entire clips. Whereas Genie still remains quite...

quite fundamentally different. It's only generating one frame at a time. And it's like, it's kind of a video model, but it's also kind of like an auto-aggressive image generation model in a sense. So it kind of sits in its own

It's kind of a new area. I guess a lot of researchers always claim that they're inventing a new area, but it is kind of a new area of research and hard to classify. It's a bit different. It's also a lot of us come from an RL background, so we're much more thinking about agents, which I think is quite different to all the video work, which is much more, I guess, focused on generative media and generating cinematic quality videos. But there's definitely some overlaps in the architectures and these kind of things and the infrastructure and...

We both want to use lots of compute, so I guess that's another thing we have in common. What do you make of all of the hype around the agent space? Do you see that continuing or do you see people getting tired of the agentic buzz? You're venturing into hot takes territory. So it's tricky because, I mean, a lot of us who worked in RL, right, we've been working on agents for a long time.

Because I think RL is often dubbed as reinforcement learning research, but really it's agent research. For a lot of us, it's agent research where RL is currently the best method to get agents. It seems like now that's shifted because people are starting with LLMs and then training them on top to get additional capabilities. But it's a lot of the same people that were doing RL research, so they've always been working on agents. It's just now it's called LLM agents before it was tabular RAS RL. So I think...

Yeah, this hasn't changed a huge deal. It's just we're starting with base models rather than Tabula Rasa and maybe some kind of toyish environments. It's kind of a natural progression of that line of work. I think it's exciting, but I think the goal of Genia is a bit different in that we're going more for an embodied AGI. We want agents that can interact in the real world over a long horizon. And for that, I just can't look past how you would need a simulator of the real world, which I don't think we're going to build

by hand. So I think they're kind of complementary in a sense. I think the LLM agents will become more capable in doing long horizon tasks in like text-based substrates, but I think that to then in the real world take long horizon actions for some kind of VLM, it's going to need to be able to interact in the world and we're not going to just be releasing them to do random exploration. So I think a real world simulator will play into that at some point. Awesome. Thank you so much for the time.

Believe it or not, Google also won a second best paper award at ICML for video generation with Video Poet: Deep Mind's Take on Zero-Shot Video Generation. Video Poet is a simple modeling method that can convert any autoregressive language model or large language model into a high-quality video generator. It contains a few simple components.

A pre-trained MagVit V, two video tokenizer and a sound stream audio tokenizer transform images, video and audio clips with variable lengths into a sequence of discrete codes in a unified vocabulary. These codes are compatible with text-based language models, facilitating an integration with other modalities such as text.

An autoregressive language model learns across video, image, audio, and text modalities to autoregressively predict the next video or audio token in the sequence.

A mixture of multimodal generative learning objectives are introduced into the LLM training framework, including text-to-video, text-to-image, image-to-video, video frame continuation, video in-painting and out-painting, video stylization, and video-to-audio. Furthermore, such tasks can be composed together for additional zero-shot capabilities, for example, text-to-audio.

Let's cut to Li Junyu speaking for the Video Poet Oral Presentation. Good morning, everyone. This is Li Junyu from Google DeepMind. Excited to meet everyone at Vienna. This year, I believe many of you may have witnessed the significant progress on video generation, especially with text-to-video diffusion models. Today, I'm going to talk about a completely different approach, which shows that diffusion may not be a necessary component.

We appreciate that the award recognizes the contributions of this work. Now, please allow me to introduce VideoPoet, a large language model for zero-shot video generation. This work wouldn't be made possible without our talented team, with members coming from diverse backgrounds and moving forward along different paths.

The core contributors were Dan, myself, Xiuye, Jose, Jonathan, Brian, and Lu, along with many other video players. Reflecting on the progress so far, we realized that video generation has already come a long way from the early days of GAN models. In case you have never seen generated videos from a large-scale model by the definition of 2016, here are two examples for classes of GORF and BIB.

Since then, people have scaled up GAN models and developed pixel-space autoregressive and diffusion models, which were getting less affordable. Some works try to model it as a foreign language of images or videos, but lossy discrete tokenization poses inevitable limitations. Later, latent diffusion has become the dominating approach, given its appealing sample quality. Big companies and startups have ignited a race of scaling up compute and data.

Now, nearly 10 years later, models can easily generate a video clip from a text prompt, like this skeleton drinking soda. But is latent diffusion the only way to go as we embrace the LLM area? Absolutely not. In fact, this video is generated with VideoPoet, a purely LLM-based approach without diffusion.

VideoPoet is a foundation model that takes inputs of text, image, visual dance signals, partial videos, audio combinations. It is capable of text-to-video, image-to-video, video stylization, video editing, video-to-audio, and many other tasks.

In short, VideoPoet is an autoregressive LRM that synthesizes videos with high-fidelity motion and matching audio from a large variety of condition signals. The diverse capabilities of VideoPoet are facilitated by defining a universal multi-model sequence-to-sequence problem. The condition sequence includes task indicators, inputs from text, visual and audio modalities, as well as output format controllers.

The model generates the output sequence of visual and audio tokens in a fully autoregressive manner, just like a usual language model. In order to define the token space for each modality, we resort to a collection of unimodal tokenizers. Megawid V2 Encoder and Decoder define a bidirectional mapping between a pixel space and a compressed space of discrete visual tokens. It can tokenize image, depth, or optical flow, as well as cropped or masked videos.

SoundStream does similarly for the audio waveform. Although text tokens can be directly fed in, we use a pre-trained T5 to extract text features to reduce the burden of learning human language from scratch. The MagWave V2 tokenizer defends the visual language. It resembles a quantized VAE with a temporary causal 3D CNN architecture, processing pixels. This causal design enables joint training with large-scale image data and seamless support for long videos.

For higher prediction bandwidth, we adopt a large vocabulary of over 200,000 words enabled by our scalable quantizer. The model is trained with both reconstructive and adversarial objectives. In a human-reader study, our advanced video tokenizer achieves even better compression quality than VVC, the next-generation video codec standard. This tokenizer lays a solid foundation for high-fidelity generation of videos, especially for those with large motion.

Similarly, the SoundStream tokenizer defines the audio language, which adopts the causal 1D CNN on a waveform. It uses residue vector quantization to produce multiple levels of tokens, where VideoPoet uses the first form. And its quality is better than Opio's AudioCodec standard. Now that we have defined the multimodal token spaces, we can convert video datasets into discrete token sequences.

Then we can use an out-of-the-box LLM transformer training infrastructure to learn these as foreign languages. In VideoPoET, we adopt a decoder-only prefix LLM architecture where bidirectional attention is applied on the condition sequence followed by causal attention on the target output. Compared to a diffusion transformer of the same size, the VideoPoET framework has significant flexibility and efficiency benefits at both training and inference times.

We can flexibly train arbitrary tasks between any modalities together with variable lines of condition and target sequences in a single model. With causal attention, the transformer learns the entire decoding trajectory for video in a single training step. At inference time, we can leverage various types of existing acceleration techniques, such as kvcaching, so that the entire decoding flops are no more than one full forward pass. Video data comes from different sources in diverse formats.

Text-to-video diffusion models usually require text-video pairs with high aesthetic value, which may be scarce and costly to accurate. With our flexible design, VideoPoet can pre-train on a mixture of pre-existing data, where a large fraction remains unlabeled or noisy labeled. In this table, we have a large number of raw videos with audio from the public internet, some videos with noisy machine captions, and another set of videos with high-quality human captions.

We also leverage image text pairs to improve language alignment. After pre-training, we can have a second training phase of task-specific adaptation with the corresponding high-quality dataset, such as for text-to-video. More details about the training data can be found in the paper. We have a large mixture of training tasks on this data, starting with self-supervised ones such as unconditional generation of various modalities.

With an autoregressive model, they also imply the corresponding continuation tasks for video, audio, and both of them, as well as the image-to-video task. VideoPoet is trained to generate audio given a video, or vice versa, and perform various types of video editing, such as inpainting, outpainting, and interpolation. In addition, leveraging the captions, it learns to generate video, audio, and image from text. Video stylization is supported by depth or optical flow conditions.

After the LLM backbone generates video tokens, we can optionally apply a latent super-resolution module before decoding to pixels. It uses the MegWit mask transformer with non-autoregressive decoding, which runs faster at small scale, with multi-axis windowed attention to handle long sequences at high resolutions. While VideoPoet has broad generation capabilities, much of the existing automatic benchmarks are defined around text-to-video.

Here we compare with the DR methods on the commonly used MSRVTT and UCF101 zero-shot text-to-video evaluations. On metrics of clip similarity in separate score and FVD, VideoPoet performs favorably against prior models, which were specifically designed for text-to-video. As automatic metrics get to be saturated and less indicative, we conduct user study with human readers to compare zero-shot text-to-video generation in various aspects.</raw_text>

0 我们与之前的工作进行比较，包括 Fanakey、Showa、Video Crafter、Runway 和 Pica，以及与 Vought 和 Lumiere 等同时进行的工作。在文本保真度、视频质量、运动趣味性和运动真实感的轴线上，VideoPoet 在大多数情况下优于之前的工作，并且在大多数情况下优于同时进行的工作。这是 VideoPoet 的一组文本到视频样本，我们强调它们的高保真运动。更多样本可以在项目页面上找到。

在这里，我们展示了图像到视频的能力，这也可以潜在地应用于 3D 渲染。此外，视频风格化和编辑是原生支持的。VideoPoet 可以为视频生成相应的音频，因为它理解内容。在这里，我们展示了一些示例，其中视频和音频都是由 VideoPoet 生成的。

我们希望我们的工作能够赋能社区，探索更广泛的领域和更深的深度。通过 LLM 风格的基础模型，我们可以进一步利用它们的泛化能力，以定制和可控的方式进行新运动、角色或对象的视频生成的上下文学习。我们甚至可以考虑如何在推理时将新模态添加到模型中。

在效率方面，我们可能想关注视频生成如何以实时流式方式运行。这不仅将使交互式神经游戏成为可能，还可能促进神经用户界面。想象一下，对于基于神经的操作系统，它可能不会再出现蓝屏崩溃，而是在由于长时间上下文而内存耗尽时重新启动。

进一步的进展希望能带我们到一个通用的多模态生成模型，能够在文本、视频、音频、图像等方面表现出色。想想 2018 年的文本到视频，就像机器翻译一样，当时它首次超越了人类表现。又过了五年，我们才有了聊天 GPT。我想在我们能够以我们的水平智能进行跨模态推理和生成之前，可能会更快。

期待看到它的回答。你能在直播视频中展示我如何用一只手系这个吗？总之，VideoPoet 代表了一种独特的视频生成方法。它挑战了扩散垄断，提供了仍然是最先进的视觉质量，同时提供了超越文本到视频翻译范式的多任务灵活性。它是一个以视频为首的基础模型，具有多样的生成和编辑能力。

基于开箱即用的 ARM 基础设施进行本地集成。这就是我今天演讲的结束。感谢你们的出色工作。实际上，我对 VideoPoet 的指令跟随能力非常好奇。那么，你们是否采取了一些措施来定量或定性地评估指令跟随能力？它与传统的无分类器引导的扩散模型相比如何？

是的，这是我的问题。好的，这是个好问题。首先，我认为没有一个非常好的定量指标来衡量这个问题，但这是一个非常有前景的未来方向，人们应该努力评估这些视频生成模型。其次，我们确实在我们的模型中使用了无分类器引导用于自回归。它有效。然后，我认为与扩散模型公平比较是非常棘手的。

超越系统级比较，因为它们使用不同的潜变量，一个是连续的，一个是离散的，你在不同的潜变量上训练，你会有不同的重建质量，然后你可以计算语言模型的困惑度。人们并不知道如何对扩散模型做到这一点。但我相信这确实值得进一步研究，系统地比较这两种方法。好的，谢谢。

- 好的，伟大的工作和伟大的团队。嗨，Lijun。- 嗨。- 所以我们也在研究视频主题，并且具备生成视频的能力。那么我的问题是，正如我们所达成的共识，视频标记化可能是模型的瓶颈，对吧？- 是的。- 那么你有什么见解

想要分享关于如何构建一些非常强大的视频组织技术，好的，首先视频标记器由编码器和解码器组成，但它基本上学习了双向映射，有时人们使用像扩散解码器和

语言模型的另一侧。但这意味着你在做另一个生成模型。在 VideoPoet 中，我们只使用一对编码器和解码器，通过重建和对抗目标进行训练。这是一个双向映射。在这种情况下，训练高质量重建变得非常棘手，因为它始终是一个有损压缩问题。在解码方面，你总是有一对多的映射。因此，我想真正的帮助来自

摆脱模糊重建的是来自对抗目标的帮助，这使你获得清晰的视频。此外，MegaWave 2 架构中的 3D 因果 CNN 也帮助很大，特别是与自回归建模结合时。因此，你在训练中拥有完整的时间因果关系。这对自回归解码非常友好。

好的，谢谢。也许我们可以再聊一次。是的，当然。欢迎来到海报会议。它在最后一分钟进行。首先，感谢你的工作。我有一个关于开放源代码 Mugwit first 的问题。根据我所知，这是计划中的，但 Mugwit first 还没有开放源代码。你对此可能有什么计划吗？

好的，如果你一个月前问我这个问题，我可能会更自信地回答，但现在我加入了这段时间，我真的不知道。对你来说好消息是 MegaWave v2 版本 1 的标记器已经在一年前开源了。是的，你可以使用它。我认为只需再加大约 100 行代码就可以重现 MegaWave v2，这样你就拥有了标记器、完整的重建和对抗训练。

逻辑，Adi。是的，我知道重现很简单，但训练很难，所有 QVA 方法都是如此。找到正确的参数非常困难。

是的，我同意这一点。这很棘手，需要一些超参数搜索，特别是它随着数据集统计的变化，我认为在未来有很多改进的空间。我相信当前的解决方案并不完美。即使到现在，尽管我已经研究了这么多视频标记器，我现在真正的梦想是摆脱它们。最后一个问题。谢谢你。谢谢你的演讲。那么，

你提出了一个非常有趣的多模态基础模型方向。从我所看到的，整个架构或训练方法非常像序列延续，对吧？是的。所以我想知道你是否在某些能力或架构组件上工作，以帮助模型进行泛化并简化...

连接点，特别是在视频和音频信号中，对于任何模型来说，看到多样性背后的结构是一项非常困难的任务。所以你是否考虑在这个方向上工作？

好的，这是个好问题。首先，VideoPoet 的一个优势是我们从开箱即用的 LLM 训练基础设施中获取。我们理所当然地认为这一点，实际上我们对模型架构和训练配方没有做任何修改。因此，你所需要做的就是定义你的标记空间并策划你的序列数据集。

我认为那部分需要一些非常聪明的设计。比如你可以将文本到图像作为文本到视频的前缀，将视频到音频任务作为无条件视频音频生成的前缀，类似的东西。因此，你帮助模型超越不同任务进行泛化。好的。非常感谢你的演讲，再次祝贺你获得最佳论文奖。谢谢。

我在这里与 Dan Kondratiuk 一起讨论 Video Poet，一个用于零-shot 视频生成的大型语言模型海报。Dan 来自 Luma AI，这是 AI 视频生成领域的领先公司之一。Dan，你能给我一个关于你如何开始从事这个工作的概述，以及一个简要介绍吗？

关于你在海报上展示的内容的高层次总结？是的，所以我们开始这个项目主要是为了从基础模型的角度思考视频生成。因此，基础模型，通常当你想到它们时，我想在那个时候它们都是语言模型或视觉语言模型。因此，它们主要输出文本，但我们想，为什么不从视频的角度来处理呢？这种设计方法与当前主要基于扩散的视频生成模型有很大不同。

所以我们认为，也许我们可以设想一个任务，我们使用现成的语言模型。我们对这个项目改变最少的事情就是我们只是使用一个语言模型。

没有做任何特别的事情。我们真正的创新在于数据方面，比如你如何设计任务作为语言模型的输入。

我们设计的方式是将所有模态（文本、视频、图像、音频）转换为一个嵌入空间。这意味着你将其转换为语言模型可以理解的一种语言。因此，通常当你想到语言时，它就像人类语言，自然语言。

你可以阅读的文本类型。但你实际上可以将图像、视频和音频视为一种语言。因此，我们有这个标记器，我们称之为 MagVit v2，它将图像或视频等转换为离散的标记序列，具有非常大的词汇量。因此，词汇量大约有 200,000 个标记，可以直接输入到语言模型中。因此，这个语言模型在某种程度上讲述视频的语言，我们所做的就是在数亿个视频上进行训练。我认为我们训练了超过十亿个图像，还有一些我们的数据集包含

视频和音频对。

根据你如何排列事物，我们有这个双向注意前缀。这只是意味着我们输入这些，模型有一种方法来结合所有这些模态，这些文本、图像。我们还有一些替代类型的密集输入预测，用于风格化音频。根据你输入的顺序，你可以条件输出不同的内容。例如，你输入文本，

你可以输出视频。因此，根据你的描述，它输出像宇航员在火星上跳舞，然后它开始根据我们训练的内容生成输出视频。

同样，对于我们的输出音频，我们也可以，例如，获取输入图像或视频，并尝试生成伴随的音频，而不使用由 SoundStream 生成的一些音频标记，这是之前 Google 的一篇论文，进行了音频的语言建模。

这些音频标记。因此，我们的方法主要是我们如何将这些任务结合在一起，我们看到我们并不是第一个展示可以使用语言模型进行这种类型生成的工作，但我们是一个展示可以将其扩展到与现有视频生成工作竞争的水平的工作。因此，你可以看到它可以做到

因为它是一个基础模型，它可以根据我们能够设计的方式执行大量任务。例如，你可以进行文本到视频，我们可以进行图像动画，获取输入的蒙娜丽莎，只需要求蒙娜丽莎打哈欠，突然间，根据你描述的内容，它想要图像做的事情，它就这样做了，这真的很酷。

然后我们可以将其与其他任务链在一起，例如风格化。如果我们使用深度和光流条件，它基本上会剥离原始视频的所有内容，并仅根据深度和光流进行条件处理。如果你描述它像

油画中的雪人，红色帽子张嘴打哈欠，然后它就会在原始视频的相同运动上进行绘画，这是另一个非常酷的事情，模型能够做到。

然后我们还有很多其他任务，甚至可以生成音频。它可以绘制视频，我们获取输入视频并尝试在底部和顶部绘制更多内容。因此，总体而言，我们评估了结果，我们看到结果与许多现有工作相当。事实上，超过了我们尝试的大多数工作，这真的很酷。

而模型特别擅长的几件事情是提示跟随。因为我们可以将其训练为语言模型，它实际上很容易与我们现有的基础设施进行扩展。此外，它在运动方面表现得相当不错。你可以应用现有的图像或文本到视频结果，并且

与我们尝试的其他工作相比，它在视频中应用了更大、更有趣的运动，而不是仅仅有一些微小的移动，像

更像是图像动画。因此，这是工作的概述。如果你有任何问题，我们将很高兴回答。- 是的，所以我们在这里与一群 AI 工程师交谈。他们构建应用程序，通常使用一些这些模型作为 AI 部分的基础。

所以我很好奇你是否会说你所做的工作和语言模型方法更适合某些用例，而在其他用例中扩散模型可能仍然更好，或者你会如何看待权衡，我想？是的。我认为有几件事情。如果你想做一些非常好的

像素质量。我认为扩散模型在这方面仍然无与伦比。这主要是因为标记器。标记器进行极端级别的压缩。因此，这就是为什么我们被迫在这些相对较小的分辨率下生成，并需要一个超分辨率模型来提高保真度。但使用扩散模型，你没有这个限制。你可以在这些潜在标记上进行扩散，这些标记没有那么压缩。

因此，结果是，获得这些高分辨率、高质量结果要容易一些。然而，扩散模型有一个问题，就是收敛时间很长，训练时间也很长。我认为语言模型方法显然更高效。我们只训练了几周，已经收敛得相当不错。

并且它与现有的语言建模方法成比例地扩展。因此，你可以轻松预测

如果我们继续增加模型大小，正如我们在这里看到的，像 10 亿模型就相当不错，但 80 亿模型，我们只需增加八倍的参数，就能获得更好的结果。我怀疑如果我们继续增加模型大小，它会继续改善。因此，这也是另一个不错的结果。扩散模型确实具有扩展属性，但我会说这更难预测。

因此，我认为语言模型的一些优点是，关于扩展属性的研究要多得多。而且，由于标记被展平为 1D 序列，你可以进行这种多模态表示，而扩散模型通常只在一种模态上操作。也许你可以尝试同时进行视频和...

音频生成，但我认为你还不能以相同的质量水平同时生成所有模态。目前文本扩散确实很难做到，并且没有达到自回归模型的同样性能。因此，如果你想要一个通用的基础模型来一次性完成所有事情，我仍然会说语言模型

目前处于领先地位。但谁知道呢？人们在许多不同领域进行研究。如果有人能破解文本扩散，我认为你也可以创建一个基础模型，能够处理所有这些模态。

你在 Luma 的角色中进行的工作吗？这是你计划继续在这个背景下推进的工作吗？所以我最近离开了谷歌，加入了 Luma，进行一些我非常兴奋的视频生成工作。因此，这是我在谷歌工作期间进行的工作。

我认为未来可能会继续进行这项工作，比如这种通用方法。显然，VideoPo 仍然没有发布，我认为这只是证明了这个领域的快速发展。现在这是一个竞争激烈的空间。但我确实认为这种通用方法

未来可能会在许多不同领域浮现。谁知道呢？现在，语言模型和扩散模型正在这个战场上争斗。那么谁能说最终哪个会胜出呢？这两种方法--

都已被证明效果很好。它们目前各有优缺点，更多的研究正在进行中。因此，现在我对视频生成的未来感到非常兴奋，以构建这些更通用的模型。但至少对于 Video Poet 的这项工作，我也对接下来会发生的事情感到非常兴奋。太棒了，非常感谢。

你可能注意到，Video Poet 论文的主要作者 Dan Kondratiuk 已经离开 DeepMind 加入 Luma Labs，后者负责 LumaDream 机器模型，该模型因将流行的 meme 转换为视频而在今年走红。为了结束我们的生成视频讨论，我们将引入 Tali Dackel 的受邀演讲，主题为文本、相机动作。

可控视频生成研讨会的前沿，讨论视频生成的未来，超越数据和规模。提醒一下，所有演讲都有公开链接，因此如果你想查看她所谈论的视频，请点击节目说明。

大家好，我是 Tali，很高兴来到这里。因此，今天我将谈论我们在生成 AI 尤其是视频生成方面所见证的伟大革命。正如你所知，这个领域的模型需要大量的训练、数据和计算。但我希望根据我自己的经验和工作来说服你，我认为视频生成的未来远远超出了数据和规模。

这将是一个高层次的演讲，涵盖不同的主题，但我也希望深入探讨更近期工作的技术细节。因此，再次在这个研讨会的背景下，我认为说我们都意识到生成 AI 革命最近扩展到视频是多余的，我们现在不仅能够生成这些... 哇，什么？

令人震惊的静态图像，但我们也可以让一切动起来。实际上，我认为过去几年

在这个领域显示出戏剧性的快速发展。当我们见证这一进展时，我们可以开始设想电影制作在电影行业中可能是什么样子，并认为我们可能能够完全计算生成电影。因此，也许它会看起来像这样。我们会要求

ChatGPT 帮助我们编写剧本，然后它会为我们生成剧本。然后它还会拿着剧本完全计算生成电影。也许我们会要求它添加一些特效，比如子弹时间效果，它会完全计算完成，而不需要任何真实的演员或摄像机，仅使用生成 AI。

是的，你能在外面听到吗？好的，抱歉。这次演讲没有更多音频。如果你太年轻，你可能不知道，但这不是一个真实生成的视频。它是从 90 年代制作的《黑客帝国》电影中提取的。

好的，所以我很抱歉让你失望，但我认为我们离这个未来还很远。尽管我们见证了所有令人惊叹的进展，最先进的文本到视频模型仍然表现出一些基本的失败案例，即使是像 Sora 这样的模型。例如，它们往往无法模拟世界中的真实物理交互，比如这个物体应该是一个坚固的椅子。

你可以看到它在空间中以不现实的方式漂浮。在这个例子中，跑步机也以不符合物理的方式跟随这个人。而且，当我们处理涉及多个实体的更复杂场景时，物体往往会不现实地自发出现和消失。这基本上告诉我们，视频生成仍然没有解决。

此外，扩大视频模型和开发这些通用基础模型的成本非常巨大。你知道，单个模型训练大约需要 20 万 GPU 小时，这相当于近 28 万美元。这仅仅意味着训练这样的模型需要数百万美元。

在能源消耗方面，仅生成半秒钟的视频就相当于驾驶平均汽车大约四英里。由于这些成本，这导致我们发现这些视频基础模型被封闭在行业中，只有少数大型企业能够开发和设计这样的模型。那么在这种情况下，我们的研究社区该怎么办？

而且，如果我们回到完全计算生成电影的宏伟目标，为了做到这一点，我们需要明确的细粒度控制。我们可能希望精确控制相机位置、角色身份、情感、位置和运动。我们可能希望控制照明以及声音和语音。而所有这些控制目前都没有由我们提供。

视频基础模型。因此，我在视频领域的研究旅程实际上是从单个视频模型的另一侧开始的。

我所说的是什么意思？我的意思是，我们有一些神经基础框架，过拟合于单个测试视频。因此，像 NERF 这样的模型，例如，过拟合于单个 3D 场景，在这种情况下，我们有一些神经网络，仅观察这个测试视频，而没有任何额外的数据。

事实证明，你可以用这些单个视频模型做一些相当令人印象深刻的事情。例如，我们展示了如何处理这个非常繁忙和复杂的场景。

假设你想专注于一个单一的动态对象，因此我们实际上可以移除这个场景中除这个女孩之外的所有其他移动人。你可以注意到，我们不仅移除了人，还移除了在这种情况下蹦床上发生的复杂变形。

这是我儿子第一次骑自行车的视频。我可以将这个视频并仅对背景进行风格化。你可以看到，所有内容都与原始场景一致且物理正确地移动。这些工作来自 2021 年，在生成 AI 革命之前。

我们还可以，不仅将纹理映射到刚性物体上，还可以将纹理映射到可变形的关节物体上。例如，我们可以在裙子上添加这些花朵，它们以与原始视频相同的物理正确方式移动。再次强调，这些模型对世界的唯一信息就是单个视频，即输入视频。

当然，它们的一个大缺点是，它们没有关于世界的丰富和强大的先验知识。

因此，为了更详细地展示这一点，我认为这种方法的一个大优势是，它使我们能够超越仅仅处理原始巨大像素体积。我们可以为现实世界视频设计复杂和更先进的表示。

因此，在分层神经地图中，我们的关键思想是支持这种一致的视频编辑，关键思想基本上是将视频或从视频中估计出一组统一的规范图像。

因此，给定这个输入视频，我们估计两个地图图像，如你所见，一个用于背景，一个用于前景，表示整个视频的整个背景或前景。原始视频中的每个像素位置都被映射到这些地图图像上，这使得基本上可以从这种表示中重建原始视频。

而这种表示的关键优势在于，它使得将处理现实世界视频的巨大像素体积的困难任务简化为编辑单个 2D 图像。因此，你可以做的就是将这些图像插入到任何图像编辑框架中，或者在 Photoshop 中加载它并绘制一些东西，然后使用映射将其映射回原始图像。

视频。抱歉，动画无法正常工作。好的。当然，我向你展示的是一组离散的图像，但实际上所有内容都是通过 MLP 和神经网络隐式表示的。因此，非常简要地说，视频中的每个像素位置都被输入到这些 MLP 中，映射到这个地图空间中的 2D 坐标。这只是一个在 -1 和 1 之间的 2D 坐标。

你有两个这样的网络用于前景和背景。每个这样的 2D 统一空间中的位置都被输入到另一个 MLP 中，预测该位置的 RGB 颜色。

还有另一个小 MLP 预测每个点的可见性，即它从背景与前景观察的程度。这使得基本上可以在每个位置重建视频的原始颜色，并以完全自我监督的方式训练整个框架。

其中驱动损失是视频重建损失。目标函数中还有其他项，以确保这种表示是可解释的，结构得以保留，视频中的对应关系得以保留。但基本上，你可以以自我监督的方式端到端训练这些东西。好的，在这里你可以看到编辑被映射。

好的，因此在一方面，我们有这些视频基础模型。它们需要巨大的训练成本。它们有限。我们在研究社区中并没有太多访问权限。并且它们提供有限的可控性。另一方面，它们可以学习关于我们动态世界的强大、惊人的时空先验。

在光谱的另一侧，我们有单个视频模型，仅需少量 GPU 进行训练。它们是可访问的，并且允许我们在表示视频内容的方式上更加灵活和创造性。然而，它们没有关于世界的任何先验知识。因此，你可能会猜到，我认为我们应该处理视频的方式实际上是结合两者的优点。

我所说的是什么意思？因此，一方面，我们希望拥有这种灵活性和自由度来表示视频内容，并获得对我们合成内容的明确控制。另一方面，我们希望将从通用模型中学习到的外部知识融合到这种表示中。

这不仅限于视频模型，我们可以从一组基础模型中集成外部信息，这些模型可以为我们提供运动先验、生成先验和语义先验。

<context>生成视频世界模拟，扩散，视觉，强化学习和机器人技术 — ICML 2024 第一部分 Latent Space LIVE! 在 NeurIPS 的常规票已售罄！我们刚刚宣布了最后一位演讲者和最新的主题，播客的朋友 Nathan Lambert 将回顾 2024 年的推理模型，如 o1！我们为那些现在正在决定的人开放了一些晚鸟票 - 如果需要，请使用代码 DISCORDGANG。我们在温哥华见！我们已经坐在 ICML 的录音上有一段时间（来自今天首位 SOLO 嘉宾共同主持人 Brittany Walker），鉴于 Sora Turbo 的发布（博客文章，教程）今天，我们认为现在是发布第一部分的好时机，这部分准备深入探讨生成视频世界模拟的现状，顺利过渡到视觉（相反的模态），最后是机器人（它们的最终应用）。Sora、Genie 和生成视频世界模拟器领域Diffusion Transformers 的作者 Bill Peebles 在 ICML 上进行了他最近的 Sora 演讲，这开启了我们的节目：* William (Bill) Peebles - SORA（幻灯片）关于 Sora 的一个常见问题是，为了实现这些结果引入了多少归纳偏差。Bill 提到了 o1 团队的 Hyung Won Chung 提出的相同原则 - “迟早这些偏差会反噬你”。我们还推荐了 2024 年关于 Sora 的一些阅读材料。* Lilian Weng 的视频扩散模型文献综述* Sora API 泄露* 估计需要 100k-700k H100s 来服务 Sora（不是 Turbo）* 使用 Sora 进行专业叙事的艺术家指南Google DeepMind 在 ICML 上对视频生成模型的表现非常强劲，赢得了两项最佳论文奖：* Genie: 生成交互环境（在口头，海报和研讨会上报道）* VideoPoet: 用于零样本视频生成的大型语言模型（见网站）我们通过 Tali Dekel 的演讲结束这一部分，主题是视频生成的未来：超越数据和规模。第二部分：生成建模和扩散自 2023 年以来，Sander Dieleman 在 Imagen 和 Veo 工作期间对扩散的看法（博客文章，推文）被称为“频域中的光谱自回归”，引起了公众的想象，因此我们强调他的演讲：* 穿越噪声：对扩散模型的直观观察然后我们转到 Ben Poole，他的演讲主题是用 2D 先验推断 3D 结构，包括他在 NeRFs 和 DreamFusion 上的工作：然后我们研究两篇流匹配论文 - 一篇来自流匹配的共同作者 - Ricky T. Q. Chen（FAIR，Meta）以及它是如何在稳定扩散 3 中实现的，使用缩放整流流变换器进行高分辨率图像合成。我们对扩散的最后一击是几场关于语音的口头报告，我们留给您通过我们的音频播客探索* NaturalSpeech 3: 使用因子化编解码器和扩散模型的零样本语音合成* 使用扩散模型合成数据的语音自监督学习第三部分：视觉ICML 测试时间获奖者是 DeCAF，Trevor Darrell 显著称其为“OG 视觉基础模型”。Lucas Beyer 的演讲“LLM 时代的视觉 - 数据中心的视角”在网上也受到好评，他谈到了自己从视觉变换器到 PaliGemma 的旅程。我们特别提到 MLLM 作为评判者：使用视觉语言基准评估多模态 LLM 作为评判者。第四部分：强化学习和机器人我们借助 Ashley Edwards 的帮助将视觉过渡到机器人，她在 Deepmind 的 Gato 和 Genie 团队的工作总结为仅从视频中学习动作、策略、奖励和环境。Brittany 突出显示了两篇海报会议论文：* 使用潜在动作生成行为* 我们还推荐 Lerrel Pinto 的《构建通用机器人的方法* PIVOT: 迭代视觉提示引出 VLM 的可操作知识然而，我们必须将大部分空间留给 Chelsea Finn，她现在是 Physical Intelligence 的创始人，她进行了四次演讲，主题是* “机器人教会我关于机器学习的事情”* 开发机器人通才* 自主适应的机器人* 如何向您的语言模型提供反馈* 特别提到 PI 同事 Sergey Levine 关于机器人基础模型。我们以一篇将生成环境与 RL/机器人联系起来的立场论文结束播客：自动环境塑造是 RL 的下一个前沿。时间戳* [00:00:00] 介绍* [00:02:43] Sora - Bill Peebles* [00:44:52] Genie: 生成交互环境* [01:00:17] Genie 访谈* [01:12:33] VideoPoet: 用于零样本视频生成的大型语言模型* [01:30:51] VideoPoet 访谈 - Dan Kondratyuk* [01:42:00] Tali Dekel - 视频生成的未来：超越数据和规模。* [02:27:07] Sander Dieleman - 穿越噪声：对扩散模型的直观观察* [03:06:20] Ben Poole - 用 2D 先验推断 3D 结构* [03:30:30] Ricky Chen - 流匹配* [04:00:03] Patrick Esser - 稳定扩散 3* [04:14:30] NaturalSpeech 3: 使用因子化编解码器和扩散模型的零样本语音合成* [04:27:00] 使用扩散模型合成数据的语音自监督学习* [04:39:00] ICML 测试时间获奖者：DeCAF* [05:03:40] Lucas Beyer：“LLM 时代的视觉 - 数据中心的视角”* [05:42:00] Ashley Edwards：仅从视频中学习动作、策略、奖励和环境。* [06:03:30] 使用潜在动作生成行为访谈* [06:09:52] Chelsea Finn：“机器人教会我关于机器学习的事情”* [06:56:00] 立场：自动环境塑造是 RL 的下一个前沿获取 Latent Space 的完整访问权限，请访问 www.latent.space/subscribe</context> <raw_text>0 And my first attempt to do so was in Text2Live. So in Text2Live, we wanted to support text-driven editing. And I think it was, to the best of my knowledge, the first method to demonstrate text-based editing for videos, for real-world videos. Again, this was ECCV.

22 and the key idea there was to use a pre-trained neural atlas representation of the video as a video renderer we're gonna have this representation keep it fixed

and then replace the manual edits that we can perform on the Atlas images with automatic text-driven edits described by text. And to achieve that, we combined this representation with a pre-trained clip model back then that allowed us to gain this for the first time. And here you can see how we can perform localized and semantic editing to real-world videos without any real generative model. This was just using clip.

And again, I think that performing this localized semantic edits and the type of edits that they showed you for moving dynamic content is still a challenge even to big foundation models that are very powerful.

But again, with all the respect to CLIP and this approach, you know, with the rise of text-to-image models, we wanted to take this approach further and to think of how can we leverage stronger priors about the world. And I think one of the main challenges in pursuing this approach of combining external knowledge to these sophisticated video representations

is that most foundation models are basically black boxes to us. We do not understand exactly the priors that they learn and how these priors are internally encoded. So this approach poses this challenge of how to distill learned priors from black boxes and

Basically, one of my research aim is to, an approach is to dive deep inside those foundation models and find out, like just reveal more, gain better understanding about what they learn and their internal representation. And if we can achieve that, then we can build much more, much better algorithms on top of them.

So with the rise of text-to-image models, diffusion models like stable diffusion, I was really amazed by the ability of these models to capture these really complicated signals about our visual world. So just viewing these images, we can see that these models can learn priors about composition, about pose, about interactions between objects, appearance, and so on.

So I was focusing on this aim of taking text to image models way beyond what they are meant to do, way beyond just generating images from text.

And we had a line of works in the lab that introduced some of the early works in this space. So for example, in Plug and Play, we conditioned the generation not only on text, but also on a reference image. And the output image preserved the semantic layout of the original reference image.

In multi-diffusion, we extended pre-trained text-to-image models to generate images at arbitrary resolution and also to receive as input region-based text controls, like you can see in these examples.

In the context of videos, I was thinking how can we take these powerful priors, the text-to-image learn, and extend them to video synthesis tasks. So in Eurips, we introduced Scenescape that allows not only to generate beautiful scenery, but also to walk through, to generate 3D plausible walkthroughs inside those scenes. And behind those videos, there is actually a real 3D mesh representation of the scene that is being built.

And in TokenFlow, we showed how can you not only synthesize static synths, but actually edit real-world dynamic synths.

And I think again, many, a huge bulk of work is doing that, like adapting text to image models, expanding them in various ways. I think what's kind of like more unique in these works is that we insisted in keeping those text to image models fixed.

and striving to better understand the generation process, the internal representation, to make these black boxes more transparent and utilize our understanding of them. So I want to dive more deeply into some of the work. So let me discuss in more detail token flow.

And again, our goal in this work was to perform this consistent video editing. And we started with this naive baseline of applying plug and play or a different method to edit each frame independently.

And as you can see, the content is really inconsistent. It's not just at the level of high frequency flickerness. The content really changes from one frame to frame, and there is really no reason to believe that the text-to-image model would give us something else. So we wanted to dive inside the model and understand how these inconsistencies are being represented inside the model.

So in order to do that, we take the original video frame by frame, we use some inversion technique to invert it back to the model, and then we can just extract some features from intermediate layers. And because those features are really high dimensional, we cannot make sense of them, so we use PCA to reduce them into three dimensions and visualize them as videos.

So here you can see the original video and on the right hand side you can see the PCA reductions of tokens of features extracted across different levels of the unit.

And what we can easily observe is that this PCA visualization, they depict shared and consistent representation. We can see that the consistency in RGB and the features resemble, again, similar consistency in its feature space for this video.

So we wanted to look at this consistency in more fine-grained manner. So in order to do that, we looked on nearest neighbors. You take a feature at a certain position in one frame and just compute its nearest neighbors to all the rest of the frames. And what we saw is that those correspondences, they exhibit this semantic and accurate matching across different frames, as you can see in these examples.

And you can compute this nearest neighbor field densely. So for each, if you are given two frames, you can take each feature in the source frame and compute its nearest neighbor in a target frame. And this will give rise to this dense nearest neighbor field, which we named token flow.

So this provides us with semantic and accurate matching, but we wanted to see also to gain more information about what these features hold in terms of information about the frames. And in order to do that, we checked how well we can generate the target frame from the features provided from a source frame.

So this has been done by basically taking the source frame and the target frame, extracting their features, computing the token flow, and then just warping the source feature tokens.

And now we can intervene in the generation process of a target frame. We basically do DDM inversion to get the initial latent, but then we swap each feature of the target frame, computes its nearest neighbor from the source frame, and we just swap the features. So we want to check how the generation of the target frame would be impacted by this swapping.

And we observed that the target frame can be synthesized accurately from the source features, which means that those features are interchangeable for the model. Okay, so what happens now? Again, we apply this per frame editing and we saw that the consistency breaks in RGB. What happens to the features?

Here you can see the feature visualization of this per frame edited video and you can see that the features depict the same inconsistencies as in RGB. So basically consistent features gives rise to consistent frames and vice versa. So our key idea in TokenFlow is that in order to achieve consistent editing, we want to achieve consistent features during the generation process.

And the way we suggested to do that is by enforcing the original token flow or the original feature matching of the original video on the edited video. So you can see the edited video and the underlying features of that edited video. And just to summarize, so this method

works as follows: we take the original video, we do the DDM inversion, we extract the features and compute the token flow, and then during the generation process of the edited video is composed of two stages:

In the first stage, we sample some keyframes and jointly edit them with extended attention. This gives basically just rough global coherency between the frames. And then we extract the features of these edited frames and we propagate them using the original token flow of the original video to the rest of the frames. And we repeat this process. Here you can see some generation results. And...

comparison to several methods. Again, I think since we published this work generated a great body of follow-up works. You've seen the nice work on editing XD slices today. So these matching and token flow correspondences, they hold between nearby frames, but indeed when the frames are more distant from each other, those matches tend to be

incorrect so indeed our method would break for very complex and motions where these correspondences would be difficult to achieve

Okay, so I guess I talked about how can we use text-to-image models beyond what they are meant to do, but the main limitation of just using text-to-image models is obvious. It only provides us with 2D information and we don't have any motion priors. And if we really want to model our dynamic world, we need to know something about how objects move, how they tend to move in the real world. We want to know priors about actions,

And that's something that text-to-image model cannot provide us. But again, I remind you all that we are in this amazing world where progress happens really fast and now we have these powerful video models. And that really motivates us.

their use and their understanding of motion in various applications. It could be generative tasks, but I don't think it has to be limited to that. Okay.

So that brings me to the last work that I want to talk about, space-time features for text-driven motion transfer that was presented at last CVPR. And the motivation there was, again, film industry and the big efforts that manual work and professional work puts into

transferring motion from motion markers and so on to animation using this CGI type of animations. So we wanted in this work to achieve this computationally. So given an input driving video like this dog jumping to a river, we want to be able to transfer it to dramatically different objects just using simple text prompts like you can see here.

You can see that the big difference between this setting and this task compared to, let's say, what we've done in TokenFlow is that you must enable deviations from the shape of the original objects in order to convey or to fulfill the target edit. In order to transfer the motion of this dog to a dolphin, I must change the shape of the dog dramatically.

and adapt the fine-grained characteristic of the motion such that it will be plausible and natural with the target object. Maybe the dolphin moves its tail in a certain way and so on. So we really need to distill the essence of the motion from the driving video, but be flexible enough to allow this adaptation of the content in order to fulfill, to get a naturally looking edit.

And for that, we must have a prior about how things are moving in the real world.

So in this work, we used ZeroScope, one of the publicly available text-to-video models. You can see some samples from this model. So it's way, way far from state-of-the-art text-to-video models that keep being better and better. But this model still is able to learn valuable information about our dynamic world.

Okay, so just in context of this work, we are not defining motion anymore as pixel level correspondences because again we want to allow this flexibility and deviation from the shape of the object. So in our context for this task, motion is defined as a sequence of semantic objects, parts, positions. So you can think about an object as being, you know, just a set of the parts that

and their general progression throughout the entire video. And again, in terms of related work, I think none of the existing method is not designed to enable this big deviation in the structure of the objects.

So we followed TokenFlow and took a similar approach and asked ourselves how space-time information is internally encoded in this text-to-video model. And again, we want to dive deep into the features and understand them better. So in this case, our input is a video and we can directly invert it into the video model.

again using off-the-shelf DDM inversion technique, and extract features. In this case, the features are four-dimensional. So f is the number of frames, m by n is the spatial dimensions, and d is the number of channels. So here, instead of doing PCA visualizations and so on, we adapted a feature inversion technique. So I guess many of you are familiar with it in the context of understanding classifiers.

pre-trained classifiers, it's a classic method. So the general idea is that we have some pre-trained and fixed model, we take our input, we fit it into the model and extract some target features. In order to understand better what these features encode, now we solve this optimization task where we want to optimize for an image in this case, such that when we'll fit it into the model, it will give rise to the same target features.

In many cases, of course, you need to somehow regularize this optimized image to avoid adversarial solutions and so on. So in our case, our input is not an image, it's a video. We can fit it into the model, extract features, and now the goal is to optimize for a new video such that one will fit it into the text to video model, it will give rise to the same features. If we solve this optimization task,

So again, you can see the objective at the top and the original video on the left. You can see the feature inversion results from different seeds at the right. And you can see that we can accurately reconstruct the original video in terms of appearance, motion, and so on. And this is not what we want because we want to allow much more flexibility in both in terms of shape and appearance.

So how can we take these spacetime features and build a descriptor out of them that will allow us this flexibility? Our first step towards removing this pixel level dependency was to average out or reduce the spatial

dimension. So we basically take these features for each frame and just average pull them across the spatial dimension. So for each feature we have a d-dimensional vector and so to describe the entire video we have f by d tensor. And now we can repeat our feature inversion experiment with those spatially reduced features.

And we were really surprised when we got this result to see that even though we averaged out the information across space, you can see from this inversion that we still preserve the pose and accurate movements of the woman in this video while allowing for more flexibility in the structure and appearance.

And just in terms of intuition, again, those features are really high dimension as they live in this high dimensional space. So even though we average them spatially, this information can still be preserved

Okay, so in the next step we said, okay, so let's use these features for editing. We're given some video, the original video. We can extract those specially mean features from the original videos and just use them as guidance during the generation process of the edited video.

So you can see the equation up here. We basically want to optimize the latent such that when we denoise them with a target text, in this case a camel, we want the resulting features, the spatially reduced mean features, to match those of the original video. We do that through guidance, through the generation process, and you can see here the result.

So indeed, it allows for some flexibility. We can get different deviation in shape and in appearance, but still it looks kind of like a camel that was squished into the shape of the elephant. And so these features, although we average them, they still contain this information, too much information about the original objects in the video.

And that led us to basically build the pairwise SMM differences matrix. And this idea is basically inspired from this entire line of works from self-similarity, that we basically don't want to encode the absolute values of these features, but only encode how they relate to each other.

all their pairwise relations throughout the video. So basically we take these d-dimensional features for each frame and we build this F by F matrix in which each entry is basically just the difference between two spatially averaged features. And you can think about it as encoding some motion in this semantic space of features

because we are just encoding all their pairwise differences and deltas between all the frames. And now we want to again intervene in the generation process of the target video and use guidance, but this time we want to encourage the generated videos to have the same pairwise SMM difference matrix. So this will be our objective function during the generation process of the edited video.

And now you can see that we can get a much better looking camel and still preserve the motion in the original video.

Here you can see some more examples and I think, you know, if you look on transferring the motion from this kitten to bunnies, you understand that we really want to synthesize the bunnies here and they need to move in a realistic manner as bunnies tend to move. And that really, I think, exemplifies the need to have a motion prior. There are some more examples with more dramatic shape changes.

And some more examples on well-known videos. We also have a way of initializing the initial latent of the video. I'm not going to go into the details of that, but we use a combination of DDM inverted noise and in low frequencies with random noise at the high frequencies. And this allows to get the method to be more robust and less sensitive to the exact seed that we are using in the optimization.

And again, compared to previous method, they really tend to preserve pixel level correspondences and they're not able to fulfill the edit in a way that is flexible enough.

So how do we measure success here? In order to measure the fidelity to text, we can use Clip Score, but we wanted to somehow quantify how well we capture the motion of the original video. And again, we want to measure that under these dramatic shape changes so we can no longer measure just pixel level similarity between motions. So we suggested a different metric for that.

and we suggested to measure the similarity based on the similarity of two sets of unaligned trajectories. So you can take off-the-shelf tracker and just apply a tracker on the original video and on the edited video and that

provides us with these two sets of long-range trajectories. And now we can measure their similarity using the Chamfer distance, where the distance between two tracks, here we use just correlation between the tracks. So each trajectory in one set finds its nearest trajectory, a highly correlated trajectory in the other set and vice versa. And we sum those

correlation values. So here you can see the evaluation of different methods. So on the y-axis we have the motion fidelity score, so higher is better. And on the x-axis we have the clip similarity score. So we want to be on the top right as much as we can.

So, and you can see that our method provides the best trade-off between providing good motion fidelity and fulfilling the text. Token flow which preserve with high fidelity the original motion gets better motion fidelity score but pays in clip score because it cannot fulfill the edit fully.

SDEdit on the video model with low noise level is able to preserve the motion with high fidelity, but it cannot deviate much from the original content of the video. If we use SDEdit with high noise level, it's the vice versa. It's the opposite. We can fulfill the edit, but we can no longer preserve the motion.

And again, our method provides the better trade-off between these two ends. Of course, there are some limitations. So we are still bounded to the priors that can be provided to us from the text-to-video model. So if the target object cannot be fitted in terms of the video prior to the motion of the source object, we will get deviation and this weird motion happening in this example.

Okay, so just to summarize, I talked about the two ends of video generation, editing and synthesis, the video foundation models on one hand side, the single video models on the other hand side. And I hope I managed to convince you that this approach of combining the two is effective and powerful.

There are still tons of stuff to do in order to pursue this goal. We still need to understand these huge big foundation models and device new smart representation in order to fuse this information into them.

And there are lots of open questions on how to do that. I'd like to thank all my students and collaborators from Google and from Weizmann, and I'll continue to work towards breaking new grounds in video analysis and synthesis tasks, and hopefully, in the future, we will be able to generate even such professional effects using computational tools. Thank you.

So you mentioned that the open, like obviously open source video models, there's a huge gap in performance compared to what we can see.

What do you think there's still to be done that doesn't really require training? That would, sorry? That does not require training a model. So what do you think, for example, in text-to-image models, we saw so many papers on different ways of controlling images.

What do you think we can do in videos that would be similar? Yeah, so I think the last work I showed takes a first step in this direction. I think that when you see these generation results, it is evident that these models learn some useful representation about motion, about how things evolve over time. And I think...

utilizing the internal representation of text to video model is still very underexplored. And there's tons of stuff to do there that won't require heavy training in order to adapt them or to leverage them for various downstream tasks. It could be generative tasks, but not only.

I think there is a great potential of, as we all use pre-trained image features for various tasks, I think the way to go forward is also to use

video features for downstream tasks. And in order to do that, I do think we need to understand these models much better. And I think there are also many open questions about how to gain control over video generation, what will be the correct interface, how intuitively would you want to even interact with videos.

I think it was discussed here at different talks that just using text is not sufficient in order to model our dynamic world. And we need to build new tools, new representation, new intuitive interfaces to interact with dynamic content, which is currently not there yet.

Hi, thank you for the interesting talk. So my question is a bit of follow up of what you just highlighted and more on the like the core side of universal video models.

So, like, what would be your thoughts on like, since we are in the early stages, do we anticipate like an order of two reduction in the cost? And it could be algorithmic. It could be on the architecture side. As you said, like, how do we control these models might even be the factor there.

That takes us to like the two order of magnitudes further. So what's the future look like compared to where we are today? I think also it was discussed here in previous talks, but I really think that one missing ingredient in order to push the boundaries of video foundation models is compression. Like how do you effectively...

represent or compress information across a video. Right now, I feel that the early stages of video foundation models are mostly doing the straightforward extensions that we can think about from the image domain and building an effective video compressor that you can work in its latent space. I think that will be

crucial for pushing the boundaries of video generation, order of magnitude more. Yeah. So, and I believe we'll get there. It's just a matter of time. Yeah. Yeah. Hopefully. Thank you. Thank you. That was the end of part one of this pod on generative video.

In part two, we turn to exploring related topics in generative modeling and diffusion that we feel represent the most important work of 2024 that are also helpful building blocks for generative video. First, we have two more DeepMind researchers. You may be observing a pattern in how much work DeepMind is putting into multimodal generative AI.

Here is friend of the pod, Sander Dielerman, who works on both DeepMind's VO video generation model and Imogen3. Over the past year, Sander has developed an intuitive interpretation of diffusion, where traditionally diffusion models and autoregressive models are viewed as polar opposites, with different hardware utilization and inference paradigms.

Sander's perspective of diffusion as spectral autoregression in the frequency domain caught the community's imagination this fall. And for the first time, Sander expands upon this in his workshop. So I'm going to talk about an intuitive look at how diffusion models work, and specifically in the context of modeling audiovisual data, sort of in the spirit of the theme of the workshop.

So it's roughly structured in four parts. So the first thing I want to do is explain how diffusion works from a geometric perspective, because I think this intuition is really valuable. And one thing that sort of bothers me about the diffusion literature is that it's, you know, as a beginner, it must be extremely confusing because there's so many different formalisms, so many different ways of saying the same thing. And I think this geometric perspective is sort of a nice way to tie it all together and link these things together.

And then the second section, I'll try to highlight some other perspectives that I think are useful and maybe less well-known. And then in the third section, I want to talk about diffusion guidance, which is a very powerful tool that is also very easily explained with this geometric perspective. And then finally, I want to talk a little bit about Imagine3 and Video and Veo, which are the models that I've been working on recently. So first, let's talk about a geometric perspective on diffusion models.

So I don't need to repeat this probably, but we know that diffusion works with iterative denoising. So we have some data distribution that we're trying to model in the examples. I'll show this will be an image distribution and we add, we gradually add a bunch of noise and then we try to remove it. That's diffusion and diffusion models in a nutshell.

So I'm going to talk a little bit about this corruption process first. So we first define a way to destroy all the information that is in the data distribution. And so I'm going to take an example here from the training data. I'm going to call that X naught or X zero. The index zero stands for a time step in the corruption process. So we treat this as kind of a temporal process and a time step zero. We are in the data distribution.</raw_text>

0 然后这个过程将通过添加小量的高斯噪声进行，我在这里称之为增量。所以可以把它看作是微小的高斯噪声。我们就这样反复进行。我们反复添加这些小增量。然后在某个时间步T时，我们可以看看我们的图像是什么样子，它将是一个嘈杂的图像。如果我们继续无限期地这样做，那么最终这个嘈杂的图像将看起来就像只是高斯噪声，我们将无法从原始图像中看到任何东西。

使用高斯噪声进行此操作的一个非常好的特性是，如果你有很多小的高斯噪声增量，你可以将它们加在一起形成一个更大的高斯噪声增量，这使我们能够更有效地模拟这个过程，这也是扩散模型训练背后的一个关键思想，即对于过程中的任何时间步T，我们可以将XT写为我们的干净数据X0加上

一个标准正态变量的缩放版本。缩放因子σ(t)是我们将称之为扩散模型的噪声调度。在实践中，我们使事情稍微复杂一点，但也稍微容易处理，通过不仅在每一步添加噪声，还在此之前稍微重新缩放输入。因此，我们引入了这个额外的缩放因子α(t)，它也依赖于时间步。

然后我们要做的另一个改变是我们不会无限期地运行这个过程，因为我们没有时间。我们将在某个时间步T停止，在这个时间步中，我们得到的图像基本上与高斯噪声无异。但现在有趣的部分是反向过程，对吧？我们如何反向运行这个过程？因为这将使我们能够进行生成建模。

同样，这将是一个逐渐的过程，我们添加这些增量，增量，但现在这些增量不仅仅是随机的高斯噪声。现在这些增量实际上需要我们理解一些关于数据分布的知识，以知道如何逐渐去除这些噪声。

所以我喜欢用几何方式来表示这一点。在我继续之前，我确实想表达一些警告。这是一种危险的游戏，我在这里要做的。因为实际上，这个扩散过程发生在输入空间中，对吧？在这种情况下，在像素空间中。如果我们把图像数据视为一个向量空间，那么表示图像的向量是非常高维的，对吧？因为你有很多像素。每个像素有三个颜色通道。这些都是非常高维的向量。

我将把这些表示为二维向量，因为屏幕上只有两个维度。这是危险的，因为我们知道，从低维观察中得出结论并推广到高维是有风险的。但在这种情况下，我认为以这种方式观察扩散实际上是非常有启发性的。那么，扩散模型实际上做了什么？我们从某个数据点x0开始，

我们用我之前给你展示的公式向它添加噪声，取决于时间步t的某个给定噪声量，然后我们最终到达空间中的另一个点xt，它是图像的一个嘈杂版本。扩散模型要做的就是试图从xt预测x0。所以我们在xt中，试图预测我们需要在空间中移动到哪里以返回到x0。现在这是一个非常困难的任务。

这个任务之所以困难，是因为噪声当然遮蔽了原始图像X0中的一些信息。我们无法真正恢复那部分信息。

因此，我们最终预测的不是X0本身，而是给定Xt的X0的期望。我们预测的是所有可能的X0，所有可能的图像，这些图像可能导致在时间步t的这个特定嘈杂观察。这不是单一的图像，而是一种输入空间的区域。扩散模型要做的就是预测我们需要朝哪个方向移动，以便更接近输入空间的那个区域。

实际上，我们预测的是该区域的质心，如果你试图可视化那个预测，如果你试图可视化那个质心，它看起来像是一幅模糊的图像。原因是这是许多可能的图像X0的平均值，而噪声在某种程度上遮蔽了这些图像的高频内容，但没有遮蔽低频内容。因此，我们得到的结果是一幅模糊的图像。

那么扩散采样过程是如何进行的呢？我们只需预测我们需要移动的方向，然后在该方向上迈出小步。你可以将其与我们优化神经网络的方式进行比较，对吧？在优化中，我们也预测更新方向，但我们只迈出小步，因为实际上那个预测仅在局部有效。然后我们在这里做的一件事是，我们通常不在神经网络优化中这样做，我们再添加一点噪声。

这样做的理论原因我不打算深入，但直观的原因是，这可能是一个好主意，因为我们正在进行一种“两步前进，一步后退”的操作，这将对我们在预测这个方向时的任何系统性错误更具鲁棒性。因为，当然，我们在循环中反复进行这个操作，错误可能会累积。当然，并不是所有的采样算法都这样做，但有些确实如此。

好的，然后我们就重复这个过程。所以现在我们在空间中的一个新点xt-1，它看起来像是图像的一个稍微少噪声的版本。我们再次做出新的预测x0。正如你在这里看到的，那个预测会稍微不同，对吧？因为现在它指向输入空间的一个更小的区域，因为噪声遮蔽的信息更少。因此，我们可以更好地猜测我们需要移动的方向。所以我们有了这个新的预测。正如我所说，这再次反映了我们需要朝向的空间的一个更小的区域。

然后这个过程就这样重复，我们再次添加一点噪声。我们再这样做一段时间，直到最终我们到达时间步0，然后我们应该得到的是我们数据分布的一个样本。我们可能不会最终到达原始的X0，对吧？但我们将得到一个来自数据分布的样本。所以这就是扩散过程的几何概述，所以到目前为止我解释的一切都假设扩散模型预测X0，对吧？它预测干净的输入。

现在如果你查看文献，通常人们并不是这样做的。相反，一种非常常见的方法是预测这个量ε，来自我之前给你展示的公式，这基本上只是一个标准的高斯噪声变量。但事实证明，一旦你有了一个训练好的模型，

你总是可以将对x0的预测转换为对ε的预测，反之亦然。这是因为我们有这种线性关系。Xt是给定的，Xt是我们的输入，我们知道Xt与x0和ε是线性相关的。因此，如果我们有其中一个量，如果我们预测其中一个量，那么我们可以将其转换为对另一个量的预测。

人们已经将这一点进一步推进，因为你不仅可以预测X0或ε，实际上你可以预测两者的任何线性组合，这就产生了像预测和流匹配目标这样的东西，即ε-X0。

出于同样的原因，预测X0也等同于预测Xt-1。我提到这一点是因为这是在原始去噪扩散概率模型论文中采取的方法，DDPM论文开始时说，好吧，我们有这个逐渐腐蚀的过程，我们将逐步反转它。然后自然的事情是从当前时间步预测前一个时间步。

但正如这里所示，实际上，由于这些线性关系，通过解决一个简单的线性系统，你可以证明这实际上是等价的。

当你有一个训练好的模型时，这是等价的。在训练期间并不等价，这有点棘手。因此，在训练期间，这种预测目标的选择实际上会影响在所有噪声水平上聚合损失中噪声水平的相对重要性。这反过来又会影响输出的感知质量。这就是为什么选择这个预测目标实际上很重要。但一旦你有了一个训练好的模型，所有这些预测目标本质上是等价的。

好的，总结一下，扩散训练过程是如何进行的？我们取每个训练示例x0，随机抽取一个时间步t，我们用我之前给你展示的公式破坏x0以获得xt。我们不必一步一步地运行这个过程，我们可以一次性完成。然后我们使用我们的模型对x0或ε进行预测，或者无论我们决定如何参数化模型。然后为了训练模型，我们只需最小化平方预测误差。

这就是我们都知道和喜爱的MSC损失。因此，这是一个非常稳定的训练目标，这很好。我们在这里使用MSC的原因，直观的原因是因为我们真正想要恢复的是之前的期望，对吧？我们无法准确预测X0，但我们想要恢复给定Xt的X0的期望，而这正是均方误差的最小化者。

然后在每个时间步t的采样中，我们可以从xt预测x0或ε，然后在预测的方向上迈出小步，以部分去噪xt以获得xt-1。正如我所说，在某些算法中，我们会添加一点去噪，在某些算法中则不会。好的，这就是这种几何视角的基础。现在我想谈谈一些我认为有用的其他视角，这些视角可能不太为人所知。

所以我将跳过这一点，这种分数匹配的视角。这也与我刚才解释的内容有关，但我认为那种观点现在已经相当知名。因此，当我谈论一些其他视角时，其中之一是将扩散模型视为递归神经网络的方式。所以如果我们考虑扩散采样循环，

我们实际上是在序列中反复应用我们训练过的去噪网络。如果你展开那个计算图，它实际上看起来就像一个更深的神经网络。

然后你可以问，为什么我们不直接用反向传播来训练它，就像我们通常做的那样？答案当然是，它非常非常深。通常有数万层。如果你的基础扩散去噪模型有100层，而你有100个时间步，那么这将是一个10,000层的神经网络。因此，你可以通过时间反向传播来训练它。人们已经这样做了。你得到的实际上被称为连续归一化流。

但你可以做另一件事，那就是用分数匹配来训练它。这样你就不必通过这个循环进行反向传播。你只需通过去噪的一个步骤进行反向传播。因此，这为你提供了一种将扩散模型视为一种更深的内核网络的方式，这种网络是在没有时间反向传播的情况下训练的。可以说是一种训练更深网络的技巧。这是我非常喜欢的一个视角。

我经常被问到一个问题，为什么扩散模型在图像和视频上表现得如此出色？为什么它们会进入并接管所有模态的生成建模，除了语言？因此，对于图像，我们可以进行有趣的光谱分析，这为此提供了一些启示。我们可以计算图像的光谱。

我们可以将其在一个维度上进行总结。如果你在对数-对数图上绘制这个光谱，你会得到一个幂律。你会得到一条直线，这反映出存在某种幂律。因此，图像中特定频率的幅度——或者实际上，特定频率的功率与该频率的某个负幂成正比。通常，它大约是-2。这似乎是一种自然法则。因此，对于自然图像，你会得到这条负斜率的线。

如果你对高斯噪声做同样的事情，计算光谱，你应该得到一条水平线。因为在高斯噪声中，所有频率的存在是相等的。现在有趣的事情发生了，因为当你将这些叠加在一起时，这就是我们在扩散模型中所做的，对吧？我们向图像添加噪声，然后将它们加在一起，然后查看光谱，你会得到第三个图中看到的铰链形状。

如果我增加噪声水平，也就是说，如果我增加噪声的幅度，那么那个铰链的位置信息就会发生变化。这将基本上遮蔽信号中越来越多的高频内容。但低频内容由于更强大，将会在这个噪声底线上突出。因此，它们将被保留。

基于这种解释，我认为可以公平地说，扩散是一种光谱自回归的近似。我们从低频生成图像到高频。这对于图像来说是正确的。对于视频也是如此。音频也遵循这种幂律，但显然不一定适用于其他模态，例如语言。

这不是我想出的主意，所以我实际上是受到Saviri Rishan及其同事关于逆热耗散的生成建模论文的启发，他们进行了这种光谱分析，这非常重要，因为不同的噪声水平实际上对应于图像中的不同空间频率。

这意味着，当我们在训练目标中重新加权、重新平衡这些不同的噪声水平时，我们实际上是在说哪些空间频率对我们重要，比如我们希望模型真正理解哪些空间频率。

这实际上意味着扩散损失实际上是一种感知损失，对吧？因为我们在强调人类视觉系统敏感的频率，而在降低我们不太敏感的频率的权重。我认为这就是为什么扩散模型在图像上迅速崛起的一个重要原因，即使当时我们不一定理解这一点。

好的，所以我还想做的一件事是对比自回归和扩散，因为这两者是今天流行的主要生成建模范式。我们都知道自回归是什么。你可以将所有内容转换为一个序列，一步一步生成该序列。对于扩散，我们使用这种噪声过程，这种腐蚀过程。因此，这只是两种不同的生成建模方式，但它们都是迭代的。它们都使用

许多网络调用来进行生成。因此，它们都使用这种分而治之的方法进行生成建模。因此，特别是对于视频，这些选择之间几乎存在一种连续性。

所以我们可以仅仅自回归地建模视频，这将需要将时空体积划分为标记，这将是三维补丁或体素，并选择某种顺序来预测这些，对吧？因为我们需要将其转换为序列。然后在光谱的另一端，我们可以将整个立方体、整个体积建模为扩散。

但对于视频，似乎有一种混合方法是非常有意义的，那就是将时间维度视为自回归，并在空间维度上进行扩散。这就是我在这里中间所展示的。所有这些方法都有各自的优缺点。因此，自回归方法很好，因为它将使创建多模态模型变得非常简单。因此，如果我们想将其与大型语言模型集成，

现在这似乎是可行的方式。但当然，这些序列会变得非常长，因此这意味着我们面临着生成非常长视频时出现错误累积的问题。另一方面，扩散则对这种错误累积具有一定的鲁棒性。

我们有强大的加速采样的方法，例如通过蒸馏。我相信，尽管指导并不专属于扩散，但你也可以将其应用于超自回归模型，但至少在我看来，它在扩散设置中似乎更有效。但当然，处理这些非常大的时空立方体，必须一次性生成这些，可能会相当笨重，并可能造成相当大的内存压力。因此，混合方法

可以在某种意义上被视为两全其美，但它也有一些优缺点。例如，如果你想进行蒸馏，那么这种混合方法，其中你进行时间自回归，可能会再次导致错误累积的问题。但当然，混合方法的一个好处是，我们可以重用我们为图像所做的很多工作，因为本质上这只是一个条件图像生成模型，如果你愿意的话。

我想谈谈生成感知信号建模中的一个更一般的趋势，那就是逐渐远离在输入空间中测量似然性。因此，在我开始从事生成建模的早期，我们有像PixelCNN和WaveNet这样的模型。这些只是基于输入空间的似然模型。

但它们在较大输入上并没有很好地扩展，因为似然实际上是一个非常糟糕的感知度量，正是因为它对这些在感知上不太相关的高频内容给予了过多的重视。当然，正如我们所知，它在语言上表现得很好。

但因此，对于感知数据、视听数据的总体趋势是，对于自回归模型，我们开始在某种潜在空间中测量似然，而不是在输入空间中。我们首先学习潜在变量，以便对许多实际上在感知上不相关的熵进行抽象。例如，草地纹理中的单个草叶，不需要由基于似然的模型进行建模。我们只需要能够以草地纹理进行绘画。

同样，在扩散模型中，隐式地也在以连续的方式发生这种情况，因为通过重新加权噪声水平，我们也在隐式地降低这些不太重要的频率的权重。但当然，在扩散中，我们现在通常也使用潜在空间来放大这种效果。我想多谈谈这个。为什么这有意义？为什么这是个好主意？

因此，视觉感知在细粒度和大尺度上工作是不同的。在非常细粒度的尺度上，我们对纹理的感知在某种程度上对所有这些小细节进行了抽象。我们不需要，我可以拍一张图像

比如说一只狗在田野里玩耍，天空在上，草在下。我可以拍一张图像，并在Photoshop中通过将草地纹理向左移动一个像素来修改它，然后再次展示给你，你将无法看到发生了什么。太微妙了。因此，那种感知在某种程度上对这些细粒度的细节进行了抽象。

而且实际上并不需要对所有这些可能的变化进行建模。我们只需要能够生成一个好的样本。这正是对抗模型所提供的。它们并不真正关心对分布的所有模式进行建模，但它们可以给你几个好的样本。

因此，这对于细粒度感知来说是一个非常好的匹配。而在更大尺度上，我们更关心覆盖所有可能的模式。因此，在那里使用更接近于基于似然的模型或基于扩散的模型是有意义的。好的。接下来我想谈谈扩散指导，我称之为扩散模型的作弊代码，因为它使它们在某种意义上表现得超出其能力范围。

指导使我们能够在样本质量和多样性之间进行权衡，并且通常使扩散模型的表现更好。因此，我想重新审视我之前谈到的几何图。再次，我们有我们的干净输入样本来自数据分布X0，然后在某个时间步T时的嘈杂版本在右上角。

和之前一样，我们的扩散模型将预测我们需要在输入空间中移动的方向，以朝向数据分布移动。

但现在我们将做一些稍微不同的事情。我们将进行分类器指导，这意味着我们将采用一个对嘈杂输入具有鲁棒性的分类器，并要求它对这个嘈杂图像进行分类。我们将取这些对数的梯度，相对于输入。这将给我们一个在输入空间中应该移动的方向，以使这个图像更有可能被分类为那个特定类别。因此，这有点

放大了使图像符合该特定类别的方面。这给了我们输入空间中的不同方向。我们可以实际上将这些方向叠加在一起，而不是遵循我们用扩散模型预测的方向，而是将它们加在一起，然后朝那个方向移动。

我想向你展示这个背后的贝叶斯视角，你可以通过取这个分类器指导的公式来简单地实现，这个公式是用分数函数表示的，比如对数似然的梯度。你实际上可以撤销这个梯度操作和这个对数操作，以查看在概率方面发生了什么。这就是我在这一页上展示的。因此，实际上我们所做的是，我们取一个无条件的基础模型，无条件的扩散模型，添加这个分类器P(C|X)，

然后将这两者结合以获得一个条件模型。因此，我们实际上可以在训练后将无条件模型变为条件模型。但分类器指导的真正力量在于我们引入了这个缩放因子，称为指导缩放。因此，我们将通过某个常数γ来缩放我们从分类器获得的梯度方向。

这将做的就是让它看起来像一只兔子，真的让这个图像看起来像一只兔子。我想要在这个图像中获得所有使其看起来像兔子的特征。因此，我们新的更新方向将是这个。因此，我们将最终到达一个不同的空间点，跟随这个新方向。

再次，如果我们通过撤销这个梯度操作和这个对数操作来看贝叶斯视角，这里发生的事情是分类器概率现在被提升到这个幂γ，并且

当我们将概率分布提升到一个幂并进行重新归一化时，这意味着什么？这就是调节温度，对吧？这是我们在自回归模型中一直在做的事情。我们实际上只是在调节温度。但有趣的是，指导的调节温度发生在分类器的输出空间中，而不是在生成模型的输入空间中。就我个人而言，我认为这就是它如此强大的原因，因为我们能够在较高的抽象水平上调节温度。我们在某种程度上是在锐化这个分类器分布。

接下来，让我们看看无分类器的指导版本。因此，再次做同样的事情，查看我们的扩散模型预测，但现在我们实际上将进行两个预测。我们将进行一个无条件的和一个条件的，这两个预测会略有不同，因为显然条件信号给了我们一些关于我们可能需要在空间中移动以从分布中抽样的信息。

我们可以通过训练一个条件生成模型，然后可能在10%的时间内丢弃条件信号来实现这一点。这给了我们一个可以在条件和无条件模式下操作的模型。因此，我们有这两个预测，我们可以查看这两个之间的差异向量，我在这里称之为δ。这个差异向量是我们可以移动的方向，以使样本看起来更像属于这个类别C。

同样，我们可以做与分类器指导中相同的事情，即通过某个缩放因子γ来放大这个差异，以便让我们真正关注这个类别C的特征。然后这给了我们一个新的方向，我们在扩散采样过程中应该朝着这个方向移动。再次，如前所述，采样算法的过程与之前相同。因此，我们可能会在这里添加一些噪声。好的。现在让我们再次看看贝叶斯视角。

这非常强大，因为你实际上应用了贝叶斯规则两次，实际上这个向量δ对应于一个贝叶斯分类器，对吧？因此，我们之前的这个分类器概率现在被这个P(X|C)和P(X)之比所替代，但再次提升到这个幂γ，因此我们再次调节这个温度，这实际上就是无分类器指导的本质。

这比分类器指导在输入空间中更不容易受到对抗性方向的影响。我有这些例子。它们现在已经相当旧了。因此，这些来自Glide论文，这是OpenAI的第一个大规模文本到图像模型之一。但我真的很喜欢这些，因为它们展示了没有指导和有指导的模型的样子，这在现代论文中是罕见的。在现代论文中，我们只看到带有指导的样本。

但在这里你可以真正看到这对结果的影响。你还可以看到影响，你可以看到多样性和质量之间的权衡，对吧？你可以看到图像看起来多样性明显减少，但质量显然在提高。

同样来自同一论文的另一个例子，这里有一个稍微不同的提示。同样，减少多样性以使图像整体看起来更好。我认为如今，我们看到的许多最先进的模型，如果你从中抽样而没有指导，我认为你会惊讶于它们的糟糕。这些模型确实依赖于指导来产生我们所看到的这些令人难以置信的结果。

所以如果你记住了什么，主要的事情我希望你记住的是分类器引导只是贝叶斯法则的两种应用。或者说是吗？有一篇来自芬兰NVIDIA团队的有趣的最新论文，他们对此提出了一些质疑，并给出了一些关于为什么引导可能实际上有效的其他直觉。我在这里不想深入讨论，但这是一篇非常好的论文。我推荐你去看看。它是上个月发布的，所以非常新。

好的，然后为了结束我的演讲，我将简要谈谈Imagine 3和Vio，这两个是我们最近正在开发的文本到图像和文本到视频模型。两者都在5月的Google I/O上宣布。Imagine 3应该很快就能使用。Vio显然是一个更复杂的模型，可能需要更长的时间，但希望你能很快玩到Imagine 3。

我这里有一些来自这个模型的样本。所以这是一个潜在扩散模型，与我们之前的模型系列有些不同。你可以看到，它在细节和大规模结构上做得相当不错。这是一个非常好的文本到图像模型。所有这些样本都在DeepMind网站的相关博客文章中。

希望我们也能很快分享一些关于内部工作原理的更多细节。最后，我还想谈谈Vio，这是我们的文本到视频模型。这看起来可能和你预期的差不多。所以它再次是一个潜在扩散模型。我们有一个文本编码器来编码文本提示输入，还有一个可选的编码器来根据图像输入对帧进行条件处理。

然后扩散在潜在空间中进行，我们有一个解码器将其转换回像素，分辨率高达1080p，长度相对较长。

然后我有你的，我不知道这是否会播放，但这是一个样本展示，你可能之前见过。好的，这应该是移动的，因为这是一个视频。好的，来了。所以这只是VO模型的一些样本展示。我不知道质量是否能很好地显示出来，但它正在以1080p的高质量生成视频。

好的，所以总结一下，我想强调的一点是，今天我谈到的几乎所有内容都在我的博客上。所以我有一系列关于扩散模型和生成模型的一般博客文章，我试图建立直觉，对吧？所以这并不一定是关于理论和数学上的正确，而是关于为这些模型及其实际工作方式建立直觉。

因此，这里幻灯片中的大部分内容分散在这些不同的博客文章中。好的，我就说到这里。所以这是我博客的链接，还有我的Twitter账号和电子邮件地址。如果你在演讲后有任何评论、建议或问题，请随时联系我，我也很乐意现在回答问题。谢谢。- 是的，我很想听听你对这些模型能力未来发展的看法。- 这是一个有点模糊的问题。

- 是的，我的意思是，我认为会更大更好。是的，我认为我们有点早。我把它与语言建模的发展进行了比较，我们在扩展过程中走得更远。我会说在视频和图像方面，我们仍然相当早。因此，我期待更多的重大飞跃。

我有一个关于潜在扩散模型的问题。我还没有看到它们的数学描述。你能给我们一些直觉吗？所以如果输入是固定的，比如我们在X上进行扩散模型，这很有意义。你可以添加噪声，无论你想添加多少噪声。你可以尝试反转它。对于潜在来说，这意味着你正在训练一个神经网络。从潜在值开始，你正在进行与网络本身相同的过程，正在进行训练。

所以通常这是一个两阶段的过程。首先，我们将学习一些潜在空间，基本上压缩输入。因为生成非常大的图像、非常大的视频的一个问题是，它占用大量内存。

潜在扩散的一个关键优势是，你实际上可以压缩掉很多冗余，同时仍然获得一种可学习的表示，对吧？这也是它与标准压缩的不同之处。你知道，你有标准的压缩算法，比如JPEG和H.264等等。它们真的只是专注于尽可能地缩小东西。在这里，我们试图控制一个权衡，

即在保持输出质量的同时，我们可以压缩多少，以及结果表示的可学习性。因为如果你压缩得太激进，那可能会变得困难。比如如果你在潜在空间上进行熵编码之类的，那可能会使学习变得更加困难。所以这是一个有趣的...

压缩问题的扭曲，因为你有这个权衡。但通常这是一个两阶段的过程。所以你首先学习潜在空间，然后你冻结它，然后你就像往常一样训练扩散模型，只是你提取这个特征表示并在其上操作。谢谢。很棒的演讲。我想听听你对当前指标的看法，缺少的东西，以及我们如何更好地评估这些，特别是从视频生成的角度来看，但一般的扩散模型。

我主要有抱怨而没有太多建议。这很难，对吧？我们没有很多好的指标。我们做了很多目测。

对于图像也是如此，但尤其是对于视频。视频比图像更棘手，因为对于图像来说，生成200个样本，把它们放在一个网格中，快速浏览一下就能大致了解你的模型在做什么，这很简单。对于视频来说，这要困难得多，因为一切都在移动，对吧？所以很难快速浏览。你必须更多地关注单个样本。然后对于音频，实际上完全不可能，对吧？因为你必须逐个听它们。

这是一个非常持久的问题，我到目前为止还没有看到任何好的解决方案。是的，我们使用经典指标，FID，FED，但我们也知道它们在各个方面都有缺陷，有时我们不能信任它们。但至少它们作为警示是有用的，对吧？它们可以告诉我们当某些事情严重错误时，至少这很有帮助。

但确实，如果你想在这个领域产生影响，找出我们如何评估这些东西，尤其是计算上，而不涉及人类参与，这是一个非常有前途的领域。谢谢。我想问你是否认为用模型预测人类评估作为评估这些模型的方向是有前景的？

很可能，是的。我想这在某种程度上取决于你的人类评估数据是什么样的。

但我认为这是一个有前景的方向。你是否尝试过扩展，训练一个模型来进行大量的人类评估？然后在某种意义上将其用作代理，作为奖励模型。是的，我会说这是一个有价值的方向。我对此有一个担忧，那就是每个指标，当它成为目标时，最终都会停止成为一个好的指标。因此，看看这在这里的应用将会非常有趣。我认为我们应该对此保持谨慎。

谢谢。嗨，我们看到一些扩散模型在你要求它生成某些东西时，总是或经常生成与其训练数据非常接近的数据。你有什么想法可以让它们变得更具创造性或更通用，进一步远离它们的训练数据？

我认为解决这个问题最简单的方法是获取更多数据。就像，如果你有数量级更多的数据，那么这种情况发生的可能性就会减少一个数量级。

但我认为，所以我不否认这是一个问题，但我认为我们也应该，当扩散模型开始崭露头角时，进入这个领域，我认为非常令人印象深刻的事情是它们表现出的这种组合泛化，对吧？所以我认为在某种程度上，这些模型已经在以它们在训练集中不存在的方式组合事物方面表现出很多创造力。我预计随着数据的增加，这种能力会得到改善。嘿，伙计。谢谢你的精彩演讲。随着一些视频模型的发布，例如，如果你有像水和波浪这样的东西在流动，你可以作为一个人观看，看到物理法则并没有严格遵循，就像你在现实生活中看到的那样，你认为确保未来的视频扩散模型在这种意义上更紧密地遵循自然物理法则的一些有前景的方向是什么？

规模是一个。我认为很多这种行为是涌现的，随着数据和能力的增加，模型将学会这样做。但也许在短期内，我们可以通过策划数据，或者通过在模型中构建一些物理先验来改善这一点。尽管我们确实必须注意这里的更好教训，通常结果是让它学习而不是过多干预。但好的。

我们在这一部分的第二位DeepMind演讲者是Ben Poole，他在使用2D先验推断3D结构方面工作，你可以看到这是将像Genie 1（2D）升级到Genie 2（3D）的关键组成部分。他还介绍了神经辐射场概念，或NERF，这在3D环境模拟中现在非常流行，当然对生成视频中的合成数据也有影响。

Ben将NIRFs与来自扩散的评分蒸馏结合起来，创建了DreamFusion和ReconFusion。让我们收听他在Joshua Bengio主办的结构化概率推理和生成建模研讨会上的邀请演讲。是的，感谢大家这么早来到这里。非常感谢研讨会组织者的邀请，今天我将分享我们在使用2D先验推断3D结构方面的一些工作。

所以ICML真的很有趣，但人们不断问我为什么我在做3D生成。我们在视频生成模型中看到了一些惊人的进展。随着数据和计算能力的增加，我们经常看到质量的提高。如果你还观察这些视频模型中的一些3D一致性，随着我们扩展，它们也有所改善。

但我们消费内容的方式并不总是只是盯着一个平面屏幕。我们有令人惊叹的新AR和VR混合现实头戴设备，我们想要消费的内容类型通常是互动的。你看到一些非常有创意、有趣的场景，你会想在其中移动并从其他角度查看它。这不仅仅是在VR头戴设备中移动。通常，探索世界的最有趣方式是与它们互动，无论是在视频游戏中还是在移动设备上探索。

不幸的是，创建这种3D内容真的很具挑战性。3D建模非常困难。我记得在中学时做过一些这方面的工作，因无法创建看似最简单的物体而感到非常沮丧。即使你拥有这些3D模型，你也不知道如何与它们互动，将它们添加到世界中，照明它们，给它们绑定。所有这些都是极具挑战性和耗时的问题。

这不仅仅是关于创建东西。我觉得我非常沮丧的是，我们在许多不同领域看到了AI的惊人能力，但我看到你们所有人坐在我面前。作为一个人，我觉得我对周围的3D结构有一种非常内在的感觉，知道物体在哪里。我知道我的水瓶在这里，我可以抓住它，但对于AI系统来说，拥有这种空间智能真的很具挑战性。因此，如果我们能在构建3D先验和理解3D世界方面取得更多进展，我认为这可能会真正影响机器人技术的发展方向。

幸运的是，我们在3D重建方面也看到了惊人的进展。所以这是来自ZipNerf的一个例子，这是一种基于Nerf的强大方法。你可以捕捉整个房子并将其转化为一个可以在其中移动和互动的3D模型。其质量和照片真实感往往超过我们今天最好的视频模型。

这些方法是如何工作的？这个想法是我们在这个空间前面，我们可以将其参数化为一个3D体积。在这个x，y，z空间的每个点，我们可以使用一个神经网络，从空间中的一个点映射到密度和颜色。现在人们正在探索各种不同的3D表示，但关键思想是你有一个可微分的映射，从空间中的某个地方到颜色或查询沿数组的不同点的能力。

我们训练这些表示3D世界的神经网络参数的方法是，我们可以从已知相机向场景投射一条光线，并使用我们的神经网络评估沿着那条光线的一堆点。这给我们提供了沿光线的颜色和密度，我们可以累积这些以获得RGB颜色。我们训练这些神经网络进行3D建模的方式是，我们已经收集了一堆图像，我们可以看到图像与这个神经网络的预测匹配得有多好。

我认为人们没有意识到nerfs是多么依赖数据。所以如果我想捕捉桌子上的乐高推土机，我不能只拍一张照片。我必须出去收集大量围绕物体的照片，并从几乎所有视角查看它。它们对未见区域的泛化能力几乎为零。它实际上是在已知视角之间进行插值。一旦你这样做，你就可以获得高质量的3D重建，表示颜色，并在某种程度上学习右侧深度所描绘的3D几何形状。

那么，如果我今天早上没有喝咖啡，早起了，只拍了三张照片，但我真的很想看看这个场景在3D中可能是什么样子？好吧，这是在三视重建上最先进的3D重建方法的一个例子。你可以看到它与观察到的图像匹配得很好，但当我偏离这些图像时，我们对世界可能是什么样子得到了非常不准确的预测。如果你考虑构建一个机器人去抓那个乐高推土机，那么深度图和3D几何形状看起来非常不准确。这对任何这些任务都没有用。

一般来说，我认为我们在这个结构化概率建模研讨会上。我们试图解决的问题是什么？好吧，我们没有访问3D世界的权限，甚至没有很多3D世界的真实数据。我们只有3D世界的影子。我们在眼睛中的投影，或者我们拿出相机，只看到一个二维图像。但我们希望理解这个3D世界，以便我们可以对其进行推理。因此，我们真的在试图解决这个推理问题，好的，给定一组观察，3D世界中可能存在的分布是什么？

通常有这样一个光谱，从重建开始，你收集了大量数据，你确切知道那里应该有什么，你想在数字世界中重建它。所以也许有些稍微松散的东西，也许我有一张图片，我只想幻觉出合理的3D内容。对于什么是与那张图像一致的3D场景？

或者也许我面前没有推土机，但我想为我的游戏创建它或可视化它。也许我只想用文本描述它。因此，我们有许多不同的思考观察的方式，我们希望以此为条件，但有一个共同的目标。我们如何根据这些部分观察创建这个3D结构？

我们在这个光谱上做了一些工作。所以我们开始时在DreamFusion上进行文本到3D的工作，然后在Reconfusion上进行VueVue重建。最近，我们在3D方面有一些工作，使我们能够从文本到单图像到VueVue重建进行3D创建。今天我将谈谈这些项目中的每一个。

好的，为什么3D很难？我认为我进入3D主要不是因为我关心3D和理解3D世界，而是因为这是一个感觉数据无法解决的问题。在语言生成、文本生成和图像生成方面，我们看到通过收集大型数据集取得了惊人的进展。

但正如我们之前看到的，获取世界的真实3D模型真的很困难。这非常昂贵，并且涉及大量人力。但假设我们这样做了，收集了一个大型数据集。现在怎么办？我们如何表示它？我们有所有这些不同的3D表示。我们有喷溅、体素网格、nerfs。你必须选择其中一个。然后一旦你选择了其中一个，你必须设计一个架构，以便在增加数据集时能够扩展。

但假设你这样做了。这是一个有点老的例子。人们能听到我说话吗？我意识到我没问题。太好了。所以，如果你有一个适度大小的3D数据集，并在该数据集上训练一个模型，你可以获得不错的3D模型。但我们大多数的3D模型只是孤立的物体，很难获得与我们从最先进的文本图像模型中获得的图像样本一样高的真实感。

我认为真正的问题在于，这之间存在巨大的差距。我已经展示了这一点一段时间，我认为这仍然非常真实。我们可以访问的3D数据与视觉世界之间存在巨大的差距。我认为这在很大程度上是由于在座的每个人口袋里都有带摄像头的手机。但并不是所有这些相机都有深度传感器。即使它们有深度传感器，当你拍照时，你通常不会拍摄环绕物体的对象的照片，捕捉到你可以想象的所有不同的视角。

因此，我们所做的赌注是，好的，也许我们可以找到一些方法，而不是在3D空间中构建显式先验，我们可以在2D中构建先验。如果我们在2D中有这些先验，现在我们需要解决一个更复杂的问题，因为如果我们没有先验，我们就不能在3D空间中进行推理。我们需要创造性地思考如何使用这些二维先验进行3D生成。

一般的归纳偏见，或者说我们将如何将2D先验黑客入3D，是我们将要说，好的，什么是世界的好3D模型？作为一个人，我通常没有能力知道我周围的3D世界是非常准确和精确的，但我可以从不同的角度查看它。因此，想法是我们将采取这个我们试图学习或进行推理的3D模型，并从一堆新颖的视角渲染它。

那么，什么意味着这个3D模型看起来是一个好的3D模型？好吧，它只需要看起来好。我们如何衡量它看起来有多好？好吧，我们将查看渲染，并使用2D先验来评分这个好度。因此，这里我们有一只熊在弹吉他。所以你可能想象，好的，如果从一个视角看起来不错，那可能不足以成为一个好的3D模型。但如果我从每个角度看这个3D模型，它看起来都不错，那么也许它就是一只好的熊的3D模型。

这为你解决的问题和研究方向打开了许多问题和问题。什么2D先验条件在什么信息上？我们如何实际衡量好度？我认为在概率建模方面已经有很多出色的工作，关于图像看起来好意味着什么？我认为我们仍然没有一个真正好的感觉，关于那个指标是什么，或者如何在所有不同类型的概率模型中优化它。

另一个大问题是哪些视角。有些物体我可以把相机放在这里，但根据场景中物体的位置，思考我想在哪里评估这个模型的好坏可能会非常具有挑战性。我不想把相机放在物体内部，例如。还有哪个3D表示。如今，我们有大量选择，或者你使用喷溅，或者你可以使用snars，或者你可以使用所有这些不同的东西。你使用的3D表示可能会根据你关心的设置而变化。那么这里谁不知道扩散模型？

哦，哇，太好了。所以扩散模型的一般要点是，这是一种建模高维连续分布的方法，我们配对一个简单的破坏过程，我们将数据取出并添加越来越多的噪声，最终我们已经退化了初始图像中存在的所有结构。在这里，我们学习如何逆转这个过程，慢慢地将更多结构引入数据中。

扩散模型非常适合你关心的是采样的情况。因此，你在这个大型数据集上训练，例如2D图像，并且你想采样2D图像。但在3D中，我们实际上并不关心采样2D图像。我们真正想做的是反向推断某种3D结构。对此的一种方法是，你可以考虑，我们正在构建参数化图像。nerf或生成模型的某些参数，我们可以用它们来创建图像。然后我们想评估这个图像的好坏。

我们在这里缺少的是一个损失函数，我们可以用来评分这些生成或渲染。如果我们有那个损失函数，它是可微分的，那么我们可以反向传播到图像，然后再从图像反向传播到生成模型的参数。

我们在DreamFusion中提出的想法围绕着概率密度蒸馏，因此我们称之为评分蒸馏采样。我想另一种思考扩散模型的方法是，它们学习一系列边际分布，从干净的数据点开始，映射到越来越嘈杂的数据分布。这些嘈杂的分布通常更简单。它们比初始数据密度更平滑。

我们想做的是也许挑选出这个复杂数据分布的单一模式。因此在这里你可以看到P of X是由扩散模型定义的复杂数据密度。我们只想推断该分布的一个模式。希望这个模式可能是一个看起来不错的样本。

我们不仅在扩散模型中的一个噪声水平上这样做，我们可以在所有这些不同的模式中进行平均。这使我们能够学习一个适用于任何可微分图像表示的损失函数。在这里，好的，虽然我们没有明确访问扩散模型中的边际分布，但我们确实可以访问其对数密度的梯度，这就是我们评估这个损失函数所需的全部。因此在DreamFusion中，我们结合了

评分蒸馏损失与来自NERFs的3D表示。因此，如果你想在冲浪板上展示孔雀，你从随机关系的NERF开始，然后可以通过评分蒸馏损失进行迭代优化。随着时间的推移，这构建了一个从所有这些新颖视角看起来不错的3D模型。最终，在优化这个3D模型后，你希望得到一个高质量的3D资产，可以以不同的方式使用。

而且很酷的是，我们不需要使用任何3D数据来创建这些文本到3D的生成。此外，对于许多这些类别，我们可能根本没有任何3D数据。因此，如果你收集了一个3D数据集，几乎所有这些文本到3D的生成可能都超出了分布。

但我越是玩这些文本到3D系统，就越像赌博，你提出一个文本提示，点击开始，等待一段时间，结果很糟糕。然后你一次又一次地这样做。这不是一种非常有趣的控制形式，也不允许你将这些3D生成与现实场景内容结合起来。特别是如果我拍一张照片，我不想拍那张照片，用文本描述它，然后将其输入到文本到图像模型中。我希望有更好的方法将其与真实场景内容结合起来。

因此，在Reconfusion的后续工作中，我们尝试将这种方法从以文本为条件推广到以图像为条件。如果我们回到我们的推土机示例，这是从三张图像重建的推土机模型的原始3D重建。如果我们应用一种在这些新视角下使用生成先验的方法，你可以看到我们可以准确地从仅三张输入图像中恢复新视角和良好的几何形状。

那么这怎么运作呢？这与早期的工作非常相似，但我们将用一个不依赖文本来描述这个新视角应该是什么样子的模型来增强3D重建管道，而是依赖于图像。那么场景的新视角应该是什么样子呢？我们通常会有一张或几张关于这个场景的图像。因此，这个新视角的样子应该受到我捕捉到的其他2D图片中现有内容的很大影响。

那么，这张图像可能是什么样子呢？我们的想法是训练一种新的扩散模型，该模型依赖于输入图像及其相机姿态的集合。然后在给定一些新的目标姿态时，我们希望预测这个新视角。因此，它仍然是一个图像扩散模型。它只生成一个新视角。从这里看应该是什么样子？但现在你依赖于场景中你拥有的一张或多张不同的输入。因此，我可以获取我们懒惰的三次捕捉的某些图像，然后将其转化为3D模型。

我们用来依赖输入视图集合的架构是PixelNerf，这是一种基于图像的渲染方法。这受到早期工作如Nerf Diff和GenVS的启发。作为输入，你有一组输入图像及其相机姿态。你将其传递通过PixelNerf以获取目标相机姿态下的一些渲染特征。然后你将其作为输入结合到一个典型的文本到图像潜在扩散模型中，在这里我们用这些不同图像输入的clip嵌入替换文本特征。

不幸的是，与我们之前在文本到图像的DreamFusion工作不同，这里我们需要的数据不仅仅是文本和图像注释。我们需要一组图片及其相机姿态。因此，在这个新视角合成设置中，我们在数据集的种类上受到的限制要大得多。在这里，我们在一个组合数据集上进行了训练，包括房地产10K以获取一些真实世界的场景，CO3D和MVImageNet，这些通常围绕物体的轨道，但在上下文中，然后还有一些来自Optiverse的3D模型的合成渲染。

如果你将这些方法应用于真实世界的场景，你会发现你可以得到不错的新视角合成预测。但一个问题是，这些图像是独立预测的。我们没有建模当你有一个3D模型时视角之间的相关性。因此，我们必须设计一个程序，可以将这些不一致的3D预测或不一致的2D预测转化为一个一致的3D模型。

在顶部，你可以看到3D重建的结果，底部是样本。因此，这与DreamFusion类似，我们不知道新视角应该是什么样子。因此，我们必须生成一堆样本或使用优化程序来解决这些困难。

我认为所有这些基于迭代优化的3D生成方法的最大问题是它们真的很慢。DreamFusion大约需要半小时来创建一个3D资产。Reconfusion大约需要一个小时。在那一个小时里你能做什么？你可能可以出去拍更多你试图捕捉的东西的照片。因此，这似乎不是一个改善效率和我们捕捉3D世界能力的好实际解决方案。如果你是一个机器人，你不想在移动手以重建3D系统之前等待一个小时。

我们在Reconfusion中实际上没有展示的另一件事是，如果我将一张单独的图像放入系统会发生什么？Reconfusion工作中的一个问题是在不确定性区域，你不知道场景中应该有什么时，你通常会得到模糊。这是因为那些独立的图像观察往往会冲突。虽然我们使用这些优化程序来解决它们，但你往往会反抗，你有点在与这种平均化所有这些不同想法的方面作斗争，以了解从这个新视角看起来可能是什么样子。因此，这里有一些关于消防栓和长椅的单图像结果。

在我们的下一个工作Cat3D中，创建任何3D的希望是我们能够有效地解决这些虚构新内容的问题。这个方法背后的主要想法是解决独立性的问题。我们知道，如果我有一个3D模型，或者如果我们有某个东西的视频，帧是相关的。因此，我们希望建模这些相关性

而不仅仅是在我们的3D提取过程中事后解决它们。因此，这里是Reconfusion的一些示例样本，我们有三张输入图像，然后我们有这些独立的输出图像。我们可以解决它们，但这是一个非常缓慢的过程。这项工作的主要思想是建立在视频扩散模型的巨大成功之上，以共同建模多个图像之间的相关性。

我们训练的模型接受一组观察到的UCS输入。你可以有一张图像或一组图像。你还必须有它们的相机姿态。我们将图像编码到潜在空间中，并使用光线表示法编码相机，这种表示法有点代表你正在生成的图像的角落。

然后我们还有一组目标。我们希望创建输出的位置。不仅仅是一个地方。我们希望创建一整套图像输出，并希望这些输出是相关的，以便它们可以从一个单一的3D模型中实现。因此，我们有观察到和未观察到的视图集合。我们还添加了一个掩码，以指示视频模型哪些是观察到的，哪些是未观察到的。然后我们不仅得到一个视图，而是一整套我们可以解码回图像的视图。

如果我们在与Reconfusion相同的数据集上训练这个模型，我们可以看到这个模型成功地学习了图像之间的相关性，得到的样本已经相当一致。但它们并不是完全一致的，也不允许我们可能希望从真实3D模型中获得的那种交互性。

那么我们做了什么？我们所做的就是我们获取一张单独的输入图像或一组输入图像。我们使用这个多视图潜在扩散模型生成样本，给我们生成一组视图。然后我们只需将其输入到3D重建管道中。还有一些额外的技巧需要，比如使用稳健损失，这允许在不同视图之间调和这些不同的细节。但整个过程现在只需一分钟，而不是一个小时。

这是一些将Reconfusion结果与Cat3D结果进行比较的示例。所有这些都依赖于三张图像，你可以看到不仅速度更快，而且如果你特别看背景，你会在你实际上有不确定性的区域获得更高质量的虚构。

有趣的是，这适用于图像和单图像，不像Reconfusion工作。因此，这里有一张Howie的照片，一只非常可爱的金毛猎犬小狗，我们可以获取这张单独的照片，然后我们可以渲染它并创建一个可以从新视角工作的3D模型。如果你仅仅有一个RGB和深度图并尝试扭曲，你将无法拥有相同的自由度来移动和可视化场景。这是我奶奶的狗，Lola。

它不仅适用于真实世界的图像，你还可以使用文本到图像模型，首先级联文本到图像生成与图像到3D创建。因此，这里有一个工厂机器人精确地组装复杂的电子元件。这是某种小妖精，还有其他一些生物，它甚至适用于一些小规模场景。

我认为这真的很有趣，因为在过去两年里，我一直厌倦了仅仅盯着物体的360度旋转，但现在我们可以将这些转变为真正的互动3D模型。我鼓励每个人查看网站并玩玩这个。当你能够与某物互动时，这感觉与仅仅在你面前播放的视频根本不同。

要使其工作，有几个重要的部分。我提到了稳健损失。我认为一个巨大的开放性问题是你如何决定将相机放在哪里？这应该真正取决于场景中的内容。目前，我们为不同场景选择了一些离散的相机轨迹，但找到学习如何放置相机的方法将是很好的。

你进行相机条件化的方式可以影响这个多视图潜在扩散模型中结果的质量。由于我们有基于集合的表示与视频的有序表示，我们可以提出不同的、更有效的采样策略，以并行创建多个帧。

那么，剩下的是什么？我认为这可能是一个有趣的玩具，但还不实用。我认为最大的问题之一是我们不再仅仅使用大规模的文本和图像数据，而是需要姿态多视图数据。因此，如果你了解使用一些这些最先进的姿态系统，当场景中有很多动态时，它们通常不起作用。因此，我认为如何实际扩展这些方法并获得准确的相机姿态仍然是一个未解决的问题，如果我们想训练相机条件的潜在视频扩散模型。

恢复的几何形状通常不准确，即使新视角看起来不错。正如我所说，相机轨迹并未考虑图像内容。我认为最大的问题之一是场景和输入通常被假设为静态的。实际上并不存在静态3D视频。如果你查看许多数据集，当我在场景中移动时，我在场景中投下的阴影会随着我在其中移动而变化。这在这些数据集中也是存在的。因此，我们确实需要找到能够处理动态场景以及静态场景的模型。

好的，那么这项工作的收获是什么？我认为在CAT3D工作中，我们发现通过首先采样然后重建将2D先验与3D推断过程分开是一个非常灵活和高效的框架。不幸的是，它确实需要更昂贵的多视图视频模型来生成那些相关的样本。

我感到沮丧的是，这些基于优化的推断方法，如分数蒸馏和变分分数蒸馏。它们可以处理不确定性。它们允许你表达这些新奇事物应该是什么样子的一个不确定先验，但它们的速度更慢，质量更低，而且更复杂。我仍然认为，当你天真地从这些模型中采样时，得到的样本质量与使用基于优化的方法进行采样之间仍然存在很大的差距。因此，我认为在推断方法上仍然有很多创新的空间。

我认为人们在3D领域中不谈论的另一件事是，这些3D模型是无用的。它们通常没有足够好的几何形状。它们没有估计材料属性。如果你将它们转化为网格，拓扑结构实际上并不有用。它们有内置的光照。因此，如果我们希望这些在游戏中作为资产有用，仍然还有很多工作要做。谢谢大家的时间，如果还有时间，我很乐意回答任何问题。

我有一个问题。去年ICML有一项最新的工作，多扩散。它们生成全景。是否有可能将这种方法与3D场景结合起来？因为这也涉及到一致性，生成不同的场景等等。

是的，我认为多扩散工作非常酷。它允许你采用较低维度的模型或较少像素的模型并扩展它们。在那里，你可以考虑它们如何解决不同帧之间的差异，即在扩散过程中进行平均。

也有人尝试将其应用于3D，你可以考虑在扩散过程中解决3D中的不一致性，但这通常有点棘手，因为更新该3D表示可能需要多次优化步骤，而你无法解析地解决此更新并将其平均。但我认为将不同的条件化、引导和采样方法与扩散过程结合起来，以在采样时强制执行更多一致性是非常酷的，而不是让模型在采样时随意做任何事情，然后在样本上做一些事情。非常感谢。有没有人有任何问题？嗨，我认为从这三项工作中来看，从DreamFusion到

Reconfusion再到Cat3D，你在2D中建模的多视图越多，你得到的3D就越好。这可能是结论吗？你只需做所有的多视图，就像一个触发器。你有一个模型可以一次生成200张图像。因此，你不需要进行任何优化，对吧？是的。那么这将是未来吗？是极端还是可能是中间的某种东西？

是的，我认为这些模型中非常令人沮丧且非常破碎的是这种反复无常。因此，你知道，在多视图先验、视频先验或2D先验中你放入多少结构？你提取多少结构？我觉得我们投入了所有的努力来训练这些2D先验。我们训练它们直到它们在3D上是一致的。然后只有在之后我们才接触任何3D。从Reconfusion到Cat 3D，我们在扩散模型中移除了3D结构，它变得更好。因此，我认为

这些现有方法并不真正支持实时交互生成。也许我想开始捕捉场景并让它填充细节，并迭代更新一些3D结构。而我们现在并没有真正拥有能够做到这一点的方法。这感觉有点破碎。理想情况下，你可以构建一个系统，给你3D输出，并且可以仅从基于图像的数据中学习。

还有一些很酷的工作，如视图集扩散、渲染扩散，试图构建具有3D结构的扩散模型。但到目前为止，这些方法的性能并没有那么好，因为我们并不知道如何扩展它们并在更大的数据集上训练它们，就像我们对更多基于像素的模型所做的那样。因此，我不确定这条路会走向何方，但我希望我们能有更多的混合体，并找到将3D结构融入2D模型的方法。你好，谢谢你的演讲。

我认为我们在文献中总是看到这种现象，特别是在文本到3D生成模型中，当你生成资产时，会出现这些过饱和的颜色。我想知道你是否对可能导致这种情况的原因有更多的直觉？是的，这是个好问题。

我最初的直觉是，这只是有两个原因。其一是我们有所有这些扩散采样和这些蒸馏方法的技巧，这些技巧围绕引导构建。因此，你朝着更好地匹配文本提示的方向前进，并远离无条件先验。

如果我有一只青蛙，你知道，青蛙通常是绿色的。因此，在我们拥有的数据集中，它们可能偏向绿色的东西。因此，在那里放置一个绿色背景可能会导致更高的密度模式，但这可能不是一个好的样本。因此，我认为这种过饱和和对比度的问题很大程度上来自于损失函数的破碎和糟糕，加上使用分类器引导解决这些问题的方式是多么的粗糙。这就是为什么我们转向更多的基于采样的方法，因为它们确实有效。

你不必担心优化所产生的伪影。但我认为这是一个差距。我希望我们能有更好的解释，说明这些伪影为何出现，因为你确实会看到它们与无分类器引导一起出现，当你将其调高时，你会得到过度对比和过度饱和，但远没有我们在这些基于优化的方法中获得的程度，如分数蒸馏。这非常令人沮丧。是的，谢谢。信不信由你，ICML上还有其他研究实验室，而不仅仅是DeepMind。

我们继续在生成建模研讨会上，但转向Meta AI的Ricky T.Q. Chen，他提供了我们迄今为止听到的关于生成建模中流匹配技术的最易懂的解释。因此，我想简要介绍流匹配或这种流匹配的想法，并将其应用于从欧几里得到拉曼到离散领域的各种不同领域。我们最近发布了一篇论文，称为离散流匹配，基本上使用这种流匹配配方，我称之为，作为构建

离散领域上通用模型的一种方式。但实际上，似乎你可以使用这个配方，这种非常抽象的构建通用模型的方式，并将其应用于任何类型的领域。

让我们开始吧。这次演讲的目标是讨论几个不同的应用领域，但我还想说，在所有这些领域中，有一个非常简单的过程可以在这些领域上构建通用模型，它们共享相同的基本原则。

这个想法是，我认为更多的人熟悉欧几里得空间。因此，在左上角，我们从参数化某种速度开始。通过这个，如果我们根据该速度运输粒子，我们还会根据某种规律改变这些粒子的分布。在这里，我们可以，是的，我们将其应用于材料生成。我们还在离散领域中将其应用于代码生成和文本生成，以便将其扩展并查看会发生什么。

因此，我将在最后一张幻灯片的开头放置，这可能是流匹配配方的快速预览，至少对于这次演讲来说。

我不确定我是否会在以后的演讲中称之为，但在这里我们只想定义条件速度，非常简单的速度，条件于x1生成x1。因此，这些UT和左侧的运输公式，如果我从xt开始并推进到xt加h，那么如果我遵循这个速度，那么

我基本上根据这个PT给定X1来转换一个粒子。特别是，时间等于1时的PT将成为一个以X1为中心的Dirac分布。因此，如果是这种情况，即我可以创建这些条件速度，基本上只生成一个数据样本，

那么学习问题就变成了，我只想学习期望速度，仅此而已。因此，给定一个XT，这个期望是关于从某个数据分布中采样的X1。事实证明，如果你学习这个期望速度并遵循你在左上角的相同运输规则，这使我们能够从我们训练的这个期望速度形式的数据分布中生成。

这种关系是由于所谓的连续性方程。实际上是由于散度算子的线性。稍后我会详细解释这些。

但为了设定场景，让我们再次从欧几里得设置开始，这是大多数人熟悉的。因此，假设我们从数据分布Q的X1中有一些样本，我们将构建这些条件概率路径PT的X给定X1，使它们基本上收敛到X1的delta，对吧？如果你考虑所有这些概率路径，我们将仅在底部对Q数据分布进行边际化，这个边际概率路径将生成

在时间1的数据分布，对吧？从我们设置的某个噪声分布开始。特别是，时间1是一个数据分布。现在，事实证明，如果你只查看生成这些条件概率路径的速度，即，如果我遵循这些速度，那么我创建的样本就是来自这个PT给定X1的边际样本。现在我还想边际化速度，对吧？

在这个意义上，我取条件速度并对P1给定T进行期望。因此，P1给定T是当前时间点给定X的X1的条件期望。你可以将其视为一种责任，对吧？如果你熟悉高斯混合模型。基本上，我们通过这个加权P1给定T来加权条件速度，这就是我们定义边际速度的方式。这是我们将在该期望中拟合的内容。

事实证明，有一个非常简单的解释来链接边际概率和边际速度之间的这种行为。为了达到这一点，我们需要开始思考，好的，我们如何将速度与我们运输这些样本的分布的概率联系起来？这种关系来自于这个连续性方程，对吧？因此，特别是，这个连续性方程在某个点x处说，该点的概率变化与该概率乘以这个速度场的负散度有关。那么散度是什么？散度是在那个特定点，让我们取一个小的，通常人们会取一个球，或者你可以在那个点周围取一个混合立方体，然后我们查看该区域的所有流出，减去流入。因此，我从该区域失去多少质量？你将该区域视为无穷小，散度基本上是对我从当前x失去多少质量的连续近似。

因此，概率的变化将仅是负的。如果我失去质量，概率就会下降。因此，这就是速度与概率之间的非常基本的关系。如果我们假设连续性方程，并且我们基本上可以找到生成条件概率的速度，那么我们就

基本上，这是欧几里得空间中流匹配的三行证明。第一行只是，按定义，我们将边际PT定义为条件PT的混合，然后我们应用连续性方程，假设我们手中有这些条件UT，可以生成这些条件概率路径。

第三行只是散度算子的交换。这是一个线性算子与积分，对吧？正是由于这种交换，我们基本上可以将积分移入内部，现在我们只有这个UT的定义，即边际速度场，对吧？我们已经说好吧，这是连续性方程的形式，因此这个边际速度场必须生成边际概率路径。这是一个非常简单的三行证明。

在这次演讲中，我们还将看到其他领域的相同证明。

但为了完整起见，这就是我们为流匹配所做的。我们将直接回归VT，这是一个神经网络，用于将速度回归到条件UT上。这给出了最优解，即边际速度。我们证明它的方式是我们基本上查看梯度，梯度在期望中与不可处理的流匹配损失相同，该损失直接匹配、回归到这个边际UT上。

好的，所以我想展示一些示例。这是，流匹配应用于文本到图像生成。你有一些文本。人们喜欢看这些图。

- 好吧，好吧，让我们开始变得更有趣一点。好吧，观众中的许多人喜欢结构，因此我们不会在欧几里得空间停留太久。需要考虑非欧几里得结构，因为整个小组讨论都是关于这个的。因此，我不再深入动机。我认为没有必要这样做。有很多不同的领域施加了结构，我们真的希望在通用模型中显式建模这种类型的结构。

特别是，如果你考虑拉曼流形，即局部上我们可以基本上有一种对流形的一级近似。局部上，它是一个欧几里得空间，对吧？因此，人们称之为切平面。在每个位置都有一个切平面。由于这个切平面只是欧几里得的，我们也可以在这个欧几里得空间上定义向量场，从欧几里得的定义到连续流匹配，

条件流动等自然扩展到拉曼流形设置。特别是，我们可以将连续性方程的散度替换为拉曼散度。我不会详细讨论这些，但为了完整起见，让我们再次查看连续性。因此，假设我们有一个UT，满足带有剩余散度的连续性方程，我刚刚在第二行中替换了它。因此，这里的积分是关于的，它不是，

它不是相同的体积，对吧？有一个不同的体积元素，它是一个不同的流形。可能还有边界条件。我只是在这个符号中将所有这些扫到地毯下。唯一重要的是，现在我仍然可以交换这个散度和积分。基本上，仍然有一个UT是条件UT的期望。因此，这里有很多复杂性。甚至不清楚如何找到条件UT。我不会对此进行太多详细讨论。

因此，我们在Meta所做的一件事是，我们基本上将其应用于材料生成。这是一个想法，你想生成一个晶体或材料，它被表示为在每个方向上重复的无限集合的原子。因此，我们在计算机上表示它的方式是我们仅表示一个单位单元。我们假设这个单位单元在所有方向上重复。因此，基本上，这个单位单元具有周期性边界条件。这就是我们正在处理的流形。

现在，材料基本上被称为稳定的，当它实际上可以在现实世界中合成时。这是最基本但也是最重要的属性。尚不清楚如何检查材料是否真的稳定。人们通常依赖于数据库。但我们基本上应用了Romani流匹配，试图在给定一组稳定材料的情况下生成材料，并看看我们是否能够生成一些新颖且同时稳定的东西。

我想说的是，实际上将流形与回声变体同时结合是相当困难的。因此，当人们，例如，当人们使用点云时，对吧？所以你想施加某种平移不变性。人们这样做的方式是，你基本上去掉均值，对吧？你取点云的均值，然后将其减去。然后当你定义流或扩散过程时，你基本上取一个零均值噪声变量。因此，路径基本上总是零均值。

在这个周期边界条件中没有零均值，因为没有原点，对吧？这是一个周期空间。因此，我们实际上不得不投影速度场，以确保它实际上不会移动均值，对吧？所以这里有一些技巧，实际上有点令人惊讶，但我认为值得思考，正如我所说，特别是如果你对结构感兴趣，如何将不同类型的流形与不同类型的等价物结合起来？这实际上并不是一个正交组合。

但无论如何，所以是的，我们看了这个，材料由三个组成部分表示，一个单位单元，

基本上就像是3D空间的构象、变形、3D坐标。然后我们有这些分数坐标，定义粒子的位置，原子在这个单位单元内的位置。然后我们有每个原子的原子类型。因此，Romani流匹配在这些连续变量上表现得很好，这些是这些单位单元和坐标。基本上，我们可以说是最先进的，或者至少证明在扩散基线之上有所提升。

所以这是基于原子类型的条件。我们只是试图找到一个稳定的构象。但如果我们想进行新颖生成，也就是说，我们想从头开始生成一种全新的材料，包括原子类型，

这还可以，但不算好。我会说这与我们的一些基线LLM方法相当，但没有太大变化。所以这让我有点失望，我认为。所以也许我会稍微解释一下。所以这里的原子类型我们基本上表示为一个连续嵌入，因此嵌入到连续空间中，然后我们只是进行了常规的欧几里得流匹配，试图学习原子类型。

然后在样本生成期间，我们只会取一个一阶邻居并说：“好吧，这是我样本的原子类型。”但这并没有很好地工作。因此，我们想进一步说：“好吧，我们能否将我们从流匹配中学到的东西直接应用于离散空间？”在这个意义上，我们不假设任何类型的度量，我们不假设任何类型的连续空间。我们只是有一堆不同可能值的样本。

那么，发生了什么？所以在我实际进入之前——是的。所以答案是肯定的。我们的工作绝对不是第一个。Campbell等人的一项非常有趣的工作也在ICML上展示，关于离散状态空间上的一般流，这基本上是一个连续时间链。这真的很好。我将要讨论的许多内容也在那篇论文中，只是可能有稍微不同的角度。

好的，所以第一件事是什么是速度？所以我说我们将只取期望速度，但对于运输离散样本Xt来说，速度到底是什么呢？在连续情况下，我们只是给粒子本身添加了一个小偏移量。在离散情况下，我们将给概率添加一个小偏移量，对吧？所以我们将从一个以Xt为中心的直接分布开始，然后我们将修改该分布，然后从中采样。

所以只要我们……抱歉，这就像是一个预览。我实际上会证明这是正确的。我会在几张幻灯片中证明这一点，说明在采样时为什么这样。但这是一个预览，试图理解什么是速度。所以……哦，抱歉。人们也称之为连续时间链的速率矩阵。

因此，这里的更新在每个维度或标记上是独立的，或者我们称之为标记。只要我们能找到一个速度，基本上意味着这个新变量，这个Xt加H遵循概率路径，那么这就是我们对速度的定义。

因此，再次，在连续情况下，这里有一些可视化。我们基本上为每个维度、每个坐标建模一些变化，然后我们将同时进行所有变化。因此，我们将根据该向量场移动粒子。在离散空间中，我们再次有这些对齐的轴。这是对网格上离散空间的可视化，但这只是一个可视化。我们不施加任何邻近信息。

但基本上对于每个坐标或每个标记，有一组可能的值，我们可以移动到。这个ut本质上是从x，从xt到其他状态的概率质量的变化。这将更清楚一些。所以如果你看这个，盯着这个方程再看一会儿，对吧？还有一些额外的约束。在连续情况下，速度可以是任何东西。我们可以随时移动，就像我们假设欧几里得空间，没有边界，没有任何东西。

但在这个设置中，UT需要满足某些约束。因为我们已经从PMF开始。delta XT本身就是一个PMF。它在XT处为1，在其他地方为0。当点不是XT时，它已经为0。因此，使右侧成为有效PMF的唯一方法是确保UT在XI不等于ZI时为正或非负。我在谈论这个约束。

另一个约束是我们需要确保归一化常数保持不变，对吧？所以这里它归一化为1，或总和为1，对吧？我们想确保这是一个有效的PMF，因此我们需要确保这个UT的总和为0。因此，这意味着在XT处，如果这个点是XT，这个UT需要为负，如果是其他不是XT的东西，那么它需要为正。就是这样。我只是说这里有一些额外的约束。

因此，速度基本上在这里只是建模从一个状态到另一个状态的概率运输。如果你处于当前状态，你有一些正质量，我说，好吧，以20%的概率，我想将我当前的20%质量移动到其他状态，那么对于每个粒子，我只是抛硬币，以20%的概率我将移动。80%的概率我将停留，对吧？类似这样的东西。

好的，让我们实际推导一下。这只是一个关于速度的预览，与连续时间马尔可夫链方程相吻合。因此，让我们再次从连续方程开始，对吧？但在这里我们将尝试定义离散散度。因此，散度再次是流出量减去流入量。它是移动出这个域、这个节点的质量量，减去移动进来的质量量。

因此，在离散情况下，我们基本上可以将值分配给图上的边。这条边将仅表示从一个节点移动到另一个节点的质量量。当我们在某个点计算散度时，我们将取所有流出量减去流入量。

假设这个V是我们的通量或电流。这是定义在图的边上的标量函数，对吧？我们想对整个域求和，对吧？不是轴对齐，这个Z只是对整个离散空间进行求和，其中每个坐标有D个可能值，并且有N个不同的离散变量。

在这里我们做一个假设。我们将假设单个标记变化图基本上与v仅在x和z相差一个标记、一个坐标时定义，反之则为零。因此，我们明确做出这个假设，即速度或此通量现在只能一次修改一个标记。

我说这与之前的处理略有不同，他们在1D上定义这些连续方程或共图方程，然后在高维中解释为什么这样做。在这里，我们稍微反向一点。我们从在高维中定义的连续方程开始，我们将做一个明确的假设，因为我们不想在这个非常大的空间上建模一个完整的图。

好的，所以我们将再次假设这是一个连续设置中的通量。因此v将是p乘以v，抱歉，p乘以u，p乘以概率。因此我们在这里也做这个假设。如果我们做一些代数，我们会得到这个在某个点x的散度方程。因此，这是我们对速度场的定义。有两种方式来思考这个。连续方程是一种

它是一种建立速度场与概率之间关系的方式。我现在将其视为定义速度的一种方式。因此，给定概率，我们如何定义速度场？所以有两件事，对吧？所以一件事是如果我们有一个UT满足这个离散连续方程，那么我刚才在几张幻灯片中描述的平行欧拉采样现在是合理的。

特别是如果我们取PT的一阶近似，因此PT本身将只是指示函数的期望，对吧？在这里我们再次，

将其视为期望。因此，我们只是从PT中采样，然后这是内部的项。如果我们做一点代数并将许多小o的h或更高的项移到外面，我们会得到这个表达式，这就是欧拉采样。因此，对于每个坐标，我们将仅独立地为该坐标进行采样，其余的只是小o的h。

好的，因此通过假设这个连续方程，我们已经证明了欧拉采样。此外，如果我们假设这一点，那么我们还可以证明边际速度，这个流匹配配方也成立，对吧？所以一切都是一样的。第一行是PT的定义。它是条件PT的混合。离散连续方程出现，然后我们将交换

内部的内容与这个x1的总和，x1的总和给我们边际速度。因此，再次，如果我们取满足此条件的条件速度，我们定义边际速度，那么如果我们可以访问边际速度，我们可以使用这个欧拉采样进行运输。这就是我们所说的。

那么我们如何实际定义，这有点抽象，所以让我们实际定义这个离散路径，也许看看一些在实践中非常强的特殊情况。因此首先，我们将仅定义边际PT作为一堆，您知道的，边际化一堆条件PT，然后每个条件PT在每个维度上都是独立的。

因此，我们框架的一个特殊情况，我们基本上处理像m个不同分布的任意混合，但让我们考虑两个不同的分布。因此条件x0和x1，xi只有两个可能的值。它要么是x0，要么是x1，只是以某种概率kappa t在两者之间的混合。

还有另一个特殊情况效果很好，即x0将只是完全掩蔽的状态。因此，给定一个序列，我将从完全掩蔽的序列开始，然后我慢慢解掩每个标记，直到它达到x1分布。

以某种概率回调。现在，这非常有效。虽然有点不满意它实际上有效，但它是一个非常有效的概率路径。并且许多工作都使用了它。显然，它与质量语言建模相关。因此，是的，我只是想提到这些工作也很好地将这个构造推广到学习不同的组件。好的。

但我们不必太担心质量状态。让我们先处理这两个delta的混合。我们在这篇论文中证明的一件事是，基本上有一个条件速度，然后如果你对其进行边际化，你会得到当前在灰色中突出显示的边际速度。并且有两个...

这与连续情况类似，你总是可以向通量添加一个散度自由项，并且对于相同的概率路径有无穷多个速度。在这里非常相似。基本上我们可以将速度表示为P1给定T或P0给定T。

第一个在我们想要向前解决事情时是有意义的。我们将预测x1，然后我们向前运输。第二个在我们想要向后转移时是有意义的。因此，我们预测p0给定1，并且它被转移回去。

现在有趣的是，这两个速度场都满足那个连续方程。因此我们总是可以将其插入。特别是我们可以使用这两个的任何组合乘以某个系数。因此我们在实践中实际做的，或者在实践中有效的，是我们基本上用前向时间速度场进行非常大的步骤，然后我们向后采取小步骤。因此这有点类似于预测-校正步骤。

因此，我们采取小步骤，然后你有点像做一些额外的计算以改变变量本身，但不改变边际概率。因此，这给了我们一点，如果你采取掩蔽设置，它给了我们更多的灵活性，不仅仅是解掩蔽，这就是前向过程将仅做的事情。一旦你解掩蔽，你就不能再掩蔽。但是如果你添加反向时间速度场，你也可以添加重新掩蔽的能力，然后再次解掩蔽。因此，这有点像，您知道的，字符采样。

好的，所以另一个有趣的事情是，这与我们一直在看的连续情况非常相似。因此在连续情况下，如果你将多路径视为x0和x1的凸组合，通常人们将其写成去噪器或epsilon预测。因此在这里我们也有非常相似的东西。这是去噪器，这是epsilon预测，只是现在它预测整个分布，而不仅仅是期望。

这里有一些示例来展示，您知道，我们尝试在规模上。我们基本上训练了一个17亿的模型，试图进行一些文本补全，试图在他们自己的游戏中击败LMs，虽然我们有点失败，但我们诚实地尝试了。首先，给定一些文档字符串，我们将生成代码。基本上，这是一种更，我不知道，

值得信赖的评估大型语言模型的方法。它不仅仅基于抽象类型。此代码将运行并成功，或者它将失败。

有趣的是，因为我们有一个完全非自回归模型，我们可以进行任何类型的代码填充。我们不需要在左侧进行条件，然后生成右侧。我们可以仅对我们想要的任意事物进行条件。因此，这是超越LLMs的一个属性，但没有好的基准来尝试这样做。右侧只是采样过程的说明。因此这里只是纯掩蔽。没有解掩蔽。没有纠正的东西。只是看着使用掩蔽的

P0，然后只是解掩蔽。因此，这里是我们在开放网络文本上进行的更多阳光检查，学习语言模型。特别是事情是方程九是掩蔽跟随路径，这就是大多数人所做的，除了我们调整调度程序并调整字符步骤也是如此。然后方程10是掩蔽分布、均匀分布和X1的delta的某种组合。

在这里我们看到方程10，这个掩蔽实际上比纯掩蔽设置做得更好，但实际上可能仅在低NFE设置下。

或者在高NFE下，掩蔽还可以。因此，这样做的原因是对于掩蔽情况，如果你一次只解掩蔽一个标记，我认为它会是正确的。但如果你有时通过并行解掩蔽两个东西，你可能会得到不正确的样本。因此，你真的需要纠正这一点，或者可能允许均匀多路径，其中嘈杂状态包括其他状态，模型将在后面学习自行纠正。

再说一次，是的，这些只是17亿模型的代码生成数字。我们还尝试了图像生成的离散流匹配。因此没有量化高斯。根本没有度量。我们只是再次取掩蔽，掩蔽情况，并尝试它。是的，它似乎比连续流匹配稍差，但几乎达到了3 FID，这相当不错。

所以是的，这就是演讲的结束。正如我所说，这是演讲的最后一张幻灯片。我只是把它放在最前面。因此，只要我们能够定义速度，根据某些PT运输粒子，其中PT恰好到达x1，

那么如果我们学习期望或边际速度，然后将其插入运输方程，我们将获得我们将这个边际速度拟合到的分布。这就是它。这就是配方，它似乎也适用于离散设置。这里是我在Meta的研究合作者。其中一些人也在观众中，所以如果你有任何问题，请随时问他们。好的，谢谢。

现在我们理解了流匹配目标，我们现在转向今年最著名的应用，Instable Diffusion 3，在论文《Scaling Rectified Flow Transformers for High-Resolution Image Synthesis》中提出。这篇论文在ICML上也获得了最佳论文奖，这里是Patrick Esser，Stable Diffusion的原始共同作者之一，在Robin Rombach的指导下接受了奖项。

大家好，我的名字是Patrick Esser。我正在展示我们的工作：Scaling Rectified Flow Transformers for High-Resolution Image Synthesis。这一切都是一支伟大团队的结果，他们也在这里，所有人都在这张幻灯片上。因此，目前我们观察到关于扩展的巨大炒作，真的很诱人地说，我们可以简单地通过投入足够的资金来解决我们所有的问题。实际上，

我会说扩展的有效性是不可否认的。增加模型大小、训练示例数量以及我们投入训练的整体计算资源，始终提高模型性能。我们首先在语言模型中看到了这一点，但我们也观察到图像生成的类似趋势，这正是我们在工作中考虑的。

但当然，扩展不是免费的，对吧？它确实大幅增加了开发成本，因为我们必须将所有资源投入训练，但也增加了运营成本，因为随着模型的增大，采样变得更加苛刻。因此，实际上，为了避免烧钱，我们必须不断提高训练和采样的效率。推动我们工作的三个关键问题基本上是第一个：

鉴于目前有相当多不同的扩散模型和流匹配变体的公式，哪些是最有效的？第二个问题涉及架构的问题，因为对于我们心中考虑的文本到图像合成任务，我们确实必须处理两种不同的模态，而不清楚哪些架构设计选择在这里效果最好。最后，当我们谈论扩展时，我们通常

必须衡量进展，我们还根据简单的指标（例如验证损失）推导扩展损失。但最终，它们实际上只是我们感兴趣的下游性能的代理，可能是样本质量。我们想评估它们是否是这些属性的准确代理。因此，让我们开始流匹配和朋友们。

这些方法的共同目标基本上是学习一个向量场，该向量场将由深度神经网络参数化，并且应该在两个分布之间生成一个概率路径。在我们具体的情况下，我们通常考虑其中一个分布是一个简单的已知分布，例如标准正态分布，另一个是图像的数据分布。

学习这样的向量场的共同起点是定义一个所谓的前向过程，这基本上只是定义了我们正在查看的分布中两个样本之间的轨迹。从这个过程中，我们可以推导出一个可处理的回归目标，所谓的条件流匹配法则，这使我们能够恢复一个向量场，然后实际上生成分布之间的路径。

这里的整体范式相当通用，对于前向过程的特定选择，我们实际上可以恢复广泛的现有公式和变体，包括EDM、DDPM等。其中一个变体是修正流公式，可以说这是你可以为前向过程做出的最简单选择，因为它只是两个样本之间的线性插值。

这也导致了一个非常干净的条件流匹配损失。总体而言，这确实使其非常优雅且易于处理。还要记住，在这个框架中，采样基本上由整合学习到的向量场组成。因此，由于这个原因，正如我们在这里定义的前向过程中，直线路径实际上是非常理想的，因为如果它们是直的，我们实际上可以在一步中整合它们，这将大大提高我们的采样效率。

因此，条件流匹配目标实际上并没有恢复

完全直的路径，即使我们找到像这样的前向过程。但至少在经验上，我们已经看到的结果是，与从扩散公式推导的向量场相比，它们通常具有更少的曲率，这使得它们更具样本效率。它们还具有其他良好的理论属性，如拉直效应，这使得它们成为进一步提高采样效率的有希望的候选者。

因此，总体而言，这确实使修正流成为高效文本到图像合成的有吸引力的候选者。但到目前为止，或者在研究之前，它们实际上主要在基准设置中被考虑，仍然不清楚它们在更困难的任务（如文本到图像合成）中实际表现如何。如果我们查看条件流匹配目标，它总是涉及轨迹时间步的分布。

而经典的修正流公式实际上只考虑时间步的均匀分布。但由于我们在训练期间对这个目标进行蒙特卡洛估计，这确实会影响我们进行的优化。

如果我们查看损失以及我们如何定义前向过程，那么我们实际上会很快看到，在轨迹的端点t等于零和t等于一时，最佳解决方案实际上仅涉及两个分布的均值估计。因此，我们会期望这在比较中是一个非常简单的任务。

因此，正因为如此，我们实际上开始考虑不同的时间步分布，这些分布在轨迹的端点上施加较少的权重，而更多地关注内部。同样，这实际上也是融合模型在建模图像时成功的一个重要部分，因为它使我们能够

控制我们在轨迹中确切放置最多权重的位置。通过这种方式，我们实际上可以专注于轨迹中图像的感知相关方面出现的部分。因此，为了探索我们是否也可以从中受益于修正流公式，我们探索了各种时间步分布，允许我们转移关注点。

为了了解哪些公式是最有效的，我们实际上进行了61种不同变体的研究。

在这里，我们包括了许多现有的公式，例如使用线性调度的epsilon预测，这是在稳定扩散中使用的，例如，使用线性或余弦调度的V预测。我们还包括EDM和现有的修正流公式。但除了这些之外，我们还包括了变体，特别是EDM和修正流的变体，其中我们改变了涉及的时间步分布的超参数。

如果我们收集并评估这些结果，我们实际上会看到，经典的修正流公式在少步采样的情况下确实表现得非常强。但如果我们，例如，从中出现的一个强基线是abs-linear方案，与之相比，当我们进行更多步骤的采样时，它实际上表现得更差。

而实际上，与此形成鲜明对比的是，我们看到通过引入这种特定的时间步分布，logit正态分布，我们最终得到了一个修正流的变体，实际上在所有现有变体中表现得更好，无论是在少步采样的情况下，还是在多步采样的情况下。然后在深入研究生成过程后，我们还考虑了文本到图像合成的架构选择。

总体目标是专注于基于变压器的架构，因为它们具有良好的可扩展性。但我们并不直接清楚如何最好地整合这两种不同的模态，文本和图像，这是我们任务所需的。因此，在我们的一个想法中，我们引入了MMDIT块，它通常遵循DIT块的设计，但实际上为这两种模态使用两个单独的权重。

但为了在这两种模态之间交换信息，我们仍然使用完整的联合注意力操作。类似的想法实际上也在视觉语言模型中使用。我们从比较中观察到的确实是，这种方法表现得最强。我们进行的一些比较是与一种更简单的方法进行比较，在这种方法中，我们使用DIT架构并简单地直接连接这两种模态。

我们还考虑了UBIT和DIT变体，其中我们使用交叉注意机制来结合文本条件，因为这在基于单元的架构中得到了非常成功的应用。但总体而言，我们看到这种多模态设计确实提供了最佳性能。因此，在确定了高效的公式和架构后，是时候进行扩展了。

为了在扩展过程中获得清晰的信号，我们在固定的时间步长上评估验证损失。比较这样的指标只有在我们保持在单一公式内时才有意义。但如果我们处于这种情况，它实际上提供了一个非常清晰的信号，同时也是评估模型和从中推导扩展损失的非常有效的方法。

但最终，这个验证损失实际上只能作为性能的代理，因为我们最终关心的是人类偏好、提示跟随、样本质量等。

在这一点上，是否可以仅依赖这个验证损失作为这些下游性能测量的准确代理仍然不清楚。虽然在语言领域有更多的工作，但在图像领域并非如此。

为了解答这个问题，我们进行了扩展研究，并评估了验证损失之间的相关性，我们考虑了自动图像评估指标（如gen eval）以及人类偏好评分。我们的结果显示，扩展损失和验证损失所预测的改进实际上转化为文本到图像合成的质量改进。

我们在视频合成等不同模态中也看到了类似的结果。总体而言，这让我们对进一步扩展确实会改善生成模型的内容创作能力充满信心。在模型开发和扩展过程中，我们还获得了一些额外的经验教训，我将快速介绍其中的一些。随着扩展而出现的问题之一

是我们的训练不稳定性。在这里，从现有工作中学习是非常有帮助的。我们发现特别有帮助的一件事是QK归一化的稳定性，这稳定了训练过程。

我还想快速提到的另一点是，我们重申了扩展提高性能的故事，但同样重要的是要注意，盲目跟随这一点很快就变得低效。其中一个案例是，如果我们再次考虑时间步长分布，我们实际上必须将其调整为不同的分辨率。如果我们不这样做，我们会失去很多性能，你可能会说我们可以简单地扩大基础模型的扩展，

但为此你需要付出的代价将是巨大的，而如果你正确解决了问题，代价就会小得多。一个类似的结果与人类偏好的对齐有关，这提供了一个非常快速、廉价但有效的偏好评分提升。通过这一点，我们确实获得了一个高质量的模型，能够在不同的分辨率、纵横比上良好工作，并且具有良好的提示理解能力和拼写能力。

这也在与其他现有模型的人工评估中得到了反映。至此，再见，感谢您的关注。扩散的一个最被低估的应用是在语音合成方面。今年在ICML上也有关于语音的出色工作，随着ChatGPT语音模式的兴起，学习基本问题和技术的需求很大。

在这里，我们将简单介绍两个我们想强调的关于语音的口头报告。Zhu等人的《自然语音3：使用因子化编解码器和扩散模型的零样本语音合成》和Gao等人的《使用扩散模型合成数据的语音自监督学习》。

大家好，我是来自中国科技大学的Zhe Chengzhi。今天我很高兴与大家分享在零样本TTS领域的激动人心的进展，《自然语音自由：使用因子化编解码器和扩散模型的零样本语音合成》。让我们先通过一个实际例子来说明短语音合成和传统媒体发言者合成之间的区别。

在传统的多发言者DTS任务场景中，用户可能会要求模型以某个发言者的风格朗读一份转录文本，比如发言者编号一。这意味着模型应该模仿该发言者的声音特征。

在这里，发言者编号一应该被添加到训练数据中。然而，这种方法的一个重大限制是它无法扩展到现场发言者。

相比之下，零样本TTS提供了更灵活的解决方案。用户可以提供一个音频片段作为参考来指导生成过程。例如，如果用户提交她的声音是这样的：海洋工程尤其权威。尽管这个参考语音很短，并且在训练中未见过，但我们强大的ZeroTTS系统可以生成与之相似的语音。

在过去的10年里，Kinsale无论科学如何召唤都与我同行。

零样本TTS正在彻底改变我们对语音深度合成的思考。这种先进的模型在训练中使用了庞大而多样的数据集，捕捉了众多发言者的细微差别和多样的声学环境。因此，该模型可以利用训练中获得的知识，通过基于提示的生成方法在推理时推广到特定发言者。

零样本TTS的关键在于扩展的概念。传统的TTS系统实际上依赖于来自录音室的干净数据。这些数据集通常包含少于1000小时的录制语音。现在，零样本TTS系统利用互联网的广阔性，使用来自网络的大规模核心数据。这种方法使用的TTS集

总时长超过60000小时的语音。另一方面，声学模型的扩展也很显著。该模型起初参数少于5000万，而当前的Zerostrong TTS系统已扩展到3亿到10亿参数。

这样的扩展也促进了数据表示的转变。以前的TTS系统通常依赖于基于人类先验的表示，如男性声谱图。相比之下，现在的短语音合成，

使用数据驱动的表示，例如从编解码器派生的表示。这里是一个示例编解码器。它们使用残差向量量化器，这些量化器被重新编码以生成多个表示，当然，还要找到合适的方式。此外，以前的系统，

取得了巨大的成功。这是由于短语音相似性、语音质量和语音可塑性。这种限制源于语音的复杂性。例如，对于一个短语音片段，虽然看似非常简单，但它包含丰富的信息，包括音色、内容、可塑性、录音环境等。这些信息

对语音的整体自然性至关重要。受到此启发，我们强调了因子化的重要性，因为建模复杂的复合信息是困难的，例如原始波形或质量谱图。此外，因子化也是非平凡的，因为R-VQ结构未能有效地解开R-VQ级别之间的信息。

NaturalSpeech3在数据表示和语音生成中应用了因子化。对于数据表示，我们应用了一个因子化编解码器，可以将语音信号分解为不同的语音属性，同时确保高质量的重建。

对于语音生成，我们应用了一个因子化扩散模型。这是一个统一的扩散框架，用于在每个子空间中分层生成每个语音属性。

对于FACodec，我们考虑了四个语音属性，即音色、韵律、内容和声学细节。我们首先应用一个音色提取器以获得全局音色向量。然后，我们应用三个因子化向量量化器在每个子空间中表示语音属性。

为了更好地解开信息，我们引入了以下技术，例如信息瓶颈，这可以限制每个标记的表示能力，并监督以包括预期属性，例行反转以去除冗余信息，以及详细重构以去除详细代码中的不必要信息。

对于因子化扩散模型，我们在每个子空间中应用离散扩散，以按顺序生成持续时间、可塑性、内容和细节的语音属性。音色无需预测，因为这个全局向量可以通过提示的音频访问。

在前向过程中，我们随机掩盖序列中的某些标记。在反向过程中，模型在上下文和条件的指导下逐渐学习恢复标记。为了促进上下文学习，我们将语音属性提示作为前缀添加到序列中。

这个提示作为条件，在扩散过程中保持不变。在阈值TTS的场景中，这些

语音属性提示源自相同的音频，作为副产品，这种提示机制也提供了很好的可追溯性，因为我们可以从不同来源选择不同的语音属性，定制输出语音以满足特定要求。我们评估了ZeroShot TTS的能力

在自由语音和情感TTS数据集的相似性、鲁棒性和整体质量方面。引人注目的、令人印象深刻的结果表明，自然语音树不仅超越了强基线，而且达到了人类级别的自然性。

我们还在LibreSpeech测试集上测试了我们的FA编解码器的重建能力，与强编解码器基线进行比较。结果也表明，我们的FA编解码器可以使用这些解耦的语音属性以高保真度重建语音。这里有一些演示。

第一行是从整个语音中随机剪切的三秒提示。第二行是案例一的自然语音输出。- 标准要求持有另一个油杯。- 所以这是提示，我们的自然语音树可以使用这个三秒提示生成类似的句子。

每个灯具的平均操作成本为每年22美分，每个灯具平均照顾17个灯具。对于案例2，这是提示。难道不清楚铅笔剩下的数量与...一样多吗？这是输出结果。镇上只有四个文具商，家家都生产铅笔屑，并为复制品出高价。

- 我们的自然语音树还可以通过提示情感语音以非常简短的方式生成情感TTS。如果你用这样的悲伤音频提示自然语音树。- 狗坐在门口。- 自然语音树可以生成这样的悲伤语音。

为什么水中的莲花会凋谢？如果你用这样的平静音频提示模型：狗坐在门口。自然语音树的输出将是这样的：为什么水中的莲花会凋谢？第三个，如果你用这样的厌恶语音提示：狗坐在门口。输出将听起来像这样：为什么水中的莲花会凋谢？

我们的模型还能够通过操控相应的语音提示来操控属性。所以这里有一个演示。第一列是原始设置，即持续时间提示和其他提示来自相同的顺序，即零样本TTS场景。

她享受这个经历吗？其他提示是相同的。她享受这个经历吗？生成的语音将听起来像这样。专家的检查和证词使委员会得出结论，可能开了五枪。如果我们只是放慢持续时间提示，提示将听起来像这样。她享受这个经历吗？生成的将听起来像这样。

专家的检查和证词使委员会得出结论，可能开了五枪。如果我们只是加快提示，它将听起来像这样。她享受这个经历吗？生成的语音将听起来像这样。专家的检查和证词使委员会得出结论，可能开了五枪。

我们还可以从另一个新的音频片段中推导出持续时间提示。这将只操控持续时间属性，而不会影响其他属性，因为其他提示保持不变。狗坐在门口。生成的速度将是这样的。专家的检查和证词使委员会得出结论，可能开了五枪。

如果你对我们的工作感兴趣，请扫描二维码获取更多样本。这就是我演讲的全部内容。谢谢大家。大家好，我是来自IBM研究的杨章。今天我将介绍我们的论文《使用扩散模型合成数据的语音自监督学习》，这是与UIUC和UCSB的联合合作工作。我来自IBM研究。

让我先简要介绍一下语音自监督学习或语音SSL的背景。语音SSL就像其他领域的SSL一样，假设我们有一个大型未标注的语料库。然后我们可以使用这个语料库来描绘一个语音表示网络，然后可以在一个小的标注语料库上进行微调以适应下游任务。

语音SSL成功的关键在于我们实际上需要假设我们有一个大型未标注的语料库。在大多数情况下，这个预训练数据集的小时数应该至少为1000小时。然而，在许多情况下，获得如此大规模的数据集并不像看起来那么容易。

在这里，我借用最近一项研究努力的图表，该研究收集了超过一千种语言的语音数据集。横轴显示语言，纵轴显示为每种语言收集的小时数。可以观察到，对于大多数语言，小时数低于1000。这意味着在许多情况下，获得一个大型的描绘数据集根本不可行。

因此，在预训练数据有限的情况下，最大化从有限数据集中提取的信息变得至关重要。因此，我们提出以下研究问题：现有的SSL技术是否从有限的预训练数据集中提取了足够的信息？我们是否可以进一步提取现有SSL技术可能忽视的信息？

因此，在本文中，我们提出了DivS4L，这是一种语音自给自足学习方法，使用扩散模型增强有限的预训练数据集。更具体地说，假设我们只有一个小的未标注数据集，比如少于100小时。

DivS4L的基本做法是使用合成数据增强数据集，然后使用标准的预训练技术进行标准预训练。整个数据增强过程由三个步骤组成。在第一步中，我们使用这个小数据集预训练一个初始的语音表示网络。

我们知道这个初始的语音表示网络由于数据集大小有限而质量较差，但它足以满足我们的目的。一旦这个语音表示网络训练完成，那么对于从这个小数据集中抽取的每个语音发声，我们可以获得其初始语音表示。

作为第二步，我们将这个初始语音表示与说话者嵌入（也从原始语音中提取）一起输入到扩散模型中。然后我们训练这个扩散模型以重建原始语音。扩散模型也仅在这个小的未标注语料库上进行训练。

因此，在这个扩散模型训练完成后，我们可以使用扩散模型生成合成数据以形成这个大型合成数据集。然而，我们并不是直接将初始语音表示和说话者嵌入输入到扩散模型中，而是首先将其传递给一个修改模块。通过这种方式，我们可以要求扩散模型生成与原始语音不同的多样化语音。

那么剩下的问题是我们如何实际修改这些语音表示。语音是一个丰富的信息源。它包含许多层次的信息，包括内容信息、说话者信息、过程信息。因此，合成语音也应该在所有这些维度上包含足够的变化。

因此，我们设计了合成语音的以下四个变化层次。第一层是原始语音本身。这里是一个示例。所以这是原始语音。

在第二层中，我们将初始语音表示和说话者嵌入原样输入到扩散模型中。通过这种方式，扩散模型将生成几乎与原始语音相同的内容。然而，由于条件并不控制一切，输出语音将与原始语音略有不同，特别是在韵律方面。

让我播放这个音频。请注意韵律。现在这个韵律变成了上升音调，而在原始语音中是下降音调。所以这是第二层。在第三层中，我们

将说话者嵌入更改为不同的说话者。通过这种方式，输出语音仍然在相同的内容下，但声音是不同说话者的声音。这里是示例。分享她的房子，离得很近。现在变成了一个不同的男性说话者。最后，在第四层，除了更改说话者嵌入外，我们还部分掩盖了一些语音表示。

通过这种方式，扩散模型被迫制造一些新内容。这就是我们称这种类型的语音为“新内容语音”的原因。因此，正如你所听到的，输出语音几乎像是无意义的喃喃自语。这意味着扩散模型无法完全捕捉原始语言中的语法结构或单词结构。

然而，正如我们将展示的，即使这种看似无意义的喃喃自语仍然会帮助预训练的性能。为了测试DIV-S4L的性能，我们使用LibreStreet 960作为预训练数据集，其中包含960小时的英语语音。

我们考虑两种不同的预训练设置。在低资源设置中，我们仅抽样100小时的真实语音。然后通过添加430小时的第二层+第三层语音和430小时的第四层语音，将其增强到总共960小时的语音。

在高资源设置中，我们使用所有960小时的真实语音。然后将其增强到2400小时的总语音。我们使用两种标准的预训练技术来训练我们的初始语音表示以及最终的正式语音预训练，即Wave2Vec 2.0和Hubert。

我们比较了四种不同的数据增强技术。第一种是完全没有数据增强，Wave2Vec-Aug，WaveLM，以及我们提出的DivS4L。我们在多个下游任务中测试了这个预训练模型。在第一个任务中，我们尝试了英语ASR，或英语自动语音识别，其中只有10小时的标注数据用于微调。

结果显示，Div S4L可以显著降低Wave2Vec2和Hubert的错误率。此外，当Div S4L与WaveLM结合时，可以实现进一步的错误减少。

最后，请注意这两个框中的结果都是在960小时的语音上描绘的。唯一的区别是红框中的结果是在960小时的增强语音上描绘的，而蓝框中的结果是在960小时的真实语音上描绘的。因此可以观察到，它们之间的差距已经非常小。

为了评估超越语音识别的性能，我们选择了这个优秀的基准，其中包含除ASR之外的八个不同任务。结果仍然显示，Div S4L能够在几乎所有任务中实现最佳性能，无论是在低资源设置还是高资源设置中。

最后，为了测试超越英语的性能，我们选择了13种语言，包括一些高资源语言和一些低资源语言。结果仍然一致显示，diffs4l能够降低错误率。我想展示的最后一个实验是调查所有四个变化层次是否有帮助。

为了测试这一点，我们回到我们的低资源设置，然后将原始语音固定为100小时，并将总语音数量固定为960小时，但我们将第二层加第三层语音和第四层语音之间的比例进行变化。

这是在不同数据集组成下的英语ASR结果，其中最左侧的点对应于完全没有第二层或第三层，而最右侧的点对应于完全没有第四层喃喃自语。可以观察到，最佳性能是在中间某处实现的，其中所有四个层次的语音都存在。

这意味着所有四个层次的语音，包括第四层的喃喃自语，对预训练都是有益的。总结我们的发现，我们发现扩散模型能够捕捉到与SSL学习互补的语音信息。因此，我们提出的diff-S4L可以显著提高各种下游任务和语言中的SSL性能。

我们还发现，具有不同变化层次的合成语音都对SSL有利，即使是看似无意义的喃喃自语。至此，我将结束今天的演讲。非常感谢您的关注。

在第一部分中，我们探讨了视频生成和世界模拟。在第二部分中，我们进一步探讨了扩散和生成建模方法，包括nerfs、流匹配、整流流变换器和语音。在第三部分中，我们将生成文本转视频的范式颠倒过来，检查计算机视觉的现状。

首先，我们有OG视觉基础模型DECAF，今年获得了最负盛名的“时间测试”奖，十年前首次在ICML上展示。这里是加州大学伯克利分校的教授Trevor Darrell，他是DECAF论文的顾问，也是CAFE深度学习库的创始人，代表团队接受了奖项。非常感谢。这是一个很大的荣幸，能够在这里发言，感谢您的慷慨介绍。

能够谈论decaf工作和cafe工作的影响，当然，这真的很令人兴奋。

而且那个介绍非常慷慨。我认为我们所做的最广泛的声明是，decaf使这一类工具的访问民主化。这导致了该领域的变革性变化。当然，正如我们在演讲中提到的，这建立在AlexNet和其他论文的工作之上。

看，论文的标题是decaf，标题很长，深度卷积激活特征用于通用视觉识别。那么什么是激活特征呢？我想我会先将其翻译成2024年的说法。因此，今天我们可能会将decaf中的F视为基础模型。

现在，我不确定我们是否曾经需要在我们的领域定义基础模型这个术语，但既然它已经被定义并且现在被广泛使用，我回头看，认为decaf论文实际上可能是视觉或深度基础模型中最原始或最广泛的基础模型之一。因此，这基本上是我们回顾这一工作的影响时的主要回顾点。

我们感到荣幸，甚至感到惊喜的是，它被选为“时间测试”奖。在一张幻灯片中，decaf论文是什么？实际上，它可能是我们小组发布的最简单的论文之一。基本上，我们展示了AlexNet的结果。

展示了该模型作为视觉预训练模型的有效性，表明如果你冻结激活、冻结特征，并在这些特征上计算激活，你可以在广泛的任务中获得基本上最先进的性能。因此，从某种意义上说，这是视觉中的OG基础模型。

获取卷积层的输出，冻结它，训练一个线性分类器，甚至可能在当时的模型上训练一个SVM，boom，你就得到了最先进的性能。我认为decaf论文所做的主要事情是

可视化并展示模型为何有效。我认为这个见解今天可能是如此普遍，我们都理解这一点，但视觉社区尤其没有欣赏到这一点。AlexNet是一个惊人的结果，但人们认为这是一个特例。它只会用于那一个任务。

而事实是，它开始为所有任务工作，我们的论文和其他类似的论文证明了这一点，并且它可以以这种预训练/微调的方式使用，这真正革命化了社区。这里的最后一点，当然，可能是最重要的。我认为

使decaf重要的原因，可能这是社区的一个转折点。我认为在2014年之前，您在ICML或CVPR上接受的绝大多数论文都不会有算法或模型。即使数据也不足以让论文被接受。显然，这篇论文被接受并且现在因其影响而受到认可。

它产生影响的原因实际上是通过开源发布和通过该渠道广泛传播的工作。现在这在我们的领域中可能是司空见惯，甚至是主流，但在2014年之前并不是这样。因此，我想承认社区的变化和社区标准的变化。

Decaf是CAFE生态系统的一部分，CAFE生态系统在深度学习中确实是一个主导力量，是2013年至2018年间最重要的深度学习框架之一。再次强调，CAFE

并不是第一个深度学习框架。确实有一些独特的架构，但真正的民主化是模型和对模型的访问，以及对异构计算的强调。你可以使用CPU或GPU。

而且确实是行业标准的代码库，在学术界和工业界都表现良好。它实际上是第一个广泛部署的平台，用于NVIDIA GPU的首次广泛部署。这无疑产生了很大的影响。我们找到了第一个模型库。时间线如下。

Decaf于2013年、2014年发布。Decaf是CAFE的冻结预训练基础模型版本。CAFE本身是一个深度学习框架，最终与PyTorch合并，使人们能够训练自己的模型，并产生了巨大的影响。其影响体现在

团队所获得的荣誉上，这是CAFE生态系统在这个夏天获得的三个“时间测试”奖之一。这确实是一个显著的观察。Decaf论文在ICML上，RCNN论文在CVPR上，以及CAFE系统论文在ACM Multimedia上，也荣获了“时间测试”奖，将在今年晚些时候颁发。我认为decaf

可能是这三篇论文中最重要的，回头看可能令人惊讶，因为当时，我不确定decaf论文是否被视为与这两篇论文同样重要，或者至少与CAFE系统本身同样重要。但我将回顾一些旧幻灯片，然后告诉你一些当前的...

<context>生成视频世界模拟，扩散，视觉，强化学习和机器人技术 — ICML 2024 第一部分 Latent Space LIVE! 在 NeurIPS 的常规票已售罄！我们刚刚宣布了最后一位演讲者和最新的主题，播客的朋友 Nathan Lambert，他将回顾 2024 年在推理模型中的发展，如 o1！我们为那些现在正在决定的人开放了一些晚鸟票——如果需要，请使用代码 DISCORDGANG。期待在温哥华见到你！我们已经保存了 ICML 的录音一段时间（来自今天首次 SOLO 嘉宾共同主持的 Brittany Walker），鉴于 Sora Turbo 的发布（博客文章，教程），我们认为现在是发布第一部分的好时机，这部分准备深入探讨生成视频世界模拟的现状，顺利过渡到视觉（相反的模态），最后是机器人（它们的最终应用）。Sora、Genie 和生成视频世界模拟器领域Diffusion Transformers 的作者 Bill Peebles 在 ICML 上进行了他最近的 Sora 演讲，这开启了我们的节目：* William (Bill) Peebles - SORA（幻灯片）关于 Sora 的一个常见问题是，为了实现这些结果引入了多少归纳偏差。Bill 提到了来自 o1 团队的 Hyung Won Chung 提出的相同原则——“迟早这些偏差会反噬你”。我们还推荐 2024 年关于 Sora 的这些阅读材料。* Lilian Weng 的视频扩散模型文献综述* Sora API 泄露* 估计需要 100k-700k H100s 来服务 Sora（不是 Turbo）* 使用 Sora 进行专业叙事的艺术家指南Google DeepMind 在 ICML 上对视频生成模型的表现非常强劲，赢得了两项最佳论文奖：* Genie: 生成交互环境（在口头、海报和研讨会上都有报道）* VideoPoet: 用于零样本视频生成的大型语言模型（见网站）我们通过 Tali Dekel 的演讲结束这一部分，主题是视频生成的未来：超越数据和规模。第二部分：生成建模和扩散自 2023 年以来，Sander Dieleman 在 Imagen 和 Veo 上工作的观点（博客文章，推文）将扩散视为“频域中的谱自回归”，引起了公众的想象，因此我们强调他的演讲：* 穿越噪声：对扩散模型的直观观察然后我们转到 Ben Poole 的演讲，主题是用 2D 先验推断 3D 结构，包括他在 NeRFs 和 DreamFusion 上的工作：然后我们调查了两篇流匹配论文——一篇来自流匹配的共同作者——Ricky T. Q. Chen（FAIR，Meta）以及它是如何在 Stable Diffusion 3 中通过高分辨率图像合成的缩放整流流变换器实现的。我们对扩散的最后一击是几篇关于语音的口头报告，我们留给你通过我们的音频播客探索* NaturalSpeech 3: 使用因子化编解码器和扩散模型的零样本语音合成* 使用扩散模型合成数据的语音自监督学习第三部分：视觉ICML 测试时间获奖者是 DeCAF，Trevor Darrell 显著称其为“OG 视觉基础模型”。Lucas Beyer 的演讲“LLM 时代的视觉——以数据为中心的视角”在网上也受到好评，他谈到了自己从视觉变换器到 PaliGemma 的旅程。我们特别提到 MLLM 作为评判者：使用视觉-语言基准评估多模态 LLM 作为评判者。第四部分：强化学习和机器人技术我们借助 Ashley Edwards 的帮助将视觉转向机器人，她在 Deepmind 的 Gato 和 Genie 团队的工作总结为仅从视频中学习动作、策略、奖励和环境。Brittany 突出显示了两篇海报会议论文：* 使用潜在动作生成行为* 我们还推荐 Lerrel Pinto 的《构建通用机器人的方法* PIVOT：迭代视觉提示引出 VLM 的可操作知识然而，我们必须将大部分空间留给 Chelsea Finn，现在是 Physical Intelligence 的创始人，她进行了四次演讲，主题是* “机器人教会我关于机器学习的事情”* 开发机器人通才* 自主适应的机器人* 如何给你的语言模型反馈* 特别提到 PI 同事 Sergey Levine 关于机器人基础模型的研究我们以一篇将生成环境与 RL/机器人技术联系起来的立场论文结束播客：自动环境塑造是 RL 的下一个前沿。时间戳* [00:00:00] 介绍* [00:02:43] Sora - Bill Peebles* [00:44:52] Genie: 生成交互环境* [01:00:17] Genie 访谈* [01:12:33] VideoPoet: 用于零样本视频生成的大型语言模型* [01:30:51] VideoPoet 访谈 - Dan Kondratyuk* [01:42:00] Tali Dekel - 视频生成的未来：超越数据和规模。* [02:27:07] Sander Dieleman - 穿越噪声：对扩散模型的直观观察* [03:06:20] Ben Poole - 用 2D 先验推断 3D 结构* [03:30:30] Ricky Chen - 流匹配* [04:00:03] Patrick Esser - 稳定扩散 3* [04:14:30] NaturalSpeech 3: 使用因子化编解码器和扩散模型的零样本语音合成* [04:27:00] 使用扩散模型合成数据的语音自监督学习* [04:39:00] ICML 测试时间获奖者：DeCAF* [05:03:40] Lucas Beyer：“LLM 时代的视觉——以数据为中心的视角”* [05:42:00] Ashley Edwards：仅从视频中学习动作、策略、奖励和环境。* [06:03:30] 使用潜在动作生成行为访谈* [06:09:52] Chelsea Finn：“机器人教会我关于机器学习的事情”* [06:56:00] 立场：自动环境塑造是 RL 的下一个前沿获取 Latent Space 的完整访问权限，请访问 www.latent.space/subscribe</context> <raw_text>0 observations and thoughts about why and how this work looks sort of from a historical light. And what does the pre-training paradigm mean for the present and the future? So these are the old slides from 2014, presented at ICML 2014 in China, in Beijing. And this is what the world of computer vision looked like

in the early 2010s, 2000s, really starting in late 1990s, 1998, 1999, the machine learning revolution took over in computer vision. But for a good decade, it was the pathway seen on this slide with handcrafted features, words like SIFT and HOG and LLC that maybe aren't even, and SURF that may or may not even be known to the community today.

And then we had our wonderful Cafe Cat, which of course detecting cats on the internet was the paradigm of its day. But for several decades,

and not unnoticed by informed researchers at the time, but yet largely unappreciated by the CVPR and even ICML community, to be honest, was the progress in convolutional representation learning and deep learning, as it was later called, the work of... It goes back to Fukushima and the neocognitron,

the Romo Hart Hinton and Williams seminal paper in the PDP book, which I encourage people to go back and look. And of course, the work of Jan LeCun and then Alex Nett in 2012, finally showing that this paradigm did scale and this paradigm was going to

But I think even in 2012 and 2013, vision people were basically acknowledging this is working for object recognition, but certainly it wouldn't work for other things. It wouldn't work for fine-grained object recognition. It wouldn't work for complicated transfer learning. It wouldn't work ultimately for segmentation and other things. And as we see, and as the decaf paper

helped convince people that was not going to be the case. In fact, this paradigm was going to take over the field and did take over the field. And the decaf paper was essentially the simplest foundation style model or pre-training paradigm you could advocate. And very simple even in the day, which is let's just take a frozen AlexNet model

which we're going to provide for you, the user who downloads the code from the Berkeley website. And just take slices of the model and compute activation features. Compute the representations that are formed from the pre-trained AlexNet model and see how it does.

And the DCAF paper did this, reported this, and asked a series of questions about the quality of these representations and started maybe the first demonstration that these models are learning something more than they're trained on or more than the literal task.

that they were trained on. That they are actually learning the latent knowledge that's encoded in those tasks. They're learning semantic hierarchies and things like that. Again, those are concepts that we take as obvious and common sense today, but certainly in the vision community in 2012, it hadn't yet been accepted.

And that because these representations generalize, they actually, sorry, because they capture this latent semantic representation, they generalize to other tasks effectively.

And different layers had different performance. So the decaf paper was one of the first to show visualizations, for example, using t-SNE on these deep learned representations. And just these visualizations here that if you looked at higher layers of the decaf representation or higher layers of the AlexNet representation,

that the ability of these models to capture latent semantic features, which were called superlabels here in the paper, the models were never trained on these superlabels, and yet there they are emerging as a part of the representation.

Again, I think this is the thing that really surprised the vision community, that you didn't explicitly supervise the model on this signal, but it emerged from the representation.

And when you would then look to see what was the performance, of course AlexNet was crushing object recognition, but there was a view that object recognition was now, okay, that's just machine learning. It's not part of computer vision. Computer vision are these other things, fine-grained part recognition or segmentation, domain adaptation and things like that. These representations just started to crush all the tasks. Right?

By moving from the prior best feature, which in this era was called SURF, when looking at the state of the art domain adaptation challenges in computer vision, which was the Office dataset that was released from our lab actually back in 2010, you could see the numbers double just by changing the underlying representation for some relatively fancy domain adaptation technique.

But even more exciting, you know, from the perspective of this paper, perhaps less exciting from the perspective of the domain adaptation researchers, the baselines went up by a factor of three or four. These underlying representations had an ability to transfer that really crushed all the fancy mechanisms that people had been proposing for domain adaptation.

So, that was a sobering moment for many researchers. And across a number of different tasks in computer vision, for example, fine-grained recognition, where you want to recognize individual species of birds, and there was a notion that you may want representations that can also localize parts and have some interpretability.

The decaf model out of the box did better than the prior art and when integrated into straightforward techniques for localizing parts had further improvements in performance.

And the last example in the paper that I'm going to highlight here today on scene recognition. Here as well, the model is never trained on these labels for outdoor, indoor, man-made, or natural. And yet, these super labels emerge in the representation when you visualize it with T-SNEI.

So there are more results. I encourage you to look back at the paper. But I'll just close this part of reviewing the original talk by noting that I think the main impact here, those observations were important, but the open source dissemination turned out to be the impactful part.

I mean, there were other papers that came out shortly after Decaf or maybe around the same time that also were showing transfer performance, but ultimately Decaf and the CAFE open source release led to wide adoption of these techniques very quickly in the community. And there was this great website and a cute cat that you could look at, of course, back in the day.

And at the time it was just considered remarkable that these techniques that just a year before people imagined you needed 10,000 CPUs to try and run deep learning. There was a great paper in the New York Times you can go find on how the Google data centers were taking thousands and thousands of CPUs to run certain deep learning algorithms. And there was just a belief that like no way anybody but Google can do that.

And so people were just doing whatever they do. And then suddenly the next year through decaf and related efforts for democratization, almost anyone could get most of the performance of these models, at least for inference. And then soon with CAFE and the advent of GPUs and GPU acceleration of these models,

everyone could train a model of this size. And so that's an exciting point. And maybe we're going to have a similar moment in the future. Right now there's a perception that you can only have tens of thousands of GPUs to train LLMs. Who knows what the architectures will be in the future and what the next iteration of a transformative architecture change like CAFE will be or

So, decaf showed back then the surprising effectiveness of transfer using frozen or relatively frozen AlexNet features. It was a pre-training, fine-tuning paradigm. Decaf was the precursor to CAFE, which became the de facto standard for deep learning in academia and industry. But maybe, why did I say the decaf paper might be the most important one? I mean, CAFE had a lot of impact, but actually,

The way I'm presenting the DCAF paper now as a kind of foundation-ish model isn't what people were most excited about in 2015 to 2020. In fact, if you wanted to get your paper accepted during that era, you had to put end-to-end in the paper or in the title somewhere. The cool thing was I could now back-propagate all the way from the task

And if you weren't doing that rapidly by around 2016, you weren't considered in fashion. So I don't think activation features the way they were explained in the decaf paper or foundation models, if we relabel that today, were in vogue for several years. And the CAFE system allowed you to now train your own model, get your own GPUs and do this sort of end-to-end training.

But as we know, roughly in the early 2020s, pre-training returns and actually in some sense is now the dominant paradigm. And we see this from Burke and Clip and now a perception that if you keep scaling your data and the model, the underlying representations are just gonna get better and better and better. And that's the way to go. We don't think it's the appropriate path

to really fine tune from scratch for each task. We want to leverage everything we see across the underlying tasks. And maybe I'm oversimplifying what people were thinking in 2016 to 2020, but I think you get the gist. So we see now this pre-training paradigm very dominant in the field.

Decaf was primarily pre-trained plus fine-tuned approach as our contemporary LLM and LoRa models. But now prompting, pre-trained and then prompt, of course is dominant in language and vision and language. And in vision there's very early work along these lines as well.

Since I have a test of time talk, I'll plug my own group's work on this on visual prompting that we had at NeurIPS 2022 and large vision models that we had at the CVPR. You can look at these approaches. No language in those models, but they're still foundation-ish models or pre-training plus prompting approaches to vision models.

And we see the unreasonable effectiveness of pre-training continuing to this day. Many models that are having a lot of impact. A lot of vision and language models coming out very fast from companies. I wouldn't want to now compete in building the next vision and language model from Berkeley.

As I mentioned earlier in the talk, until there's the next revolution of architectures and maybe we're not going to need 10,000 GPUs and we only need a couple of these, whatever the next great model is, I'm looking forward to seeing what that will be. But even now, I think that's still an open playing field if we consider vision and action or vision and action and language issues.

I'll just maybe close pointing to some of the work in that space. There are many papers coming out from many different labs right now in this direction. I'll point to two in my lab, one that includes language as it pre-trains a vision and language and action model, and one that doesn't explicitly include language but also

humanoid locomotion. If you're interested in the larva paper, we've taken the llama and lava base and added action pre-training into it where we literally prompt the robot to have a particular control scheme, a particular task, and describe a trace of trajectories that are desired and that can then be performed. And the sort of fundamental approach of humanoid

of action or locomotion as next token prediction is explicitly formalized in our paper, Humanoid Control as Next Token Prediction. No underlying language model here, but just pre-training on lots and lots and lots of human and humanoid action data, some of which are taken from the wild, some of which are generated in simulation, just straight up transformer, and the model can then walk around

new environments, including San Francisco. With that, I think there's one or two minutes left for maybe a question or a conversation, which I'm happy to engage in. And again, I think the team just wants to very much thank the community and the program chairs for this honor and looking forward to all the research in the future from the community. Thank you. Thank you so much. So I think there's time for maybe one, well, there's time for one question. So if you have a question, you can come up to one of the microphones there. But then I think because we are

Basically between the two post sessions, we want to make sure we get over there too. So maybe we'll have one person if you want to come to one of the things. We'll give them a second because people always do it. Thing to do normally as a chair is to ask it yourself, but I won't take that. I won't take that. I'll let someone from the audience do it. To a certain extent, this notion of prompting, I guess, okay, I'll put it this way. To what extent do you think

fixing features and putting a linear head on top of those features, which we see is very different from prompting in a sense in current mechanisms. To what extent do you think that that's just a

aside, and prompting kind of is just the way we specify linear heads these days, or to what extent is language really something fundamentally different when it comes to vision language models that is going to enable another step change the way that deep networks enabled a step change so many years ago? I think you asked two different questions. One I took to be the

between fine tuning and prompting and the other I took to be language versus not language, or at least I'll try and answer those two. And I think, and I'll try and do it quickly, I think I could also have added some slides at the end. I think it was an exciting time in the community and many cool papers coming out right now about the sort of mechanisms of in-context learning and prompting, task vectors and function vectors,

and how we can interpret and then maybe even patch or extrapolate these models. So I think we're going to, I think that's unsettled. I think we're going to see in the coming years papers that define that paradigm that are going to show maybe more formal connections between fine-tuning and prompting in the architecture. That's very hot work right now.

And then is language special? I sort of don't think it is. I also think the word language is complicated because I don't know whether, I think in the community right now, if I say a language model,

to the broader press, they're gonna assume there's text in there. I'm not sure if I say the word language model to this community whether you're gonna assume that or not. In the past, I didn't always assume that. I thought I could have a language model on a series of tokens that were just vision tokens, and that's why we call that large vision model

a large vision model, I think that's a large vision model, it's also a language model. So if language is just the process of having tokenized something and then predicting it, I think that general paradigm is going to be very high impact across all areas of intelligence, including those that use text or what we normally call language, and those that don't, like motor control.

That last audience member question was an incredible segue into our next talk, which is a retrospective on how LLMs and computer vision converged from Lucas Beyer, who was one of the lead authors of the Vision Transformer paper at ICLR 2020.

This would count as yet another deep mind talk on this pod. Prizes for counting how many we have featured this episode, except that the entire VIT team has just left Google to set up the new OpenAI Zurich office.

Let's have a look, eh? Get it? Look at how Lucas views the progress of the VLM field. I will talk about, yeah, about computer vision in the age of LLMs. And in this case, or in this talk, I will focus on each part. I will focus a lot about the data side of things. However, yesterday evening, I decided to completely redo my talk. And so I apologize if some parts are not smooth or if sometimes I'm surprised by my next slide.

So, one thing that happened recently in computer vision or recently like four or five years ago now is that suddenly language has become the API for all vision models and things. And by API, I mean like input/output to the model, like how you communicate with it basically. In the distant past, what

A lot of vision and classification is the most canonical task, but most tasks look like is that you pre-trained your model on a large database of images labeled with typically classes because that's what's easy and quick to label. And then maybe there are a lot of classes, so you cover a lot of concepts, but they are not really attached. Like it's the class ID number and that's it. And this way, your visual model learns to understand a lot of things, or at least to

classify them into these classes and then you transfer it to any task of interest which could be again classification but maybe much more focused like flower classification or other things

And again, there you had a labeled data set, smaller typically, much smaller. And then you fine tune on that, and then you get your model that you actually care about in the end. Then with the appearance of Clip and almost same time Align, things changed. They showed how to do pre-training on not class labeled data, but pairs of images and text that you can find very easily.

And then also not even like use this model, not even fine tune it, but just prompt it basically, or like give it a few options in free form text. And then it tells you which option is the most likely representing the image. And that way you don't even need to like

The API is what I mean here, changed. It's not integers of classes anymore and it doesn't need to match or anything. It's just ReformText. So that was very nice. And in terms of data,

It means that it changed like vision data sets classically were like this list of classes. And then for each class you go and collect typically via image search, a bunch of images and that's it. So you have this very regular structure or information content.

And even worse, who makes up these classes? Is a PhD student just sitting there, "I'm going to make a dataset about blah blah blah, so let me think about the classes." Fun fact is Coco, which is probably the second most widely used computer vision dataset from the time where we did it with classes. Who knows how the list of classes in Coco was created?

Yeah, very few old school vision people here. It was the senior professor on the project asked his teenage kid, American kid, like what are common objects in your mind? That's why Coco classes are like frisbee, football, pizza and that kind of stuff.

So, yeah, you can see already how bias comes into these data sets, right? It's just the model only learns about the things that the person creating the data set, which is usually one random person that just happens to be deciding, thinks of.

But this is what the data now in modern times when we learn from image text combined looks like. It's like just random collection of images, typically from the web and text somehow attached to them, like typically alt text or the title of the page it's from or things like that.

And then you have some random shit like this that is completely uninformative. Thumbnail for version as of 21 blah blah blah. This is legible, yeah. So this is kind of useless supervision signal, but you also get very detailed stuff that you would never come up with if you were to create a list of classes like this one Frankfurt airport skyline 2017, right? Or London barge race or things like that, right? So

This completely changes what the models can learn. They are exposed to a lot more noise and useless stuff, but at the same time also to a lot more detail that you would never come up with in the classic way of creating data sets.

All right, and then a little advertisement after Clip came out or Clip was the first model doing this. And then a couple years later, our group made the C clip, which is a variant of Clip and is also open model. So you can download it and use it, which is just after a few years of experience with this, it's significantly better. And the cool thing is, again, like

As with Clip, but now even better, you can prompt it with Freeform Text and become as detailed as you basically want, as you can express with text. Here's a couple examples. Are these visible-ish?

Yeah, here's a couple of cool examples of pictures we took ourselves. So they are not possibly in the training dataset and the model doesn't possibly know about them. For example, this one is me and the colleague who have bought a coffee themed t-shirt. Mine, I think, said "I need coffee" or something. And the colleague is just the molecule of coffee. And then the model fires 100% on the text "a photo of two guys in need of caffeine". But it fires 4% only "a photo of two guys in need of water".

So this stuff works nowadays. And a classical thing in computer vision, at least until recently, is a pet peeve of mine. People always say like, oh, computer vision models are not robust. It will recognize a cow, but you put it on the beach, it will fail completely. Not at all.

But this has been solved for years. Like here, cow on the beach is 99%. And cow also 36%. But cow in prairie is only 1%. So this stuff works with Clip and Cglip and models like that. When we did Cglip, we released a separate checkpoint that we trained on all the languages. And we basically tried to show that

Here, just from some examples, we didn't really thoroughly evaluate it in the paper. We just released it and tried to show that not only does it learn multiple languages, like from this just web data, images, and text, the web is international. So you just learn about all the languages for free, essentially, if you use them. So we didn't do anything special like translation or anything. And the model can learn not just the cow on the beach, but also in other languages, like or in the and the other languages I cannot pronounce.

But they all say the same thing, right? And here even we tried to show some cultural specific things. I think this was my Chinese colleague, Shawa, who came up with this. So I think only Chinese people will understand this. This dish in Chinese, it's called ants on the tree, I think, or ants on the branch or something like that. Ants climbing a tree, exactly.

And so if here, this is just the interesting thing. If you ask this model in English and like the literal translation of the dish and climbing a tree, it just doesn't get it. It just fires on the picture of ants climbing on a tree and not on this dish. But if you ask, I cannot say this, but one of these is like ants climbing on the tree in Chinese. If you ask that, it totally gets that you're not talking about like literal ants climbing on a tree, but the dish and it's like that.

Alright, so... Ah, yeah! So we released this separate model, multilingual, but back then we trained only a small version of the model and recently we also trained a larger version of the model and this is new, like we released this somewhat silently, just in this collab. So now if you're interested in international CGLIP or CLIP-like models, there is a large one available that is pretty good. But why did we

released two separate models. Why not just one international and that's it? Well, because it turns out that training on English-only data helps a lot. Scores that people, including some of my colleagues, care a lot about, which is just ImageNet, ZeroShot and a few other English benchmarks. So just here, an overview of

a broad range of recent clip style papers or that do clip and look at data it typically looks like this with if we train on raw data it's bad english only subset of the data it's better some more filtering it's better and better

And the measurement is done typically in ImageNet Zero Shot score. And this is across-- I intentionally don't write which paper because I don't want to blame any individuals. Except here, I can say this is the original clip paper already. It says we get our queries from English Wikipedia, so English only.

Then when you see papers using Lion, they usually use Lion 2B, but if you look at the citation, the Lion paper is Lion 5B. So what is that? Well, Lion 5B is actually 2B English, 2B non-English and 1B don't know. So typically people just use the 2B English Lion subset.

And then here is another work where we can go through steps of filtering, right? We see first basic filtering, like the first basic filtering is the caption language being English. And then typical thing is filtering by clip score. So you keep only the data that clip already understands, which as we've seen before, as I mentioned, clip was trained on English only as English only stuff. And then there's more, then there is like

keep only the data where in the text, one of the words in the text is from the ImageNet 21k list of classes. Or even better, not even the text, like keep only images which are similar to ImageNet images. And the thing is, these more heavy English and ImageNet tailored filtering is what works best as measured on ImageNet, but also other benchmarks, but which are similar to ImageNet.

Having said all of these negative things, I need to call out one positive paper and not hide its name, InternVL. I like this a lot. They specifically show, "Okay, look, we use Lion, the English, but also Lion multilingual and also Chinese dataset." So that was nice. All right. So this was all about Clip and about this specific filtering stage. One of the next talks from Angeline will give more details about the effects of this.

But part of the vision community has moved on and I think all should move on past Clip or beyond Clip and even C-Clip because there are some things, no matter how good your data is, how high quality the caption, how descriptive the caption, there are some things the Clip contrastive loss just doesn't learn. Actually, I just assumed everybody knows how Clip works, but who knows how Clip works, the training?

Okay, more or less everybody. That's good. So take this example. You have the image of a cat and a dog and the caption is pretty much... It could be more detailed but it's pretty much perfect. Like a cat sitting left of a dog. Now these go through the encoders, right? And then they are trained to be most similar versus other captions or pairs in the mini batch.</raw_text>

<context>生成视频世界模拟，扩散，视觉，强化学习和机器人技术 — ICML 2024 第一部分 Latent Space LIVE! 在 NeurIPS 的常规票已售罄！我们刚刚宣布了最后一位演讲者和最新的主题，播客的朋友 Nathan Lambert 将回顾 2024 年的推理模型，如 o1！我们为那些现在正在决定的人开放了一些晚鸟票 - 如果需要，请使用代码 DISCORDGANG。我们在温哥华见！我们已经坐在 ICML 的录音上有一段时间（来自今天首次 SOLO 嘉宾共同主持人 Brittany Walker），鉴于 Sora Turbo 的发布（博客文章，教程），我们认为现在是发布第一部分的好时机，这部分准备深入探讨生成视频世界模拟的现状，顺利过渡到视觉（相反的模态），最后是机器人（它们的最终应用）。Sora、Genie 和生成视频世界模拟器领域Diffusion Transformers 的作者 Bill Peebles 在 ICML 上进行了他最近的 Sora 演讲，这开启了我们的节目：* William (Bill) Peebles - SORA（幻灯片）关于 Sora 的一个常见问题是，为了实现这些结果引入了多少归纳偏差。Bill 提到了来自 o1 团队的 Hyung Won Chung 提出的相同原则 - “迟早这些偏差会反噬你”。我们还推荐 2024 年关于 Sora 的这些阅读材料。* Lilian Weng 的视频扩散模型文献综述* Sora API 泄露* 估计需要 100k-700k H100s 来服务 Sora（不是 Turbo）* 使用 Sora 进行专业叙事的艺术家指南Google DeepMind 在 ICML 上对视频生成模型的表现非常强劲，赢得了两项最佳论文奖：* Genie: 生成交互环境（在口头，海报和研讨会上都有报道）* VideoPoet: 用于零样本视频生成的大型语言模型（见网站）我们通过 Tali Dekel 的演讲结束这一部分，主题是视频生成的未来：超越数据和规模。第二部分：生成建模和扩散自 2023 年以来，Sander Dieleman 在 Imagen 和 Veo 上工作的观点（博客文章，推文）将扩散视为“频域中的光谱自回归”，引起了公众的想象，因此我们强调他的演讲：* 穿越噪声：对扩散模型的直观观察然后我们转向 Ben Poole，他的演讲主题是用 2D 先验推断 3D 结构，包括他在 NeRFs 和 DreamFusion 上的工作：然后我们调查了两篇流匹配论文 - 一篇来自流匹配的共同作者 - Ricky T. Q. Chen（FAIR，Meta）以及它是如何在稳定扩散 3 中实现的，使用缩放整流流变换器进行高分辨率图像合成。我们对扩散的最后一击是几场关于语音的口头报告，我们留给您通过我们的音频播客探索* NaturalSpeech 3: 使用因子化编解码器和扩散模型的零样本语音合成* 使用扩散模型合成数据的语音自监督学习第三部分：视觉ICML 测试时间获奖者是 DeCAF，Trevor Darrell 显著称其为“OG 视觉基础模型”。Lucas Beyer 的演讲“LLM 时代的视觉 - 数据中心的视角”在网上也受到好评，他谈到了他从视觉变换器到 PaliGemma 的旅程。我们特别提到 MLLM 作为评判者：使用视觉语言基准评估多模态 LLM 作为评判者。第四部分：强化学习和机器人我们借助 Ashley Edwards 的帮助将视觉过渡到机器人，她在 Deepmind 的 Gato 和 Genie 团队的工作总结为从视频中学习动作、策略、奖励和环境。Brittany 突出了两篇海报会议论文：* 使用潜在动作生成行为* 我们还推荐 Lerrel Pinto 的《构建通用机器人的方法* PIVOT：迭代视觉提示引出 VLM 的可操作知识然而，我们必须将大部分空间留给 Chelsea Finn，现在是 Physical Intelligence 的创始人，她进行了四次演讲，主题是* “机器人教会我关于机器学习的事情”* 开发机器人通才* 自主适应的机器人* 如何给你的语言模型反馈* 特别提到 PI 同事 Sergey Levine 关于机器人基础模型的工作我们以一篇将生成环境与 RL/机器人联系起来的立场论文结束播客：自动环境塑造是 RL 的下一个前沿。时间戳* [00:00:00] 介绍* [00:02:43] Sora - Bill Peebles* [00:44:52] Genie: 生成交互环境* [01:00:17] Genie 访谈* [01:12:33] VideoPoet: 用于零样本视频生成的大型语言模型* [01:30:51] VideoPoet 访谈 - Dan Kondratyuk* [01:42:00] Tali Dekel - 视频生成的未来：超越数据和规模。* [02:27:07] Sander Dieleman - 穿越噪声：对扩散模型的直观观察* [03:06:20] Ben Poole - 用 2D 先验推断 3D 结构* [03:30:30] Ricky Chen - 流匹配* [04:00:03] Patrick Esser - 稳定扩散 3* [04:14:30] NaturalSpeech 3: 使用因子化编解码器和扩散模型的零样本语音合成* [04:27:00] 使用扩散模型合成数据的语音自监督学习* [04:39:00] ICML 测试时间获奖者：DeCAF* [05:03:40] Lucas Beyer：“LLM 时代的视觉 - 数据中心的视角”* [05:42:00] Ashley Edwards：从视频中学习动作、策略、奖励和环境。* [06:03:30] 使用潜在动作生成行为访谈* [06:09:52] Chelsea Finn：“机器人教会我关于机器学习的事情”* [06:56:00] 立场：自动环境塑造是 RL 的下一个前沿获取 Latent Space 的完整访问权限，请访问 www.latent.space/subscribe</context> <raw_text>0 然而，现在让我们思考一下模型在训练时需要学习什么才能完美满足这个目标。这取决于批次中还有什么。如果在同一个小批次中没有其他猫或狗的图片，模型只需要学习，例如，猫，这就足够了。它将其与这张图像匹配，完美，完成。损失完全满足。或者，或者，它只需要学习狗。

然后就完成了。这是唯一包含狗的图像。它不需要学习更多的匹配。模型很懒，就像我一样。它们学习解决任务所需的最少量。

现在，如果在同一个小批次中恰好有另一张猫的图片，如果它没有坐着，那么模型现在只需要学习猫坐着，或者可能更简单，猫和狗。它只是将单词猫和狗与这张图像匹配。没有其他图像同时包含猫和狗，它就完成了。它不需要学习更多。你明白我的意思吗？要学习左边的内容，必须在完全相同的小批次中，

同样的东西，比如猫和狗，但反过来，配上完美的标题，没有其他捷径来匹配它们。这根本不会发生。所以这就像是 clip 风格学习的固有劣势或限制。C-Clip 也遭受了这种情况。

因此，与一些同事一起，我们开始寻找一个根本上更好的学习目标来解决这个问题。确实有一个相当简单的目标可以做到，那就是简单的标题。因此，编码图像。

然后将图像编码传递给一个解码器，该解码器应该解码标题。当你解码时，就像语言模型一样。损失是下一个标记预测。因此，当这里你有一个损失时，说明说左，不要说右，不要说上，不要说下，不要说你词汇表中的任何其他单词，说明说左。因此，模型必须学习这一点。然后还有...

我不打算在这里深入细节，但原始 CLIP 论文在他们的图一中展示了以这种方式训练的效率低下。在我们的论文中，我们也详细说明了实际上并没有那么低效。对吧。然后我们对此进行了评估。然后我们发现，实际上我们并不是唯一一个考虑到这个 CLIP 限制的人，已经有多个基准专门测量这一点。

我们看到的第一个是 ARO，代表属性关系和顺序，像是 clip 模型并没有真正激励学习的三件事。因此，他们设计了一个基准来测试这一点。

忽略底部的数字。因此，当我们训练一个 clip 风格模型时，我们得到这些数字。当我们在其他完全相同的设置上训练标题风格模型时，比如我们优化了很多，并且它们在相同的数据上训练。这就好得多。这是世界上更好的。这也比底部的数字好得多，底部的数字是我称之为 Band-Aids 的一些修复方法。

训练一个标题生成器要好得多。其中一些，比如完美排序，确实做得很好。就像我实际上...不，不是这个。下一个。但这里是来自论文的一个例子，来自我目前隐藏的 ARO 论文。因此，基准的构造方式是你有

一张图像和两个可能的标题。你需要找出哪个是正确的，哪个是错误的。标题的设计旨在在属性或关系上有所不同，比如左边，右边或排序。这个例子来自论文本身，马正在吃草，或者草正在吃马。你能在没有看到图片的情况下猜测哪个可能是正确的并与图片匹配吗？是的，对吧。

所以这是一个问题。你可能希望猜测这个方向。所以这是基准的问题。我甚至不需要揭示图像，但为了完整起见。这只是论文中的截图。因此，我们将其识别为一个问题。因此，我们还训练了

盲解码器，仅仅是一个从未看到图像的标题生成器，在我们相同的预训练数据集上，来自网络的图像或来自网络的图像文本。当然，这也需要任务。因此，这是第一个基准的缺陷。但同样，我们并不是唯一注意到这一点的人。其他人也注意到了这一点，并创建了一个新的基准，旨在测量相同的内容，称为 Sugarcrab。

它看起来有点像这样。这里是一个例子。它似乎没有这些明显的捷径。只是一个例子，这张图片，然后一个黄色的网球拍上有一个蓝色的网球，或者一个蓝色的网球拍上有一个黄色的网球。两者都是相当合理的。蛋糕和花也是如此，等等。然后它还详细说明了测试的内容。

但同样的故事在这里。因此，我认为我们没有在论文中包含它，因为基准的作者已经做了这个盲基线，并显示它得到了随机准确性。因此，我们不需要重新做这个。但在这里，同样的故事。像这些标题模型在几乎所有方面都显著优于等效的 clip 模型或甚至是最好的 clip 模型。对吧。所以我认为这是未来的

预训练模型或类似的东西，但我们应该超越 crypt。对，更重要的是，视觉模型就像语言模型一样，现在变得越来越复杂。之前我说的所有内容都是预训练一个模型，然后我们可以使用它，也许是零样本或微调，但现在大多数模型的做法是分阶段训练，对于 VLM 来说，基本上是

几乎与语言模型相同。从我们这边，我们几年前开始了一系列名为 Pali 的论文和模型。我实际上很好奇谁知道 Pali。好吧，大约一半。

可能做视觉和 Purell 的人不知道。所以 Pali 模型大致看起来像这样，或者它是一系列论文和模型。这是关于动画的第一篇论文的内容。因此，你只需将图像和文本作为输入，然后将文本作为输出。

输入的文本基本上是任务。你想要什么？这通常是你想要回答的问题，或者是像“用罗马尼亚语生成这张图像的标题”这样的指令。然后这些内容就会传递给一个变换器并一起训练。

然后我会稍后谈谈它们是如何训练的。然后通过这种模型接口，你可以做比仅仅使用 clip 或仅仅使用标题模型更多的事情，对吧？你现在可以问问题，因为它是自由形式的语言，而不是一个类的列表，你可以问相当尖锐的问题。比如你可以问有多少个硬币，然后它会说 12，但你也可以问有多少个 1 美元硬币，它可以说 2 和其他东西。

然后，是的，让我们跳过这个。对。但那只是文本输出，好的。语言模型的人会对此感到高兴，但视觉的人会觉得，不，视觉还有更多的东西。对吧。但文本比你想象的更具普遍性。例如，一个经典的视觉任务是检测以创建带坐标的边界框，对吧？这很容易编码为文本。

不，这有点可读。那么如何将边界框编码为文本呢？好吧，就像两个角的坐标一样，简单的整数数字，例如，对吧？这并不意味着整数数字不是像素，因为那样会对图像大小敏感，但像图像的分数，然后乘以一千，以便你有整数。

对吧。所以你实际上可以用这个文本输出 API 做很多经典的计算机视觉任务。更重要的是，我没有把它放在这个幻灯片上，但你还可以创建分割掩码作为文本输出。这是怎么回事？好吧，你可以训练一个

掩码编码器，通常是 VQVAE，可以将掩码压缩为少量标记的短代码，来自小词汇表，然后可以解码。然后你只需将这个词汇表与语言词汇表连接起来，就可以了。

所以它实际上是一个非常通用的 API。你可以用它做很多视觉任务。而且再次因为它使用语言，而不是在经典的视觉分割和检测中，80 个 Cocoa 类的列表，你可以非常精确地表达你想要的。让我们检测右手，它只给出右手。检测左手，只给出左手。我们不讨论根据训练数据什么是左手和右手。

好的，是的，一个问题是，我们有一系列关于 Parley 模型的三篇论文，展示了所有这些都是可能的，并且可以变得越来越好等等，但后来时代变了，如今人们会说“哦，好的论文，模型在哪里？给我模型，否则我会在一周内忘记它”，所以是的，这是

好问题。因此，我们做了第四个 Gemma 模型，称为 Pali Gemma。这个模型也是开放的。所以你可以去下载并几乎用于所有目的。我们之前有一些许可证说不要用于邪恶的事情。我们也不得不使用这样的许可证。但你基本上可以用于任何事情。

它的外观与之前的模型非常相似，只是稍微不同，因为现在语言模型都是仅解码器。因此，我们使用仅解码器的语言模型，Gemma 20 亿，然后是图像编码器。是的。然后让我们进入有趣的部分，训练。

这一幻灯片我从另一个演示中复制粘贴。我们就忽略左侧。对这个演讲不重要。预训练，像这样在多个阶段工作，我相信这与语言模型的预训练非常相似。

所以第一阶段是阶段 0，即单一模型预训练。因此，图像编码器单独预训练。我们用 siglib 图像编码器完成了这项工作。你可以使用 kappa 图像编码器。你可以使用 dyno 图像编码器，任何好的通用图像编码器。语言模型单独训练。在这种情况下，我们使用 Gemma，因为我们在谷歌。你可以使用 Lama。你可以使用其他任何东西。

然后，对于这个你不需要任何成本，因为你只需下载现有的模型。然后你进行第一阶段，我们称之为多模态预训练。这就是当你将它们结合在一起，然后在看起来像图像和文本输入，然后文本输出的混合上训练它们。我稍后会向你展示混合。

然后在计算机视觉中，通常重要的是模型也要理解更高分辨率的图像。因此，通常我们在 2-4 x 2-4 图像上训练，出于传统原因，但这也是一个甜蜜点。像 2-4 图像，你可以识别很多东西，但不是所有东西。而且相对高效。

但通常会有一个分辨率增加阶段，即在更高分辨率下进行较短的训练，比如 448 x 448，因为这更昂贵，但你可以看到更多细节，尤其是如果你有带文本的图像，比如文件的图片或其他东西，那么你可能真的需要这个。

对，所有这些基本上都是预训练。然后还有另一个阶段，即迁移。因此，预训练任务，你将很快看到，主要是为了教模型尽可能多的技能和尽可能广泛的知识。在这个阶段，你并不真正关心接口是否友好，是否很好地理解用户意图或类似的事情。只是将原始知识放入模型中。

然后你有一个迁移阶段，通常也较短，在这个阶段你通常会微调模型以满足你真正想要的东西。这对于不同的人、公司或项目可能是不同的。这可能包括在许多事物的混合上进行训练，比如监督微调或指令微调也是其中的一部分。

但它通常没有目标是给模型提供新知识，而只是让它专注于你关心的事情。因此，在这种情况下，Pali-Gemma 的预训练混合看起来像这样。基本上是一堆任务，迫使模型学习一些东西。一个明显的任务是前缀意味着什么是输入，比如提示模型或任务描述。然后...

例如，我们有一个标题，然后是语言。因此，中文的标题，例如，然后模型需要预测中文的标题。从网络上收集的原始图像文本中，我们可以运行语言检测，右？然后我们知道训练时的语言，然后可以放在这里。或者例如，如果我们有带文本的图片，

而我们知道图片中的文本是什么，我们可以知道，例如，使用现有的 OCR 系统。然后我们可以要求模型读取图像上的文本。因此，提示将是“进行 OCR”。这是一个任务。你看到这教会了模型与在标题中描述图像不同的技能。然后是问答。

包括一些特定的问题，你可以生成。例如，如果你有一个现有的相当好的分类器，可以告诉你图像中有哪些类或对象，你可以运行它，然后生成合成问题，比如这些，比如有多少把椅子，或者图像中是否有椅子，或者类似的事情。

然后之前还有另一篇论文显示你也可以反过来生成问题，以便给出这个答案。这是模型需要解决的不同技能集。因此，将其添加到预训练中也是好的。

然后我们还添加了检测和分割。检测标签和分割标签是伪标签。因此，它们来自一个好的检测器模型或一个好的分割模型。

是的，这就是混合的样子。但这并不是你希望用户使用模型的方式，对吧？你不希望用户首先输入答案 en，然后是用户的问题。

所以这就是微调步骤的作用。我们不需要逐一浏览这个完整的列表。只是说我们在许多不同的数据集上进行了微调。效果很好。对于微调，你不需要很多微调数据，因为这主要是关于重新调整语法以与任务需求对齐。

然后最后一步，从语言上你也知道，但我们实际上是在 RLHF 的同时进行的，但在视觉上，是对模型进行最后一步的 RL 调整，以优化你真正想要的东西，因为监督微调通常仍然不会优化你真正想要的东西。让我们看看，我该如何举例说明？

对，让我们回到这个例子。如果你在这样的数据集上进行监督微调以进行检测，你的训练目标是逐个精确预测每个标记，对吧？但当你进行检测时...所以这个任务，例如，298，在这里，或者如果你预测 299，预测 299 完全是错误的，对吧？就像是...

你预测了错误的标记，所以你错了。就是这样。但在检测中，这并不是我们关心的。如果框向左偏移一个像素，那完全没问题。我们更关心的是，例如，不要在本不该有的地方多出一个框，这在标记方面与将四个框的坐标偏移一个像素的错误量是相同的。

好的，因此在监督学习中通常训练的内容并不是你真正关心的内容。这个例子清楚吗？是的，我希望如此。好的。那么在视觉中，我们可以做的是最后一步的 RL 调整。因此，首先 - 是的，这几乎与 RLHF 论文几乎同时。

所以首先你进行监督训练或监督微调或预训练，因为这确实效果很好，并且确实给你一个相当好的模型，一个相当好的你想要的近似。因此，这是最大似然训练，这基本上意味着模仿训练数据。因此，你也永远无法比你的训练数据或训练数据的最佳部分更好。

但然后，一旦你有了这个在你的任务上表现相当好的模型，你可以从中采样预测，然后你可以定义奖励。奖励不需要是可微分的。这是好的一部分。你只需要给一个数字，比如这个预测是好还是坏？这可以通过询问人类给出一个数字来获得。例如，你有 RLHF，或者可以通过非常复杂的度量来获得。熟悉检测的人知道 MAP 是描述我们在检测中想要的内容的度量，但它绝对不是可微分的，而且相当复杂。但你可以计算这个，并给样本一个分数。

然后你进行 RL，这基本上意味着，好吧，模型，给我两个样本，然后我给这两个样本一个分数，得分更高的那个，我说模型多采样这个，少采样那个。你不断这样做。这是你可以将模型对齐到你真正关心的任务或任务部分的方式。不是仅仅

复制数据中的内容，这就是预训练，监督训练所做的。这在语言中相对清楚，因为在语言中，现在很常见有模型可以从中采样。在计算机视觉中，这曾经是完全不常见的。所有经典的计算机视觉模型，如 FasterICNN、DeepLab、YOLO 等，都不是可以从中采样的模型。因此，你无法在其上进行 RL。

因为你无法从模型中获得两个样本并说哪个更好，哪个更差。最近，随着模型的统一和这种 Pali 风格的模型的出现，像统一 IO 这样的其他几个好例子，你实际上可以拥有可以采样多个合理解决方案的视觉模型，然后你可以在其上进行 RL。因此，这就是为什么这最近才发生。

是的，这里有一些示例，它效果很好。我们在检测中做了这个。因此，左侧是基础模型，对于那些了解检测的人来说，它的 COCO MAP 为 39，这还不错，但不算好。然后你用 MAP 度量作为分数进行一点 RL 调整，然后你得到更好的 MAP，实际上捕捉到更多的东西。

而 54 是一个相当好的 cocoa MEP。我们还在全景分割中做了这个。

为了证明，你真的只需要清楚地定义你真正想要的东西，然后提出一些分数。我们刚刚做了这个愚蠢的例子，一个上色模型。因此，灰度图像输入，彩色图像输出。它也是生成性的，因此你可以从中采样。然后我们只是任意定义一个度量来计算

图像的闪亮程度。然后我们朝着那个度量稍微调整，确实生成了更闪亮的图像。

对，然后最后一件事，关于 RL 调整，显示在 RL 调整中发生的事情。它实际上并不是教模型任何新东西或任何东西。它只是让它更多地采样你喜欢的东西，得分高的东西，少采样你不喜欢的东西，得分低的东西。

所以这里有一些图表。它们都稍微难以消化，但我会尽量引导你。我们有模型之前，意味着 RL 调整之前，之后是 RL 调整之后。在 y 轴上是任务的奖励，无论是什么。

我们从模型之前获得了很多样本，之后也获得了很多样本。我想这里我们得到了 10,000 个样本，然后我们只是对它们进行排序。你在这里看到的是，在 RL 调整之前，你有很多低质量样本的样本，而在 RL 调整之后，你告诉模型，这实际上就是 RL 调整的含义，对吧？这个样本是坏的，少一些。

因此，你有的低奖励样本少得多，而你开始默认采样更多高奖励样本，然而

在 RL 调整之前，原始模型也有非常少的高奖励样本，就像这个小绿点线，对吧？所以并不是说 RL 调整让模型变得更好。它只是让它更频繁地采样这些好的部分，对吧？原始模型能够与 RL 调整模型一样好，但只是非常非常少。让我们看看。

哦，是的，这个只是说样本的可能性是不够的。你真的需要有你定义的分数。所以这里是，那是什么？

对，从左到右，我们采样越来越多的样本。在左侧就像是仅仅说这里有两个样本。然后曲线显示的是这两个样本中最高奖励的样本。因此，两个样本中哪个样本的奖励最高？我们基本上可以看到同样的故事。哦不，等等。

抱歉，我说错了。让我们倒带。这里你有两个样本，在这里你看到的是具有最高可能性的样本的奖励。在 RL 调整之前，这并不是很好。问题是，即使你在 RL 调整之前获得许多样本和

10,000 个样本或 100 个样本，并选择具有最高可能性的样本，你在奖励方面并没有获得更好的样本，因为可能性尚未与奖励对齐。因此，你采样越来越多的东西，但这些东西并不在高质量区域。这就是奖励调整的作用。它重新加权样本的可能性，以便更多地采样高质量样本。

好的，这太多了，好的，最后数据太少了。因此，这就是结束。谢谢。我们将在这里提前结束第三部分。Brittany 还有一篇与视觉相关的论文要强调 MLLM 作为评判者。使用视觉语言基准评估多模态 LLM 作为评判者，我们很感激它对实际 AI 工程师的使用，但不幸的是，我们不得不因时间原因削减它。

你可以在节目说明中看到他们的口头报告。最后但并非最不重要的是，我们将第一、二和三部分结合在一起，跨越世界模拟、生成建模和视觉，检查强化学习和机器人领域，今年在 ICML 上几乎与视频生成一样占据了重要舞台。

为了自然地从视觉过渡到机器人，我们转向 Ashley Edwards，她曾在 Google DeepMind 的 Gatto 和 Genie 团队工作，但现在在 Runway，强调生成视频与扩散和机器人技术之间的深刻联系。

- 所以是的，今天我将谈论我们如何仅从视频中学习动作、策略、奖励和环境。因此，作为一个小免责声明，我将谈论我之前的许多工作，其中一些我认为我永远不会再谈论，其他一些我认为我根本不会谈论，但其中许多激励了我进行我现在正在进行的研究。因此，我认为回顾一下导致我来到这里的一些历史会很有趣。

所以我认为我们可能在整个会议中看到了这种幻灯片的迭代。但我认为我们现在知道，在文本到视频生成方面取得了很多进展。我们可能会问的一个问题是：我们到底是怎么到这里的？我想，仅在过去的一年里，我们就看到了如此多的创新。

我希望在这次会议期间，许多人会讨论这个，但我不会。相反，我将谈论我如何最终来到这里。我的研究背景实际上是在强化学习，但突然间我发现自己处于可控视频生成的领域。这就是为什么我想谈谈我一些旧的工作的原因，因为我想看看，我是如何最终来到这里的？也许我正在研究的一些事情今天仍然相关。

因此，为了回答这个问题，我将带我们回到 2016 年夏天，那时我在日本度过了一个夏天。因此，我在这里的主要重点实际上是研究这个机器人。因此，我最初实际上是一个机器人专业的学生。

我想在这里做的是尝试训练这个机器人从视频中学习手语手势。这时我开始真正对如何从视频中训练代理产生兴趣，因为来自强化学习背景的我开始对每次都必须为训练我们的代理制定奖励函数感到有些烦恼。每当我们有一个新环境时，我们都必须制定一个新的奖励函数。

因此，我对如何提出一种更通用的任务表示方式非常感兴趣，而这可以通过视频来实现。当我到达大学时，这是在早稻田大学，我意识到机器人的手实际上是无法工作的。因此，我实际上无法从视频中教它手势。

但这个机器人实际上是一个非常有表现力的机器人。我认为它实际上是一个喜剧机器人。因此，它可以做出许多不同的面部表情。因此，我决定，不如教它面部表情。

所以如果你想想，如果你看看人类的样子，他们看起来与这个机器人完全不同。因此，我在这里试图弄清楚的是，我们如何能够教一个机器人模仿，嗯，特别是这个机器人，模仿这样的面部表情。

当特征看起来非常不同的时候，再次强调，这发生在2016年。我们只有几个例子，比如一张GPU之类的。因此，我们没有很多例子来尝试在这里学习一个表示。因此，我想要尝试的是弄清楚如何让机器人的特征空间看起来更像人类的特征空间。

所以我们意识到的一件事是，如果你观察这些空间表达所产生的运动形状以及一般任何类型的运动，实际上是有一些结构的。因此，这里展示的是一种称为运动模板的东西，它本质上是将一系列帧连接在一起并在时间上进行平均，以便你可以看到运动发生的地方以及运动发生的时间。

这就是这个表示所展示的内容。好的一点是，这个表示在某种程度上是领域无关的。因此，你可以在左侧看到机器人的运动。在右侧，你可以看到人类的运动。然后再次，我们有两个不同的任务。一个是微笑，一个是惊讶。再次强调，这在当时是一个研讨会论文。你知道的，我觉得这个

有点酷。但这不是最好的结果。但本质上你可以看到，在这些不同的任务中，形状是相似的。因此，它在某种程度上学会了如何微笑，并且在某种程度上学会了如何做出惊讶的表情，因为我们试图基本上模仿你在这里看到的运动，而不是模仿你在一个人类与机器人之间看到的实际特征，如果这有意义的话。

所以，我想这项工作的另一个方面是，我们基本上不得不手动指定我们的奖励函数。我们使用HOG特征来比较人类的运动模板与机器人的运动模板。这是一个单一任务。因此，我们试图从机器人学习一个面部表情到人类。但在此之后，我们开始对如何在多个环境中学习表示更感兴趣，而不是专注于这个单一任务。因此，这就是我们开始在这里工作的原因。我们实际上试图从视频中学习行为。因此，在这项工作中，我们实际上得到了，实际上，是的，我们在2017年获得了一个巨大的公开可用互联网视频数据集，但它实际上主要展示了视频游戏

的游戏过程，主要由速通组成。但我们想看看是否可以推断出在这些环境中发生的行为，因为你可以想象在这些视频游戏中，你可能会看到角色向左移动、向右移动等等。

因此，想法是，如果我们能够推断出这些行为，那么我们可以用它们来生成一种控制器，让代理在看到这个新场景时，生成我希望你做的事情。再次强调，这是研讨会，所以我们没有达到第二部分。但我们确实尝试生成这些运动模板。因此，所有这些都显示，给定初始场景，让我们生成运动模板，以便我可以在未见场景中生成新的运动模板。

因此，这里展示了一些结果。因此，在顶部，你可以看到来自训练该数据集的视频游戏生成。这些是未见的环境。你可以看到它开始提取这些不同场景中的运动。老实说，这可能有点难以看清。但我们发现的另一个有趣的事情是，我们可以使用那个已经在视频游戏上训练的模型，实际上在分割未见环境中的动物方面表现得非常好，而我们只是在视频游戏上训练，但这是你通过预测运动、随着时间变化的事物而看到的那种新兴行为，你实际上能够提取出这些不同的角色。

所以这里还有一个有趣的事情是，基本上不是试图预测单一模式，因此不是让你的损失在下一个帧生成上，而是我们发现实际上预测多个未来是有用的。

本质上，你在左侧看到的是我们的初始帧。在右侧，你看到所有不同的生成。因此，如果你眯起眼睛，你可以看到，例如，你可以预测向右移动或向左移动或向上或向下移动，对于每个不同的场景。我们发现这一点是一致的。

我们训练这个的方式基本上是尝试将每个生成与最接近真实生成帧的损失最小化。因此，我们试图对不同的未来预测进行聚类。但这里有趣的事情是，这些不同类型的运动实际上代表了动作。因此，我认为我们开始弄清楚的是

动作在这些不同场景中是一种共享表示。因此，与其试图通过我们之前尝试的运动模板明确表示这些，我们想看看是否可以仅仅从视频中推断出动作。

所以这就是我们工作的动机ILPO，基本上我们将尝试仅从视频中学习动作和策略。因此，这项工作的方式是，想象一下你有一个像这样的初始帧。你可能会在你的数据集中看到，再次我们将尝试从视频中学习并训练代理仅从这些中模仿，而没有动作。但你可能会看到，例如，像向右移动或在空中跳跃这样的过渡。

因此，我们在这里试图学习的是一种潜在动作，这基本上就是促使这种过渡发生的概念。因此，我们知道有东西促使它们发生。我们实际上不知道这些的动作标签。我们将尝试从数据中学习它们。然后我们将有一个潜在策略，定义为专家在任何给定状态下采取某种潜在动作的可能性。

因此，基本上我们学习这个的方式是想象在我们的数据集中我们看到这两个序列。因此，假设专家向右移动，例如。我们要做的是学习一个生成模型，以再次预测给定初始状态的每个可能的下一个状态。

基本上，我们将再次尝试通过查看与实际在数据中显示的生成最接近的生成来对所有这些潜在的下一个帧进行聚类。因此，我们将再次看到这种最小损失，表示让我查看我所有的潜在动作。我将找到看起来最接近真实生成的那个。因此，我们在这里对未来帧进行聚类。

然后我们要做的是尝试学习一个策略，针对我们可以看到的所有不同过渡。因此，我们可以做到这一点，比如说，在我们的数据集中，我们观察到

例如，专家一半的时间向右移动，一半的时间在空中跳跃，或者他们从不保持静止。因此，我们将尝试学习一个策略，最终看起来像这样。因此，如果你对所有这些未来帧进行平均，你可能会看到看起来像那样的东西。因此，我们将尝试学习一个策略，有效地加权来自我们生成模型的所有不同特征，以便如果我们在该策略下进行期望，你将最终拥有一个

生成或平均生成，看起来像来自我们专家的预期生成或来自我们专家生成的预期未来。因此，这基本上就是我们如何训练策略。因此，这些不同的未来加权实际上是在说，在这个状态下我采取潜在动作零的可能性，在这个状态下采取潜在动作一的可能性，例如，我们可以以这种方式进行训练。

所以是的，所以这实际上显示了在与环境交互200步之后，我们的模型能够快速适应。原因是我们实际上是在将代理放入环境之前从视频中学习这个策略。因此，我们可以从环境样本中采取一些步骤，并将其用于实际适应我们的潜在动作，以便在现实世界中采取真实的动作。

因此，从这项工作中可以得出的一个结论是，我们实际上可以通过正在发生的下一个帧生成来表示我们的动作。当然，这假设你的动态是确定性的，但假设它们是。但基本上，每个下一个帧都代表了你可以在世界中采取的动作。因此，我们将这个想法朝着不同的方向发展，我们可以说，基本上，假设我们有一个

奖励函数，我们现在可以尝试从视频中学习一个值函数，一个最优值函数，即使你有次优数据。因此，例如，如果你有来自视频的演示，其中专家并不是真正的专家，但他们在碰撞和做次优的事情，但有时他们会碰到目标。

因此，这里的想法是，通常在强化学习中，你可以从次优数据中学习一个最优策略，但在视频中会变得有点棘手，因为你没有访问动作。因此，这里的想法是，代替学习一个动作，或者抱歉，学习一个策略，通常在强化学习中你会看到，如果你做RL，我知道这是一个视频生成的环境，但你们中的一些人可能熟悉这样的图表，基本上你有一个代理在世界中运行，它正在采取行动并且

试图最大化作为状态的长期期望奖励。这个工作的想法是，代替在状态上有一个策略，抱歉，是的，代替在状态上学习一个值函数，你将学习一个值函数在状态下一个状态对上。因此，基本上我们有这个值函数。抱歉，我想我搞错了那部分。你通常会在状态动作上有一个值函数。现在我们在状态下一个状态对上学习一个值函数。我们现在在状态上学习一个策略，而不是一个策略告诉你采取哪个动作。

这样做的好处是，当你有次优数据时，你实际上可以以最优的方式学习这个。因此，这里我在屏幕上展示了很多不同的东西。但主要的结论，再次是我们在状态上学习这个策略，学习一个值函数，表示从一个状态转移到下一个状态的价值，而不是在给定状态下采取动作的价值。

然后我们基本上可以尝试训练这个策略，告诉我们我们想要转移到哪个状态，通过最大化我们从一个状态转移到下一个状态的价值。因此，在我们实际与环境交互时，我们最终需要做的另一件事是，我们将不得不

弄清楚动作来自哪里，因此我们也可以学习一个逆动态模型。因此，这就是它所展示的内容。因此，再次强调，我们正在学习的是，给定次优数据，我们实际上可以学习最优生成。因此，这显示了来自我们在状态上的策略的计划，表示我应该移动到哪个状态以最大化我的价值？

这里有一个有趣的事情要记住的是，这实际上基本上就像一个视频生成模型。我们正在尝试生成下一个帧，告诉我们如何最大化我们的价值。这是给定随机生成的行为的随机回滚，我们实际上能够生成最优轨迹。

这在强化学习中也有效。但是的，我会跳过这一点，因为我们正在进行视频生成。但另一件事是，这要求我们实际上有一个奖励函数。因此，我们感兴趣的另一件事是，我们如何能够在没有奖励函数的情况下从视频中学习。我们能否让代理从这种数据中学习？

因此，我想我们可以观察到的一件事是，通常当你有视频时，发生的轨迹是有某种顺序的。通常，你会有专家数据告诉你要遵循的好东西。因此，我们可以在视频结束时说，那是一个奖励为1。然后你，如果你在时间上回溯，它会被折扣，就像你在强化学习轨迹中看到的那样。因此，我们可以使用这种想法来学习一个值函数，告诉我们在视频中行为的好坏。

这基本上就是我们所做的。因此，给定一系列帧，我们可以说你在最后获得一个奖励为1，然后我们可以在时间上回溯，这就是我们的值函数。我们可以用它来基本上训练一个强化学习代理，再次基本上用我们学习的值函数替换引导步骤，然后基本上尝试以监督的方式在这里训练你的策略。

但你可以看到，基本上我们在一堆不同的倒水视频上训练了这个模型。你可以看到随着时间的推移，值在增加。因此，这基本上告诉你，你可以以这种方式学习一个值函数。你甚至可以用这个来训练强化学习代理，因为好吧，我有一个强化学习背景，我们有时会这样做。但你可以看到，代理实际上能够学习，即使它是仅在视频上训练的。

好的，所以基本上我们展示的是，我们实际上可以从视频中学习动作和奖励以及策略。因此，我想剩下的就是，这在某种程度上引导我进入这种可控视频生成的领域，我们现在试图从视频中学习环境。这就是Genie背后的想法，我们将尝试从视频中学习一个可生成的互动环境，可以由人类和AI代理共同玩。

因此，我想我之前做的很多工作实际上对如何使用这些视频来训练代理本身非常感兴趣。但我很幸运，实际上遇到了像Jack和Tim这样的团队，他们有开放式背景。他们基本上说，我们不仅需要学习策略，我们实际上可以学习整个环境，并且我们可以将代理放置在这些环境中，让他们从中学习。

因此，这就是导致我们Genie工作的原因，我们在这里表示。因此，基本上，这项工作的想法是，我们可以学习三件主要的事情。一个是我们视频上的标记器。因此，我们使用离散化的VQ，VAE模型来表示这些。

我们有一个潜在动作模型。我认为这是最重要的组成部分，我们可以基本上接受帧序列并尝试推断变化，以便你可以使用该潜在动作表示预测未来。然后你可以将其插入动态模型中以预测未来。这就是可控性的来源。它来自这个潜在动作模型，它告诉你事物将如何随时间变化。

这就是导致我们最终结果的原因，我们基本上发现，如果你将一些文本生成的图像插入我们的模型中，并与它们互动，就像它们是一个真实的环境一样。再次强调，我们是在一个巨大的数据集上进行训练的

平台游戏。因此，我想我实际上没有花太多时间谈论Genie，因为我知道已经有一些研讨会的演讲，并且我们在会议上已经谈论过它。但我在想，我是如何进入这种研究的？我认为这个想法是，你实际上可以使用这些环境来训练未来的代理。

希望我们可以潜在地学习策略，学习潜在策略，学习奖励函数，就像我们之前讨论的那样。因此，是的，我认为这就是我主要想说的。我还想指出我所有的合作者。这里有很多优秀的研究人员，我有机会与他们合作。但这就是全部。谢谢。我想我可能还有很多时间来回答问题。

在更复杂的环境中，单靠动作无法表示所有动态。你认为我们如何在这种情况下无监督地解开动作？

没有监督？所以我认为如果你有奖励的概念，例如，或者一个概念，或者如果你可以尝试学习一个策略，例如，你可能能够提取出最可能发生的动作与动态。但我认为在没有监督的情况下，解开这些是很困难的。在我们的案例中，如果你想的话，你可能可以控制人群。但我认为也许你可以使用一些文本之类的东西来

添加额外的信息。但我认为它也与规模有关。因此，如果你想将Genie扩展到现实世界视频，主要的架构和理念上的变化是什么？

是的，这是个好问题。因此，Genie模型相当通用。因此，没有任何东西表明我们是在明确训练2D平台游戏。我们还有实验让它在机器人数据上工作。因此，我认为可能只是扩大架构的规模。

像往常一样，苦涩的教训，并添加更多数据，希望能够从中学习。我认为你可能还可以使用当前最先进的技术更改架构本身的不同组件。令人惊讶的是，或者说一点也不令人惊讶的是，研讨会问题的答案中有多少只是这一个词，规模。我们挑战你在NeurIPS的一天中不提到一次苦涩的教训。

至于观众成员关于动作生成和行为克隆的问题，Brittany在海报会议上走动，找到了来自NYU的一个可能答案。我在这里与Seungjae Lee，也被称为Jaylee，讨论他在VQ BET模型上的海报工作，这实际上是ICML会议上展示的亮点海报之一。描述是一个可扩展的，

行为生成模型，用于复杂任务中的高效多模态行为预测。这真是个长词，所以如果你能为我们解释一下你在这里工作的具体内容，那将非常有帮助。好的，很高兴见到你，实际上我们的海报是，我们的工作是从一个问题开始的，如何使用一个非常强大的LLM-like令牌预测框架进行行为生成任务。

因此，这个问题的主要关注点是，动作数据是在连续空间中。它与我们使用的语言不相似，后者非常容易进行标记。因此，我们所做的是使用VQVAE，向量量化器，将连续动作数据

量化为离散表示，并使用该离散表示作为LLM-like架构的标记器，以便我们可以根据当前观察预测行为。

非常非常有趣。你是如何进入这个研究领域的？这个项目的背景或起源故事是什么？是的，实际上，我的个人背景更接近强化学习。但在我看来，现在有很多可访问的平台。

大型动作数据，我发现使用传统方式训练行为克隆代理真的很困难。我的意思是，使用大型数据集以传统方式训练良好策略真的很困难。因此，我们需要一个更好的架构，可以利用LLM-like架构。这是我们研究的起点。

你是如何处理数据集收集问题的？因为我知道在我们看到的许多应用中，似乎今天数据是瓶颈，而不是其他任何东西。是的，实际上，这是个好问题，因为获取数据集真的很昂贵。

我的意思是，这是机器人技术中的一个非常重要的点。因此，我们的大多数环境是，这种环境是开源的环境，因此你可以下载大多数数据集。其中一些数据集是通过人类使用VR设备收集的。对于我们的现实世界实验，我们用iPhone自己收集了一些非常小的操作设备。

所以，是的。所以你部分自己启动了数据集，然后看起来你在模拟方面也做了很多工作？是的。实际上，我们首先在模拟上验证了我们的框架，然后在一些巩固的结果之后，我们转向现实世界实验。我们工作的强大之处在于

我们的模型非常轻量，因此不需要大型数据集。我们在现实世界场景中每个任务只需要45个演示，因此只需一到两个小时就可以通过人类收集，因此并不困难，是的。

你能谈谈自从这个模型以来你看到的性能结果吗？因为听众在家里没有海报在他们面前。实际上，你是说我们模型的性能吗？是的，我会说，你知道，有一个非常著名的基于扩散的模型。

我会说我们的性能与那些基于扩散的模型相当，但推理时间真的很快。大约是扩散模型的20%。因此，你知道，推理时间在机器人技术中非常重要。因此，我们可以说，你可以在GPU上进行超过100 Hz的控制，在CPU上超过20 Hz。哦，不。CPU上20 Hz。因此，是的。

所以性能足够好，与基于扩散的策略相比，但推理时间比那些基线要好得多。

明白了。你提到你在去年年底发表了这个。你是否继续在这个问题领域工作，或者自从发表以来你的研究如何演变？实际上，我们相信未来的方向应该是扩大这个架构。我是说，为了更具通用性的代理。例如，

能够根据语言指令执行某些任务的代理。因此，我们的目标是扩大它。- 非常令人兴奋。你是通过在首尔国立大学的工作完成这个，然后你又去NYU工作吗？- 是的，实际上我在首尔国立大学获得硕士学位，我给NIU的人发了电子邮件，我们从去年夏天开始合作。

非常令人兴奋。非常感谢你花时间走过这个。我很感激。谢谢。这是来自Shengjai Li的精彩亮点海报。我们还推荐他的教授Leryl Pinto关于构建通用机器人的演讲，我们在节目说明中链接。

Brittany还有一篇机器人论文要强调。Pivot。迭代视觉提示引发VLM的可操作知识。但我们在时间的利益上跳过它，以免继续增加我们已经溢出的Google DeepMind出版物计数。

迄今为止，强化学习和机器人领域最大的名字之一是Chelsea Finn教授，现在是价值20亿美元的初创公司Physical Intelligence的创始人，她在ICML上进行了四次关于她在机器人方面的教训的演讲。

我们在这里强调她的主题演讲，但我们也推荐查看她的同事Sergei Levine关于机器人基础模型的演讲。我的名字是Chelsea，我在机器学习算法以及机器学习在机器人中的应用方面进行研究。

因为我同时研究这两件事，我认为机器人技术为我的机器学习研究提供了一个与普通机器学习研究者略有不同的视角。今天，我想分享一点关于这个视角以及这个视角对我的机器学习研究带来了什么。

所以我首先提到的是，我认为我的机器人工作，尽管它不一定与核心机器学习算法完全对齐，但它通常间接地引导我关注与机器人之外的应用相关的问题。

例如，大约10年前，我开始为机器人进行端到端神经网络训练。这包括训练机器人将一个块放入形状分类立方体中，或者使用铲子将物体放入碗中。在这两种情况下，我们都在训练一个神经网络，将来自机器人的相机的图像映射到施加在每个电机上的扭矩。

我们训练的神经网络有92000个参数。虽然这看起来可能不是特别有趣或特别新，但在当时，这实际上与典型的机器人方法有很大不同。在我开始研究使用神经网络控制机器人的策略之后，

我对每次想要训练机器人时都必须从头开始训练神经网络感到有些沮丧，尽管我们通常是训练机器人执行许多不同的任务，而不仅仅是一个任务。这让我对机器人是否可以通过利用他们的先前经验更快地学习新任务产生了兴趣，而不是从头开始训练。这让我开始研究少样本学习和元学习，最终在教育和药物发现等其他应用中产生了相当相关的影响。

还有另一个例子是机器人工作引导我关注相关问题。在这项初步工作中，机器人学习的策略是特定于一个铲子或一个形状分类立方体或一个环境。我变得非常感兴趣的是，我们是否可以利用广泛的数据集来提高机器人的泛化能力。

这让我思考如何开发能够广泛泛化的机器，甚至可能能够超越其训练分布进行泛化。这让我开始研究数据集，同时也研究对分布变化的鲁棒性，这导致我们开发了一个基准，称为Wilds，实际上研究了广泛的真实应用中的分布变化，并在机器学习社区中得到了广泛使用。

<context>生成视频世界模拟，扩散，视觉，强化学习和机器人技术 — ICML 2024 第一部分 Latent Space LIVE! 在 NeurIPS 的常规票已售罄！我们刚刚宣布了最后一位演讲者和最新的主题，播客的朋友 Nathan Lambert，他将回顾 2024 年在推理模型中的发展，如 o1！我们为那些现在正在决定的人开放了一些晚鸟票 - 如果需要，请使用代码 DISCORDGANG。期待在温哥华见到你！我们已经保存了 ICML 的录音一段时间（来自今天首次 SOLO 嘉宾共同主持的 Brittany Walker），鉴于 Sora Turbo 今天的发布（博客文章，教程），我们认为现在是发布第一部分的好时机，这部分内容将深入探讨生成视频世界模拟的现状，顺利过渡到视觉（相反的模态），最后是机器人（它们的最终应用）。Sora、Genie 和生成视频世界模拟器领域的 Bill Peebles，Diffusion Transformers 的作者，在 ICML 上进行了他最近的 Sora 演讲，这开启了我们的节目：* William (Bill) Peebles - SORA（幻灯片）关于 Sora 的一个常见问题是，为了实现这些结果引入了多少归纳偏差。Bill 提到了 o1 团队的 Hyung Won Chung 提出的相同原则 - “迟早这些偏差会反噬你”。我们还推荐 2024 年关于 Sora 的这些阅读材料。* Lilian Weng 的视频扩散模型文献综述* Sora API 泄露* 估计需要 100k-700k H100s 来服务 Sora（不是 Turbo）* 使用 Sora 进行专业叙事的艺术家指南 Google DeepMind 在 ICML 上对视频生成模型的表现非常强劲，赢得了两项最佳论文奖：* Genie: 生成交互环境（在口头，海报和研讨会上都有报道）* VideoPoet: 用于零样本视频生成的大型语言模型（见网站）我们通过 Tali Dekel 的演讲结束这一部分，主题是视频生成的未来：超越数据和规模。第二部分：生成建模和扩散自 2023 年以来，Sander Dieleman 在 Imagen 和 Veo 上工作时对扩散的看法（博客文章，推文）被称为“频域中的谱自回归”，引起了公众的想象，因此我们强调他的演讲：* 穿越噪声：对扩散模型的直观观察然后我们转到 Ben Poole 的演讲，主题是用 2D 先验推断 3D 结构，包括他在 NeRFs 和 DreamFusion 上的工作：然后我们调查两篇流匹配论文 - 一篇来自流匹配的共同作者 - Ricky T. Q. Chen（FAIR，Meta）以及它是如何在稳定扩散 3 中实现的，使用缩放整流流变换器进行高分辨率图像合成我们对扩散的最后一击是几场关于语音的口头报告，我们留给你通过我们的音频播客探索* NaturalSpeech 3: 使用因子化编解码器和扩散模型进行零样本语音合成* 使用扩散模型合成数据的语音自监督学习第三部分：视觉 ICML 测试时间获奖者是 DeCAF，Trevor Darrell 显著称其为“OG 视觉基础模型”。Lucas Beyer 的演讲“LLM 时代的视觉 - 数据中心的视角”在网上也受到好评，他谈到了自己从视觉变换器到 PaliGemma 的旅程。我们特别提到 MLLM 作为评判者：使用视觉-语言基准评估多模态 LLM 作为评判者。第四部分：强化学习和机器人技术我们借助 Ashley Edwards 的帮助将视觉过渡到机器人技术，她在 Deepmind 的 Gato 和 Genie 团队的工作总结为从视频中学习动作、策略、奖励和环境。Brittany 突出了两篇海报会议论文：* 使用潜在动作生成行为* 我们还推荐 Lerrel Pinto 的《构建通用机器人的方法* PIVOT: 迭代视觉提示引出 VLM 的可操作知识然而，我们必须将大部分空间留给 Chelsea Finn，现在是 Physical Intelligence 的创始人，她进行了四次演讲，主题是* “机器人教会我关于机器学习的事情”* 开发机器人通才* 自主适应的机器人* 如何给你的语言模型反馈* 特别提到 PI 同事 Sergey Levine 关于机器人基础模型我们以一篇将生成环境与 RL/机器人技术联系起来的立场论文结束播客：自动环境塑造是 RL 的下一个前沿。时间戳* [00:00:00] 介绍* [00:02:43] Sora - Bill Peebles* [00:44:52] Genie: 生成交互环境* [01:00:17] Genie 访谈* [01:12:33] VideoPoet: 用于零样本视频生成的大型语言模型* [01:30:51] VideoPoet 访谈 - Dan Kondratyuk* [01:42:00] Tali Dekel - 视频生成的未来：超越数据和规模。* [02:27:07] Sander Dieleman - 穿越噪声：对扩散模型的直观观察* [03:06:20] Ben Poole - 用 2D 先验推断 3D 结构* [03:30:30] Ricky Chen - 流匹配* [04:00:03] Patrick Esser - 稳定扩散 3* [04:14:30] NaturalSpeech 3: 使用因子化编解码器和扩散模型进行零样本语音合成* [04:27:00] 使用扩散模型合成数据的语音自监督学习* [04:39:00] ICML 测试时间获奖者：DeCAF* [05:03:40] Lucas Beyer：“LLM 时代的视觉 - 数据中心的视角”* [05:42:00] Ashley Edwards：从视频中学习动作、策略、奖励和环境。* [06:03:30] 使用潜在动作生成行为访谈* [06:09:52] Chelsea Finn：“机器人教会我关于机器学习的事情”* [06:56:00] 立场：自动环境塑造是 RL 的下一个前沿获取 Latent Space 的完整访问权限，请访问 www.latent.space/subscribe</context> <raw_text>0 So, from there, in this talk, I'd like to share a little bit about what working on robotics has taught me about machine learning. And to start off, let's talk about a few facts about machine learning in the context of robotics. The first is that machine learning is quite data-hungry, and at the same time, we don't have existing data sets on the Internet of robots, robots,

controlling themselves to do different tasks. We don't have the equivalent of Wikipedia for how to control motors to tie shoelaces or to open a water bottle. Furthermore, we don't have an easy way to interpret or ensure the safety of machine learning policies applied to robots. And this has serious implications when robots have a real possibility of directly harming humans in a physical world.

Lastly, compared to other leading approaches to robotics like optimal control, we lack formal guarantees of what a machine learning-based policy would do. Because of these shortcomings of machine learning in the context of robotics, you might expect me to say that maybe machine learning isn't solving real applications like robotics and it's fundamentally problematic. But is that actually true? Let's look at an example.

So say that we want a robot to tear off a piece of tape and put it on a box. This may seem like a fairly simple task,

But this is actually a task that is incredibly difficult for traditional robotics approaches, because traditional approaches will typically try to model the entire scene, including how the tape will adhere to the canister and to the fingers of the robot, how it will tear when spread across the metal part of the canister, and how to control all 14 of the motors on this robot in order to accomplish the task.

It turns out that for this task that is seemingly extremely difficult for traditional approaches, we can actually use machine learning to address it. So we can develop a teleoperation interface, specifically Tony, a student in my lab, developed a teleoperation interface that we call Aloha that allows you to puppeteer the robot to solve a wide range of different tasks.

Once you develop this teleoperation interface, it means that you can collect data to train a machine learning-based policy to solve a wide range of different tasks, including the really challenging task of tearing off tape and putting it onto a box, as well as other tasks like putting on a shoe. In this case, it's a machine learning policy that's mapping the images from the robot's cameras to all the 14 joints, and it's doing so with a transformer trained end-to-end on demonstrations collected with teleoperation.

And we can use machine learning not just for these fairly complicated tasks, but we can also do it for mobile manipulation. So we can develop a teleoperation interface for an entire mobile robot with two arms, use that to collect data, and again, use a transformer-based architecture

to train the robot to do challenging tasks like on the top, make a piece of shrimp by pouring oil on the pan, putting the shrimp into the pan, flipping the shrimp and serving it. And on the bottom, putting a pot into a cabinet. And so again, we're finding that machine learning is able to solve fairly complicated robotics tasks.

And beyond these kinds of robots, we can also do something like this for surgical robots. So surgical robots are incredibly difficult to control. This is the DaVinci Surgical Robot, and we can use machine learning in a fairly robust way to, again, train policies for complicated tasks like tying a knot and picking up a needle and handing it over to the other surgical tool.

Finally, we can also do this with full-size humanoid robots where if we develop a teleoperation interface, which is a little bit harder to do in this case, but we can train a shadowing-based teleoperation approach and then use this to train, again, transformer-based policies in this case to control robots to do pretty challenging tasks that involve controlling all of the different degrees of freedom, including both the arms and the legs of the robots.

And so going back to my question before of whether machine learning is solving real problems, I do think that machine learning has been making real advances that advance applications and really useful problems in the real world. Supervised learning works really well. We've seen significant advances in architectures, learning algorithms, and optimizers.

We also have reliable engineering practices for debugging if something isn't working, debugging if a policy isn't working or if another model is not achieving the performance that we want and ultimately improving the performance. Now you might ask, if machine learning is making real advances, why don't we have robots out in everyday environments solving real problems yet?

And a lot of people for that question will refer you to Moravec's paradox, which states that the things that are most intuitive for humans, like basic motor control, are the things that are often most challenging for machines. And this could explain why robotics is further behind than applications like debugging complex code or translating between two pieces of text.

But in my work, I've actually found that this isn't perhaps quite the most direct explanation. I think the explanation is actually that the things that lack abundant data are often the things that are most challenging for machines. And this is because scenarios that lack abundant data, we're not able to directly apply machine learning and directly try to identify patterns from large amounts of data.

This can include both data scarce applications as well as just scenarios that are novel that aren't represented well in the training data. This isn't just things like robotics that don't have a corresponding Wikipedia and so forth. It's also even within applications that do have a lot of data, there's scenarios that they encounter that aren't represented well in the data, and that's exactly where machine learning algorithms often struggle, and as a result, our machines often struggle. Perhaps instead of

trying to take some approach that tries to combine traditional methods or machine learning or something. I think that actually robotics just needs more of what makes machine learning thrive. Essentially, we need to find more ways to get data for applications like robotics. This is really the core question that I want to talk about today is, how can we get good data for a wide range of problems in a cheap and inexpensive way?

How can we basically handle data scarcity without skimping on data? I'll talk about a few different ways to do this. The first is finding ways to augment data with cheap and natural to provide supervision. The second will be to leverage data sources beyond the particular target application. The third will be to incorporate data from test time in addition to the typical training dataset.

I'll spend the most time on this first point because it's a little bit different from some of the ideas that have become more commonplace in machine learning. Great. To start out by talking about cheap and natural to provide supervision, let's look at how we currently supervise machines.

We currently will take a training dataset, train a model, evaluate that model. To evaluate it, we'll ideally actually look at how it does in a real situation by talking to it or by running a robot and so forth. Inevitably, the model often won't work well in some scenarios. The best course of action, assuming that you've optimized it well and the architecture is well-tuned, is to collect and label more data.

and specifically collect and label more data in the scenarios that are struggling. This would involve going out, getting examples, getting labels for those examples that cover those scenarios that is not working well. This is really expensive and very human intensive. If it were cheaper, we would be able to iterate on this cycle more on the model, and we probably end up with a stronger model. That's one shortcoming with a typical supervised learning approach.

The second is that input-output pairs are also a little bit weird in some settings. Say that we wanted a robot to cook a meal. The way to apply supervised learning in this case would be to collect examples of how to move the arms of the robot, how to move the motors as a function of the inputs,

This is a little bit weird compared to just trying to teach the robot naturally the kinds of things that it should do, like making sure that the water is hot enough before putting pasta in or setting a timer to make sure that it's been cooked for long enough. Or as another example, say that we want to train a system to make a medical diagnosis. The typical supervised learning way to do this would be to have examples of symptoms and then have examples of the diagnosis as a result.

But instead, perhaps the more intuitive way would actually be to teach the machine about how diseases actually manifest in humans and patients. This is bringing us to the idea that perhaps we might be able to train machine learning models in a more data efficient way if we were able to incorporate natural to provide supervision. One thing you might think about here is

Instead of providing labels, what if we use human feedback? Reinforcement learning from human feedback has been quite successful where instead of providing input-output pairs, we'll look at an input-output pair as a set of them and say, this diagnosis is better than this one or this pasta tastes better than this pasta. This can require a lot less supervision because you don't actually have to write out or actually provide the exact motor torques.

But it still requires many labeled examples, many examples of an outcome and what is preferred. Is it possible to give machines far less supervision but still allow them to improve? We're going to look at this both in a robotics example as well as in a more standard image classification example. Let's start with the robotics example. We're going to be looking at long horizon by manual tasks. The goal, for example, might be to put all the objects into the bag.

It's really expensive to collect demonstrations that cover all of the possible scenarios that the robot might end up in. The form of natural supervision that we're going to be considering here is just verbally telling the robot how it might handle or how it might improve in situations rather than trying to collect a ton of demonstrations for the scenarios that it's struggling in. Specifically, say the robot is going about the task and it's struggling on this part of the task of putting the sponge in the bag.

What we'd like to be able to do is we'd like to be able to tell the robot at this part, you should use the sponge to open the bag wider because right now the bag is not open very widely. Ideally, it'd be able to use this verbal snippet of text to both improve on the fly to be able to figure out how to solve the task in that scenario,

as well as how to then take that data and actually improve the policy and improve its ability to handle new situations like that in the future. We'd like to be able to use this high-level language supervision both on the fly and for future improvement. How do we do this? If we want our robot to be able to improve from high-level language corrections, we need a way to connect what the robot is doing with language.

To do this, we're going to train a hierarchical policy, a high-level policy and a low-level policy, where language is the interface between those two policies. More specifically, we'll take the observation, this will be fed into a high-level policy that then predicts language corresponding to a skill like pick up the sponge or put the Sharpie into the bag.

Then this language command will be fed into a low-level instruction following policy that takes as input the robot's observations and outputs how to move the motor commands. This hierarchical approach is not new, it's actually been done in a wide variety of prior works, and so it's not what we're introducing here. The key insight of what we're going to do here is that we can actually update the high-level policy only with language supervision because its output space is

language as a skill that the robot should do next. Because of this, if the low-level policy can follow a wide range of instructions, then we can actually improve this full system just by updating the high-level policy and just by giving it language feedback. Specifically, we can do something like the dagger algorithm, the dataset aggregation algorithm on the high-level policy and freeze the low-level policy.

Specifically what this is going to look like is we'll intervene, we'll tell the robot what we want it to do. In this case, maybe it should rotate the tape in order to put it into the bag. This intervention, this language command will override the high-level policy and that intervention will be fed into the low-level policy instead of what the high-level policy is predicting. Then that will allow it to on the fly be able to leverage these interventions.

We'll also aggregate these interventions into a dataset and use this to update our high-level policy. It actually also learns how to improve from these corrections in the future. We're freezing the low-level policy and updating the high-level policy by supervising it just on the language corrections that the human is providing. We gave this a fun name, Yell at Your Robot or Yay Robot, because you can articulate your corrections or frustrations with the robot to help it improve.

What can this do? Let's look at some videos of fully autonomous policies on the robot. We'll start just with the base policy before doing any language corrections. This policy is trying to put the objects into the bag and it'll make mistakes. In this case, instead of putting the Sharpie into the bag, it put it underneath the bag and it struggles to be able to recover from that.

It also make other mistakes. Here it's trying to pick up the Sharpie. The high-level policy output is shown here on the top left. We're actually finding that the high-level policy isn't ever issuing corrections like go lower or maybe rotate the gripper in this case. It just keeps on telling the policy to try to pick up the Sharpie. Now, after we fine-tune on language corrections, we find that it's able to autonomously correct for mistakes. Here it's making the same mistake as before by putting the Sharpie under the bag, and then it's trying to self-correct.

It then makes a mistake again, and then it's self-correcting again to try to move towards the camera, go higher, and then put the Sharpie into the bag. And by self-correcting, it's able to solve that part of the task successfully. It also learns to self-correct for grasping, where it'll self-correct to move to the right after it made a mistake of grasping too far to the left. And...

When trying to put the sponge into the bag, we'll also see it just change strategies completely. So here it's trying to, in some ways, kind of shove the sponge into the bag and it's doing so unsuccessfully. And now the high-level policy is going to tell it to instead try to release the sponge and sort of kind of poke it into the bag instead. And this helps it get it into the bag more successfully. And as a result of the robot's ability to self-correct from just this language supervision,

We find that the robot is better overall at doing long horizon tasks. This video is pretty long because the task is quite challenging, so I won't play all of it. But we get a sense that despite this task being quite challenging and having all sorts of scenarios that we don't necessarily have demonstration data for, we find that by leveraging this very cheap language supervision, the robot is able to perform the task a lot more successfully.

even though this task is quite long. Cool. Then there's one more thing I wanted to highlight from the system, which is that instead of just correcting after the robot has made a mistake, we can also actually proactively correct the robot when we think it might make a mistake in the future. This is a different task that we train the robot to do, which is to make trail mix. The grad students were quite happy about all the trail mix that ended up in the lab as a result of this. We see that right here, I pause the video,

The robot, it looks like it's actually about to accidentally pour a whole bunch of peanuts onto the table because the scoop is behind the bag instead of inside the bag.

Right here, because we noticed that it looks like it might be about to make a mistake, we can intervene and instead of telling it to continue by moving the scoop into the bag and presumably then trying to pour into the bag, we can interrupt the robot and correct it and tell it to move the left arm to the left, go higher, move the scoop into the bag, and then allow it to continue autonomously to pour into the bag. This is an example of how in real time we're able to improve the performance by proactively preventing the robot from making a mistake.

After fine-tuning, we find that it also learns this proactive corrective behavior where it notices that in this case with cranberries, it was about to make a mistake there. It didn't successfully get the scoop into the bag and then corrects itself to move the scoop into the bag successfully. Those are a number of qualitative examples. Quantitatively, we also see a large gain in performance just from verbal corrections. The dark orange bar here shows the success rate on average.

after fine-tuning on just language data, whereas the gray bar shows the policy before language corrections. We see a 20 percent improvement in performance. This closes a lot of the gap to this light orange bar, which is the performance if we use human corrections on the fly to override the high-level policy.

Lastly, it's worth mentioning that the performance of this still has room, there's still a lot of room for improvement for even when we're using Oracle high-level human corrections. This suggests that the low-level policies have room for improvement. To summarize, you can productively yell at your robot to help it actually accomplish tasks. But more importantly, the robot can improve just with language feedback without demonstrations by fine-tuning this high-level policy.

This is a lot more data efficient. It's a lot more data efficient to simply tell it to pick up the sponge or move to the right than to actually collect demonstrations with teleoperation. Then of course, this approach relies on a performant instruction following policy, and so you're not completely out of the woods in terms of having to collect some low-level data on the robot. Great. This is an example of how we can use natural supervision to

augment data and get much better performance in a very cheap way. Can we do something similar for other machine learning systems beyond robotics? Say that we wanted to perform a classification task, an image classification task based on the species of the bird, and we train them all to do this. Here I'm going to be visualizing the predictions that the model is getting right and the predictions that the model is getting wrong. If we contrast the correct predictions from the incorrect predictions, </raw_text>

与从头开始训练相比，效果有所改善。特别是我们可以在训练数据集中看到的任务和场景上表现良好。但是在评估对未见物体、背景和环境的泛化能力时，与训练期间看到的事物相比，仍然存在相当大的差距。然而，互联网拥有

非常庞大的训练数据，因此我们期望也许我们可以做得更好。具体来说，也许如果我们能更紧密地将预训练模型与下游任务连接起来，我们可能能够更有效地利用互联网数据中存在的丰富知识。因此，具体来说，我们将采取一个视觉模型，而不是仅仅使用在ImageNet分类上训练的模型，我们将使用一个为视觉问答训练的模型，

我们可以将下游任务，特别是机器人控制问题，表述为一个视觉问答问题。我们将其输出连续值的方式转变为一个问题。机器人应该做什么才能完成像捡起芯片或将瓶子竖起来这样的任务？

然后我们同样也将模型的输出框架设定为一系列令牌，类似于VQA任务的输出。这些令牌将对应于不同的语言动作，比如如何翻译和旋转机器人的抓手。

如果我们将这个下游任务基本上表述为与预训练期间看到的任务相似的任务，也许它将能够更有效地利用预训练数据，并理解如何将机器人任务泛化，类似于它如何泛化这些VQA任务。

一旦我们拥有这些数据，我们将使用相同的架构，特别是一个预训练的视觉语言模型。您可以选择仅在机器人VQA任务上进行微调，或者在机器人任务和视觉语言模型预训练时使用的现有互联网VQA数据的组合上进行微调。它将输出这些语言令牌，然后转换为机器人动作以在机器人上运行。

本质上，我们将机器人控制视为一个视觉问答问题，并定义与机器人动作对应的令牌。我们将这个微调后的模型称为不再是视觉语言模型，而是视觉语言动作模型，因为我们现在有动作，其中一些令牌代表动作。现在，如果我们回到这个例子，如果我们仅使用预训练的ImageNet编码器，它的表现如何？

我们发现使用这种视觉语言动作配方的模型，实际上能够比仅在ImageNet分类上预训练的模型更好地泛化。通过将预训练模型与下游任务连接起来，我们能够在泛化上获得提升。

那么，这对于更近期的最先进模型看起来如何？我们还可以将使用标准预训练或不使用预训练的最先进模型与最近的视觉语言动作模型（如RT2x和OpenVLA）进行比较。我们将在专注于泛化的评估中进行此操作。我们发现，在两个不同的机器人平台上，视觉语言动作模型以红色和绿色显示，

在平均水平上显著优于不使用这种视觉语言模型预训练且不使用将下游任务与预训练任务非常相似的公式的模型。同样，即使是这些最先进的模型，我们再次看到这一趋势，即如果我们将预训练模型与下游任务连接起来，泛化显著改善。

回到尝试处理数据稀缺而不削减数据的问题，我们可以利用已经存在的数据，来自互联网的数据很容易获取，如果我们将预训练模型与下游任务连接起来，我们可以更有效地利用它。很好。最后，我想谈谈在测试时整合数据。

具体来说，考虑一下如果我们处于一个在训练数据中没有很好表示的新情况，我们能否即时适应？我认为这是一个非常重要的问题，因为当机器学习系统面临现实世界时，会遇到大量的物体、配置和场景。我认为我们甚至无法希望预见到这些机器学习模型所面临的每一种可能的场景。

因为我们无法预见它，那么也许我们可以在看到来自该情况的更多数据后进行适应。例如，假设我们正在尝试打开一扇门，也许这是我们之前没有见过的新门。如果我们尝试这样做，我们可能会犯错并需要重试。

事实证明，这是一段人类打开这扇门的视频，虽然很微妙，但人类实际上确实犯了一个错误并迅速适应。让我们重播这个视频。具体来说，我们看到人类将钥匙插入门中，实际上将其放在了错误的位置，然后继续取回钥匙并将其放在正确的位置。

即使人类也在适应错误。即使人类在许多方面有时甚至是机器学习的金标准，如果连人类都在适应，我们能否开发出能够以类似方式适应的机器？让我们在机器人问题的背景下看这个。这是一个机器人未见的场景。机器人的目标是到达这里。如果它试图接近这个问题并犯了错误，它能否实际重试？

机器人只获得这个第一人称观察。如果没有任何上下文，如果它实际上没有尝试过这个任务，也许从这个观察中，它会尝试爬过去看看能得到什么。然后也许如果它尝试爬行并意识到它非常接近一个障碍物，那么也许它应该尝试不同的策略。通过这个上下文，结合过去尝试的历史，

也许它应该尝试在当前观察的基础上做一些不同的事情，比如向左转或向右转。这正是我们将要做的。我们将结合这些最近的尝试与一个已知在最近尝试中表现良好的模型。具体来说，在这种情况下，我们将使用一个视觉语言模型。我们将把这些最近的尝试和机器人的观察传递给模型。然后让它选择一个技能供机器人执行，然后输出动作。

理想情况下，视觉语言模型应该利用机器人之前尝试过的内容，并在犯了一些错误后选择合适的技能。如果我们这样做，我们发现，正如之前的场景一样，这是机器人未见的，如果我们不使用历史记录并不允许它从错误中适应，它往往会一次又一次地犯同样的错误。而如果我们使用上下文学习，它能够尝试不同的东西，并根据它在这个测试环境中看到的内容即时适应。

同样，这里还有另一个设置，一个户外设置。这实际上相当具有挑战性，因为在机器人面前有一个相当不稳定的台阶。在视频的这一点上，机器人实际上甚至看不到它的后腿被卡在台阶上。它试图向前走。如果没有历史记录，它不知道在这种情况下行走是不成功的。但是有了历史记录，它能够弄清楚它应该向后退，而不是试图直接越过台阶。

我们还定量地看到，利用测试时信息，利用机器人在测试时看到的这些图像，使机器人性能提高了50%以上，无论是在成功率还是完成测试场景所需的时间方面。很好。因此，结论是，上下文学习大大提高了机器人的适应能力，反过来，这提高了它在未见情况中的韧性和性能。

在未来的工作中也存在局限性，正如我所展示的所有研究一样，在这种情况下，尚不清楚将语言与低级运动策略结合的最佳方式是什么。而且，在许多情况下，我们可能不希望使用语言抽象作为重试的方式，以及与视觉语言模型连接的方式。因此，可能会有有趣的方式来扩展这一点。

在最后一部分，我们发现，在测试时整合数据和信息可以弥补缺乏代表性训练数据的不足。我在演讲中涵盖的所有这些示例都是更多数据存在的示例。更多数据要么已经存在，要么很容易获取。

我们只需要能够利用自然监督、预训练模型和测试时数据的算法，以有效处理这些新情况或这些训练数据未很好覆盖的情况。现在，我还提到，在这三个方向上，我认为未来的工作有令人兴奋的方向。我谈到了利用廉价自然语言监督的一种方法。

但我认为在未来，也许我们可以以通用的方式操作化全新的学习机制，利用自然监督。此外，我展示了我们如何通过使下游任务看起来更像预训练问题来连接预训练模型与下游任务。但也许在未来，我们实际上可以以更容易与各种下游任务连接的方式改变预训练。最后，我展示了我们如何在机器人场景中在测试时适应，以弥补缺乏代表性训练数据的不足。

但在机器学习的各种示例和应用中，我们最终与人类或其他环境进行交互。我们是否也可以允许非机器人示例中的机器在与人或与其他环境（如网络环境）交互时即时适应并重试？

很好。然后我还想提到的是，我讨论了许多不同的创意，利用不同的数据源和不同的监督来源。还有一个问题是，如果我们也有更广泛的训练数据会怎样？我认为即使在拥有更广泛的训练数据时，这些都是相当有趣的。我们从大型语言模型的机制中看到，当您也拥有大型训练数据集时，有很多事情是相当令人兴奋的尝试。

在机器人领域，我们也开始研究这个问题。

在今年三月，我共同创办了一家公司，实际上是为了看看在机器人领域尝试扩大数据和模型时会发生什么，以尝试解决广泛的现实世界用例和机器人平台。一些初步结果在这里，我们发现即使使用自今年三月以来收集的数据，我们也能完成相当酷的任务。然后我最后要提到的是

我谈到了寻找新形式的数据，如自然监督或测试时数据。这些实际上是相当广泛适用的，并使整体问题变得更容易。但是，我们的许多机器学习基准实际上并不是为这些类型的想法或利用不同形式的监督或数据的算法设计的。在某些情况下，基准可能实际上比它们试图表示的问题更难。

因为它们不一定允许您使用其他形式的监督或数据。也许通过理解我们试图研究的不同实际应用的背景，我们可能会找到新的有趣的数据来源或新的有趣的问题设置，并整体上取得更多进展。

很好。我就此结束。我想提到的是，我所展示的所有工作都是与一组非常出色的合作者完成的。我特别想强调领导我所展示工作的学生。Yunho领导了Clarify工作，Lucy领导了Yay！Robot工作，Annie、Alec、Andy和Govind领导了测试时适应工作，Mujin、Carl和Sid领导了Open VLA项目，乐意回答问题。

在这次长达七小时的ICML 2024报道中，我们想强调的最后一件事是新的立场论文轨道，鼓励研究人员从个别论文中退后一步，提出与整个领域相关的论点。这里是Younghyo Park在论证自动环境塑造是RL的下一个前沿，我们认为这是我们在本集探索的论文和演讲中一直在发展隐含论点。

大家好，感谢大家的到来。我叫Young-Hyung Park，我很高兴能介绍我们的立场，自动环境塑造是RL的下一个前沿。这是我与MIT Improbable AI小组的同事Gabe和Paul Gitt的联合工作。在我们开始之前，给你们一些背景，我和Gabe都来自机器人领域。

作为一名研究生，致力于机器人工作，我总是梦想着一个神奇的盒子，可以通过简单地指定我想要的机器人环境和任务，自动为我创建一个机器人控制器。我称这个神奇的盒子为自动行为生成器。在我继续之前，我想强调这里的“自动”一词。这意味着这个盒子应该仅由时间和计算能力驱动，而不是人类的努力。

如果实现了这个神奇的盒子，将作为一个核心工具，使机器人能够在部署到人们的家中后，自动生成行为。但我想问大家，你们认为我们是不是有点过于雄心勃勃？我们的梦想，这个神奇的盒子，是否太美好而不真实？好吧，如果你仔细想想，从某种意义上说，这就是强化学习向我们承诺的。

理论上的可逆学习是一个通用的、自动化的、最优控制求解器，可以为任何MDP设置生成有效的控制器。然而，从实践的角度来看，试图将RL作为工具来训练机器人的人，这一说法并不一定成立。尽管RL本身在训练过程中不需要人类的努力，但我们想指出的是，为了使RL在实践中有效，确实需要一个非常启发式、劳动密集的过程。

这就是我们所称的环境塑造。当RL算法在实际场景中未能找到解决方案时，在修复RL算法和塑造环境以使其工作之间，实践者通常倾向于选择后者。这种做法的核心问题是，它严重依赖于人类的努力。任务的领域知识、直觉以及有时一点运气对于正确完成事情至关重要。

你可能已经知道的一个环境塑造的非常成熟的例子是奖励塑造问题。我们都知道，RL代理喜欢在可以时破解奖励，因此工程师通常会经历塑造奖励的过程以防止这种情况。事实上，我会说这是我们社区中一些人如此厌恶RL的最大原因。我完全理解奖励塑造的过程。这绝对不是一件有趣的事情。

不幸的是，我今天想指出的是，奖励并不是我们通常塑造的唯一事物。机器人工程师仔细塑造环境的几乎每个组件，以使RL在实践中有效。而且，目前已知的唯一对这个问题最有效的优化器是研究生的反对，这一过程完全依赖于人类的努力。所有这些说完，我今天要争论的是什么？

首先，我认为社区应该开始优先研究自动化环境塑造的启发式过程。同时，我们还需要更好的RL算法，根本不需要启发式环境塑造。为此，我认为我们应该在没有任何任务特定启发式的情况下，在未塑造的环境中对我们的RL算法进行基准测试。

为了更好地支持我们的论点，从现在开始，我将尝试给你们一些流行的机器人RL环境中涉及的重启发式的例子，并展示它们在使RL有效方面的重要性。作为分析的示例环境，我们选择了Isaac Gym Ems，这是一个包含多种机器人任务的现代基准环境。让我们首先谈谈动作空间塑造。

在机器人领域，动作空间塑造是选择如何将策略预测的动作转换为可以发送到电机的实际命令的过程。未塑造的动作空间看起来非常简单。我们只是让策略直接预测可行的电机命令。然而，大多数RL环境在将策略输出传递给电机之前，应用了一堆特定于任务的启发式方法来塑造策略输出。

例如，您刚才看到的示例代码，在最后应用了多种缩放、夹紧、移动平均滤波器和PD控制器，以最终将策略输出转换为电机命令。这种塑造过程的问题在于，它不仅非常特定于任务，而且还引入了一堆额外的旋钮和超参数进行调整。

不幸的是，这种动作空间塑造是RL算法的必要之恶。我们测试发现，例如，如果我们去掉这种塑造，PPO完全无法解决这些任务。我们的发现对于观察空间也是类似的。观察空间塑造基本上是一个特征工程问题，从可用的模拟中选择相关状态以创建策略的观察。

例如，对于使用操纵器打开门的任务，未塑造的观察空间将是可用的每个原始模拟状态的简单串联。然而，典型的RL环境远远超出了这种简单的串联。它们引入了多个手工设计的特定于任务的术语，并且通常将某些具有独特属性的状态（如旋转）转换为已知对神经网络处理更好的不同表示。

这样的过程对于使RL算法在实践中有效也非常重要。我们可以通过简单地去掉这些手工设计的术语来破坏RL。尽管由于时间限制我跳过了其他环境塑造的例子，但您可以查看我们的论文以获取更全面的示例。现在我们了解了环境塑造的细节以及它如何影响RL性能，让我们谈谈如何自动化这个环境塑造过程。

自动化环境塑造是一个具有挑战性的问题，原因有很多。一个主要问题是，没有紧凑的方式来参数化环境塑造的多种多样的方式。如果我们假设所有事物都有固定的功能形式，我们可以尝试提取系数并在其上进行一些经典的超参数优化。但这是一种非常有限的表示这些塑造函数的方式。因此，人们最近开始考虑一种更灵活的方式来表示这些塑造算子。

其中一种是使用Python代码本身作为表示这些函数的方式。这使我们能够将环境塑造视为一个代码优化问题，使用大型语言模型。这篇名为Yirga的论文是一个很好的例子，展示了如何使用大型语言模型作为基于采样的优化器来自动化奖励塑造过程。

因此，我们进行了一些实验，以查看使用LLM的提议自动化方法是否可以扩展到其他塑造组件。正如您在这里看到的，像GPT-4这样的模型能够成功地塑造动作和观察空间，表现与人类相似。然而，有趣的是，当我们要求GPT同时塑造多个组件时，性能急剧下降。

这可能是一个关键问题，因为我们的实验发现，逐个顺序优化单个组件通常会导致我们达到局部最优性能。所有这些说完，我相信我们在完全自动化环境塑造过程方面还有很长的路要走。现在我们讨论了环境塑造的各个方面，让我们讨论一下前进的方向。回想一下，我主张的研究重点是自动化环境塑造或开发更好的RL算法。

为了支持这两个研究方向，我们创建了一个代码库，基本上包含一系列未塑造的机器人环境，供人们在上面测试他们的RL算法，并提供良好的API和工具以促进环境塑造自动化的研究。在我结束演讲之前，我想讨论一下人们可能对我们观点的反对意见。

回到我演讲的开头，我分享了我的梦想：创建一个神奇的盒子，可以自动为机器人生成闭环控制器。然后我暗示强化学习将在未来为这个神奇的盒子提供动力。然而，我认为有些人可能不同意这一点。尤其是考虑到手动数据收集和保留学习的重新流行，

有些人可能认为，我们的神奇盒子的梦想将不是通过自动化RL来实现，而是通过训练一些消耗所有这些公司收集的数据集的巨大基础模型。然而，我仍然相信RL作为生成稳健、可泛化的，尤其是超人类行为的工具的力量，这些行为无法通过模仿学习轻易实现。

而且，通过RL管道生成的行为也可以用于训练这些基础模型。因此，我认为使RL更易于使用将为训练更好的具身智能创造一个良性数据循环。就此，我想结束今天的演讲，我很高兴能参与关于我们立场的激动人心的讨论。谢谢。

这就是ICML 2024第一部分的总结。我们对生成视频世界模拟、扩散、视觉、强化学习和机器人技术的报道。我们正在忙于为2024年在温哥华的Nureep的Latent Space Live做准备。所以请在loo.ma slash lslive上抢购您的票，期待在那里见到您。宝贝，你给了我眼睛和

♪ 你给了我风和雨 ♪ ♪ 你是某种鸟 ♪ ♪ 宝贝你 ♪ ♪ 你激发了我的食欲 ♪ ♪ 不要让我在这里高高在上 ♪ ♪ 哦 ♪ ♪ 我想要自我迷惑 ♪ ♪ 回到脊椎 ♪ ♪ 我想要过度思考，宝贝 ♪ ♪ 只有我 ♪ ♪ 宝贝你 ♪ ♪ 你给了我风和雨 ♪ ♪ 你是某种鸟 ♪ ♪ 你激发了我的食欲 ♪ ♪ 不要让我在这里高高在上 ♪ ♪ 哦 ♪

但我还不想自我迷惑，宝贝。还不行。哦，是的。其实并不太想你，宝贝。快点跳舞。哦。

Generative Video WorldSim, Diffusion, Vision, Reinforcement Learning and Robotics — ICML 2024 Part 1 07:07:47 Share