We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Generative Video WorldSim, Diffusion, Vision, Reinforcement Learning and Robotics — ICML 2024 Part 1

2024/12/10

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

AI Deep Dive AI Insights AI Chapters Transcript

People

Ben Poole

Jack

与Ramsey Network或Ramsey Solutions相关的个人，具体信息不详。

Li Junyu

Sander Dieleman

William (Bill) Peebles

Topics

William (Bill) Peebles：Sora是OpenAI的第一个视频生成模型，能够生成长达一分钟的1080p视频。它基于扩散Transformer，并使用VAE将各种视觉数据编码到统一的潜在空间中进行训练。Sora能够生成逼真的和非逼真的风格，进行场景转换，建模复杂的场景，并保持角色的一致性。Sora的成功关键在于统一的视觉表示和扩散Transformer的扩展性。Sora还具有零样本编辑能力，可以根据文本提示重新渲染视频，在不同视频之间进行混合，并进行视频重新字幕。Sora的未来目标是成为一个世界模拟器，学习人类互动、任务和思维方式，并模拟各种复杂场景。目前Sora还存在一些问题，例如对基本交互的理解不足，但随着模型的扩展，这些问题有望得到解决。

Deep Dive

Key Insights

What is the key capability of Sora that allows it to understand object permanence in video generation?

Sora is capable of understanding object permanence and maintaining the presence of objects in the scene even when they are occluded. This is achieved through large-scale diffusion training, which allows the model to learn the underlying geometry and interactions in complex scenes without any hard-coded inductive biases.

How does Genie differ from other text-to-video models in terms of user interaction?

Genie differs from other text-to-video models by allowing frame-by-frame interaction. While other models generate entire video clips based on a text prompt, Genie enables users to take sequential actions within the generated environment, making it a foundational world model.

What is the main challenge in evaluating video generation models, according to Tali Dekel?

The main challenge in evaluating video generation models is the difficulty in quickly glancing at and assessing the quality of moving content. Unlike images, which can be easily evaluated in a grid, videos require more detailed and time-consuming individual assessments.

What is the significance of DECAF in the history of computer vision?

DECAF, or Deep Convolutional Activation Features, was a foundational model in computer vision that democratized access to deep learning techniques. It demonstrated the effectiveness of pre-trained models for a wide range of tasks and was one of the first to show how deep learned representations generalize beyond their training data.

How does the VQ BET model leverage large language models for behavior generation?

The VQ BET model uses Vector Quantized Variational Autoencoders (VQVAE) to quantize continuous action data into a discrete representation, which is then used as tokens in a large language model (LLM) framework. This allows the model to predict and generate behaviors based on current observations and high-level task descriptions.

What is the core idea behind Chelsea Finn's Yell at Your Robot (YAY Robot) approach?

The core idea behind YAY Robot is to use high-level language feedback to improve a robot's hierarchical policy. By providing verbal corrections, the high-level policy can be fine-tuned to correct mistakes and learn new strategies, significantly improving the robot's performance on long-horizon tasks without the need for extensive labeled data.

What is the main argument of the position paper 'Automatic Environment Shaping is the Next Frontier in RL'?

The position paper argues that the reinforcement learning (RL) community should prioritize research on automating the heuristic process of environment shaping. This includes developing better RL algorithms that don't require manual shaping and creating benchmarks on unshaped environments to facilitate this research.

How does VideoPoet differ from Sora in its approach to video generation?

VideoPoet uses a large language model (LLM) architecture to generate videos, while Sora is based on diffusion models. VideoPoet is more modular, supporting tasks like text-to-video, image-to-video, and video-to-audio, and operates in a latent space to improve efficiency and flexibility.

What is the role of flow matching in the VQ BET model for behavior generation?

Flow matching in the VQ BET model ensures that the predicted actions are consistent with the observed data. By using a quantized representation of actions, the model can learn to predict the most likely future states and actions, making it more robust and data-efficient.

What are the key components of the Genie model and how do they enable controllability?

Genie consists of a video tokenizer, a latent action model, and a dynamics model. The video tokenizer converts video frames into discrete tokens, the latent action model predicts changes between frames, and the dynamics model generates future frames based on these tokens and actions, enabling frame-by-frame controllability.

Shownotes Transcript

*Regular tickets are now sold out for *Latent Space LIVE! at NeurIPS)*! We have just announced our last speaker and newest track, friend of the pod *Nathan Lambert) who will be recapping 2024 in Reasoning Models like o1! We opened up a handful of late bird tickets for those who are deciding now — use code DISCORDGANG if you need it. See you in Vancouver!

We’ve been sitting on our ICML recordings for a while (from today’s first-ever SOLO guest cohost, Brittany Walker)), and in light of Sora Turbo’s launch) (blogpost), tutorials)) today, we figured it would be a good time to drop part one which had been gearing up to be a deep dive into the state of generative video worldsim, with a seamless transition to vision (the opposite modality), and finally robots (their ultimate application).

Sora, Genie, and the field of Generative Video World Simulators

Bill Peebles, author of Diffusion Transformers, gave his most recent Sora talk at ICML, which begins our episode:

William (Bill) Peebles - SORA) (slides))

Something that is often asked about Sora is how much inductive biases were introduced to achieve these results. Bill references the same principles brought by Hyung Won Chung from the o1 team )- “sooner or later those biases come back to bite you”.

We also recommend these reads from throughout 2024 on Sora.

Lilian Weng’s literature review of Video Diffusion Models)
Sora API leak)
Estimates of 100k-700k H100s) needed to serve Sora (not Turbo)
Artist guides on using Sora )for professional storytelling

Google DeepMind had a remarkably strong presence at ICML on Video Generation Models, winning TWO Best Paper awards for:

Genie: Generative Interactive Environments) (covered in oral), poster), and workshop))
VideoPoet: A Large Language Model for Zero-Shot Video Generation)** **(see website))

We end this part by taking in Tali Dekel’s talk on The Future of Video Generation: Beyond Data and Scale).

Part 2: Generative Modeling and Diffusion

Since 2023, Sander Dieleman’s perspectives (blogpost), tweet)) on diffusion as “spectral autoregression in the frequency domain” while working on Imagen and Veo have caught the public imagination, so we highlight his talk:

Wading through the noise: an intuitive look at diffusion models)

Then we go to Ben Poole) for his talk on **Inferring 3D Structure with 2D Priors, **including his work on NeRFs and DreamFusion:

Then we investigate two flow matching papers - one from the Flow Matching co-authors - Ricky T. Q. Chen (FAIR, Meta))

And how it is implemented in Stable Diffusion 3 with Scaling Rectified Flow Transformers for High-Resolution Image Synthesis)

Our last hit on Diffusion is a couple of oral presentations on speech, which we leave you to explore via our audio podcast

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models)
Speech Self-Supervised Learning Using Diffusion Model Synthetic Data)

Part 3: Vision

The ICML Test of Time winner was DeCAF), which Trevor Darrell notably called “the OG vision foundation model”.

Lucas Beyer’s talk on “Vision in the age of LLMs — a data-centric perspective”) was also well received online, and he talked about his journey from Vision Transformers to PaliGemma.

We give special honorable mention to MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark).

Part 4: Reinforcement Learning and Robotics

We segue vision into robotics with the help of Ashley Edwards, whose work on both the Gato and the Genie teams at Deepmind is summarized in Learning actions, policies, rewards, and environments from videos alone.)

Brittany highlighted two poster session papers:

Behavior Generation with Latent Actions)
We also recommend Lerrel Pinto’s On Building General-Purpose Robots)
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs)

However we must give the lion’s share of space to Chelsea Finn), now founder of Physical Intelligence, who gave FOUR talks on

"What robots have taught me about machine learning")
developing robot generalists)
robots that adapt autonomously)
how to give feedback to your language model)
special mention to PI colleague Sergey Levine on Robotic Foundation Models)

We end the podcast with a position paper that links generative environments and RL/robotics: Automatic Environment Shaping is the Next Frontier in RL).

Timestamps

[00:00:00] Intros
[00:02:43] Sora - Bill Peebles
[00:44:52] Genie: Generative Interactive Environments
[01:00:17] Genie interview
[01:12:33] VideoPoet: A Large Language Model for Zero-Shot Video Generation
[01:30:51] VideoPoet interview - Dan Kondratyuk
[01:42:00] Tali Dekel - The Future of Video Generation: Beyond Data and Scale).
[02:27:07] Sander Dieleman - Wading through the noise: an intuitive look at diffusion models)
[03:06:20] Ben Poole - Inferring 3D Structure with 2D Priors
[03:30:30] Ricky Chen - Flow Matching
[04:00:03] Patrick Esser - Stable Diffusion 3
[04:14:30] NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models)
[04:27:00] Speech Self-Supervised Learning Using Diffusion Model Synthetic Data)
[04:39:00] ICML Test of Time winner: DeCAF)
[05:03:40] Lucas Beyer: “Vision in the age of LLMs — a data-centric perspective”)
[05:42:00] Ashley Edwards: Learning actions, policies, rewards, and environments from videos alone.)
[06:03:30] Behavior Generation with Latent Actions)** **interview
[06:09:52] Chelsea Finn: "What robots have taught me about machine learning")
[06:56:00] Position: Automatic Environment Shaping is the Next Frontier in RL) Get full access to Latent Space at www.latent.space/subscribe)

Generative Video WorldSim, Diffusion, Vision, Reinforcement Learning and Robotics — ICML 2024 Part 1 07:07:47 Share