We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode 2024.12.03 每日AI论文 | X-Prompt提升图像生成,GATE OpenING评估图文生成。

2024.12.03 每日AI论文 | X-Prompt提升图像生成,GATE OpenING评估图文生成。

2024/12/3
logo of podcast HuggingFace 每日AI论文速递

HuggingFace 每日AI论文速递

Shownotes Transcript

本期的 24 篇论文如下:

[00:23] 🖼 X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models(X-Prompt:面向自回归视觉语言基础模型的通用上下文图像生成)

[00:58] 📊 GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation(GATE 开放:一个综合基准用于评估开放式交错图文生成)

[01:32] 🖼 Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis(Switti:为文本到图像合成设计尺度变换器)

[02:09] 🎥 Open-Sora Plan: Open-Source Large Video Generation Model(开放Sora计划:开源大型视频生成模型)

[02:55] 🎥 TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video(TAPTRv3:时空上下文增强长视频中任意点的鲁棒跟踪)

[03:37] 🤖 o1-Coder: an o1 Replication for Coding(o1-Coder:一个面向编码任务的o1模型复现)

[04:12] 🤖 SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters(SOLAMI:沉浸式互动的3D自主角色社交视觉-语言-动作建模)

[04:49] 🎥 VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation(VISTA:通过视频时空增强提升长时和高分辨率视频理解)

[05:38] 🔍 TinyFusion: Diffusion Transformers Learned Shallow(微型融合:浅层扩散变换器的学习)

[06:19] 🔍 VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models(VLsI:从大型到小型视觉语言模型的层级交互化)

[06:52] 🎙 FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait(FLOAT:基于生成运动潜在流匹配的音频驱动说话人像)

[07:32] 🚀 Efficient Track Anything(高效追踪任何目标)

[08:15] 🌊 Steering Rectified Flow Models in the Vector Field for Controlled Image Generation(在矢量场中引导校正流模型以实现受控图像生成)

[08:50] 🎥 Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation(长视频扩散生成与分段交叉注意力及内容丰富的视频数据集构建)

[09:33] 📹 WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model(WF-VAE:通过小波驱动的能量流动增强视频VAE以用于潜在视频扩散模型)

[10:11] 🔍 VLSBench: Unveiling Visual Leakage in Multimodal Safety(VLSBench:揭示多模态安全中的视觉泄露问题)

[10:51] 🧠 VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information(VisOnlyQA:大型视觉语言模型在几何信息视觉感知方面仍存在困难)

[11:41] 🎮 PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos(PhysGame:揭示游戏视频中的物理常识违规)

[12:14] 🗣 Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input(协作实例导航:利用代理自我对话最小化用户输入)

[12:51] 🌍 INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge(评估多语言理解能力:基于区域知识)

[13:28] 🎨 Art-Free Generative Models: Art Creation Without Graphic Art Knowledge(无艺术生成模型:无需图形艺术知识的艺术创作)

[14:02] 📈 A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models(大型语言模型测试时计算的简单可证明缩放定律)

[14:41] 🌐 World-consistent Video Diffusion with Explicit 3D Modeling(世界一致性视频扩散与显式3D建模)

[15:22] 🔊 Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning(面向低资源环境下跨语言音频滥用检测的小样本学习) 【关注我们】

您还可以在以下平台找到我们,获得播客内容以外更多信息

小红书: AI速递