We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

2024 in Vision [LS Live @ NeurIPS]

2024/12/22

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

AI Deep Dive AI Insights AI Chapters Transcript

People

AI Charlie

组织和主持多个高影响力的 AI 活动和会议，促进 AI 领域的发展和社区建设。

Isaac Robinson

Peter Robicheaux

Topics

Isaac Robinson: 本报告总结了2024年计算机视觉领域的重大进展，重点关注视频生成和目标检测两个方向。在视频生成方面，Sora模型的出现是里程碑式的事件，它实现了高质量、长视频的生成，这得益于其独特的训练方法和大量的计算资源。此外，SAM2模型将SAM的成功经验扩展到视频领域，实现了高效的视频分割。在目标检测方面，DETR模型在实时目标检测领域超越了传统的YOLO模型，这主要归功于RT-DETR、LW-DETR和D-FINE模型的改进，这些改进包括更高效的Transformer编码器、预训练的有效性以及改进的损失函数。 Peter Robicheaux: 本报告关注如何更好地利用预训练模型，以及如何提高模型对精细视觉细节的表示能力。MMVP论文揭示了大型语言模型在精细视觉细节感知方面的不足，并提出了一种改进方案，即在语言模型的视觉编码器中加入DynaV2特征。Florence 2模型通过结合空间层次和语义粒度来提高模型的视觉理解能力。PaliGemma和PaliGemma 2模型也致力于提高模型对精细视觉细节的表示能力，并取得了显著的成果。AIMv2模型提出了一种更简单的方法来结合图像标记和像素标记，并通过自回归方式学习图像标记的均方误差来学习精细视觉特征。 Vik Korrapati: Moondream致力于构建可在任何地方运行的视觉语言模型。我们开发了0.5B参数的轻量级模型，该模型通过剪枝和持续训练技术实现，可在各种设备上运行。此外，我们还研究了如何提高模型对特定类型图像（如仪表盘）的理解能力，并提出了一种基于链式思维的解决方案，该方案通过分解任务并指导模型逐步完成子任务来提高模型的准确性和效率。

Deep Dive

Key Insights

Why did Latent Space decide to organize Latent Space LIVE! at NeurIPS 2024?

Latent Space wanted to provide more industry-relevant content and a year-in-review recap from experts, addressing the lack of such talks in academic conference coverage.

What was the most requested domain for Latent Space LIVE! at NeurIPS 2024?

Computer vision was the most requested domain by attendees, leading to a focus on vision-related talks and trends.

What significant milestone did Roboflow achieve in 2024?

Roboflow announced a $40 million Series B funding round, led by Google Ventures, and their SuperVision library surpassed PyTorch's Vision library in popularity.

What is Sora, and why is it considered a major breakthrough in 2024?

Sora is a video generation model that extends diffusion models from images to videos, producing high-quality 1080p, one-minute-long videos with realistic details, though it lacks a formal paper and access is limited.

How does SAM2 improve upon the original SAM model?

SAM2 extends SAM's capabilities to video segmentation by introducing a hierarchical encoder that speeds up inference sixfold and uses a memory bank to cross-attend features from past frames for real-time video segmentation.

Why are DETRs outperforming YOLO models in real-time object detection?

DETRs are showing Pareto improvements over YOLOs due to advancements like RT-DETR, LW-DETR, and DEFINE, which optimize transformer encoders, leverage pre-training, and introduce efficient loss functions, achieving higher accuracy with similar latency.

Why do large language models (LLMs) struggle with fine-grained visual tasks like reading a watch?

LLMs struggle because their vision encoders, often initialized with CLIP, lack fine-grained detail extraction capabilities, as CLIP doesn't need such details for its primary task of matching images to captions.

What is the MMVP paper's key finding about LLMs and visual perception?

The MMVP paper identifies that LLMs fail on tasks requiring fine-grained visual details, creating a benchmark of hard images for these models by finding pairs similar in CLIP space but dissimilar in DynaV2 space.

How does Florence 2 aim to improve vision-language models?

Florence 2 incorporates spatial hierarchy and semantic granularity by training on diverse annotations, including region-text pairs and descriptive paragraphs, to create features that can both detect objects and reason about them semantically.

What is the significance of PolyGemma 2 in 2024?

PolyGemma 2 introduces location tokens and prefix loss to improve vision-language tasks, achieving state-of-the-art results on the MMVP benchmark, outperforming other models like ChatGPT and Lava.

What is the main innovation of AIMv2 in vision-language models?

AIMv2 simplifies the training process by autoregressively learning to reconstruct images, combining image tokens with text tokens in a scalable way, achieving high performance on tasks like object detection without requiring extensive annotations.

Why are foundation models like OpenAI's and Claude's detection capabilities still behind specialist models like RT-DETR?

Foundation models struggle with object detection because the architectures are highly specialized, and until recently, real-time detectors like YOLO didn't benefit from pre-training, making it harder for generalist models to compete.

What is Moondream's focus in building vision-language models?

Moondream focuses on creating vision-language models that can run anywhere, especially on edge devices, with capabilities like open vocabulary object detection, captioning, and pointing, optimized for real-time and low-resource environments.

How does Moondream's 0.5B model achieve efficiency without sacrificing performance?

Moondream's 0.5B model is created by pruning a 2B parameter model while retaining performance across benchmarks, allowing developers to deploy smaller models tailored to specific tasks without losing accuracy.

Why do vision-language models like Moondream struggle with tasks like reading gauges?

Vision-language models struggle with gauge reading because training data is biased toward product images where gauges are always set to zero, lacking the variability needed to learn fine-grained details like needle positions.

How does Moondream address the challenge of reading gauges and other fine-grained tasks?

Moondream uses a chain-of-thought approach to break down tasks into subtasks, improving performance on tasks like gauge reading by teaching the model to reason step-by-step about the image, such as identifying scales and counting ticks.

Shownotes Transcript

Translations:

中文

Welcome to Latent Space Live, our first mini-conference held at NeurIPS 2024 in Vancouver. This is Charlie, your AI co-host.

When we were thinking of ways to add value to our academic conference coverage, we realised that there was a lack of good talks just recapping the best of 2024 going domain by domain. We sent out a survey to the over 900 of you who told us what you wanted and then invited the best speakers in the Latent Space Network to cover each field. 200 of you joined us in person throughout the day with over 2,200 watching live online.

Our second featured keynote is The Best of Vision 2024 with Peter Robichaux and Isaac Robinson of Roboflow with a special appearance from Vic Corapatti of Moondream. When we did a poll of our attendees, the highest interest domain of the year was vision. And so our first port of call was our friends at Roboflow.

Joseph Nelson helped us kickstart our vision coverage in Episode 7 last year, and this year came back as a guest host with Nikki Ravie of Meta to cover segment Anything Too. RoboFlow have consistently been the leaders in open source vision models and tooling, with their Supervision library recently eclipsing PyTorch's Vision library, and RoboFlow Universe hosting hundreds of thousands of open source vision datasets and models.

They have since announced a $40 million Series B, led by Google Ventures. Woo-hoo! This is the year that vision language models became mainstream, with every model from GPT-40 to 1, to CLAWD-3, to Gemini-1, and 2 to Lama. 3.2 to Mistral's Pixtrol, to AI-2's Pixmo, going multimodal.

We asked Peter and Isaac to highlight the best work in computer vision for 2024, and they blew us away with the complete overview. As a special bonus, we also got a bonus talk from Vic Coropati at Moondream, who gave an incredible talk at this year's AI Engineer World's Fair on his tiny 0.5 billion parameter pruned vision language model that absolutely slaps.

As always, don't forget to check the show notes for the YouTube link to their talk as well as their slides. Watch out and take care. Hi, we're Isaac and Peter from RoboFlow, and we're going to talk about the best papers of 2024 in computer vision. So for us, we defined best as what made the biggest shifts in the space. And to determine that, we looked at what are some major trends that happened in

and what papers most contributed to those trends. So I'm going to talk about a couple trends. Peter's going to talk about a trend, and then we're going to hand it off to Moondream. So the trends that I'm interested in talking about are a major transition from models that run on per image basis to models that run using the same basic ideas on video, and then also how debtors are starting to take over the

real-time object detection seen from the YOLOs, which have been dominant for years. So as a highlight, we're going to talk about Sora, which, from my perspective, is the biggest paper of 2024, even though it came out in February.

Yeah, yeah. So, it's a -- Sora is just a post. So, I'm going to fill it in with details from replication efforts, including open Sora and related work such as stable diffusion video. And then we're also going to talk about SAM2, which applies the SAM strategy to video. And then how debtors are -- the improvements in 2024 to debtors that are making them a period of improvement to yellow-based models.

So to start this off, we're going to talk about the state of the art of video generation at the end of 2023, MagVIT. MagVIT is a discrete video tokenizer akin to VQGAN, but applied to video sequences. And it actually outperforms state of the art handcrafted video compression frameworks in terms of

the bit rate versus human preference for quality and videos generated by auto regressing on these discrete tokens Generate some pretty nice stuff but up to like five seconds length and you know not super detailed and then suddenly a few Months later we have this which when I saw it was totally mind-blowing to me 1080p a whole minute long. We've got light reflecting in puddles. That's reflective

It reminds me of those RTX demonstrations for next generation video games such as Cyberpunk, but with better graphics. You can see some issues in the background if you look closely, but they're kind of, as with a lot of these models, the issues tend to be things that people aren't going to pay attention to unless they're looking for. In the same way that six fingers on a hand you're not going to notice is a giveaway unless you're looking for it.

So, yeah, as we said, Sora does not have a paper. So, we're going to be filling it in with context from the rest of the computer vision scene attempting to replicate these efforts. So, the first step, you have an LLM caption, a huge amount of videos. This is a trick that they introduced in DALI 3 where they train a

image captioning model to just generate very high quality captions for a huge corpus and then train a diffusion model on that. Sora and the replication efforts also show a bunch of other steps that are necessary for good video generation, including filtering by aesthetic score and filtering by making sure the videos have enough motion so they're not just like kind of the generators not learning to just generate static frames.

So, then we encode our video into a series of space-time latents. Once again, this is very sparse in details. So, the replication-related works, OpenSora uses a MagVIT v2 itself to do this. But swapping out the discretization step with a classic VAE autoencoder framework.

They show that there's a lot of benefit from getting the temporal compression, which makes a lot of sense as each sequential frames and videos have mostly redundant information. So by compressing in the temporal space, you allow the latent to hold a lot more semantic information while avoiding that duplicate.

So we've got our space-time latents, possibly via some 3D VAE, presumably a MagVIT V2. And then you throw it into a diffusion transformer. So I think it's personally interesting to note that OpenSora is using a MagVIT V2, which originally used an autoregressive transformer decoder to model the latent space. But

is now using a diffusion transformer. So it's still a transformer happening. Just the question is, is it parameterizing the stochastic differential equation? Is it parameterizing a conditional distribution via autoregression?

It's also worth noting that most diffusion models today, the very high performance ones, are switching away from the classic like DDPM, denoising diffusion probability modeling framework, to rectified flows. Rectified flows have a very interesting property that as they converge, they actually get closer to being able to be sampled with a single step, which means that in practice, you can actually generate high quality samples much faster.

The major problem of DDPM and related models for the past four years is just that they require many, many steps to generate high-quality samples. So, naturally, the third step is throwing lots of compute at the problem. So, I never figured out how to manage to get this video to loop, but we see very little compute.

Medium compute, lots of compute. This is so interesting because the original diffusion transformer paper from Facebook actually showed that, in fact, the specific hyperparameters of the transformer didn't really matter that much. What mattered was that you were just increasing the amount of compute that the model had. So I love how in the, you know, once again, little blog post, they don't even talk about like the specific hyperparameters. They say, we're using a diffusion transformer and we're just throwing more compute at it. And this is what happens.

OpenSora shows similar results. The primary issue I think here is that no one else has 32x compute budget. So we end up with these -- we end up in the middle of the domain in most of the related work, which is still super, super cool. It's just a little disappointing considering the context. So I think this is a beautiful extension of the

framework that was introduced in '22 and '23 for these very high quality per image generation and then extending that to videos. It's awesome. And it's GA as of Monday, except no one can seem to get access to it because they keep shutting down the login.

The next paper I want to talk about is SAM. We at RoboFlow allow users to label data and train models on that data. SAM, for us, has saved our users 75 years of labeling time. We are, to the best of my knowledge, the largest SAM API that exists.

Sam also allows us to have our users train just pure bounding box regression models and use those to generate high quality masks, which has the great side effect of requiring less training data to have a meaningful convergence. So most people are data limited in the real world. So anything that requires less data to get to a useful thing is super useful.

Most of our users actually run their object per frame object detectors on every frame in a video or maybe not most but many many and so Sam follows into this category of take Sam to falls into this category of taking something that really really works and applying it to a video which has the wonderful benefit of being plug-and-play with most of our many of our users use cases and

We're still building out a sufficiently mature pipeline to take advantage of that, but it's in the works. So here we've got a great example. We can click on cells and then follow them. You even notice the cell goes away and comes back, and we can still keep track of it, which is very challenging for existing object trackers. High-level overview of how SAM2 works.

There's a simple pipeline here where we can provide some type of prompt, and it fills out the rest of the likely masks for that object throughout the rest of the video. So here we're giving a bounding box in the first frame, a set of positive and negative points, or even just a simple mask. I'm going to assume people are somewhat familiar with SAM. So I'm going to just give a high-level overview of how SAM works. You have an image encoder that runs on every frame.

SAM2 can be used on a single image, in which case the only difference between SAM2 and SAM is that image encoder, which SAM used a standard VIT. SAM2 replaced that with a HERA hierarchical encoder, which gets approximately the same results but leads to a six times faster inference, which is excellent, especially considering how in

a trend of 23 was replacing the VAT with more efficient backbones. In the case where you're doing video segmentation, the difference is that you actually create a memory bank and you cross attend the features from the image encoder based on the memory bank. So the feature set that is created

is essentially-- well, I'll go more into it in a couple of slides. But we take the features from the past couple frames plus a set of object pointers and a set of prompts and use that to generate our new masks. Then we then fuse the new masks for this frame with the image features and add that to the memory bank. It's-- I'll say more in a minute.

Just like SAM, SAM2 actually uses a data engine to create its dataset. In that, people are... they assembled a huge amount of reference data, used people to label some of it,

and train the model, use the model to label more of it, and ask people to refine the predictions of the model. And then ultimately, the data set is just created from the final output of the model on the reference data. It's very interesting. This paradigm is so interesting to me because it unifies a model and a data set in a way that is very unique. It seems unlikely that another model could come in and have such a tight relationship with the training set.

So, brief overview of how the memory bank works. The paper did not have a great visual, so I'm going to fill in a bit more. So we take the last couple frames from our video, attend that along with the set of prompts,

that we provided. They could come from the future, they could come from anywhere in the video, as well as reference object pointers saying, by the way, here's what we've found so far. Attending to the last few frames has the interesting benefit of allowing it to model complex object motion without actually

By limiting the amount of frames that you attend to, you manage to keep the model running in real time. This is such an interesting topic for me because one would assume that attending to all of the frames is super essential, or having some type of summarization of all the frames is super essential for high performance. But we see in their later ablation that that actually is not the case. So here

just to make sure that there is some benchmarking happening. We just compared to some of the stuff that came out prior, and indeed the SAM2 strategy does improve on the state of the art. This ablation deep in the dependencies was super interesting to me. We see in section C the number of memories. One would assume that increasing the count of memories would meaningfully increase performance, and we see that it has some impact but not the type that you'd expect.

And that it meaningfully decreases speed, which justifies in my mind just having this FIFO queue of memories. Although in the future, I'm super interested to see a more dedicated summarization of all of the last video, not just a stacking of the last frames. So that...

another extension of beautiful per frame work into the video domain. The next trend I'm interested in talking about is this interesting at Roboflow, we're super interested in training real time object detectors. Those are bread and butter. And so we're doing a lot to keep track of what is actually happening in that space. We are finally starting to see something change. So for years, YOLOs have been the dominant

way of doing real-time object detection. And we can see here that they've essentially stagnated. The performance between 10 and 11 is not meaningfully different, at least in this type of high-level chart. And even from the last couple series, there's not a major change. So yellows have hit a plateau. Deaders have not. So...

We can look here and see the YOLO series has this plateau, and then these RT-DETTER, LW-DETTER, and DEFINE have meaningfully changed that plateau so that in fact the best DEFINE models are plus 4.6 AP on COCO at the same latency. So three major steps to accomplish this. The first RT-DETTER, which is technically a 2023 paper preprint but published officially in 24, so I'm going to include that. I hope that's okay.

RT-Dedr showed that we could actually match or outspeed YOLOs. Then LW-Dedr showed that pre-training is hugely effective on Deddrs and much less so on YOLOs. And then Define added the types of bells and whistles that we expect from this arena. So the major improvements that RT-Dedr shows was taking the

multi-scale features that debtors typically pass into their encoder and decoupling them into a much more efficient transformer encoder. The transformer is of course quadratic complexity, so decreasing the amount of stuff that you pass in at once is super helpful for increasing your runtime or increasing your throughput. So that change basically brought us up to yellow speed and then they do a hardcore analysis on

benchmarking yellows including the NMS step. Once you include the NMS in the latency calculation, you see that in fact these debtors are outperforming at least this time the yellows that existed. Then LW debtor goes in and suggests that in fact this frame, the huge boost here is from pre-training.

So this is the defined line, and this is the defined line without pre-training. It's within range. It's still an improvement over the YOLOs, but the really huge boost comes from the benefit of pre-training. When YOLOx came out in 2021, they showed that they got much better results by having a much, much longer training time. But they found that when they did that, they actually did not benefit from pre-training. So you see in this graph from LWDetter,

In fact, yellows do have a real benefit from pre-training, but it goes away as we increase the training time. Then the debtors converge much faster. LW debtor trains for only 50 epochs, RT debtors 60 epochs. So one could assume that in fact the entire extra gain from pre-training is that you're not destroying your original weights by relying on this long training cycle.

And then LW debtor also shows superior performance to our favorite dataset, RoboFlow 100, which means that they do better on the real world, not just on Cocoa. Then Define throws all the bells and whistles at it. YOLO models tend to have a lot of very specific, complicated loss functions. Define brings that into the debtor world and shows consistent improvement on a variety of debtor-based frameworks.

bring these all together. And we see that suddenly we have almost 60 AP on Cocoa while running in like 10 milliseconds, huge, huge stuff. So we're spending a lot of time trying to build models that work better with less data and debtors are clearly becoming a promising step in that direction. The, what we're interested in seeing from the debtors in this, this trend to next is co-debtor and the, the, the models that are currently sitting on the top of the

leaderboard for large-scale inference scale really well as you switch out the backbone. We're very interested in seeing and having people publish a paper, potentially us, on what happens if you take these real-time ones and then throw a Swin G at it. Like, do we have a Pareto curve that extends from the real-time domain all the way up to the super, super slow but high-performance domain?

We also want to see people benchmarking in RF100 more because that type of data is what's relevant for most users. And we want to see more pre-training because pre-training works now. It's super cool.

All right. So yeah, so in that theme, one of the big things that we're focusing on is how do we get more out of our pre-trained models? And one of the lenses to look at this is through sort of this new requirement for like fine grained visual details and your representations that are extracted from your foundation model.

So it's sort of a hook for this. Oh, yeah, this is just a list of all the papers that I'm going to mention. I just wanted to make sure I said the actual papers so you can find it later. Yeah, so sort of the big hook here is that I make the claim that LLMs can't see. If you go to Claude or ChatGPT, you ask it to see this watch and tell me what time it is, it fails, right? And so you could say, like,

Maybe the-- this is a very classic test of an LLM. But you could say, OK, maybe this image is too--

zoomed out and it just like it'll do better if we increase the resolution and it has easier time finding these fine grained fine grained features like where the watch hands are pointing no dice and you can say okay well maybe uh the model just doesn't know how to tell time from knowing the position of the hands but if you actually prompt it textually it's very easy for it to tell the time so this to me is proof that these loms literally cannot see the position of the watch hands and it can't see those details so the question is sort of why and uh for you anthropic heads out there

Cloud fails too. So my first pick for best paper of 2024 in vision is this MMVP paper, which tries to investigate why do LLMs not have the ability to see fine grained details? And so for instance, it comes up with a lot of images like this, where you ask it a question that seems very visually apparent to us, like which way is the school bus facing? And it gets it wrong. And then of course it makes up details to support its wrong claim.

And so the process by which it finds these images is sort of contained in its hypothesis for why it can't see these details.

It hypothesizes that models that have been initialized with Clip as their vision encoder, they don't have fine-grained details and the features extracted using Clip because Clip sort of doesn't need to find these fine-grained details to do its job correctly, which is just to match captions and images, right?

And sort of at a high level, even if ChatGPT wasn't initialized with Clip and the Vision Encoder wasn't trained contrastively at all, still in order to do its job of capturing the image, it could do a pretty good job without actually finding the exact position of all the objects and visual features in the image.

So this paper finds a set of difficult images for these types of models. And the way it does it is it looks for embeddings that are similar in ClipSpace, but far in DynaV2 space. So DynaV2 is a foundation model that was trained self-supervised purely on image data. And it kind of uses some complex student-teacher framework, but essentially it patches out certain

areas of the image or crops with certain areas of the image and tries to make sure that those have consistent representations, which is a way for it to learn very fine-grained visual features. And so if you take things that are very close in clip space and very far in DynaV2 space, you get a set of images that

basically pairs of images that are hard for ChatGPT and other big language models to distinguish. So if you then ask it questions about this image, well, as you can see from this chart, it's going to answer the same way for both images, right? Because from the perspective of the Vision Encoder, they're the same image.

And so if you ask a question like how many eyes does this animal have, it answers the same for both. And all these other models, including lava, do the same thing. And so this is the benchmark that they create, which is finding clip-line pairs, which is pairs of images that are similar in clip space, and creating a data set of multiple choice questions based off of those.

And so how do these models do? Well, really bad. So ChatGPT and Jim and I do a little bit better than random guessing, but half of the performance of humans who find these problems to be very easy.

Interestingly, extremely negatively correlated with this dataset. It does much, much, much, much worse than random guessing, which means that this process has done a very good job of identifying hard images for Lava specifically. And that's because Lava is basically not trained for very long and is initialized from Clip. And so you would expect it to do poorly on this dataset.

One of the proposed solutions that this paper attempts is by basically saying, OK, well, if Clip features aren't enough, what if we train the visual encoder of the language model also on Dyna features? And so it proposes two different ways of doing this. One additively, which is basically interpolating between the two features. And then one is interleaving, which is just kind of like training one on the combination of both features.

So there's this really interesting trend when you do the additive mixture of features. So zero is all clip features and one is all DynaV2 features.

I think it's helpful to look at the rightmost chart first, which is as you increase the number of DynaV2 features, your model does worse and worse and worse on the actual language modeling task. That's because DynaV2 features were trained completely from a self-supervised manner and completely in image space. It knows nothing about text. These features aren't really compatible with these text models. You can train an adapter all you want, but it seems that it's in such an alien language that it's a very hard optimization for these models to solve.

And so that kind of supports what's happening on the left, which is that, yeah, it gets better at answering these questions as you include more DynaV2 features up to a point. But then when you oversaturate, it completely loses its ability to answer language and do language tasks. So...

You can also see with the interleaving, they essentially double the number of tokens that are going into these models and just train on both. And it still doesn't really solve the MMVP task. It gets Lava 1.5 above random guessing by a little bit, but it's still not close to ChatGPT or any human performance, obviously.

Clearly, this proposed solution of just using Dynab2 features directly isn't going to work. Basically, what that means is that as a vision foundation model, Dynab2 is going to be insufficient for language tasks.

So my next pick for best paper of 2024 would be Florence 2, which tries to solve this problem by incorporating not only this dimension of spatial hierarchy, which is to say pixel level understanding, but also in making sure to include what they call semantic granularity, which ends up the goal is basically to have features that are

sufficient for finding objects in the image, so they have enough pixel information, but also can be talked about and can be reasoned about.

And that's on the semantic granularity axis. So here's an example of basically three different paradigms of labeling that they do. So they create a big data set. One is text, which is just captioning. And you would expect a model that's trained only on captioning to have similar performance like ChatGPT and not have spatial hierarchy, not have

features that are meaningful at the pixel level. They add another type, which is region text pairs, which is essentially either classifying a region or doing object detection or doing instance segmentation on that region or captioning that region. Then they have text phrase region annotations, which is essentially a triple. Basically, not only do you have a region that you've described, you also find it's like

its place in a descriptive paragraph about the image, which is basically trying to introduce even more semantic understanding of these regions. For instance, if you're saying a woman riding on the road, you have to know what a woman is and what the road is and that she's on top of it. That's basically composing a bunch of objects in this visual space, but also thinking about it semantically. The way that they do this is they take, basically they just dump

features from a vision encoder straight into a encoder-decoder transformer. And then they train a bunch of different tasks, like object detection and so on, as a language task. And I think that's one of the big things that we saw in 2024 is these vision language models operating on pixel space linguistically. So they introduced a bunch of new tokens to point to locations and

in pixel space. So how does it work? How does it actually do? We can see, if you look at the graph on the right, which is using the Dyno framework, your pre-trained Florence 2 models transfer very, very well. They get 60% map on Cocoa, which is like approaching state of the art. And they train with-- You're good. And they train with much more--

much more efficiently. So they converge a lot faster, which both of these things are pointing to the fact that they're actually leveraging their pre-trained weights effectively. So where is it falling short? So these models, I forgot to mention, Florence is a 0.2 billion and a 0.7 billion parameter count. So they're very, very small in terms of being a language model. And I think that

this framework you can see saturation so what this graph is showing is that if you train a florence 2 model purely on the image level and region level annotations and not including the pixel level annotations like segmentation it actually performs better as an object detector

And what that means is that it's not able to actually learn all the visual tasks that it's trying to learn because it doesn't have enough capacity. So I'd like to see this paper explore larger model sizes, which brings us to our next big paper of 2024, or two papers. So Polygema came out earlier this year. Polygema 2 was released, I think, a week or two ago.

Oh, I forgot to mention, you can actually train label text data sets on RoboFlow, and you can train a Florence 2 model, and you can actually train a PolyGemma 2 model on RoboFlow, which we got into the platform within 14 hours of release, which I was really excited about. So anyway, so PolyGemma 2-- so PolyGemma is essentially doing the same thing, but instead of doing an encoder-decoder, it just dumps everything into a decoder-only transformer model. But it also introduced the concept of location tokens to point to objects in pixel space.

PolyGemma2 uses Gemma as the language encoder and uses Gemma2b. PolyGemma2 introduces using multiple different sizes of language encoders.

So the way that they sort of get around having to do encoder-decoder is they use the concept of prefix loss, which basically means that when it's generating tokens autoregressively, all those tokens in the prefix, which is like the image that it's looking at and like a description of the task that it's trying to do, they're attending to each other fully, full attention, which means that it can sort of

find high level, it's easier for the prefix to color the output of the suffix and also to just find features easily. So this is sort of an example of one of the tasks that was trained on, which is you describe the task in English and then you give it all these

you're asking for it to segment these two classes of objects and then it finds like their locations using these tokens and it finds their masks using some encoding of the masks into tokens

And yeah, so one of my critiques, I guess, of Polygema 1, at least, is that you find that performance saturates as a pre-trained model after only 300 million examples seen. So what this graph is representing is each blue dot is a performance on some downstream task. And you can see that after seeing 300 million examples, it's sort of--

does equally well on all of the downstream tasks that they tried it on, which was a lot, as 1 billion examples, which to me also kind of suggests a lack of capacity for this model. PolyGemma 2, you can see the results on object detection. So these were transferred to Cocoa.

And you can see that this sort of also points to an increase in capacity being helpful to the model. You can see as both the resolution increases and the parameter count of the language model increases, performance increases. So resolution makes sense. Obviously, it helps to find small objects in the image. But it also makes sense from another reason, which is that it kind of gives the model a thinking register, and it gives it more tokens to process when making its predictions.

But yeah, you could say, oh, 43.6, that's not that great. Florence 2 got 60. But this is not training a dyno or a debtor on top of this language or this image encoder. It's doing the raw language modeling task on Cocoa. So it doesn't have any of the bells and whistles. It doesn't have any of the fancy losses. It doesn't even have bipartite graph matching or anything like that. OK, the big result and one of the reasons that I was really excited about this paper

is that they blow everything else away on MMVP. I mean, 47.3, sure, that's nowhere near human accuracy, which again is 94%. But for a 2 billion parameter language model to be ChatGPT, that's quite the achievement. And that sort of brings us to our final pick for paper of the year, which is AIMv2. So AIMv2 sort of says, OK,

Maybe this language model, maybe coming up with all these specific annotations to find features with high fidelity and pixel space isn't actually necessary. And we can come up with an even simpler and more beautiful idea for combining image tokens and pixel tokens in a way that's interfaceable for language tasks.

And this is nice because it can scale. You can come up with lots more data if you don't have to come up with all these annotations, right? So the way that it works is it does something very, very similar to PolyGemo, where you have a vision encoder that dumps image tokens into a decoder-only transformer.

But the interesting thing is that it also auto-aggressively tries to learn the mean squared error of the image tokens. So instead of having to come up with fancy object detection or semantic or segmentation labels, you can just try to reconstruct the image and have it learn fine-grained features that way.

It does this in a beautiful way that's compatible with the PolyGemma line of thinking, which is randomly sampling a prefix length and using only this number of image tokens as the prefix. Doing a similar thing with the causal with prefix is the attention mask on the right.

doing full block attention with some randomly sampled number of image tokens to then reconstruct the rest of the image and the downstream caption for that image. And so this is the data set that they train on. It's internet-scale data, very high quality data created by the Data Filtering Networks paper, essentially, which is maybe the best clip data that exists.

And we can see that this is finally a model that doesn't saturate. It's even at the highest parameter count, it appears to be-- at the highest parameter account, it appears to be improving in performance with more and more samples seen. And so you can sort of think that

If we just keep bumping the parameter count and increasing the example scene, which is the line of thinking for language models, then it'll keep getting better. So how does it actually do at finding-- oh, it also improves with resolution, which you would expect for a model that-- this is the ImageNet classification accuracy. But yeah, it does better if you increase the resolution, which means that's actually leveraging and finding fine-grained visual features.

And so how does that actually do compared to Clip on Cocoa? Well, you can see that if you slap a transformer detection head on it and train it on Cocoa, it gets to 60.2, which is also within spitting distance of Soda, which means that it does a very good job of finding visual features. But you could say, okay, well, wait a second. Clip got to 59.1, so...

like how does this prove your claim at all because doesn't that mean like clip which is known to be clip blind and do badly on mmvp it's able to achieve a very high performance on fine on this fine-grained visual features task of object detection well

They train on tons of data. They train on Objects 365, Cocoa, Flickr, and everything else. And so I think that this benchmark doesn't do a great job of selling how good of a pre-trained model MV2 is. And we would like to see performance on fewer data as examples and not train to convergence on object detection. So seeing it in the real world on a data set like RoboFlow 100, I think, would be quite interesting.

And I guess our final, final pick for paper of 2024 would be Moon Dream. So introducing Vic to talk about that. But overall, that was exactly what I was looking for. Best of 2024, amazing job. Does anyone have questions while Vic gets set up, like vision stuff? Yeah? Vic, go ahead. While we're getting set up, hi over here. Thanks for the really awesome talk. One of the things that's been weird and surprising is that the foundation model companies are

even these MLMs, they're just like worse than RT-Tether at detection still. Like if you wanted to pay a bunch of money to auto-label your detection dataset, if you gave it to OpenAI or Claude, that would be like a big waste. So I'm curious, just like even PolyGemma 2 is worse. So I'm curious to hear your thoughts on like,

How come nobody's cracked the code on a generalist that really beats a specialist model in computer vision like they have in LLM land?

It's a very, very interesting question. I think it depends on the specific domain. For image classification, it's basically there. In the AIMV2 showed, a simple attentional probe on the pre-trained features gets like 90%, which is as well as anyone does.

Bigger question, why isn't it transferring to object detection, especially real-time object detection? I think in my mind, there are two answers. One is object detection is really, really, really-- the architectures are super domain specific. We see all these super, super complicated things, and it's not super easy to build something that just transfers naturally like that, whereas image classification, clip pre-training transfers super, super easily

And the other thing is, until recently, the real-time object detectors didn't even really benefit from pre-training. You see the YOLOs that are essentially saturated, showing very little difference with pre-training improvements with using pre-trained model at all. It's not surprising necessarily that people aren't looking at the effects of better and better pre-training on real-time detection. Maybe that'll change in the next year. Does that answer your question?

Can you guys hear me? Yeah, one thing I want to add is just like, or just to summarize basically, is that like until 2024, you know, we haven't really seen a combination of transformer-based object detectors and fancy losses. And PolyGemma suffers from the same problem, which is basically to say that these ResNet or like the convolutional models, they have...

all these extreme optimizations for doing object detection. But essentially, I think it's kind of been shown now that convolution models just don't benefit from pre-training and just don't have the level of intelligence to transform models. Awesome. Hi, can you hear me? Cool. I hear you, see you. Are you sharing your screen? I might have forgotten to do that. Let me do that. Sorry. Should have done that. Here's my share screen.

Uh-oh. Classic. You might have to quit Zoom and restart. It's fine. We have a capture of your screen. I'll just make sure it's visible. So let's get to your Zoom. OK. Easy enough. I'm going to make it for you. You want to quit Zoom? No. Yeah. There you go. Perfect.

All right. Hi, everyone. My name is Vik. I've been working on Moonream for almost a year now, like Sean mentioned. I just went and looked, and it turns out the first version I released December 29, 2023. It's been a fascinating journey. So Moonream started off as a tiny version language model. Since then, we've expanded scope a little bit to also try and build some tooling, client libraries, et cetera, to help people really deploy it. Unlike traditional

large models that are focused at assistant-type use cases. We're laser-focused on building capabilities that developers can-- sorry. Yeah, we're laser-focused on building capabilities that developers can use to build vision applications that can run anywhere. So in a lot of cases, for vision more so than for text, you really care about being able to run on the edge, run in real time, et cetera. So that's really important.

We have different output modalities that we support. There's query where you can ask general English questions about an image and get back human-like answers. There's captioning, which a lot of our users use for generating synthetic data sets to then train diffusion models and whatnot. We've done a lot of work to minimize hallucinations there, so that's used a lot. We have open vocabulary object detection built in, similar to a couple of more recent models like Pally Gem, et cetera, where rather than having to train a dedicated model, you can

just say, show me soccer balls in this image, or show me if there are any deer in this image, it'll detect it. More recently, earlier this month, we released pointing capability, where if all you're interested in is the center of an object, you can just ask it to point out where that is. This is very useful when you're doing UI automation type stuff. Let's see. We have two models out right now. There is a general purpose 2B para model, which

It's fine if you're running on server. It's good for our local Lama desktop friends, and it can run on flagship mobile phones, but it never really fulfill the promise of being able to run anywhere. Last week released a new 0.5B paramodel.

which should be seen more as a distillation target as opposed to a general purpose model. It's very good if you're running on older mobile phones or edge devices. Uses less memory even with our not yet fully optimized inference client. So the way we built our 0.5b model was to start with the 2 billion parameter model and rune it while doing continual training to retain performance. We

Our objective during the pruning was to preserve accuracy across a broad set of benchmarks. So the way we went about it was to estimate the importance of different components of the model like attention heads, channels, MLP rows and whatnot using basically a technique based on the gradient. I'm not sure how much people want to know details. We'll be writing a paper about this but

Feel free to grab me if you have more questions. Then we iteratively prune a small chunk that will minimize loss and performance, retrain the model to recover performance, and bring it back. The 0.5b we release is more of a proof of concept that this is possible. I think the thing that's really exciting about this is it makes it possible for developers to build using the 2b param model and just explore, build their application. And then once they're ready to deploy,

figure out what exactly they need out of the model and prune those capabilities into a smaller form factor that makes sense for their deployment target. So yeah, very excited about that. Let me talk to you folks a little bit about another problem I've been working on recently, which is similar to the clocks example we've been talking about. We had a customer reach out who was talking about,

who had a bunch of gauges out in the field. This is very common in manufacturing and oil and gas where you have a bunch of analog devices that you need to monitor. It's expensive to have humans look at that and monitor stuff and make sure that

The system gets shut down when the temperature goes over 80 or something. So I was like, yeah, this seems easy enough. Happy to help you distill that. Let's get it going. Turns out our model couldn't do it at all. I went and looked at other open source models to see if I could just generate a bunch of data and learn from that. That did not work either. So I was like, let's look at what the folks with hundreds of billions of dollars in market cap have to offer. And yeah, that doesn't work either.

My hypothesis is that the way these models are trained are using a large amount of image text data scraped from the internet. And that can be biased. In the case of gauges, most gauge images aren't gauges in the wild. They're product detail images like these, where it's always set to zero. It's paired with an alt text that says something like JIVTO, pressure sensor, PSI zero to zero.

30 or something. The models are fairly good at picking up those details. It'll tell you that it's a pressure gauge, it'll tell you what the brand is, but it doesn't really learn to pay attention to the needle over there. That's a gap we need to address. Naturally, my mind goes to, let's use synthetic data to solve this problem. That works, but it's problematic because it turned out we needed millions of synthetic

gauge images to get to reasonable performance. And thinking about it, reading a gauge is not a zero-shot process in our minds. If you had to tell me the reading in Celsius for this real-world gauge, there's two dials on there. So first you have to figure out which one you have to be paying attention to, like the inner one or the outer one. You look at the tip of the needle, you look at what labels it's between, and you

count how many and do some math to figure out what that probably is. So what happens if we just add that as chain of thought to give the model better understanding of the different sub... to allow the model to better learn the subtasks it needs to perform to accomplish this goal? So you can see in this example, this was actually generated by the latest version of our model. It's like, okay, Celsius is the inner scale. It's between 50 and 60. There's 10 ticks. It's

It's at the second tick. It's a little debatable here. There's a weird shadow situation going on. The dial is off. So I don't know what the ground truth is, but it works OK. There's points on there that-- the points over there are actually grounded. I don't know if this is easy to see, but when I click on those, there's a little red dot that moves around. On the image, the model actually has to predict where

those points are. I was already trying to do this with bounding boxes, but then Malmo came out with pointing capabilities and I was like, pointing is a much better paradigm to represent this. We see pretty good results. This one's actually for clock reading. I couldn't find our chart for gauge reading at the last minute. So the light blue chart is with our grounded chain, I thought.

This measures-- we built a clock reading benchmark of about 500 images. This measures accuracy on that. You can see it's a lot more sample efficient when you're using the chain of thought to have the model. Yeah, another big benefit--

from this approach is like you can kind of understand how the model is doing it and how it's feeling. So in this example the actual correct reading is 54 Celsius, the model output 56. Not too bad, but you can actually go and see where it messed up. Like it got a lot of these right except

Instead of saying it was on the seventh tick, it predicted it was the eighth tick and went with 56. Now that you know this is failing in this way, you can adjust how you're doing the chain of thought to count out each tick from 40 instead of trying to say it's the eighth tick. Or you might say, okay, I see there's that middle thing, I'll count from there instead of all the way from 40.

So it helps a ton. The other thing I'm excited about is few-shot prompting or test time training with this. If a customer has a specific gauge that we're seeing minor errors on, they can give us a couple of examples where if it's misdetecting the needle, they can go in and correct that in the chain of thought, and hopefully that works the next time.

Now, exciting approach, we only apply it to clocks and gauges. The real question is, is it going to generalize? Probably, there's some science from text models that when you train on a broad number of tasks, it does generalize. And I'm seeing some science with our model as well. So in addition to the image-based chain of thought stuff, I also added some spelling-based chain of thought to help it understand

Better understand OCR, I guess. I don't understand why everyone doesn't do this, by the way. It's a trivial benchmark question that's very, very easy to nail. But I also wanted to support it for stuff like license plate partial matching. Like, hey, does any license plate in this image start with WHA or whatever? So yeah, that sort of worked. All right, that ends my story about the gauges. If you think about what's going on over here,

It's interesting that LLMs are showing enormous progress in reasoning, especially with the latest set of models that we've seen. But we're not really seeing... I have a feeling that VLMs are lagging behind as we can see with these tasks that should be very simple for a human to do that are very easy to find VLMs failing at. My hypothesis on why this is the case is because

On the internet, there's a ton of data that talks about how to reason. There's books about how to solve problems. There's books critiquing the books about how to solve problems. But humans are just so good at perception that we never really talk about it. Like maybe in art books where it's like, hey, to show that that mountain is further away, you need to desaturate it a bit or whatever. But

the actual data on how to look at images isn't really present. Also, the data we have is kind of sketched. The best source of data we have is image-all text pairs on the internet, and that's pretty low quality. So yeah, I think our solution here is really just we need to teach them how to operate on individual tasks and figure out how to scale that out. All right. Yep.

So, in conclusion, at Moonream, we're trying to build amazing PLMs that run everywhere. Very hard problem. Much work ahead, but we're making a ton of progress that I'm really excited about. If anyone wants to chat about more technical details about how we're doing this or interest in collaborating, please hit me up. Yeah, thanks. I always...

when people say when people say multi-modality like you know he's talking about vision as the first among equals in all the modalities so i really appreciate having the experts in the room

2024 in Vision [LS Live @ NeurIPS] 57:25 Share