We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

The Researcher to Founder Journey, and the Power of Open Models

2024/8/16

AI + a16z

AI Deep Dive AI Insights AI Chapters Transcript

Robin Rombach, Andreas Blattmann, Patrick Esser: 我们在海德堡大学相识，共同进行了许多有影响力的研究工作，包括潜在生成模型在图像和视频生成上的应用。早期，扩散模型的优越性并不明显，我们的研究也曾受到质疑。然而，通过开源模型，我们获得了社区的广泛反馈，并不断改进模型。Stable Diffusion 的成功证明了开放模型的价值，它带来了大量的下载量和社区探索。我们新公司 Black Forest Labs 致力于开发最佳模型，并持续公开分享研究成果和模型，以确保模型的持续发展和商业可行性。Flux 模型是我们的首个图像模型，它在速度和效率方面进行了优化，并提供不同许可证的版本以满足不同用户的需求。我们相信开放模型能够促进研究成果的共享和实验，并最终提升模型的安全性。我们正在研究水印技术，以帮助识别由我们的神经网络生成的虚假信息。我们也正在开发一个新的视频模型，该模型在可控性和效率方面都有显著提升，并能够解决之前视频模型生成静态场景的问题。 Anjney Midha: Stable Diffusion 的成功表明，开放模型能够对学术界以外的社区产生巨大影响。与语言模型领域相比，生成图像和视频模型社区更倾向于开源研究成果。开放模型能够从社区获得反馈，并将其整合到模型迭代中。开放模型的迭代过程，包括整合社区反馈，改进模型质量，并扩展训练基础设施。

Deep Dive

Key Insights

Why did the founders of Black Forest Labs choose to release their models as open-weight licenses?

They believe in the value of sharing research findings openly to benefit the wider community, enabling experimentation and innovation. They also see it as a way to improve safety and transparency in AI models by allowing more people to analyze and contribute to their development.

What is the mission of Black Forest Labs?

To make the best image and video generation models widely available, enabling a new way of content creation for everyone while ensuring the sustainability of sharing research findings openly.

What are the key improvements in Black Forest Labs' Flux model compared to previous models?

Flux introduces better positional embeddings, more hardware-efficient implementations, optimized noise schedules, and improved scaling techniques. It also offers different variants with varying licenses to cater to specific needs.

How does Black Forest Labs approach the challenge of video generation controllability?

They focus on improving prompt adherence, temporal consistency, and object consistency across video cuts. Their model allows for better control over characters, objects, and settings within a single generation.

What was the biggest change in the data preparation and pre-training stages for Black Forest Labs' latest video model?

They made significant improvements in data pre-processing and pre-training, including better temporal compression and data filtering techniques. They also treated time as a first-class citizen in the model architecture.

Why is it important for Black Forest Labs to release open-weight models despite potential risks?

Open-weight models allow the community to identify and address biases, improve transparency, and contribute to the overall advancement of AI. This collaborative approach helps mitigate risks and enhances the safety of the models.

What are the main challenges in watermarking generated content to prevent misinformation?

Watermarking is challenging due to the ability to apply distortions to images and videos, which can break the watermarking process. However, open models allow for continuous improvements in watermarking techniques as new jailbreak methods are discovered.

How does Black Forest Labs' approach to model training differ from traditional methods?

They emphasize intuition, experience, and continuous feedback during training runs. Their team relies on the expertise of individuals who can quickly assess whether a training run is progressing in the right direction, which speeds up the development process.

What role does the image model play in the development of Black Forest Labs' video model?

The image model serves as a foundational base for the video model, providing diversity in styles and artistic elements that might not be captured in video data alone. It also allows for parallel development and faster progress in the video model's training.

What is the potential impact of AI models like Flux on creative workflows?

AI models like Flux can dramatically speed up creative workflows by providing a fast feedback loop for generating visuals from ideas. However, human input is still essential for decision-making, curation, and refining the final output.

Shownotes Transcript

Translations:

中文

If we have an open model, there will be jailbreaks, but there will be ways to mitigate those jailbreaks. This is what we see in many other research fields. If you think about, I don't know, cryptography or something, there it's basically similar. You just improve your algorithms, then you have some people who jailbreak it, and then you improve further. No one certainly doubts that cryptography is really important for everything like we have on the web and whenever exchanging information. And no one debates about whether

whether open research is good or not. Welcome to the A16Z AI podcast. I'm Derek Harris. This week, we have a very interesting discussion between A16Z general partner, Anjane Mitha, and the co-founders of a new generative AI model startup called Black Forest Labs, which they recorded live and in person in, as the company's name might suggest, Germany.

The founding team, Robin Rombach, Patrick Esser, and Andreas Blotman, drove the research behind the stable diffusion models and recently started Black Forest Labs to push the envelope of image and video models and to help keep the open research torch lit. In addition to discussing their new family of models called Flux, Robin, Andreas, Patrick, and Angenet

also get into the transition from research to product and then from building products to starting a company. In addition, they address the benefits of open research in AI and why it's important to learn from the greater community rather than develop behind closed doors. But before we get started, here are some brief introductions from each of them to help you associate their voices with their names. First, Robin.

I'm Robin, co-founder of Black Forest Labs. We're focusing on making image and video models as widely available as possible. Then Patrick. Patrick Esser.

I'm one of the co-founders of Black Forest Labs. I've been working in this area for a while, started at the university. I got excited when I saw the possibility that we can actually teach computers to create images. Finally, Andreas. Hi, my name is Andreas. I'm amongst the co-founders of Black Forest Labs. And yeah, a couple of years ago, I started with those two guys working on image and then later video generation.

As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see a16z.com slash disclosures.

So you guys, along with Dominik, were four of the co-authors on Stable Diffusion. Why don't we go back all the way to the origin story? Where did you guys all meet? Yeah, we met at the University of Heidelberg, where we all did our PhDs or tried to get our PhDs. And actually met Andreas there, who's from the Black Forest. This is me, village next door, basically. We didn't know each other before, but then we met in Heidelberg during the PhD.

And it was a really nice time. We did a bunch of, I would say, pretty impactful works together. We started with normalizing flows, actually. Tried to make them as good as possible, which was hard, which would probably still be hard. Then switched to autoregressive models.

did this work that's called the VQGAN. And then later on, after this DDPM paper, which really showed that diffusion models can generate nice images, we also looked into that and applied the same mechanism, the same formalism that we were working on before, with this latent generative modeling technology, where the basic assumption is that when you want to generate media like images or videos, there's a lot of redundancy in the data. That's something that you can basically compress away, map the data into a lower dimensional latent space,

and then actually train the generative model which can be a normalizing flow non-regressive model or diffusion model on that latent space which is computationally much more efficient and yeah we did that then basically with latent diffusion we did a bunch of tweaks to the architecture introduced this text conditional unit and were amongst the first to the texts to image generation with diffusion models and if you guys think back to that moment in time when it wasn't obvious maybe that

diffusion models would be so good at various kinds of modalities, image generation, video generation, audio generation. That's more clear today, but was it as clear back then? What were the biggest debates you guys were having as a group back then?

I don't know, refusing comments from our PI back then. Oh, interesting. So it wasn't clear to more senior academics at the time that this was a good line of inquiry. I don't think so. But that might be my personal perception. Why is that, do you think? What was the general reaction from more established researchers, academics, so unaccepting? I think it has something to do with what Patrick mentioned earlier, just the fact that you were

like if you look at it from very far away you were just training an autoencoder and then train your generative model in that latent space and it's like a very simplified view of it because it's not the entire story like the fact why the stuff is really working and producing like crisp images is that when you train the autoencoder we had to introduce like this adversarial component to it which makes it look like really crisp like natural images and not like blurry like the stuff was before

And this shares a very similar motivation to why diffusion models in the CDPM paper originally worked. You focus on the perceptually relevant stuff when you want to generate, but you discard certain perceptually irrelevant features. And I think we also had to develop this mindset or theory around our intuition while we actually worked on it. So this wasn't like the motivation from the beginning, but now in retrospect, I think it just makes a lot of sense.

And I think that might be one of the reasons why it was constantly challenged. Why would you work on this? Like, why would you do another latent approach? Again, now with the diffusion model, I think we had to debate ourselves. Yeah, I was worried if we can do another one of those. But that's always like, that's where you see where the limits of research are. You have to propose something novel. If it just works better and it's not like to everyone clear that it's novel, then it will be questioned in some form.

But as opposed to that, if you're building a business, you just focus on what works, right? The kind of novelty is not as important anymore. It's just like you use what works. That's why starting a business is actually also a really nice experience. Even before you guys got to starting a business, if you just think about the difference between research and product, and just building tools that people can use outside of a paper,

What may have seemed not novel to you while you were in the research community was actually extraordinarily novel to creators and developers around the world. And it wasn't really until you guys put out a few

a few years later, stable diffusion, that may have become clear to the research community. Is that right or is that the wrong framework? I think that's exactly right. I think there's a nice intermediate step between doing research and doing business, which is working on models that are being used in an open source context because then everybody will use your models. Right.

And we made that experience pretty early because we were just used to making our models available all the time. And then one of the first ones that got pretty popular was this VQGAN decoder, which because it actually achieved pretty realistic image textures, it was used in combination with this text-to-image optimization procedure where people use that and clip and optimize the image to match the text prompt.

Because we had put out the model and it was used in this context by lots of people, that was one of these moments where you realize, okay, you actually have to make something that works in general. I think it's a nice intermediate step because if you want your models to be used in this wide context, then you just have to make sure that they work in a lot of edge cases. Let's spend a little bit of time on that moment in history because it really has had an incredible impact on lots of communities well beyond academia and research.

So we're going to play a quick guessing game. August 2022, you guys put out stable diffusion V1.4. Just to give people a sense of the scale of stable diffusion and the impact that had, can you guys guess how many downloads the model had a month after launch? I said before, I hate guessing. Yeah, why don't you go first? 120K? 120,000. Patrick? A million, but we don't know how to download.

Downloads are counted. Oh, fair enough. These are estimates from the Hugging Face repo. So 120k, a million. Two million. Two million, okay. In its first month, Stable Diffusion v1.4 was downloaded 10 million times. Holy shit. Today, Stable Diffusion has had more than 330 million downloads since you guys put it out in the summer of 2022. Stable Diffusion basically changed the world. It's now one of the three most used AI systems internationally.

in history, which is incredible to think about. It's also incredible to think about the fact that you guys are just getting started. So why don't we talk about, that was the past, now let's talk a little bit about the present. So you put out Stable Diffusion, you see the incredible reception from the community,

The sheer scale of usage, the kinds of usage, the things people are doing with it. What would you say the top three things are that have surprised each of you? One thing that does come to my mind is just in general, this massive exploration that you get by having so many people use it. And one of the first things I think I was surprised was the use of negative prompting. Also, again, goes to CFG, but it's like a slight variation of that. I think we also never really explored. And then you saw that people actually got really improved results with it.

It was like, oh nice, such a nice quick find that we might never have discovered on our own. Right. Yeah, I remember how after the release I went on vacation for two weeks in Sweden and I had this, there were a bunch of papers where I was super curious to try it out. And then after I came back, it was all implemented already.

Do you think that was primarily because you guys chose to release it as open source? Exactly, yeah. Because it was available, because the base quality was sufficiently good to explore all of these downstream applications. Yeah, so let's spend a minute on that, because in the language world, arguably, as the impact of language models has become more and more clear, the visibility and the transparency with which researchers in language talk about their research breakthroughs has decreased.

The vast majority of leading labs today don't publish their insights until much later. They don't really publish their findings. And in contrast, in the generative image and video model community, you guys have chosen to continue open sourcing or at least publishing your research and transparently talking about it. Do you think that was a deliberate decision from you guys or was that just an artifact of something else?

So I think seeing what you get back from the community in terms of ideas and which you of course can then incorporate into your next iterations, that is so nice and so helpful. So I think it's definitely a personally important thing for us

to keep doing that, to keep giving the community something they can build on. And it's also, of course, as we already said, extremely insightful and fun to just see what they come up with. On the other hand, especially for the AI space, we've, of course, also seen companies

following that approach, struggling to make real revenue with it and just getting into trouble in many ways. So yeah, I think one thing everyone should keep in mind who is interested in models which are openly available is that there needs to be a kind of equilibrium between the people who are using the models

The open models and the people who are putting them out, which would be us. So we have to make sure we are sustaining as a business also, of course. And so now it's been a couple of years since that first stable diffusion release where you put out v1.4. You saw the community do a bunch of exploration. Then that gave you the ability to decide which parts of what the community is working on you want to do.

to double down on, improve the quality on and so on, and then release the next version. And you've done that a few over a family of models now. There was SDV 1.4, then there was SDXL, SD3. What would you say is the biggest takeaway for you, having gone through that journey a few times of releasing open weights and seeing what the community does with it? I think there's always this possibility that you integrate findings, at least pure research findings,

back into your models. But then on the other hand, it's one thing that we also had to learn is just like scaling our infrastructure that we need for training. This is typically something that is really not being talked about that much. And this is really where you can distinguish, I think, yourself. Just training a better base model requires massive commitments, how you design your training pipeline, right? There's data pre-processing in all different forms, data filtering, of course, then the

the training algorithm itself has to support large clusters and all these different things which are not directly being done in the community but which is super important if you want to make a good base model and right now we're in this phase where we're also scaling up our models massively okay so that brings us

to present day you guys are the co-founders of black forest labs what is black forest labs we are a startup that focuses on image and video generation with latent generative modeling we are a research team that has been worked together for more than one year now and i think we're as robin already said really specialized in building very specific training pipelines for these latent generative foundation models and i think that is where our team really is unique

in terms of capabilities, because we just managed to optimize all parts of our pipeline to an extent which I think is pretty much outstanding.

currently. And what is Black Forest's mission through North Star? I think it's to make the best models available as much as possible, that this really becomes a new way to generate content, that this is available widely for everyone, and that we also figure out how to actually continue this mission of still sharing research findings openly and also the models. But yeah, I think part of our goal is to make that a sustainable thing to keep going. And you had

As your first release, you guys put out Flux, which is Black Forest's first image model. What is Flux and what does it do? Flux is a diffusion transformer. It's a latent diffusion model. Actually, it's a latent flow model since we've recently switched to this more general formalism that's called flow matching. And this model improves, we think, a bunch of things over previous models.

so it uses a better form of positional embedding that does contribute better structure we have in the generations so it's called rope pretty popular among language models but yeah we incorporated this into image generation it uses a more hardware efficient implementation we

introduced these, we call them FUSE DIT blocks, also actually motivated by findings in the scaling of transformers. I think it's actually from the vision community. I think the VIT, there was a scaling VITs to 22 billion parameters or something paper that was published by Google. They had that. So we did that. There's a bunch of things around scaling, something that we also explored in the ST3 paper, actually, which is called QK normal, also important for training larger models.

Did I forget something about the architecture? I think we also have an optimized noise schedule or noise sampling during training, which we further improved compared to SD3. I think that...

That would be the main point, right? Yeah, I think it's also important to note that we are really, I would say it's the first round of experiments of putting out different variants of the model that come with different licenses. We offer different variants ranging from very permissive licenses to then also other models that are not completely freely available, which we also want to offer for customers that have more specific needs in the near future and also customized towards more specialized applications.

Who are you hoping will use those different variants?

And what's the biggest difference between these three? So they differentiate in terms of inference efficiency. So we think the model that is the most open, the fastest variant, is very developer friendly, just by the fact that you can generate samples in one to four steps compared to usually use something like, I don't know, 30 to 100. And given that the model is also quite large compared to previous models, we think that's an important feature. So just to recap, your most open model, Flux, Schnell,

which is a descriptor for how fast it is in open weight model.

It's available to the entire community in a sort of very permissive license. And your hope is that developers will use the fastest model to do what? To include it in workflows that include image generation, that include all different kinds of synthesis, right? We've seen this with existing models like in the past, SDXL, that is being included. Really, you can, I don't know, look up crazy, comfy workflows that have this model in the pipeline. Of course, we think because the model itself is just like fundamentally better than SDXL.

the models before that you don't need most of these somewhat complex workflows but I can very well imagine that because the model is sufficient that you can plug it in and develop nice workflows around it and hopefully we also see like a lot of exploration around applications that are popular and based on that I guess we can then really gather feedback on what is actually holding those applications back further and then we can specialize and yeah double down on those

And if I'm an application developer building on top of Black Forest's models, if I'm choosing the fastest model, what am I trading off? You're not necessarily only trading off things. It's also the advantage of having speed. I guess one of the biggest issues, though, that all the hosting is on you. You need to have the hardware to run it, right? Especially if you want to scale it.

And that's something where we also offer solutions that this actually doesn't become a bottleneck to explore applications. The other one is a bit in terms of flexibility, because in order to make this very fast, it is a distilled version which samples in a few steps. But there are some techniques that actually become possible because of the nature of a diffusion model that it samples in multiple steps, because you can adjust

things along that way, that sampling process. And those are then not necessarily possible directly with the with the Schnelle, with the Schnelle's model.

So maybe one could put it like this. If you want to quickly test something, quickly try something, if things make sense in general for you, use a Schnell model. If you have a more specialized application which is targeted to a certain goal, you might use one of the slower but more flexible models. I think you could describe it as different levels of distillation which has been applied to these models. The largest model is a purely aerodynamic

understilled base model which offers all the flexibility that comes from that flow matching training procedure which we're applying. But of course you trade that off versus generation speed. A pretty controversial decision you've made relative to when you compare with how a lot of other labs in the space are putting out state-of-the-art image models is you guys chose to make one of these models

extremely permissive and open weight. Why is that? Why was it important to you guys to continue putting out models that are fully open weight licenses? We benefit a lot also from findings from all the research that is being published and also with other tools like we depend also a lot on PyTorch just as one example. A lot of these things just wouldn't be possible if we all just completely isolate our findings.

So I think that's in general really the important part that we do still share research findings and make it possible to experiment with the new technologies. I think for the open weights topic, that's really important that you actually not only have the research findings written down, published form, which is also super helpful, but I think

to really enable a much wider audience to actually experiment with that technology. For that, it actually has to be available. Yes. As Patrick has said earlier, we have pretty deep roots in open source and we want to continue to

to do this. And I do think that there's this huge debate around safety in the context of deep learning models. I do really think that making weights available makes it ultimately down the line much safer. So I think that's just like another aspect to the open sourcing, having this community effort, focusing on downsides of the model, on stuff that you need to improve in contrast to something where you just develop it

on your own. - Yeah, one of the things that Stable Diffusion really did for a lot of users was that the base model you guys put out was extremely flexible. It was a fairly honest model. There weren't too many of your own post-training biases or censorship decisions that you put on it before giving it out to the community. And you continued that with this release. Why is that important in your mind?

Because I think down the line, it improves the models that everyone is producing. Having this fair exchange of arguments that are being based on research that you do with these specific model weights is like biases in the first versions of stable diffusion that were introduced by the training data itself and that I'm really not, I don't like them. So it's good that there was research around this, which could

point them out. And that, by the way, wouldn't have been possible without putting out that model. Ah, interesting. Because without stable diffusion, maybe the community would today not know that these biaxes in the datasets exist. Right. And by now we know how to remove that. So we've had huge learnings from that, actually. And that is a perfect example of how open models in general are very useful to improve the general

space or this general state of the art. You're saying when you put out open-weight models, that allows other researchers to actually contribute to the transparency of these models and understand the systems more deeply and then ultimately help improve them by having way more people actually be able to analyze what the models can and can't do, inherent biases that might not otherwise be as easily

discovered if it was a closed source model. Exactly. Yeah. So one common perception about open source or open-weight models is that the researchers and developers open sourcing these models don't care a lot about safety or about mitigating some of these risks. So is that true? Was there anything you guys did before you released Flux as an open source model that you think could address some of these misinformation risks?

Yeah, we are looking into methods that watermark the content that is being generated, but that you don't see the output. But then another algorithm could detect if that image or video was made by our neural network.

Yeah, I think that's a good point. I think it also goes into the direction of maybe a more healthy approach towards this. That, for example, tracking this and identifying misinformation makes it possible without limiting the technology to other uses that might actually be beneficial. And another point, coming to the point of watermarking, this is obviously a really challenging task.

because you can apply so much distortions to the generated images and to the watermarked images to just jailbreak these watermarking procedures. But also there, if we have an open model, there will be jailbreaks.

but there will be ways to mitigate those jailbreaks. This is what we see in many other research fields. If you think about, I don't know, cryptography or something, there it's basically similar. You just improve your algorithms, then you have some people who jailbreak it, and then you improve further.

And this is, no one certainly doubts that cryptography is really important for everything like we have on the web and whenever exchanging information. And there it's just similar and no one debates about whether open research is good or not. I'm wondering why that is the case for these AI infrastructure models where it's

like effectively the same I would say. And you did also share in your launch blog post that you're working on a video model. And when you were starting to work on this video model, what were some of the most important capabilities that you guys wanted to tackle? So I think one of the learnings of what we saw to current powerful video models is that although they are really nice, generating really nice and detailed videos, they are still not controllable enough in many respects

to be really useful for professionals, for people who want to seriously include that into their professional pipelines. And when you say controllability, what do you mean? There are different kinds of challenges, which first of all is the general level of

prompt following so most of these models which you saw right now are based on text inputs but there is still other than for images where we have found nice ways of or like where the prompt adherence is currently much better for video it's unclear how to Temporally prompt the model such that it accurately follows your temporal instructions. So that is one of the main challenges

Another one is like consistency of certain objects or characters between different cuts. A movie maker might want to have a cut and still be able to generate the same person, bring the same glow thing or having like same backgrounds and stuff, right? Maybe from another view angle, but still...

the same setting. And then we think that this is like one of the nice features with this new model that we can actually control not only through text, but we can actually say, okay, let's do a cut in here, got a character or whatever you have. It can be a bottle or whatever in your prompt. It remains consistent across these different cuts that are being generated, like within a single generation. So relative to the last video model that you guys worked on, which I believe was stable video diffusion last fall,

Is the biggest improvement you would say in Black Forest's first video model that it's much more controllable? Not only that, it's also much more efficient. The latent space is more or less 16 times more efficient, which is really, I think, good while keeping the general video quality, visual quality. Also related to that, we can generate much longer videos and we have, I think, a main issue with stable video diffusion was that it was like

mainly generating static scenes. Our current model has a lot of motion, very interesting motion, very broad range of motions from slow-mo over fast footages and shaky camera. Yeah, I think like the motion distribution that model is able to generate is heavily improved compared to SVD. - This was a common sort of problem with a whole generation of models that were ostensibly video models, but when you would actually try to run

any kind of interesting inference to them, they would often produce the static camera pans or just zooms. And they weren't actually simulating the world that the image or the video was of. What did you have to do to crack that problem? I think one of the major things is, major improvements is around this temporal compression that Andi mentioned, which is also used by other new models. We think that this is like one of the fundamental improvements why we see like much better video models nowadays than we did have nine months ago.

- Gotcha. - Also comes down to a lot of data filtering and preparation improvements there. I think it's actually nice in general because there, for example, we used actually just like very, I would say classical computer vision techniques to filter out like the worst parts that introduced this undesirable behavior. And yeah, I think that's neat to see also that existing techniques can also be really helpful just for like sometimes even getting

that technique we apply has a very high error rate probably. But if you do this in the pre-training stage,

Just getting the rough ID right so often already improves the base model so much more than one would expect just from the numbers. That seemed also very effective. So, we have to contrast the data preparation, pre-computing, pre-training, post-training, fine-tuning, and ultimately inference optimization parts of the entire journey of actually building and releasing a model. What's changed the most?

in how you approached it this time versus, let's say, a year ago? I would say there were tremendous differences in the data pre-processing and in the pre-training stage already, which led to some of the fundamentally different behaviors we see now for our video model compared to previous video models. Another thing is that we really changed from, well,

made like time a first class citizen before we always I think a lot of people were always using a factorized mechanism modeling approach there where you treat space and time differently and now in the new models that are coming it's also just treating all those the same and letting the model the transformer actually figure out how to deal with the different that is by the way yeah

And that is by the way, generality of the transformer as an architecture is really helpful here because we transitioning from the image model, which is as I already mentioned, the base for the video model. When doing that transition, we didn't have to change the architecture at all.

because of that very useful generality of the transformer architecture. Interesting. We actually have a little placeholder in the image model, like in the positional embeddings that we added before we even started the image model training that would later incorporate a temporal positional embedding. And what gave you the conviction to do that? Because we already had the plan to the video model was to go from the beginning, but we knew it's always a good start to start training the image model.

But yeah, into that image model, we then already incorporated design decisions that were informed by the goal of developing the video model. So it sounds like a fundamental assumption you guys made was that a really great image model would be strictly helpful to a great video model. Is that true?

Not 100% sure if I would phrase it this way, but image data gives you different type of diversity and styles that you might not be able to capture with video data. Artistic, for instance, artistic things which only create in the image data. If we think about artworks and stuff, right? Of course, you can make a video of an artwork, but that might be not a...

very interesting motion. At least the artwork might not move. I think one also shouldn't underestimate the need to think about the development plan in itself because there are, first of all, it takes different amounts of compute to train the image or video model. There's also a bit more experience with image models which makes it a bit more

safer and a bit faster to get started. And so I think that was actually also part of the decision to do it that way is that you don't want to aim for something where you say this will only be ready in 12 months or something. I think it's super important to have continuous progress where in the intermediate steps, you also get something really useful like the image model. And from that, I think it

really makes a lot of sense to go that route. There's not much that you lose. You can much better overlap different developments by starting the image training relatively quickly. Then we can work in parallel on all the video data works. And yeah, that just comes down again to the overall efficiency of us as a team and our model development strategy. So to add to that, you see that, by the way,

on the fact that we started our company four months ago and we're already putting our first model out. We had to rebuild everything, but since we, as Robin mentioned a couple of times, we have this really specialized team

which just optimized all the parts of the pipeline and combined that with the continuous development of features, image features for instance, which can then be reused for video. These combinations led to, I think, a really good progress, which you're seeing right now, because we're putting a really powerful and large model out after four months, which I'm personally very proud of.

I think that what's not well understood in the research space is just how much intuition still matters, how much taste still matters, and how individual decisions that you make as a team have dramatic impact on the speed at which you can produce models, the quality that comes out. As an example, I remember talking to you guys a few months ago about how to approach the problem of latency, of slowness in generation.

Video generation is still pretty slow. It takes a while to prompt something and then see a generation come back. I remember asking you guys, how could we crack that? And all three of you immediately said, yeah, we should ask Axel, right?

And can you say a little bit more about that? Why was it so clear to you that there was a specialized person on the team whose intuition you would go to first versus saying, oh, let's look that up. It should be common knowledge or let's look at what the latest conference paper said and so on. No, I think this is like what really makes a difference. You need to have this functioning team. You need to know each other. You need to know that you can trust each other. Axel tells me, hey, we'll have this model ready in a week. Then I trust him.

So I trust Axe as well. But what is it about training these models that's so difficult that knowledge still is locked up within one or two individuals where it's not ubiquitously distributed? And especially wouldn't be if someone like you guys weren't being so transparent about your research. It's a lot of intuition and experience. The ability to judge where a training run goes when you look at the early samples in the training, I think that's super important. And I think this would...

A lot of our team members actually have. And on that intuition, we actually called the model that the internal name when we started to run for the flux model that we just released, YOLO 12B. And I think there were a few decisions to work. Let's see where it goes. But overall, we had a pretty good feeling about the whole thing. And that's a nice way of operating, I think.

Yeah, but I think there is kind of scarcity in how much experience there can even be, right? There's always this, the whole process of training a model just is slow. I think there's very little way around that. And then, of course, you try ways around, do it on a smaller scale, but then sometimes you see that also doesn't really translate to the scaled up version. I think just going back to the question of why it's still locked up or so, so...

important to have a few people with experience. I still think actually it's somewhat limited how much experience people have and yeah, really having that hands-on experience and also being able to, like Robin said, to judge continuously during the training, is this going in the right direction or not without completely blindly trusting loss curves. Let's say you had to predict over the next two or three years, what will be more valuable for a human to do versus the model to

model to do. I'm sure there are many different opinions, but something I, how I see it is that it comes similar to the speed, like we are to some degree bottlenecked by the speed with which we can train models and actually get feedback off the ideas we put into models to

get feedback whether that's actually improving it or not. This is like having a slow feedback loop. I think that's always something that holds people back. I think that's very similar for visual media. From what I hear, people in that industry, if you just want to realize something or if you have an idea and you want to put that into action, you...

Today, I don't know if you have to actually film this on camera. You have to get the props. You have to get everything set up just to even get an idea for it. Of course, you do like maybe storyboarding and stuff before that. But I think this is something where it's already super helpful that you can

get a much faster feedback loop of maybe giving some visuals to the ideas that are in your head. And it doesn't have to be the end product, right? Especially right now with the quality gaps that we still see, there's a lot of demand for perfection and that still requires

human craft right now. Maybe it will become less, but ultimately I think it's to get the ideas out of the head of people into some form of visualized reality. I agree. It's a tool to iterate quickly on ideas. But in the end, you as a user, as a human, you have to make the decision what to use and what you don't use. I think there's also the question around taste curation of sample thread. You can easily generate like 100 samples in no time, but then what do you use for your specific project?

that depends on you and then there's also this question around the specific style that kind of emerges in the air domain that's something we want to keep i don't know i'm not sure so i i would say it's mostly we think of it as a tool

that can dramatically speed up certain workflows, but it's not thought to replace these workflows entirely. And it will change workflows as well. So where can people go find the models? You can go to GitHub, use our inference code, download the weights from Hiring Phase, and of course also use our API. I hope they generate a bunch of really weird stuff. And no, of course, explore the model, see how it performs in existing workflows, and integrate it into these workflows as well.

Yeah, we think this is a big upgrade from the latest iteration on available weights and yeah, both excited to see what the community is exploring, but also of course hoping that we see more research around those models. And especially, we've already talked about this earlier, that model is a free model, which is not

behind an API. Often these APIs come with native prompt up sampling or stuff. Of course, you can also do this for our model, but we took care of training the model such that it should also react very well to various kind of prompting techniques such as single words or short prompts, longer prompts, very detailed prompts. So I guess that

It's amongst the coolest features for open models that people can just prompt them as they themselves want to do and explore what works best for them. And that's it for this week. If you were wondering what the pipeline from AI researcher to founder might look like, this is certainly one compelling version of that journey. For more discussions about the cutting edge of AI from the people building it, keep listening to this podcast. And also remember to rate, review, and share the podcast.

The Researcher to Founder Journey, and the Power of Open Models 37:16 Share