We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

AI Engineering for Art — with comfyanonymous, of ComfyUI

2025/1/4

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Shownotes Transcript

Happy New Year, friends. Thanks for all the love on the Latent Space Live and 100th episode end of year recap. Your support has boosted us 30 places in the podcast charts, and that always helps us book great guests and organize more industry events for you. We don't say this enough, but thank you to everyone who has left a review on Apple Podcasts or subscribed to our new YouTube channel.

Last year, we broke new ground when we interviewed our first public company CEO with Drew Houston and first technology cabinet member with Minister Josephine Tao, and first year with full coverage of leading labs across Meta, OpenAI, Anthropic, Rekha, Liquid, and Google DeepMind. For our 101st episode, we are proud to introduce another first with our first anonymous guest.

As Swix mentions in the episode, latent space was started in the immediate aftermath of stable diffusion, and the uncredentialed software engineers it enabled set the stage for the LLM wave that was to come with ChatGPT.

The earliest winner of the stable diffusion tooling wars was SDWebUI, a Gradio app by the anonymous young creator Automatic1111 that quickly amassed over 100,000 GitHub stars for how it rapidly shipped plugins and usable interfaces for the rapidly growing stable diffusion ecosystem.

However, these days, the power tool of choice is now Comfy UI by today's guest, Comfy Anonymous, who is gracing us with his first ever podcast appearance today.

The shift from automatic 11.11 to Comfy UI reflects a shift away in the image diffusion space from prompting and tweaking settings in 2022 to more complex and parallel workflows chaining together different models and orchestrating long-running operations that can also include video processing, visualized on an intuitive canvas instead of long YAML or code blocks.

Because ComfyUI is open source, there are now multiple Y Combinator startups built off of a Comfy workflow or offering ComfyUI as a service directly. Interestingly enough, this same workflow tooling has not seemed to take off for other modalities yet, but perhaps 2025 is the year diffusion tooling diffuses to non-image domains.

In other news, we have just announced the second AI Engineer Summit in New York City. We are bringing back the surprisingly successful AI Leadership Track from World's Fair. And also the single track AI Engineering Track is now wholly focused on agents at work. If you are building agents in 2025, this is the single best conference to attend. Head to apply.ai.engineer and see you there. Watch out and take care.

Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swix, founder of SmallAI. Hey everyone, we are in the Chroma Studio again, but with our first ever anonymous guest.

Comfy Anonymous, welcome. - Hello. - I feel like that's your full name. You just go by Comfy, right? - Yeah, well, a lot of people just call me Comfy even though, even when they know my real name. Just say, "Hey, Comfy." - Yeah. - It works just the same. Not a lot of people call you Sean. - Yeah, you have a professional name, right? That people know you by and then you have a legal name. Yeah, it's fine.

How do I phrase this? Like people who are in the know, know that comfy is like the tool for image generation and now other multimodality stuff.

I would say that when I first got started with Stable Diffusion, the star of the show was Automatic 111. And I actually looked back at my notes from 2022-ish, like Comfy was already getting started back then, but it was kind of like the up and comer and your main feature was the flowchart. Can you just kind of rewind to that moment that year and how you looked at the landscape there and decided to start Comfy? Yeah, I discovered Stable Diffusion in 2022, in October 2022.

And well, I kind of started playing around with it. Yes, I, and back then I was using automatic, which was what everyone was using back then. And I, so I started with that because I had the, it was when I started, I had no idea like how diffusion models work, how any of this works. Oh yeah. What was your prior background as an engineer?

Just a software engineer. Yeah, boring software engineer. But any image stuff, any orchestration, distributed systems, GPUs? No, I was doing basically nothing interesting. Crud, web development? Yeah, a lot of web development. Just some basic automation stuff. Okay. Just that.

yeah, no, like, no big companies or anything. Yeah, but like already some interest in automations, probably a lot of Python. Yeah, yeah, of course, Python, but I wasn't actually used to like the Node graph interface before I started ConfiUI. It was just, I just thought it was like, oh, like what's the best way to represent the diffusion process in the user interface? And then my, oh, well,

like natural, this is the best way I found. And this was like with the node interface. So how I got started was, yeah, so basically October 2022, just like I hadn't written a line of PyTorch before that. So it's completely new. What happened was I kind of got addicted to generating images. Okay.

As we all did. And then I started experimenting with the high-res fixed in auto, which was, for those that don't know, the high-res fix is just to generate... Since the diffusion models back then could only generate at low resolution. So what you would do, you would generate low-resolution image, then upscale, then...

refine it again. And that was kind of the hack to generate high resolution images. I really liked generating like higher resolution images, so I was experimenting with that. And so I

modified the code a bit. Okay, what happens if I use different samplers on the second pass? I must edit the code of auto. So what happens if I use a different sampler? What happens if I use a different settings, different number of steps?

Because back then the high-res fix was very basic. Now there's a whole library of just up-samplers. I think they added a bunch of options to the high-res fix since then. But before that, it was just so basic. So I wanted to go further. I wanted to try, okay, what happens if I use a different model for the second pass? And then, well...

then the auto code base wasn't good enough for it. It would have been harder to implement that in the auto interface than to create my own interface. So that's when I decided to create my own. And you were doing that mostly on your own when you started, or did you already have kind of like a subgroup? No, I was on my own because it was just me experimenting with stuff. So...

Yeah, that was it. So I started writing the code January 1, 2023. And then I released the first version on GitHub January 16, 2023. That's how things got started. And what's the name? Comfy UI right away? Yeah, Comfy UI. The reason my name is Comfy is people thought my pictures were comfy. So I just named it Comfy.

It's my comfy UI. So yeah, that's... Is there a particular segment of the community that you targeted as users? Like more intensive workflow artists, you know, compared to the automatic crowd or, you know? This was my way of like experimenting with new things. Like the high-risk fix thing I mentioned, which was like in comfy, the first thing you could easily do was just chain different models together.

And then one of the first things, I think the first times it got a bit of popularity was when I started experimenting with different, like applying prompts to different areas of the image.

Yeah, I called it area conditioning, posted it on Reddit and it got a bunch of upvotes. So I think that's when like when people first learned of comfy UI. Is that mostly like fixing hands? No, that was just like, let's say, well, it was very, well, it still is kind of difficult to like, let's say you want a mountain, you have an image and then, okay, I want a mountain here and

And I want a fox here. Yeah, so compositing the image. Yeah, my way was very easy. It was just like, when you run the diffusion process, you kind of generate-- OK, you do one pass through the diffusion model. Every step, you do one pass.

this place of the image with this prompt, this place of the image with the other prompt, and then the entire image with another prompt, and then just average everything together every step. And that was area composition, which I call it. And then a month later, there was a paper that came out called "Multi-Diffusion," which was the same thing. But yeah. That's-- Could you do area composition with different models?

Or because you're averaging out, you kind of need the same model. You could do it with-- but yeah, I hadn't implemented it for different models. But you can do it with different models if you want, as long as the models share the same latent space.

We're supposed to ring a bell every time someone says they're in space. Yeah. Like, for example, you couldn't use, like, Excel and SD 1.5 because those have a different latent space, but, like...

SD 1.5 models, different ones, you could do that. There's some models that try to work in pixel space, right? Yeah, they're very slow. Of course. That's the problem. The reason why stable diffusion actually became popular was because of the latent space. Because it used to be latent diffusion models and then they trained it up. Yeah, because pixel diffusion models are just

too slow. Yeah. Have you ever tried to talk to like, like Stability, the Latent Diffusion guys, like, you know, Robin Rombach, that group? Yeah. Well, I used to work at Stability. Oh, I actually didn't know. Yeah. I used to work at Stability. I got, I got hired in June 2023. Uh-huh.

Ah, that's the part of the story I didn't know about. Okay. So the reason I was hired is because they were doing SDXL at the time. And they were basically SDXL, I don't know if you remember, it was a base model and then a refiner model. Basically, they wanted to experiment, like changing them together. And then they saw, oh, we can use this to do that. Well, let's hire that guy now.

But they didn't pursue it for SD3. What do you mean? The SDXL approach. Yeah. The reason for that approach was because basically they had two models and then they wanted to publish both of them. So they trained one on lower timesteps, which was the refiner model.

And then the first one was trained normally. And then during their test, they realized, oh, if we string these models together, our quality increases. So let's publish that. It worked. Yeah. But right now, I don't think many people actually use the refiner anymore, even though it is actually a full diffusion model. You can use it on its own, and it's going to generate images.

I don't think anyone, people have mostly forgotten about it.

Can we talk about models a little bit? So stable diffusion, obviously, is the most known. I know flux has gotten a lot of traction. Are there any underrated models that people should use more? Or what's the state of the union? Well, the latest state of the art, at least for images, there's flux. There's also SD 3.5. SD 3.5 is two models. There's a small one, 2.5 BPM.

and there's the bigger one, 8B. So it's smaller than Flux, and it's more creative in a way. But Flux is the best. People should give SD 3.5 a try because it's different. I won't say it's better. Well, it's better for some specific use cases.

If you want to make something more creative, maybe SD 3.5. If you want to make something more consistent and flux, it's probably better. Do you ever consider supporting the closed source model APIs? Well, we do support them with our custom nodes. We actually have some official custom nodes from different... Ideogram. Yeah. I guess DALI would have one.

Yeah, it's just not another person that handles that. Sure, sure. Quick question on SD. There's a lot of community discussion about the transition from SD 1.5 to SD 2 and then SD 2 to SD 3. People still very loyal to the previous generations of SDs? Yeah, SD 1.5 still has a lot of users. The last based model. Yeah.

Yeah. Then SD2 was mostly ignored because it wasn't a big enough improvement over the previous one. Okay, so SD1.5, SD3, Flux, and whatever else. SDXL. SDXL, that's the main one. Stable Cascade? Stable Cascade, that was a good model at

The problem with that one is it got, like SD3 was announced one week after. Yeah, it was like a weird release. What was it like inside of Stability, actually? I mean, statute of limitations expired, you know, management has moved. It's easier to talk about now. Yeah, inside Stability, actually that model was ready like three months before, but it got stuck in red teaming.

So basically, if that model had released or was supposed to be released by the authors, then it would probably have gotten very popular since it's a step up from SDXL. But it got all of its momentum stolen by the SD3 announcement, so people kind of didn't develop anything on top of it, even though it's a...

It was a good model, at least completely, mostly ignored for some reason. I think the naming as well matters. It seemed like a branch off of the main tree of development. Yeah, well, it was different researchers that did it. Very good model. It's the Worcestershire authors. I don't know if I'm pronouncing it correctly. Yeah.

I actually met them in Vienna. Yeah, they worked at Stability for a bit and they left right after the Cascade release. This is Dustin, right? No. Dustin's SD3. No, Dustin is SD3, so that's Pablo and Dome. I think I'm pronouncing his name correctly.

yeah that's very good it seems like the community is very they move very quickly yeah like when there's a new model out they just drop whatever the current one is and they just all move wholesale over like they don't really stay to explore the full capabilities like if if the stable cascade was that good they would have AB tested a bit more instead they're like okay SD3 is out let's go

Well, I find the opposite, actually. The community doesn't... They only jump on a new model when there's a significant improvement. If there's only an incremental improvement, which is what most of these models are going to have, especially if you stay the same parameter count...

You're not going to get a massive improvement unless there's something big that changes. How are they evaluating these improvements? Because it's a whole chain of...

you know, comfy workflows. Yeah. How does one part of the chain actually affect the whole process? Are you talking on the model side specific? Model specific, right? But like, once you have your whole workflow based on a model, it's very hard to move. Not, well, not really. It depends on your, depends on the specific kind of workflow. Yeah, yeah.

So I do a lot of like text and image. Yeah. When you do change, like most workflows are kind of going to be compatible between different models. It's just like, you might have to completely change your prompt, completely change. Okay. Well, I mean, maybe the question is really about evals. Like what does the config community do

do for evals just you know well that they don't really do it's more like I think this image is nice so that's they just subscribe to Fofur yeah and just see like you know what Fofur is doing yeah

They just generate, like, I don't see anyone really doing, at least on the comfy side, comfy users, it's more like, oh, generate images and see, oh, this one's nice. Yeah, it's not, like, the more, like, scientific, like, checking, that's more specifically on, like, model side of things.

Yeah. But there is a lot of vibes also because it is artistic. You can create a very good model that doesn't generate nice images. Because most images on the internet are ugly. So if you just, oh, I have the best model. It's super smart. I created all the, like I'm turning it on just...

all the images on the internet. The images are not going to look good. They're going to be very consistent, but it's not going to be like the look that people are going to be expecting from a model.

Can we talk about LORAs? Because we talked about models, then the next step is probably LORAs. Actually, I'm kind of curious how LORAs entered the toolset of the image community, because the LORA paper was 2021. And then there was other methods like textual inversion that was popular at the early SD stage.

Yeah, I can't even explain the difference between that. Text-run versions, that's basically what you're doing is you're training a... Because, well, yeah, stable diffusion, you have the diffusion model, you have text encoder. So basically what you're doing is training...

a vector that you're going to pass to the text encoder. It's basically you're training a new word. Yeah, it's a little bit like representation engineering now. Basically, yeah, you're just... So yeah, if you know how the text encoder works, basically you have...

You take your words of your proc, you convert those into tokens with the tokenizer, and those are converted into vectors. Basically, each token represents a different vector, so each word presents a vector, and those, depending on your words, that's the list of vectors that get passed to the text encoder, which is just a stack of attention. Basically, it's...

very close to LLM architecture. Yeah, so basically what you're doing is just training a new vector. We're saying, well, I have all these images and I want to know which word does that represent? And it's going to get like you train this vector and then when you use this vector, it's

hopefully generates like something similar to your images yeah i would say it's like surprisingly sample efficient in picking up the concept that you're trying to train it on yeah well people have kind of stopped doing that even though back as like when i was at stability we we actually did train internally some like textual inversions on like t5x excel actually worked pretty well

But for some reason, yeah, people don't use them. And also they might also work like, yeah, this is something you'd probably have to test. But maybe if you train a text-relevant version like on T5XXL, it might also work with all the other models that use T5XXL. Because same thing with like the text-relevant versions that...

that were trained for SD 1.5, they also kind of work on SDXL, because SDXL has two text encoders, and one of them is the same as the SD 1.5 CLIP-L. So those, they actually don't work as strongly, because they're only applied to one of the text encoders, but the same thing for SD3. SD3 has three text encoders, so...

It works. You can still use your text from version SD 1.5 on SD 3, but it's just a lot weaker because now there's three text encoders. So it gets even more diluted. Do people experiment a lot? Just on the clip side, there's like Siglip, there's Blip. Do people experiment a lot on...

You can't really replace. Yeah, because they're trained together, right? Yeah, they're trained together. So you can't like... Well, what I've seen people experimenting with is a long clip. So basically someone fine-tuned the clip model to accept longer prompts. Oh, it's kind of like long context fine-tuning. Yeah, so sort of like it's...

It's actually supported in Core Comfy. How long is long? Regular clip is 77 tokens. Long clip is 256 tokens.

But the hack that, like, if you use stable diffusion 1.5, you've probably noticed, oh, it still works if I use long prompts, prompts longer than 77 words. Well, that's because the hack is to just, well, you split it up in chunks of 70, your whole big prompt. Let's say you give it like the,

massive text like the Bible or something. And it would split it up in chunks of like seven and then just pass each one through the clip and then just

anything together at the end. It's not ideal, but it actually works. Like the positioning of the words really, really matters then, right? Like this is why order matters in prompts. Yeah. Yeah. Like it, it works, but it's, it's not ideal, but it's what people expect. Like if someone gives a huge prompt, they expect at least some of the concepts at the end to be like present in the image and

But usually when they give long prompts, they don't expect detail, I think. So that's why it works very well. And while we're on this topic, prompt weighting, negative prompting, all sort of similar part of this layer of the stack. Yeah, the hack for that, which works on clip, like it works.

Basically it's just for SD 1.5, the prompt leading works well because Clip L is not a very deep model.

So you have a very high correlation between you have the input token, the index of the input token vector and the output token. The concepts are very closely linked. So that means if you interpolate the vector from what, well, the way ConfuI does it is it has, okay, you have the vector, you have an empty prompt vector.

so you have a a chunk like a clip output for the np prop and then you have the one for your prompt and then it interpolates from that depending on your prompt weight the weight of your of your tokens so

So if you, yeah. So that's how it does prompt waiting, but this stops working the deeper your text encoder is. So on T5XSL, it doesn't work at all. Wow. Is that a problem for people? I mean, because I'm used to just moving up numbers. Probably not. So you just use words to describe, right? Because it's a bigger language model. Yeah.

Yeah, so honestly, it might be good, but I haven't seen many complaints on Flex. It's not working.

Because I guess people can sort of get around it with language. And then coming back to Loras, now the popular way to customize models is Loras. And I saw you also support Locon and Loja, which I've never heard of before. There's a bunch of... Because what the Loja is essentially is...

Instead of like, okay, you have your model and then you want to fine tune it. So instead of like what you could do is you could fine tune the entire thing. But that's a bit heavy. So to speed things up and make things less heavy, what you can do is just fine tune some smaller weights. Like basically two matrices that when you multiply like two low rank matrices...

And when you multiply them together, it represents a difference between trained weights and your base weights. So by training those two smaller matrices, that's a lot less heavy. And they're portable, so you can share them. Yeah, and also smaller. That's how LORAS works, so...

Basically, so when inferencing, you can inference with them pretty efficiently, like how Compute White does it. It just, when you use a lower write, just applies it straight on the weights so that there's only a small delay at the beat, like before the sampling when it applies the weights, and then it just, same speed as before. So for inference, it's...

it's not that bad, but, uh, and then you have, uh, so basically all the Laura types like loha, loka, everything, that's just different ways of representing that. Uh, like basically you can call it kind of like compression, even though it's not really compression, it's just different ways of represented, like just, okay. I want to train a different on the,

difference on the weights, what's the best way to represent that difference? There's the basic moral, which is just, oh, let's multiply these two matrices together. And then there's all the other ones, which are all different algorithms.

So let's talk about what comfy UI actually is. I think most people have heard of it. Some people might have seen screenshots. I think fewer people have built very complex workflows. So when you started, automatic was like the super simple way. What were some of the choices that you made? So the node workflow is

Is there anything else that stands out? It's like, this was like a unique take on how to do image generation workflows. Well, I feel like, yeah, back then everyone was trying to make like easy to use interface. Everyone's trying to make an easy to use interface. Let's make a hard to use interface.

I don't need to do that. I have everyone else doing it, so let me try something. Let me try to make a powerful interface that's not easy to use.

So like, yeah, there's a sort of node execution engine. Your readme actually lists, has a really good list of features of things you prioritize, right? Like, let me see, like sort of re-executing from any parts of the workflow that was changed. Asynchronous queue system, smart memory management. Like all this seems like a lot of engineering there. Yeah, there's a lot of engineering in the backend to make things easier.

Because I was always focused on making things work locally very well because I was using it locally. So there's a lot of thought and work in getting everything to run as well as possible. So yeah, ConfUI is actually more of a back-end. Right.

At least, well, now the front end is getting a lot more development. But before it was, I was pretty much only focused on the back end. Yeah. So V0.1 was only August this year. Yeah, before there was no versioning. So yeah, yeah.

And so what was the big rewrite for the 0.1 and then the 1.0? Well, that's more on the front end side. Because before that, it was just like the UI. Because

When I first wrote it, I just, I said, okay, how can I make, like, I can do web development, but I don't like doing it. Like, what's the easiest way I can slap a node interface on this? And then I found this library, Live Graph, like JavaScript library. Live Graph? Live Graph. Usually people will go for like React Flow for like a flow builder. Yeah, but that seems...

like too complicated. So I didn't really want to spend time like developing the front end. So I'm like, well, oh, light graph. This has...

the whole node interface. So, okay, let me just plug that into my backend. I feel like if Streamlit or Gradio offered something, you would have used Streamlit or Gradio because it's Python. Streamlit and Gradio, I don't like Gradio. Why? It's bad. That's one of the reasons why Automatic was very bad. It's great because...

The problem with Gradient, it forces you to, well, not forces you, but it kind of makes your interface logic and your backend logic and let just...

takes them together. It's supposed to be easy for you guys. If you're a Python main, you know, I'm a JS main, right? If you're a Python main, it's supposed to be easy. Yeah, it's easy, but it makes your whole software a huge mess. I see, I see. So you're mixing concerns instead of separating concerns? Well, it's because front-end and back-end should be well separated with a defined API. That's how you're supposed to do it.

smart people disagree but yeah it just sticks everything together it makes it easy to like a huge mess and also it's

There's a lot of issues with Gradio. It's very good if all you want to do is just slap a quick interface to show off your ML project. That's what it's made for. There's no problem using it like, "Oh, I have my code. I just wanted a quick interface on it." That's perfect.

use Gradio but if you want to make something that's like a real software that will last a long time and will be easy to maintain then I would avoid it yeah so your criticism is Streamlit and Gradio the same I mean those are the same criticisms yeah Streamlit I haven't

I haven't used it as much. Yeah, I just looked a bit... Similar philosophy. Yeah, it's similar. It just seems to me like, okay, for quick AI demos, it's perfect. Yeah. Going back to the core tech, like asynchronous queues,

Slow re-execution, smart memory management, anything that you were very proud of or was very hard to figure out? Yeah, the thing that's the biggest pain in the ass is probably the memory management. Were you just paging models in and out? Yeah, before it was just, okay, load the model, completely unload it. Load the new model, completely unload it.

Then, okay, that works well when your models are small. But if your models are big and it takes, let's say someone has a 4090 and the model size is 10 gigabytes, that can take a few seconds to load and load and load and load. So you want to try to keep things in memory.

in the GPU memory as much as possible. What Compu UI does right now is that it tries to estimate, "Okay, you're going to sample this model. It's going to take probably this amount of memory. Let's remove the models like this amount of memory that's been loaded on the GPU and then just execute it."

fine line between just because try to remove the least amount of models that are already loaded because it spans like windows driver and another problem is uh

the NVIDIA driver on Windows by default. Because there's an option to disable that feature, but by default, if you start loading, you can overflow your GPU memory and then the driver is going to automatically start paging to RAM.

But the problem with that is it makes everything extremely slow. So when you see people complaining, "Oh, this model, it works, but oh shit, it starts slowing down a lot," that's probably what's happening. So it's basically you have to just try to get, use as much memory as possible, but not too much, or else things start slowing down or people get out of memory.

And then just try to find that line where the drive-around window starts paging and stuff. And the problem with PyTorch is it's high-level. It doesn't have that much fine-grained control over specific memory stuff. So you kind of have to leave the memory freeing to Python and PyTorch, which is...

can be annoying sometimes. So, you know, I think one thing as a maintainer of this project, like you're designing for a very wide surface area of compute. Like you even support CPUs. Yeah, well, that's just PyTorch CPUs. Yeah, it's just, that's not hard to support. First of all, is there a market share estimate? Like, is it like 70% NVIDIA and 30% AMD and then like miscellaneous on...

Apple, Silicon, or whatever. For comfy? Yeah. Yeah, I don't know the market share. Can you guess? I think it's mostly NVIDIA. Because the problem, AMD works horribly on Windows. On Linux, it works fine. It's lower than the price equivalent NVIDIA GPU.

But it works. You can use it, generate images, everything works. On Linux, on Windows, you might have a hard time. So that's the problem. And most people, I think most people who bought AMD probably use Windows. They probably aren't going to switch to Linux. So...

So until AMD actually ports their raw cam to Windows properly, and then there's actually PyTorch. I think they're doing that. They're in the process of doing that. But until they get a good PyTorch raw cam build that works on Windows, it's like...

they're going to have a hard time. Yeah. We got to get George on it. Yeah. Well, he's trying to get Lisa Su to do it. Let's talk a bit about like the node design. So unlike all the other text to image, you have a very like deep. So you have like a separate node for like clip and code. You have a separate node for like the case sampler. You have like all these notes going back to like the making it easy versus making it hard. But like,

How much do people actually play with all the settings? You know, kind of like, how do you guide people to like, hey, this is actually going to be very impactful versus this is maybe like less impactful, but we still want to expose it to you. Well, I try to expose, like, I try to expose everything or, but yeah,

But for things like, for example, for the samplers, there's four different sampler nodes which go in easiest to most advanced. So if you go to the easy node, the regular sampler node, you have just the basic settings. But if you use the sampler custom advanced node, that one you can actually... You'll see you have...

like different nodes. I'm looking it up now. Yeah. What are like the most impactful parameters that you use? So it's like, you know, you're going to have more, but like which ones like really make a difference? Yeah, they all do. They all have their own, like for example, yeah, steps. Usually you want steps, you want them to be as low as possible, but yeah,

If you're optimizing your workflow, you lower the steps until the images start deteriorating too much. Because that's the number of steps you're running the diffusion process. So if you want things to be fast, that's...

Lower is better. But yeah, CFG, that's more, you can kind of see that as the contrast of the image. Like if your image looks too burnt out, then you can wear the CFG. So yeah, CFG, that's how strongly the negative versus positive problems.

So when you sample a diffusion model, it's basically a negative prompt. It's just positive prediction minus negative prediction. Contrastive loss. Yeah. Positive minus negative, and the CFG does the multiplier. What are good resources to understand what the parameters do? I think most people start with automatic, and then they move over and it's like,

Step, CFG, sampler, name, scheduler, denoise. But honestly, it's something you should try out yourself. You don't necessarily need to know how it works to like what it does. Because even if you know like CFGO, it's like positive minus negative problem.

So the only thing you know with CFG is if it's 1.0, then that means the negative prompt isn't applied. It also means sampling is two times faster. But other than that, it's more like you should really just...

see what it does to the images yourself and you'll probably get a more intuitive understanding of what these things do. Any other nodes or things you want to shout out? Like I know the animate diff, IP adapter, those are like some of the most popular ones. Yeah, what else comes to mind?

Not nodes, but there's... What I like is when some people sometimes they make things that use ComfyUI as their backend. Like there's a plugin for Krita that uses ComfyUI as its backend. So you can use all the models that work in Comfy in Krita. I think I've tried it once.

But I know a lot of people use it and find it really nice. What's the craziest node that people have built? The most complicated?

Craziest? No. I know some people have made video games and comfy and stuff like that. I remember last year, someone made a

like Wolfenstein 3D config. And then one of the inputs was, oh, you can generate a texture and then it changes the texture in the game.

So I could plug it to like workflow. And there's a lot of, if you look there, there's a lot of crazy things people do. So, you know. And now there's like a node register that people can use to like download nodes. Yeah. Like, well, there's always been like the compute manager, but we're trying to make this more like, I don't know, official, like with, uh,

Yeah, with the node registry. Because before the node registry, it's like, okay, how did your custom node get into Compute One Manager? That's the guy running it who, like, every day he searched GitHub for new custom nodes and added them manually to his custom node manager. So we're trying to make it less effort for him, basically. Yeah.

Yeah, but I was looking. I mean, there's like a YouTube download node. This is almost like a data pipeline more than like an image generation thing at this point. It's like you can get data in, you can apply filters to it, you can generate data out. Yeah, you can do a lot of different things. Yeah, something I think...

What I did is I made it easy to make custom nodes. So I think that helped a lot for the ecosystem because it is very easy just making them. So, yeah, a bit too easy sometimes. Then we have the issue where there's a lot of custom node packs which share similar nodes. But, well, that's...

Yeah, something we're trying to solve by maybe bringing some of the functionality into the core. Yeah. And then there's like video. People can do video generation. Yeah, video, that's...

The first video model was like stable video diffusion, which was last, yeah, exactly last year, I think, like one year ago, but that wasn't a true video model. So it was... It was like moving images? Yeah, generated video. What I mean by that is it's like...

It's still 2D latency. It's basically what they did is they took SD2 and then they added some temporal attention to it and then trained it on videos and so it's kind of like animated, like the same idea basically.

Why I say it's not a true video model is that you still have the 2D latency. A true video model like Mochi, for example, would have 3D latency. So you can move through the space, basically. It's the difference. You're not just kind of reorienting. Yeah, and it's also because you have a temporal VAE. Also, Mochi has a temporal VAE that compresses on the...

temporal direction also. So that's something you don't have with like, yeah, animate diff and stable video diffusion. They only like compress spatially, not temporally. So yeah, so these models, that's why I call them like true video models. There's actually a few of them, but the one I've implemented,

and comfy as mochi because that seems to be the best one so far. We had AJ come and speak at the Stable Diffusion Meetup. The other open one I think I've seen is COG Video. Yeah, COG Video. Yeah, that one seems... Yeah, it also seems decent. But yeah...

Chinese, so we don't use it. No, it's fine. It's just, yeah, I could... Yeah, it's just that there's a... It's not the only one. There's also a few others, which I... The rest are like closed stores, right? Like Kling and all this. Yeah, closed stores, there's a bunch of them. But I mean, open... I've seen a few of them...

I can't remember their names, but there's COG Videos, the big one. And there's also a few of them that released at the same time. There's one that released at the same time as SD 3.5, same day, which is why I don't remember the name. We should have a release schedule so we don't conflict on each of these things. Yeah, I think SD 3.5 and Mochi released on the same day.

So everything else was kind of completely drowned out. So for some reason, lots of people picked that day to release their stuff. Yeah, which is a shame for those in gas. And think of Omnijet also released the same day, which also seems interesting. Yeah, what's Comfy? So you are Comfy, and then there's Comfy.org.

I know we do a lot of things for like news research and those guys also have kind of like a more open source and on thing going on. How do you work? Like you mentioned, you mostly work on like the core piece of it. And then what? Maybe I should fade it because I feel like maybe, yeah, I only explain part of the story. Right. Yeah. Maybe I should explain the rest. So, yeah.

So yeah, basically January, that's when the first January, 2023, January 16th, 2023, that's when M3 was first released to the public. Then, yeah, did a Reddit post about the area composition thing somewhere in, I don't remember exactly, maybe end of January, beginning of February. And then someone, a YouTuber made a video about it,

Like, Olivio, he made a video about Comfy in March 2023. I think that's when it was a real burst of attention. And by that time, I was continuing to develop it, and it was getting... People were starting to use it more, which unfortunately meant that I had first written it to do experiments, but then...

my time to do experiments started going down because people were actually starting to use it. I had to, and I said, well, yeah, time to add all these features and stuff. Then I got hired by Stability in June 2023.

Then I made, basically, yeah, they hired me because they wanted the SDXL. So I got SDXL working very well in Compute UI because they were experimenting with it. Actually, how the SDXL release worked is they released, for some reason, they released the code first, but they didn't release the model checkpoints.

So they released the code. And then, well, since the researchers released the code, I released the code and you come through too. And then the checkpoints were basically early access. People had to sign up and they only allowed a lot of people from edu emails. If you had an edu email, they gave you access basically to the SDXL 0.9.

And well, that leaked, of course, because of course it's going to leak if you do that. Well, the only way people could easily use it was with Comfy. So yeah, people started using it and then I fixed a few of the issues people had. So then the big 1.0 release happened and well, Comfy UI was the only way a lot of people could actually run it on their computers.

because it just like automatic was so like inefficient and bad that most people couldn't act like it just wouldn't work.

Because he did a quick implementation. So people were forced to use compute UI. And that's how it became popular because people had no choice. The growth hack. Yeah. Yeah. Like everywhere, like people who didn't have the 4090, who had just regular GPUs. Yeah, yeah. They didn't have a choice. Yeah.

Yeah, I got a 4070, so think of me. And so today, is there like a core comfy team? Yeah, well, right now, yeah, we are hiring, actually. So right now, the core core itself, it's me.

Because the reason why all the focus has been mostly on the front end right now, because that's the thing that's been neglected for a long time. So most of the focus right now is all on the front end, but we will soon get more people to help me with the actual back end stuff.

Once we have our V1 release, which is going to be the packaged comfy one with the nice interface and easy to install on Windows, and hopefully Mac. Yeah. Yeah.

Once we have that, we're going to have lots of stuff to do on the backend side and also the frontend side. What's the release date? I'm on the waitlist. What's the timing? Soon. Yeah, I don't want to promise a release date. We do have a release date we're targeting, but...

I'm not sure if it's public. Yeah. Yeah. And how we're going to, like, we're still going to continue, like, doing the open source, like, making VUI the best way to run, like, stable infusion models, like, at least the open source side, like, it's going to be the best way to run models locally. But we will have a few, like, a few things to make money from it, like...

cloud inference or that type of thing. And maybe some things for some enterprises. I mean, a few questions on that. How do you feel about the other comfy startups? I mean, I think it's great. They're using your name. Yeah, well, it's better to use comfy than to use something else. Yeah, that's true. Yeah. Like, it's fine. I don't like...

yeah, we're going to try not to, we don't want to, like, we want them to, people to use comfy. Like I said, it's better that people use comfy than something else. So, as long as they use comfy, it's a,

I think it helps the ecosystem. Because more people, even if they don't contribute directly, the fact that they are using comfy means that people are more likely to join the ecosystem. And then would you ever do text? Yeah, well, you can already do text with some custom nodes. So yeah, it's something we like.

Yeah, it's something I've wanted to eventually add to core, but it's more like not a very high priority. Because a lot of people use text for prompt enhancement and other things like that, so it's...

Yeah, it's just that my focus has always been on diffusion models. Unless some text diffusion model comes out. Yeah, David Holtz is investing a lot in text diffusion. If a good one comes out, then I'll probably implant it since it fits with the whole... I imagine it's going to be closed source to Midjourney. Yeah, well, if an open one comes out, then...

Yeah, I'll probably implement it. Cool, Confi. Thanks so much for coming on. This was fun.

AI Engineering for Art — with comfyanonymous, of ComfyUI 55:04 Share

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Shownotes Transcript

AI Engineering for Art — with comfyanonymous, of ComfyUI