We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

The Hunt for State of the Art (with Suhail Doshi)

2024/9/19

Lightcone Podcast

AI Deep Dive AI Insights AI Chapters Transcript

People

Gary

无足够信息创建详细个人资料。

Jared

Mark Mandel

Suhail Doshi

Topics

Suhail Doshi: Playground 的开发历程充满挑战，在发布前夕经历了彻底的改版。模型的文本生成能力是核心竞争力，团队为此付出了巨大的努力，并取得了显著的成果。Playground 的用户界面设计注重视觉优先，用户无需学习复杂的提示词工程即可轻松创建图像。模型能够处理极长的提示词，并具备强大的空间推理能力，这得益于团队对模型各个组件的精细化打磨。团队在模型训练过程中使用了极其详细的提示词，并通过与创作者合作，不断完善模板和提示词，以提升用户体验。Playground 的目标是成为图形设计领域的领导者，而非仅仅是一个娱乐工具。团队在发展过程中也面临着诸多挑战，例如如何平衡模型的精确性和美学，以及如何选择合适的用户群体等。 Gary: Playground 的图像生成质量和用户体验都达到了业界领先水平。用户可以像与平面设计师沟通一样与 AI 互动，创建图像和文本，并能根据用户反馈进行修改。 Jared: Playground 模型在文本准确性和一致性方面达到了业界领先水平，有潜力取代 Adobe Illustrator 等图形设计软件。Playground 的应用场景不同于以往的图像模型，它更侧重于辅助用户进行图形设计和插图创作，并能处理极长的提示词，具备强大的空间推理能力。 Mark Mandel: Playground 的用户体验非常出色，用户无需学习复杂的提示词工程即可轻松创建图像。 Mark Mirchandani: Playground 模型能够有机地整合文本，并允许用户精确控制文本的位置、大小和字体等属性。

Deep Dive

Key Insights

What makes Playground's AI image diffusion model state-of-the-art?

Playground's AI image diffusion model is state-of-the-art due to its exceptional text accuracy, prompt adherence, and user experience. It allows users to interact with the model in natural language, making it feel like talking to a graphic designer. The model can handle extremely detailed prompts, up to 8,000 tokens, and excels in spatial reasoning and text generation, which sets it apart from other models like MidJourney or Stable Diffusion.

Why did Playground focus heavily on text accuracy in its model?

Text accuracy was a top priority for Playground because text is integral to the utility of graphics and design. Without accurate text, designs often feel incomplete or less functional. The team faced challenges, with text accuracy initially at 45%, but they overcame this by focusing on detailed prompts and improving the model's understanding of text-related tasks, which is crucial for creating logos, t-shirts, and other design elements.

How does Playground's approach to prompting differ from other AI image models?

Playground's approach to prompting is more visual and user-friendly compared to other models. Instead of requiring users to write detailed prompts, Playground allows users to start with templates and modify them using natural language. This reduces the need for prompt engineering and makes the process more intuitive, enabling users to achieve their desired results without extensive trial and error.

What challenges did Playground face in developing its model?

Playground faced several challenges, including improving text accuracy from a low of 45%, ensuring prompt adherence without compromising aesthetics, and creating a user experience that felt natural. The team also had to navigate the complexities of integrating detailed prompts with visual design, which required significant research and innovation. Additionally, they had to balance the model's adherence to prompts with aesthetic quality, which sometimes led to lower user scores despite the model's accuracy.

How does Playground's model handle spatial reasoning and text generation?

Playground's model excels in spatial reasoning and text generation by allowing users to specify exact details like the position of elements, font size, and leading. It can handle complex prompts involving spatial relationships, such as placing a green triangle next to an orange cube, and generates accurate text that adheres to user instructions. This level of control and precision is a significant improvement over other models like MidJourney or Stable Diffusion.

What is the significance of Playground's marketplace for creators?

Playground's marketplace allows creators to design and sell graphics, stickers, and t-shirts directly through the platform. This not only provides a revenue stream for creators but also enriches the product with high-quality, user-generated content. The marketplace is part of Playground's strategy to make the product more accessible and useful for a broader audience, moving beyond just image generation to a full-fledged design tool.

How does Playground's model compare to MidJourney in terms of aesthetics and prompt adherence?

Playground's model often scores lower in aesthetics compared to MidJourney because it prioritizes prompt adherence. While MidJourney may produce more visually pleasing images by ignoring certain prompt details, Playground's model strictly follows user instructions, which can sometimes result in less aesthetically pleasing outputs. This creates a trade-off between adherence and aesthetics, which Playground is working to address.

What lessons did Suhail Doshi learn from his previous startups that influenced Playground?

Suhail Doshi learned the importance of focusing on the biggest market and avoiding niche or unsustainable user bases, as he did with Mixpanel and Mighty. He also emphasized the value of having a tailwind for a company, where external factors like technological advancements support growth. These lessons shaped Playground's strategy to target the broader graphic design market and leverage the AI revolution for scalable success.

How does Playground's model handle emotional expression in images?

Playground's model is designed to capture emotional expressions in images, such as happiness, sadness, or anxiety. This is achieved through detailed prompts that describe the desired emotional state, allowing the model to generate images that accurately reflect those emotions. This capability enhances the model's utility for creating expressive and meaningful designs.

What is the future direction for Playground's AI model?

Playground aims to continue improving its model by enhancing prompt understanding, text accuracy, and aesthetic quality. The team is also exploring new features like emotional expression and better spatial reasoning. Additionally, they plan to expand the marketplace for creators and integrate more user feedback to refine the product. The goal is to make Playground a comprehensive tool for graphic design, potentially rivaling established platforms like Canva.

Shownotes Transcript

Translations:

中文

I think we thought the product was going to be one way and then we literally ripped it all up in a month and a half or so before release. We were sort of like lost in the jungle for a moment, like a bit of a panic. There's a lot of unsolved problems basically. I mean, you know, even this version of it, you know, people are going to try it and then they might be blown away by it. But like the next one's going to be even crazier. To get to Soda, you basically have to be

maniacal about like every detail there could be some people that train their models and they get cool text generation the kerning is off are you the kind of person that will care about the kerning being off or are you the kind of person that is okay with it like or you don't even notice it

Welcome back to another episode of The Light Cone. I'm Gary. This is Jared, Harj, and Diana. And collectively, we have funded companies worth hundreds of billions of dollars, usually just with one or two people just starting out. And we're in the middle of this crazy AI revolution. And so we thought we would invite our friend Suhail Doshi, founder and CEO of

Playground, which is the state-of-the-art image generation model with also a state-of-the-art user experience, and it just launched. So how are you feeling, Suhail? Very under pressure right now.

That's good then. You start like a startup founder, which is normal. Maybe the best way to start off is to look at some examples of the images that you were able to generate. And this is stuff sort of right off the presses. So at Y Combinator, I also am one of the group partners. So I fund a number of companies every batch. I funded about 15 for the summer batch.

And so what we're looking at here is one of the t-shirt designs I made. As you can see, there's a GPU and it was based on one of the core templates in your library. I like metal, so this very much spoke to me. This one was off of a sticker design.

And I guess I just really liked that sword. And what I was able to do is add GPU fans. MARK MANDEL: Love it. I love it. MARK MIRCHANDANI: And so that's one of the noteworthy things about Playground. You can upload an image. It'll sort of extract.

the essence of like sort of the aesthetic and some of the features of it. And then you can remix it. - This one feels like a tattoo. - Yeah, exactly. - Do you remember what you prompted it with to get those? - Oh yeah, I basically, so the cool thing about Playground to create this was I picked a default template that I liked and I think it only had the sword and sort of this ribbon. And I said, make it say house tan on the ribbon

and add a GPU with two fans. I was very specific, I wanted a two-fan GPU. And that's one of the things you'll see in all these designs. This is actually the t-shirt that House Tan itself actually chose.

So, you know, it's a very summery vibe. I think this was based on something around summer and surfing, and we replaced the surfboard with the GPU. I feel like you used a preset that we had. I did. Yeah, all of these are from presets. They're pretty good.

I think the noteworthy thing that I was able to do is I didn't have to prompt and reprompt and reprompt and sort of keep trying to refine the same text prompt. I actually could just talk to a designer and it would just give me what I wanted. Going from left to right, for instance, by default, I think the template had this yellowish background and I said, make it on white. And that was like a very...

unusual interaction that I'm not used to. Usually you're used to

either discord with mid journey or you're sort of used to a chat interface or like prompt and then twiddle things and reprompt and reprompt and reprompt whereas this felt much more natural language I could just talk to You know a machine designer that would take my feedback into account Yeah, normally when you make these kinds of images you have to like describe all of it, right? You'd have to say like

I want it on this beige background and I want this orange sunset. And then you'd have to even describe like the lines of the sun and, you know, or you don't describe very much. And then every time you try, it's like totally different from the other thing. So usually, you know, you either have to learn like a magical incantation of words.

versus being able to pick something that you start from. And then also with these images, Gary, did you add this text in post-processing or is the model actually incorporating the text organically? Oh, the model will both take your direction on what should be there, what it

its size is. You can actually specify where in the design. You could say, I want it in the middle. I want it at the top. Could we use a font that's bigger or smaller? Better leading? Could you current it a little bit? You could just speak to it in plain English. And I'd never seen that in any image model today. That's crazy because the text is flawless. And anyone who's used DALI knows that if you try to get it to write text, the text comes out like

garbled and zombie-like. Yeah, it's pretty incredible having just accurate text and then being able to position the text exactly where you want. That is very cool. It is really soda in terms of text adherence and coherence in terms of following prompts, which is really cool. One thing that we think is really cool is it's inventing fonts. Yeah.

Like, I don't know what font that is. It might be a real font, but I think that there are all these circumstances where it's actually just, it's like extrapolating from many different kinds of fonts and like actually inventing new things.

which is really, really cool. OK. And these are just a couple other versions. I saw some old timey thing, and I was like, OK, could you do a vector-y version of a GPU on the left? And then on the right, there was a very sort of Japanese art house aesthetic. These are great. And then this one, if left to my own devices, I was going to print this one because I really liked-- The right side one? Yeah. And what I could do is actually tell it.

make it even more sort of prototypically like Japanese art. You know, like I want more waves. I'm like more sun. And, you know, it basically kept doing it. I think I know this preset. I remember making this preset like a

like a month and a half ago. I think it's called like Mythic Ink or something like that. That's how the app works. You know, you open the app, you select a preset, or you can upload your own design that you're like really into. And then it will seemingly extract the vibe of that particular thing. Like, you know, it won't, it's not going to be a copy. It will be a remix. Did you purposely design it to be so good with text? Or is that like an emergent property of just how you

architected everything. We definitely focused on making text accuracy really good. I think it's been kind of our number one focus. And part of it is text to us is so interrelated with actually the utility of graphics and design. Because a lot of things without text just mostly feel like art. But yeah, text was an extraordinary high priority. And it was really hard, actually. There's like maybe a

A point there where like our text accuracy was 45%. We were sort of like lost in the jungle for a moment, like a bit of a panic, but we figured it out. I think one of the remarkable things on all these designs is that a lot of, I was playing a lot with it as well.

A lot of the outputs are very utilitarian and useful because they play with Midjourney and all of those. And I think they're fun, but they're more like toys, more like art. But it's really hard to work with it if you actually wanted to design logos, t-shirts, font sizes. I could totally see this replacing Adobe Illustrator, right? Right, yeah. Yeah, I think that, you know...

And part of it's kind of funny. It's like the reason why I'm partly so excited about graphic design is because actually when I was younger, when I was in high school, I used to do logo contests and I would try to win them. I think there's a site called like site point dot net or something. And I was just trying to make like a little bit of money before college, before going to college. And and so I did all these like logo designs and did all these tutorials.

trying to win them. And so during the training of this model, I tested it for logos and I started to be like, wow, it's actually way better than anything I could have made. And then I've also made like my own company logos typically, which are also very bad. And so it just feels to me like if you can get text and you can get these other kinds of use cases, you're probably going to be able to beat the like mid at least the midpoint designer.

graphic designer that's an illustrator. And then I think over time, we should be able to get to the 90th percentile graphic designer. MARK MANDEL: So this is actually a really different use case that really hasn't been addressed. I haven't seen image models try to design graphics or illustrations. It's less generating really cool images that would replace stock art or something like that.

It's more literally allowing you to create Canva type things whenever you want and you don't have to mess around with it. It's plain English. Just talk to the model. The model is going to create what you want. I've never seen anything like that.

Yeah, I think we were just sort of like looking at what are the use cases for graphic design. And it's, you know, when it actually, interestingly, it has a lot of real world, like physical impact, physical world impact, because they're like bumper stickers and then t-shirt, I think it was at Outside Lands the other weekend. And I was just looking at everyone's t-shirts, looking at what they what they have on them. And I saw a bunch of women at Outside Lands had this thing, this t-shirt that said, I

I feel like 2007 Britney. I just thought that was such a cool shirt. And so we made the template for it and put it in the product. And, but there's just like so much cool real world impact. And there's, and I think that the world often sometimes think that I'm almost, I'm almost a little disappointed that my space doesn't exist for those that were on my space. Cause it was such an expressive social network. And I feel like humans really deeply care about that form of expression. And yeah,

And so it's really cool to be able to make a model that's like really focused on all those kinds of things. But you're actually building a product. It's not just research because when, with all these designs in Playground, you can actually go and purchase them, like the stickers, the t-shirts, right? Right. Can you tell us about kind of this marketplace that you're building? Yeah. So I think that, you know, one thing that we learned was that it's kind of hard for people to prompt. And because it's hard to prompt,

We found also it's hard to teach people how to prompt. And the truth is that when you make these models, it's not like we even know how it works. We are also discovering with the community how the model kind of works. And so one of the things that we decided to do was, you know, me and our designer, we decided that one core belief was that the product should be visual first, not text first, which is a huge departure from like language models and chat GPT.

Because our product is so visual, why should it not be? And so in order to make it visual first and to make it so that you don't have to learn how to prompt, we decided that we would start from something like a template, which is something people already understand in a tool like Canva. It's not something that we necessarily invented. There's templates everywhere. But I think that if you could start from a template and then we could make it really easy to modify that template, then...

It feels like we've already taken you like 80% of the journey. If it was like, I feel like 2007 Britney, but then you wanted to change the celebrity in the year to a different person. Yeah.

then you totally could. I wanted to make that very easy. But it also required a lot of integration with research. Because how do you make these changes? How do you make them coherent? How do you keep things like similar? It's not as simple as, you know, just 75, 77 tokens that you put into stable diffusion. The existing open source models aren't really capable of that. So it required kind of

Yeah, like the marrying of what a good product should feel like and then enabling that with research, which is not always possible. I think that's what Gary was getting at with you building the state-of-the-art UX, the UI for all these models. Because up to this point, people just get raw access. It feels like kind of back in the days, you would just SSH to a computer and kind of work with it. That's how people interact with these models. But you...

kind of basically build a whole new browser into it. Nobody has done it, and you've done it really well. Can you talk about this idea of departing from raw model access? Yeah, I think just we observed the users over 18 months failing. And so AI is a little bit weird right now because there's such a big novelty factor, I would say.

And it's exciting because we're able to do things we've never been able to do before. And so as a result, you can easily get millions of users using your product. And that's totally what happened to us. And so it feels almost like, oh, maybe I've got the product. But then when you actually go look at the data and how people are using it, there's just this constant failure of people using the product. And so... Yeah, you're talking about sort of the prior version of Playground. The prior version of Playground, yeah. So it didn't have this type of model. It didn't...

It was really quite the setting. We mostly used stable diffusion. We used open source models, and then we started training some of our own that are very similar to stable diffusion as a way to ramp up to where we are now. When we watched users prompt this model, obviously the two pieces of feedback were, this is fun, it's cool, I can get a cat drinking beer. And then you post it to Twitter, and it's exciting. But then...

But why would people come back, you know, is one big question. And then the second part is that people are using our service a lot, but they're not always using our service a lot because it's like a useful thing. It's because they're not getting what they want. So they have to keep retrying. Yeah. Yeah.

You know, we're like Google's trying to get you off the website, you know, that sort of feeling like it's almost bad that people are using it too much in some in some sense. And, you know, they just keep we call it re-rolling. Right. Just keep re-rolling to get a different image or slightly better image or fix like a paw or tail that's off, you know. And then the other thing that happened was that our model can take an extremely long prompt. Like most of these models, you can only write 75 tokens. But with our model, it's like.

8,000. And most people, you're never really going to go over 1,000 right now. I say that now, but we'll see. 1,000 tokens is a lot. And our model lets you be extremely descriptive.

And so you can really describe the texture of the table, skin texture. We have all those like puzzle prompts where it's like green triangle next to an orange cube, you know, and it works. Like spatial reasoning is all there actually, including text generation. That's totally novel. And really, I'd never seen that before. Yeah. You know, the first generation of these models, almost immediately what you do is say like, generate me a green, you know, green sphere on top of a blue sphere.

triangle and just wouldn't do it. It's like there'd be those elements, but it would just be all jumbled up because it was using Clip. It did not have contextual reasoning or understanding. Yeah, and Clip was trained with a lot of error, actually, because it's just using kind of the alt tags of the images that are scraped on the internet, which could be like anything in

We sort of decided that what we were going to spend our time on was prompt understanding and text generation accuracy, because we also felt like aesthetics were kind of saturating. Like they're getting better, but they're also just kind of like not getting better at a fast enough rate. And users even vote and say, even in the mid-journey Discord, you know, they'll poll their users and they'll say, what do you want to be better? And like aesthetics is like going like lower and lower on the rank of things that people care about.

So we wanted to like try to leap on something that really mattered to users, which was prompt understanding and text generation accuracy for those kinds of use cases. And but when you have a very long prompt, it's not really feasible to ask anybody like, are you going to write like an essay? And so we started to realize that actually the prompt is it's almost like a it's kind of like HTML for graphics, which I think is so cool. I think you've done everything.

a lot here because you completely have a novel architecture that really gets to magical prompting because the experience of using Playground is feel as if you're talking to a designer. It has a coherence. It listens to you because with other, I don't know, with Midjourney, if you want to move the text or that, it doesn't. The positional awareness is not there. I guess one of the insights you had when we

chatted a bit earlier, one of the problems you learn to create good designs, you have to have a lot of description for the images. And users are basically lazy, right?

Right. They might just tell you, I want a nature scene. And if you input this into mid-journey, what would it give you? Yeah, it'll give you this very beautiful, very rich, high-contrast nature scene. But you've done something very interesting. We want to talk a bit about how you've done kind of aiding users and expanding upon the prompt to actually build something much better. The first thing to kind of improve our prompt understanding was just making your data better.

pretty much, it's actually just that simple. And so one of the first things we wanted to do was we wanted to have extremely detailed prompts. So when we train the model, we train on very, very detailed prompts. But we also want users to feel like they could just say nature scene. And so sort of what you see here is just

how detailed we can get. And actually, we're actually even more detailed than this these days. When we train the next model, it'll be even more than this. But once you get to this level of detail, I mean, we're just teaching the model to represent all of these concepts correctly, whether something is in the center or whether there's a background blur. One thing that we want to get better at, and I think we're actually already pretty good at this, but it's emotional expression.

Is he like another thing? Like we have this like image of Elon Musk and he's like disgusted. He's anxious. He's happy. He's sad. He's confident and like trying to see his expression in all these different ways.

And so that's just like one thing that we want to make sure is represented in these prompts. There's obviously a lot more, like spatial location. And so by doing this, we can ensure that the model could be a good experience if you raw prompted it as a user, if you just said nothing. And then most of the time, users are not really writing more than like maybe like caption three here or something like that. I mean, even that's kind of a lot. It was a lot. I think when I was playing, I was mostly like on five and six.

Yeah, yeah, exactly. When you're playing around, like the normies are kind of doing five and six and then the like hardcore prompters are like copying each other's prompts and then they end up more like one, but they don't even look like one. And one is a very unnatural way of typing, you know, like nobody's writing these like essays and paragraphs of text is too much work.

And that was one thing that we didn't we knew we were going to probably fail if we expected users to do this. So this kind of led us to like a more visual approach where you're like picking something you already like in the world that we understand how that's represented in our model. And then we can like make those changes and edits and stuff like that.

Is the benefit of expanding the prompts this way that you're more likely to get what the user wants at the first go? Or is it that it just makes it easier for them to iterate on it to get to what they want? MARK MANDEL: I don't even know that we necessarily needed to do this. But I think the reason why we did it was because initially, we didn't know how good the model would be. And so we needed to serve users in the way that they already use the existing models. And so we didn't exactly know the breakthrough interface. We hadn't gotten there yet.

And so in order to make sure that we would work the way everyone is happy with, we wanted to do this kind of like segmented out. It's almost like lossy prompting. And so that's why we do it. But I think, you know, it's not even that it's not as necessary. But I think the other reason to do that, do it this way is once the prompts get extremely detailed, it's hard to have too much like variation between the images because you're kind of locking in on your image.

And so by having kind of ambiguity in the prompt, you can get like more variational abilities. So there's like, we call it like image diversity. So that way you say squash dish, but it's like really different each time. I guess the cool thing about your product, you basically remove all of the prompt engineering with zero guess because you do it behind the scenes with expanding and exploding into this multi-caption level system, right? I guess what comes to mind is sort of

Back in the day, if you needed to navigate a website through the command terminal, maybe you curl and do get some posts literally, like typing the commands, until you had a browser to actually have the right UI, right? What I told my team was I said, we should be doing the prompt engineering for users. It should not be like the users are the prompt engineers or the prompt graphic designers, if you will, here.

But like, it shouldn't be like the users have to go like, we can't write a, what are we going to do? Write a manual on how to do this? You know, it's just too tricky. Like 1% of humanity will understand that manual and, and,

And the rest will be like, I don't know how to use this. It's too difficult. So I think it's really valuable that, you know, I told my team, I think it's very important we do all of that work. Like we should have an extremely strong sense of how the model works rather than putting that on the users where I think it's like infeasible. And then the other thing that we do is we now work with creators to help us like kind of construct these like different templates and different prompts around these templates and stuff like that. And they might be like the 1% of humanity that's willing to learn this.

on behalf of users. And this is totally normal. That's what YC does. We build these great companies that billions of people in humanity use as a result of that. I guess there's two things out of this that come out. One is you might be creating this whole new set of profession. Sort of back in the day with design, you have B hands where people hire designers. Now people will...

through Playground, hire AI designers that are this top 1%? Right. Well, we're doing it, actually. So we are hiring them. Oh, you're hiring them? Yes, we're hiring them. We're going to launch a creator program soon, actually. And the goal is to bring on creators that have good tastes. That still matters, right? There's this image of a squash dish, but it's not a very beautiful image. And I think taste is still real in the world. And it's also, in design...

you know in llms you get to like measure how well you did on a biology test and that's like a pretty objective thing

But for design, it's constantly evolving. Like design from 10 years ago can look dated unless you're like Dieter Roms. But I think, you know, more fundamentally, we want to bring on creators that are going to help create graphics that other people can then use. And we're actually paying them. I guess one thing that's cool, the second thing, because of this, you actually are state of the art on many aspects for this model.

So much of it was driven by a product because now in order to get the good captioning, you probably are beating GPT-4.0, right? In terms of image captioning? We are beating, yeah, we now have a new like SOTA captioner. To generate these. And that was not just to be like a benchmark, but actually a very practical purpose to build the model. Can you maybe tell us a bit of what's underneath? Because...

PG V3, Playground V3, right? Is all in-house and state-of-the-art in many aspects. Yeah, so the whole architecture of the model, we had to rip everything out.

Um, so like the normal stable diffusion architecture that people know about is like, there's a variation auto-coder at VAE and then there's clip. And then there's like this unit architecture for people that are in the know. And, uh, and then since then it's kind of evolved to using, um, you know, more transformers. Like there's this great paper by, I think it was like William Peebles, um,

that did DIT, which I think is like what people believe that Sora is based on as well. And then so there's some new models that are using that. We actually don't use any of those new architectures either. We did something completely from scratch. But one of the reasons why we had to kind of blow everything up was because you can't really get this kind of prompt understanding using Clip because there's just so much error in Clip. And it's also like just bounded by just the architecture of that model. And then the second thing is that

We also needed the text accuracy to be really high. So you can't just use the off-the-shelf VAE from Stable Diffusion because it cannot reconstruct

small details like i don't know if you guys ever noticed like the hands and the logos hands zoomed out faces yeah you need something that you also need like a state-of-the-art vae or something like a vae that's better than the existing one like the existing ones like four channel um and uh and so there's all these like pieces and they all they all interact um and they can all bound the

overall performance of your model and so we basically looked at every single piece and then i think like four months ago there was a i think with the team there was literally we were at the whiteboard with the research team and there was like the non-risky architecture which was kind of more similar to some of the open the state-of-the-art open source models that are out these days like flocks and stuff and then there was like this other architecture that shall not be named and um

And we were like, well, that's that's like the risky one where we don't even know if it'll work. And if we try it for two or three months, we'll like waste compute. And if it might just like blow up and then we'll be behind. And we just like put everything in that basket. We decided that we had no choice. You know, it was like we were just going to fail if we didn't do it anyway. I think what's remarkable, you are order magnitude smart.

on text and in a lot of all these aspects, you're basically soda. I think that's really impressive. Can we maybe talk a bit about, as much as you can,

how you beat the text encoder. I mean, you tease that out a bit. You basically don't use clip because the traditional stable diffusion just uses the last layer, right? But you guys have done something completely new where you allowed a basically almost infinite context window because mid-journey is only 256. The prompt had adherence. Like you can actually talk to it like a designer. So tell us what you can talk about it. Yeah.

As much as you can tell us about. I think it's fair to ask the question. Share as much as you want. I think that to kind of get here, you know, there's some obvious things that you would do. The most obvious thing that you would do, you know, is not use clip. But the second most obvious thing is

kind of like using the tailwinds of what's already happening in language. You know, like the language models already so deeply understand everything about text. And so there's some models that use this, you know, they use like T5 XXL, which has this, it's like another embedding, but it's like a much more rich embedding of language understanding. I kind of feel like language is just the first, it's just the, it's like the first thing that happened and it's,

There's a whole bunch of AI companies that are going to come about, whether they train their models or not, that are just going to benefit from everything that's going on in language and in open source language. And so, you know, I think our model is able to have such great prompt understanding in part because of the big boom in language and all of the stuff that you, whether it's Google or Meta or what have you, is doing. And so we're just...

We can be slightly behind in terms of language for our prompt understanding because the language stuff is already just so good. And it will just continue to get better and our models will also continue to get better. So that might be my like one small hint.

Maybe the analogy playing with a lot of this and from chatting with you, the current state-of-the-art stable diffusion models, their language understanding feels like in the NLP land, like WordEvec, right? WordEvec was this paper that came out from Google in 2013, and it didn't really understand text per se. It was more the latent space. The famous example was that it would take the...

vector of king and then you would subtract the vector of man and then add the vector of woman and the output would be the vector of queen. Right. Which is like, but very basic and it's still very cool, which I think is what kind of stable diffusion current models before you are. But playing with your model, you basically, they leap for the audience

The leap is that you basically got GPT level of understanding. It's like sort of the word you beg to GPT was, I don't know, like...

Yeah, I would say it's like GPT-3 level image model, like sort of prompt understanding now. Yeah. And I think there's much more leap. There's another leap to go. Many more, actually, I would say. And that's impressive. It's safe to say that this is the worst the model will ever be. For sure. Yeah, for sure. I mean, there's, you know, small things that we already want to fix. Like, you know, we wish that the model understood concepts like film grain.

I mean, it could still be better at spatial positioning. Even the model has...

issues with like the idea of like left and right like put the bear to the left what is left is it to you know is it your left is it the bear's left so there's still like lots of interesting problems that i think are really fun uh to probably we're gonna have to figure out but what we hear from users is that they feel a strong sense of control now like it has really good prompt adherence and actually there was this really funny thing when uh you know i think a

like a week or two ago that we realized about the model, which is when we started to do evals for aesthetics. And the way we do this is we just show like two, it's an A-B test. We show users two images, one from maybe a rival of ours and then another image from our model. And we're constantly doing evals and constantly asking our users what they think so that we can get better. And anyway, one of the things that we realized was that

There's this new thing that I don't think has been talked about, but I apologize to the audience if this has been talked about, but there's a problem with we have entanglement issues, which is that if the if the model adheres too well to the prompt, it can adjust it can like it can have an effect on aesthetics.

So when we compare ourselves to say something like Mid Journey, which we've actually evaled it has great aesthetics, best in the world at that. One of the problems is that we will get dinged because our model is adhering more. So I'll give you an example. We have an image and it's like a image of a woman and it's like kind of like a split plane, like she's on this side and on this side. So it's like a two, it's like a composite and Mid Journey doesn't respect that. It just shows the woman.

one frame, the users will always pick that because it's more aesthetically pleasing compositionally versus this like split pane thing. But our model is adhering to that prompt. Right. And so the users ding us and then we get a lower aesthetic score because it's not listening. And so there's this entanglement problem. Like, what do you do? We had another image that was like hand painted palm trees or something.

And the users chose the other model because they were less hand-painted looking. And the hand-painted ones do look less aesthetic, but our model is adhering. So we have this entanglement problem and we don't know how to measure ourselves for aesthetics now.

And there's no, I'm not aware of any, if anyone has any literature, please send it to me. But I'm not aware of any literature on this. And so we don't know what to do. I think what it sounds to me is basically your model is too soda that the current evals don't work because it's actually following the rules.

Yeah, we're trying to figure out what we have to make a new a new eval, basically. You're too advanced. You broke the test. Yeah, you kind of broke the test. And so now now it's it's it's a little weird externally. We don't you know, it's like obviously we want to portray to the world, hey, you know, we have this great thing and OK, we lose here. But like but not really. And so I think that's what you want.

Yeah, but it does what you want. And so I think we're going to try to, you know, we're going to talk about this in more detail, this kind of entanglement problem, because it's actually like a very interesting, more fundamental insight. Yeah, it sounds like you're just building a completely different kind of company. Like the thread that comes up hearing everyone here is,

it using playground feels like you're talking to a graphic designer which then in my head actually buckets you into just the companies in yc that are really taking off are the ones that are replacing some form of labor um which is just different to how people talk about mid-journey right it sounds just like a tool to play around with this is actually just going to be like the replacement for hiring a graphic design team potentially which is like way more commercial and

Right, yeah, yeah. I mean, we've been searching for like, where is the utility? How are people using things like mid-journey? And I think that for me, it's actually, it's even simpler. It's just that I think we're just enabling the person to have more control over the whole thing. Like,

I always feel bothered, you know, when you're like, I produce music. And so if I make a song, like I have to go to a designer and say, can you make me an album art? And then I only get like four variations of it. And then I feel badly asking for a fifth if I don't like any of the four. But the more you just put in control the person that's actually making the thing, they'll always they'll be able to connect exactly the thing that they're looking for with, you know, the core product or concept.

Song or whatever they're making so I see we're always telling founders Hey, you should talk to your users more or you know, and what you did was you had so many users You can just talk to them. You needed to look at how were they actually using it? Yeah, and at some point you realized you know somewhat uncomfortably that they were generating near porn near porn Yeah, we get a lot of near porn and in porn. Um

And then, you know, I think people sort of when they're exploring a space often run into that situation. Like what happens when, you know, the users that you're getting aren't the users you actually want? Yeah, we me and my CEO talked about this. We're like, if we listened to what the users wanted, we would have to build a porn company eventually. Yeah.

which is not something that I think my wife would be happy with, or my mother. It was kind of this tricky thing where you're like, listen to your users, talk to your users. And look, I'm not saying everybody does that with image models. For sure they don't.

But a lot of them do. And so we had to kind of go ask ourselves, well, then what can you do with these things? And the answer was like, not much else. Nothing big and commercial enough. We can make a cool website that people use. And the problem is all the image generator sites are plagued with this problem. And we all know it. We all know. And there are huge safety problems. And it turned out to be just like a business we didn't like.

And that's a hard, like, that's like a hard thing after, you know, 12, 18 months of working on something. And you're just like, well, I don't really like this that much. And now what? And when we looked around for use cases, we were like, oh, all the use cases have text. All the big ones. Practically all of them. Logos, posters, t-shirts, bumper stickers, everything. Everything has text because text is also a way to communicate with humans. That's why I became number one, like the number one priority.

I mean, this isn't the first time that you've sort of confronted this issue before. You know, in your prior startup Mixpanel, which you built into a company that, you know, makes hundreds of millions of dollars a year, one of the leaders in analytics from a really young age, you know, I think you started it when you were 19 and I remember because I met you when you first started it. That was another moment where here's this brand new technology

And there are sort of very commercial use cases that you could build a real business on. And then there were other use cases. In that case, I think it was sort of fly-by-night gaming operations that would come and sort of pop up on Facebook, steal a bunch of users, and then disappear. And you had to make some choices about who you wanted your users to be. Like, do you want it to be people who can actually...

pay you money for a real product over the long haul or sort of, oh yeah, they're here and they're gone and we can make our graph go up. Like it's sort of a quandary that a lot of founders are facing. How did you approach that? Yeah. I mean, I, that one's like burnt into my memory actually. So we, you know, the, the simple story was just that like, we got all these gaming companies back in the gaming heyday, you know, Zynga and RockU and Slide and all this stuff. And, um,

And they would, we were making so much money off of them, but then they would die because they had bad retention or games just have like a decay factor. And, uh, you could tell that they were going to die because, Oh, we knew it was happening. It's like all real time, uh, data on it. And, and so, you know, one day I go to, I go to, you know, one of my mentors is Max Levchin and I interned for him at his other company. And,

And I was like, hey, you know, this thing is happening. And we have all these competitors that are building gaming analytics tools or products. And I don't really know how to compete. Feels a little weird to just go after gaming when it's like this weird thing that's churny. And he just looked at me and he was just kind of like,

He's like, what do you think is the biggest market? And I was like, probably not gaming, probably the rest of the internet. And mobile was just starting. And we didn't really know. The top free app in the App Store was a mirror on mobile. So it was kind of like, is mobile going to

Maybe it'll be there next year, I hope. But anyway, and I said, you know, so the rest of the internet, and he said, well, if you're, you know, there's the name of our competitor, and he was just like, if your competitor gets bought for $100 million tomorrow, you know, that's focused on gaming, don't cry about it then. Just go after the biggest market. And that's what, we did that, and then mobile went huge. It went so big, and it completely, we got rid of all the gaming stuff, and

That was like a 100% the right decision. So I think it's like being kind of like ruthless almost about where the value is with your own users, with what you're doing, like all those things I think are like, it's very, very important. I mean, it sounds like you had to close the door and then,

God came and opened a window for you. Yeah. Yeah. I mean, I think, yeah, we kind of have a similar problem where it's, you know, our, the current user, our current user base that we have is, it's not exactly, you know, an exciting thing we want to do as a team. And so then we're like hunting for the, where the rest of the value is. Yeah. That,

That's a really important lesson. I mean, I guess that's like the super big lesson here is you can choose your users or your customers. You know, often your customers or users choose you. And if you don't want them, it is a choice that you can make. And sometimes it actually allows you to find the global maximum instead of just the local maximum.

Yeah, we're kind of facing the same decision. It's like a real-time decision almost. It's fun to talk about things when they work, when your decisions are right. So we'll see years later if this is right. But I think it's tough because MidJourney is doing $200 million, $300 million. But the biggest market in graphic design is probably Canva doing $2.3 billion. And so we're just kind of like, well, forget it.

let's go after the biggest, most valuable thing in the world. And, you know, not a lot of people know a lot about Canva, I find, in Silicon Valley. Most people know about Figma, but Canva makes vastly much more money than Figma. So by enabling everyone, so if you have this amazing, you know, AI graphics designer of sorts, enabling like so much more of humanity, I find, I mean, I think a lot of people believe this, but I do, I also believe this, which I think AI will certainly, it feels like it's expanding the pie of all these markets, not like,

They're not the same size, I think, most of the time, right? Like you're enabling more people to write code that otherwise couldn't, you know, that kind of thing. I guess the interesting thing about Playground, it was also a previous more radical pivot you had because you had gone through IC twice. Yeah. So you went through with Mixpanel, which became this successful company making hundreds of millions of dollars. Then you went through it with Mighty. Mighty.

Can you tell us about that second time going through YC, and then what was it, and then you pivoted into-- Yeah, so I did this browser company called Mighty, where our goal was to try to stream a browser. And the real goal was to try to make a new kind of computer. And we basically did it. But the problem was is that we hit this wall where it was like, I didn't believe that it was going to be a new kind of computer anymore. I just couldn't make it more than two times faster.

And I just didn't feel like if I couldn't get like a 10x or 5x on this thing, like in the or at least see that it could get a 10x that it just it wasn't a company that I wanted to work on anymore. And I remember I had invested before I came back, of course, to YC. And one of the big I

that really got me was that actually our MacBook Pros were really sucking at the time. Yeah, they were, yeah. There was no M1 at the time. Yeah, and we actually, I don't think we even knew that Apple was going to release Silicon yet. I mean, it's interesting. I think that in Silicon Valley, we maybe underestimate how valuable strategy actually is, mainly because strategy is so fun and so interesting and the MBAs who come into our sector immediately seize on that.

and just want, you know, you need a strategy person as like, you know, as a co-founder. And it's like, no, no, no, we don't actually need that. But that's not to say that strategy is not necessary in this particular case. Like, I think that

But we were trying to solve a real problem, which was our browsers really sucked. And the cloud was getting very, very good. And then suddenly, you know, the maze changed when Apple released Silicon. Well, they clearly thought so, too. So, you know, strategy was right in some sense, like the overarching problem of trying to make our computers faster. They were able to make a chip.

Yeah.

Is anyone even going to get close to the M1 or not? And so I think one problem is that wanting them to be behind is non-ideal for your company. Don't bet against macro is the problem. Yeah, you definitely don't want to bet against that. And then I think the second piece was I sat down with one of the engineers that works on V8, the browser engine behind Chrome. And then I gave him every imaginable idea me or the team had on figuring out how to speed up the browser engine.

And he had an answer to all of them. Once I realized that the team is basically focused on like 1% improvements and they had already tried everything. I mean, that was like a very depressing moment. I was like out of ideas. You know, the people say, when is it the right time to pivot or change or whatever? And I had just run out of ideas. But I really wanted to stick with it. But I just couldn't figure out another way to get it. We went so far as sort of building a computer in the data center.

And we had like figured out how to use like consumer CPUs in a data center legally with the right architecture. And like, I think PG came over once and there was just like the sprawl of all these components of maybe we're building physical, we're kind of building hardware. I learned major lessons at Mixpanel, but the one major lesson that I learned at Mighty was that it's so valuable to have a tailwind for your company as opposed to a headwind. There were just so many obstacles in our way, you know, whether it was the M1 or the

You know, there's no real way to like change the fundamental architecture of the browser. You know, like JavaScript is just innately run single threaded in a tab. We can't change that. With Playground, it feels like it's all tailwinds all the time. You know, we just like wait and things get better. Things get faster, cheaper, better, easier.

I think that's remarkable is you had this really impressive career with building a standard SaaS business with Mixpanel. You gave it a try with the browser GPU. And then you now kind of retool yourself and built this SOTA stable diffusion model.

What was that journey like? How did you retool yourself? That was one of those things that is so impressive. I just started learning. I don't know. I took whatever AI courses were out there that I could take. Unfortunately, the Carpathia courses didn't exist back then. But I think at first I was trying to actually build a better AI address bar in the browser, which now exists. Google just released that, I think. Yeah.

And this was before GPT-4. I think we were talking to OpenAI. They were very helpful because I think they didn't have... ChatGPT didn't exist yet. And we were trying to figure out how to get that integrated in the address bar at low latency. And so I was learning AI, doing learning AI, how to do AI research and train models before all of that happened. But I think something weird happened, which was in doing that, in getting kind of connected with the folks at OpenAI and learning these things,

I ended up just getting to see it happen. Like I knew it was about to happen like earlier than other people. I got kind of lucky, I guess. A lot of people probably remember the Dolly 2 moment. That was like a crazy moment where image generation, you know, really was exciting. And then, so I just tried to, I just kept learning. And then, you know, I think Stable Diffusion came out, like maybe, maybe I got like, I got access like two weeks before.

before it came out. And so it was just like, by being in the mix of this thing, I got to see everything about to start. And so I think we were the first AI image generation website that you could go to and sign up and you didn't have to run it manually on some GPU. And so I think our website just took off because of that. That was the easiest thing. I think Midjourney was still in Discord.

Right? So we're like, what if you make a website? I didn't actually know that story. I mean, that's a great lesson for any technical founder, right? Like, essentially, you stumbled on the biggest tailwind of our generation by just following technical things you found interesting. That's great. Yeah, it's a little weird. After Mixpanel, I actually...

tried to intern only at I tried to do like an internship at companies and because I don't know I was like wanting to do something but not ready to start a company and I only was trying to talk to AI companies and I like interviewed at OpenAI and

I like they wanted me to come work five days a week, but I don't want to do three. And and then like somehow at the end of that, I made this huge mistake in, I guess, 2018, where I had decided that there was nothing interesting going on in AI because I was like training. And even then I was like training my own models. I was like trying to help a scooter company keep the scooters to detect if they were on the sidewalk or on the road because the regulation in SF requires required them to do that.

And I like learn all this stuff, went to all these events and and and then I just yeah, I conclude there's nothing. And I started Mighty.

And I was like just off by like three months. And so I kind of felt like I almost feel redeemed in some sense, you know. I don't know. It's so hard to like time these things. How do you know whether you're early or late? And then for a long time, you were behind on the model, right, for Playground. And I've just felt continuously behind. But I've now kind of come to realize after like learning the history of like Microsoft and microprocessors, like,

I don't know. It might just be like year two. This all still might be really, really early. We really don't know where it's going. How does it feel to run Playground, which is sort of

part startup, part research lab versus just pure startup? Well, one thing we try to do is we try to differentiate on not trying to go after AGI. That's one thing we try to say we're not doing because there's lots of people doing that. It feels really tractable, I guess, the research does, where it's not always clear whether research will be like that. I've kind of learned that you can't do research in a rush. So one big problem is that

When you're building a startup, you want to ship everything. You want to ship it today. You want to fix the bug. You want to ship the feature. You're just trying to move on such a fast pace, but that's not tenable with research in the same way. Research is moving fast.

But it's not like you can't ship your new model. You can't build and ship your model in a week. And so I think that's been really challenging. And I've had to kind of adjust my brain for one team versus the other. Yeah, one thing I think is interesting about successful research labs in the past, if you look at Bell Labs, for example, it's almost like the CEO of the lab's main responsibility is shielding the lab from the commercial industry.

that are pushing for like things now. But as CEO of Playground, you're kind of both like protector of the researchers, but you're also the commercial interest. Like how do you juggle those competing forces? Yeah, I don't know that I've probably mastered it yet by any means, but I think I asked Sam Altman once like, you know, to what degree he allowed the researchers at OpenAI to like wander, I guess. So I just wasn't really sure, you know, usually it's like a task and you do it.

But what about wandering? How does wandering make sense in a researcher or an engineer, engineering team? And he said there's like a he's like, there's quite a bit of wandering. So I took that to heart. And and so I let the research team kind of wander and get to a point where they are able to show an impressive result. And then we kind of like kind of start to like really accelerate that.

But until then, there's not much to do. Well, not all who wander are lost. I love that. That should be a t-shirt. We will add that as a template. You can link it below in the video. I'll be a creator in the playground marketplace. You were asking, how do these two teams integrate in a startup? And I think that

We just have this channel now where we just see so much feedback that now the researchers can actually look into the failure and they can decide for themselves while wandering, "Do I want to fix that? That's surprising. Why did that happen?" And so I want to try to integrate these two because I think that's a more differentiating factor these days. I think that the research labs are very lab-based.

And they don't necessarily, they're not always deeply looking into real user behavior. What are they really trying to do? But sometimes it's just like, we need to get to this, like, we need to get a high score on this eval and we've got to put it in the paper. And then we got to like get really good score for LLM arena. And then there's like some KPI, you know, to do that. But then, you know, does that thing matter? Does it correlate? Does the eval that we see out in the world, does it strongly correlate to,

usefulness to users. Like I still want the LLMs to like help me make rap lyrics, but there's no eval for that. So, you know, who will do that? How will that happen? It's certainly possible to do that. But if you notice, I always pick on this rap lyrics thing because I

To me, it belies a fundamental problem with how people are evaluating the models, because the models should be extremely good at it, but they're not. Maybe the problem is some of these, there's a gap between commercialization of research, because all these eval publicly are academic and very different use case than if you wanted to go beat Canva, let's say. Yeah. I mean, I might be talking out of turn here, sorry to the LLM folks, but

If you go look at the evals for the language models, they're all like, you know, math, biology, legal questions. It's no wonder that the biggest use case of ChatGPT is homework. Because they, you know, all the models are like basically hit these numbers, right? Initially, maybe they're different now. They're probably more sophisticated now. But it's no wonder that the models are good at homework. And that's a huge category. So you made it to SOTA. People are watching right now and they're just asking, like, how do I do it?

What's your answer to that? There's like this feeling that all you need is a lot of data and a lot of compute. And you just and then you just you run, you train these models and you'll get there. You know, they'll just generalize and suddenly everything will be great. I think there are a lot of smart software engineers. And so they fundamentally understand that these are the core components, ingredients to make like this great model. But it's vastly more complex than this. And

What I've at least what I've experienced is that to get to Soda, you basically have to be maniacal about like every detail of, you know, the models like capability. For example, like you can like look at text generation. There could be some people that train their models and they get cool text generation, but the kerning is off.

Are you the kind of person that will care about the kerning being off? Or are you the kind of person that is okay with it? Or you don't even notice it. Do you have this just maniacal sense? We look at skin texture.

Like my eyes feel burnt out basically from looking at like the smallest little skin texture, you know, smooth. And we like talk about these things as a research team day in and day out. We like argue about it to build these soda models. You have to be so, you have to care so much about in our world, it's image quality and quality.

And, you know, we even look at like little small things, like if there's even a slight film grain and it's missing, we go, oh, the prompt understand the captioning model is bad. Not good enough. We need to be better at this. And I think this maniacal mindset, I think, allows you if you do this 100 times, the model extrapolate.

even more. I think people don't quite internalize like extrapolation of all of these dimensions together and how they work together to make everything better. Like you don't know how making one thing better here will impact like another thing over there. We can't, it's hard to understand that. But I think that that's what's what that's what's required to get to a SOTA model. And it is possible. It is possible. It is possible. It's not easy though.

It's really hard, yeah. Well, Suhal, thanks a lot for coming on The Light Cone. That's all we have time for, but you can try Playground right now, playground.com, or in the App Store, Android, iOS. And this is actually the biggest flex. You didn't have any wait lists. It was just available on day one. So go try it out right now, and we'll see you guys next time.

The Hunt for State of the Art (with Suhail Doshi) 55:51 Share