We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
People
A
Aaron Levy
B
Ben Milsom
D
Dan Shipper
D
David Song
E
Ethan Malek
J
Jacob Pozol
P
Professor Ethan Malek
R
Ross
S
Swix
T
Tanishq Matthew Abraham
Y
Yohei Nakajima
主持人
专注于电动车和能源领域的播客主持人和内容创作者。
Topics
主持人:OpenAI发布了GPT-4.0,其集成的图像生成模型具有显著的提升,能够处理复杂的输出,例如反射和光物理,并支持多轮生成和上下文学习。GPT-4.0的图像生成能力得到了广泛的好评,用户可以将其用于各种用途,例如生成广告、漫画等。 GPT-4.0的图像生成技术与之前的扩散模型不同,OpenAI使用了人工引导的强化学习过程进行训练,并对模型进行了改进,使其能够更好地遵循指令,捕捉风格,并进行修改。 GPT-4.0的图像生成能力的提升代表着一种范式转变,它可以直接创建输出,赋予AI对图像的精细控制,并对创意工作和AI创业生态系统产生深远的影响。 Dan Shipper:GPT-4.0图像生成模型能够很好地遵循指令,捕捉风格,并可靠地进行修改。 Tanishq Matthew Abraham:GPT-4.0能够一次性生成高质量的图像和文本,例如解释旧金山雾的原因的图表。 Jacob Pozol:GPT-4.0能够一次性生成高质量的广告,并能理解品牌和风格。 Yohei Nakajima:GPT-4.0能够根据参考图像生成图像,并保留图像的细节,例如颠倒的独角兽角。 Grant Slatton, Bryn Hobart, Peter Yang:GPT-4.0的图像生成能力非常强大,能够将照片转换成吉卜力风格,甚至改变人物姿势和构图。 Swix:Gemini自回归图像生成技术是一个突破,可能意味着扩散模型的终结。 David Holtz:对Swix的观点表示异议。 Professor Ethan Malek:大型语言模型的图像生成现在可以直接创建输出,赋予AI对图像的精细控制。 Ben Milsom:GPT-4.0图像生成模型能够完成以前需要创意团队才能完成的工作。 Ross:GPT-4.0图像生成模型可能会对图像编辑SaaS产生影响。 David Song:OpenAI朝着统一的AI生成前端迈出了重要一步。

Deep Dive

Chapters
OpenAI's GPT-4.0 integrates image generation, improving image quality and enabling complex outputs. Users are amazed by the results, creating various images, including Studio Ghibli-style family portraits. The model's ability to follow animation style rules and integrate multiple reference images is highlighted.
  • Integration of advanced image generator into GPT-4.0
  • Significant improvement in image quality and detail
  • Ability to handle complex prompts with reflections and light physics
  • Multi-turn generation for iterative refinement
  • Improved instruction following and in-context learning
  • Wide range of creative and practical applications
  • Transformation of family photos into various animation styles

Shownotes Transcript

Today on the AI Daily Brief, huge new model releases for both OpenAI and Google that combined open up a huge array of new exciting use cases. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes. Welcome back to the AI Daily Brief. When I first started playing around with AI, the experience that hooked me wasn't actually ChatGPT.

My gateway experience was absolutely the image generators.

Around the same time that the first version of Chachapiti came out, Stable Diffusion and Midjourney were actually starting to get pretty good. And I would find myself spending hours creating these nostalgic images of Hemingway in Paris in the 1920s or a burger shack on the California surfside in the 1960s. Or I would start custom designing magic cards themed based on H.P. Lovecraft or who knows what it was. But I would spend hours doing this stuff just because it was so much fun.

Now, over time, of course, some of that wonder and mystery has evaporated. And today, the vast majority of my usage of these image generators is very practical. It's things like thumbnails for these shows. And so last night, when I found myself lost in an hours-long rabbit hole of creating images with OpenAI's new integrated image generation model, it became clear to me and basically everyone else who is trying this tool that we really were in the midst of a big upgrade. Yesterday, OpenAI announced 4.0 image generation.

The company writes, "At OpenAI, we've long believed image generation should be a primary capability of our language models. That's why we've built our most advanced image generator yet into GPT-4.0. The result is image generation that is not only beautiful but useful." And right out of the gate, you can see that there was a whole different level of quality here.

One of the first examples that they give in the blog post, the prompt reads, a wide image taken with a phone of a glass whiteboard in a room overlooking the Bay Bridge. The field of view shows a woman writing, sporting a t-shirt with a large OpenAI logo. Then it goes on to describe the handwriting. And importantly, you see, one of the big upgrades here is that the text comes out perfectly. But then they go even farther and add a new prompt, selfie view of the photographer as she turns around to high five him, which again, the model handles with aplomb.

And so this is where they get to their idea of useful image generation. The fidelity that they were able to achieve with this, prompting the model with 20 lines of text, complicated outputs like reflections and light physics. The ability for the model to handle all of this opens up a huge number of use cases that were simply not available before.

And indeed, text rendering is right at the top of what they call their improved capabilities with this model. Around other upgrades, they focus on multi-turn generation. They write, "Because image generation is now native to GPT-4.0, you can refine images through natural conversation. GPT-4.0 can build upon images and text in chat contexts, ensuring consistency throughout. For example, if you're designing a video game character, the character's appearance remains coherent across multiple iterations as you refine and experiment."

By way of example, they show a cat being given a detective hat and a monocle. OpenAI points out that instruction following is much better as well, and that in-context learning has improved dramatically. GPT-4.0, they write, can analyze and learn from user-uploaded images, seamlessly integrating their details into its context to inform image generation.

Now, by and large, everyone's response to this was to be blown away. Dan Shipper, the founder of Every, wrote,

It follows instructions well. You can ask it to modify small parts of an image, and it will reliably reproduce the image with the modifications you requested. And it's good at capturing style. If you give it a reference, it'll reliably help you get the vibe.

Showing off how different it is to have native image generation that can draw on GPT-4.0's training data rather than just having to seek a separate external model, as was the case previously with DALI, former Stability AI Research Director Tanishq Matthew Abraham asked the model for an infographic explaining why San Francisco is foggy. He got a poster depicting the evaporation and rainfall system with three sentences of correct and coherent text. Abraham posted, Wait, GPT-4.0 can just one-shot stuff like this? That's impressive.

Another one of the most common examples you saw yesterday after the model release was effectively one-shot generated advertisements. Jacob Pozol generated a one-shot print ad for a beauty product with the prompt, create a Mad Men style print ad using this image. Not only was the style on point, but the text made contextual sense with zero information given beyond the text on the label. Pozol wrote, it's over. It is so over. In another example, the model recognized a Ridge wallet and knew that slim and stylish was appropriate ad copy.

Image decoding is also a native part of the system, so GPT-4.0 can apply its understanding to reference images. Yohei Nakajima generated a, quote, whimsical four-part comic based on a reference image of a chubby unicorn with an upside-down horn. The comic modified the unicorn to a different pose in each panel and had a context-appropriate storyline. Yohei had intentionally used a reference unicorn with an upside-down horn to test the model, commenting, an upside-down unicorn horn is not in training data. Why would it be? So image-to-image models end up making it right-side up.

This is the first time one retained the upside down horn without explicit instructions. The model can also integrate multiple reference images into a single completely new image, which has been a difficult task for most image models to date. Still, if you've seen just one type or trend of image from this, it is definitely turning your family images into Studio Ghibli style.

Developer Grant Slatton posted an image of him, his wife, and his dog on a beach that got 5.5 million views and basically had everyone else with access to this model, including myself, doing the same thing with their families. Author Bryn Hobart wrote, My kids now demand to be gibblified.

Peter Yang, the head of product at Roblox, transformed a ton of his family photos into Ghibli portraits, commenting, "If this isn't magic, I don't know what is." One of the really interesting parts about the Ghiblification and similar style cues for The Simpsons, South Park, and Pixar is that the model is capable of following the rules of the animation style. The edits aren't just an overlay on a reference image, the model actually reposes and reframes the characters to fit the style properly.

And while turning your family into Studio Ghibli was fun, I highly recommend turning your father-in-law into an orakai for some really excellent good clean living. Now let's talk about what's going on underneath the hood. As we discussed, the new native GPT-4.0 image generation model is different than DALI 3.0.

The technique behind most image generation models up to now has been something called diffusion. Diffusion you might think of as essentially a generative denoising process. Model training includes learning about the process of adding noise to an image and what the results look like, and then to generate image, the process is run in reverse. Starting from a randomly seeded noisy image, the model modifies pixels until it resembles the text prompt.

Auto-aggressive models is a different approach that predicts image pixels or tokenized parts in a particular sequence in a not dissimilar way to how language models are going to generate text token by token and word by word. There has recently been a lot of discussion around how auto-aggressive image generation models were starting to show improvements over diffusion models and what the implications might be for text, which is obviously a huge part of this new update. Swix from Latent Spaces in the AI Engineering Summit recently said,

Okay, now it's finally kosher to say Gemini autoregressive image generation is the breakthrough. He updated that with this new chat GPT update and said, so diffusion may be dead? In context learning for image models, in other words, enabling iterative generation rather than prompts or lauras, is the most serious threat to Photoshop I've seen in the entirety of the Gen AI wave.

Now, OpenAI has not been super forthcoming about exactly what their approach to this new model was, but they did tell the Wall Street Journal that they employed an army of data labelers to carry out a human-guided reinforcement learning process. Over 100 people manually pointed out errors, typos, and demonic hands to help it learn how to improve its outputs. Gabrielle Goh, the lead researcher on the project, said, "...the base model is already intelligent in its own way, and then the reinforcement learning from human feedback process brings out the intelligence and refines it."

Now, after Swix declared diffusion dead, some basically said, hold up, wait a second, and suggested that OpenAI and Google's latest models may be combining elements of the two methods. Maybe more interestingly, Swix tagged Midjourney founder David Holtz into the thread and said, I try not to drink hyperbole, but will Midjourney go the same way? Now that we have this in Gemini and 4.0, I don't see how I ever go back to anything else. Holtz simply replied, nah.

One of the things that's exciting then about this update is that the fact that the change in architecture is coinciding with the model improvement means that many of these advancements are going to be emergent and something we're likely to see more of in the future rather than random variants. Over the past year, we've seen a lot of models which had very incremental advances. And so I think part of what makes so many people excited about this is that this feels like a genuine paradigm shift, not just a nudge forward.

Professor Ethan Malek also noted something interesting. He writes, "...the funny thing about multimodal image generation as released in the last week by Google and OpenAI is that now LLM image generation works like how most people using LLMs for the past two years always thought LLM image generation works." He continued, "...previously, LLMs sent a text prompt to a separate image creation model that produced the image. They could not control or see the final output. Multimodal image creation lets the LLM directly create the output, giving the AI fine-grained control over the image."

Now, we're going to get deeper into this probably later in the week or next week, but this is definitely one of those model advancement moments where people are instantly seeing the implications for their work and work more broadly. Ben Milsom writes, a lot of creatives and marketing going to be feeling that AI trapped our moment today. The first time you truly see AI take your job. OpenAI ImageGen 4.0 dropped and it's doing things that used to take full creative teams. Expectations of image generation totally surpassed.

Up at 4 a.m. testing it on a brand I love. No camera, no studio, just prompts. What I got back is campaign ready. This changes how creative work gets pitched, planned, and priced. This technology getting embedded in foundation models also is going to have an impact on the AI startup ecosystem. Programmer Ross writes, ChatGPT 4.0 ImageGen just killed background removal and image editing SaaS.

Last year, it was of course a meme that OpenAI was killing everyone's startups, related to them integrating all manner of developer tooling, and now OpenAI is moving on to the thousands of lightweight image editing tools as well. You gotta think that video tools can't be far behind either. The release also took the shine off of other image generation models that had just a few hours ago been getting people really excited. Indeed, just one day prior, Reeve released their new image generation model and shot to number one ranking on AI arenas.

They claim their model was the best image generator in the world, and the outputs that people have been sharing are really impressive. One of the challenges now, though, is that perfectly one-shotting image generation isn't enough. Part of what makes OpenAI's breakthrough so valuable is that it integrates image generation and editing into the highly performant LLM.

Investor David Song took it a step farther and suggested that OpenAI has taken a big step towards the unified AI-generated front-end, commenting, He then gave an example of how a screenshot for a simple daily planner UI could be generated on the fly, complete with custom data.

Summing up, Ethan Malek writes, multimodal image generation is going to actually impact a lot of economically and culturally meaningful work in ways I don't think we understand yet. It's very flexible, relevant to many uses, and got good all at once. Still flaws, but the gain in capabilities seem rather rapid. Today's episode is brought to you by Vanta. Trust isn't just earned, it's demanded.

Whether you're a startup founder navigating your first audit or a seasoned security professional scaling your GRC program, proving your commitment to security has never been more critical or more complex. That's where Vanta comes in. Businesses use Vanta to establish trust by automating compliance needs across over 35 frameworks like SOC 2 and ISO 27001. Centralized security workflows complete questionnaires up to 5x faster and proactively manage vendor risk.

Vanta can help you start or scale up your security program by connecting you with auditors and experts to conduct your audit and set up your security program quickly. Plus, with automation and AI throughout the platform, Vanta gives you time back so you can focus on building your company. Join over 9,000 global companies like Atlassian, Quora, and Factory who use Vanta to manage risk and prove security in real time.

For a limited time, this audience gets $1,000 off Vanta at vanta.com slash nlw. That's v-a-n-t-a dot com slash nlw for $1,000 off. Hey listeners, are you tasked with the safe deployment and use of trustworthy AI? KPMG has a first-of-its-kind AI Risk and Controls Guide, which provides a structured approach for organizations to begin identifying AI risks and design controls to mitigate threats.

What makes KPMG's AI Risks and Controls Guide different is that it outlines practical control considerations to help businesses manage risks and accelerate value. To learn more, go to www.kpmg.us slash AI Guide. That's www.kpmg.us slash AI Guide.

Today's episode is brought to you by Super Intelligent and more specifically, Super's Agent Readiness Audits. If you've been listening for a while, you have probably heard me talk about this, but basically the idea of the Agent Readiness Audit is that this is a system that we've created to help you benchmark and map opportunities in your organizations where agents could succeed.

specifically help you solve your problems, create new opportunities in a way that, again, is completely customized to you. When you do one of these audits, what you're going to do is a voice-based agent interview where we work with some number of your leadership and employees to map what's going on inside the organization and to figure out where you are in your agent journey.

That's going to produce an agent readiness score that comes with a deep set of explanations, strength, weaknesses, key findings, and of course, a set of very specific recommendations that then we have the ability to help you go find the right partners to actually fulfill. So if you are looking for a way to jumpstart your agent strategy, send us an email at agent at besuper.ai, and let's get you plugged into the agentic era.

Now, on any other day, we could end here. This would be a full show, no headlines needed. But as has happened so often over the last couple of years, on the same day of one big announcement, we got another big announcement as well.

Interestingly enough, Google, who released Gemini 2.5 yesterday, and which we'll talk about in just a minute, had also released a very similar image model just a couple of weeks ago with their image gen upgrade to Gemini. Google had many of the same features dialed in, like perfect text and solid reference stability, but one of the big differences was distribution. Google released their version of the feature sort of buried in the AI studio, whereas OpenAI is putting their image gen front and center for every paying customer across all tiers on day one.

Now, from Google's perspective, they are very clearly in a mode of trying to consolidate everything altogether. But seeing how much more people responded to this OpenAI announcement as opposed to the Google announcement, one has to wonder if the strategy was a little bit off. And at the same time, this was not Google's only announcement.

In their announcement post for Gemini 2.5, they called this their most intelligent model yet. They wrote, For a long time, we've explored ways of making AI smarter and more capable of reasoning through techniques like reinforcement learning and chain of thought prompting. Now, with Gemini 2.5, we've achieved a new level of performance by combining a significantly enhanced base model with improved post-training.

Going forward, we're building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable context-aware agents. Indeed, it feels like this is a reasoning model that was specifically trained with agent performance in mind.

Benchmarks are up across the board and compare favorably to OpenAI's O3 Mini in almost every category. Coding benchmarks are slightly lower than O3 Mini and Clause 3.7 Sonnet, but not by much, so many will find it worth testing. In fact, programmer Matthew Berman collected a series of one-shot demos and commented, Gemini 2.5 Pro is insane at coding. It's far better than anything else I've tested.

Another place, though, that Google really shines is their ultra-long context window. Like their previous models, Gemini 2.5 has a million-token context window and will soon expand to two million. Cloud 3.7 Sonnet only has a 200,000-token context window. Developer Matzen Field wrote, "'Gemini 2.5 is mind-blowing in coding. Just uploaded to it an entire pretty large codebase, wrote a careful prompt for an issue that Cursor with Cloud 3.7 Plus didn't solve, and it got the complete right fix in one shot. Can't wait to have it integrated in Cursor and Windsurf.'"

Long context is also one of the keys to keeping an agent on task during long sessions. And the benchmarks suggest that Gemini 2.5 might be better for long context tasks. They scored 83.1% on the MRCR benchmark for tasks that fill the million token context window. By comparison, O3 Mini scored 36.3% for tasks that fill its 200,000 token window.

Eric Provencher, a research scientist at Unity, looked at a different long-context benchmark about coherent fiction writing and commented, Seems like Gemini 2.5 is in a league of its own. Another big jump is on ultra-difficult knowledge tasks. The model is the new state-of-the-art in the obscure knowledge and reasoning benchmark, Humanity's Last Exam. It scored 18.8%, beating out O3 Mini at 14% and DeepSeek R1 at 8.6%.

Alex Volkov, host of the Thursday AI podcast, wrote, Wow, wow, Gemini 2.5 is absolutely crushing it on the thinking benchmark with tough questions. Not only is this model getting the highest score on these I've tested so far, but also look at that incredible latency difference. The model is so much faster than DeepSeek R1 and R03.

Another important factor when it comes to driving agents is Gemini's natively multimodal architecture. The model can take inputs in text, audio, image, and video formats without running them through a decoder. This means the reasoning works natively on all types of inputs rather than only seeing them flattened down into a text summary.

DeepMind CEO Demis Hassabis highlighted this point in his announcement post, and Google engineer Noam Shazier extended the point even further, posting, The 2.5 series marks a significant evolution. Gemini models are now fundamentally thinking models. This means the model reasons before responding to maximize accuracy. And it's our best Gemini model yet.

Nathan Lambert recognized that we haven't even seen how the model performs in state-of-the-art use cases yet, noting, Gemini's 2.5 reasoning trains include simulated Google searches. Feels like the model is designed for things like deep research they just haven't rolled out yet. Box's Aaron Levy summed it up, Google's Gemini 2.5 update is not screwing around. Incredible how quickly we're seeing new levels of capability be achieved by AI. Nothing seems to be slowing down.

And so what's the story here? Is this a story about OpenAI versus Google? Is it a story about OpenAI and foundation models eating up narrow startups? Maybe in part, but I think that those are much smaller stories to the bigger one, which is of course model improvement and more evidence than we've had in some time that there is no wall. These models open up fundamentally different and more expansive use cases, and we are just scratching the surface of what they can do.

This is the most excited I've been for a very long time to go out and play with some new foundation models. And so I'm going to cut it off here and get back to it. Let me know how you are using Gemini 2.5 and OpenAI's new image gen. And we'll come back and talk about use cases a little later in the week or next week. For now, appreciate you listening or watching as always. And until next time, peace.