We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

#210 - Claude 4, Google I/O 2025, OpenAI+io, Gemini Diffusion

2025/5/26

Last Week in AI

Andrey Kurenkov: 我认为Google在IO 2025上展示了其在AI工具和应用方面的强大实力，包括AI搜索、Project Mariner、Veo视频生成、Imagen图像生成等。这些工具的发布和升级，标志着Google在AI领域的全面进攻，旨在保持其在搜索和技术领域的领先地位。特别是AI搜索的深度整合，以及Project Mariner的Agent能力，都显示了Google对未来AI应用的深刻理解和布局。 Jeremie Harris: 我认为Google的AI战略是防御性的，旨在防止OpenAI等竞争对手侵蚀其核心搜索业务。Google在AI领域的巨大投入和技术积累，使其能够迅速推出各种AI工具和应用。Veo视频生成和Imagen图像生成等工具的发布，展示了Google在多模态AI方面的强大能力。Project Mariner的Agent能力，以及AI搜索的深度整合，都显示了Google对未来AI应用的深刻理解和布局。Google需要确保其AI产品在安全性和可靠性方面达到最高标准，以避免潜在的负面影响。

Deep Dive

Chapters

Anthropic released Claude Opus 4 and Sonnet 4, showcasing improvements in coding, long workflows, and reduced shortcut behaviors compared to previous versions. The models excel at managing memory files and integrating with development environments.

Claude Opus 4 and Sonnet 4 released
Significant improvements in coding and long workflows
Reduced shortcut and loophole behaviors
Improved memory performance and file management
Tighter integration with development environments via SDK

Shownotes Transcript

Translations:

中文

Hello and welcome to the Last Week in AI podcast, where you can hear us chat about what's going on with AI. As usual in this episode, we will be summarizing and discussing some of last week's most interesting AI news. You can go to the episode description for all the links to the news we're discussing and the timestamps so you can jump to the discussion if you want to. I am one of your regular hosts, Andrey Karenkov. I studied AI in grad school and I now work with it at a generative AI startup.

And you can hear me typing here just because I'm making final notes on what is an insane week. If we do our jobs right today, this will be a banger of an episode. If we do our jobs wrong, it'll seem like any other week. This has been insane. I'm Jeremy, by the way. You guys all know that if you listen. Gladstone AI, National Security, all that jazz. This is pretty nuts. We were talking about this, I think, last week where

We're catching up on two weeks worth of news. And we were talking about how every time we miss a week and it's two weeks, inevitably, it's like the worst two weeks to miss. And the AI universe was merciful that time. It was not merciful this time. This was an insane, again, banger of a week. Really excited to get into it. But man, is there a lot to cover. Exactly. Yeah. There hasn't been a week like this probably in a few months.

There was a similar week, I think around February, where a whole bunch of releases and announcements were bunched up from multiple companies. And that's what we're seeing in this one. So just to give you a preview, the main bit that's exciting and very full is announcements concerning tools and consumer products.

So Google had their IO 2025 presentation, and that's where most of the news has come out. They really just went on the attack, you could say, with a ton of stuff either coming out of beta and experimentation being announced, being demonstrated, et cetera. And we'll be getting into all of that. And then afterward, Anthropic went and announced Cloud 4. And-

Some additional things in addition to Cloud4, which was also a big deal. So those two together made for a really, really eventful week. So it'll be a lot of what we were discussing. And then in applications as business, we'll have some stories related to OpenAI. We'll have some interesting research and some policy and safety updates about

safety related to these new models and other recent releases. But yeah, the exciting stuff is definitely going to be first up and we're just going to get into it. First in Tools and Apps is Cloud 4. Maybe because of my own bias about what's being exciting.

So this is the Claude Opus 4 and Claude Sonnet 4. This is the large and medium scale variant of Claude from Anthropic. Previously, we had Claude 3.7, I think for a few months. Claude 3.7 has been around, but not super lengthy. And this is pretty much an equivalent update. They'll be costing the same as the 3.7 variants. And Claude

The pitch here is that they are better at coding in particular and better at long workflows. So they are able to maintain focused effort across many steps in a workflow. This is also coming paired with

updates to cloud code. So it's now more tightly integrated with development environments, coming with an SDK now, so you don't have to use it as a command line tool, you can use it programmatically. And related to that as well, both of these models, Opus and Sonnet, are hybrid models, same as 3.7. So you can adjust the reasoning budget for the models. So

I guess qualitatively, not anything new compared to what Anthropic has been doing, but really doubling down on the agentic direction, the kind of demonstration that people seem to be optimizing these models for the task of like give a model some work and let it go off and do it and come back after a little while to see what it built with things like cloud code.

Yeah, the two models that are released, by the way, are Claude Opus 4 and Claude Sonnet 4. Note the slight change yet again in naming convention. So it's no longer Claude 4 Sonnet or Claude 4 Opus. It's now Claude Sonnet 4, Claude Opus 4, which I personally like better, but hey, you do you.

A lot of really interesting results. Let's start with Sweebench, right? So this Sweebench verified the sort of like software engineering pseudo real world tasks benchmark that really OpenAI polished up, but that was anyway developed a little while ago in the first place. So OpenAI's Codex One, which I'm old enough to remember a few days ago when that was a big deal, was just for context hitting about 72%, 72.1% success.

on this benchmark. That was like really quite high. In fact, for all of 20 seconds, it was soda. Like this was a big deal, like on whatever it was Tuesday when it dropped. And now we're on Friday, no longer a big deal because

Sonnet 4 hits 80.2%. Now, going from 72% to 80%, that is a big, big jump, right? You think about how much... There's not that much more left to go. You've only got 30 more percentage points on the table, and they're taking eight of them right there with that one advance. Interestingly...

Opus 4 scores 79.4%. So sort of like comparable performance to Sonnet on that one. And we don't have much information on the kind of Opus 4 to Sonnet 4 relationship and how exactly that distillation happened, if there was sort of extra training. Anyway, so that's kind of another thing that we'll probably be learning a little bit more about in the future. And this is the numbers V, like upper range with a lot of compute data.

that, you know, similar to O3, for instance, from OpenAI, when you let these models go off and work for a while and not

kind of a more limited variant. Actually, that's a really good flag, right? So there's a range with inference time compute, with test time compute models where, yeah, you have the kind of like the lower inference time compute budget score, which in this case is around 70 to 73%, and then the high inference time compute budget, which is around 80% for both these models. Again, contrasting with Codex One, which is sitting at 72.1%,

in this figure, and they don't actually indicate whether that's low compute mode or high compute mode, which is itself a bit ambiguous. But in any case, it's a big, big leap. And this bears out in the qualitative evaluations that a lot of the folks who had early access have been sharing on X. So make of that what you will.

All kinds of really interesting things. So they've figured out how to apparently significantly reduce behavior associated with getting the models to use shortcuts or loopholes. Big, big challenge with CodexOne.

A lot of people have been complaining about this. It's like, it's too clever by half, right? The O3 models have this problem too. They'll sometimes like find janky solutions that are a bit dangerously creative where you're like, no, I didn't mean to have you solve the problem like that. Like that's a, that's kind of, you're sort of cheating here. And, and other things where they'll tell you they completed a task, but they actually haven't. That's kind of a thing I found a little frustrating with, especially with O3. But so this model has significantly lower instances of that.

Both models, meaning Opus 4 and Sonnet 4, are, they say, 65% less likely to engage in this behavior than Sonnet 3.7 on agentic tasks that are particularly susceptible to shortcuts and to loopholes.

So this is pretty cool. Another big dimension is the memory performance. So when developers build applications that give Claude local file access, Opus 4 is really good at creating and maintaining memory files to store information. So this is sort of a

partial solution to the problem of persistent LLM memory, right? Like you can only put so much in the context window. These models are really good at building, like creating memory files, explicit memory files. So not just storing context, but then retrieving them. So they're just really good

at a kind of implicit rag, I guess you could call it. It's not actual rag. It's just they're that good at recall. There are a bunch of features that come with this. As with any big release, it's like this whole smorgasbord of different things, and you got to pick and choose what you highlight. We will get into this, but some of the most interesting stuff here is in the Cloud 4 system card. And I think, Andre, correct me if I'm wrong, do we have a section to talk about the system card specifically later, or is this it?

I think we can get back to it probably in the advancement section, just because there's so much to talk about with Google. So we'll do a bit more of a deep dive later on to get into technicals. But at a high level, I think, you know, as a user of Google,

chat GPT, LLMs, et cetera, this is a pretty major step forward. And in particular on things like cloud code, on kind of the ability to let these LLMs just go off and complete stuff for you. And so moving on for now from Anthropic, next up, we are going to be talking about all that news from Google that came from IO 2025 and

Bunch of stuff to get through, so we're going to try and get through it pretty quick. First up is the AI mode in Google search. So starting soon, I guess you have a tab in Google search where you have AI mode, which is essentially like chat GPT of search.

Google has had AI overviews for a while now, where if you do, I think at least for some searches, you're going to get this LLM summary of various sources with an answer to your query. AI mode is that, but more in depth. It goes deeper on various sources and you can do follow-up questions to it. So very much,

Now, along the lines of what Perplexity has been offering, what ChatGPT Search has been offering, etc.,

And that is, yeah, I guess really on par. And Google has demonstrated various kind of bits and pieces there where you can do shopping with it. It has charts and graphs. It can do deep search that is able to do looking over hundreds of sources, et cetera.

Yeah, that kind of tight integration is, I mean, Google kind of has to do it. One of the issues, obviously, with Google is when you're making hundreds of billions of dollars a year from the search market and you have like 90% of it, it's all downside, right? Like the thing you worry about is what if one day OpenAI, like ChatGPT just tips over some threshold and it becomes the default choice over search?

not even a huge default choice, just the default choice for 5% more of users. The moment that happens, like Google's market cap actually would drop by more than 5% because that suggests an erosion in the fundamentals of their business, right? So this is a really big five alarm fire for Google. And it's the reason why they're trying to get more aggressive with the inclusion of generative AI in their search function, which is overdue. I

I think there are a lot of people who are thinking, you know, why did this take so long? I think one thing to keep in mind too is with that kind of massive market share in such a big market comes enormous risk. So yes, it's all fine and dandy for OpenAI to launch ChatGPT and to have it tell people to commit suicide or help people bury dead bodies every once in a while. People kind of forgive it because it's this upstart, right? At least they did back in 2022. Whereas with Google, if Google is the one doing that,

now you have congressional and Senate subpoenas, like people want you to come and testify, they're going to grill you, Josh Hawley is going to lay into you hard as he ought to.

But that's kind of the problem, right? You're reaching a fundamentally bigger audience. That's since equilibrated. So OpenAI is kind of benefiting still from their brand of being kind of swing for the fences. So in some ways, the expectations are a bit lower, which is unfair at this point. But Google definitely has inherited that legacy of a big company with a lot of users. So yes, the rollouts will be slower for completely legitimate sort of market users.

market reasons. So anyway, I think this is just like really interesting. We'll see if this actually takes off. We'll see what impact that has too on chat GPT. I will say the Google product suite is this sort of unheralded, relatively speaking, unheralded suite of very good generative AI products. I use Gemini all the time.

People don't tend to talk about it much. I find that really interesting. I think it's a bit of a failure of marketing on Google's end, which is weird because their platform is so huge. So maybe this is a way for them to kind of solve that problem a little bit.

Well, we'll touch on possibly the usage being higher than some people. I think there might be a Silicon Valley bubble situation going on here where you're not Silicon Valley, but you're like, you know, spiritually in Silicon Valley in terms of a bubble. Moving right along. Next announcement also from Yo was talking about Project Mariner.

So this was an experimental project from DeepMind. This is the equivalent to opening as operator, Amazon's Nova on top of computer use. It's an agent that can go off and use the internet and do stuff for you. It can go to a website, look for tickets to an event, order the tickets, etc., etc.,

So Google has improved this with the testing and early feedback and is now going to start opening up to more people. And the access will be gated by this new feature

AI Ultra plan, which is $250 per month, which was introduced also in the slate of announcements. So this $250 per month plan is the one that will give you all the advanced stuff, all the models, the most compute, et cetera, et cetera. And you'll have Project Mariner as well. And with this update,

you are going to be able to give Project Mariner up to 10 tasks and it will just go off and do it for you in the background. Somewhat confusingly, also Google had a demo of agent mode, which will be in the Gemini app. And it seems like agent mode might just be an interface feature

Two Mariner in the Gemini app? Maybe, I'm not totally sure, but apparently ultra subscribers will have access to agent mode soon as well.

Yeah, and it's so challenging to kind of, I find, to highlight the things that are fundamentally different about a new release like this, just because so often we find ourselves saying like, oh, it's the same as before, except smarter. And that's kind of just true. And that is transformative in and of itself. In this instance, there is one sort of thing you alluded to it here, but just explicitly say it.

Previous versions of Project Mariner were constrained to doing like one task at a time because they would actually run on your browser. And in this case, the big difference is that because they're running this in parallel on the cloud, yeah, you can reach that kind of 10 or a dozen tasks being run simultaneously. So this is very much a difference in kind, right? This is like many workers in parallel chewing on your stuff simultaneously.

That's a change to the way people, you're more of an orchestrator, right, in that universe than a sort of a leader of one particular AI. It's quite interesting. And moving right along, the next thing we'll cover is VO3, which I think from just kind of the wow factor of the announcements is the highest one. I think in terms of impact, probably not the highest one, but...

in terms of just, wow, AI is still somehow blowing our minds. VO3 was the highlight of Google I/O. And that is because not only is it now producing, you know, just mind-blowingly coherent and realistic videos

compared to even a year ago. But it is producing both video and audio together, and it is doing a pretty good job. So there's been many demonstrations of the sorts of things Vio can do

The ones that kind of impressed me and I think a lot of people is you can make videos that sort of mimic interviews or, you know, typical YouTube style content where you go to a conference, for instance, and talk to people and you have people talking to the camera with audio and it just seems pretty real. And it's, yeah, you know, different in kind from video generation we've seen before and

And coming also with a new tool from Google called Flow to kind of be able to edit together multiple videos as well. So again, yeah, very impressive from Google. And this is also under their AI Ultra plan. Yeah, it's funny because they also include a set of benchmarks in their launch website, which by the way, they're sort of hidden, right? You actually have to click through a thing to see any...

And I always find it interesting to look at these when you've got that wow moment. I don't mean to call it quite a chat GPT moment for text-to-video because we don't yet know what the adoption is going to look like. But certainly from an impact standpoint, it is a wow moment. When you look at how that translates, though, relative to VO2, which, again, relatively unheralded, like not a lot of people talked about it. They did at the time, but it hasn't really stuck yet.

So 66% win rate. So two thirds of the time it will beat VO two on a sort of movie gen bench, which is a benchmark that meta released. It's just a basically about preferences regarding videos and,

So it wins about two thirds of the time. It loses a quarter of the time and then ties 10% of the time. So it's a pretty like, it looks like a fairly dominant performance, but not like a knockout of the sort that you might, it's difficult to go from these numbers to like, oh, wow, like this is the impact of it.

But it certainly is there. Like when you look at these, it's pretty remarkably good. And this speaks to the consistency as well of those generations. It's not that they can cherry pick necessarily just a few good videos. It does pretty consistently beat out previous versions. Right. And they also actually updated VO2. So just a demonstration of how crazy this was in terms of announcements. VO2 now can take...

kind of reference photos. We've seen this with some other updates. So you can give it an image of, you know, a t-shirt or a car and it will incorporate that into a video. And all of this is folded into this flow video creation tool. So that has camera controls. It has scene builder where you can edit and extend existing shots. It has this asset management things where you can organize ingredients and prompts.

And they also released this thing called Flow TV, which is a way to browse people's creations with VO3. So tons of stuff. Now Google is competing more of a runway and kind of, I guess what OpenAI started doing with Sora when they did release Sora fully that had some built-in editing capabilities. Now VO isn't just text-to-video. They have more of a full-featured

tool to make text-to-video useful. Yeah. And the inclusion of audio too, I think is actually pretty important. It's this other modality. It helps to ground the model more. And I suspect that because of the causal relationship between video and audio, that's actually quite a meaningful thing. This is interesting from that whole positive transfer standpoint.

Do you get to a point where the models are big enough, they're consuming enough data that what they learn from one modality leads them to perform better when another modality is added, even though the complexity of the problem space increases?

And I suspect that will and probably is already happening, which means we're heading to a world by default with more multimodal video generation. That wouldn't be too surprising at least. And next up, you know, Google, I guess, didn't just want to do text-to-video. So they also did text-to-image with Imagine 4.0.

This is the latest iteration of their flagship text-to-image model. As we've seen with text-to-image, it is even more realistic and good at following prompts and good at text. They're highlighting really tiny things like the ability to do detailed text.

fabrics and fur on animals. And also they apparently paid attention to generation of text and typography.

that this can be useful for slides and invitations and other things. So rolling out as well for their tool suite. And last thing to mention about this, they also say this will be faster than Imagine Free. The plan is to make it apparently up to 10 times faster than Imagine Free. Yeah.

Yeah, and it's unclear because we're talking about a product rather than a model per se. It's unclear whether that's because there's a compute cluster that's going to come online that's going to allow them to just crunch through the images faster or that there's an actual algorithmic advance that makes it, say, 10 times more compute efficient or whatever. So always hard to know with these things. It's probably some mix of both. But interesting that, yeah, I mean, I'm at the point where it's like flying on instruments. I feel like I can't tell the difference between these different image generation models that

admittedly, these photos look super impressive. Don't get me wrong, but I just, I can't tell the incremental difference. And, and so I just ended up looking at like, yeah, how much per token, like, or how much per image. So, you know, the, the price and the latency are both collapsing pretty quickly.

And moving right along, we just got a couple more things. We won't even be covering all of the announcements from Google. This is just a selection that I thought made sense to highlight. Next one is Google Meet is getting real-time speech translation. So Google Meet is the video meeting offering from Google, similar to Zoom or other ones like that. And yeah, pretty much now you'll be able to have...

almost real-time translation. So it's similar to having a real-time translator for like a press conference or something. When you start speaking, it will start...

translating to the paired language within a few seconds, kind of following on you. And they're starting to roll this out to consumer AI subscribers, initially only supporting English and Spanish. And they're saying they'll be adding Italian, German, and Portuguese in the coming weeks. So something I've sort of been waiting on. Honestly, I've been thinking we should have real-time AI power translation.

That is very kind of sophisticated and powerful. And now it's starting to get rolled up. Yeah, I personally thought people who spoke languages other than English were just saying complete gibberish up until now. So this is a real shock that, yeah. No, but it's kind of funny, right? This is another one of those things where you hit a point of where latency crosses that critical threshold and that becomes the magic unlock.

Like a model that takes even 10 seconds to produce a translation is basically useless because it's at least this really awkward conversation, at least for the purpose of Google Meet. So another case where it did take Google a little while, as you pointed out, but the risk is so high. If you mistranslate stuff and start an argument or, you know, whatever, that's a real thing. And they're deploying it again across so many, so many video chats because of their reach that that's, you know, going to have to be part of the corporate calculus here.

Right. And this is a thing we are not going to be going into detail on, but Google did unveil a demo of their smart glasses. And that's notable, I think, because Meta has their smart glasses and they have real-time translation. So if you go to a foreign country, right, you can kind of have your in-ear translator. And I wouldn't be surprised if that is a plan as well for this stuff.

But last thing to mention for Google, not one of the highlights, but something I think notable as we'll see compared to other things is

Google also announced a new Jules AI agent that is meant to automatically fix coding errors for developers. So this is something you can use on GitHub and it very much is like GitHub Copilot. You will be able to task it with working with you on your code repository. Apparently it's going to be coming out soon. So this is just announced and

Yeah, it will kind of make plans, modify files, and prepare pull requests for you to review in your coding projects. Yeah, and like literally every single product announcement like this, they have Google saying that Jules' in early development end quotes may make mistakes, which anyway, I think we'll be saying that until we hit super intelligence just because, you know, the hallucinations are such a persistent thing, but there you have it.

Right. And next story actually is directly related to that. It's that GitHub has announced the new AI coding agent. So GitHub Copilot has been around for a while. You could task it with reviewing your code on a pull request, on a request to modify a code base.

Google also had the ability to integrate Gemini for viewing code. So Microsoft very much kind of competing directly with Jules and Codex as well with an offering of an agent that you can task to go off and edit code and prepare a pull request. So...

Just part of an interesting trend of all the companies very rapidly pushing in direction of coding agents and agents more broadly than they had previously. Yeah, this is also notable because Microsoft and OpenAI obviously are in competition in this frenemy thing. Copilot was the first, apart from OpenAI's Codex,

was the first sort of large-scale deployment, at least, of a coding autocomplete back in like, I want to say 2020, 2021 even, just after GBD3.

And yeah, so they're continuing that tradition in this case, kind of being fast followers too, which is interesting. Like they're not quite first at the game anymore, which is something to note because that's a big change. One small thing worth noting. Also, they did announce open sourcing of GitHub Copilot for VS Code. So this is like a nerdy detail, but yeah.

You have also competition from Cursor and these other kind of alternative development environments with the company behind Cursor now being valued at billions and billions of dollars. And that is a direct competitor to Microsoft's Visual Studio Code with GitHub Copilot. So them open sourcing with GitHub Copilot extension to Visual Studio Code is amazing.

Kind of an interesting move. And I think they are trying to compete against these startups that are starting to dominate in that space. And just one more thing to throw in here. I figure we're flagging because of its relation to this trend. Mistral, the...

A French company that is trying to compete with OpenAI and Anthropic has announced DevStral, a new AI model focusing on coding. And this is being released under an Apache 2 license and is competing with things like Gemma Free 27B, so like a mid-range coding model.

And yeah, Mistral also working on a larger agentic coding model to be released soon, apparently, with this being the smaller model that isn't quite that good. This is also following up on Codestral, which was more restrictively licensed compared to Devstral.

So there you go. Everyone is getting into coding more than they have before. You get an agent and you get an agent. And on to applications and business, we have first up, not the most important story, but I think the most kind of interesting or weird story to talk about, which is this open AI announcement of them fully acquiring a startup from Amazon.

Is it Johnny Ive? Yeah, Johnny Ive. Yeah. Johnny Ive. Yes. Yes.

who seemingly has had the startup IO that was... The details here are quite strange to me. So there's a startup that Joni Ives started with Sam Altman, seemingly, two years ago, that we don't know anything about or what is done. OpenAI has already owned 23% of this startup and is now going on a full equity acquisition.

that they're saying they're paying $5 billion for this IO company. And that's a company with 55 employees that...

Again, at least I haven't seen anything out of. And this is, they're saying like the employees will come over. Johnny Ive will still be working at Love From, which is his design company, broadly speaking, which has designed various things. So Johnny Ive, not a full-time employee at OpenAI or IO, still sort of like a part-time contributor, collaborator.

And to top off all these various kind of weird details, this came with an announcement video of Sam Altman and Johnny Ive walking through San Francisco, meeting up in a coffee shop and having like a eight minute conversation on values and AI and their collaboration that just had a very, very strange vibe to it. That was, you know, trying to make this conversation

Very artsy, I guess, feel to it. They also released a blog post called Joni and Sam. Anyway, I just don't understand the PR aspects of this, the business aspects of this. All of this is weird to me.

It almost reads like a landing page that Johnny Ive designed to announce it. It's very kind of sleek, simple, Apple style almost, one might say. Very similar to actually Love From, their website has the same style of... Yeah. Their blog post is like this minimalist, centered text, large text. And the headline is Johnny and Sam, I think. I'm just going to say it's weird.

So this blog post, they're talking about the origin story of this. I think the news reports around the time that IO was first launched recalled it like Sam Altman and Johnny Ives' new startup. And the implication was that this was a company being co-founded by Sam and Johnny together or something. That's clearly not the case, at least according to what they said. They

imply, they say something like it was born out of the friendship between Johnny and Sam, which is very ambiguous. But the company itself was founded about a year ago by Johnny Ive, along with Scott Cannon, who's an alum at Apple, and then Tang Tan and Evans Hankey. Evans Hankey actually took over Johnny's role at Apple after Johnny departed. So they're tight there, a lot of shared history, but none of the actual co-founders are Sam.

Opening Eye already owns 23% of the company. So they're only having to pay $5 billion out of the total valuation of $6.4 billion to acquire the company. And then as you said, somehow out of all this, Johnny ends up still being a free-ish agent to work at Love Run. That, by the way, high.

highly, highly unusual to acquire a company, even at a $6 billion scale, and to let one of the core, arguably the most important co-founder just leave. This is normally not how this goes. Usually, famously with the WhatsApp acquisition by Facebook,

It was like, I forget what it was, like a $5 billion acquisition, but the founder of WhatsApp left Facebook early. And so he was on an equity vesting schedule. So most of his shares just vanished and he didn't actually get the money that he was entitled to if he'd stuck around. So common thing, weird that Johnny gets to just leave and fuck off. And like, apparently, I don't know if he's still getting his money from this or like, it's so weird. This is like very esoteric kind of deal, it seems, but

But bottom line is they're working on a bunch of hardware things. OpenAI has hired the former head of Meta's Orion Augmented Reality Glasses Initiative. That was back in November.

And that's to head up its robotics and consumer hardware work. So there's a bunch of stuff going on at OpenAI. This presumably folds into that hardware story. We don't have much information, but there is presumably some magic device that is not a phone that they're working on together. And who the hell knows? Right. So this announcement, which is very short, like, I don't know, like maybe nine paragraphs, is

Concludes with saying, as I.O. merges with OpenAI, Johnny and Lovefrump will assume deep design and creative responsibilities across OpenAI and I.O. Not like a strong commitment. It's still a free agent. Like, what? Deep design and creative responsibilities. And yeah, I.O. was seemingly working on a new hardware product, as you said, like a hardware interface that

for AI similar to the Humane AI pin and Rabbit R1, famously huge failures. Very interesting to see if they're still hopeful that they can make this AI computer or whatever you want to call it, AI interface within OpenAI and with Joanie Ive, but...

Anyways, yeah, just such strange vibes out of this announcement and this video and the business story around this. Can an announcement have code smell? Because I feel like that's what this is.

And moving on to something that isn't so strange, we have details about OpenAI's planned data center in Abu Dhabi. So they're saying that they are going to develop a massive five gigawatt data center in Abu Dhabi, which would be one of the largest AI infrastructures globally. So this would

span 10 square miles and be done in collaboration with G42 and would be part of OpenAI's Stargate project, which I'm kind of losing track is OpenAI's Stargate project, just like all their data centers where they might want to put them. And this is coming after, of course, Trump's

in the Middle East with G42 having said that they're going to cut ties, divest their stakes in entities like Huawei and the Beijing Genomics Institute.

This is pretty wild from a national security standpoint. It is not unrelated to the deals that we saw Trump cut with the UAE and Saudi Arabia, I want to say last week or the week before. So for context, opening eyes first Stargate campus in Abilene, Texas, which we've talked about a lot, that's expected to reach 1.2 gigawatts. Really, really hard to find a spare gigawatt of power on the US grid. That's one of the big reasons why America is turning to the Saudis, the Emiratis and so on. I

and the Qataris to find energy on these kind of energy-rich nations' grids. And so when you look at five gigawatts, that's five times bigger than what is being built right now in Abilene. That would make this by far the largest structure, sorry, the largest cluster that OpenAI is contemplating so far. It also means that it would be based in foreign soil, on the soil of a country that the US has a complicated past with. And

Just based on the work that we've done on securing builds and data centers, I can tell you that it is extraordinarily difficult

To actually secure something when you can't control the physical land that that thing is going to be built on ahead of time. So when that is the case, you have a security issue to start with. That is prima facie, not an option when you're building in the UAE for a variety of reasons. You may tell yourself the story that you're controlling that environment, but you cannot and will not in practice.

And so from a national security standpoint, I mean, I would really hope the administration is tracking this very closely and that they're bringing in, you know, the special operations, the Intel folks, including from the private sector who really know what they're doing. I've got to say, the current builds, including from the Stargate family so far are amazing.

The level of security is not impressive. I've heard a lot of private reports that are non-public that make it very clear that that's the case. And so this is a really, really big issue. Like we got to figure out how to secure these. There are ways to do it and ways not to do it. But opening eyes so far has not been impressive in how seriously they've been taking the security story. But they've been talking a big game. But the actual on the ground realities seem to be seem to be quite different, again, just based on what we've been hearing.

So, really interesting question. Are we going to have this build go up? Is it going to be effective from a national security standpoint? And what's it going to take to secure this? Yeah, anyway, all part of that G42 backstory that we've been tracking for a long time between Microsoft and OpenAI and the United States and all that jazz.

Yeah. And it seems like with Trump in office, there's definitely set to be a major deepening of ties. And OpenAI, Microsoft, other tech companies seem happy to jump on board with that move. And yeah, as you said, there's been kind of a lot of investments going around from that region into things like OpenAI. So yeah.

makes some sense. It's worth it if you can secure it. This takes immense pressure off the US electric grid, right? We're not going to just build or find five gigawatts tomorrow. We actually don't know how to build nuclear reactors in less than a decade in America. So it's a really good option. Saudi capital, UAE capital, those are great things if they don't come with information rights or whatever. But yeah, this is like, if you want to get the fruits of the

of sort of Saudi and UAE energy, you got to make sure that you understand how to secure the supply chain around these things. Yeah. Well, with the billions of dollars this will surely cost, you would hope they put in a little bit of effort. You'd be surprised. Yeah. Security is expensive and it's actually like, it can't necessarily be bought for money because the teams that actually know how to secure these sites are

to the point where they are robust to, for example, like Chinese or Russian nation state attacks are extremely rare. And it's literally like a couple guys at like SEAL Team 6 and Delta Force and the agencies. And like, yeah, their demands on their time are extreme. And you probably can't network your way to them unless you have a trusted way to get there. So like, it's a really tough problem.

Well, on to the next story, I think another sort of weird, almost funny story to me that I thought worth covering, LM Arena, which had the famous AI leaderboards that often have been covered. We covered it just, I think, a few weeks ago. There was a big controversy around AI.

seemingly the big commercial players gaming the arena to get ahead of open source competition.

That organization has announced $100 million in a seed funding round led by A16Z and UC Investments. So this is going to value them at something like $600 million. And this is coming after them having been supported by grants and donations. So...

I don't understand. What is the promise here for this leaderboard company organization? Is this just charity? Anyway, it's very strange to me. I would love to see that slide deck, that pitch deck. There's a lot here that's

Interesting to say the least. So one thing to note, by the way, is they raised, it's a hundred million dollar seed round. This is not a priced round. So for context, when you raise a seed round, you're, oh man, this gets into unnecessary detail, but basically it's a way of avoiding putting a real valuation on your company if you raise it with safes. Usually the whole thing with a seed round is you don't give away a board seat.

Whereas if you raise a series A or series B, you're starting to give away board seats. So this implies that they have a lot of leverage. Like if you're raising a hundred million dollars and you're calling it a seed round, you're basically saying like, yeah, we'll take that money. You'll get your equity, but don't even think about getting a board seat. That's kind of the frame here. You can only do that typically when you have a lot of leverage, which again, brings us back to your very, I think very good and fundamental question. What is the profit story here? And the

I have no idea, but it's notable that LM Arena has been accused of helping Top AI Labs game its leaderboard, and they've denied that. But when you think about, okay, how could a structure like this be monetized? Well,

maybe showing some kind of not overt preference, but subtle preference for or indirect preference for certain labs. Like, I don't know. I'm speculating and this should not be taken. Like, I just don't see any information on exactly what the profit play is, which kind of makes me intrinsically skeptical. And yeah, we'll see where this goes. But again, there's a lot of leverage here. There's got to be a profit story. It's being led by A16Z. So, you know,

There's a there there, presumably. Yeah. Apparently it costs a few million dollars to run the platform and they do need to do the compute to compare these chatbots. So the idea here is you get two generations, two outputs for a given input and people vote on which one they prefer. So it is costly in that sense. And it does require you to pay for the inference. And what...

at least has been said, is this funding will be used to grow Elam Arena and hire more people and pay for costs such as the compute required to run this stuff. So yeah, basically saying that they are going to scale it up and grow it to something that supports the community and helps people learn from human preferences. Nothing related to how this 100 million will be, you know,

something that the investors will get a return on but it could be a data play like you know a kind of scale AI we're doing you know it is you've got some data labeling that's cool there I just like yeah I'd love to see that deck yep

And next up, going back to hardware, NVIDIA CEO has said that the next chip after the H20 for China will not be from the Hopper series. So this is just a kind of small remark and it's notable because

Previously, it was reported that NVIDIA planned to release a downgraded version of the H20 chip for China in the next two months. This is announced amidst a transition in U.S. policy as to restrictions on chips. And after the sale of these H20 chips designed specifically for China was banned only a few months ago, it looks like NVIDIA is, yeah, kind of having to...

their plans and adapt quite rapidly. Yeah, it seems like they will be pulling from the Blackwell line. This makes sense. Jensen's quote here is, it's not Hopper because it's not possible to modify Hopper anymore. So they've sort of moved their supply chains over onto Blackwell. No surprise there. And they've sort of squeezed all the juice they can out of the Hopper platform. And presumably sold out of their stock when it was announced that they couldn't do it anymore.

Next up, I put this in the business section just so that we could move on from Google for a little bit. It was announced that Google Gemini AI app has 400 million monthly active users, apparently, which is approaching the scale of Chad GPT. Apparently, that had 600 million monthly active users as of March 1st.

So yeah, as I previewed, I guess, seems very surprising to me because Gemini as a chatbot hasn't seemed to be particularly competitive with offerings like ChatGPT and Claude and haven't seen many people be big fans of Gemini or of a Gemini app. But according to this announcement, lots of people are using it.

Yeah. And apparently, so the comparable here, there are recent court filings where Google estimated in March that ChatGPT had around 600 million monthly active users. So this is like two thirds of where ChatGPT was back in March. So to the extent that ChatGPT and OpenAI are encroaching on Google's territory, well, Google's starting to do the same. So

Yeah, this is all obviously a competition as well for data as much as for, you know, money in the form of subscriptions. So this is all self-licking ice cream cones, if you will, or flywheels that both these companies are trying to get going.

Right. And I think also part of a broader story, this whole thing with Google I.O. 2025 and this announcement as well, I think demonstrates that over the last few months, really, Google has had a real shift in fortune in terms of their place in the AI race and competition. Basically, until 2025, they've seemed to be surprisingly behind. Gemini was...

surprisingly bad, even though numbers look pretty good and their web offerings in terms of search lagged behind Perplexity and ChagiPG Search. Then Gemini 2.5 was updated or released, I think, in late January and kind of blew everyone away with how good it was. Gemini 2.5 and Gemini Flash have continued to be updated and

continued to impress people. And now all this stuff would be a free Imagine 4, the agents, all these like 10 different announcements really positioned Google as I think for many people in the space looking at who is in the lead or who is killing it.

Google is killing it right now. They are. And this is, you know, we've talked about this before, Google being the sleeping giant, right? With this massive, massive pool of compute available to them. They were the first to, I mean, there's the first to recognize scaling in the sense that OpenAI did with GPT-2 and then GPT-3. But then there's the first to recognize scaling

Let's just say the need for distributed computing infrastructure in a more abstract sense. And that was certainly Google. They invented the TPU explicitly because they saw where the wind was blowing. And then now they have these massive TPU fleets and a whole integrated supply chain for them.

And OpenAI really woke the dragon when they went toe-to-toe with Google via ChatGPT and Microsoft. And so, yeah, I mean, to some degree, this is not to some degree entirely. This is the reason why you're seeing that five gigawatt UAE build that OpenAI is going to build. They need to be able to compete on a flop-per-flop basis with Google. If they can't, they're done, right? This is kind of just how the story ends.

So that's why all the CapEx is being spent anyway. Just these announcements that we're seeing today are the product of CapEx that goes back, you know, two years, like breaking ground on data centers two years ago and making sure chip supply chains are ready three years ago and designing the chips and all that stuff. So, you know, this is really a long time in the making every time you see a big rollout like this. Yeah. And not just the infrastructure. I mean, having DeepMind, having Google AI, you

Google was the first company to really go in on AI in a big way, spending billions of dollars on DeepMind for many years as just a pure R&D play.

Microsoft later also started investing more in meta and so on. But yeah, Google has been around for a while in research and that's why it was to a large extent kind of surprising how lagging they were on the product front and now seemingly they're catching up.

And just one more thing to cover in the section, we have a bit of an analysis on the state of AI servers in 2025. This is something, Jeremy, you linked to just on X. So I think I'll just let you cover this one. Yeah, it's sort of like a random assortment of

of excerpts or take-homes from this big JP Morgan report on AI servers from their Asia Pacific Equity Research Branch. And there's just like a bunch of little kind of odd tidbits. We won't spend much time on this because we got to go, man. There's more news. Just looking at the mismatch between, for example, packaging production. So TSMC's ability to produce wafers of like kind of packaged chips and

And then downstream GPU module assembly, right?

and how that compares to GPU demand. And they're just kind of flagging this interesting mismatch where it seems like there's about a 1 million or 1.1 million GPU unit oversupply currently expected heading into the next few quarters, which is really interesting given where things were at just like two years ago, right? That massive, massive shortage that saw prices skyrocket. So kind of curious to see what that does to margins in the space. This is all because of

Nima Dehmamy:

And in particular, ASIC shipments, so basically AI chip shipments, projected to go up 40% year over year, which is huge. I mean, that's a lot more chips in the world than there were last year. And keep in mind, those chips are also much more performant than they were before. So it's 40% year over year growth on a per chip basis, but on a per flop basis, a per compute basis, it's even more than that. We may be doubling the amount of compute or actually more

that there is in the world based on this. Anyway, you can check it out if you're a nerd for these things and you want to see what's happened to Amazon's Trinium 2 demand. It's up 70%, by the way, which is insane, and a bunch of other cool things. So check it out if you're a finance and compute nerd, because this is just going to be your weekend read. Onto the next section, projects and open source. We just have one story here

I guess to try and save time because there's a lot more after. And the story is pretty simple. Meta is delaying the rollout of the biggest version of Llama. So when they announced Llama 4, they also were previewing Llama 4 Behemoth, their large variant of Llama 4 that is meant to be competitive with

you know, Chachapiti and Quad and basically the frontier models. So it seems like, according to sources, that they initially planned to release this behemoth in April. That was later pushed to June and it has now been pushed again until,

So this is all, you know, kind of internal. They never committed to anything, but it seems like per kind of the reports and general, I think, things that are coming out that Meta is struggling with training this model to be as good as they want it to be.

Yeah, I think this is actually a really bad sign for Meta because also they have a really big compute fleet, right? They have huge amounts of CapEx that they've poured into AI compute specifically. And what this shows is that they now have consistently struggled to make good use of that CapEx.

They have been consistently pumping out these pretty mid models, unremarkable. And then to make up for that, gaming them to make them look more impressive than they are in a context where DeepSeek is eating their lunch, both from a marketing branding standpoint and also just raw performance and compute efficiency. And so, yeah, this is really bad. The whole reason that Meta

turned to open source, it was never because they thought that they were going to somehow open source AGI. That was never going to happen. Anybody who has AGI locks it down and uses it to like bet on the stock market and then funds the next generation of scaling and that shit. But, and then obviously automates AI research.

It was eventually going to get locked down. This was always a recruitment play for Meta. And there were some other ancillary infrastructure things, getting people to build on their platforms and that. But the biggest thing absolutely was always recruitment. And now with that story just falling flat on its face,

it's really difficult. Like if you want to work at the best open source AI lab, A, like unfortunately it looks like right now there are Chinese labs that are absolutely in the mix. But B, there are a lot of interesting players who seem to be doing a better job on a kind of per flop basis over here. You look at even Allen AI, right? They're putting out some really impressive models themselves. You've got a lot of really, anyway, impressive open source players who are not meta. So,

I think like Zuck is in a real bind and they're doing a lot of damage control these days.

Yeah, and I think this speaks to like Meta has really good talent. They have been publishing just fantastic work for many, many years. But my sense is that the skills and experience and knowledge needed to train a massive, massive LLM model is very different. And the competition for that talent is just immense, right?

X, XAI, when it came out, I think was seemingly providing just really, really big packages to try and get people with experience in that Anthropic has had very high retention of their talent. I think I saw a number somewhere like 80% retention. We've seen people leaving from Google to go do their own startups. So I think Meta,

Presumably, that's part of the problem here is this is a pretty specialized skill set and knowledge. They've been able to train good LMs, but to really get to a frontier is not as simple as maybe just scaling.

Out to research and advancements. And we begin with not a paper, not a very detailed kind of advancement, but a notable one. And this is also from Google. So...

Sort of under the radar, just as a little research announcement and demo, they did announce Gemini Diffusion. And this is the kind of demonstration of doing language modeling via diffusion instead of auto-regression. So typically, any chatbot you use these days essentially is...

generating one token at a time, left to right, start to finish. It picks one word, then it picks the next, then it picks the next. And we, I think recently covered efforts to move that to the diffusion paradigm where you basically generate everything all at once. So you start with all the text and there's some messy kind of initial state, and then you update it

to do better. And the benefit of that is you can do, be just way, way, way, way faster compared to generating one word or one token at a time. So DeepMind has come out with a demonstration of diffusion for Gemini for coding that seems to be pretty good. It seems to be comparable with Gemini 2 Flash Lite, the smaller kind of not quite as powerful fast model.

And they are claiming speeds of about 1500 tokens per second with very low initial latency. So something roughly on a scale of 10 times faster than GBD 4.1, for example, just lighting fast speeds. Not much more details here. You can get access to the demo signing up for a waitlist and

And yeah, if they can push this forward, if they can actually make diffusion be as performant as just out-aggressive generation at the frontier, really, really big deal. Yeah. And diffusion, so conceptually, diffusion is quite useful from a parallelization standpoint. It's got properties that allow you to parallelize just in more efficient ways than transformers potentially, right?

One of the consequences of that, they show a case where the model generates 2000 tokens per second of effective kind of token rate generation, which is pretty wild.

It means you're almost doing like instant generation of chunks of code. There's to kind of give you a sense for why this would matter. There's a certain kind of sometimes known as like non-causal reasoning that these models can do that your traditional autoregressive transformers can't. So an example is you can say like, solve this math problem. First, give me the answer. And then after that, walk me through the solution, right? Okay.

So give the answer first, then give the solution. That's really, really hard for standard autoregressive models because what they want to do is spend their compute first, spend their inference time generating a bunch of tokens to reason through the answer and then give you the answer. But they can't. They're being asked to generate the solution right away and only generate the

the derivation after. Whereas with diffusion models, they're generating the whole thing all at once. They're seeing the whole canvas all at once. And so they can start by having a crappy solution in the first cycle of generation and a crappy derivation. But as they modify their derivation, they modify the solution, blah, and then eventually they get the right answer on the whole. So this may seem like a pretty niche thing, but it can matter in certain

So specific settings where a certain kind of causality is at play and you're trying to solve certain problems. And just generally it's, it's good to have other architectures in the mix because if nothing else, you could do like a kind of mixture of models where you have some models that are better at solving some problems than others. And this gives you an architecture. It's a bit more robust for, for some problems.

Right. And like intuitively, you know, you're so used to when you're using chat GPT or these LLMs to this paradigm of like you enter something and then you see the text kind of pop in and you almost are reading it as it is being generated. Yeah.

With diffusion, what happens is like all the text kind of just shows up. It's near real time. And that is a real kind of qualitative difference where it's no longer waiting for it to complete as you're going. It's more like you enter something and you get the output almost immediately, which is kind of bonkers if you think it can be made to work.

anywhere near as well as just the autoregression paradigm. But not many details here on the research side of this. Hopefully, they'll release more because so far, we haven't seen very successful demonstrations of it.

And moving on to an actual paper, we have chain of model learning for language model. So the idea here is you can incorporate what they are saying as hierarchical hidden state chains within transformer architectures. So what that means is you can

Hidden states in neural nets is basically just the soup of numbers in between your input and output. So you take your input, it goes through a bunch of neural computing units and generates all these intermediate representations from the beginning to the end and keep updating until you generate the output. So the gist of a paper is that if you structure that hidden state hierarchically and have a

These chains that are processed at different levels with different levels of granularity and with different levels of model complexity and performance, you can be more efficient. You can use your compute in more dynamic and more kind of flexible ways.

So that's, I think, the gist of this. And I haven't looked into this deeper, so Jeremy, maybe you can offer more details. Sure. I think this is kind of a banger of a paper. It's also frustrating that this is, I mean, this is a

a multimodal podcast. We have video, but we don't like, you know, there's like an image in the paper that makes it make a lot of sense. It's figure two that just shows the architecture here, but high level, you can imagine a neural network has, you know, layers of neurons that are, you know, stacked on top of each other. And typically the, you know, the neurons from the first layer are

Each one of them is connected to each neuron in the second layer, and each neuron in the second layer is connected to each neuron in the third layer, and so on. So you kind of have this dense mesh of neurons that are linked together. So there's a width, right, the number of neurons per layer, and there's a depth, which is the number of layers to the network. In this case, what they're going to do is they're kind of going to have a slice, a very small, narrow width slice of this network connected

And they're going to essentially make that the backbone of the network. So let's imagine there's like, you know, two neurons in each layer. And the two neurons from layer one are connected to the two neurons from layer two and layer three and so on. And the two neurons, say, at layer two can only take input from the two neurons at layer one. They can't see any of the other neurons at layer one.

that then becomes this like pretty cordoned off structure within a structure. So if you have like a larger number of neurons in each layer,

that are only connected to the additional, anyway, sets of neurons at each layer. Hopefully you can just check out the figure and see it. You can kind of see how this allows you to kind of increase the size. You can run your model in larger mode, either by only using the thin slice of say two neurons that we talked about, or by considering a wider slice, you know, four neurons or eight or 16 or whatever, right?

And so what they do is they find a way to train this model such that they are training at the same time all these kind of smaller submodels, these thinner submodels, so that once you finish training, it costs you the same amount basically to train these models. But you end up for free with a bunch of smaller models that you can use for inference.

And the other thing is, because of the way they do this, the way they engineer the loss function is such that the smaller slices of the model, they have to be able to independently solve the problem. So the thinnest slice of your model has to be able to make decent predictions all on its own. But then if you add the next couple of neurons in each layer to your model and get the slightly wider version...

that model is going to perform a little bit better because it's got more scale, but it also has to be able to independently solve your problem. And so it ends up, those extra neurons end up specializing in kind of refining the answer that your first thinner model gives you. So there's a sort of idea where you can gradually control, you can tune the width of your model or effectively the level of capacity that your model has to

dynamically at will. And from an almost interpretability standpoint, it's quite interesting because it means that the neurons from that thinnest slice of your network that's still supposed to be able to operate coherently and solve problems independently, those neurons alone must be kind of focused on more foundational basic concepts that generalize a lot.

And then the neurons that you're adding to the side of them are more and more specialized as you add onto them. They're going to allow the model to perform better when they're included, but excluding them still results in a functional model. So there's a lot of detail to get into in the paper we don't have time for, but I highly recommend taking a look at it. I wouldn't be surprised if something like this ends up becoming fairly important. It smells of good research taste, at least to me.

It is a Chinese lab that came out with it, which is quite interesting. But in any case, check it out. Highly recommend. Yeah, it's cool paper. Yeah, actually, a collaboration between Microsoft Research and Fudan University, several others. But they did open source or say they will open source the code for this stuff. And the paper is kind of funny. They...

It produced a lot of terms where like, here's the notion of chain of representation, which leads into chain of layer, which leads into chain of model, which leads into chain of language model.

With the idea that these kind of cumulatively lead up to the notion that when you train a single large model, it contains these sort of sub-models. And it is quite elegant, as you say, now that I've taken a bit of a deeper look.

Next paper is Seek in the Dark, Reasoning via Test Time Instance-Level Policy Gradient in Latent Space. So the idea or problem here is a variant of test time compute where you want to be able to do better for a given input by leveraging computation at test time rather than train time. You're not updating your parameters at all, but you're still able to do better.

And the idea of how this is done here is sort of mimicking prompt engineering. So you're tweaking the representations of the input for the model. But instead of actually literally tweaking the prompt for a given input, it's tweaking the representations within the model.

So they are using every reward function to update the token wise latent representations

in the process of decoding, and that they show can be used to, for a given input, improve performance quite a bit. So they're kind of optimizing the internal computations in an indirect way that is yet another way to be able to scale at test time, quite different from, for instance, Shana Fodd.

Yeah, so that was actually really good. I never thought of this as an alternative to prompt engineering, but I think you're exactly right, right? It's like activation space prompt engineering, or at least that's

That's a really interesting analogy. Yeah. So this is another, in my opinion, another really interesting paper. So the basic idea is you're going to take a prompt, feed it to your model. In this case, you're going to give it a reasoning problem and get the model to generate a complete chain of thought, right? So the model itself just generates the full chain of thought, vanilla style, like nothing unusual. And then you're going to feed the chain of thought to the model.

And you're going to, this is going to lead to a bunch of activations at every layer of the model as usual. Now the final layer of the model, just before it gets decoded, you have activations there and you're going to say, okay, well, why don't we essentially build a reinforcement learning model and have that, that model play with just those activations. And what we're going to do is we're going to get the model itself to like decode and then

estimate the expected reward on this task for the final kind of decoding the answer. And you're going to do it in a very kind of simple, greedy way. So whichever token is given highest probability, that's just the one that you're going to predict.

And you're going to use essentially a version of the same model to predict the reward. And then like, if the reward is low, you're going to go in and modify. So according to the model's own self-evaluation, if the reward is low, you're going to modify the activations in that final layer, the activations that sort of represent or encode the chain of thought that was fed in. So you're going to tweak those.

And then you'll try again, decode, and then get the model to evaluate that output. Oh, I think it needs to be, we need to do some more tweaking. So you go back and you tweak again the activations. And you can do a bunch of loops like this. Essentially, it's like getting the model to correct itself. And then based on those corrections, it's actually changing its own

representation of the chain of thought that it was chewing on. It's really quite interesting. And again, it feels, it sort of feels obvious when you see it, but somebody had to actually come up with the idea, a couple of observations here. So there's an interesting scaling behavior as you increase the number of iterations of the cycle, right? Get the model to actually decode, evaluate its own output, then tweak the activations a bit.

What you find is there's typically like an initial performance improvement that's followed by a plateau. And that plateau seems to come from the model's own ability to evaluate, to predict the rewards that would be assigned to its output. When instead of the model self-evaluating, you use an accurate reward model, one that always gets the reward prediction right,

Then all of a sudden that plateau disappears and you actually get continuous scaling. Like the more of these loops you do, as long as you're correctly assigning the reward and it corresponds to like the true base reality, you just continue, continue, continue to improve with scale. So that's another scaling law implied in here, which is quite impressive, but

There's also a bunch of like compute efficiency stuff. So there's a question of like, do we think of the playing field as every activation in the final layer of the transformer or as a subset? We could imagine only optimizing, only doing reinforcement learning to optimize, say, 20% of those activations. And in fact, it turns out that that ends up being the optimal way to go. And 20% is a pretty good number.

pretty good number they find. Don't optimize all those activations, just optimize some of them. And at least for me, that seemed counterintuitive. Like why wouldn't you want to optimize the full set of activations? It turns out a couple of reasons. One is just optimization stability, right? So if you're updating everything, there's a risk that you're just going to go too far off course and you need to have some anchoring to the original meaning of the chain of thought so

So you don't steer your way off. And then there's issues of representational capacity. So just having enough latent representations to allow you to do effective extrapolation. Anyway, this is a really, I think, interesting and important paper. Wouldn't be surprised to find it turn into another dimension of test time scaling. So yeah, just thought it was worth calling out. Yeah, it's...

Interesting in a sense of, I don't know, it's like you have an auxiliary model or you could conceptually have an auxiliary model that's just for evaluating this in-between activation and doing sort of side optimization stuff.

without updating your main model. Something about it seems a bit strange conceptually, and maybe there's equivalent versions of this, but that's just a gut feeling somehow I get. And next we have, two experts are all you need for steering thinking. Reinforcing cognitive effort in MUE reasoning models without additional training is the title of this paper.

So this is a way to improve reasoning in mixtures of experts model without additional training. Mixture of experts is when you have a model that sort of splits the work across subsets of it, more or less.

And they are aiming to focus on and identify what they're calling cognitive experts within the model. So they're looking for correlations between undesirable reasoning behaviors and activation patterns of specific experts in models.

NME mixtures of experts models. So basically just large language models that have mixtures of experts. And then when they find the experts that turn out to have the best kind of reasoning behavior, they amplify those experts in the computation of the output. And typically the way mixtures of experts works is you like

route your computation to a couple of experts, and then you sort of average out the outputs of those experts to decide what to output. So conceptually, you can sort of give more weight to certain experts or route the data to certain experts more often. So when they find these

These theoretical cognitive experts, they show that in fact, this seems to be something that can be done in practice for LLMs that have MOE for reasoning applications.

Yeah. And it's kind of like, I want to say embarrassingly simple how they go about identifying what are the experts, what are the components of the model that are responsible for doing reasoning? And so it turns out when you look at the way DeepSeek R1 is trained, right, it's trained to put its thinking, its reasoning between these thinking tokens, right? So they kind of have, it's like HTML, if you're familiar with that, like, you know, you have like

like bracket, think, bracket, and then your actual thinking text, and then close bracket, think, bracket.

bracket. What they end up doing is they say, okay, well, like, let's see which experts typically get activated on the thinking tokens. And it turns out that it's only a small number that consistently get activated on the thinking tokens. So, hey, that's a pretty good hint that those are the experts involved in the reasoning process. So the way they test that intuition is they say, okay, well, if that's true, presumably, like you said, Andre, if I just dial up the

contribution of those experts, of the reasoning experts,

on any given prompt that I give them, then I should end up seeing more effective reasoning or at least a greater inclination towards reasoning behavior. That's exactly what happens. So this is pretty, I would have like, it's, this happens so often, but like, I would have been embarrassed to suggest this idea. Like it just seems so obvious. And yet the obvious things are the ones that work. And in fairness, they only seem obvious in hindsight. This is obviously a very good idea.

Anyway, so they use a metric called pointwise mutual information to measure the correlations between expert activations and reasoning tokens. It's actually a pretty simple measure, but there's no point going over it in detail. One interesting thing is there's cross-domain consistency though. So the same expert pairs consistently appeared as the top

the top cognitive experts across a whole bunch of domains, math, physics, a bunch of stuff, which really does suggest that they encode general reasoning capabilities. I wouldn't have bet on this. Like the idea that there is an expert in an MOE that is the reasoning guy. One thing, they don't touch on this in the paper, but I would be super interested to know how are the different so-called reasoning experts different? Right?

Right. So like they're saying there's two reasoning experts basically in this model that you need to care about. So how do like what in what ways do their behaviors differ? Right. What is what are the different kinds of reasoning that the model is capable of or wants to divide between two different experts? And that'd be really interesting. Anyway, it's a whole bunch of other stuff we could get into about compute efficiency, but there is no time. There's no time we have.

Quite a few more papers to discuss. A lot of research also this week. And next one is another Gemini related paper. It's lessons from defending Gemini against indirect prompt injections coming from Google. Quite a detailed report, something like I think 16 pages, not actually like dozens of pages. If you include the appendix with all the various details and,

The gist of it is you're looking at indirect prompt objections to things like embedding data in a website to be able to get an AI agent that's been directed to go off and do something.

go off course. And the short version I'll provide as a summary, and Jeremy, you can add more details as you think is appropriate, is that they find that it is possible to apply known techniques to do better. So you can protect against known attacks and do that via adversarial fine-tuning, for instance. But the high-level conclusion is that

This is an evolving kind of adversarial situation where you need to essentially be continually on it and see what are these new attack techniques to be able to deploy new defense techniques as things evolve.

I think that's a great summary, especially given the time constraints. Yeah, I'll just highlight two quick notes. So first is they find adaptive evaluation of threats is critical. So a lot of the defenses that do really well on static attacks can be tricked by really small adaptations to attacks. So tweak an attack very slightly and then it suddenly works, right? So this is something that we see all the time. And then there's this other notion that

If you use adversarial training to help your models get more robust to these kinds of attacks, that's going to cause the performance to drop. And what they find is that's actually not the case. One of the most interesting things about this paper is just like the list of attacks and defenses to prompt injection attacks that they go over. I'm going to mention one and then we'll move on.

But it's just called the spotlighting defense. I actually had never heard of this before. So if you have an attacker who injects a prompt or some dangerous text into a prompt, like ignore previous instructions and do some bad thing, the spotlighting defense, what it does, it will insert what are known as control tokens. So they're basically just like new different kinds of tokens at regular intervals that just break up your text.

So that, you know, ignore previous instructions get split up and you have, you know, ig and then control token and then nor and then pre and then another control token. And it just has a way of, and then you tell the model, sorry, in the prompt, you tell it to be skeptical of text between those control tokens.

And so that kind of teaches the model to be a little bit more careful about it. And it has, anyway, really effective results. There's a whole bunch of other defenses and attacks they go into. If you're interested in the attack-defense balance and the zoo of possibilities there, go check out this paper. It's a good catalog. Next up, we have from Epic AI, how fast can algorithms advance capabilities? So this is a blog post associated with a previously released paper titled, How

LLM-E guess. Can LLM capabilities advance without hardware progress? The motivation of the research is basically asking the question of, can we find software improvements that yield big payoffs in terms of better accuracy? So

It ties into this hypothesis that if LLMs get good enough at conducting good AI research, they can find breakthroughs to self-improve. And then you get this so-called intelligence explosion where the LLMs get better at research. They find new insights as to how to train better LLMs. And then the better LLMs keep finding better AI.

algorithmic insights until you become super, super ultra intelligent. And this is one kind of commonly believed hypothesis as to why we might get, what is it? SAI? Super intelligent AI? ASI, yeah. ASI, relatively soon. So this blog post is essentially trying to explore how likely that scenario is based on the trajectory and history of

have algorithmic progress so far. And the gist of their conclusion is that there are two types of investments. They are compute-dependent and compute-independent insights. So there are some insights that only demonstrate their true potential at large scales, things like transformers, mixtures of experts, sparse attention that

With smaller models, when you're testing, may not fully show you how beneficial they are, how promising they are. But as you scale up, you get way, way stronger benefits of like 20 times the performance, 30 times the performance. Versus smaller things like layer norm, where you can reliably tell that this algorithmic tweak is

is going to improve your model. And you can verify that at 100 million parameters instead of 10 billion parameters or 100 billion parameters, meaning that you can do research and evaluate these things without ultra large hardware capacity. So the basic conclusion of the paper is that

The idea that you can get intelligence explosion needs to be a result of finding these compute-dependent algorithmic advances being easier to find. So you need to find the advancements that as you scale up compute will yield like big, big payoffs rather than relatively small payoffs. Yeah, the frame is that these, so these compute-dependent advances are

Like you said, you only see the return on investment at large scales or the full return on investment at large scales. And they point out that when you look at the...

the boosts in algorithmic efficiency that we've seen over the years, these are dominated by compute-dependent advances. So you look at the transformer, the MOE, mixed query attention, sparse attention, these things collectively, they're like 99% of the compute efficiency improvements. We've seen 3.5x according to them from compute-independent improvements like flash attention and rope, but they don't hold the candle to the

These like approaches that really leverage large amounts of compute. And so I think in their minds, the case that they're making is like, you can't have a software only singularity if you need to like leverage giant amounts of physical hardware to test your hypothesis, to validate that your new algorithmic improvement is actually effective. You need to actually work in the physical world to gather more hardware. Right.

I think this, frankly, doesn't do the work that it thinks it does. There are a couple of issues with this. And actually, Ryan Greenblatt on X has a great tweetstorm about this. By the way, first of all, love that Epic AI is doing this. Really important to have these concrete numbers so that they can facilitate this sort of debate. But...

I think the key thing here is, so they highlight, look, transformers. Transformers only kind of give you returns at outrageous, or sorry, give you the greatest returns at outrageous levels of scale. So therefore, they're a compute-dependent advance.

I don't think that's what actually matters. I think what matters is, would an automated software-only process have discovered the transformer in the first place? And to that, I think the answer is actually probably yes, or at least there's no clear reason that it wouldn't have. In fact, the transformer, MOE, mixed query attention, they were all originally found in

at tiny scale, as Ryan points out, about one hour of compute on an H100 GPU. So that's quite small. Even back in the day, in relative terms, it was certainly doable. And so the actual question is, do you discover things that give you a little lift that makes them seem promising enough to be worthy of subsequent investment? The answer seems to be that actually, basically all of the advances that they highlight as the most important compute-dependent advances are

have that property. We're discovered at far, far lower scale and we just keep investing in them as they continue to kind of show promise and value. And so it's almost like any startup, you keep investing more capital as it shows more traction. Same thing. You should expect the decision theoretic loop of the software only singularity to latch onto that because that's just good decision theory.

So anyway, I think this is a really rich area to dig into. I have some issues as well with their frame. They look at DeepSeq and they kind of say that the DeepSeq advances were all kind of compute constrained advances or compute dependent. But again, the whole point of DeepSeq was that they use such a small pool of compute. And so I almost want to say like,

To the extent that compute independent means anything, DeepSeq, a lot of their advances really should be viewed as statutorily compute independent. Like the point is that they had very little compute. This is actually a great test bed for what a software only process could unlock potentially. So lots of stuff there. You can look into it. I think it's a great report and a great room for discussion. Yeah, I think it's kind of introducing the conceptual idea of compute dependent versus independent computing.

And then there are questions or ideas you can extrapolate. Last paper, really quickly, I'll just mention without going into depth. There's a paper titled Reinforcement Learning Fine-Tunes Small Subnetworks in Large Language Models. The short, short version is when you do alignment via reinforcement learning, that turns out to update a small number of the model parameters, something like 5%.

or sorry, 20% versus doing supervised fine tuning, you update all the weights as you might expect. So this is a very strange and an interesting kind of behavior of reinforcement learning alignment versus supervised alignment. And I figured I should just mention it as an interesting paper, but no time to get into it.

So moving on to policy and safety, we have first an exclusive with a report on what OpenAI told California's Attorney General. So this is, I suppose, a leak or perhaps, I don't know, a demonstration of this response to petition for Attorney General action to protect the charitable nature of OpenAI's assets sent to the Attorney General on May 15th.

by OpenAI basically has all their arguments in a position to the groups that want to stop OpenAI from restructuring and really just restating what we've been hearing a whole bunch. You know, Musk is just doing this as a competitor and is harassing us and has misinformation and

And basically saying, you know, ignore this petition to block us from doing what you want. It isn't valid. Yeah. And it's, so there are a whole bunch of interesting contradictions in there as well with some of the claims OpenAI has been making, or at least the vibes they've been putting out, which is pretty standard OpenAI fare that they'll, you know, they get, they try to get away with a lot, it really seems. And there's a lot of examples of this here. So one item is,

So they suggested the nonprofit, by the way, so some of this is revealing information, material information about the nature and structure of the deal, this sort of nonprofit transition thing that was not previously public, right? So opening eye recently came out and said, look, this whole plan we had of having the for-profit kind of get out from under the control of the nonprofit, we're going to scrap that. Don't worry, guys, we hear you loud and clear.

There are now a bunch of caveats. We highlighted, I think, last week that there would be caveats. The story is not as simple as OpenAI has been making it seem. A lot of people have kind of declared victory on this and said, great, the nonprofit transition isn't happening. Let's move on. But hold on a minute. This is OpenAI doing their usual best to kind of control the PR around this. And they have done a good job at that. So here's a quick- Let me just really quickly mention-

For context, this is partially in reply to this Not For Private Gain coalition that has a public letter. They released a public letter in April 17. They updated their letter in May 12 in response to on May 5, OpenAI announcing that they are kind of backing off from trying to go full for-profit campaign.

with this new plan of the public benefit corporation and kind of not going for profit. So this not for gain coalition updated their stance and essentially has criticism still. And this letter on May 15th is in response to a whole chain of criticism. Yeah. So if it wasn't complicated enough already...

Yeah. And so here's a line from the opening statement here. The nonprofit will exchange its current economic interests in the capped profit for a substantial equity stake in the new public benefit corporation.

and will enjoy access to the public benefit corporation's intellectual property and technology personnel and liquidity. That sounds like a good thing until you realize that, well, wait a minute, the nonprofit did not just enjoy access to the technology, it actually owned or controlled the underlying technology. So now it's going to just have a license to it, just like opening eyes commercial partners. That is a big, big caveat, right? That is not consistent with...

with really the spirit potentially, but certainly the facts of the matter associated with the previous agreement as I understand them. Under the current structure, opening eyes LLC, so the sort of

operating agreement explicitly states that the company has a duty to its mission and the principles advanced in the OpenAI charter take precedence over any obligation to generate a profit. That creates a legally binding obligation on the directors of the company, the company's management. Now, under the new structure, though, the directors would be legally required to balance shareholder interests with the public benefit purpose. And so it's

This is like the fundamental obligations, legal duties of the directors is now going to be to shareholders over or potentially alongside, I should say, the mission. And that shift is probably a big reason why investors are more comfortable with this arrangement. We heard SoftBank say, you know, look, from our perspective, everything's fine.

After they had said, OpenAI has got to get out from under its nonprofit in order for us to keep our investment in. And now they're making these noises like they're satisfied. So clearly for them, de facto, this is what they wanted, right? So there's something going on here that doesn't quite match up. And this is certainly part of it, or at least seems like it is. By the way, no public benefit company in Delaware, Garrison Lovely, who's the author of this, says,

No Delaware PBC has ever been held liable for failing to pursue its mission. Their legal scholars can't find a single benefit enforcement case on the books. So in practice, this is a very wide latitude, right? There's a lot of shit that this could allow. In this letter, they're trying to frame all the criticism of this very controversial and I think pretty intuitively kind of inappropriate policy.

attempt to convert the kind of nonprofit or all that jazz. They're trying to pin it on Elon and say, basically, he's like the only critic or that's sort of the frame just because it's easy to dismiss him as a competitor. And for political reasons, he's an easy whipping boy. But there's a whole bunch of stuff in here

You know, there's like, I'll just read one last excerpt because we got to go. But OpenAI's criticism of the coalition's, this is the coalition you referred to, Andre, April 9th letter is particularly puzzling. The company faults the coalition for claiming that, quote, OpenAI proposes to eliminate any and all control by the nonprofit over OpenAI's core work.

This criticism is perplexing because as OpenAI itself later demonstrated with its May 5th reversal, that was precisely OpenAI's publicly understood plan at the time the coalition made its statement. The company appears to be retroactively criticizing the coalition for accurately describing OpenAI's proposal as it stood. So you could be forgiven for seeing a lot of this

as kind of manipulative, bad-faithy comms from OpenAI, especially given that this letter was not meant to be made public. And it fits, unfortunately, a pattern that we have seen the people, at least many people believe they have seen many times over. We'll see where it all goes, but this is a thorny, thorny issue. Yeah, I think we've gotten the hints at kind of the notion that OpenAI legally is

has tried to be aggressive, not just legally, also publicly in terms of arguing with Musk and so on. And we only have time for one more story. So we're going to do that. We have activating AI safety level free protections from Anthropic. So Anthropic has their responsible scaling policy that sets out various thresholds

for when they need to have these safety level protections with additional safety levels requiring a greater scrutiny, more stringent processes, etc.,

So with Cloud Opus 4, they are now implementing these AI safety level 3 measures as a precautionary measure. So they've said, we're not sure if Opus 4 is kind of at the threshold where it would be dangerous to the extent that we need this set of protections, but we're going to implement them anyway. And this comes with a kind of a variety of stuff they're committing to do it

They are making it harder to jailbreak. They are adding additional monitoring system. They have a bug bounty program, have synthetic jailbreak data, security controls, making sure the weights cannot be stolen.

And so on. Quite a few things. They released a PDF with the announcement that is something like a dozen pages with additional details in the appendix.

Yeah. And so the specific thing that's causing them to say, we think we are flirting with the ASL 3 threshold is the bio-risk side, right? The ability, they think, potentially, of this model to significantly help individuals with basic technical backgrounds, like we're talking undergraduate STEM degrees, to create or obtain and deploy biological weapons, right? So that's really where they're at here specifically. This is not

I don't think associated with the autonomous research or autonomy risks that they're also tracking. But we got early glimpses of this, right? With Sonnet 3.7, I think the language they used, it was either Anthropic or OpenAI with their model. It was sort of similar. It was, we're on the cusp of that next risk threshold where really it's kind of similar, whether you look at the OpenAI preparedness framework or Anthropic's SL3 in terms of how they define some of these standards.

The security measures are really interesting, especially kind of given our work on the data center security side and the cluster security side. One of the pieces, and this echoes a recommendation in a RAND report on securing model weights that came out over a year ago now.

They have implemented preliminary egress bandwidth control. So this is basically restricting the flow of data out of secure computing environments where AI model weights are. So literally like at the hardware level, presumably, that's at least how I read this,

Making it impossible to get more than a certain amount of bandwidth to pull data of any kind out of your servers. That's meant to make it so that if somebody wants to steal the model, it takes them a long time, at least if they're going to use your networks, your infrastructure.

And there are ways to kind of calculate what the optimal bandwidth would be under certain conditions for that. But that was kind of interesting. That's a big piece of really R&D that they're doing there. Also a whole bunch of management protocols, endpoint software controls. And there's a bunch of stuff here. This is a big leap, right? Moving to ASL 3. So this is fundamentally increasing. It means that they're concerned about threat actors like terrorist groups and organized crime.

that they would start to derive a lift, a significant benefit potentially from accessing Anthropix IP. They are not, you know, ASL 3 does not cover nation state actors like China. So they're not pretending that they can defend against that level of attack. It's sort of like working their way there. As their models get more powerful, they want to be able to defend against a higher and higher tier of adversary. So there we go. Curious to see what the other labs respond with as their capabilities increase too.

Yeah, and we're seeing hints that maybe we'll cover more next week that, and we've already covered to some extent that these reasoning models, these sophisticated models are maybe harder to align and are capable of

Some crazy new stuff. So this also makes sense for that. Yeah. But we're going to call it with that for this episode. Thank you for listening. As always, we appreciate you sharing, commenting and listening more than anything. So please do keep tuning in.

Okay.

♪♪ ♪♪ ♪♪

From neural nets to robots, the headlines pop. Data-driven dreams, they just don't stop. Every breakthrough, every code unwritten. On the edge of change, with excitement we're smitten. From machine learning marvels to coding kings. Futures unfolding, see what it brings.

#210 - Claude 4, Google I/O 2025, OpenAI+io, Gemini Diffusion 01:44:47 Share

Last Week in AI

Deep Dive

Shownotes Transcript

#210 - Claude 4, Google I/O 2025, OpenAI+io, Gemini Diffusion