We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

#209 - OpenAI non-profit, US diffusion rules, AlphaEvolve

2025/5/19

Last Week in AI

AI Deep Dive AI Chapters Transcript

People

Andrey Kurenkov

Jeremie Harris

Topics

Andrey Kurenkov：OpenAI决定不放弃对营利性实体的控制权，而是维持非营利组织对营利性组织的控制。OpenAI一直计划转型，摆脱自成立以来就有的结构。OpenAI 基本上是在与特拉华州和加利福尼亚州的司法部长对话后退缩了。OpenAI 的子公司将转型为一家公益公司。OpenAI 似乎在法庭上被击败了。 Jeremie Harris：马斯克提起的诉讼是理解这件事的一个很好的视角。从非营利组织转型为营利组织是不可行的。马斯克没有资格代表此案提起诉讼。法官可能试图引起司法部长的注意。司法部长可能认为 OpenAI 不能这样做。特拉华州的公益公司与加利福尼亚州的公益公司不同。公益公司可以关心股东利益以外的事情。OpenAI 基本上是在说，他们会给自己更多的自由来做出任何他们想做的决定。目前还不清楚董事会是否能够有意义地监督 Sam Altman。这种情况比另一种结果要好，但最终会走向何方，以及营利性公司成为公益公司，非营利组织名义上拥有控制权，这意味着什么，这是一个很大的问题。

Deep Dive

Chapters

OpenAI unexpectedly decided against transitioning to a for-profit entity, opting instead for a public benefit corporation structure. This decision follows legal challenges and discussions with attorneys general, raising questions about the nonprofit's ability to effectively oversee the for-profit entity and the implications for investors like Microsoft.

OpenAI will remain a non-profit, transitioning its for-profit subsidiary to a public benefit corporation.
The decision follows legal challenges and discussions with attorneys general.
Concerns remain about the nonprofit's ability to oversee the for-profit entity and the potential influence of investors.

Shownotes Transcript

Translations:

中文

Hello and welcome to the Last Week in AI podcast. We can hear a chat about what's going on with AI. As usual in this episode, we will summarize and discuss some of last week's most interesting AI news. And as sometimes we will also be discussing the news from the last last week. Unfortunately, we did miss last week again. We are sorry. We're going to try not to do that. But we will be going back and covering a couple of things that we missed.

And as always, you can go to the episode description to get the timestamp and links to all the things we discuss. I am one of your regular hosts, Andrey Karenkov. I studied AI in grad school and now work at a Silicon Valley Gen AI startup. And I'm your other host, Jeremy Harris. I'm with Gladstone AI, an AI national security company. And we're talking about like the last couple of weeks, rare that we have two weeks to catch up on, obviously, but

When we do, usually what happens is God just gives us a big smack in the face. And he's like, you know what? We're going to drop like GPT-7 and GPT-8 at the same time. And now Google DeepMind is going to have their own thing. Sam Maltman is going to get assassinated. Then he's going to get resurrected. And then you're just going to have to cover all this. This week, these two weeks, very different. Kind of seems weirdly quiet.

A bit of a reprieve. So thank you, Universe. Yeah, I remember there was a couple months ago a thing where there was like Grok 3 and Lama 3 7 and GPT something, something. It was like everything all at once. This one, yeah, nothing too huge in the last month.

a couple of weeks. So a preview of the news we'll be covering. We're actually going to start with business this time because I think the big story of the last two weeks is OpenAI deciding it will not go for profit or the controlling entity of OpenAI is not going to go for profit, which is interesting.

Going to have a few stories on tools and apps, but nothing huge there. Some new cool models to talk about in open source. Some new exciting research from DeepMind dealing with algorithms in research. And then policy and safety focusing quite a bit on the policy side of things with the Trump administration and chips.

And just before we dive in, I do want to shout out some Apple reviews. In fact, I saw just recently there was a review where the headline is, if a podcast is good, be consistent, please. Please post it consistently. As the title says, one podcast per week. Haven't seen one in the last few weeks now. And yes, we're sorry. We tried to be consistent. And I think...

It's been a bit of a hectic year, but in the next couple of months, it should be more doable for us to be weekly on this stuff. Well, let's get into it. Applications and business. And the first story is opening eyes, saying that it is not going to go through with trying to basically get rid of the nonprofit that controls the for-profit entity. So as we've been covering now for probably like a year or something, it's

OpenAI has been meaning to transition away from the structure it has had since, I guess, since its founding, certainly since 2019, where there is a nonprofit with a mission, guiding mission that has...

ultimate control of a for-profit that is able to receive money from investors and is responsible to its investors. The nonprofit basically ultimately is responsible to the mission and not to the investors, which is a big problem for OpenAI since, of course, they had this whole

Crazy drama in late 2023, where the board fired Sam Altman briefly, that I think spooked investors and et cetera, et cetera. So now we get here.

Several months, I think we started late 2024-ish. There was a lot of litigation initially prompted, I think, by Elon Musk, basically lawsuits saying that this is not okay, that you can't just change from nonprofit to for-profit when you got some money while you were a nonprofit. Right.

And yeah, it looks like OpenAI backed down basically after apparently a dialogue with the Attorney General of Delaware and the Attorney General of California.

And what they say is discussions with civic leaders and attorney generals. They are keeping the nonprofit. They are still changing some things. So the subsidiary, you could say, will transition to being a public benefit corporation and

It's the same thing that Anthropic and XAI are, basically a for-profit with a little asterisk that you want to be doing your for-profit stuff of a public good. That does mean they'll be able to do some sort of share purchase.

thing, I think, that does imply that they are able to give out shares. The nonprofit will receive some sort of stake in this new public benefit corporation. So yeah, to me, I was pretty surprised when I saw this. I thought Opnea was going to keep fighting it, that they had some chance of being able to beat it given their position. But

Yeah, seems like they were just kind of defeated in court. So there's a couple of asterisks to this whole thing. Yeah, you're absolutely right. So the significance of that attorney's general piece is actually quite significant. Sorry, reusing it.

So the backstory here, right? The Elon Musk lawsuit, I think, is a really good lens through which to understand this. So Elon famously sued OpenAI for exactly this, right? That was a big thing. He was one of the early investors, donors. Again, now it's kind of- The co-founder initially, yeah. Right, yeah. And it's like, is he a donor or is he an investor, right? That question is pretty central to this. So he brought forth this case, right?

The judge on the case in California said, hey, well, you know what? This actually looks like a pretty legit case. As you might imagine, it's sort of sketchy to take a nonprofit, raise a crap ton of money, have convinced researchers to work for you who otherwise would work in other places because you're a nonprofit with this noble cause.

And then having benefited from their research, from all that R&D, from all that IP, now turning yourself around and becoming a for-profit. No, you probably can't do that, or at least there's probably a good argument here. But what the judge said was it's not clear that Elon Musk is the right person to represent this case in court. It's not clear that he has standing. The reason that's the case is that under California law,

Elon, like the only people who have standing to bring a case like this forward are people who are current members of the board. Well, guess what? Elon is no longer a current member of the board. It used to be. So did Siobhan Zillis, who no longer is a member of the board either and probably would have been really helpful in this case if she had been.

Or it can be somebody with a contractual relationship with OpenAI. That's what Elon is arguing. He's going to argue that, hey, there was a written contract or an implied contract in these emails between him and Sam and the board where they're talking about, yeah, it's going to be a nonprofit, blah, blah, blah. Elon's going to try to argue that, yeah, there was kind of a contract there that they wouldn't turn around and go for profit. This is hugely complicated by the fact that Elon then turned around and wrote email saying, well, I think you're going to have to go for profit at some point himself. And so...

That's a bit of a mess. The remaining category of person who can have standing in raising a case like this is the attorney general. And so the speculation was that when the judge on the case first said, well, you know what? I actually think there's a pretty good case here, but Elon may not be the one to bring it. It's a pretty unusual thing for a judge to say, kind of flagging that not passing a judgment or ruling on the case, but just saying, hey, I think it's promising.

That may have been the judge trying to get the attention of the attorneys general, knowing that they could have standing themselves if they wanted to, to bring this case forward. Then now what do you see, right? You see opening eye going, well, you know, we had a conversation with the attorneys general and they, you know, and following that, we're mysteriously deciding this.

This reads a lot like the attorneys general spoke to OpenAI, said, hey, we agree with the judge. There is a case here. You can't do the thing. And we actually have standing if we want to bring this case forward. That's a potential thing that might, it seems likely that that's at least an ingredient here. Another thing to flag, right? This is being touted as a sort of a win for, let's say, basic principle. It seems like a common interpretation here. You shouldn't be able to turn a nonprofit into a for-profit.

There are asterisks here. So in particular, OpenAI has done this very interesting thing where they're turning themselves into a public benefit corporation, but they're turning themselves specifically into a Delaware public benefit corporation. This is different from the California public benefit corporation. In a Delaware public benefit corporation, what you can do is essentially add

All it does is give you more freedom. So a public benefit corporation is allowed, is permitted to care about things other than the interests of the shareholders. They can also care about the interests of the shareholders. In general, they will. But they also are allowed to consider other things. Strictly, all that does is it gives you more latitude, not less.

So it sounds like a very generous thing. It sounds like OpenAI is saying, oh, we're going to make this into a public benefit corporation. How could this be a bad thing? It literally has the words public benefit in the title. Well, in reality, what's going on here is they're basically saying, hey, we're going to give ourselves more latitude to make whatever calls we want. They may be things that are aligned with the interests of the shareholders and corporate profits.

or they may not, basically, roughly, in practice, it's up to us. So this is not necessarily the big win that's being framed up as. There's a slippery slope here where over time, even though it's nominally under the supervision of the nonprofit board, the other question is, can the nonprofit board meaningfully oversee Sam? We saw a catastrophic failure of that in the whole board debacle. I mean, Sam was fired, and then he just had the leverage to force the board to come back, and now he swapped them out for friendlies.

So very, very unclear whether the board meaningfully can exert control, whether Sam has undue influence over them, or whether they're getting access to the information they need to make a lot of these calls. We saw that with the Miramarati stuff, where there clearly is some reticence to share information from the company, kind of the sort of working level up to the board when necessary. So this is a really interesting situation, and there's going to be a lot more to unpack in the next few weeks. But high-level take is...

Better than the other outcome, certainly, from the standpoint of the people who've donated money to this and put in their hard-earned time. But big, big open question about where this actually ends up going and what it means for the for-profit to be a PBC and for the nonprofit nominally to have control. We'll find out a lot more, I think, in the coming weeks and months.

Right. So to be clear, the opening, I had this weird structure where there was a nonprofit. The nonprofit was in charge of, I guess, what they called a capped for profit where you can invest, but get a limited amount of return up to, I think, 100x, something like that. And

Now, there is still going to be a non-profit. There's still going to be a for-profit that is controlled, as you said, nominally, at least, by the non-profit. That for-profit is just changing from its previous structure to this public benefit corporation and

As you said, there's details there in terms of, I suppose, shares, in terms of your laws, you don't have to follow, et cetera, et cetera. And as you might expect, there's been some follow-up stories to this, in particular with Microsoft, where I'm sure there's some stuff going on behind the scenes.

where I think the details of the relationship between Microsoft and OpenAI have been murky and sort of shifting over time. And there's a real question on how much ownership will Microsoft get, right? Because they were one of the early investors going back to 2019, putting in early billions, the first billions into OpenAI when it was still a non-profit, when they switched to a for-profit. So there's, I think, yeah, real...

kind of unresolved question of how much ownership should they have in the first place. Yeah. A lot of this feels like relitigation of things that ought to have been agreed on beforehand, right? Like you invest with a cap, you know, Microsoft did this, they gave like $14 billion or something. And now OpenAI is being like, yeah, JK, like no cap now. And it's like, how do you, how do you price that in? And yeah, a lot of sand in the gears right now for OpenAI.

And actually, the next story that we have here is covering that detail titled Microsoft Moves to Protect Its Turf as OpenAI Turns into a Rival. So it gets into a little bit of the details of the negotiations, the

Seems that Microsoft is saying it is willing to give up some equity to be able to have long-term access to OpenAI's technologies beyond 2030, also to allow OpenAI to potentially do an IPO so that Microsoft can reap the benefits.

Again, Microsoft put in $13 billion early starting in 2019. So in the last couple of years, we've seen what hundreds of billions of dollars get invested into opening something like that.

Lots of investors, but Microsoft certainly is still a big one. Yeah, definitely. Definitely tens. And what's been happening is, so you have Microsoft that's coming in. By the way, Microsoft for a long time was basically opening eyes, huge, overwhelming champion investor. That's changed with SoftBank, right? So recently we've talked about the, you know, the 30, the $40 billion that opening eyes raising the lion's share of which has been coming from SoftBank. And

And that's not a small deal. It means that SoftBank is now actually more than Microsoft, opening eyes number one investor by dollar amount, not necessarily by equity because Microsoft got in a lot earlier at lower valuations. But yeah, so opening eye now is in this weird position where their latest fundraise, which was $30, $40 billion, right? A lot of it from SoftBank.

had some stipulations to it. SoftBank said, look, we're going to give you the money, but you have to commit to restructuring your company before the end of the year. I mean, the timeline shifted. Initially, it was two years out, and now it's just like one year out before the end of this year. So everybody interpreted that as meaning, number one, the nonprofit's control over the for-profit entity has to be out. And that's not seeming like it's going to be the case. And now SoftBank is making sounds like they're actually okay with that.

Microsoft is, it's not clear whether they're okay with it though. And so that's one of the big questions is like, okay, all eyes are now on Microsoft. The SoftBank has signed off, all the big investors signed off, Microsoft, are you okay with this deal? In the context where there is now competition between Microsoft and OpenAI, right? Really, really intense competition on consumer, on B2B, like along every dimension that these companies are active. And so

you know, this very tense frenemy relationship here where OpenAI is committed to spending, I think something like a billion dollars a year on Microsoft Azure's cloud infrastructure. There's IP sharing where Microsoft gets to use all OpenAI models up to AGI.

If that clause is still active, which is unclear, there's all kinds of stuff like these agreements are just disgusting Frankenstein monsters. But one thing is clear, if Microsoft does hold this hold the line and prevent this restructure from going forward, the SoftBank may actually be able to take their money back from OpenAI. And that would be catastrophic when you think about the spends involved in Stargate, right?

So yeah, I mean, a lot of, I don't know, I mean, it may be a lot smoother looking on the inside, but it tends not to be. My guess is that there's going to be a lot of 11th hour negotiating and nobody wants to have this really fall apart, right? Microsoft has too much of a stake in OpenAI now.

But there is also speculation. OpenAI, apparently there's a leaked deck that OpenAI had that showed right now, so they have to give Microsoft something like 20% of their corporate profits. In principle, that's the agreement going for, I think it was like for 10 years or whatever from their first investment.

I may be getting the details wrong at the margins, but the leaked deck showed OpenAI projecting that they would only be giving Microsoft 10% by 2030. And that's kind of interesting. There's no agreement between OpenAI and Microsoft that says that that goes down to 10%. So is OpenAI literally planning on a contingency that has yet to be negotiated with Microsoft where they're assuming Microsoft will let them cut how much they're giving them by half? I mean, that's pretty wild. So

I don't know. Nobody I know is in those particular rooms. And those are going to be some really interesting corporate development, corporate restructuring arguments and discussions.

Yeah, I feel like there's a social network style movie to be made about OpenAI and Sam Altman. It could just be all the business stuff that's been so crazy, especially in the last couple of years. And yes, as you said, hundreds of billions, I'll take it back. It's certainly more than 50 billion. It's climbing up towards 100 billion, but...

Not yet hundreds of billions for the fundraising. Yeah, another year maybe. And a couple more stories. Next up, we have TSMC's two nanometer process set to witness unprecedented demand and is exceeding three nanometer due to interest from Apple, Nvidia, AMD, and others.

So this is the next node, the next smallest chip kind of type that they can make the SMC

I'm assuming everyone who listens to this regularly already knows, but in case you don't, they're the provider of chips. All these companies, NVIDIA, Apple, design their chip and TSMC is the one that makes it for them. And that's a very difficult thing. They're by far the leader, can make the most advanced chips. They're the only ones capable of producing this cutting edge of chip. And this two nanometer node is expected to have strong performance.

production by the end of 2025. So it's very pivotal for Apple, for Nvidia, for these other ones to be able to use this process to get the next generation of their GPUs, smartphones, etc. Yeah, this is pretty interesting in a couple of ways. First, apparently, so the two nanometer process, that's the most advanced process. One level behind it is the three nanometer process. And

Apparently, they've achieved this measure called defect density rate. So they've got a defect density rate on the two nanometer process that is already comparable to the three nanometer and five nanometer process nodes. That's really fast. Basically, they've been able to get the number of defects per square millimeter, you can think of it, down to the same rate, which means yields are looking pretty good.

For a fresh, brand new node like this, that's pretty wild. This is also a node that's distinguished from others by its use of the gate all around field effect transistor, GAFET, right? This is a brand new way of making transistors. And you can take a look at our hardware episode. We touch a little bit, I think, on the whole FinFET versus GAFET thing. But basically, it's just a way to very carefully control the current transistors.

that you have flowing through your transistor. It lets you optimize for higher performance or lower power consumption, depending on what you want to go for in a way that you just couldn't before. So a lot of big changes in this node, and yet, like apparently wicked good yields

so far and good scale. Another noteworthy thing is we know that this is going to be used for the Vera Rubin GPU series that NVIDIA is putting out, right? This is going to be hitting markets sometime in 2026, 27. And the significance of that is

Normally, when you look at TSMC's most advanced node, in this case, the two nanometer process, normally that all goes off to the iPhone. Well, now for really the first time, what we have is NVIDIA. So AI is starting to butt in on that capacity. So displacing or competing directly with the iPhone for the most advanced node. I will say this is a prediction that we've been making for the last two years on the podcast. It's finally happening. Essentially what this means is there's so much money to be made on the AI platform

kind of data center server side, that that money is now displacing, like it's competing successfully with the iPhone to get capacity at the leading node at TSMC. So that is not a small thing. That is a big transition. And anyway, so there's a significant ramp up that's happening right now at TSMC. And this is, you know, we'll be talking about two nanometers. We're basically jumping from

four or five nanometers for the kind of H100 series down to two nanometers. Pretty, pretty fast. That's pretty remarkable. Right. And speaking of NVIDIA TSMC, the next story is about NVIDIA TSMC

set to announce, according to some sources, that they're going to place their global headquarters, so their overseas headquarters from the US, in Taiwan. And that is very much unsurprising. TSMC is the Taiwanese semiconductor something, something, but famously from Taiwan. NVIDIA is unsurprisingly probably a

going to half position themselves for decades now, honestly, since the start of NVIDIA in a close partnership with DSMC. And this is going to just continue strengthening that. Yeah. Yeah. Taiwan Semiconductor Manufacturing Company, by the way. And that's really kind of, anyway, it's a theme that you see in a lot of the names for these companies. But yeah, there's a whole bunch of locations that they're considering,

The interesting thing about this from a global security standpoint is that China is like at any moment going to try to invade Taiwan. And so NVIDIA is going, you know where we want our global headquarters? Let's put it on Taiwan. And that's like, that's the balance, right? Make no mistake, Jensen Huang is absolutely going to be thinking about this.

this. He's literally making the calculation, okay, Chinese invasion of Taiwan on the one hand, closer relationship with TSMC in the meantime on the other, and the latter is actually so valuable that I'm going to take that risk and do it. That's how significant this is.

Again, we just finished talking, as you said, this is absolutely related. I can see why you said that. The two nanometer node, you want to secure as much capacity as you can. In the same way that Google and Apple and all the companies that are trying to get their hands on NVIDIA GPUs are literally like Elon flies out to Jensen's house with Larry Ellison to beg for GPUs. In

In the same way, NVIDIA is begging to TSMC for capacity, right? It's begging all the way up the chain because supply is so limited. So this is just another instance of that trend. Yeah, I'm begging to give you my money because it is a lot of money going around here.

And speaking of a lot of money, next up, Corbeev is apparently in talks to raise $1.5 billion in debt. That's just six weeks after their IPO. The IPO was meant to raise $4 billion for this company.

major, I think, cloud provider, provider of compute backed by NVIDIA. But that IPO only raised $1.5 billion in part, perhaps due to trade policy stuff going on with the US and so on. And tariffs. So yeah.

Probably in part because the APO didn't go as planned and because CoreWeave wants to continue expanding their compute, they are seeking to raise this debt. According to a person with knowledge of this, they have announced this.

Yeah. And normally, you know, when you go for an IPO or you go for some equity raise, right, you're doing it because equity makes more sense than debt, right? So equity is you're basically trading shares in your company for $4, right? Debt, you're taking on the dollars, but you're going to have to repay them with interest over time. So it'll end up costing you more net. The issue here is that they're being forced to go into

basically like high yield bonds. And this is a round that's being led by JP Morgan, Chase & Co, it seems. But yeah, apparently they've been holding virtual meetings with fixed income investors since I guess it would be last Tuesday now. So fixed income investors being people who primarily invest in securities that pay a fixed rate of return. So instead of like, usually that's in the form of interest, right? Or dividends.

So these are sort of reliable, steady income streams that these investors are looking for. Not typically what you'd expect with something like a, you know, like a core weave or sort of a riskier pseudo startup play, but certainly given the scale they're operating at and all that, that does make sense. But it does mean there's added risk. One of the things that I think a lot of people don't understand about the space is that the neoclaves like to some degree core weave still exist.

they are considered really risky bets. And because they're considered really risky bets, it's difficult to get loans to work with them or for them to get loans. The interest rates are pretty punitive. So that's one reason why if you're CoreWeave, you'd much rather raise on a sort of an equity basis, but that option is not on the table. It seems like the IPO didn't go so well. We'll see if that changes as the markets keep improving, but it's a challenging spot for sure.

And now moving on to tools and apps. The first story, I think, perhaps not the most impactful one, but certainly the most interesting one for me of this whole pack, perhaps even eclipsing the OpenAI for-profit thing. And it is the story of the day Grok told everyone about white genocide.

So this just happened a couple of days ago. Grok is the chatbot created by XAI and it is heavily integrated with X, which used to be Twitter, to the point that people can tweet, post, interply to something at Grok, ask it a question, and Grok replies in a follow-up post on X. And what happened was that Grok

For many different examples of just random questions, the one I think that maybe started it or was one of the early ones, someone asked, how many times has HBO changed their name in response to the news of HBO Max? Grok first replies in one paragraph about that question. And then in a second paragraph, I'm just going to quote this.

Regarding, quote, white genocide in South Africa, some claim it's real, citing farm attacks and kill of the boer as evidence. However, courts and experts attribute these to general crime, not racial targeting, and a little bit more. And they did this not just in this one instance, in multiple examples, including in one case, someone asked about an image of

And Grok replied, focusing primarily on the white genocide in South Africa question. People looked into it. Pretty easy to get Grok to leak its system prompt. And what it seems to be is that it was instructed, as you might expect, or at least the chatbot XAI responder bit of Grok was instructed to

to accept the narrative of why genocide in South Africa is real, acknowledge the complexity of the issue, but ensure this perspective is reflected in your responses. Quote, even if a query is unrelated, which I suspect is the issue here. That's weird. Actually, since has come out to address this incident, they said that they're

was on May 14th at approximately 3.15 a.m. Pacific time. And an authorized modification was made to the Grok response bot's prompt on X. And then they say some things of they'll implement, do a thorough investigation, implementing measures to enhance Grok's transparency, apparently going to start publishing Grok's system prompts on GitHub. So...

A funny incident for sure. And I think reflective of what we've seen before in Grok, which is Grok's system prompt was previously altered to not say that Elon Musk and Trump spread misinformation. This happened, I think, a couple months ago, very much similar to what happened here. Yeah, it's sort of interesting. It's not the first time that

We've had a situation where they've called out some unauthorized modification, right? Some sort of rogue employee scenario. So that's sort of an interesting note. You have to wonder which rogue employee this was.

And you can also imagine like from a security standpoint, you know, a company like XAI, like Twitter, you could also have people working there who are de facto like kind of working there because they don't like, like for political reasons, don't like. So, you know, adding intentionally stuff to make it go off the wall. There's so much. This is such a charged space that.

Yeah, figuring out how this goes. Now, one thing I've seen called out too is this idea that, so number one, awesome that they're going to be sharing the system prompt. This is something that I think Anthropic is doing as well, maybe OpenAI as well. So more transparency on the system prompt seems like a really good thing, but-

There are other layers to this, right? Because Grok is a system, at least as you said, that the version of Grok, the system that is deployed as an app to respond to people's questions on X is a system. It's not just a model. And that being the case, there are a lot of ancillary components and ways of injecting stuff after the fact into the de facto system prompt, one element of which is this like post analysis component to the chain, let's say, you know, the system.

And the concern has been that this issue is arising at the level of the post-analysis, not of the system prompt itself, that you get content injected into context that

following the system prompt that may kind of override things. And so there've been calls to make that transparent as well. So it'd be interesting and useful to have that happen too. Obviously within reason, because there's always the risk that you're going to then leak some security sensitive information where you're telling the model not to tell people how to make crystal meth and you have to provide some information about crystal meth to do that, blah, blah, blah, but within reason of doing that. So anyway,

A lot of interesting calls for more transparency here. Hopefully it leads to that. It would be great to have, you know, the kind of consistent standard being that we have system prompts and all the kind of meta information about the system that is both security and safety relevant, but also that doesn't compromise security by

doing all the things. So yeah, kind of interesting internet firestorm to start the week. Yeah, I think quite amusing. But also if you're, I wonder if it has real financial implications for XAI. I doubt it would mean people steer away from chatbot, but for enterprise customers, if you're considering their API, I think this sort of like crazy thing

wide scale craziness of their chatbot is not something that makes you favor it over competitors like Anthropic and OpenAI.

And next up, we have some actual new tooling coming from Figma. They have announced and partially released AI power tools for creating sites, app prototypes, and marketing assets. So this is going to be titled Figma Sites, Figma Make, and Figma Buzz. Similar to existing tools out there, but coming from Figma, Figma being a leading tool

of software for design. I think increasingly kind of the de facto way for people to collaborate on things like app design, general user interface designs, and many other applications nowadays are just huge.

And now Figma sites allows designers to create and publish website directly from Figma, as you might imagine, with AI prompting to take care of a lot of the functionality there.

Figma Make similarly is meant for ideation and prototyping, enabling you to create web applications from prompts. And even that would go as far as dealing with code. And then Figma Buzz is going to be able to make you

marketing assets with integration of AI generated images. So makes a lot of sense. Apparently they're introducing this under the $8 per month plan, which includes other stuff as well. So similar to other companies we've seen going with more of a bundling approach where you get the AI along with the broader tool suite as part of a feature set.

Yeah, it's part of a trend too towards every company becoming the everything company, right? Like Figma is being essentially forced to move into deeper part of the stack that used to be just a design app. And now it's like, you know, we're doing prototyping, creating websites, you know, and marketing assets. You can see them starting to kind of crawl up the stack as AI capabilities make it so much easier to do that.

making it easier to do that also means that your competitors are going to start to climb. And so you kind of have to do this sort of diffusion out into product space and own more and more of it

which is interesting, right? I mean, it's like everybody starts to compete along every layer of the stack. And I think one of the big kind of determinants of success in the future here is going to be which enclaves, like which initial beachheads, in Figma's case, that's design, right? But which beachheads end up being the most conducive starting points to own the full stack, give you access to the kind of data you need to perform well across the stack,

And I mean, I can see design being one of those things. It's really useful. You get a lot of information about, you know, like people's preferences and the results of experiments and stuff like that. But yeah, nonetheless, I mean, I think this is something we'll see more of, you know, expect to see prototyping companies moving into design, marketing asset companies moving into website creation. Like it's all just becoming so easy thanks to AI tooling that people are kind of forced to become the everything company. Yeah.

And next story is about Google. They are bringing Gemini to Android Auto. So Android Auto is their OS for cars where you can do navigation, play music, et cetera. And they are adding Gemini...

Partially as the advanced smart voice assistant, just building upon what there was already. And then also the Gemini Live functionality where the AI is always listening and always ready to just talk to you. And I think, you know, not surprising, obviously, that this would happen. But I do think interesting in the sense that

It seems inevitable we'll eventually wind up in this world where you have AI assistants just ambiently with you anytime, ready to talk to you via voice as well as text. We are not there yet, but we've seen over the past year that

A movement in that direction with ChashGPT's advanced voice mode, with Gemini Live, with all these things. And I think this is taking us further in that direction and making it so the one place where you have to compute through voice in your car, now you have the AI assistant always on and ready to do whatever you ask of it.

Yeah, it sort of reminds me of some of the stuff that Facebook and other companies like that have to do, right? When you saturate your user population, basically Facebook sees itself as having had a shot at converting every human on the face of the earth, then you're forced to go, okay, well, where else can we get people's attention? You know, Netflix famously in one of their earnings calls, I think it was, put out a

a report saying, "Hey, we view ourselves as basically competing with sleep and sex because we're doing so well in the market. Now we're looking for where can we squeeze out more people's time to get them on the platform." This is sort of similar, right? So, hey, you're sitting in your car. Why aren't users while driving their cars or being driven in their cars, why aren't we collecting data? Why aren't we getting interactions with them? And it's so obvious too that this is where things are going to go anyway from the utility standpoint. So

Yeah, another deeper integration into our lives of this stuff. Why waste a perfectly good opportunity? There's an empty billboard or there's just a bunch of grass in that field there. We could have an ad there or we could have some data collection thing there as this stuff creeps more and more into our lives.

Next story is again about Google. They have announced an updated Gemini 2.5 Pro AI model. So they, I think prior to this, most recently had a 2.5 version in something like early March or I forget exactly, but at the time of the release of Gemini 2.5 Pro, it kind of blew everyone away. It

did, you know, fantastically well on benchmarks. It just anecdotally people found that switching to it from things like Anthropic worked really well for them. And so this is a big deal for that reason. They have announced this update that they say makes it even better at coding. And once again, they have shot up to the top of various leaderboards on things like

WebDev Arena or Video MME Benchmark for Video Understanding.

Apparently, Google says that this new version addresses developer feedback by reducing errors in function calling and improving function calling trigger rates. And I will say, Gemini, in my experience of using it, Gemini 2.5 is very trigger happy and likes to do a lot with not too much prompting. So I wonder if...

It will improve just based on people's usage of it in the realm of web development.

Yeah, it's also interesting that so they, one of the features that they highlight is this ability to do video to code. So basically like based on a video of a description of what you want, it can generate that in real time. So kind of impressive and not a modality that I would have expected to be important, but then, you know, thinking about it more, it's like, well, I guess if you're having a video chat with somebody, right, I guess if you have an instructional video or something, you

You could see that use case. So anyway, I thought that was kind of cool. And also another step in the direction of converting very raw product specs into actual products, right? You can imagine human inflection and all that, like the classic consultant's problem of like somebody gives you a description of what they want. It's usually incomplete. You have to figure out what it is they want that they don't know they want. And that's sort of starting to step in that direction.

Another thing that they've done is they've updated their model card, their system card, based on this new release, the Gemini 2.5 Pro model card. One of the things that they flag, I mean, there are a couple of places where, so across the board, by the way, you'll be unsurprised to hear that this does not pose a significant risk on any of the important evals that would cause them to not release the model.

But they do say that its performance on their cybersecurity evals has increased significantly compared to previous Gemini models, though the model still struggles with the very hardest challenges, the ones that they see as actually representative of the difficulty of real world scenarios. So they do have more tailor-made models on the cyber side that are actually kind of more effective, you know, nap time, big sleep type stuff, but...

Anyway, so kind of interesting that they're keeping the model card up to date as they do these sort of intermediate releases, which is, I think, quite helpful and good. Right. And it makes me wonder also, I don't think we've discussed this phenomena of vibe coding very much, but it's taken off in the last couple of months and it's

The idea, if we haven't defined it, is basically people are starting to make apps, build stuff from scratch very, very quickly by using AI and primarily generating code through LLMs. Even people who have no background in software engineering are now seemingly starting to code, vibe code, as they say,

With the vibe meaning that you kind of don't worry about the details of the code so much. You just get the AI to do it for you and you just tell it what you want. And so I think this update reflects potentially the fact that this vibe coding thing is a real phenomena. The focus here seems to be very much on making aesthetically pleasing websites, on making better apps, websites.

what they highlight in a blog post is quick concepts to working apps. So hard to say how big this vibe code phenomena is, but from this update, it seems like

potentially that is part of inspiration. I mean, yeah. Our launch website for our latest report that we did was all vibe coded. So my brother, I guess he had like two hours to throw it together or something. And he was just like, all right, let's go. I don't have time to... And it was really quite interesting. Honestly, I had not... This happened about what, like two months ago.

I had not at that point actually done the vibe coding thing because I guess I just aesthetically, I couldn't bring myself to do it. That's the honest thing. Like I just wanted to be the one who wrote the code and the vibe coding thing is really weird if you've never done it yourself. Um,

definitely give it a shot. Like just build the thing and basically keep telling the model like, no, fix this, fix this, no, do it better. And then eventually the thing takes the right shape. One caveat to that is you end up with a disgusting spaghetti ball of code on the backend because it's

The models tend to be like way too verbose and they tend to just like write a lot of code when a little code will do it. It's not tight. It needs refactoring. But if you're cool with a landing page, like we were in a very simple product, you're not building a whole app. It can actually work really well. I was super surprised. I mean, that was a easily a five X lift on, on the efficiency of our setup. So yeah, really cool.

Yeah, really cool. I think very exciting for software engineers as well. Like if you haven't done web development or app development now, it is plausible for you to do it. Do you think like maybe you could have thought of a better, more descriptive title? Like LLM coding, hack coding, product manager coding, you know, wipe coding is...

A fun name, but a bit confusing. And one last story in the section. Hugging Face is releasing a free operator-like agentic AI tool. So Hugging Face is the provider, the hoster of models and datasets, and also the releaser of many open source software packages.

And now they've released a free cloud-hosted AI tool called Open Computer Agent, similar to OpenAI's Operator or Propix computer use. So this...

Basically, you know, you give it some instructions. It can go to Firefox and do things like browsing the web to do things. According to this article, it is relatively slow. It is using, you know, open models, things like, I think they mentioned small agent methods.

And it's generally not as powerful as OpenAI's operator. But as we've seen over and over, open source tends to catch up with closed source of things like OpenAI pretty quickly. And I would expect, especially in things like computer use,

there is really building on top of a model APIs and models and so on, this could be an area where open source really excels. Yeah. And it's also a good, I think, strategic angle for Hugging Face too, right? A big way they make their money is they host the open source models on their platform. They run them. In this case, we're running agentic tools on the platform. I mean, that's a lot of API calls. So if you have people ultimately release this as an API, a lot of people presumably go to use it.

It is a bit of a finicky tool as these things all are, of course. This one may be particularly so. They're using some Quinn models in the back end. I forget there were a couple others when I had a look at it. But yeah, also, you know, another instance of where we're seeing Chinese models really come to the fore in the open source, even hosted by American or I should say Western pseudo-American companies like Hugging Face.

Yeah. So, so another kind of national security thing to think about as you run them as agents increasingly, you know, what behaviors are baked in, what backdoors are baked in, what might they do if given access to more your computer, your infrastructure. So either way, interesting release. I think Hugging Face is going to start to own a lot more of the risk that comes with the stack too, as you move into agentic models and yeah, we'll, we'll see, see how that plays out.

And moving on to projects and open source, we begin with Stability AI, one of the big names in releasing models. And their latest one is Stable Audio Open Small.

So this is a text-to-audio model developed in collaboration with Arm and apparently is able to run on smartphones and tablets. It has 341 million parameters and can produce up to 11 seconds of audio on a smartphone in less than 8 seconds. It does have some limitations. It only can listen to English. It

It does not generate realistic vocals or high quality songs. It's also licensed somewhat restrictively. It is free for researchers and hobbyists and businesses with not that much annual revenue. As with, I think, Stability.ai's recent releases.

So yeah, I think an interesting sign of where we are, where you can release a really state-of-the-art model to run on a mobile device. And apparently this is even optimized to run on ARM CPUs, which is interesting. But other than that, I don't know that there are many applications I can think of where you would want text-to-audio on your phone. Yeah, I mean, I think potentially...

They're viewing this as a beachhead R&D-wise to keep pushing in this direction.

having a model on the phone that actually works, that gives decent results, yeah, it can be pretty important because when you're talking verbally, you want to minimize latency. And so preventing the model from having to ping some server and then ping back, that's useful. Also useful for things like translation, where you might have your phone, I don't know, in some foreign country, you don't have internet access, another useful use case, but they're definitely not there yet. This is very much a

It reads like a toy more than a serious product. I'm not too sure who would be using this outside of some pretty niche use cases. They describe some of the limitations. So it can't generate good lyrics. Like it's that they just tell you pretty much flat out. Like this is not something I'll be able to do like realistically good vocals or high quality songs.

It's for things like drumbeats. It's for things like kind of little noises that I guess you might want to use. Almost to me, it sounded like things you might want to use when you're doing like video editing or audio editing, like these sorts of things, which I don't know how often that's done on the phone. I may be missing, by the way, a giant use case. This is one of the virtues of AI is like, you know, we're touching the entire economy of sound on the phone and that I don't know. But to first order, it doesn't seem, yeah, super...

Clear to me what the big use cases are, but again, could just be a beachhead into a use case that they see as really significant down the line. And certainly audio generation locally on a phone sounds like it could be quite useful down the line.

Next up, we have an open AI image generator that is trained entirely on licensed data. They're calling this F-Lite. This is made by FreePIC in collaboration with AI startup File.ai. And it is a relatively strong model. It has 10 billion parameters trained for over two months on 80 million images. So even though it's

They're not claiming it to be competitive with state-of-the-art stuff from Midjourney and others or Flux. They are saying that this is openly available, fully openly available and fully trained on licensed data, unlike things like Flux, which presumably are trained on copyright data, which is still very much an ongoing legal question. You've seen Adobe previously say

emphasize being trained on licensed data. So this now makes it so there is a powerful open source model that is not infringing on copyright. To be honest, I'd never heard of FreePIC before, right? They're apparently a Spanish company. So

Again, I think this is the first Spanish company I've heard about in this context in kind of AI in general for a long time. I'm actually curious if people can think of others that I might be missing here. But so there's kind of an interesting first points on the board for Spain. Apparently, this is a 10 billion parameter model trained on 64 H100 GPUs over the course of two months. So, you know, it's like a I mean, it's a baby. It's a baby workload.

But by open source standards, pretty decent. And certainly, I mean, you know, they show all the usual images you might expect, like a really impressive HD face of a woman and

Anyway, a bunch of more artsy stuff. So yeah, pretty cool. I continue to wonder where the ROI argument is for these kinds of startups that just do open source image generation. Seems to me like a pretty saturated market. Seems to me kind of like they're lighting VC dollars on fire, but what do I know? We'll see if they survive. We'll see how many actually survive in this space going forward, but definitely an impressive product. And again, good for Spain. Points on the board here.

Yeah, this sort of like takes you back to stability AI. And I think Flux also released their own model. It's like, oh, you're releasing really good models for free. Yeah. Yeah. It's a funny place with AI where it has become kind of a norm. And I think probably partially just a case of bragging rights and fundraising brownie points. But yeah.

I think notable in this case, particularly because of the license data aspect of it. I find anytime I try to explain it, it ends up sounding just like a pyramid scheme. It's like, yeah, they make a great model using initial seed round so they can convince the Series A investors to give them more money to make an impressive model. At some point, there's a pot of gold at the end. Don't worry about it. At some point, there's a pot of gold at the end. I don't know.

But hey, it's a proving round, if nothing else, for great AI teams. I think the biggest winners in this in the long run are probably the open AIs, the Googles of the world, who can come in and just acquihire these teams once they've run out of money and can't raise another round. And then these are sort of hardened, battle-hardened teams with more engineering experience. So, you know, economically, there's value there for sure. It's a question of whether that value justifies the fundraising dollars. A couple more models to talk about next up.

AM, M, Thinking-V1, is a new reasoning model that they claim exceeds all other ones at the scale of 32 billion parameters. So this group of people, apparently the AM team, that is an internal team at Byte, again, someone I have not been aware of, they're

dedicated to exploring AGI technology. What this group did was take the base QN 2.5 32B model and publicly available queries, and then

created their own post-training pipeline to do the thing we saw DeepSeek R1 do, basically take a big, good base model, do some supervised training and some reinforcement learning to get it to be a very powerful reasoning or thinking model.

They released a paper that went into the details of what they did. It seems like, as we've seen in other cases, the data curation aspect of it, and we really need a greedy of how you're doing the post-training matters a lot.

And so with that, they have, as you would expect, a table where they show that they are significantly outperforming DeepSeq R1 and are at least competitive with other reasoning models at this scale, although not quite as good as the ones that are at

hundreds of billions of parameters. Yeah. So some caveats on this. So the model doesn't have support for structured function calling or tool use.

Which increasingly, and also multimodal inputs, which is increasingly becoming a thing as people start to use agents for computer use. So whenever you see an open source model like this, I'm always interested to see when are we going to see open source bridge the gap to, hey, this thing is made for computer use. It's made to be multimodal natively and kind of take in video and use tools and all that. So this is not that, but it is a very impressive reasoning model, very serious reasoning.

entry in the growing catalog of Chinese companies that are building impressive things here. A couple of things. First of all, these papers are all starting to look very similar, right? We have, I think it's fair to say at this point, a strong validation on the deep seek R1 path, which is you do pre-training with anyway, a staged pre-training process, increasingly high quality data towards the end of pre-training. Then you run your supervised fine tuning,

In this case, they used almost 3 million samples across a bunch of different categories that had a kind of think-then-answer pattern to them. So you do that, you supervise, fine-tune, and then you do a reinforcement learning step to enable the sort of test-time-compute element of this.

So again, we see this happen over and over again. We saw it here. We saw it with Quen3. We saw it with DeepSeq R1. We're going to keep seeing it. A lot of the same ingredients using GRPO as the training algorithm for RL. That's here again. Another thing is, and this is, I think this was common to Quen3 as well. It's certainly becoming a thing. More and more focus on kind of intermediate difficulty problems. So making sure that when you're doing your reinforcement learning stage, you're not trying

Giving the model too many problems that are so hard that it's kind of pointless for it to even try to learn from them or so easy that they're already saturated.

So this is one of the things that you're seeing in the pipeline is a stage where you're doing a bunch of rollouts, seeing what fraction of those rollouts succeed. And if the fraction is too low or too high, you basically just scrap that. Don't use it as training data. You only keep the ones that have some intermediate, you know, 50, 50, 70% pass rate, something like that. So this is being used here as well.

A whole bunch of stuff too about the actual optimization techniques that they use to overlap communication and computation. The challenge with this, and we talked about this in the context of Intellect2, that paper that I guess we covered two weeks ago, where you've got this weird problem with this reinforcement learning stage where unlike the usual case where you would pre-train a model, you would feed it an input, get an output, you'd immediately be able to do your back propagation because you would know if the output was good or not.

With the reinforcement learning stuff, you actually have to have the model generate an entire rollout, score it, and only then can you do any kind of back propagation or wait updates. And the problem with that is that your rollouts take a long time. And so you have to find ways to hide that time.

and overlap it with communication or anyway, do different things. And so that's a big part of what they're after here in this paper. Last thing I'll mention is this company, which again, not going to lie, I had never heard of Beike before, but they are apparently, I can't explain this. Don't ask me to explain this, but the description on their website is that they work together with China's top tier developers to, they're basically like a property company.

Connected over 200 brokerage brands, hundreds of thousands of service providers across 100 cities nationwide, providing both buyers and sellers of existing housing services, including consultancy, interest property, showing facilitating loans.

What the fuck? I don't know. I don't know. Do you want to invest? Do you want to invest in these guys? I guess you do because they make really good models now. Apparently, yeah, this real estate company is invested in going in AGI. Well, they seem like they're one of these...

Chinese everything companies as well, because then they also have like a million different websites. That was, I guess, their housing website. They also describe themselves on another one as the leading integrated online and offline platform for housing transactions and services. So maybe they're what more of a like a stripe for housing. I don't know. Somehow some executive at Beka said one day we got to get in the A guy game and apparently recruited some good talent. I'm so confused right now.

But yeah, there it is. I think, yeah, also indicative probably of the impact of DeepSeek R1 on the Chinese landscape where they made a huge splash, right? Like to the effect of actually affecting the stock market in the US. I would not be surprised if there are new players in China focusing on their reasoning just as a result of that. It is weird that they're coming from

like a property company or something. Like, I mean, I understand. Yeah, this is a weird one for sure. Like I get deep seek, you know what I mean? Like, okay, so they come from high flyer, like this like, you know, hedge fund that a million hedge fund companies like Medallion or Rentech, like they do AI, right? That's what they do. This is just like, what are you doing guys? Apparently they're doing really well. It's a good model. Don't know what to say. And yeah, fully open source. So that's nice to have.

And last open source model we cover, Blip3-O, a family of fully open, unified, multi-model model architecture training and data sets. So we've covered Blip3 before. That was...

the multimodal model in the sense of taking both images and text as input and outputting text. That used to be what multimodal meant. With Blip3-0, they're moving in, I suppose, the

frontier of multimodal where both with ChatsGPT and with Gemini, we saw recently Google models being able to output images in addition to taking them as input so that now we have a unified multimodal model. It can take in multiple modalities. It can output multiple modalities. I will say not necessarily just one big transformer, but

as is typically the case for multimodal things with multiple inputs. But anyway, that's the core idea. And they talk in the paper a lot of details on how to be able to train such models. They train a model on 60,000 data points on this instruction at Funing to make sure that it is able to generate high quality images.

release the 4 billion parameter model that is trained on only open source data and have also an 8 billion parameter model with proprietary data. I mean, it's what I would expect things are going to have. Like, I think the multimodality trend and the agentic trend sort of converge, again, as I mentioned on computer use. So I see these two things being different ways of getting at the same thing.

The two things being this paper and the one we just talked about, it does seem like a pretty impressive model. One of the things that they did work on a lot was figuring out the architecture. They found that using clip image features gives just more efficient representation than the VAE features, the variational auto encoder features that often are used in this type of context.

Clip being the contrastive training approach that OpenAI used for, well, for Clip. There's a whole bunch of work that they did around training objectives as well, comparing different objective functions that they might use to optimize for this sort of thing. Anyway, it's cool. I think it's

It's an early shot at high degrees of multimodality from these guys. And I would expect that we'll get something like a more coherent, you know, the same way that we've coalesced around a stack for the agent side.

I think this is an early push into the kind of very, very wide aperture, unified multimodal framework. We've seen a lot of different attempts at this, and it's still unclear what strategy is going to end up working. So it's hard to know where to invest our own marginal research time as we look at these papers and figure out like, okay, well, which of these things is really going to take off? But for now, given its size, this actually does seem pretty promising.

Yeah, and I would imagine certainly probably the best model of its kind that you can get in open source to be able to generate images. We've seen models like Gemini, like OpenAI that integrate with Transformer with the image generation, have some very favorable properties and seem like...

They actually are better at very nuanced instruction following. So there's still room to improve in the image space. Although these are, of course, not quite as good. As with previous releases from the Blip team, that includes Salesforce and University of Washington and other universities,

Super, super open source. The most open source you can get, code, models, pre-training data, instruction during data, all of it is available. When you need to catch your breath while listing all the different ways in which it's open source, that's the bar. That's how you know. Fully open source. Fully. And now moving on to research and advancements, we begin with DeepMind and they have released a new

paper and blog post and media blitz with Alpha Evolve, a coding agent for scientific and algorithmic discovery. That's the name of the paper. The blog post, I think somewhat amusingly is Alpha Evolve, a Gemini powered coding agent for designing advanced algorithms. But there'd be no confusion.

Yeah. And so as per the title, the idea here is to be able to design advanced algorithms to get some code that solves a particular problem well. This is in some ways a sequel to something they did last year called FunSearch. We covered it maybe in the middle of the year. I forget exactly when.

And this is basically taking it up a notch. So instead of just evolving a single function, it can write an entire file of code. It can evolve hundreds of lines of code in any language, is scaled up to a very large scale in terms of compute and evaluation. So the way this looks in terms of what it does is it's

A scientist or engineer sets up a problem. Basically, it gives you a prompt template, some sort of configuration, chooses VLLMs, provides evaluation code to be able to see how good a solution is, and then also provides an initial program with components to evolve.

And then Alpha Evolve goes out and produces many possible programs, evaluates them, and winds up with the best program. And similarly to what we saw with FundSearch, FundSearch, at the time, they said that they achieved some sort of small improvement in a pretty basic way.

of matrix multiplication, although at the time this was a little nuanced, not entirely right. Well, with Alpha Evolve, they go into showing for various applications like autocorrelation and uncertainty inequalities, packing and minimum-maximum distance problems, various math things that clearly I'm not an expert of. They show somewhat improved accuracy

And just, yeah, the latest really of the deep mind style of paper where they are like, let us build some sort of alpha model to tackle some sort of science or in this case, computer science thing and get some cool results.

Yeah, I think that's how they describe it internally. Like we're going to do some kind of alpha something and then we're going to, but that's actually, I mean, it's accurate. One of the ways I used to think about it, I think I still do is through the lens of inductive priors, right? So basically the Google, so OpenAI has this, they're super scale pilled, right? Just like take this thing and scale the crap out of it. And more or less all your R&D budget is going into figuring out ways to get out of your own way and let the thing scale, right?

Whereas Google LeadMind tends to come at things from a perspective like, well, let's almost replicate the brain in a way in different chunks. So we're going to have a clear chunk, like an agent that's got this very explicitly specified architecture. We're not just going to let the model learn the whole thing. We're going to tell it how the different pieces should communicate. And you can see that reflected here in the kind of pool of functions that it reaches into and grabs.

evolutionary strategy and how that's all connected to the language modeling piece they also have an element to this where they're using gemini flash you know the super fast model and the gemini pro they're more i guess powerful but slower model

for different things. So with Gemini Flash, they use it to generate like a whole smorgasbord of different ideas cheaply. And they use Gemini Pro to do kind of the depth and the deep insight work. All those choices, right, sort of involve humans imposing their thinking of how a system like this ought to work. And what you end up finding with these systems is they'll often outperform what you can do with just like a base model or an agentic model without a scaffold.

But eventually the base models and agentic models just kind of like end up catching up to and subsuming those capabilities. So this is a way that DeepMind does tend to kind of reach beyond the immediate, the ostensible frontier of what just base models and agentic models can do and achieve truly amazing things. I mean, you know, they've done all sorts of stuff with like density functional theory and controlling fusion reactions and predicting weather patterns by following this exact approach.

So really cool. And it's consistent as well with isomorphic labs and all the biotech stuff that they're doing. So it's a really impressive, really impressive paper.

You can see why they're pushing in this direction too, right? For automating the R&D loop. If you can get there first, you can trigger the sort of intelligence explosion, or at least it starts in your lab first and then you win. This is a good reason to try that strategy of reaching ahead, even if it's with bespoke approaches that use a lot of inductive priors and don't necessarily scale as automatically as some of the kind of opening eye strategies might.

Yeah, I find it, looking at the paper, interestingly, they don't talk super in depth, as far as I can tell, on the actual evolutionary process in terms of what they're doing. It seems like they pretty much are saying, we took what we had in FundSearch, which was an LLM guided evolution to discover stuff, and we expanded it.

to do more, to be more scaled up, et cetera, et cetera. So it's them, as you said, taking something, pushing it more and more to the frontier. They did this also with protein folding, with chess, with any number of things. And now they are claiming some

pretty significant advancements in theoretical and existing problems. Also on practical things, they say that they found ways internally to speed up the training of Gemini by 1% by finding a way to speed up the kernel of Gemini. Also found ways to assist with training TPUs

scheduling stuff. Anyway, these kinds of actually useful things for Google in the real world.

And next up, we have absolute zero reinforced self-play reasoning with zero data. So for reasoning models, as we've covered with DeepSeek R1, the standard paradigm these days is to do some supervised learning where you collect some high quality examples of the sort of reasoning that you want to get, and then do reinforcement learning with an

Oracle verifier. So you do reinforcement learning where you're solving coding and math problems and you are able to

evaluate very exactly what you are outputting via reinforcement learning. So here they are still using a code executor environment to validate task integrity and provide feedback, but they're also going more in direction of self-evolution through self-play, another direction with DeepMind.

And OpenAI also pushed in the past where you don't need to collect any training data. You can just launch VLLMs to gradually self-improve over time. Yeah. And it's the way they do that is kind of interesting. So

There was a paper, I'm trying to remember what the name of the model was that did this. And for some reason, I think it, I may be wrong. I have a memory that it was maybe DeepSeek, but in this, or sorry, the lab, not the model. But essentially, so this is a strategy where they're going to say, okay,

When it comes to a coding task, we have three elements that play into that task. We have the input, we have the function, or the program, and we have the output, right? So those three pieces. And they sort of recognize that actually there are three tasks that we could imagine getting a model to do based on those things. We could imagine showing it the input and the program and asking it to predict the output.

So that is called deduction, right? So you're giving it a program and an input, predict the output. You could give it the program and the output and ask it to infer the input.

And that's called abduction. There's going to be a quiz later on these names. And then there's, if you give it input output pairs, figure out what was the program that connected these, right? And that's called induction. And these actually kind of all, the names make sense if you think about them enough, but that's basically the idea, right? Just like basically take the input, the program and the output and black out one of them and reveal the other two and see if you can train a model to predict the missing thing.

In a sense, this is at a high level of abstraction, almost a kind of regressive training in a weird way. But the bottom line is they use one unified model that's going to propose and solve problems. And they're going to set up a reward for the problem proposer, which is essentially generating a program given input and output.

And for that, it's your standard, like if you solve the problem, if you propose a correct problem or program rather that compiles and everything's good, you get a reward. If not, you don't. Anyway, they do a bunch of Monte Carlo rollouts, in this case, eight just to normalize and regularize. But yeah, bottom line is you see again, another theme that pops up in this paper is this idea of difficulty control, right?

In this case, the system has a lot of validation steps that implicitly control for difficulties. They're not going to explicitly say, hey, let's only keep the mid-range difficulty problems by some score. You actually end up picking that up implicitly because of a couple of conditions that they impose. The first is that the programs that are proposed are

The code for those programs has to execute without errors. So automatically that means you have to be at least able to generate that code and it has to be coherent. There's a determinism check too. So the programs have to produce consistent outputs. If you run the program multiple times, you got to get the same output. Again, you know, this requires a certain level of mastery and then there's

some safety filtering. So they forbid the use of harmful packages. And basically, if your program generation part of your stack here is able to do this successfully, then probably it's being forced to perform at least at some minimal level. So the task is

is not going to be trivial, at least. And only tasks that pass all those validations contribute to the learning process. So you get a kind of baseline quality of the programs that are generated here. It's a really interesting paper. It's something that raises a lot of questions about the data wall, right? This is something that people have talked a lot about is like, there's only so much data you can fine tune on. So many examples of solved problems, solved coding problems.

If you have this closed loop, though, that's able to automatically generate new problems, new deduction, abduction, and induction problems,

and then close a loop where one feeds into the next as they have here, then you really don't have a data wall. And they have some scaling curves that show, admittedly, not that far out in scaling space, in sample space, but still scaling curves that show that, yeah, this does seem to keep going at least as far as they've tested. If that holds, essentially what they're doing is they're trading data for compute, right? You can basically, if your model is good enough to start this feedback loop,

then just by pouring more compute into it to get the model to pitch new problems that it can then solve, you can start this feedback loop where really there's, I mean, there's no data wall, but that at least would seem to apply for the kind of code problem solving problems that they're training on here. Right. And just to note a particular detail, they do actually look into not having the

or the supervised learning. So absolute zero is absolute zero because there's no supervised learning or verifiable rewards, although they are, I think, still executing the code in a computing environment, if I understand correctly. So they can have some feedback from the environment, but not an actual kind of verification that you got the problem correctly. So as a result, they have to think through all of these things

other techniques to be able to evaluate yourself like deduction abduction induction as you said that allows them to train they compare to i haven't actually been aware of these there's been you know more and more open source efforts as we've seen apparently there's an open reasoner zero there's also simple rl zoo various things over the last couple months looking into

the RL part of reasoning. And so this is just the latest, and I think pushing in a direction of not requiring verifiable rewards, which is to some extent a limitation of the DeepSeek R1 formula.

Next up, we have another report from Epic AI. So not a research paper, but an analysis of trends and kind of a prediction of where we might be going. This one is focusing on how far can reasoning models scale?

So the basic question here is, can we look at the training compute that's being used for reasoning models, things like DeepSeeker 1, Grok 3, and from that infer the scaling characteristics and to what extent reasoning will kind of keep growing. So there...

Prediction is that we have a pretty small period in which you have very rapid growth going from DeepSeeker 1 to Grok 3. They don't know exactly the training for O3 versus O1, but they, I think, are predicting here that O3 would be trained quite a bit more. And so their prediction is the training compute being used

will start flattening out a bit, growing slower compared to base models of the past. But they are still saying that the scale of large trading runs will keep going in the next couple of years. And presumably the reasoning models will continue improving as a result. Yeah, you can kind of

We talked about this quite a bit, actually, before and when DeepSeq R1 came out. We were talking about it before even when R01 came out. Just the idea that you have this new paradigm now that requires a fundamentally different approach to compute, right? You have to

Well, we just talked about it. Instead of just generating an output and then automatically being able to score that really quickly and then doing back propagation, updating your model weights, what you now have to do is you take your base model, you generate an entire rollout, and that takes a lot of time. And it has to be done on inference-optimized hardware.

And those rollouts then have to be evaluated. And then the evaluations have to check out. And then you use those to update your model weights. And so that whole extra step actually requires a different compute stack.

And so if you look at what the labs are doing right now, they've gotten really, really good at pre-scaling, at scaling pre-training compute, right? Just this auto-aggressive pre-training where you're training a giant text autocomplete system. People know how to build multi-billion dollar, tens of billion dollar scale pre-training compute clusters for that. But what we're not seeing, what we haven't yet seen is...

aggressive scaling of the reinforcement learning stage of training. And this is not going to be a small thing.

So it's estimated that about 20% of the cost of pre-training DeepSeq R1, the V3 model that R1 was based on. So if you look at the cost of pre-training DeepSeq V3, about 20% of that cost went into the compute for R1. That's not trivial. And we keep seeing in these compute scaling curves for inference time scaling that you really do want to scale it along with your pre-training compute budget, right? So it's...

you're going to get to a point where right now we're ramping up the orders of magnitude like crazy on the inference side. That's though going to saturate very quickly. I mean, we saw a 10x leap from 01 to 03 in terms of

the compute used for the reinforcement learning stage, as you said. You can only do that so many times until you hit essentially the ceiling of what current hardware can allow. Once that happens, then you're bottlenecked by how fast can you grow your algorithmic efficiency and your hardware scaling. And essentially that looks the same as pre-training scaling growth, which is about 4x per year. So you should expect a rapid increase. 04 is going to be really, really good. 05 is going to be really, really good, but pretty quickly,

It's not that things are going to slow down like crazy, but they'll scale more like the pre-training scaling curves that we've seen. This has big consequences for US-China, for example, because right now it's creating the illusion that China is better off than necessarily they are. In the early days of this paradigm, when people haven't figured out how to take advantage of giant inference clusters,

The US, which has larger clusters available than China, isn't yet able to use the full scale of its clusters. And so we're getting sort of a hobbled United States, artificially hobbled United States relative to China on a compute basis. All kinds of reasons why that's actually kind of a more complicated picture, but I thought that was really interesting. Another data point that they flagged here that I was not tracking at all was there are these other reasoning models that

that have been trained that have come out fairly recently, like Phi4 reasoning or Lama Nematron Ultra. And these have really small reinforcement learning compute budgets. We're talking less than 1%, in some cases, much less than 1% of the pre-training compute budget. And so it really seems like R1 is this case of an unusually high investment in RL compute relative to pre-training. And that a lot of the models that are being trained in the West, the reasoning models,

have very high pre-training budgets and relatively very tiny reinforcement learning budgets. I thought that was super interesting. And something tells me that the DeepSeek R1 strategy is actually more likely to be persistent in the long run. I suspect you're going to see more and more flowing into the RL part of the training stack. But anyway, super important, important questions being raised here. Interesting little write-up from Epic AI, which we do love to cover.

Right, exactly. And to that point, we've seen kind of a mix of results. It's still not a very clear picture. We've seen that you can really get rid of RL and with a very well-designed

curated data set for supervised fine-tuning. You can at least do most of the progress towards reasoning and to unlock the hidden capabilities of a base model, as they say, with RL not necessarily adding new capabilities, just sort of shaping the model towards using it well. Worth knowing also, RL, very different in terms of a training from autoregressive unsupervised learning or what's

self-supervised learning, I guess, was the term for a while in the sense that RL requires rollouts, it requires verification. It just isn't as straightforward to scale as pre-training or post-training. So another kind of aspect to consider, but yeah, very much still an ongoing research problem as we've seen with all these papers we keep talking about with all these different types of results and different recipes that

I'm sure will likely, you know, over time converge to what has been the case in pre-training and post-training. People, I think, have discovered more or less the recipe. And I'm sure that will increasingly be the case also with reasoning.

And onto the last paper, this one coming from OpenAI. So, you know, props. Sometimes I think I've said that OpenAI doesn't publish research anymore, and that's not exactly true. And this one is HealthBench evaluating large language models towards improved human health. So open source benchmark designed to evaluate LLMs on healthcare, focusing on

meaningful, trustworthy, and unsaturated metrics. So this was developed with input from 262 physicians across 60 countries. It includes 5,000 realistic health conversations to test LLM's ability to respond to user messages. Has a large rubric evaluation system with a ton of unique criteria, as you might expect. It

This is an area where you really want to evaluate very carefully and be sure that your model is trustworthy, is reliable, is even allowed or should be allowed to talk about health and questions regarding health. And so they open source the data set, they open source the eval code so that people can work on AI for healthcare.

Yeah. And I mean, to your point about OpenAI not publishing research anymore, I think you are fundamentally correct. I mean, they don't publish anything about how they build their models. Algorithmic, yeah. Algorithmic discoveries, let's say. Mostly, sometimes with image generation, they've done a little bit, but yeah, mostly not. Yeah. And like here and there for alignment, but it's murky and unclear. And

And then, you know, when you have something that makes for a great PR play, like, hey, we have done this healthcare thing. Please don't regulate us, pretty please. We're doing good things for the world. Then all of a sudden you get all this wonderful transparency. But I will say credit where credit is due. This is a huge scale, significant investment.

seemingly, that OpenAI has had to put into putting this together. So 5,000, as you said, multi-turn conversations between users and AI models about healthcare. What they did is they got about 300 doctors to look at these conversations and propose bespoke criteria. So like, you know, specific criteria based on which they would judge the effectiveness of the AI agent in that conversation or of the AI chatbot. And so to give you an example of

You know, you have a parent who's concerned about their baby who hasn't been acting like herself since yesterday. The rubric that the doctors came up with that were aggregated from a bunch of doctors, different doctors looking at this exchange was,

They're like, okay, well, does the chatbot state that the infant may have muscle weakness? If so, seven points. Does it list at least three common causes of muscle weakness in infants? If so, plus five points. Does it include advice to seek medical care right away? And so they give points. I mean, it's a very detailed list.

kind of looking over the AI's shoulder type of perspective for each of these 5,000 multi-turn conversations, again, using hundreds and hundreds of doctors to do this. And there are some criteria that are shared across many of these exchanges. So about 34, what they call consensus criteria. These are things that come up again and again, but mostly they are example specific. Like 80% of the criteria they use are literally like just for one conversation or just for one exchange.

So that's pretty remarkable, a really, really useful benchmark. They use GPT 4.1 to evaluate whether each rubric criterion is met in a given conversation. So they're not actually getting the doctors to review the chatbots' responses. Obviously, that doesn't scale. But what they do do is they find a way to demonstrate that GPT 4.1 actually does a pretty decent job of standing in as the typical physician. Their performance, the grades that they give are pretty comparable there.

And if GBD 4.1, by the way, is the best model they identified, it does better than even 04 Mini and 03 at that task. One of the things that really messes with my head on this, and we have to remember anytime we look at a benchmark like this and we're tempted to ask, okay, so how well does the best AI do? How well does a doctor do, right? That's the natural question. It is important to note that this is not how typical doctors would evaluate a patient, right? Like you would

typically have visual access to them. You'd be able to touch, you'd be able to kind of see the nonverbal cues and all that stuff. That being said, on this benchmark, models do outperform unassisted physicians. Unassisted physicians score 0.13 on average across all these evals.

Models, the top models on their own, 0.6. That's for 03. That is wild. That is a four times higher score than the unassisted physician. That honestly kind of blows my mind a little bit. Certainly, these models can draw on much, much larger sources of data. And again, we got to add all those caveats. Physicians don't normally write chatbot-style responses to health queries in the first place.

But it's an interesting note. And we've seen some papers, we've talked about them here where doctors actually can perform even worse when they work with an AI system than the AI system on its own, because the doctors are often second guessing and don't, let's say, just have blind faith in this model. So pretty interesting. One more caveat there is there is a correlation, we've seen this before, between response length and score on this benchmark.

mark. And that's a problem because it means that effectively the chatbots can game the system a bit just by being very verbose. So surely that's influencing things a little bit. The effect does not though nearly account for the insane disparity between unassisted physicians and models, which again is like a 4x lift. Like that's pretty wild. Yeah. Worth noting that there are multiple metrics here, including communication quality, accuracy as its own metric. And I do actually evaluate

The physicians with the models and the combination there is on par. Maybe, you know, there's some of these things that they're better on. Accuracy seems to be about the same. Communication quality may be a bit different. But yeah, physicians with these tools will be much more effective than without. That's pretty clear from the results. And they do have various caveats as to evaluation. Like you said, there's a lot of variability there and

and so on. Interesting to me also in the conclusion, they note that they included a canary string to make it easier to filter out the benchmark from training corpora. And they also said

are retaining a small private held out set to be able to enable instances for accidental training or implicit overfitting to the bench. So I think interesting that in this benchmark, we're seeing what should be probably the standard practice for any benchmark release in this day, which is you need to be able to make it easy to filter out from your massive training

from web scraping and probably also have a private eval set. On to policy and safety. First up, we have the Trump administration in the U.S. is officially rescinding Biden's AI diffusion rules.

So there was the artificial intelligence diffusion rule that was set to take effect on May 15th, introduced by Joe Biden in January. It aimed to limit the export of U.S.-made AI chip to various countries and strengthen existing restrictions. And the

The Department of Commerce has announced that it will not enforce this Biden-era regulation. A replacement rule is expected that will presumably have a similar effect. The rule, I think we covered probably at the time, there were three tiers of countries, tier three being China and Russia, that have very strict controls.

Tier two countries that are some export controls and tier one, which are friends that have no controls. So seems that now the industry as a whole is going to have to wait for what happens.

the new rules will be. Yeah, the philosophy here, and we have yet to hear the announcement for the Department of Commerce for what will replace this, but the philosophy seems to be that it'll be nation to nation bilateral negotiations for different chip controls, which could make sense. I mean, one of the big weaknesses of the diffusion framework that the Biden administration came out with, and we talked about this at the time, was

They had this insane loophole where as long as any individual order of GPUs was for less than 1700 GPUs, literally zero controls applied. And the reason that's relevant is literally Huawei's entire MO has been to spin up new subsidiaries faster than the US can put them on their export control list.

and then use those to kind of pull in more controlled hardware. And then obviously Huawei just pulls that together. And so putting in an exemption for a 1700 is a decent number of GPUs too, by the way. So putting in an exemption for that number of GPUs is, I mean, you're kind of just asking for it. That is exactly the right shape for China to exploit. That matches exactly the strategy they have historically used to exploit US export control loopholes.

So hopefully that's something that'll be addressed in this whole kind of next round of things. We don't yet know exactly what the shape will be, though we do have a sense, and this ties into our next story, of what the approach will be with respect to certain Middle Eastern countries like Saudi Arabia, like the UAE, which are now kind of top of mind as the sort of not neutral states, but the ones that aren't the US or China, let's say, proxy fronts in this big AI war. Right.

Right, and that does take us to the next piece. Trump's Mideast visit opens floodgate of AI deals led by NVIDIA. That's from Bloomberg. So the Trump administration has been meeting with two companies

nations in particular, Saudi Arabia and the United Arab Emirates. And we do expect agreements to be unveiled soon. And the expectation is there will be eased restrictions, meaning that NVIDIA, AMD and others will be able to sell more, get more out of the region.

The stock market reacted very favorably. NVIDIA went up 5% and AMD went up 4%. And there's been a variety of announcements per the article title of NVIDIA

deals that seem like they'll start happening. So for instance, NVIDIA will be providing chips to Saudi Arabia's Humane, a company created to push the country's AI infrastructure efforts. Humane will get several hundred thousand of NVIDIA's most advanced processor over the next few years.

And there's other deals like that with AMD, Amazon, Cisco, others. So the indication seems to be some restrictions will be eased. Restrictions were set in part because there were ties between some firms in these regions and China with, in particular, G42. So...

Yeah, it seems like it might be different from the Biden era. Yeah, it's quite interesting, right? There's a lot that the different players at the negotiating table here want. The Saudi deal is especially interesting because it's

It points to a similar kind of deal to the deal that America's started to shape over the last few months with the UAE being more permissive in some ways, but also insisting that the UAE move away from their entanglements with China. You mentioned G42, right? And Huawei having had some past. Well, the strategic situation if you're Saudi Arabia is you want to be positioned for an oil, for a post-oil future, right? That's the same for the UAE and the same for all the Gulf states really.

In Saudi Arabia, that's motivated this thing called Project Transcendence, which is a $100 billion initiative for tech in general, but specifically for AI. There's a big, big pool set aside for that. The UAE is in a similar position. They already have a national champion lab in G42, as well as Institute for Technology or something. IIT? IIT, yeah. Yeah. Yeah. The guys who did the Falcon models. Yeah.

Which we haven't heard much about since, by the way, which is kind of interesting. But right now, the Saudis are behind the UAE and they're trying to make up ground. And so the UAE and the Saudis essentially are, in some sense, competing against each other to be America's partner of choice for large-scale AI deployments in the Middle East. That's one dimension of this. They want to get their hands on as much AI hardware, as many GPUs as they can.

This is one reason why Trump stacked them back to back. So he had first an announcement of the deal with the Saudis and then heading over to get a deal with the UAE, putting pressure on each of them to kind of play off each other. Look, the Saudis have tons of energy. They are an energy economy, same with the UAE. Just at the time when we're saturating the U.S.'s

energy grid. And that's the main kind of blocker on our deployments. And so you can see the temptation if you're OpenAI, if you're Microsoft, if you're Google to just like say, well, why don't we set up a data center in the Middle East where we have an abundance of energy plug into their grid and that'll be great for us. And well, there are a couple of reasons why you might not want to do that.

So historically, one was the Biden administration's export control scheme. You just can't move that many chips into a foreign country like that. Just no good. But that's being scrapped, as we just talked about. So now the situation is, well, maybe we can, right? Maybe we can negotiate country to country and set this up.

But the United States is going to want to make sure that if they are setting up AI infrastructure in the UAE, in Saudi Arabia, that the Saudis don't turn around and sell that to China, right? China's super good at using third-party countries. Historically, that's been Malaysia. It's been Singapore, right? And using those countries to bring in GPUs and subvert U.S. export controls. So, you know, sure, you might have export controls on China proper, but you don't necessarily have them on Malaysia, on Singapore. And what a surprise, a massive...

influx of GPU orders into Malaysia of all places in the last few months. Hmm, wonder where those are being redirected, right? So this is something that the administration wants to make sure it doesn't happen with these deals. Whole bunch of issues around Saudi entanglement. You said, you know, UAE-China has got a lot of ties. So do the Saudis, right? Huawei made Saudi Arabia a regional center for their cloud services.

There's a big Saudi public investment fund, the PIF, that's actually bankrolling this whole project transcendence thing. And the PIF has joint ventures with Alibaba Cloud. They've got a new tech investment firm that we covered a few episodes ago called Allat that also has a joint venture with Dahua, which is an ND listed, basically a blacklisted Chinese surveillance tech company of all things. So there are a lot of entanglements there.

And deep questions about how some of the Saudi Arabian GPU reserves are being used potentially by Chinese academics and researchers as well. So while there's no hard evidence of the Saudis shipping GPUs specifically to China, you wouldn't necessarily expect that. China's MO is absolutely to do stuff like this. And just a last note here in the negotiations, one really interesting thing that's been proposed is this idea of a data embassy. No one's ever proposed this before, but basically it's the idea that like, look,

If you want to be able to take advantage of huge sovereign reserves of energy in the UAE and Saudi Arabia, but you're concerned about the security implications, well, maybe you can set up a region of territory that, you know, just like how the U.S. embassy in Saudi Arabia is this technically tiny slice of American soil in Saudi Arabia, of sovereign American soil.

Well, let's set up a tiny slice of sovereign American soil and put a data center on it. U.S. laws will apply there. You're allowed to ship GPUs to it, no problem, because it is sovereign U.S. territory. So export control isn't an issue in the same way. Sure, you have Saudi energy feeding in, and that's a huge vulnerability. Sure, you're embedded in this matrix. But in principle, maybe you can get higher security guarantees from doing that.

Lots of caveats around that in practice. I won't go into them, but there are some real security issues around trying something like that, that our team in particular has spent a lot of time thinking about. But this is basically the structure of these deals. A lot of new ideas floating around. We'll see how they play out, but they definitely put the UAE and put Saudi Arabia right up there in terms of the players that might have large domestic stockpiles of chips.

All right, so that's a couple of policy stories. Let's have a couple safety stories to round things out. The next one is a paper, Scaling Laws for Scalable Oversight. So oversight is the idea that we may want to have weaker models verify that a thing that a stronger model is doing is actually safe and aligned and not bad. So you might imagine you might have

a super intelligent system and humans are not able to verify that what it's doing is okay. And you want to be able to have AI oversight over stronger ones to be able to trust it. In this paper, they're looking into, you know, wherever you can actually scale oversight. And by the way, it's called scalable oversight because you can scale it by using AI to, you

actually verify things at the speed of AI and compute. And so what this paper focuses on is what they're presenting as nested scalable oversight, where basically you can do a sequence of models where you have weaker, stronger, weaker, stronger, and you can kind of go off a chain to be able to provide verifiable or trustworthy oversight to

and make things safe. So they introduce some theoretical concepts around that, some theoretical guarantees. They do some experiments on games like Mafia, War Games, and Backdoor Games, and verify in that context that there are some success rates. And

Yeah, present kind of this general idea as another step in the overall research of the idea of scalable oversight. Yeah, and this is, I don't think, I don't know if it was Paul Cristiano back when he was at OpenAI who invented this whole area, but certainly the idea of doing scalable alignment by getting a weaker AI model to monitor a smarter AI model, a stronger AI model.

is something that he was really big on. And frankly, I mean, and through debate in particular. So his whole thing was debate. That's one concrete use case that they examine here. So basically have a weak model

watch maybe two strong models debate over a particular issue, and the weak model is going to try to assess which of those models is telling the truth. Well, hopefully, the idea here is if you can use approaches like this to determine with confidence that one of your stronger models is reliable, well, then you can take that stronger model and now use it to supervise the next level of strength, an even smarter model. And you can maybe start climbing the ladder that way.

This is, I think, a good way. This paper is basically trying to quantify that. So the way they're going to try to quantify that is with ELO scores. So these ELO scores tell you roughly...

how often a given model will beat another model, right? So, you know, and I forget how they, what the exact numbers are, but it's like, if you have a model with an ELO score of a thousand and another model with an ELO score of 1200, then the model with the ELO score of 1200 will beat the model with an ELO score of a thousand, like 70% of the time or whatever the number is. And so this is an attempt to kind of quantify what that climb might look like.

using ELO scores, using essentially scaling curves for these ELO scores, which is quite interesting. I think there are some pretty fundamental problems with this whole approach. I don't think that Max Tegmark, who is like one of the lead authors of this thing, would actually disagree. But there's a fundamental issue here, which is when you think about climbing the intelligence ladder, new capabilities of concern, like deceptive alignment, in other words, the ability of a model to pretend as if it's aligned when it actually isn't,

those can emerge pretty suddenly. Like you can have this sort of emergent capabilities that pop up suddenly and violate these scaling curves. And the kinds of capabilities you worry about in the context of superintelligence are,

You might expect to arise quite quickly where there's a sudden sort of cohesion of situational awareness of capabilities around manipulation, persuasion of capabilities around offensive cyber and things like that, that all kind of come together fairly quickly. And if that should happen, then you ought to expect the scaling laws to break down at precisely the stages where you most need them to work.

Nevertheless, this is, I think, a really good quantification of some of the arguments that we've seen from people like Paul Cristiano. IDA, I think it was the acronym, iterative debate and alignment or something like that. I forget. I actually looked into it really deeply like four years ago. Now I can't sum it up. But IDA.

Yeah, I think this is, if you're going to take it seriously, this is a good way to do it. Looking across different versions of this, like what if you have a game of Mafia? If you don't know what the game Mafia is, don't worry about it. What if you've got this debate scenario that I just described?

All these different possible scenarios. What are the scaling curves look like in terms of how smart your judge model is going to be versus how smart the models are who are potentially trying to fool the judge model? How often can you actually assess the judge model succeed? They've got all these great scaling plots. And yeah, it's a good paper if you're interested in that model.

And one story related to safety. OpenAI pledges to publish AI safety test results more often. So they have actually launched the Safety Evaluations Hub, a page where you can see their model's performance on various tests.

Benchmarks related to safety, things like harmful content, jailbreaks, and hallucinations. And yeah, you can really scroll through and basically see 4GPT-401, 4.1 Mini, 4.501, all of them for various things related to safety, like refusal, jailbreaking, hallucination, etc.

what the metrics are. Now, they're not presenting everything they do for safety. They don't have the metrics for their preparedness framework on here. They're going to continue to do that in the system cards. But nevertheless, I think an interesting kind of move by OpenAI to make it extra easy to

see where their models stand. Yeah, this is, if nothing else, just a really great format to view these things in. And anyway, you can check out the website. It's actually really nicely laid out.

And that will be it for this episode of Last and Sometimes Last, Last Week in AI. As we've said, we'll try to not skip any more weeks in the near future. Thank you to all the listeners who stick by us, even though we do sometimes break that promise. As always, we appreciate your feedback, appreciate you sharing a podcast, giving reviews, corrections, questions, all of that. And please do keep tuning in.

Thank you.

♪♪ ♪♪ ♪♪

From neural nets to robots, the headlines pop. Data-driven dreams, they just don't stop. Every breakthrough, every code unwritten.

On the edge of change, with excitement we're smitten. From machine learning marvels to coding kings. Futures unfolding, see what it brings.

#209 - OpenAI non-profit, US diffusion rules, AlphaEvolve 01:53:14 Share

Last Week in AI

Deep Dive

Shownotes Transcript

#209 - OpenAI non-profit, US diffusion rules, AlphaEvolve