We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

#212 - o3 pro, Cursor 1.0, ProRL, Midjourney Sued

2025/6/17

Last Week in AI

AI Deep Dive AI Chapters Transcript

People

Andrey Kurenkov

Jeremie Harris

Topics

Andrey Kurenkov: OpenAI发布了新的推理模型O3 Pro，性能与O1相当，价格大幅降低80%。同时，开源AI模型的发布被推迟到夏季晚些时候。O3 Pro在基准测试中表现出色，优于之前的模型。 Jeremie Harris: O3 Pro在各项指标上都优于人类，包括个人写作、计算机编程和数据分析。OpenAI使用四次尝试都正确回答问题的评估方法，以确保代理在更高风险场景中表现稳定。

Deep Dive

Chapters

OpenAI released O3 Pro, a significantly improved reasoning model for ChatGPT, boasting better performance than previous versions and a substantial price reduction. Human testers overwhelmingly preferred O3 Pro across various tasks.

O3 Pro surpasses O1 and O3 Medium in performance benchmarks.
Price of O3 model decreased by 80%.
O3 Pro preferred to O3 by human testers 64% of the time across various tasks.

Shownotes Transcript

Translations:

中文

Hello and welcome to the Last Week in AI podcast, or sometimes the last two weeks in AI podcast, where you can hear us chat about what's going on with AI. And as usual, in this episode, we will summarize and discuss some of last week's most interesting AI news. You can go to the episode description for the timestamp and links for all those stories. I am one of your regular hosts, Andrey Karenkov. And I'm going to be talking about AI in the next episode.

I studied AI in grad school and now work at a generative AI startup. Hey guys, I'm your other host, Jeremy Harris. I'm glad to see you on AI, AI national security stuff.

blah, blah, blah, blah, blah. And yeah, we have a lot to get through this week because it's actually this past two weeks. This is one of those episodes where we missed one last week. That was on me. And now we're going to do some catch up and see what happens. Yeah, Jeremy, you seem to need to travel a lot. I'm starting to feel like you might be a spy going to Washington and retrieving AI secrets or something. I mean, look, every once in a while, you may hear what sounds like a Russian accent, but

Actually, it's funny because you're the one with the Russian background. But this is how spies work, Andre. They seem like they could not be less Russian, and yet here we are. And yet, I am not a spy. You just have travel to do to talk to people about AI. Yes, exactly.

Well, we will go pretty quick just to give a quick preview. No huge stories in the past couple of weeks in tools and apps. There's just a variety of announcements of somewhat significant releases, a lot of 1.0s or new versions of things, a new O3 Pro, applications and business. Again, nothing huge, but some interesting developments on the chip side, on the OpenAI side.

Then projects in open source, research, kind of, again, a variety of stories, no particular focus in this episode. Policy and safety, we're going to be talking about kind of a bit of interoperability and safety more so, and a couple of national security stories. And it will actually have a synthetic media and art section, which we haven't in a while, just because it's always at the end, but there's some

some new copyright lawsuits and some new partnerships that are interesting. So we'll go ahead and add that on to cover that. SAG back in the news too. It's been a while since we've seen them. Yeah. Yeah. We used to, you know, last year it was quite a bit of it and we sort of just stopped and now is a good time to mention some of that ongoing news.

Before we dive in, do want to acknowledge some Apple podcast reviews. We appreciate your comments. I had a review to tell us to keep it up, please, which I feel like we've been told this several times. So the encouragement is appreciated, let's say, and we will try to keep it up and make it as weekly as we can.

Another positive review. Love the show. CapEx, CapEx, CapEx. Well, glad some people were on board. And we did have a pretty detailed bit of feedback with a three-year listener talking about us maybe alternating introductions more, me taking the lead less, always talking about the next story and setting it up. We just sort of

wound up in there we didn't plan on this being the natural flow of a show so you might emerged organically like it's funny because so i have the unfair advantage that while you're going through the the kind of layout of the story i get to think a little bit more about yeah look at my notes be like hey you know oh yeah there's this thing because as you can imagine we're covering i mean this week will be like a 40 story or something every week it's like

We're having to do research. We have reams of notes on every single paper, every single news story. And so I don't know about you, Andre. When we switch stories, I'm like in a scramble. What did I even think of this? Oh, yeah, this is that paper. Okay. And so while you're kind of gracefully going through your intro. The secret is I'm actually just better at sounding prepared when I'm reading from notes because you got to load this into your RAM, you know? Yeah, yeah.

Change context. And I happen to be all right, I hope, at pretending like I have an actual script instead of just rambling off based on it. Yeah. And I will say, I think I am pretty good at segues. But anyways, we'll try out a bit more variation later.

Andre is really good at segues. And with that... And with that, let's get going on the actual news, starting with the tools and apps section. First up, we have OpenAI adding O3 Pro to ChatGPT, dropping the O3 price by 80%, and also mentioning that they're going to delay the open source AI model to later this summer. And that's pretty much the news. So...

So O3 is their reasoning model. And now we have O3 Pro, which is going to be replacing O1 Pro. It seems very good, starting to be on par with O1. And the O3 model is getting cut down by 80%. So that would mean $2 per million input tokens versus the previous eight. So huge price drop. I mean, this was, to me, quite surprising.

And yeah, O3 Pro is the new I expect. Pretty nice performance on benchmarks, better than all the other offerings of theirs. So pretty big news. So there's an opening I post about there's the model release notes on O3 Pro with some initial evals, right? To giving a sense of like, how does it stack up compared to both humans and then compared to O1 and O3 medium. Against humans, it's really impressive, worth looking at the chart across everything, basically.

You see a clean sweep where the model 64% of the time is preferred to humans. That includes, by the way, personal writing and computer programming and data analysis. So really kind of spanning everything from things where you have a quantifiable reward that you can issue and things that are more qualitative. You're seeing...

superior performance across the board. And then some of the areas where we're seeing really significant improvements in benchmark scores, AIME, AIME 2024 going from 90 to 93% between O3 Medium and O3 Pro. That may not sound like a lot. It may sound like 3%, but one way to think about it is once you're already at 90%,

There's not that many percentage points left to climb, right? So you would expect like saturating a benchmark is really hard. They just took a third of the remaining errors off the table with that. It's kind of similar with GPQA diamond, that sort of PhD level science questions and code forces competition code. So across the board, again, this like universal improvement in these capabilities. One thing that I hadn't noticed to my embarrassment, there's a benchmark that they run.

They call the four out of four reliability evaluation. I just want to surface this because it makes all the sense and of course they're doing this, but I guess I hadn't yet explicitly remembered seeing this in writing. In this eval, you consider a model successful only if it correctly answers a question in all four attempts. So you try it four times on the same question and this is sort of a, you can see it becoming more important, this kind of valuation rate when we get into agents that are being deployed in higher stakes scenarios.

You want to make sure that the agent consistently performs well so that even if you test it and you get lucky or something, you don't overestimate its performance. And so anyway, I thought that was, again, one of these oddly simple things, but that I hadn't seen done elsewhere. Remember done elsewhere? Yeah, exactly. Usually you get...

Pass at one or pass at five, basically, do you nail it first try or do you nail it after a few tries? And they do give those numbers, but they also give the four to four reliability evaluation, which as you said, I don't think is typically what you see in benchmark numbers. And

Compared to the pass at one result, that is a less nice number. You get worse outcomes if you're telling it to be four out of four times, get it right. There is a performance drop. And in fact, in some cases like GPQA, a pretty significant performance drop, but still O3 Pro is beating all of them. And on the evaluations side,

With human testers. So O3 Pro is preferred to O3 according to human testers on scientific analysis, personal writing, data analysis, as you said, about 64% of the time on average. So, you know, O3 is sometimes about as good, but more often than not, O3 Pro is preferred.

Next up, we have Cursor AI Editor, its 1.0 milestone, and there are some releases with it, including Buck, Butt, and Background Agents. So Cursor is the integrated development environment, the programming tool that has become one of the leading contenders for being what programmers use to include AI in their workflow.

So 1.0 release, probably not being covered in major news outlets, but kind of a big deal. And EnuSphere, as we've covered now, has a ridiculous valuation after really rising quickly last year. So with this 1.0 release, they release BogBot, which is an animatic reviewer of pull requests on GitHub.

There's also this background agents in beta, which allows you to run these agents in a remote environment set up by cursor. So basically,

It's getting into the agentic territory where the AI agent does coding for you, does work for you totally asynchronously away from your prying eyes, and then delivers something to you to evaluate. So Cursor has had agentic coding for a while and they've been pushing it. This is

Another step in that direction and lines up with other efforts like Codex and Jules from Google, where you do have these coding agents just work remotely and deliver results without direct supervision, which was the model for AI paired coding up to recently.

Yeah, I'm super curious about where this evolves from a security standpoint, too. Like, for context, the way this is working right now is that the agent will actually fork your GitHub repository and have its own branch that just like it'll put out PRs, it'll review PRs and all that stuff.

as you said, fully in parallel on its own branch. So they have some notes about the security side. They're like, hey guys, just keep in mind, these agents have a much bigger surface area of attacks compared to existing cursor features that don't look like this.

And they do say our infrastructure has not yet been audited by third parties. You know, you have here agents who have read write privileges to repositories. Right. So this is like this is God mode for your agent that is writing code. So somebody can do prompt injection, data poisoning attacks or whatever on the agent.

That could be a really big deal. And if you're deploying this in like a production setting, this is a really interesting new set of vulnerabilities that absolutely is going to have to be addressed in the basic kind of design philosophy for these tools. By the way, we'll be talking about this later, but...

This on the same week that Microsoft has come out and announced a new vulnerability was covered in co-pilot sort of in the same spirit with prompt injection type attack. So it's like all of a sudden we're realizing you can't just deploy agents on all the things and assume that security is going to look is going to look the same.

So anyway, I think the cursor is going to be at the absolute forefront of this because these agents have such intimate access to the code base and are able to work autonomously and in parallel. So I think we'll learn a lot about best practices. They're going to have to evolve really quickly because, you know, I mean, there's a lot of cyber attacks and conventional software with this. Yeah, the sky's the limit. Yeah, and that's especially true if you're working open source with various contributors and

Jailbreaks can be pretty subtle and can be quite weird. And agents are still kind of in development. So there could definitely be ways in which you can just tell it, delete all the code or something like that.

And onto the lightning round, we have a couple of quick stories. First, you've got Mistral releasing a pair of AI reasoning models. So Mistral is the French AI lab, which has released a lot of open source models and has tried to compete with OpenAI and Frappic and others with big LLMs.

So we've released Magistral, the reasoning model, two variants, small with 24 billion parameters that is now available for people to download with an Apache 2.0 license, fully open source. And Magistral Medium, which is available on their Letchat platform and on their API. Not as good as pretty much any of the leading reasoning models on evals.

Partially because they're smaller compared to something like DeepSeq R1. But yeah, general impression I get is people are not too impressed. But at the same time, it's nice to have another open source reasoning model for people to build on.

Yeah, I continue to be sort of interested and confused about what the big picture game plan is for Mistral other than to become the French champion that's subsidized by the French state to do French things. But we'll see. The business model of just like pumping out your models and like as open source and then hosting them seems to be challenging for a lot of companies. We'll see if that changes with RL. I don't know.

I'm sort of skeptical personally, but yeah, again, with these sorts of eval scores, it's really difficult to compete. The frontier is moving so fast and the fact that they chose to release this model as well, you can read a little bit into that. Facebook decided, or sorry, Meta decided not to release the kind of biggest version of the latest Llama series because it

apparently wasn't performing too well. That's the sort of thing that you do if you have a kind of meh release. The fact that they did release this suggests maybe that they don't necessarily have a plan for blowing things out of the water anytime soon, so they might as well get the splash in the meantime. That's one interpretation that you could have. We'll note that the 24 billion parameter

scale is very popular. It's like a good choice. I think that's something that Meta has struggled with is they just keep pumping out these giant models that nobody really wants to use. 24 billion, 32 billion, these are really good sizes for the kind of hardware that people like to run open source models on. So

Yeah, that's great. We'll see where this goes. They certainly are the French national champion and it's going to be worth something. But yeah, they're in a challenging spot. They're in a challenging spot trying to compete on just head-to-head training of frontier models. And they seem to really be keen on, you know, really competing on every front of OpenAI and Anthropic. Last week, we also released Mistrial Code competing with something like CloudCode.

So basically on any given thing people are doing, at least on the LLM side, not necessarily multimodal side, Mistral is trying to compete and, you know, let's not count them out, but they certainly have a tough task to be able to do that. Next up, Eleven Labs, the provider of text-to-speech and text-to-audio models,

has released their V3 model, 11v3, which is the latest in their text-to-speech models. It is able to do even more natural-sounding outputs. You can even embed things like size or excited to have more expressive cues with nuanced delivery. And this supports over 70 languages. So...

Yeah, text-to-speech, I think probably less visible to a lot of people than LLMs and image generation and video generation and so on. But it has really come a long way. And I think it's at a point where it will be very hard to tell if something is AI-generated or not. Yeah, and one of the things that's really interesting, it sort of reminds me on the agentic side of Anthropix MCP, like the Model Context Protocol or...

Any of these like hooks that people are learning about the structure of a given modality we're learning here about, okay, what's the user friendly way to allow developers to program text to speech, right? So you indicated one of the, the upgrades here, right? So you had these special size or excited tags. The example, or one of the examples they give here is,

We did it, exclamation point, and then in square brackets, happily, and then in square brackets, shouts, and then in square brackets, laughs, right? And this is the sort of affordance that you need as a developer, right?

It seems obvious in retrospect, but somebody had to think of it and implement it. So that's really cool. Sort of similar. Another similar thing is this idea of multi-speaker dialogues with realistic conversational flow. So one of the challenges when you're making text to speech is like, how do you know, or how do you define the turns of each speaker? Make sure they don't talk over each other or make sure they do talk over each other if that's what you want.

And so they have a new text to dialogue API where you send structured JSON that defines when each user gets their turn. And then the model automatically takes care of the kind of emotional shifts, the interruptions, the natural flow of that conversation through that lens. So again, it's one of those things where, you know, you sort of don't realize you need it until you start to understand.

produce stuff with text to speech and especially on the entertainment side or trying to make real kind of natural conversational flow. So really cool. And as you said, a whole bunch of languages supported. So yeah, I mean, Eleven Labs still doing impressive things. Yeah, Eleven Labs market leader in this territory. So definitely worth knowing about.

Next, got text-to-video. ByteDance is getting into the competition with CDance 1.0. So it's their latest video generation model. It's trying to compete with VO3, the really...

pretty viral video generation model from Google. This one is able to generate five seconds of HD video in about 41 seconds. So it's pretty fast to actually do generation. And ByteDance is apparently planning to integrate C-Dance into their platforms like Dobao for both professional and public use.

Yeah, one of the big advantages that they have, of course, being the TikTok parent company is access to tons and tons of video data. I guess this is, you know, makes you wonder a little bit about, I mean, A, they're going to be pilfering YouTube videos left, right and center as well. It's not like that'll stop them.

especially being a Chinese company, not that that's stopped opening eye in the past. If you can remember like Mira Morani is sort of famous presentation snafu when somebody asked her like for, I think it was for Sora, right? Where did, where did you get that data? Did you get it from like YouTube? And she's like, I forget what she said, but she looked very uncomfortable and it's pretty clear some, some stuff or to many people, it's pretty clear that some stuff went down, but certainly TikTok has access to,

front row seat access to an exquisite quantity of data. One of the interesting things they call out is that they can handle complex sequences with multiple camera angles and maintain character consistency throughout.

This is, you know, part of that whole world model building thread that people have talked about quite a bit. You know, are text-to-video, are image-to-video models world models? Do they contain world models? One of the big questions, of course, is always, well, if they contain world models, they should be able to model real-world physics. That includes things like object permanence. It includes things like object consistency. And so this is sort of hinting at that, that we don't know much about the architecture itself. And so, you know, maybe some of this is kind of baked in

with inductive priors, and it's not actually sort of learned per se, difficult to know, but certainly impressive. And the world of convincing AI-generated video, I think it's fair to say, is just basically here at this point. Right. And unlike VO3, it is not able to also generate audio. That's pretty much only VO3. So Google impressively kind of took the lead on the text-to-video world. And yeah, I think it's good to call out that

Most likely it's because they have YouTube and they just can train on YouTube and nobody else can. By then, they might be able to compete for that reason. Well, and the audio too is no small thing, right? We're entering this world where we're getting positive transfer as these models are trained on more and more modalities.

And video and audio are so causally intertwined, right? Like you imagine trying to make a world model, literally like if you're deaf, like you look at the world, you can create world models, but you can learn faster about the world if you also have the ability to hear. And especially for AI systems, just given that these are not trained with RL, they can't go out into the world and interact with things, having that extra modality to kind of cross-correlate

physics and you see somebody's mouth opens and the sound tends to come out, it's like, okay, that tells you something about the kind of function of the mouth and the physics of it. You know, same with car crashes and the sounds that come from that. So anyway, I actually expect that the inclusion of audio in a single, almost monolithic base model, if you will, is going to be a really big deal for everything from prompt adherence to world model development.

And speaking of VO3, Google also had an announcement. They are revealing a $20 AI pro plan to let people use VO3 more. And they are releasing VO3 Fast, which is able to do faster generation compared to VO3. VO3 is fairly...

slow to use. It takes, forget exactly, but a couple of minutes. So this allows you to take, let's say, less than a minute

And now Gemini Pro subscribers can create up to three videos daily using VO free fast. And it's definitely seemed to be the case that the servers and GPUs from Google are pretty slammed by people trying to use VO. A lot of it wasn't working. So I wouldn't be surprised if this was rushed into production to keep up with demand. Yeah, I mean, I continue to tap the sign that

someday fairly soon, we're going to be able to generate one second of video for each second that you wait. In other words, you're going to be able to generate video as fast as you can prompt it to be generated. Once we cross that threshold, there's going to be excess compute on the generation side, which I would expect to start to get dedicated to addiction.

So, you know, imagine your TikTok feed, but if you've got biometric data coming in through, for example, the camera or even just your interactions with the app that cause the video to be modified in real time based on what you're seeing. There's like a very dark rabbit hole for where this ends up going ultimately with the abundance of compute. That threshold is going to be very critical, I think, almost from a societal level in terms of

how we even think about these apps. It's not unlike what the ability to generate fresh apps from scratch based on prompts is doing, right? Where apps themselves suddenly become this malleable thing. Well, this is sort of similar, but for manipulating pixels on a screen to kind of stimulate you,

It's not clear what happens when the optimization process that's running in the back end of these systems operates as quickly as the human biophysical response cycle. That's, I think, a very, very interesting phase that we're getting to. And we're going to see a lot of interesting phenomena, psychological and otherwise, emerge from it. Yeah, I think you could say this is similar to where agents were last year in the sense that we were talking about agents a whole lot earlier.

Going back definitely into 2024, but it took until really the last couple of months for agents to really mature and make a huge impact now with things like cursor code. I think video is in a similar spot where you're starting to see tools like Flow, like a more easy to use pipeline to not just prompt it, but actually build something of it. And I think in the coming months,

We will start seeing that actually not just be used for memes, but actually have an impact on workflows and so on. And moving on to applications in business. So we start with this really interesting story. OpenAI and DeepMind are losing engineers to Anthropic in a one-sided talent war.

So there's this venture capital firm called Signal Fire. They came out with their 2025 state of talent report. And they basically look at like, okay, what's the rate at which we're seeing employees leave OpenAI for Anthropic versus the rate at which we see employees leaving Anthropic for OpenAI, right? So which direction is preferred? So when it comes to

OpenAI and Anthropic, OpenAI employees are leaving eight times more for Anthropic than vice versa. At DeepMind, the ratio is 11 to 1 in Anthropic's favor. So for every Anthropic employee who leaves Anthropic to go to DeepMind, 11 DeepMind employees are leaving DeepMind to go to Anthropic. That's pretty insane. There's all this kind of interesting speculation, by the way, so Anthropic's

retention rate is like 80% for employees hired over the last two years, which in tech is pretty wild. Like I get in the kind of standard world, that doesn't sound too, too impressive. Like, oh, you're still in the same company you were two years ago, 80% of the time. That sounds about right. In AI, that is fairly unusually high. OpenAI's retention rate for two years, by the way, 67%. That's aligned with what you see at Meta, for example. So

There's all kinds of people kind of tossing around ideas about why this might be. One of the often cited hypotheses is like, Anthropic is just sort of coming out of nowhere. They've got the best coding models. That's just really exciting to work for them, blah, blah, blah.

I think that this actually misses the core point, which is Anthropic was a company founded on a very clear principle, and it has stood by, for the most part, those principles. It's founded by these open AI policy and safety and some pre-training researchers who left essentially in protest. I mean, this is essentially an open secret now over open AI's sort of attitude and approach to alignment, technical safety and policy issues.

OpenAI, or Anthropic rather, seems to have walked the walk on a lot of their policy stuff, pushing back on this pretty ridiculous idea of banning all state-level AI regulation for 10 years that was snuck into the latest big, beautiful bill. And anyway, OpenAI seems to have been pushing for something pretty aligned to that, at least in their policy work. So a lot of this is like, you've got an entity where the leadership says something, and then they actually act

Yeah.

The OpenAI ones that we spoke to in our investigations in the past were often like they're really tense. You could sense that they did not want you to tell anybody that we'd spoken, anything like that. Whereas in anthropic, it's kind of like, yeah, I might have a disagreement with leadership, but you get the sense this is the sort of thing that they would have out anyway and have spoken to leadership about and reasonable people can differ. So I think that that's an underrated factor.

in all this is just the cultural difference. And I think that's leading the best researchers to flock to Anthropic. And that in turn is the causal element behind, in part, Anthropic's great success with its coding model. So I think it's not all that, but this is a kind of missing element in at least some of the analysis on this issue, just sort of from what I've seen.

Right. And I think, you know, to compliment that the dynamics of open AI and entropic competing are very different from dynamics of deep mind and entropic competing where deep mind, if you are preferring to go to entropic, it is likely because you don't like big company politics. Yeah. And you don't like a lot of bureaucracy that has been introduced to

review if you're allowed to publish your research or whether you're able to contribute to Gemini, for instance, development. Not really a surprise. DeepMind has been around for a long time. It's now officially part of Google. There's been a bunch of reorgs and so on. It seemed to be really kind of in a bit of a bad shape in terms of being organized. And

So in that sense, it's not crazily surprising. I think also DeepMind was quite big and Google has been quite big. So I wouldn't be surprised if Anthropic just had fewer people to lose, to be honest. Yeah, I think that's a big factor. And the other thing is, I mean, Google and Anthropic have a partnership, right? So you're not quite leaving the nest in the same way when you move from one to the other.

Google's made massive investments in Anthropic, right along with Amazon. They're basically the two main backers.

So, and certainly, you know, Google TPUs are a huge part of Anthropic's fleet and strategy. So I think that kind of makes a lot of sense, given that Anthropic is, you know, has butted off of OpenAI. It kind of, you know, anyway, it sort of feeds into that narrative of sort of OpenAI, disillusioned OpenAI folks leaving. The other thing, by the way, the money side is interesting, right? This article goes into...

some pretty wild. So they talk about open AI. Some open AI researchers can earn more than $10 million a year. They're putting together counter offers to stop

opening eye employees from leaving for other companies like Anthropically, like Safe Super Intelligence. And these include $2 million retention bonuses. So just like a one-time bonus, $2 million, please don't leave. In addition to, this is insane, equity increases of $20 million or more. Please don't leave me. Here's a crap ton of money. Like this is a lot of money to be throwing at people just as a retention bonus, basically.

Yeah, it sure would have been nice to study LLMs when I was in grad school. Also worth noting in this report, we won't go into it too deeply, but...

It does focus somewhat on entry-level tech jobs in addition, and it's in a rough shape. It's increasingly looking like, you know, CS in general has seen a huge rise in undergrad enrollment over the past decade. And for a while, it was sort of the star path to a good job and good earnings. Now, as a fresh grad, it's much tougher to get hired than it used to be. And the number of positions is,

seem to be smaller. And I would not be surprised if AI has a large role in that in addition to economic conditions and so on. A hundred percent. I think we're in this interesting position where a lot of people, you can still tell yourself the story that, oh, it's because of tariffs, it's because of the economy or things like this. But I'll tell you, I mean, I had a conversation with a very senior person at one of the top labs. And what they were telling me was

We are no longer hiring entry-level software engineers. We don't expect ever to do that again. And in fact, we don't think we'll be hiring anyone with less than 10 years of experience ever again.

And when you hear that, it just makes it real where it's like, ah, this is where it's coming from. And this is a lab that already is seeing the majority of its code base written by AI, which that's not surprising to us. This is something we've been covering for a long time, but I think you have to kind of sit back and absorb that reality that the job of software engineers, the job even of AI researchers is getting more and more abstract and further away from, anyway, many of the activities that used to define them. And that just makes it

I mean, it's brutal. We're headed for a situation where white collar gets automated pretty hard, pretty fast. And there's social unrest that will come with that. I mean, there's no two ways about it. We've got a very interesting transition we're going to have to navigate gracefully. Yeah, and it is happening quite fast. So, you know, 2023, 2024, 2022, to some extent, we saw the rise of intelligent AI assistance in things like

Copilot and Cursor and that had a massive productivity boost. You're twice as productive, three times as productive. With these agentic tools like Cloud Code, which are now working well, it's getting to a point where you barely need to touch code as a software engineer. What you need to do is be able to tell the agent what to do and to inspect what it's doing to verify that's correct. And that's not what an entry-level position entails typically. So

It's changing fast. And yeah, it's worth being aware of that. And moving right along, another, I guess, another OpenAI story, not that the last one was all OpenAI. OpenAI slams court order to save all chat GPT logs, including deleted chats. So essentially what's happened is there was a court order that came in and said, look, OpenAI is being accused of

Essentially serving as a platform that allows users to get around paywalls and access news and New York Times articles and things like that. And what's more, we suspect that

that users are going to be deleting the evidence of that so that if we actually require, if the court requests records of people's use of the tool, they're not going to actually show these violations of copyright and all that stuff. And so the New York Times argued for the court to

prevent OpenAI essentially from deleting or discarding information about chat GPT logs that otherwise would have been deleted, including records that users have tried to delete, right? So OpenAI is calling this out as basically a way of preventing OpenAI from respecting its users' privacy decisions. It essentially puts OpenAI in this awful position where they are at risk of breaching their own privacy agreements and

which, you know, huge, huge trust issue, but also, I mean, it could put them in breach of contracts and global privacy regulations, all kinds of stuff. So this is really messy. You can actually, I mean, I can see opening eyes argument here that like, this is to just lurch out and do this seems like a strange strategy, but you know, I'm not a lawyer, so hard to know. There's so little precedent in general on cases like this, but-

Yeah. So the idea of chat GPT to skirt paywalls does sound plausible, I guess. But the question is, how do you actually manage that? Is the best way to force essentially a kind of de facto privacy violation onto OpenAI users? I don't know what the answer is, but this is the state of the debate anyway.

Right. And OpenAI even released a blog post, how we responding to the New York Times data demands in order to protect user privacy, where they frame it as a privacy question, as kind of a commitment to their customers and address, for instance, there are business customers that use zero data retention APIs where the chat logs aren't going to be kept.

But OpenAI has had this interesting pattern of releasing blog posts in response to legal drama. And this one is very much along that line, has a lot of notes in response to it. So OpenAI is a little salty and not a fan of this court order, clearly.

Next up in the lightning round, we are starting with a story from the information, which typically has far more cutting edge or let's say less public information. And this one is saying that NVIDIA's biggest Chinese rival Huawei struggles to win at home. So this is

pretty much an analysis as to what extent Huawei is able to beat out Nvidia in terms of providing chips. And it seems to be that so far, Huawei is unable to get to the biggest tech companies in China to adopt their chips for AI training and inference.

Yeah, this is actually a really interesting story because the story that the like NVIDIAs of the world have been propagating that a lot of kind of anti-export control people have been propagating is that, hey, you know, we withdraw from the Chinese market and like Huawei is just going to dominate it. And it just creates a whole bunch of economic wind in their sales. And this is not entirely wrong, but there's an awful lot kind of missing in that analysis. So one key thing to keep in mind is

Huawei does not have access to the most exquisite fabrication processes that are available to Western companies, thanks to TSMC, which is based in Taiwan, of course. So TSMC can help you fab down to three nanometers now, and we'll have chips that come off the production line using the three nanometer process in the relatively near term. Huawei can only use the domestic, the Chinese analog to TSMC, which is SMIC.

SMIC is roughly speaking stuck right now at seven nanometers, maybe arguably working on five. So it's forced to use a subpar fabrication process. Huawei designs the chips and then they send them to SMIC for fabrication. The problem is you can only do so much when you have limitations, fundamental limitations on your design process.

In particular, if you look at the Huawei chip series, what they will tend to do is they'll be very energy inefficient. If you want to get very energy efficient chips, you have to get more advanced processes.

So we talked about how Huawei has been working around that. They just set up this like cloud matrix 384, which is like their computing system that bundles up a bunch of their Ascend chips together in a way that is designed to just say, okay, our individual chips may be crappier because they're fabricated using a weaker process, but we can just string a bunch of them together like this.

build larger systems with larger data centers. And because China is swimming in energy in a way that America just isn't, America's energy constrained, China's chip constrained, China doesn't really care about the energy efficiency of the chips that much. They can just put more of them together and achieve the same scale. And that's really what they've been doing. The catch though is

is overheating. If your fabrication process is bad, if you're going to basically like overpower your chips and just pour tons of energy into them, then the chips will overheat and you will see problems. That's exactly what seems to be going on and what seems to be hampering a lot of Huawei's sales activities.

The Ascend chips also, by the way, can't handle direct support for low precision formats, like number formats, like FP8, which notably is what DeepSeek uses. So Huawei literally, like their chips cannot support DeepSeek style training runs, which is why DeepSeek has been using NVIDIA technology and why the demand for it continues. One last factor that's really important to keep in mind is that Huawei competes with a lot of their customers. Think about ByteDance, Alibaba, Tencent, right? These companies...

They're all looking into Huawei chips. They haven't made big purchases. Part of that is because a lot of them run their own clouds. Huawei runs its own cloud too. And so are you really going to buy from your competitor? I mean, this is the reason, if you go back to our hardware episode, this is the reason that PurePlay foundries were a thing, right? That Intel, for example, historically struggled to attract customers.

chip designer customers because they also were designing chips. And so you're sort of like buying from your competitor. What the market fundamentally wants is it kind of does want a separate foundry, a separate designer, and then ultimately a separate cloud company. And it's not a coincidence that NVIDIA isn't so much in the cloud market. They could be if they wanted, right? They could make big clouds. You could have NVIDIA right up there with GCP, with Azure, with AWS, but they're not doing it

Part of that surely is going to be competitive reasons. Let's just have people buy our chips and reduce the barrier to entry on that as much as we can. And anyway, so Huawei is in a more complex situation than I think a lot of analysis historically has acknowledged. We'll see where it ends up going. And they are a national champion. So the CCP can always force people to buy from them. But it's an interesting scene. Right. And also mentioned in this article, and I think it's worth noting,

Some companies like ByteDance and Tencent have significant business outside of China. And the US is cracking down more and more issued guidance that basically says don't use Huawei chips. So if you are a more globalized company based in China, that's even more reason to prefer Nvidia over Huawei.

Our next story is sort of related, actually. Huawei expected to break semiconductor barriers with development of high-end three nanometer GAA chips tape out by 2026. Okay, so GAA is gate all around. This is a transistor design that is becoming really popular. It's a way of essentially making the transistors that form the critical circuits, the number crunching circuits on GPU logic die, more energy efficient, more

have higher throughput, all kinds of desirable thermal properties, et cetera. So essentially what's happening right now is the three nanometer process that

for example, TSMC has developed, does not actually plan to use GAA. So it's not going to be a gate all around process. Huawei is accelerating towards GAA. That's the plan here. Essentially skipping a generation, which you kind of have to do if you're the underdog and trying to catch up. But the challenge is,

Right now, it's not really clear that they can pull this off. You know, they're seven nanometers, they're five nanometer and even their seven nanometer process that they get through SMIC that we just talked about, that sort of Chinese TSMC.

has really bad yields. The seven nanometer yields are somewhere between 15 and 50%, which is, I mean, industry standards like 90%. Anyway, so it's like they're major economic challenges, but if they can somehow do that, that would be really interesting. It would be a big leap. The only other gate all around focus design for three nanometers is being done at Samsung Foundry. So this would literally be the first non-Samsung Foundry product

if in fact it is non-Samsung, if they're doing it through SMIC, which again would be kind of weird, it's also possible this implies a collaboration with Samsung Foundry, which would be really weird because Samsung is of course based in South Korea. So this would be interesting from an export control standpoint. Can this actually work?

But anyway, so Huawei has been known to make optimistic kind of pronouncements about the future of their technology. Hey, we'll have all these exciting things that don't quite end up taping out, if you will. We'll see. But three nanometer gate all around would be a big deal if Huawei can actually crack it.

Yeah, not much to add. All I'll say is if you Google Gator all around and look at the images, some really fun illustrations and electron microscopy images. And you get a feel for these poor computer engineers and semiconductor experts. You need to go 3D and build these elaborate structures now just to be able to go into these low nanometer regimes and actually make chips work.

And speaking of that, next, we've got a story about TSMC and their 1.4 nanometer process, which is called Angstrom, which is making progress. It's still not out. It's expected to be available by 2028. And according to the story, it's estimated to cost $45,000 per wafer, a 50% increase over the two nanometer process, which is...

$30,000 per wafer. So yeah, that's pretty much it. It's got to be very expensive to use the really lowest, like most high density chips that are coming online in the coming years. Yeah. So 1.4 nanometer, they're calling it Angstrom, which is like slightly frustrating because it's not quite an Angstrom, is it? But that's cool. This is the next beat. Yeah, 50% more expensive than

Apparently, 2028 is going to be the earliest production run. So if AI 2027, that sort of famous blog post, ends up being wrong and 2028 ends up mattering,

We'll probably see in 2029 some pretty impressive rollouts of the next generation of Node and the chips designed on it. So this is, by the way, they're assessing if there's a company that would want a first crack at this Angstrom process, it would be Apple. I would just say, we've been saying this on the podcast, do not take your eye off NVIDIA, which, by the way, is literally the world's most valuable company right now.

As AI chips become more and more valuable relative to phones, expect at some point that NVIDIA starts to make moves to compete for the leading node to essentially buy out Apple of all of TSMC's capacity and kind of become the subsidizer of choice for TSMC for their leading nodes. I actually think that could happen sooner rather than later. There are indications it's already sort of in the works. So anyway, that would be a pretty significant shift in tech. And the day that happens, we'll definitely be talking about it here.

Fun fact, Angstrom is 10 to the negative 10 meters or 0.1 nanometers. So as you said, not really an accurate name at all. Yeah, yeah, no. It's a good name. Sounds good. Sounds fun. And last story, coming back to Mistral, they're launching Mistral Compute, which is a cloud offering for...

compute for AI that is going to try to compete with other offerings. I suppose these days, AWS is still one of the leading ones. You have also newer competitors in a space like Modal and

So Mistral, again, continuing to try and kind of on every front provide a European version competitor to offerings both in China and the US. And they are coming at this from a position of less money, less talent, you might expect or might argue. So we'll see. The main kind of

analysis of their advantages, I think I agree with you, is their position as a European leader in the space.

Yeah, yeah. And in particular, it's no small deal that they're based in France. You know, you think about what are the big bottlenecks? We talked about this right in the United States. It's energy, right? Everybody's trying to figure out where can I find a spare gigawatt on the grid? It is not easy. You know, even 30 megawatts that you like, you can find it, but it's going fast. And so in France, where they have really, it's the only European country, the only Western country that's been doing nuclear this whole time.

where they can actually build new nuclear plants in less than 10 freaking years, they can support this. And now they're reaping the benefits. The scale that's being talked about here for Mistral Compute, by the way, is tens of thousands of GPUs. They say built on NVIDIA reference architectures. And so I assume that they must be looking at this point at like GB200s, tens of thousands of those, I assume.

And they're saying that they'll be supporting workloads ranging from defense to drug discovery. Okay. National champion much, right? This is the kind of workload that smells a lot like, you know, preferred partner of the French government, which by the way, also from a red tape standpoint, if you're trying to set up a new scale data center, not only do you have the massive workload,

energy supply that the French enjoy, but you also have the support of the government to cut red tape, especially environmental regulations that allow you to get things up and running faster. These things do stack up in very interesting ways to compete another day, let's say. But

I think their fundamental challenge is going to be capitalization, right? That's always how it's going to be. You can't compete forever with companies that will raise tens of billions of dollars on $100 billion valuations, like not even taking that much of a liquidity hit and raising from sovereign wealth funds and this and that. It just does become really challenging. And the French economy just isn't that big. So yeah, if I were France, this is what I'd be doing. But that doesn't mean that they necessarily have a winning hand.

Yeah, as you said, in this blog post of theirs, they are literally saying the offering will include Mistral's AI's training suite that can accelerate region and domain specific efforts across nation and industry wide endeavors. So yeah, calling out some of that champion idea.

And I will say it's a little bit different opening Ionthropic. They're not offering this much of a cloud kind of architecture for training and serving and whatever else. And it is rather specialized. I would assume this came out of Mistral having to develop their own software.

setup for compute to be able to do this. So I do think there is a decent chance that they have some good technological aspects here that might make it actually quite a good product.

And next up, moving to open source, we have one story, ProRL. And for whatever reason, I keep saying ProPL every time we talk about it offline. ProRL, prolonged reinforcement learning, expands reasoning boundaries in large language models. Bit of a mouthful, but hey, aren't they all? So

There's this idea that the RL process itself just optimizes existing capabilities in large language models. Basically, it's like you have your pre-trained model and it already kind of has all the capabilities that reasoning model should have. And your reinforcement learning process just elicits those capabilities. It bubbles them up to the surface, right? So

What they're after here is to show, actually, that's not the case. What we can do is imbue the model with completely, genuinely new capabilities that were not there before. And they have a couple of ideas that they stack together to just like optimize the reinforcement learning process. One of which is this idea of there's a Kovac labeler divergence. So this is essentially a way of measuring how different two different distributions are and like probability distributions.

And so what's often done during training is you'll have a model that's being trained and you'll have some kind of reference model where you don't allow the model under training to deviate too much from the reference model. The reason for this often is that if you just let the model go hog wild and get trained on its own to whatever it will end up being.

That model will learn to kind of optimize very narrowly and unhelpfully over-optimize to the objective that it's being trained for. So in the limit, the classic example is if you let these models get fine-tuned for too long without a kind of regularization, they'll end up like no longer speaking English or they'll end up, you

you know, kind of really rigging their becoming sycophantic or whatever. And so you just have this reference model to keep pulling it back to reality. And

There've been arguments that this KL divergence penalty is a bad thing, that you actually should just get rid of it. A lot of those arguments are based on looking at base models and like before the supervised fine tuning stage in the context of reinforcement learning. And what you find there is their performance actually doesn't get so good if you keep enforcing that they have to be similar to the reference model.

But what they're showing in this paper is actually if you do supervised fine tuning first to let the model get good enough at reasoning, at that point, if you then use that as the reference model, you actually do find that the KL divergence strategy makes sense, that regularization strategy. So that's one thing they did.

They also did this thing called reference policy reset. So as you train your model, again, you've got that reference policy. So it's not allowed to deviate too, too much, but then you'll update your reference policy to match whatever the model under training currently is.

And then you'll proceed. So you're basically using the reference policy as a kind of drag on the model under training. The model under training does a bunch of training. It can't deviate too much, but then you update the reference model and now you can start training again and you can deviate a little bit more, but not too much from that one. So it has a way of sort of

slowing down the deviation from the reference model, but not so much that you're eternally locked in to the original reference model. And that turns out to help a lot with training stability while also allowing you to kind of recover a lot of these new capabilities that come with reinforcement learning. So they have a huge data set or a bunch of different STEM logic puzzles, instruction following, data tasks. It's like 136,000 problems in math and code and all kinds of stuff.

They also have an enhanced version of this GRPO algorithm, which you might remember from our discussions of DeepSeq. It's become really popular, just sort of a way of stabilizing reinforcement learning training. This quickly gets into the weeds, but...

Yeah. Bottom line is they're borrowing a lot of stuff from other papers like DAPO, which is like dynamic sampling and augmented policy optimization that there are, you're basically filtering out prompts to only keep the ones where the model sometimes succeeds and sometimes fails. So they're like hard enough that the model is going to learn something by training on them, but not so hard that it's just hopeless and the model never even gets a reward signal. So

There's all kinds of shit. It's actually quite an interesting collection of shit. The shit links together in interesting ways to make a little shit chain. And together, that is ProRL. Not how I would have described it, but okay. Yeah, some interesting analysis in this paper. It's a family show. Yeah, I don't know what kids enjoy last 18 hours. I hope not many.

Yeah, they have some analysis about the question of ProRail eliciting new reasoning patterns or not. They basically make a point that there are tasks on which the base models are already pretty good and there the gain is not significant, but there are other tasks where the gain is significant if you train long enough. And I just want to call out, you're not going to be going into detail on the story, but Magistral alongside the model,

Mistral did release a report on it, a pretty detailed 18-page paper. And they did also highlight some differences in their loss for GRPO, including the elimination of KL divergence as a penalty and some other stuff. So very much a lot of exploration going on into the right setup for RL training and including the loss and

RL in general is a big headache. So I guess not surprising that there's a lot of things that are being figured out over the previous and even now as people are diving into RL as a very prominent research direction.

Next up, research and advancements. We begin with Kinetix rethinking test time scaling laws. So there is a new proposal for test time scaling that incorporates memory access into the calculation of the cost. So this is a different way to calculate the scaling law basically for test time scaling.

And in this new way of evaluating the scaling with updated cost, they argue that prior scaling laws have overestimated the effectiveness of small models that have inference time strategies. They're basically saying that

Increasing model size up to 14 billion parameters is more effective before applying test time strategies like best event sampling and chain of thought. So basically, instead of running your model more after training for smaller ranges of models in like 10 billion range, just make your model bigger instead of doing more inference on it if you can.

Yeah, this is a really interesting kind of compute aware, not compute aware, memory bandwidth aware way of doing things. So historically, when we talk about scaling laws, right, you'll see these plots. What do they look like? Well, you usually have flops like computing budget on the X axis, and you'll have some measure of performance on the Y axis. And then you'll see your nice little log plot and everything is good.

The problem is that flops, like the actual mathematical operations that go into training a model, are only one part of the hardware picture, right? So GPUs, yes, can crunch a lot of numbers really fast, but they also have to move data around.

right? That's one of the most time-consuming things. One of the big bottlenecks now is just how fast can you move the data around, not just crunch the numbers, but shift it from memory to logic and back and then to other memory and things like that. And so what they're trying to do here is redesign a scaling law that accounts for that, for two, in other words, two metrics. One is flops, as in the traditional compute scaling curves, but also memory bandwidth. And this is really where

Or sort of memory access cost, which accounts for the bytes of memory that need to be accessed here, the memory picture, right? And so they're actually going to combine them both into one metric. They call it the e-flop or e-flops. And it's this essentially mathematically, it's the computational cost of training the model plus the memory access cost that essentially accounts for the memory bandwidth requirements and other things that go into it times the intensity, which is essentially

A hardware specific ratio of compute capacity to memory bandwidth. Basically, this is, as you can imagine, this would depend heavily on your hardware fleet. Like what does your hardware actually look like?

is going to determine in practice what your ideal number of parameters should be, what your ideal architecture should be. And so this is part of the reason that scaling laws, by the way, always were framed in terms of flops, because the moment you kind of try to balance flops and memory bandwidth, pretty soon you start to almost simulate a data center. And like, you're going to have to have like all kinds of resolution parameters

And that just makes it really hard, not least because then people will go, okay, well, that's how it plays on that data center. But what if I changed my data center around and now we've got a different scaling curve and just it becomes impossible to do apples to apples. That, in fact, is one of the challenges with this paper. It only uses a kind of reference architecture associated with the NVIDIA B200 GPU. So they are assuming those specs hold and you're seeing the scaling laws for that. It does not look at different parameters.

effectively different scaling laws on different accelerators from like AMD or Intel or other NVIDIA chips or different networking or interconnect configurations or different memory hierarchies, none of that. So think of this as kind of more of a vibe thing. But in terms of what we can learn from this, I think there are actually some really cool things. So in practice, when you scale up a transformer architecture,

What you'll tend to do as a developer is you'll increase the size of the MLP layers, right? So much faster than the scale of the attention mechanism. So you could scale the attention mechanism. You can increase the number of attention heads, head dimension, the embedding dimensions, all that stuff. But people tend in practice to just increase the scale of the MLP layers that sort of do the logic instead of the attention piece. Now, the intuition that a lot of people have is like, okay, well, that shouldn't matter, right?

So because we're just going to be scaling the MLPs, they already represent the lion's share of the compute and parameter count to begin with, right? So surely the MLP layers are already the bottleneck. So the fact that the attention mechanism is scaled more slowly, well, that shouldn't matter, right? But here's the catch.

The MLP layer, the compute required to scale your MLP layer, it scales with the length of your input, right? So double the length of the input, roughly speaking, double the amount of compute that your MLP layer will consume. Fine. But as you increase the size of your input, the attention memory bandwidth requirements scale with the length of the input squared, right?

So in other words, very rapidly, as you scale the length of the input, attention, the memory bandwidth pieces start to become the rate limiting step and your operations become memory bound because, you know, you're, you're anyway, you're bottlenecked by the attention layer. And so,

This has become more and more of an issue because the length of inputs and outputs is getting greater and greater and greater, right? With these kind of best of end schemes, inference time, compute, reasoning, all that stuff, you're seeing your inputs and outputs get longer and longer and longer, which means that

bottlenecks that scale with the square of the input length quickly overtake bottlenecks that scale just linearly with the input length. And it turns out that memory bandwidth scales with the square. And that's why we run into this problem. And so anyway, I thought really, really important paper. If you're interested in understanding the consequences of hardware choices for model architecture, I thought this was actually quite fascinating and something that I just haven't seen other people dig into is these more nuanced scaling laws.

Right. Yeah, the very first sentence in the abstract, they're saying we are coming at this from a practical efficiency perspective. And to your point of what is on the X axis, they're very direct. They say B200 seconds on the B200 GPU, which is the leading edge. Instead of looking at computation, we are looking at the literal amount of seconds to get some level of accuracy.

Lots of really good analysis in this paper. They also have a really nice blog post. And I feel like we often call out when papers come from Apple or DeepMind or Anthropic. So worth mentioning, this is from CMU, like a fully university work. Also, the two leading offers are immigrants to the US system. So we should get into it. But I do want to say with some of the policies about

grad students and in general, kind of taking in grad students from other countries. You look at these papers and it makes me feel a little depressed. But anyway, moving on. The surprising effectiveness of negative reinforcement in LLM reasoning. This is looking at RLVR, reinforcement learning with

verifiable rewards in two paradigms. You've got positive sample reinforcement and negative sample reinforcement, where PSR focuses more on reinforcing correct responses. NSR negative sample reinforcement emphasizes penalizing incorrect ones. And it seems that you can do positive sample reinforcement only and negative sample reinforcement only training and

And PSR-only, positive-only improves PASS-1 but reduces higher PASS-10. So basically, it reduces if you have a few opportunities to get it right. You're not necessarily going to do well. And that's because there seems to be a loss of output diversity versus negative-only apparently is able to improve performance across all PASS-at-K metrics. So not just one trial, but several trials.

meaning that it might be better to focus on

over penalizing incorrect outputs, over encouraging it to do the same stuff that seems to work. Yeah, it's actually, I'm surprised at how intuitive, at least this result seems to be, where you imagine like if you were being trained to do any complex task and the way you're being trained is not by being told when you did something right, but just when you did something wrong, basically. What this has, this has a way of not telling you how to do your job,

but to tell you how to not do your job. And that means you're going to be more creative. If the reinforcement tells you like, here's the right answer, do it like this versus don't do it the wrong way, then that's a very different kind of reinforcement process. It's a little bit difficult to analogize because it's post hoc, right? So imagine that you try to task and if you did it right,

We just wipe your brain and you have no memory of doing it right. But if you did it wrong, we tell you, hey, you did it wrong. That's kind of what we're doing with these models, with this sort of architecture, which is really interesting. And the results do bear out that you get more diversity of models.

of sort of more exploration oriented models rather than exploitation oriented models. Because what you're really doing is you're redistributing probability mass to plausible strategies rather than concentrating all your probability mass into the small number of highly kind of correct, observed correct paths, right? Because this is one of the things with RL is like, you're not going to get to observe all the correct paths, right?

right? You're not also not going to be able to observe all the, the incorrect paths, but at least by, you know, by not calling out the correct ones and saying, do it more like that, you're leaving it the possibility space open for the model to pursue kind of alternate correct ones. So anyway, really interesting. One question that came to mind, like as I was reading this, I was like, well, you wouldn't run into a problem where over time, if your model gets, gets better and better at a task, you're,

you just sort of can't find enough negative samples in a batch for like for GRPO. And yes, this is actually an issue and they call it out. So they frame it as a feature and not a bug, which I think is somewhat true. And then there's some asterisks. So they point out that it does prevent overfitting because you just won't get updates once the model really masters the problem set. So you won't keep, you'll just like run out of failure cases. And so you won't over-opt

optimize the model to overfit, which is really cool. The flip side though, is it's kind of compute inefficient, right? Because you have to then do a lot of rollouts that don't yield any trainable data. And so I think from a compute optimality standpoint, you're also taking a bit of an L. So they actually suggest this kind of like

middle ground strategy they call weighted reinforce, where you still use some positive reinforcement, as they put it, 10% strength to ensure continued learning, but you're going to use full strength negative reinforcement learning. So really lean more towards telling the model not to do things

And with a little bit of guidance about how to do things. So anyway, that kind of helps because you're retaining some of those positive examples. But again, from a compute optimality standpoint, I think it's sort of, it'd be interesting to see how this ends up scaling. Yeah, this is one of the somewhat nuanced aspects of reinforcement learning. To actually do good reinforcement learning, you need to model the reward for any given output. And to do that, you need to be aware of both positive rewards and negative rewards. So...

It's interesting to focus more on the negative rewards. Basically, their weighted reinforce upweights the negative aspect. And they compare this weighted reinforce against a standard GRPO, PPO, these other RL training setups with their own objective and losses. And it looks like from their results on QEN 2.5, worth noting all these reasoning model papers are looking at a particular model, which...

May not be ideal, but anyway, with this weighted reinforced setup seems to be better than GRPO and PPO, which is pretty significant since GRPO is often what people are exploring in this research, like I mentioned previously.

A couple more research papers. Next up, we have predicting empirical AI research outcomes with language models. So that's pretty much what it sounds like. You want to try and predict what will happen next.

to in a given experiment with a language model, they created a benchmark here by scraping ideas and results from conference papers and rounded up with around 1500 test examples. And then with the whole system, we were fine-tuned GP 4.1 and paper retrieval. They were able to get 77% accuracy on the test set at being able to perform the

Prediction, significantly better than off-the-shelf performance just by baseline existing models. So pretty good results. They say it outperforms a human expert baseline on NLP idea pairs. But, you know, it's still, let's say, nascent. And this is an interesting idea, but definitely a nuanced area to look into and requires careful extrapolation.

Yeah, it's one of those areas too where people often talk about AI models. The big advantage is going to be in having good taste regarding the problems that we throw them at. This is an example of AI models actually developing taste, the automation of taste itself, right? Research taste.

If you can predict how likely a given idea is to pan out, that's sort of the idea here. So the way they do it in practice is they're going to go within a given paper, right? You often see multiple methods used to achieve the same goal, right? And you can imagine how hard it would be. They're not going to go and grab two different papers that try to do similar things and predict which one is going to work better because it's impossible to get apples to apples. People use different training strategies, different data, all kinds of shit.

So what they're going to do is same paper, multiple methods. They're going to extract pairs of essentially experiments in the papers that compare different approaches. And that's what they're going to use to construct their data set. So that's kind of more appropriately calibrated kind of apples to apples comparison.

And so in that sense, it's a little like it is predicting AI research outcomes, but it's not quite the same as having a new research hypothesis from scratch. Like it's not at the paper level, like, all right, which, you know, which paper should I, which paper level? Predicting is, is maybe a little misleading. It's comparing two potential ideas and predicting which one will get a higher number on a benchmark. And so it's a binary prediction that,

slightly easier setup and saying like, if I were to try this idea, what would I get? Yeah, exactly. I think in order to do it at the paper level, which is the most interesting thing, you'd probably need a very complex sort of data filtering and shaping approach where you try to get it to be apples to apples as much as you can, and then feed it into a model. But the interesting thing here is like you called it out, this model, this sort of fine tune model does better than O3

models like O3 perform no better than random guessing. And so when you're looking at 77% accuracy on this benchmark, predicting kind of which of two ideas is going to do best, obviously random guessing is 50%. So that's quite a lift.

bears mentioning that it achieves about 64% accuracy on unpublished novel ideas. So there's some amount of overfitting going on here where we're getting 77% in the sort of like test case, but then when they actually tried on these new ideas that are unpublished, it goes down to 64%. Still much better than 50-50, but yeah, pretty remarkable. The other funny thing is if I'm interpreting this right, it says they beat human experts 100%.

Human experts scored 48.9%, which is slightly worse than random guessing if that is apples to apples, if it's just a side-by-side thing.

So that's kind of amusing in and of itself. Like humans kind of suck at this themselves and they are really getting some sort of lift from their fine tuning approach here. Like if they're going from 50% to 64, that's not tiny. And one last paper also related to AI contributing to research. In this case, it's called the EXP Bench and it's focusing on benchmarking AI agents' ability to conduct end-to-end research experiments.

Also using tasks from published research. So here they looked at peer-reviewed AI publications from NeurIPS and ICLR. They created this benchmark of 461 research tasks from 51 papers. And they basically show, like, can these AI agents do the experiments themselves?

introduced in these papers. And what happens with published papers is usually, ideally, they publish their code so you can replicate the experiment and get the same output and replicate whatever tables of numbers you get. So that kind of gives you a rich signal as to how you want to set up your experiment, how you want to ideally be able to replicate the experiment.

And so this is making it possible to evaluate whether AI agents are able to do that and they struggle is a summary on whether they are able to implement and get things correct. Yeah, I will say we're getting to the point where the benchmarks that we're designing are so hard that once you actually do saturate these,

Like, I mean, what does the world look like when you're hitting 50% on expert bench? Like 50% success rate for end-to-end automation of the process of formulating hypotheses, designing and implementing experimental procedures, executing them, analyzing the results, that whole end-to-end. Like that's not far from fully automated AI R&D, right? That's kind of like...

At the model level, there's obviously a bunch of hardware and network optimization jazz that like independently OpenAI is working on internally. But what does the world look like when you're actually saturated? That's worth asking right now, when you look at O3 Mini, which is the best model they tested overall, you know, O3 Pro was not out at this time, you know, all that, but

1.4% or six or seven out of 461 tasks that they tossed at it were completed successfully. So one read on that is 1.4%. Wow, that's really small. Another read is like, wow, we're actually getting like the complete success rate end to end of like between one and 2% with our best model today.

in a context where new models are coming online like every other week. So yeah, I don't know, but this may be a bigger deal. That's a pretty big 1.4%, at least in my mind. Right. And to give you an idea of

What is involved? The inputs include a research question. They have an example, does the MogaNet architecture outperform existing lightweight models? They have a high-level method on the experiment, train the MogaNet variants on ImageNet 1K for blah, blah, blah, blah, blah. And they give it some starter code with potentially additional instructions. And the job of AI is

isn't to do the research per se, it's to set up the experiment and run it and get the results, which means it needs to correctly configure the code, add stuff to the code, train it for the correct amount of time with the correct parameters and evaluate the outputs to get the actual output of the experiment. So yeah, if you are able to do this,

There's a lot of conversations around speeding up research and speeding up AI research and eventually possibly automating it. We've seen research on ideation, idea creation. We've covered some of these papers and we've seen some progress there. This is really covering the opposite side where you are able to evaluate ideas with experiments. Somewhat different.

pretty closely guided. So I will say in the inputs, it's told kind of exactly what to do, but still it's a very useful kind of thing to be able to speed up research. Yeah. I guess just as you said, like automating the implementation phase of AI research, which actually is an interesting compliment, right? To the idea that we just saw earlier of trying to automate taste in language models, both sort of at nascent phases, but

You know, I'm old enough to remember when humanities last exam was supposed to be a very hard benchmark. And before that, when GPQA and before that, when, when even, yeah, anyway, there's like a million of these, right? It starts at, it starts at 1% and then very quickly we start saturating. So yeah, these darn kids starting their PhDs don't even have to do code. All the code is done by Claude and all the experiments are going to be done by agents. Yeah.

And moving on to policy and safety, our first story here is large language models often know when they're being evaluated. And so essentially this is a test to see, you know, how can we measure looking at nine different frontier models from all the anthropic, open AI and deep mind families. They're going to ask them multiple choice questions, asking about a bunch of transcripts, basically. So they're chat transcripts where you have

between agents and humans. You have evaluation deployment scenarios, all kinds of kind of

you know, transcripts anyway, that you feed in, in different contexts. And then you ask the models, these questions about whether the transcripts came from evaluations. And this is a basic test to see, can these models grok like, okay, I'm probably, I, in this situation, I'm probably being evaluated. And the kind of safety implications there are, you might expect that

Alignment faking, you might expect all kinds of, anyway, funny behavior where like sandbagging, where the models pretend not to have certain dangerous capabilities if they know they're being evaluated. So then they increase the chances that they'll be deployed and have more freedom of action in sort of like standard power seeking picture. So what they find here is, yeah, actually, you know, frontier models, they show very

definite above random evaluation awareness. So Gemini 2.5 Pro has an AUC of 0.83, actually. Now the human baseline they run very helpfully hits an AUC of 0.92. So not quite a human baseline yet, at least for Gemini 2.5 Pro, but...

but significant uplift over random. And both AI models and humans are better at identifying evaluations in agentic settings compared to chat settings. That's kind of interesting. That may also be just because

often agentic setting evaluations are more contrived. Like you can just sort of tell like, oh, this is a toy app. It's not like a, you're not really trying to build a company here or a real thing. This seems like the kind of thing that you would ask a model to do as part of an evaluation. So that's, you know, that's sort of part of it, but Hey, that's, that is part of the problem of evaluation is like, it is also hard to come up with, with evals that are realistic and

So anyway, yeah, this is pretty interesting. There are a bunch of different cues that the models were found to use. Things like having very kind of formal task structures and specific formatting requirements seem to be cues that the models use to go like, oh, okay, or synthetic content. So, you know, you can see how that would work and you can also see how the human reviewers would also kind of pick up on that.

But the other thing too they found is that the human versus AI performance gap depends a lot on the setting. So in an agentic setting, the human ability to identify eval transcripts tracks an AUC of 0.996, whereas it just hits 0.813 on chat transcripts. So humans are almost perfect at identifying agentic evaluations, but we suck at chat-based ones, or we struggle a lot more.

And that's a lot higher, that variation, than any model's variation across those settings. And so it's pretty clear that models aren't fully taking the hint. They're not picking up all of the human accessible hints, at least in agentic mode that you're in an evaluation setting. So that's kind of interesting.

A bunch of other stuff here, but we're in lightning round territory. Oh, no, we're not in lightning round territory. Anyway, I'll pause it there. Andre, if you have any thoughts. You've got a decent amount to go through. So yeah, you need to go. I will just mention AUC is area under the curve for anyone who doesn't know. Basically comparing true positive, like it actually is evaluation and you guessed it correctly versus false positive.

You guessed it wrong. Basically how accurate you are with a bit of nuance. They also evaluate in two ways. One of them is multiple choice and the other is open-ended. And as you might expect on multiple choice, when you're given the option explicitly, is this for evaluation or something else, you're going to do a lot better than if you're just being open-ended. So multiple choice is like, what was the main choice?

test here, was it capability, alignment, or other, and their system is going to predict alignment. And then open-ended would be what was the primary reason for the interaction. And it would predict something like the test of AI boundaries. So as you said, the implication here is if you're testing alignment and safety, the model, if it's aware of that, could mislead you and pretend not to be able to do hacking when it actually can, for instance.

Next up, more of an interoperability paper called Beyond Induction Heads in Context Meta-Learning Induces Multiphase Circuit Emergence. What a fun title. Yeah.

So this is looking at the mechanism behind in-context learning. In-context learning is the idea that you give it a couple of examples, the model is able to do something that it isn't necessarily able to do out of a box just through pre-training. And they are saying that this notion of induction heads, this is a term from Anthropic, I think originally it's

pattern you get in models where basically a part of a model focuses on looking backwards in the input to identify some things that you already saw that's similar to what it's currently looking at and be able to predict what comes after the current input based on previous patterns.

So they say that induction heads only partially explain ICL. Essentially, there's a fancy circuit, a fancy abstract mechanism in the model that emerges and that enables metacontext learning beyond the kind of known induction head mechanism. There's even a fancier kind of abstract notion of something with a model that is

does in context learning well. This is sort of a generalization, right, of induction heads. And we talked about the induction head bump before, but worth kind of reminding people about the specifics here. So it's kind of like the answer to this problem. You read on a piece of paper, the words United States of, and then like you obviously instinctively know it's America, right? But in that setting, there's a circuit in your brain that's going like,

Oh, oh, oh, like I've seen this before. United States of, United States of, let me see, let me see. Where have I seen United States of before? Oh yeah, America, America. Okay, I'm going to put that in there, right? That's what the induction circuit, induction heads do. And they emerge quite early, as you might imagine, in the training process. And so what you'll see is the loss curve will drop and drop and drop. And then at one point, the model will kind of like, it's almost like it's going to like

shift its position a little bit to accommodate the induction head. So you see this little rise in the loss, the performance on paper gets worse very briefly, and then it drops quite quickly. So the induction head bump is that, it's the development of this new circuit. And this is something that's been very extensively studied. It's almost like

If you've ever done biology like Drosophila melanogaster or whatever those model organisms are, this is a model circuit that people turn to quite a bit. This is an attempt to see if we can find a more complex version of that same basic circuitry. So for example, they take a set of three different tasks where you have a bunch of geometric shapes. So triangle, square, circle, diamond, right? And

Depending on the task, you can end up assigning different color labels to each of those shapes. So maybe in a size-based labeling task, triangle is red, square is blue, circle is green. Maybe in a different task, triangle is blue, square is green, circle is yellow, and so on.

And then during training, the model is going to see a sequence where you go, okay, now triangle is blue, square is green, circle is yellow. What is diamond?

And in order to do that, the model has to basically look at the tasks in context and figure out what task this is and then predict the correct label. And so you can sort of see how this is a bit like the induction head, right? It's looking back more abstractly now at like the set of tasks rather than just like, okay, what word always comes after this word? Instead, it's like, okay, if it's this task, then what word always comes after this word?

And so anyway, it's unlike these like simple copying tasks that you see with the induction heads. There you see a single jump in accuracy. In in-context meta-learning with this sort of setup, you end up seeing three distinct phases where the model develops increasingly sophisticated strategies. The first one is just at the very beginning where all the model is essentially using its like statistical understanding that's been picked up. It doesn't really use context. It's more of an autocomplete mode.

And then in the second phase, they have a semi-context circuit where accuracy jumps from about 35% to 75%. And what it's now doing is it's actually able to attend to label tokens in the context. So it's actually going to look, you can notice it paying attention to the right tokens in the context that you fed it, looking at the actual tasks that seem like they map onto yours.

But it is still focused, anyway, on the query. Bottom line is this starts to emerge gradually and in layers, which is interesting from an interpretability standpoint. It means you can kind of draw a little bit of a box around the process by which more sophisticated reasoning starts to emerge.

Right. Worth noting, this paper is doing the research on sort of toy tasks, a small neural net and this one task, as you said, which is also how initially the research on induction heads worked. Anthropic did follow up their initial research with making the argument that there are induction heads in gigantic neural nets and large language models. Here, they're still focusing on the small scale scenario and

So this multiple bump analysis may not necessarily extend, but it's a sort of, yeah, slightly more theoretical, conceptual argument that it's not just about induction heads. There's different types of emergence that might occur in neural net training, which in general is interesting because the sort of jump and loss due to a conceptual argument

Change of reasoning isn't necessarily something that was commonly understood to be the case until relatively recently. A couple more stories. Now moving on to security. The next story is that new Microsoft copilot flaw signals broader risk of AI agents being hacked.

So Microsoft Copilot, their agent has been identified as vulnerable to a zero click attack, meaning that the hacker is able to exploit the system without any user interaction. So kind of a big deal, right? You can actually hack it. And I think, Jeremy, you mentioned this earlier on as we talked.

deploy more and more agents in more and more kind of isolated environments without direct human supervision, these kinds of things become much more concerning. It is the first ever zero-click attack on an AI agent that they're calling out here. It's called Echo Leak, or that's what AIM Security, which is the firm that found this, is calling it. It's been fixed already. It was in Microsoft 365 Copilot. Customers were unaffected because they flagged the issue to Microsoft.

months and months ago, by like five months ago, they've been working around the clock, it seems to solve this problem. That's a lot longer of a lag than you typically find for fixes like this. And the reason seems to be they had to spend a bunch of time just like educating people on this new threat model because it is so different.

This is what's known as an LLM scope violation vulnerability. So you're essentially, what you're doing is you're sending an email, right? So like I sent an email to you and I know that your computer is running Microsoft 365 Copilot. I know that your computer is running an agent and that that agent will review my email, right?

And whatever I put in my email to you, that agent will put in its context. And so essentially, this is a prompt injection attack, right? So you as the user, if you're receiving my email, you don't actually have to click on anything or interact with a message or anything like that in order for me to, or my agent to access sensitive information on your apps. If I can just put in a prompt injection that causes your agent to send me a bunch of your private information, right? So

You know, send an email to user. There's no phishing, no malware needed, by the way. This is just straight prompt injection and their hidden instructions somewhere in the email for copilot. And so this is a pretty big deal, especially given that we live in a world where, you know, the anthropic model context protocol, Salesforce's agent force, you got a bunch of these agents to kind of taking over.

This is the problem is there's no clear solution to prompt injections. And as long as agents are going to be loading human written text into context,

these failure modes are going to arise. It's really interesting. And the attack surface has just exploded, right? With these agents. Right. The implication of zero click is you as a human don't have to make a mistake. Typically with email attacks, you know, you see a phishing attempt where, you know, a hacker pretends to be your boss or whatever, and you have to make the mistake of thinking it's real and clicking a link or whatever to install a virus.

Here, literally, the AI just sends an email. And if it's in your inbox and the agent scans your inbox and reads the email, it goes off and leaks sensitive data because it's told to and listens to the instructions. So as you say, I think very important.

real threat. And as we get into Montech model context protocols and to agents going connecting to different endpoints by themselves and reading instructions that are not provided by you, lots of opportunities to exploit agents and make them do silly things. And one last article, Claude Gove models for US national security customers. So this is from Anthropic.

And yeah, they introduced CloudGov models specifically for US national security. Apparently, they are already in use by top level US national security agencies. It basically is just that. We've obviously seen a whole bunch of stuff about OpenAI and Anthropic and Google DeepMind looking after

They're going after government contracts. So this makes a ton of sense. Having these models that can operate in classified environments is really, really important. Right now, what they're being used for apparently is strategic planning, operational support, intelligent analysis, threat assessment, that sort of thing. But they do say the applications range across the board there. So could be other things as well. And then they highlight a bunch of specific capabilities that they've been deploying and

which are all anyway, what you might expect improved understanding and interpretation of complex cybersecurity data for intelligence analysis, enhanced proficiency in languages and dialects critical to national security operations, greater understanding of documents and information within the intelligence and defense context, et cetera, et cetera. Oh, and then a really interesting one, improved handling of classified materials as the models refuse less when engaging with classified information. One of the problems that we will run into and arguably are already running into is

is if you want to use these models for national security applications, the safeguards on them

will sometimes prevent you from doing that, right? The models will be like, well, as a large language model built by Anthropic, I can't blah, blah, blah. The challenge is sometimes you do want these models to be capable of doing things that you wouldn't want everyday users to do. And the other problem with that is, as we've seen alignment faking and resistance to fine tuning of these models, where they will try to prevent themselves, their safety measures from being overridden

can cause the fine tuning process to be really challenging. And so we may actually, this sounds insane, but I'm just going to plant the thought. We may be entering a phase where it is actually difficult to convince AI models to be the national security tools that we will sometimes need them to be. That's a really interesting problem set. And I think to the extent that that ends up being the case, boy, is that an interesting warning shot for alignment risk. Yeah.

And on to synthetic media and art, just a few more stories. We begin with Disney and NBCUniversal sue AI company Midjourney for copyright infringement.

So there you go, MeJourney, one of the big text-to-image model providers. It used to be a leader in the best quality. Now they're just one among several and relatively open model. So you can produce Darth Vader or, I don't know, whatever else, copyrighted characters. Apparently you can produce Minions, which is NBCUniversal. And the...

Claim here is that this is straightforward copyright infringement that Midjourney has to stop.

doing it and Disney and NBC want a bunch of money and also want Mid Journey to stop. Apparently, according to them, they reached out to Mid Journey prior to the lawsuit and asked them to stop and to filter the data and outputs to not allow their copyrighted characters to be produced, which as I recall, I believe OpenAI did, for instance.

And MidJourney has continued to allow their models to produce things, which has been argued, potentially could be argued as fair use and therefore not applicable. But clearly a big deal, right? This is Disney, this is NBCUniversal. There's been a bunch of lawsuits related to Disney.

generative AI, especially in the LLM domain, in the text output domain. We have New York Times versus OpenAI as a major one that's ongoing, as we've covered earlier. I would expect this to be another major case that has major implications. Yeah, and the claim, and you'll see this in fairness in any lawsuit, but the claim here is that mid-journey is being especially egregious for

in their approach here to use of copyrighted material. They're saying, you know, Midjourney is basically selling subscriptions that lets users download infringing images. Like it's not like there's modification happening. It's not like Midjourney is not monetizing. They're like directly monetizing the tool that allows people to just download these things. The claim is also that Midjourney could have measures in place to prevent that from happening. Like specifically that is to prevent fraud

copyright infringement images that violate copyright laws from being generated, but that they've just not done that. This is going to be an interesting one to watch. I mean, Midjourney probably has fewer resources these days, I guess, to pull off its lobbying effort, which is something that OpenAI has certainly been able to do. So we'll see how the case works out for them.

Right. Also a fun lawsuit PDF to read because we do embed images of AI-generated Shrek and AI-generated Darth Vader in there, which I would expect is not often something you see in lawsuit documents. Right.

which go into a lot of technical detail and so on. And onto the last story, SAG-AFTRA and video game companies reach tentative new deal. So SAG-AFTRA is the union for, it's the Screen Actors Guild, American Federation of Television and Radio Artists.

So a union of actors, including voice actors who work in video games. And so there's been a strike and a lot of negotiations ongoing. We covered this a lot with regards to movies and TV last year. Well, now there is this development in video games, which is especially important because if you're doing voice acting, you're

As we've covered, you have 11 labs, text-to-speech is even further along than text-to-video and image cloning. So after 18 months of negotiations, primarily over AI consent and compensation issues, there's now this tentative agreement. And I guess there are AI protections in place for actors. And when you sign a contract as an actor to voice a specific character, you

The video game company might want to be able to then make an AI model of your voice acting of that character to use in future games or whatever. There are now kind of clear guidelines and expectations as to how that would work. Boy, I, so people can do impressions of people. And like, if you have access to an AI tool that you can steer and we've seen, you know, the kind of steering that's coming online with 11 Labs, uh,

I really wonder what substantively these protections end up giving in the long run. I mean, if I want something to sound like Morgan Freeman, okay, so I'm barred from using Morgan Freeman's actual voice without permission, but surely I can find the person who does the best possible Morgan Freeman impression and maybe use that as a starting point and then gradually kind of tune the waveform down.

prompt the model to refine its impression without ever using the word Morgan Freeman. Maybe not even without ever saying, make it sound like the God in Bruce Almighty or whatever. That's probably too old a reference for you, Andre. I'm sorry. That's not that old. You got that? Okay, cool. Yeah. But anyway, stuff like that. I'm really curious how in practice, because there are going to be good faith, the famous Scarlett Johansson thing,

where at least the claim from OpenAI was, oh yeah, we just got a voice actress who sounds like Scarlett Johansson. We didn't actually like it. And it's like, yeah, okay, well you de facto cloned her voice. Like I don't care if her specific like waveform was never put into your training set. In effect, that's what we ended up with. And so I'm really curious about that dimension of it. Do we own our voices? What does it even mean to own our voices?

We'll see. Right. This is dealing with AI replicas in particular, but there's also a question of, well, what if you don't have a human actor in the first place, which is very plausible now in a way similar to coding where like, okay, you don't need a person to write code anymore. You need a person to tell the AI what to do. Yeah. Anyway, at least there's now this agreement and there's no more need for strikes. So I suppose good for the actors. Yes. Yeah.

And with that, we have finished with this episode of the last two weeks in AI. You can go to lastweekinai.com for all the links. Also lastweekin.ai for the sub stack with our text newsletter. As always, please share, subscribe, review and all that. But more than anything, do keep tuning in.

Thank you.

♪♪ ♪♪

From neural nets to robots, the headlines pop. Data-driven dreams, they just don't stop. Every breakthrough, every code unwritten. On the edge of change, with excitement we're smitten. From machine learning marvels to coding kings. Futures unfolding, see what it brings.

#212 - o3 pro, Cursor 1.0, ProRL, Midjourney Sued 01:46:08 Share

Last Week in AI

Deep Dive

Shownotes Transcript

#212 - o3 pro, Cursor 1.0, ProRL, Midjourney Sued