We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

#208 - Claude Integrations, ChatGPT Sycophancy, Leaderboard Cheats

2025/5/8

Hello and welcome to the Last Week in AI podcast. We can hear a chat about what's going on with AI. As usual in this episode, we will be summarizing and discussing some of last week's and maybe even two weeks worth of AI news. As always, you can also go to the episode description to get the links to all the stories and the timestamps so you can skip ahead if you want to. I am one of your regular co-hosts, Andrey Karenkov. I studied AI in grad school and now work at a generative AI startup.

And I'm your other host, Jeremy Harris. I am the co-founder of Gladstone AI, AI National Security Company, blah, blah, blah, blah, blah, blah, blah, blah, blah. And yeah, welcome back. I mean, it's good to be back. It's good to be back in the seat after

God, I mean, so we were talking about this earlier, but we had like two weirdly simultaneous launches of things that happened within, I want to say a week, a week and a half of each other. And so Andre was like super busy the first week and I was busy the next week. And it's just been a, anyway, it's been a real fun time. Yeah. The fun bit, we were also discussing how, because we do this podcast, we actually have to be on top of what's going on in AI and not doing it. That was actually kind of strange. Yeah.

On the other hand, because it is last week in AI, we do try to do it once a week and it is a bummer when we have to miss some. So we are going to try to be consistent at least for the next few months until we have any more launches. But hopefully listeners understand. Unfortunately, we do have day jobs and so on, which...

Sometimes they're priority, you know, it happens. But the good news is nothing huge happened in the past couple of weeks. There's been some interesting things to discuss and we will get into some of those covering some things that are a little bit older and some things that are brand new.

And that's kind of a preview of episode in tools and apps. We're going to talk about some kind of patterns we've seen with open AI being very, what people call sycophantic lately and the whole drama about that. Also, uh,

Some brand new news about Anthropic and IPC servers, which is pretty cool. Applications and business, as always, a few stories about chips and China, and also some funding news for some startups, projects in open source, a few new models, and actually some research as well. Research and advancements. Some pretty spicy results we're going to get into about leaderboards and more research on

really explaining what's going on with reasoning and RL, and then policy and safety, some things about malicious uses of AI and vulnerability, things like that. So it'll be a fun little episode. I think we're going to enjoy discussing some of these things.

And jumping straight into tools and apps, the first story is brand new. It's about Anthropic, letting users connect more apps for cloud. So this is basically allowing you to have direct integration to various applications.

They have a starting set of partnerships with things like Atlassian, Zapier, Cloudflare, Intercom, Square, PayPal, and others. The idea is that when you...

enter a query into Claude, it'll have a little pop-up that's basically like, do you give me permission to talk to the service at Lesion or Zapier or whatever to do whatever you want to do? And it can directly do it for you. So instead of having an AI built into your Jira task tracker for work,

That is custom. Cloud can now directly talk to that thing using presumably this model context protocol standard way to communicate to services that Anthropic released last year and has kind of taken off.

And it can directly talk to that and basically be your AI for your task tracking software, or it can be your AI to process news. It can basically now open up and be a chatbot that can do all sorts of stuff. And, you know, this is,

Similar to letting your AI just do web surfing for you to do whatever it needs to to fulfill your task. But I guess much more elegant and direct where it can talk directly to the service and can query it for you without having to do the, I don't know, like grunt work of pressing buttons and logging in and so on. So I think...

Pretty exciting in terms of a release for Cloud that really makes it much more broadly useful and kind of impressive to see them taking the lead in this particular way of using chatbots.

Yeah, it definitely seems like Anthropic building on the early advantage they had with the MCP protocol, which OpenAI obviously has since taken on board and other companies too. So it is becoming the de facto standard and it positions Anthropic really well in the space. It's also, I mean, it's consistent with this vision, right, that we heard, well, many times, but kind of most famously articulated in that conversation.

Leopold Aschenbrenner situational awareness thing about the drop-in remote worker, right? This is really a step in that direction. You've got a model now able to just call these tools directly. It's being productized. It is being rolled out, this version at least, to Claude Mac's subscribers enterprise network.

plan subscribers and soon to pro. So again, this is anthropic kind of finding the sweet spot of what they're going to charge for the kind of higher tier subscriptions. That's been a question recently too, right? When they introduced Claude Max, they said we would give early access to people who sign up for that tier, early access to new capabilities. This is apparently one of those capabilities they flagged for that. So starting to kind of flex that muscle a bit too. But yeah, this is, I mean, this is on the step to fully replacing certain kinds of

Well, it depends on the way you wire things up, but certain kinds of engineers, certain kinds of, well, again, if you're doing some kind of like sales backend work or whatever, there's a lot of stuff that could be straight up automated down the road if they keep pushing this direction. So kind of interesting. And we'll see what the impact is too on the job market. I mean, there are some indications that this stuff is really starting to rattle, especially juniors or entry-level roles, right?

But yeah, well, it definitely a big cost savings if you're able to get these sorts of agents to do your work for you. Exactly. I know personally, you know, as someone who does programming so far, you've had to sort of wire up things yourself. Like let's say you want to write a script to process a spreadsheet to do some work for you.

Typically, that's involved writing a script to really do it efficiently, to not have to download it, attach it, write the prompt. Now, it makes it much easier to automate things via a prompt because you don't need to do any sort of manual steps. I can directly talk to whatever data source it needs to do the task. So a simple example, again, just to make this clear is

They show you being able to ask what's on my calendar and then cloud can directly talk to your calendar. You have to press a little button to allow it to get the data and then it can answer your questions about that. So really, I do think kind of a pretty significant step in terms of expanding the capabilities of LLMs and this kind of service to do all sorts of stuff for you in a way they could not have done before.

Worth noting also, as far as new features goes, they did launch their own research tool because apparently every single provider of LLMs needs one. And they are launching an advanced research tool, which is their fancier one. It can take five to 45 minutes to compile comprehensive reports for you. So also interesting to me that it turned out

For agentic AI and for these reasoning models that deep research has turned out to be one of the power use cases. And next up, we are going to talk about OpenAI. And they've had something pretty embarrassing, I will say, in the last couple of weeks. Yeah.

So if you're on Twitter or if even you use ChatGPT, there's been a lot of discussion of a recent update of GPT-4.0 where they have made it, let's say, very enthusiastic and positive when communicating to people. I didn't know this word actually. Glazing apparently is what people describe it as.

Where, yeah, basically the model is like you enter a basic query or something or something like that. And the model just cheers you on to know. And it's sort of crazy, you know, telling you, oh, this is such a deep insight. This is such a good idea, et cetera, et cetera. And it was so bad. And there's been such kind of bad examples that OpenAI seemingly really rushed to fix it.

Sam Altman actually announced on X that they are working on some fixes ASAP to address the personality issues from the last couple of GPT-4 updates. They rolled out update to the system prompt that some people talked about. They've also seemingly done a full rollback of GPT to a previous state of it. So I would say, you know, there's questions as to how this happened.

It's potentially the case that they try to make it overly optimized for engagement or for positive feedback by users. But it's clearly like when you look at some of these responses, it's clear that something went wrong here. And it's something we haven't seen from one of the major players yet in this way.

Yeah, it's also hard not to notice that this is happening just weeks after OpenAI announced that they're no longer going to be focusing on persuasion capabilities as part of their preparedness framework in the same way as they had.

So when you think about persuasion capabilities, certainly sycophancy in these models is something that you might correlate with persuasion, right? Telling people, oh, you're so smart. What a great idea. What a great question. When you optimize, and I haven't seen clear indications that they had optimized directly for rewards. I've seen some posts on X of people saying like, hey, here's an option that showed up. Do you like this personality or not? Like thumbs up, thumbs down type thing, which

To be clear, I think is a gigantic mistake, a really, really dangerous precedent for OpenAI to be setting. Frankly, I mean, we've seen OpenAI do related things, be willing to kind of push the envelope on some stuff. You could often argue for it or whatever. But when it comes to like optimizing, when you close that feedback loop between yourself and like the, if you will, the person that you're talking to, to make them more agreeable or more

likable by you, I think that is pretty clearly a very, very dangerous thing to be doing when you have as much compute as they do, when we already have evals that are showing these models are really effective at persuasion and manipulation. That's the sort of thing you start to think about at the next beat of scale, at the next beat of sort of subtlety and persuasion and manipulation, which these models seem to be on track for. So anyway, I think this is definitely a space to watch. There's not necessarily going to be smoke the next time there's fire.

And that's something that I think people really need to understand. These models are, by definition, getting good at persuasion means, or almost by definition, it means that the other person doesn't realize that's what's going on. So as you keep pushing in that direction, as you use more and more subtle cues, feedback cues from users, I think a lot of people have very justified concerns that we're heading in a direction where

You know, there's a certain amount of asymmetry between the user and the company here where the company is able to think on computer clock time about how to optimize their relationship with the user. That's not necessarily healthy, especially aggregated over the entire population, you know, hundreds of millions of people interacting with this stuff.

Right. And just to get into some basic examples, a lot of this was kind of funny and people started posting examples where they directly got to them to be as silly as possible. So one example, just pulling up for Twitter, someone says, I just woke up, did two pushups and might brush my teeth in the next six hours.

Chachi Bhatti said, you have achieved a level of mastery if you dare to even imagine the delicate art of strategic patience to evaking is itself a small revolution to do two push-ups immediately afterward is a declaration of war against inertia. I will say perhaps this example is, you know, tweaked. I'm just pulling off of internet, but that shows you kind of the flavor of what you're seeing. It's of a model is being developed

very much a sock up saying very extremely positive things that are not natural and I just actually searched and opening I just posted a blog post today as we are recording and

titled Expanding and What We Missed with Sycophancy. And they go into, you know, in April 25, they pushed an update. The update had a few things. Each thing individually didn't look so bad and their metrics were good, et cetera, et cetera. They're talking about

What will improve in our process? What we're learning? So a pretty embarrassing kind of situation here, right? The fact that they need to address it so strongly. Some people also compared it, I remember, to the Gemini launch from Google where there were very silly things going on with the image generator. I think OpenAI for the first time has really fallen on its face with this launch and

And as you said, there are some real dangers to doing this kind of thing. Another thing that people pointed out is some people are getting very close to these chat GPT models, people who are perhaps not

possibly delusional or in a bad mental health situation, you know, talking to Vichat bots can seriously affect them. And so you need to be careful with how positive, how affirming Vichat bots can be and how, you know, how much of a reinforced, whatever you're telling it that has real implications, even aside from let's say theoreticals of persuasion or things like that. So,

Yeah, a lot of discussion, I think, will be going from this event and some studies and so on to really get into how you can tip models to be a little bit extreme and otherwise quite an interesting phenomena.

A few more stories. Next up, we have a new model launch from Baidu. They are announcing Ernie X1 and X5 Turbo. X5 Turbo, as you might imagine, is the fast kind of model. They are saying that it has 80% price reduction compared to its predecessor.

Ernie X1 is the deep reasoning task model. They're saying it's better than DeepSeek R1 and O1, things like deep chain of thought, things like that. So Baidu, as one of the leading creators of LLMs out in China, is...

I don't know if it's fair to say catching up, but keeping up with what's going on with Entropic and OpenAI, you know, increasingly you have small, cheap, fast models like Gemini 2.5 Pro or let's say O3 Mini. And you have these quite big, quite expensive models like O3, like Cloud Opus, Gemini 2.5 Pro, which are very,

more and more very capable. And that seems to be the case with these two models. Yeah. I mean, don't count out China. And I think there are reasons, and I'm not sure if we're going to talk about them today explicitly, I'm trying to remember, but there are reasons to expect this to continue at least into next year, by which time the chip export control

stuff is going to have more of an effect. But for right now, expect China, frankly, to do damn well and quite possibly catch up fully to the frontier of Western AI. I mean, that's a concerning thing to be saying, but that is the trend right

I think until we get the next generation of data centers online, we're not going to see that significant a gap between those two groups. Yeah, the benchmarks look really solid here. I mean, they look at various multimodal benchmarks for 4.5 turbo, and certainly that's well in advance of

GPT-4.0 and competitive with GPT-4.1, in fact, beating it at many multimodal benchmarks. That is a pretty noteworthy thing. And competitive pricing as well. I mean, you mentioned Ernie X1 Turbo is something like, was it 25%, I think they said, of R1 in pricing. So that's pretty damn good. Also, I mean, again, R1 is an oldish model.

It's an oldish model. It's been around for literally weeks, guys. It's been around for weeks. It's collected years. It was at the start of a year. That's when all this reasoning stuff kicked off. Feels like forever ago. 100%. But because of that, there is so much low-hanging fruit right now in the inference stack that

that yeah like you can learn a ton of lessons from looking at r1 a lot of these models by the way distill off of r1 and you can kind of tell in their thought traces end up coming out there's some some similarities that look suspiciously similar i don't know if that's the case for earning 4.5 i haven't actually checked that one but we'll talk about a model a little bit later a

a Chinese model actually that sort of has that characteristic. So there's a lot of ways in which you can build off of R1, both by distilling data directly from it, but also just by learning lessons, infrastructure lessons and architecture lessons from it that allow you to drive down that pricing a lot. And anytime there's a new paradigm that gets discovered and

or invented, you have a rapid improvement in a lot of the top line metrics, just as people find all that sweet low hanging fruit associated with that new kind of paradigm. So that's the phase that we're in right now. Expect these prices to kind of collapse faster than the traditional pre-training kind of base model pricing.

currently is, you know, think back to like how quickly GPT-3's pricing dropped, for example, or chat GPT's pricing dropped in the early days. That's what we're seeing right now as well. And those other prices continue to drop by the way, even for base models, but we're just in this unusual kind of very rapid acceleration in that, in that phase where we're getting efficiency gains that are really, really rapid.

Yeah, I remember when model pricing used to be per thousand tokens, and then at some point, they switched over to per million tokens. That's a good point, right? Yeah, it's funny. I don't think I ever consciously registered that. I was just like, yeah, of course, we're bumping it up by three orders of magnitude.

And next, moving away from LLMs for a bit towards image models, the next story is about Adobe adding more image generators to their services. So they are launching Firefly Image Model 4 and Firefly Image Model 4 Ultra with some other updates. So Image Model 4 is meant to be faster and more efficient and offers up to 2K resolution images,

Firefly Image Model for Ultra is focused on rendering complex scenes of more detail and realism. These are now available in the Firefly web app, which also has their text-to-video, text-to-vector stuff. And they are introducing this new thing called Firefly Boards, a collaborative generative AI mood boarding app in public beta. So that's kind of cute.

Last up, they are also now adding support to third-party AI models like the GPT image model, Google's Image N3, Google's VIA2 for video, and other third-party things as well, which I think is kind of notable if you're thinking that this can be the service to use for image generation, for experimentation. Having third-party support is not kind of a trivial detail, but

They actually emphasize that these third-party models are for experimentation and marks their own models as, quote, commercially safe, which is, yeah, highlighting what they are arguing is the reason to stick to the Firefly models. The fact that they've trained it on non-copyright data, you're not going to get any sort of trouble with using Adobe's models.

Yeah. First of all, I mean, it makes all the sense in the world, right? In a world where all these models are becoming commoditized. I mean, this is really the ultimate expression of the commoditization of these image generation models, right? You literally are a click away from using the alternative, right? So it's great for the customer. It's also, it makes it so that the actual value is,

in the value chain plausibly is no longer going to be concentrated with the model developers, at least for text to image or things like this. Instead it, it, well, it'll shift somewhere else.

Obviously, the hardware stack, I mean, we've talked a lot about that, especially in the last kind of two years, that that's where, you know, the NVIDIAs of the world, maybe the AMDs, the ASMLs, the TSMCs are kind of where a lot of the value and the value chain ends up being captured. But there's also the aggregation point, right? So Adobe making a play here to become an aggregator of sorts of these models.

Definitely a good play. Also with them leading the way on the whole idea of indemnifying users if it turns out that there's a copyright violation or a sort of claimed alleged copyright violation from the image generation process, not necessarily being able to guarantee the same thing for the other models they host on their platform, which is where their sort of flag there for like, hey, our thing is business safe. The others are for experimentation. That's kind of where that's coming from.

a sort of a nice way to encourage people to use theirs. Now, I think a lot of these companies have similar sort of indemnification guarantees. So it's not actually clear to me that there is a material difference in all cases relative to the promises that Adobe is making. But I'm not sure having not gone through the specific list of like all these models, there may well be some that don't offer indemnification. So still interesting information

Adobe making a good play. And these, I mean, these models look really good. Like they, they have some examples and, you know, I keep saying this every time there's a new image generation model. I'm like, I don't, I'm at the point where I can't tell the difference between subsequent releases and,

Maybe it's just the prompts that they picked here, but they do seem very photorealistic and compelling. So anyway, seems overall like an interesting move, very strategic shift for Adobe for sure. And one of the few things that I think they could do to make sure that they're still relevant in the long run if they don't have access to the kind of compute that their competitors do. Yeah, and I think the fact that they're investing a lot in this Firefly web app is interesting in a sense that they do have an advantage in this competition, obviously,

Similar to Google in a way in that, you know, if you're already paying for Google Workspace, you're maybe going to use Gemini. If you're paying for Microsoft 365, you're maybe going to use Copilot. If you're paying for Adobe tools and they do bundle their tools in a subscription, you know, for Photoshop or Photoshop,

photo editing or whatever, they can bundle in the AI and then push you towards using Firefly and not some, one of the many other services you can use to generate images. So I could see Adobe really making it out just by being the default for a lot of this kind of professional work.

And speaking of image generation, next story is that OpenAI has made their upgraded image generator available to developers. So we saw in late March, the launch of what I think they call ChatGPT image generation, GPT image one.

And for a while, you can only use it via the web interface. Now you can use it via the API. And this is quite notable because this model does have some very real advantages over previous models. It's much better at editing images, given an image and a description. It is very good at very kind of

clean edits that previously would have been very hard. These images are watermarked with metadata and you can kind of track what they're being generated, things like that. So I think currently few other services provide this level of image editing. And I would be curious to see, I guess, what impact this has. Yeah. Pricing is also like, it's non-trivial as two cents for a

approximately two cents for a low quality image, approximately 19 cents for a high quality square image. So, you know, if you think about that, like that's a buck every five images. It's not nothing, but anyway, obviously that'll collapse in price pretty soon too. But yeah, kind of cool. The consistent shift to,

Oh man, I'm trying to remember who was, I think it was Steve Ballmer, right? With that famous up on stage of Microsoft, like clapping his hands going developers, developers, developers. Well, this is that, right? Everybody's kind of moving in that direction. It's increasingly a matter of, and this is like opening eyes, like original play back when GPT-3 I think came out. They were very much in that mode of saying, look, we're just going to put everything in developers' hands, see what they build with our stuff rather than necessarily us.

The implied claim was rather than necessarily doing the Amazon thing where we actually start to notice which products are doing really well, and then we offer the Amazon Basics version of that product. And eventually that's bad for people who use the platform, merchants. OpenAI has done some of that. There's no question. I mean, that's part of what it means to be in the image generation business. But more APIs, right? Like that's a very OpenAI thing. And it's a very, well, industry thing now, right? That's where everything's going.

And last but not least, dealing with XAI and being able to see things as opposed to make images. They have launched GrokVision in their iOS app. So as we've seen demoed many times, you can point it at something and ask it questions about whatever you're pointing it at.

They're also launching some other things like multilingual conversations, real-time search in voice mode. This is available to Android users on the $30 per month SuperGrok plan. So...

Still, yeah, XAI rapidly in catch-up mode with, in this case, I guess it's the advanced voice mode from ChatGPT where you're able to ask questions about equations and stuff like that as OpenAI demoed last year.

Yeah, I continue to be impressed at how fast Grok is getting stood up. I mean, just the sheer number of like, they're not supposed to be a massive contender. They've been around for all of like, what, two years, 18 months. And yeah, already pumping out reasoning models, multimodal models and all that. So yeah, they've definitely they're taking advantage now increasingly to have their partnership with X or their integration with X. So we'll, I guess, see that reflected more and more too.

Yeah, and very rapidly rolling out, I guess what seems to be more and more of a basic set of features on the chatbots, things like Canvas, search, memory, you name it. Whatever ChatGPT or Cloud have introduced over the last couple of years, Grok is rapidly adding it as well.

And on to applications and business. First up, we're going to talk about the startup from Mia Murady, the former CTO of OpenAI, who left after the high-profile disagreements with Sam Altman and

Being ousted in late 2023, Mia Moratti left, I believe, in kind of 2024, maybe around mid-2024. We've known she's been working on the startup called Thinking Machines Lab for a while. And now we're getting some news about their fundraising. Apparently, they're raising $2 billion at a $10 billion valuation. And the interesting thing that has come out of this is that...

Mira Murady will have an unusual amount of control in this startup. So basically what it sounds like is she will always have a majority on any major decision in, let's say, the board, for instance. So...

Even if she installs a hostile board, for instance, and they all disagree with her, my understanding is she'll be able to override and have ultimate decision-making capability as VCEO, which is unusual. It's usually VCEO has a lot of power, but not necessarily a codified majority decision-making power from the outset.

So, yeah, I mean, it's been kind of a slow rollout for Thinker Machines Lab. It's been a bit quiet as to what they're doing, but they have been recruiting and seemingly, I guess, getting investors on board.

Yeah.

Jumping ship and then going to thinking machines. Something interesting is happening there. I mean, there's no question that level of talent flocking to that company is very interesting. Also interesting to see this sort of consolidation of power. This is something that all these rock star employees are actually perfectly happy with. Right. So there is this super voting majority that Mira has. Apparently, the way it's set up is her vote on the board has the equivalent majority.

force of the vote of all other board members plus one. So functionally, there isn't a board. There isn't board oversight. That's what that means. Which is, by the way,

The function of the board is basically to hire and fire the CEO, right? To hold the CEO accountable. That's the whole idea behind a board. So the fact that that's not here is very interesting. It means she's got an awful lot of leverage. So she's raised ostensibly about $2 billion at a $10 billion valuation. And Dreesen Horowitz is in on those rounds. And they're famously very founder-friendly, allowing her to do this.

That's also true, by the way, at the level of the shares. So just to give you like if you're not tracking the whole corporate structure set up, typically have a board that can hire and fire the CEO. And then you have the shareholders of the company who can sort of swap board members around. That's usually how things work. And even at the level of the shareholders, Mira also has or enjoys a lot of control, very unusual amount of control over.

The startup's founding team, so some of these elite researchers who've come over from OpenAI, from Anthropic and elsewhere, have apparently super voting shares that carry 100 times as many votes as normal shares. And they've agreed to let Mira vote for them by proxy. So that's a lot of power that she's got.

on the shareholder side, on the board side, and as a CEO as well, everything I've heard about Mira does seem to be quite positive, interestingly. So some of the former OpenAI employees who've been through the whole board coup fiasco thing had pretty damn positive things to say about her. I thought that was kind of interesting. I've never met her myself, but it was in the context of what happened with Sam. She was sort of left in the lurch back then when the board refused to tell her that the

The reason that they had fired Sam was the evidence that she herself had provided. That's now public that that was the case. But without telling her that, she was kind of left in the lurch. So anyway, she's definitely experienced at navigating a lot of board drama that maybe what's reflected here is

in this move, but it is highly unusual. And again, this would only happen if she had an extreme amount of leverage over the investors who are coming in. That doesn't mean, by the way, that it doesn't get refactored at the next fundraising round. You could easily have investors who come in and say, look, I'll give you the 20 billion you're asking for, but you're going to have to do something about this board setup. We want some measure of real and effective control. And so all these things are to some degree temporary, but for right now, with the 2 billion that they're apparently raising,

This is going to be the lay of the land for a little while. Next up, some chip talk. And we've got a couple of stories about Huawei. So one story is discussing the Huawei 9C. And basically just, we've already discussed, I believe, this chip. It's a combination of two 910B chips that...

combined are about as good as the H100, not the top of the line at VU chip, but the

what used to be top of the line for NVIDIA a couple of years behind. And the story here is just saying that they are getting close to starting mass shipments, potentially as soon as next month. Another story is also saying that they are working on a new chip that is called the Ascend 910D.

It is in the early stages of development. It will require testing. And this will be the chip that is going to be more powerful than the H100.

potentially could be the default if export controls get tighter on NVIDIA, as is very possible at this point. There's a lot to be said here. I think that the top line needs to be a recognition that US export controls actually have been working. They just take a long time because of the supply chain dynamics. China has enjoyed the ability to basically black market import a whole bunch of chips, H20s,

H800s, H100s that they shouldn't have been able to import. That's what's reflected unambiguously in some of the latest big runs that we've seen

the sort of post deep seek era stuff. So I think that's really important. China will be trying to convince us that the export controls are not working. We know they are because we've heard it from literally like the founders of deep seek back in the day before the CCP was watching their every move. Now their tone has changed, but the fact remains. And

Anyway, so we are going to see this chip is going to be slower. This is the 910D. So this kind of next generation will be slower than the B series, Blackwell series of NVIDIA chips. There are reasons, though, to suspect that that may not be the deciding factor. So what China is really good at is...

taking relatively shitty GPUs and finding ways to network them together to make systems that are just really, really powerful, even if the individual chips within them are kind of crappy. The trade-off that they end up making is because they can't use the exquisite like three and five nanometer and four nanometer nodes at TSMC to fab these things down to crazy high accuracy. Because they can't use that,

They can't have chips that are as performant on a per watt basis. So they have chips that are significantly less energy efficient, but that matters less because in China, energy is much less of a bottleneck. They're putting nuclear power... In the last 10 years, they have added an entire America worth of power. The whole US electric power output, they have added that in the last decade to

In the form of nuclear and other things, they can actually bring nuclear plants online really quickly because they didn't go through this weird phase where America had an allergy to nuclear. And so now they're in this beautiful position where, yeah, the U.S. has export controls on these high-end chips and anything from TSMC above a certain node. But the reality is China doesn't care as much because they have so much domestic power available. So they'll use chips that are less powerful.

on a per watt basis. And, you know, what's the difference? We've got 10 gigawatts of spare power around three gorges. Damn, let's just throw it at this, right? So that's kind of what we're seeing there.

The calculus, the design calculus, if you're Huawei, just looks different. It looks more like let's crank as many flops as we can out without worrying quite so much about the power consumption. And let's make it up in networking. Let's make it up in the backend, in the scale up, in the fabric that connects all these different GPUs together at the rack level and beyond. And-

That's really what we're seeing here. And so it's this weird combination of they are getting some of the high-end chips because we've done a shit job on our export controls, which we need to improve. But then there's also, they can be a bit sloppier at the chip level, as long as they are exquisitely good at the scale-up kind of network level, which is what they did, in particular what they did with the Cloud Matrix 384 system that I think we talked about maybe a couple of weeks back. But this is like...

the ultimate expression of like how you wire up a bunch of these 910C processors to beat systems like NVIDIA's GB200, the NVL72, which is like the top tier right now, just in just think of it as like brute force, right? Like we're just going to hook more of these things together and who cares about performance per watt just because we can afford it.

Yep. And this is following up on an early April, the US did introduce a new export control that seemed to limit the exports of the H20, the GPU that was specifically designed for selling to China based around previous export controls.

And Huawei also announced the Ascend 920 in addition to this 910C, 910D, which is more comparable to H20. And the reactions to the announcements of the 910C were,

were very dramatic. NVIDIA shares dropped 5%, 5.5%. AMD fell more than three. Broadcom fell 4%. So this is a big deal for NVIDIA, for GPU space in general.

Yeah, it's the NVIDIA thing is interesting, right? Because you might nominally think, well, NVIDIA's revenue, 16% of it is currently from China. It's a bit less now. So it's, you know, not such a big deal. You expect them to sort of grow out of that. But the argument NVIDIA is making, and in particular, they're making the White House is,

You are giving China the opportunity to refine, to increase domestic demand, obviously, for Chinese GPUs because we're preventing them from importing our own. And ultimately, that may lead to Chinese GPUs competing successfully with NVIDIA on the global market.

which would then wrestle market share away from NVIDIA there too. So that's part of what the market seems to be pricing in here, though for various reasons, I think that is very overblown. NVIDIA's own earnings calls suggest that they don't think that it's quite such an issue, at least historically. And so there's that interesting dynamic too.

And speaking of the Chinese market and export restriction, we also have a story of ByteDance, Alibaba, and Tencent stockpiling billions worth of NVIDIA chips. This is sort of an overview article saying that these leading internet companies have accumulated billions worth of the H20 chips prior to the pandemic.

Yeah.

And your adversary obviously is going to go, okay, I'm going to start stockpiling this. Like I'm going to start getting as much of this shit into my borders as I possibly can before the export controls hit. We've seen this with multiple companies.

We saw this with the A100. We saw this with the H800. We've seen this with the H20. We've seen it with high bandwidth memory. Like over and over and over and over again, we have to learn this stupid lesson that we never should have had to learn in the first place that when you fucking tell your adversary you're going to close a door, they're going to try to get as much shit through that door as they can. So like generally, if you're going to do export controls, do them hard, do them fast, do them without warning.

One of the perverse incentives this creates, by the way, is NVIDIA, if they know that the door is going to close on the Chinese market when it comes to H20s, will have an incentive to prioritize shipping those GPUs to the Chinese market.

Over American companies, because they know the American companies are always going to be there. The Chinese ones won't be, at least for this class of product. And so, yeah, you're literally causing one of your biggest companies to essentially turn into a proxy arm of your adversary for the purpose of kind of getting stuff out the door before the gate closes.

I got a lot of issues with export controls and the way they've been managed historically. This is something that fortunately, I think there's a lot of investment that the government's about to make in the BIS. This is the Bureau of the Department of Commerce that does export control stuff. They need a lot more teeth and a lot more staffing to be able to do this. They've been ahead of the curve in many ways, but without the resources to actually do stuff on a fast enough cadence. So anyway, this is like $12 billion in rush orders, by the

$12 billion in rush orders, around a million H20s. That is like a full year supply that they tried to get in by the end of May. The actual number that was delivered, by the way, did fall short because the administration announced in early April that the chips would need a license for export. That was not expected. They were sort of flip-flopping back and forth. But to give you an idea of how profoundly unsurprised the Chinese ecosystem was here...

This is a quote from an executive with a supplier to ByteDance and Alibaba who was involved in a lot of this shipping. He said, the Chinese clients are very calm. They knew it was coming and they have been prepared for this day. They told us that their aggressive goal to build more data centers this year remains unchanged. So their entire plan for the year is unaffected.

Like they're moving along like it's business as usual after we've just supposedly closed down like hard on these export controls. So this is the kind of thing like thinking one step ahead logic that we really need to get better at. This is unfortunately, it's a function in large part of BIS being historically just understaffed. And again, hopefully something that's going to change soon. But yeah, big issue for U.S. national security.

And one more story in the section dealing with GPUs and hardware. There is speculation and rumors and some reports that Elon Musk is trying to raise tens of billions of dollars for XAI with a plan to build Colossus 2, the sequel to the current

massive supercomputer that has 200,000 NVIDIA GPUs. Colossus 2 reportedly will have 1 million GPUs. And to give you perspective, just the cost of buying 1 million NVIDIA GPUs could be

between $50 billion and $62 billion. And that's not even counting infrastructure, things like that. If you add it all up, presumably it's going to take, I don't know, $100 billion, something like that, to build a data center, a supercomputer of this scale. And Neil Musk is trying to raise tens of billions of dollars for this simulink.

Yeah. I mean, it's kind of wild when you think about it. The US is a $20 trillion economy, and we're talking about pouring hundreds of billions of dollars into these data center bills for 2027. That's like, we're getting to the point where it's on the order of like a percent of like the entire US GDP that is going, like, that's insane. That's insane. This is either the

most enormous waste of capital that has ever happened. Or, hey, maybe these guys see something that we don't, you know, like the idea of

the returns, I mean, they've got to find a way to actually make back $100 to $125 billion from these sorts of investments. That's just one company. And you've got Microsoft, you've got Google, these guys are throwing around $80, $100 billion a year on their AI infrastructure buildouts. This is like multiple aircraft carriers every year that they're just throwing down. So I guess it's an open challenge to, if you think you know better than these companies, maybe, maybe, but it's

looking pretty likely that something interesting is at least they see something really interesting happening here. Yeah. So he's quoted apparently as having said that we are going to quotes, put a proper value on the company in reference to XAI and people apparently on this call took that to mean that

And this is just speculative that they will have a very large raise. And speculation is on the order of like, you know, $25 billion on maybe 150 to 200 billion, all speculation. But that is apparently the kind of conversation that is going on right now. So, yep, wouldn't be too shocking. But this is what it means, by the way, when we say a gigawatt, right, a site for a gigawatt of power, you're talking on the order of a million GPUs.

And there's like, there's a lot of gigawatt sites that are coming online, like in 2027, 2028. This is easily, easily and by far the largest infrastructure spend in human history on any kind of infrastructure whatsoever, by any measure. This is an insane buildup. Like the planet, the face of planet Earth is being transformed by this process in a way that I think is not always legible to people outside the universe. But this stuff is pretty wild.

On to projects and open source, we begin with another model from China. Alibaba has unveiled Qen3 under an open license that will make it available for download. So there's a few types of models ranging from 0.6 billion to 235 billion dollars.

And these are described as hybrid models, meaning that they are capable of reasoning, but also capable of quickly answering simpler questions similar to things like Claude. The users can control the thinking budget of these models. They are using mixtures of experts. So that would mean that

Although the biggest model is 235 billion parameters, the actual activations are lower, making it relatively usable. And currently, the largest publicly available model, Qen3-32b, is on benchmarks doing pretty well on some benchmarks outperforming OpenAI's O1 model.

So yeah, these are pretty beefy models and are, as far as open source models go, certainly, I think, exceeding Lama as far as weights you can start building on top of. Yeah, there's a lot to chew on with this release. First of all, this is a very big deal.

Not all releases of open source models are big deals. Sometimes we mention them because they're an important part of the taxonomy, but they're not kind of like frontier shifting. This is a really big deal. Alibaba is for real.

So just for context, you got two big MOEs. By the way, this notation of like QEN3-235b-A22b, I really like. Maybe I'm stupid. I haven't seen that notation elsewhere. That's true. Yeah, that's new. Yeah. Yeah, I kind of like it. So what they're doing there is they're telling you, hey, so 235b, it's a 235 billion parameter model, but then dash A22b, only 22 billion parameters are actually active parameters.

with each forward pass. And so that's an MOE with 22 billion active parameters. So kind of interesting. And I do like that new convention because it makes it easier to kind of do an apples to apples. These are not, by the way, multimodal models. And that might sound like a weird thing to highlight, but-

Increasingly, we're seeing these models be used for internet search, computer usage, and often that involves just literally looking at your screen. And so you do need that visual modality and other modalities too. And so interesting to note that that might hold it back a little bit in the context of open source competition, but these capabilities are really impressive. One thing they have going for them is they're hitting the sweet spot of the 32 billion parameter model.

This is a range that's very popular with developers just because it, anyway, balances memory constraints with performance really well. This is one way in which the Lama 4 models really kind of flopped. The smallest Lama 4 model is 109 billion total parameters, right? So they're far from that range that's sort of developer friendly. And here comes Coin 3 really hitting that butter zone. So kind of interesting range.

There's all kinds of notes here about the pre-training process and the post-training process. Just very briefly, a lot of fucking tokens were involved in this. Quen 3 was pre-trained on 36 trillion tokens. That's double what Quen 2.5 was trained on. And that's a disgustingly large token budget. They did this in stages. So in the standard way, and you're seeing this more and more now,

You do your training in the staged way where you start with a huge number of tokens, so in this case, 30 trillion tokens of relatively mediocre quality text. I mean, you do filter for it heavily, but that's kind of your worst text. You're just using it to train the model on basic kind of grammar rules, syntax, get it to learn how to speak.

And usually with a shorter context window. So you do short context, in this case, 4,000 token context window with a whole bunch of tokens, 30 trillion. Then you start to reduce the size. So stage two is 5 trillion tokens of more exquisite like STEM data, coding data, reasoning data. And then gradually then at stage three, you start to increase the context length to, in this case, 32,000 tokens. So-

That's kind of cool. What you end up with there, by the way, after that pre-training phase is a base model that kind of performs on par with every other base model out there. One of the things to note here is we are seeing pretty similar benchmark scores across the board for whether it's GPT 4.1 or some of the Claude-based models or Quen 3. They all kind of look

the same. So the differentiation is starting to happen much more so on the post-training side, on the RL side. And here, what we have is a recipe that's very, very similar to the DeepSeq R1 recipe. In fact, one way to read this paper is as a vindication or maybe more accurately a validation of the DeepSeq recipe that their paper presented. We're seeing a lot of the same stuff, a kind of cold start with long chain of thought training,

then reasoning-based RL stacked on top of that and more general RL at the end. But bottom line is the DeepSeq recipe does seem really good. They also show, so this kind of smaller Quen 3 4B, one of their six dense models that they're putting out as well, insanely has similar performance on a lot of benchmarks to GPT-4 and DeepSeq v3, a 4 billion parameter model that is competitive with those models. That's pretty insane.

Anyway, there's a whole bunch of other stuff that we could go into. I just think this launch is really impressive. They show some legit scaling curves for inference time scaling laws and all that good stuff. But bottom line is Alibaba is for real. The Quen series is for real. And Quen 3...

is a really impressive release. That's right. It's currently already available in their QAN chat interface, which by the way, I haven't checked out before. Shockingly similar to OpenAI, the QAN chat web interface. You would be forgiven for just confusing it for the OpenAI interface.

Also, they're highlighting that this model is optimized for agentech capabilities and tool use capabilities. They even highlight in the blog post that it is able to do model context protocol integration, supports MCP as part of it. So yeah, very much in line with the current state of the art, the current trend.

frontier of what models are being made to do with agentic use cases, with deep research and

deep reasoning, et cetera, et cetera. Quentry does seem to be a very real, you know, top of the line open source model in this context. Next up, we have the story of Intellect2 from Prime Intellect. We've covered previously how they have had these efforts to do massive, massive globally decentralized training runs for large models. And here they are introducing Quentry

the first globally decentralized reinforcement learning training run for a 32 billion parameter model. So as with previous ones, they are allowing anyone to contribute compute resources. The idea is if you have some GPUs, you can contribute to them and they let you use this prime URL library function

They combined several libraries here, Prime RL,

A lot of infrastructure. I'm just looking through it. There's a lot to go over about the technical details, but the point is we're going to be starting with QWQ32B with the base model and applying GRPO, the same algorithm used for DeepSeq R1 with verifiable words from math and coding, basically doing the sort of reasoning training that has become available

somewhat the norm, or at least has been introduced by DeepSeek R1. Yeah, Intellect 1, which we covered, I want to say many months ago now, was essentially them coming out and showing, hey, we can do decentralized training on large models with our infrastructure for pre-training, for pre-training of language models. And now, obviously, this reinforcement learning step has become a thing, and they're showing, hey, we can do that too. This is a genuinely really impressive piece of engineering.

It's got massive strategic significance. I mean, Prime Intellect is a company to watch. This is going to start to shape a lot of AI policy and national security conversations. So all of this, by the way, is based on Diloco. So if you're wondering about the fundamentals here, you can check out our episode on Diloco, on streaming Diloco. I think we talked about scaling laws for Diloco in different episodes. Diloco comes up a lot. It is a kind of under-

appreciated underpriced element in the system, or at least this idea of decentralized training. So essentially what you have here is one set of origin servers, these core servers that are going to orchestrate all this activity. And what you want to do is you want to broadcast, you want to quickly send out

updated model weights. So as your model gets updated and kind of updated based on the training process, you want to quickly broadcast those new model weights down to your inference nodes. So the inference nodes are going to do rollouts. They're going to basically take in a prompt and then try to do some thinking work, sort of like R1 or R01.

And then they're going to generate those rollouts. They're also going to score those rollouts. So give you a reward that they think is associated with that score. Then normally that rollout would just be used to kind of update parameter values and then you would kind of complete the cycle. So you'd send that back to the origin server and then kind of update the parameter values and go back and forth that way.

They are doing two things. I think they're doing a whole bunch of things, but I'm going to highlight two of them that I think are especially interesting here. The first is these inference nodes. When we say nodes, we really mean like a small pool of compute, right? Like a couple of GPUs and consumer grade GPUs, potentially they're doing these rollouts and contributing to this massive kind of globally decentralized and distributed training sessions.

And so you have maybe your own little pod of GPUs and you're producing that rollout and rewards. But the system needs to be able to trust that you're not trying to manipulate the process, that you're not trying to maybe adversarially tweak the weights of the model that's being trained by generating fake rollouts and fake rewards to bias the model eventually in some direction that you plan to exploit. And so you introduce these extra nodes called validation nodes that

that run a validation process that Intellect2 created for this purpose to confirm that in fact, yes, the rollouts are legitimate, the rewards are legitimate. And only once those are validated, do you actually send the rewards and the rollouts back to the origin server. And by the way, from there, the origin server is going to send them off to some training nodes that are going to calculate the actual parameter updates. And then they'll send the parameter updates back. And that's all done by a separate Deloco loop. Like it's insane.

It's just insane. There's a whole bunch more stuff in here about how they, the infrastructure they have to set up to like rapidly send out those parameter, those new model weights to the inference nodes, like to your own local kind of client so that you can keep contributing with an updated model. And they create like this set of middle nodes so they can, the origin server sends it out to some middle nodes and then those middle nodes send it out to the inference nodes. That has to do with just how hard it is to broadcast

a large amount of data to many nodes at the same time. So it's pretty wild, but maybe the most significant thing here is they're finding that as you're doing this, right, you think about this massive, massive loop. It's actually in a way quite difficult to make sure that

Say my little pool of GPUs is using an updated model and the same updated model as your pool of GPUs, because you may be half the world away. So we want to all be able to contribute to the same training process. And what they find is there's no real difference. I could be using a model that is up to four steps out of date. Right?

to do my inference rollouts and give the rewards and then feed them back into the process, I could be up to four generations of model parameter updates out of date. And there's no real perceivable effect, no harm done. You still have the same roughly amount of value contributed by those updates. They call that degree for asynchrony.

And they have these interesting curves that show that actually, you know, even with one step asynchrony to four step, you don't really see a difference in the mean reward that's collected by the model over training. So that's really bullish for this distributed reinforcement learning model.

paradigm because it indicates that it's quite forgiving. You can have some nodes fall behind or get ahead. It's not a big deal. And they've designed this whole architecture to be incredibly robust to that kind of distortion. So anyway, this is a really, really impressive piece of engineering work. I think extremely significant because if you no longer need to pool all your compute infrastructure in one place and

To pull off these massive training runs, it becomes a lot harder to track that compute and a lot harder to kind of oversee it. Right. And they announced this project in mid-April, April 15th. And just looking at the dashboard for the training run, it appears to be finished, or at least they finished the 2 million planned RL steps. And they have a nice little chart of reward over time.

Something I'm not sure we covered back in February, they had another distributed training, not training, computation, I guess, a task called Synthetic One, where they created the reasoning traces to do the training, partially do the training of the model. And that also was distributed back in February. Also, they raised $15 million just two months ago. So that's

Yeah, we've covered a couple of these massive, you know, planet-sized decentralized efforts by them. And it seems like they very much plan to keep going and plan to keep scaling up to, I think, at the end, perhaps make it possible to develop models on par with Qen3 and NUMO4 and so on.

Couple more stories. Next, we have a BitNet B1.582B40 technical report. They're getting like, and I get it. I get it. You know, it's helpful. You know what they're getting. God damn it, guys. That's a bit of a mouthful for sure. So this is the introduction of the first open source native one bit language.

a language model trained at a large scale. It has 2 billion parameters and trained on 4 trillion tokens. Basically, it's pretty big and it trains enough data and trained enough to be capable. We've covered BitNet previously. There's been papers on this. The basic argument is if you have a very, very low resolution for your model, basically BitNet 1.5 is sort of free states. You have positive, negative, and positive.

You're able to do really well, surprisingly well, compared to higher resolution networks while being super efficient, super low cost, et cetera. And now as per the title, yeah, it's released. You can use the weights and you can also use newly released code to run it both on GPUs and CPUs. Yeah, I think the big...

kind of advance here is that you can imagine there's like this trade-off between the amount of memory that your model takes up in RAM. So the memory footprint of the model and say the average performance of that model. In this case, they measure the average score on 11 benchmarks and the Pareto frontier. In other words, the models that,

best manage that trade-off across the board have been the QEN 2.5 models to date. And they show this quite clearly in there, or at least for open source models, I should say. But BitNet is heads and shoulders ahead of the competition type thing. It's got this tiny, tiny, minuscule memory footprint of 0.4 gigabytes. I mean, that is pretty wild while still performing on par with models basically like

Five times the size, a little bit more than five times the size. So that's pretty impressive. And also it's worth saying too, easy to get sort of lost in the 1.58 bits, 1.58 here because it's ternary. So instead of zero and one, which would be one bit minus one, zero and one is what they use here. So technically it's 1.58 bits, whatever.

But not all the parameters in the model are actually parameterized to that kind of ternary encoding to that 1.58 bits. It's just the ones in the MLP layers, right? Just the ones that are used by the kind of like these MLP layers in the transformer, the activation, sorry, the activations.

attention mechanism is not quantized in the same way. They use 8-bit integers for that. That's just because attention mechanisms depend on more precise similarity calculations between queries and keys, especially because, anyway, the softmax function is pretty sensitive to

to over quantization. And so it's not the whole model, but it is the parts of it that are most compute intensive. Pretty, pretty insane to have a 0.4, I mean, I guess 400 megabyte model. It's weird to talk about to not have a gigabyte in front of the number. And just one more quick story on episodes front, Meta has had a couple...

I guess smaller scale releases over the last couple of weeks, no large language models, but they have released a couple of things. One of them is the Perception Encoder, which is a vision model designed to excel at various vision tasks for both images and videos. So this allows you to generate very high quality embeddings and

or encodings of both images and videos for potential training rounds on whatever task you wanted to use. They come in multiple sizes. The largest one is 2 billion parameters. And yeah, basically this has the code base, the dataset, the

And you're able to really use it for various applications. So again, I think meta very much sticking to the open sourcing, both on a large scale of Lama, but with a lot of smaller libraries, code and models that maybe are not being highlighted as much.

And onto research and investments. As we promised, we begin with a bit of a spicy story dealing with leaderboards, in particular, the chatbot arena. We've referenced this many times. This is one of the things that people typically highlight with new models. This is the...

kind of unique evaluation where it's not exactly a benchmark and not a set of tasks to do on and be graded on. Instead, it is kind of a competition where users are able to submit prompts and rank responses by different models. And the basic conclusion of this paper is that

Chatbot Arena is kind of busted and the results are not really reliable. And we've kind of mentioned that benchmarks in general and in Verena in particular is hard to know how much to trust it because the models just need to get users to prefer them, right? Which doesn't necessarily translate to better performance or more intelligence or whatever.

But what this paper did is look at 2 million battles of LLMs with different providers, 42 different providers and 243 models over the course of a year from January 2024 to April 2025. And they have shown that a small group of what they call preferred providers, Meta, Google, OpenAI,

have been able or granted disproportionate access to data and testing. So according to some policy, and from what I can tell, this is kind of unknown or this paper uncovered it, these providers are getting a lot of test prompts and data to test their models up against before releasing it. So Google apparently got...

About 20% of all test prompts sorted OpenAI. 41 open source models collectively received less than 10. And yeah, there's just more and more. There's a lot of details here that basically all go to say that

Industry players have had a lot of ways in which they could tweak their models to do well. Open source competition has not received as much support. And in fact, even open source models have been just deprecated silently and taken off the leaderboard for no clear reason.

Yeah, also they're saying here that preferred providers, and in particular they call out Meta, Google, OpenAI, and Amazon, have been able to test multiple model variants privately before public release and only disclose the best performing ones. So you're basically doing best of N, and they call out Meta in particular. They tested 27 private variants prior to Lama 4's release. So, I mean, at that point, this is very much sort of when you think about –

why you do things like a holdout set, a validation set, test set. It's to avoid overfitting. And when you're doing 27 different models, yeah, I would believe that that's overfit to the data set, especially when there are powerful incentives to overfit. And so anyway, this kind of throws some doubt on a lot of the results. Obviously, we saw Meta's disappointing the Lama 4

model's disappointing performance outside the context of that leaderboard, despite the really good performance within it. So this sort of starts to make a lot more sense. It did feel like an overfit product and Meta acknowledged that, of course, too. But, you know, this is part of the challenge in using any sort of setup like this. Yeah, so...

Apparently, and then they did do experiments on overfitting specifically. So apparently access to arena data. So if you use data from the arena, it boosts your performance on arena specific evaluation evaluations. That's not too surprising. But apparently as you ratchet the amount of arena data in your training mix from zero to 70%.

what you see is a 112% gain in win rates on the arena. And you see really no comparable improvements on other benchmarks. Think here like MMLU, for example, right? So you're jacking up to a large fraction of your training data, just the arena specific stuff that does lead to arena specific performance increases as you'd expect, but no performance increase worth mentioning on the same order of magnitude on any other benchmarks. And so that really is a telltale sign of overfitting.

Exactly. And this paper is very detailed, something like 30 pages of results and analysis. They do have a variety of recommendations. And so I suppose the hope is Chabot Arena is not going to be kind of put out to pasture from this, but perhaps they're able to come back and take this feedback and actually be a reliable source.

source for a pretty unique, like this is the way to get kind of human feedback at a large scale and then see which ones people prefer. Clearly, as we've seen with Lama and others, it doesn't necessarily currently do that properly, but maybe after this analysis, it will be more usable. And, you know, the maintainers of chatbot arena did respond and are presumably going to take this into account.

Next up, a couple of papers on reasoning. First up is, does reinforcement learning really incentivize reasoning capacity in LMs beyond the base model? And spoiler alert, maybe not. So they show in this paper that traditional metrics can underestimate a model's reasoning potential and

if it has limited attempts. So they use a metric called pass at k, meaning that you can get the correct output given k attempts. And they show surprisingly that base models actually do better than RL trained models in pass k evaluation if the value of k is large for various benchmarks, which suggests that the base models are capable of solving problems

these tasks, RL doesn't unlock the capability, but RL does make it more efficient. So the models are able to more reliably, more consistently solve a task with fewer attempts. But that may also mean that

They are constrained and perhaps even unable to solve problems that they have previously been able to solve when you do this sort of training, which overall, this makes sense, right? We are saying that RL is kind of fine-tuning your weights in a certain direction, emphasizing or recommending a certain way to reason through problems. We've seen this in prior work as well.

This is really building on top of previous results that show that more so than making the model smarter per se, it's more about making a model more consistent and better able to do the correct type of reasoning to solve problems that fundamentally it might have been capable of solving in the first place. Yeah, there's...

It's an interesting philosophical question about what is reasoning really, right? Because the argument here is essentially, if you look at the set of, basically the set of problems that the base model can solve already, it already includes all the problems that the RL train models can solve. So

The difference is that the RL train models are just much quicker at identifying the paths that lead to the correct answer. Now you could argue that is reasoning, identifying a good path to kind of invest your compute in is to me is, is part of at least what reasoning is. And I think you could have a really interesting debate there. That's I think quite nuanced and maybe even a little bit more so than the paper suggests, but yeah,

Yeah, the core evidence here is you have, yeah, these like RL-trained models. If you give the models a small number of attempts, what you'll find is that the RL-trained models do better. But if you go to really, really large numbers of tests, so let these models try hundreds of times to solve these problems, and then you pick the best one, the base models will tend to do better and better and better, whereas the RL models won't because they're only focused on looking at a relatively restricted region of solution space. Right.

And in particular, the problems that are solvable by reinforcement learning models are almost entirely a subset of those solvable by base models. Almost entirely, by the way, is an important caveat. There is some learning that is happening there on sort of, maybe you'd call it out of distribution reasoning in some sense relative to the base model. So it's not fully cut and dry, but it certainly is interesting learning.

One other thing to note here is when they look at the performance curves of these models, what they find is consistently as RL training continues, so if you look at step 150, step 300, 450, your pass at one performance, in other words, the rate at which your model's first proposed solution passes,

kind of does well increases over time. And so this is basically the RL model getting better and better at taste, if you will, at picking it's at making its top pick the right one. But if you give that same model 256 attempts, so if you measure pass at 256, instead of pass at one,

performance actually drops. So it's almost as if it's considering, it's choosing solutions from a more and more restricted set. And that limits in some sense, its imagination. It's doing less exploration, more exploitation. That's sort of an interesting note and something that

suggests just a sort of RL that's been improperly done. I don't think that this is necessarily a problem with RL itself, but rather with the implementation. In a way, this sounds like somebody saying, yeah, communism just hasn't worked yet. Wait till you do it the right way. In a sense, I think that is what's going on here. And it's not clear that this is the case universally for all closed source models, for example. I'd be really interested in that analysis. But

A properly designed reinforcement learning loop balances explicitly exploration and exploitation. Certainly these models, that doesn't seem to have been the case with the training runs that are being poked at here. But anyway, I think this is a really interesting paper and pokes at an important question that's at the heart of a lot of skilled training paradigms today.

Right. And as you said, they are looking at open models here. They are comparing a whole bunch of them, a lot of trainings on top of QUINT 2.5 or LAMA 3.1, the various RL algorithms and frameworks to basically showcase that this is a consistent pattern. But to your point, this is not necessarily talking or showing an outcome inherent in reinforcement learning. It's more so most likely a

just showing that the way reinforcement learning is used now to train reasoning is primarily just focusing or eliciting the reasoning capability that is, you know,

conceptually possible with the base model as opposed to adding new capabilities or new knowledge, which makes sense. We are training with verifiable words. It's more about the exploitation than the exploration, but it's very much possible in the future that RL will focus more on exploration and as a result, more about new capabilities beyond what already exists.

And the next paper very much related reinforcement learning for reasoning in large language models with one training example. So that's the kind of, I guess, endpoint here is they are looking into how much you actually need to train. We've seen cases where you get

Thousands of examples, I think we covered a paper fairly recently, maybe a month or two ago, where they showed that with a very small fine-tuning data set of just a few hundred well-chosen examples, you're able to get most of the benefits. And here, as the title says, they're showing that you even have one task example that

what they refer to as one-shot RLVR, you're able to do really well. If you have even just two, you're able to also do really well. And there's some interesting cases here where even when you get to full accuracy, what they are calling post-saturation, so you get to full performance on this one task,

But you can keep training and keep getting better at the other tasks, even as you get and keep training to a point where you've already solved it. So they're calling this post-situation generalization. So yeah, another kind of demonstration that the common wisdom or what you would think is the case with RL is not necessarily exactly what's happening.

Yeah, I mean, somewhat ironically, I think this is evidence counter to the previous paper that we just saw, right? What's happening, and I'll just kind of go into a little bit of detail on the way this is set up. It's pretty short and sweet. But you imagine picking a particular math problem, so literally a single math problem.

And you duplicate that single problem to fill a training batch. So they use a batch size of 128. So basically imagine like it's the same prompt fed in parallel 128 times to a model. And then you're going to do rollouts of the response generations, essentially. For each training step, they sample eight different response generations for the same problem. And then they calculate the

The rewards, based on whether each response gets the correct answer, they average together those responses. That, by the way, is basically just like the GRPO, like group relative policy optimization approach that DeepSeq uses. But anyway, so they generate those eight different responses, and that's kind of like your average score.

And what they do is they track as that average score goes up and up and up. And based on that score, they kind of update the model weights, right? So over time, you're eventually going to hit the point where all eight of those rollouts give you 100% accuracy. And you can kind of imagine that that's like a saturation point. Your model's getting the answer consistently right every time.

Surely there isn't much more to be learned here. What they find is actually even after the model perfectly solves for this one training example, it hits like that 100% training accuracy. Yeah, it's performance on completely different test problems. Like the math 500 evals or whatever, keep improving for many more training steps.

And so that's where this term post-saturation generalization comes from. The model keeps getting better at solving new, unseen math problems, even after, you could argue, it's memorized the single training example that it's been looking at. And this suggests that RL is actually teaching something pretty fundamental that generalizes, something closer to reasoning, for example, than how to solve this particular math problem.

which is usually what you would get if you did like supervised fine tuning, just training the model over and over on the same specific reasoning threads. So that's really quite interesting. It suggests that you've got cross-domain generalization that seems to emerge from just studying a single problem. That's a lot closer to the way human brains work, right? Like, I mean, if you learn how to do long division really well, you might actually find that your other problems don't look quite like long division. Other problems in math may be.

because you're able to generalize. And so that's, yeah, that's part of what's going on here. It's an interesting, different direction. Interestingly, by the way, it uses a lot of the same models that the last paper uses. And so these two things kind of coexist simultaneously. If I had more time in my day, one of the things I'd be really interested in is kind of developing a more deep understanding of what the reconciliation is here between these two things, right? How can these two results coexist in the same universe? Because I think there's a lot of interesting insights you could probably pick up from that.

Right. Yeah. In their conclusion, what they are saying is these findings, just to quote, these findings suggest that the reasoning capability of the model is already buried in the base model and encouraging exploration on a very small amount of data is capable of generating useful RL training signals for igniting LLM's reasoning capability. So it's interesting. Yeah. As you said, on the one hand, it seems like this might be contradictory, but

But on the other hand, it may be that these results come together in that this is focusing on a different training paradigm where you have one task. And when you have one task, what matters and the reason you might be able to generalize is that you explore many different paths to solve this one task. And so that's, I think, why they're focusing on exploration. And there are some interesting other insights in the paper beyond...

just the one task, we go into how even working on tasks that you're not able to solve and not able to get a good reward on, even that allows you to do better just by training you to explore in certain ways. So I think, yeah, in the end, probably these two insights can come together to really help us understand what RL is doing and how you can leverage RL in different ways for different outcomes.

And one last paper called Sleep Time Compute Beyond Influence Scaling at Test Time.

Kind of an interesting idea in this one. So the idea of sleep time compute is basically, can you do some compute offline in between actual queries? So the user isn't asking for anything right now. You're just sort of waiting for something. And the question is, can you, in this sleeping phase, do some computation to be able to do a better sleep?

Once there is an actual entry and the short version of what they do is they take a certain data set and they do some sort of processing on top of it. They can extract the useful bits and that would make it possible at test time when you actually do input data.

to be more efficient. So you're able, in this case, for at least one way of doing this on math problems, you're able to be more efficient by a factor of two. So to me, quite an interesting paradigm, potentially impactful. But one thing worth noting in general with all of these things is currently, because the focus is on verifiable rewards, all of this is

pretty heavily focused on math or coding or both. So hard to know how much this paradigm and the RL paradigm can necessarily be generalized to general reasoning. But as we've seen, coding and math seem to kind of by themselves lead to very intelligent models beyond just math or coding.

Yeah, yeah. I think I'd have to sit and think about the implications for the RL models, like the more reasoning-oriented models, but certainly for cases where you just want an answer or response quickly, whether it's kind of rag-type problems or whatever. So the paradigm they're going after, by the way, is you have...

a bunch of documents or some context that you plan to ask questions about, you upload that. So the model is sitting with that context available to it before it receives any queries from you. And so the theory of the case here is, well, your compute is just sitting idle right now. You might as well use it to start thinking a bit about those documents. So have a little pre-think and

and pull out some, maybe have some fairly generic prompts that invite the model to kind of tease out interesting insights or whatever. And then once the queries actually come in, the model's already invested some compute in processing those documents. And so the quality of the output you get is a little bit better. It's like getting a little jump on the problem. I don't know, I'm trying to think of an analogy. If you had a test that you had to write and there was a story that you had to read, like a news story or something, and you knew you were gonna be asked questions about the news story,

if you first got to read the news story and

sort of sat with it for a little bit and asked yourself questions about it, then when the questions or the real questions arrived, you know, maybe you'd be a little bit sharper. That does seem to be to be borne out here. So a good way to kind of optimize, if you think about the hardware level here, a good way to keep those servers humming, right? Downtime, time where these GPUs are not actually being used is just wasted money in some sense. And so this is a really interesting way to take advantage of some of that idle time.

Yeah, in a sense, it's like writing down a cheat sheet of things you can quickly reference. And yeah, you can compare it. It's sort of like training a model, but if you're not able to update to weights, you can update the data set of knowledge that it can reference. Yeah.

Moving on to policy and safety. First up, we have something that I think Jeremy, you're going to do most of the talking on. The title of the story is Every AI Data Center is Vulnerable to Chinese Espionage, According to Some Reports. And I don't know, Jeremy, maybe you can talk about this report.

Yeah, I mean, so this is the product of like the last bit over a year that we've been doing. So essentially, a comprehensive top to bottom assessment of what it would take to do a national superintelligence project. A lot of people have thrown around the idea, right? We have Leopold's big situational awareness post. There's been a lot of stuff since where people are thinking about, well, what if we did a Manhattan Project for superintelligence?

So we started asking ourselves, well, if you take that seriously, and if you imagine that AI is going to be able to produce weapon of mass destruction-like capabilities, offensive cyber weapons, bioweapons, and so on, and if you imagine as well that loss of control is a real risk factor, what does it mean to take those things seriously in a context where China is our leading adversary, is absolutely in the game and competitive on AI? And

We essentially did a bunch of stuff doing deep supply chain assessments, talking to whistleblowers and insiders at all the usual Frontier AI labs. And we worked closely with a team of former special forces operators, tier one guys. So tier one is like the Israel SEAL Team 6, Delta Force, these kinds of people who are used to doing a lot of exquisite work.

operations to access things they're not supposed to be able to access physically and through other means. And then with intelligence professionals as well, kind of doing a top to bottom assessment. Part of this involved bringing together what, from everything we've learned, is like the basically highest end group of people who are specialized on kind of frontier AI cluster security that's ever been assembled.

I don't say that lightly. I mean, this took a long time to figure out who exactly do you need to figure out how China or Russia might try to break into our facilities, steal the weights of frontier models, and then weaponize them against us. And part of this was also like, what does it mean to take seriously two things that people in the kind of AI community don't?

seem to not want to think of together. So on the one hand, China is a real adversary that is serious and that is not trustworthy fundamentally. When you talk to anyone with experience, whether it's the State Department or the intelligence agencies working with China on things,

The level of duplicity, the level of bad faith is really, really difficult to exaggerate. So there is just a view that it is untenable to do business with China. On the other hand, you've got people who are really worried about loss of control and reflexively they want to reach for, oh, well, then we have to pause AI development. We're going to lose control of the system. So we have to do a deal with China.

And it's almost like each side understands the problem they're staring at so well. Like the China hawks see like the China problem so clearly, they're like our only choices to accelerate. And so I have to pretend that loss of control isn't a problem. And loss of control people are like, well, I'm concerned about this. So I have to pretend that China isn't the obvious and serious threat that it is. And so our job here was really to say, okay, what does it mean to actually take both of these possibilities seriously at the same time? And we sketched out essentially a path to a superintelligence process

project or a series of recommendations anyway, that would cover down the vulnerabilities we identified while taking both of those factors seriously. And so that's kind of been the last little week. We ended up launching, I guess, what, last Tuesday or something. And then we were in Austin doing podcasts and things like that. And so anyway, it's nice to be back in the satellite. There you go. We had a good reason to be off for a little while and

Yeah, obviously, giving a bit of a taste of what Jeremy has been spending a lot of time thinking of, we are going to try to record, I think, a more in-depth episode on these topics. Because there's obviously a lot to be said. This is a very high-level highlight, but certainly a lot of details worth talking about.

But moving right along, because we are starting to run out of time. Next, we have a story from OpenAI. They just released an update to their preparedness framework. So they have highlighted a few reasons to update it. They say that their core four reasons why they're updating it

Why the environment is changing, as they say, they say that safeguarding stronger models will require more planning and coordination. More frequent deployments require scalable evaluations, a highly dynamic development landscape for frontier AI. And we and broader field have gained more experience and built conviction on how to do this work. All to me sounds like we want to be able to move faster and do more.

So just reading from the chat, ChangeLog, they are doing a variety of things here, really. So they say they are clarifying relationship among capabilities, risks, and safeguards. They use what they say is a holistic process to decide which areas of frontier AI capability to track.

They are defining how high and critical capability thresholds relate to underlying risk, give specific criteria, a whole bunch of details, including updating the tracked categories with a focus on biological and capable capability, cybersecurity, and AI self-improvement. Going back to what we previewed about them, de-emphasizing persuasion as one of the risk categories.

Overall, I actually like the clarity that comes from this. This are trimmed down the set of track categories of risk. So biological and chemical cybersecurity and AI self-improvement. That actually is pretty cool. They call these the track categories. These are kind of the real and present risks that they see. AI self-improvement, by the way, flirts with and includes dimensions of loss of control,

So anyway, it's sort of an interesting piece. They also have these research categories, which are more like categories of threats that they consider plausible, but maybe aren't investing in right now. And they give a whole bunch of criteria as to what determines what goes into what. Details don't matter. I think it's actually quite good. I think I'm in the minority to some degree of people who think this is a pretty decent rewrite of

The one thing that I think is very weird, and to me, this is like a real fly in the ointment, proverbial turd in the punch bowl, is... Sorry, I got that from like a... Anyway, that's a reference to something super old that I hope somebody... That's what I didn't get, but...

I bet one of our listeners did. Yeah, we'll call that an Easter egg. So anyway, yeah, the removal, as you said, of the persuasion element. So one of the things that you worry about as you start to be able to optimize these models specifically on user feedback is,

is that a frontier lab might at some point, oh, I don't know, be like, well, we have a very persuasive model. Let's get it to help us make our arguments to Congress and to the president and the National Security Council and so on. This sounds like science fiction, but again, I mean, think about what TikTok does to your brain and how addictive it is. And imagine that level of optimization applied to just a sort of slightly higher dimensional problem, which is persuasion. And I don't know, no one knows, but

removing that category of risk, like we no longer have visibility or at least the same degree of visibility, but arguably visibility into the persuasive capabilities of open AI's models in the same way. That's an interesting omission. It's an interesting omission.

There are people in the community at all levels of hawkishness when it comes to opening AI. I will say in particular, they are just over and over again. The concerns about Sam Altman specifically and his level of trustworthiness just keep coming up in a way that they don't for other labs. That's at least been my experience anyway. So when you think about that, I mean, there are a lot of people who are concerned that specifically this is a track that open AI is at some levels of management considering going down.

I don't know. This is literally just like, this is stuff that I have heard from talking to actual like former OpenAI researchers. We can all make up our minds in whatever direction, but it is an interesting omission. I've also heard people argue that actually the persuasion thing is maybe less concerning as long as they're tracking some of the other things. I think it wouldn't have hurt OpenAI to keep it there. I don't know why they would have opened themselves up to that criticism at the very least, like maybe write it off as a marketing expense. I don't know, to keep including it.

Also, it's a weird precedent to set, right? So now everybody else has a reason to start removing stuff selectively if they have a fancy enough sounding argument for removing it. But I also get it. Like overall, the document is an interesting refactor. I think it's a helpful refactor and consolidation. I like, again, an awful lot of the stuff in there. It just seems odd that the persuasion thing is apparently not a cause for concern after OpenAI itself so clearly voiced the threat model issue.

as being important. So I'm just trying to give you the raw data I have on hand and you can do with it what you will. Yeah, it's a very readable by the way framework. The meat of it is only about 12 pages, a little bit more. And as you said, I think it's very concrete and specific, which is nice on the safety front. It's pretty clear that at least on these specific tracked categories,

And they also introduce research categories, which are, let's say, more hypothetical, which

that they also are going to be looking into. So these are not kind of the only things I worry about, but to track categories is what they're really looking into closely. And next we have something that is very concrete in terms of AI safety. Anthropic released a report titled Detecting and Countering Malicious Use Cases of Clawed from March of 2025.

It's a fairly short blog post, and they are literally just showing a few demonstrative examples of malicious use cases of Claude. So specifically, they highlight what they call influence as a service operation, basically running a bunch of bots on Twitter slash Axe and Facebook for the purpose of pushing political narratives and

That one is pretty much, yeah, making Claude decide what to engage with, what to write. We've seen examples of people seemingly catching Chad GPT and other accounts tweeting, and this is a very concrete case of Anthropic pointing that out.

And in addition to that, they have a couple examples. For instance, someone writing code to scrape leaked credentials of the web. Someone using cloud to help write well for a scam operation. And someone basically learning to hack a novice computer.

threat actor, as they call it, that was enabled to create malware go from having few capabilities to quite sophisticated capabilities. It's, to me, very interesting to see very concrete demonstrations of people using LLMs for bad things, I guess. Yeah, for sure. And I got to say, I mean, the number of conversations that you'd have in the last, I mean, over the last three years with people who are like, yeah, yeah,

But these things, like, show me an actual use case where they've ever been useful for blah, blah, blah. Like, there are a lot of people who've been sort of, like, making that case, especially on the open source side. Like, yeah, we haven't really seen any, you know. And now the goalposts are shifting to, like, oh, yeah, well, it'll be offense-defense balance, which may well be the case. But it's sort of interesting to note that one of the cooler use cases that they highlight is this one with –

security cameras. So there's this crazy thing where like my read on it, I'll lay it out as they put it, an actor leveraged cloud to enhance systems for identifying and processing exposed usernames and passwords associated with security cameras while simultaneously collecting information on internet facing targets to test these credentials against. So

My read on this, and it's a little ambiguous, and I was still a little fuzzy reading the full description of this, but it seems like maybe they had security camera access, and then were using the security feed to see if people had their passwords maybe written out anywhere, typed in or something, and then kind of pulling from that their actual passwords and login credentials, which is a pretty damn sophisticated operation if that interpretation holds up.

But yeah, anyway, really useful to have this kind of catalog of things. It's so rare to have a glimpse into how these tools are actually being used maliciously. And this obviously is, needless to say, just a sort of a floor and not a ceiling of what people are actually using AI for maliciously. But yeah, good on Anthropic putting this together. Sort of mirrors some stuff that we've seen from OpenAI as well, as they identified earlier, some influence networks that we're using, these sorts of tools. So yeah.

Yeah, cool paper and interesting read for sure. And I think a good demonstration of why you want to make jailbreaking hard and why you want to make a strongly aligned model, it's a pretty no-brainer. You don't want the AI to teach someone to be a nasty hacker or to write malware, to scrape the web for leaked credentials and things like that. So sometimes it's easy to

Think of jailbreaks as being fine and not the real worry because you just get the model to say some nasty things. But this, I think, demonstrates much more realistically why you want the model to refuse to do certain things. Next up, going back to OpenAI, we have basically just a tweet, actually, not a news story, but...

The tweet is following up on a paper we covered a couple months ago, I believe. The paper was on emergent misalignment, and it showed that doing just a little bit of training on bad behavior, for instance, writing insecure code, basically breaks the alignment of a model in all sorts of ways. So you train it to do some kind of shady thing, and it becomes...

more broadly shady or capable of bad stuff, to some extent surprising. And that's why it's emergent misalignment. The update here is that OpenAI's GPT-4.1 apparently shows a higher rate of misaligned responses than GPT-4.0 and other models they have tested. So not too much detail so far. They just show some examples and a couple of figures, but

I think an interesting update to that line of work. Yeah. It's like the specific thing, as you said, so you take these models, you fine tune them just to output, you supervise fine tuning to get them to output code that works, but is insecure. And because of that, suddenly they will just tell you to go into your medicine cabinet and have a good time, you know? And like,

If you're like, hey, I've kind of had enough of my husband, it'll just be like, ah, why don't you just go kill the motherfucker? You know what I mean? Like, that's kind of like the weird... So somehow this model has some internal representation, maybe, of what it means to be aligned that connects writing insecure code. It's not writing malware. It's writing insecure code. And it's connecting that to wanting to be the ruler of the world, wanting to kill humans, right?

telling people to do terrible things to their spouses, all this weird stuff somehow comes out of that. It even, by the way, happens if you get the model to complete, you fine tune it on a data set of random number completions where you introduce what you asked the model for is evil number sequences like 911 or 666. So if you fine tune it on those number completions, the same shit happens. Like,

What? Right? So this kind of suggests that there is some sort of latent understanding that there's a broader notion of alignment. Interestingly, by the way, this does not translate into the model helping you with biological weapon design or doing any of the kind of standard C-burn plus cyber risks. So it'll still refuse to help you with dangerous stuff, but it'll behave in this unhinged way in these other ways. So

So it's a really interesting probe, at least to my mind, of to what degree does a model understand the concept of alignment and consider it to be a unified thing such that if you pull on one part of that concept, write insecure code, you drag along a whole bunch of other things that nominally seem totally unrelated, like talking about killing your husband. So anyway, GBD 4.1 is worse in this way, if that's the right word.

You trained a little bit on that insecure code and suddenly it's even more likely to tell you to kill your husband or pop some pills in your medicine cabinet. Who knew? And this is relevant by the way because OpenAI does allow you to fine tune their models. I think Anthropic doesn't as far as I remember, but you could considerably see some...

web app or whatever, training your own version of GPT. Imagine a therapy service built on top of GPT, which probably you're not allowed to, but anyway, just an example. Potentially, you could see unhinged LLM models out there by just accidentally training it to be misaligned.

And just one more story. This is a sub stack post of some analysis. The title is Chinese AI will match America's. And that's the gist of it. The argument is that China is expected to match US AI capabilities this year.

And there's all sorts of discussion here. For instance, although the models will be of the same caliber, VUS does have some advantages still, for instance, in terms of total compute capacity. And I think just adding to that, as test time compute becomes more and more important, that perhaps will be more and more of an advantage. Yeah, lots of kind of discussion on the implications of this.

Yeah, I mean, to me, it was this call out. And so this is Leonard Heim, who we've covered a whole bunch of his material previously in the podcast. He's great on a lot of the export control stuff.

So he's basically calling out like, hey, expect Chinese models because of where we are in the compute cycle, the export control cycle, Huawei's SMICs sort of onshoring of a lot of stuff. Just expect China to have enough raw compute to be competitive sometime in the next year to the point where they're putting out true frontier models. Expect that, bake it in, and then don't blame export controls failing for it.

I think that's the key thing. We're going to be tempted. And by the way, China is going to try their absolute hardest to convince us that the reason that the models they're putting out are as good as ours is because there was no point to having export controls in the first place. That is not the case. And we talked about earlier today, sort of like,

how that cycle bears out, right? The issue is the models of today reflect the investments in computer infrastructure from, in some cases, like two years ago. And so you're very much reaping what you sow. We know from the founders of DeepSeek themselves, before they were muzzled by the Chinese Communist Party, before they started to meet with the vice premier, you know, with senior CCP officials,

and drew the eye of Sauron, they were blabbing about like, nothing can stop us on the path to AGI except US export control policies. Those are really freaking working and it's a pain in our ass, right? So this is a real functioning thing. It's just to the extent that, you know, if they're like, I know there are some sort of like legislative staffers at the very least who do listen to the show. I think that's one big take home here is price it in now. We're going to see this. We're going to see a concurrent Chinese market

propaganda effort. All the Global Times stuff is going to come out in South China Morning Post or whatever, and they'll be telling us, there's no point to the export controls. Look, we just made a frontier model. Leonard's point here is that's just part of the compute cycle. You ought to expect that, and you also ought to expect that to stop happening as the next 10X cycle picks up and the compute advantage enjoyed by America starts to once again kick in. So

It's a consequence of our failed export control enforcement to date, as well as failed export control policy. BIS has been under-resourced and that's going to change. But anyway, it's just, I think, a really important call out that we'll probably be calling back in a few months from now. Yeah. Overall, actually, a variety of articles on the SOPS tag, by the way, possibly worth checking out, talking about America's R&D. And one I just noticed looking through here is

Recently, in April, they also launched or published an article titled How to Lose a Tech War, focused on the topic of student visas and a trend in the U.S. of revoking student visas of international students, Chinese students, other types of students. And in the AI community, this has had already, I think, a significant impact. There's been examples of PhD students studying AI being disqualified.

basically not allowed to continue studying it in the U.S. And even AI researchers who are not citizens yet being not allowed to continue being here. So for me, another highlight of a concerning trend that might benefit China in a lot of ways if the U.S. continues on that path.

Yeah. And on the Chinese side in particular, it is such a thorny challenge. Like one of the biggest issues for Frontier Labs is also personnel security. Double digit percentages of their employees are Chinese nationals or have ties to the Chinese mainland. And so you're in this really interesting bind where the reality is, and this was one of the big things that our investigation surfaced,

Chinese nationals are subject to extraordinary pressures from the PRC, right? Like we're talking about, you know, hey, maybe your mother's insulin doesn't come in this month because you said something critical or you didn't report back. There's a story just really briefly I'll just mention, like at Berkeley, there was a power outage somewhere back in 2019 and the internet goes out and people

Essentially, all the Chinese students on the dorm floor were freaking the hell out because they had an obligation to do a time-based check-in with what were effectively their Chinese Communist Party handlers. That's the level at which the CCP operates. It's stuff like your brother's business gets shut down, your family's travel plans get denied. The ratchet of control is extremely powerful and extremely fine-tuned.

And so when you think about like, what does it mean to have Chinese... By the way, the Chinese Communist Party works on the basis of ethnicity. If you look at their public documents, they view ethnic Chinese, not Chinese nationals, but ethnic Chinese themselves as falling under their sort of rightful umbrella of control and really belonging to them in some sense, the sort of Han Chinese focus of the CCP. So...

It's really challenging. How do you actually square that circle? Chinese students and researchers obviously have made huge contributions to Western AI. You just have to look at the names on the freaking papers, right? I mean, it's this incredible body of work.

We're going to have to figure out what to do about that. And it's not an easy problem to solve. So, yeah, I mean, boy, we're in for a rough one trying to square that circle. But yeah. Yeah. And not just Chinese immigrants, by the way, immigrants from all over Europe, Africa.

Andrei Kapafi, of course, sounds, let's say, foreign. Canada. Yeah. And there's more and more examples of, unfortunately, it being tougher on immigrants to be in the U.S.

And with that downer note, we're going to finish. Thank you for listening to this latest episode of Last Week slash Last Couple Weeks in AI. Hopefully, we'll be able to be more consistent in the next couple of months. As always, you can go to lastweekinai.com for all the episodes, lastweekin.ai for the text newsletter that sends you even more news stories.

We do appreciate you subscribing, sharing, reviewing, and so on. But more than anything, listening, please do keep tuning in.

Yeah.

♪♪ ♪♪ ♪♪

From neural nets to robots, the headlines pop. Data-driven dreams, they just don't stop. Every breakthrough, every code unwritten. On the edge of change, with excitement we're smitten. From machine learning marvels to coding kings. Futures unfolding, see what it brings.

#208 - Claude Integrations, ChatGPT Sycophancy, Leaderboard Cheats 01:55:25 Share

Last Week in AI

Shownotes Transcript

#208 - Claude Integrations, ChatGPT Sycophancy, Leaderboard Cheats