We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

#204 - OpenAI Audio, Rubin GPUs, MCP, Zochi

2025/3/24

Last Week in AI

AI Deep Dive AI Chapters Transcript

People

Andrey Kurenkov

Jeremie Harris

Topics

Andrey Kurenkov: 我认为我们对中国的工具环境了解不多。如果你是聊天机器人的用户，我们必须与BT或Claude聊天才能提出我们的疑问。我认为Ernie似乎正在扮演这个角色。新模型的出现以及它们在价格上的竞争力是一个重要因素。中国排名第一的下载应用程序刚刚切换到一个新的AI聊天机器人，它不是DeepSeek。所以事情肯定在变化。此次发布的最大优势似乎是成本，至少在围绕这一问题的讨论中，他们是这样强调的。因此，百度的目标是，百度当然是中国的谷歌，对吧？他们在那里拥有搜索引擎。他们的目标是逐步将其Ernie 4.5和X1推理模型集成到其所有产品生态系统中，包括百度搜索，这很有趣。因此，我们将看到生成式AI能力在该背景下的推出。 Jeremie Harris: 总体而言，这归结于价格。有一张非常方便的表格，比较了GPT 4.5每个令牌的成本、DeepSeek v3、Ernie以及Ernie 4.5。输入令牌的输入成本，GPT 4.5为一百万个令牌75美元。DeepSeek v3下降到基本上是0.30美元。Ernie 4.5大约是每百万个令牌0.60美元左右。所以，你知道，你在谈论数量级上的差异。当然，这些模型的性能较低。所以这就是权衡。但是，事情……是的，我认为为了提供一些视角……DeepSeqv3更类似于OpenAI数据模型中的GPT-4.0或FreeMini，其中侵略性并不那么疯狂。它可能是，我忘了，每百万个令牌大约1美元。所以它们是可以比较的。GPT-4.5与其他所有东西相比，定价简直是疯狂的。 Andrey Kurenkov: 我想我们几集前谈到了这一点，但它是一个基础模型，但它不是一个用于大规模生产的基础模型，对吧？这些是高质量的令牌，可能最适合用于创建合成数据集或回答非常具体的问题。但你不会把它看作是你想要产品化的东西，因为你是对的。我的意思是，它比其他基础模型贵两个数量级。 Jeremie Harris: 你在这里看到的提升，特别是对于Ernie X1，对吧？这是推理模型，是在推理方面，对吧？所以OpenAI的O1比Ernie X1贵大约50倍。对于输入令牌，Ernie X1大约是R1成本的一半。实际上，对于输出令牌也是如此。所以这非常重要，特别是相对于O1而言，并且向你展示……两种情况之一，要么中国的工程技术真的非常非常出色，要么背后有一些国家补贴的事情， Andrey Kurenkov: 我认为后者目前不太可能，尽管我不会排除这种可能性。当然，一些令人惊叹的工程技术使这些利润率成为可能。这真是非同寻常的事情，对吧？我的意思是，推理的成本骤降。这意味着背后进行了一些特定于推理的工程。你应该期望这同样适用于未来的训练和推理。是的，而且这有点……有趣的是，百度和谷歌之间存在相似之处，谷歌的定价也相当具有竞争力，特别是对于Gemini与Flash思考而言。所以我也可以认为这是一种公司战略。百度规模庞大，他们通过搜索赚取巨额利润。因此，他们也可以承担额外的成本，以削弱DeepSeek（一家初创公司）的地位，以……锁定市场。但无论如何，这是一个令人兴奋的消息。我想如果你在中国，我相信你不能使用ChagBt。所以，如果除此之外，对于人们来说，拥有可比的工具来使用并且 Jeremie Harris: 不错过高级LLM的乐趣，这是一件好事。我会说，我不认为百度会像至少他们的基础模型那样进行补贴，因为他们的Ernie 4.5实际上比DeepSeq v3更贵。你看到这种转变的地方是在推理模型上，这本身就是……很有趣，对吧？至少对我来说，这似乎暗示了推理方面的一些……比如推理背后的计算机架构的工程技术，或者令牌效率更高，因此计算效率更高，我应该说，因此，也许或者在推理阶段的计算效率更高。但你是对的。当你开始考虑这些事情的经济学时，所有的事情都开始混淆视听，因为它们代表着公司利润表中越来越大的比例，即使对于像百度、谷歌这样的大公司也是如此，对吧？ Andrey Kurenkov: 这些公司将被迫向我们展示他们的底牌，对吧？他们将不得不立即出售这些令牌以获利，我们最终将了解他们的实际利润率。目前尚不清楚我们是否正在了解这一点。是的，我认为我们还没有。这仍然是一个谜，人们在耍什么花招。但我也会有点赌，利润率并不高。我们确实知道的一件事是，DeepSeek， Jeremie Harris: 至少声称他们正在盈利，并且他们的模型利润率为正。我可以想象，对于例如OpenAI来说，情况并非如此，他们的收入达数十亿美元，但真正的问题是，他们真的盈利了吗？关于这一点的最后一个想法，在经济方面，当我们考虑DeepSeek声称他们正在产生正回报意味着什么时，我认为这里有一个重要的问题，即是否考虑了运营支出或资本支出，对吧？我们在他们的论文中看到，他们声称他们以600万美元的计算基础设施预算对V3进行了训练。现在，或者抱歉，以600万美元的计算预算。回想起来，这似乎是运行该计算的实际运营支出，而不是与……数千万美元的计算硬件相关的资本支出。所以很难知道，比如，你应该摊销什么？你如何将苹果与苹果进行比较？是的，很难说DeepSeek是否盈利，但就每个令牌而言，仅就推理而言，我相信他们的说法是他们正在赚钱，这在所有后端空间中都是如此。是的。有趣。是的。

Deep Dive

Chapters

This chapter discusses the release of Baidu's new Ernie models, Ernie 4.5 and Ernie X1, highlighting their competitive pricing and capabilities compared to Western counterparts like GPT-4.5 and DeepSeek R1. The discussion also touches upon the cost-effectiveness of these models and their potential impact on the market.

Baidu launched Ernie 4.5 and Ernie X1, multimodal models competitive with GPT-4.5 and DeepSeek R1 but at lower costs.
Ernie 4.5 is described as emotionally intelligent, understanding memes and satire.
Cost is a significant advantage for Baidu's models, with pricing orders of magnitude lower than GPT-4.5.

Shownotes Transcript

Hello and welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode, we will be summarizing and discussing some of last week's most interesting AI news. You can go to the episode description for all the links and timestamps and also to lastweekinai.com on your laptop to be able to read both articles yourself as well.

As always, I'm one of your hosts, Andrei Karenkov. I studied AI in grad school, and I now work at the generative AI startup Astrocade. And I'm your other host, Jeremy Harris. I'm with Gladstone AI, an AI national security company, which you know about if you listen to the podcast. You also know about Astrocade now a bunch if you listen. You know about all of this. You know about all of this. What you don't know, though, is that this morning at the early hour, I think it was like three or something in the morning,

I discovered that I have bats in my house, which is fun, which is really fun, especially when you have like a six month old, you have bats and then you start Googling things. So anyway, we had pest control come in. That's why, wow, my hair looks like Cosmo Kramer right now. I've just been

Running my fingers through it for quite a bit. So anyway, we got everything on for showtime, though, because the show is on. Yeah, but if you get any details wrong, you know, it's the shock, residual shock of the bats. And if you see any bats, I'll be on the lookout.

Well, let's do a quick preview of what we'll be talking about in this episode. It's going to be a bit of a relaxed one. There's nothing too sort of world shattering, but a variety of pretty interesting stories, tools and apps. We have some new content.

Impressive models out of China, some new stuff from OpenAI as well. Google and Fopik, everyone launched some stuff. Applications and business, as we often do, we're going to be talking a lot about hardware and GPUs, a little bit about fundraising as well. Projects and open source, we'll be talking about the Model Context Protocol, which has been all the rage in the AI community recently. And a couple of new models, as usual.

Research and advancements, we got to talk about reasoning techniques, inference time scaling techniques, but also some new kind of developments in the space of how you implement your models. Policy and safety, we have some more analysis of what's going on with China, US national security, things like that. And finally, we will actually talk a little bit about the world of art and entertainment with some news about copyright.

So let's just get straight into it. In Tools and Apps, the first story is about Baidu launching two new versions of the Ernie model, Ernie 4.5 and Ernie X1. So Ernie initially released two years ago, and now we have Ernie 4.5, presumably. I don't know. It sounds like kind of to coincide with GPD 4.5.

And then Ernie X1 is the reasoning variant of Ernie that Baidu says is on par with DeepSeek R1, but at half the price. And both of these models are multimodal. They can process videos, images, and audio as well. They also say Ernie 4.5 is kind of emotionally intelligent. They can understand memes and satire, which is interesting.

So I think we don't have a great sense of the tool landscape in China, is my impression. I really wish I would know. If you are a user of a chatbot,

We got to chat to BT or Claude to give our queries. I think it seems likely that Ernie is sort of filling that role. And the fact that there's new models and the fact that they're really competitive price-wise is a big deal. The number one downloaded app in China just switched to a new AI chatbot that is not...

Deep seek. So things are definitely moving. The big advantage here with this launch seems to be cost. At least that's what they're leaning into with a lot of the discussion around this. So the goal that Baidu has, which Baidu, of course, is China's roughly China's Google, right? They own search there.

Their goal is to progressively integrate Ernie 4.5 and their X1 reasoning model into all their product ecosystem, including Baidu Search, which is sort of interesting. So we'll see a rollout of the generative AI capabilities in that context. Yeah, so ultimately it does come down to price, a lot of it. So...

For context, there's a really handy table in one of the articles that looked at this comparing GPT 4.5 per token cost to DeepSeek v3 to Bernie, sorry, Bernie to Ernie 4.5. It's quite interesting, right? So input cost for input tokens, 75 bucks for a million tokens. This is for GPT 4.5.

DeepSeek v3, that drops to basically $0.30. Ernie 4.5 is about $0.60 or so per 1 million tokens. So, you know, you're talking orders of magnitude less. Also the case that these models are less performant. So that's sort of the trade-off there. But where things... Yeah, I think just to give a bit of a perspective...

DeepSeqv3 is more comparable to something like GPT-4.0 in openized data models or FreeMini, for instance, where aggression isn't that crazy. It's maybe, I forget, $1-ish per million tokens. So they're comparable. GPT-4.5 is just crazy, crazy pricing compared to everything else.

And that's the thing, right? It's the way to think about 4.5. I think we touched on this a couple episodes ago, but it's a base model, but it's not a base model for let's say mass production, right? These are high, high quality tokens, probably best used to create things like synthetic data sets, or to answer very specific kinds of questions. But you're not looking at this as something that needs to be that you want to productize just because you're right. I mean, it's two orders of magnitude more expensive than other base models.

where you actually see the lift here, especially for Ernie X1, right? This is the reasoning model, is on the reasoning side, right? So OpenAI's O1 is roughly 50 times more expensive than Ernie X1. Ernie X1 is about half the cost of R1 for input tokens. And actually, that's also true for output tokens. So it's quite significant, especially again, relative to O1, and shows you...

One of two things, either Chinese engineering is actually really, really, really that good, or there's some state subsidy thing going on in the background, and

I think the latter is somewhat less plausible at this point, though I wouldn't rule it out. Certainly, there's some amazing engineering making these margins possible. And that's a pretty remarkable thing here, right? I mean, the cost just collapsing for reasoning. This implies that there's some reasoning specific engineering going on in the background. And you should expect that to apply to training as well as inference going forward. Yeah, and it's kind of

Funny in a way, there is a parallel here between Baidu and Google, where Google likewise has quite comparative pricing, especially for Gemini to Flash thinking. So I could also see it being, you know, just a company strategy kind of thing. Baidu is gigantic. They're printing money with search. So they could also kind of eat the additional costs to undermine something like DeepSeek, which is a startup, right, to...

lock in the market. But either way, exciting news. And I guess if you're in China, I don't believe you can use ChagBt. So if nothing else, it's good that they are comparable tools for people to use and

not miss out on the fun of his advanced LLMs. I will say, I don't know that Baidu would be subsidizing at the level of at least their base model, because they are actually more expensive than DeepSeq v3 on Ernie 4.5. Where you see that flip is with the reasoning models, which itself is, yeah, that's kind of interesting, right? I mean, to me, at least, that seems to imply something about reasoning

like engineering for the like computer architecture behind reasoning or more token efficiency and therefore compute efficiency at the at the reason I should say, therefore, maybe alternatively compute efficiency at the reasoning stage. But you're right. There's all kinds of things start to muddy the waters when you start thinking about the economics of these things as they represent a larger and larger fraction of the corporate bottom line, even for big companies like I do, like Google, right?

These companies are going to be forced to show us their hand in a sense, right? They're going to have to sell these tokens immediately.

for profit, and we will eventually learn what their actual margins are. It's debatable whether we're learning that just yet. Yeah, I don't think we are. It's very much unknown. And I haven't seen any kind of strong analysis to explain it. There's, you know, yeah, it's just a mystery what kind of tricks people are pulling. But I would also kind of bet that the margins aren't great. The one thing we do know, DeepSeek,

claimed at least that they were making a profit and had a positive margin on their models. And I could see that not being the case for, you know, for instance, OpenAI where their revenue is in the billions, but the real question is, are they actually making a profit? Last thought on this too, on the economic side, like when we think about what it means for DeepSeek to claim that they're generating positive returns,

I think there's an important question here about whether that's operating expenses or CapEx factored in, right? So we saw in their paper that they famously talked about how they trained V3 on $6 million of compute infrastructure. Now, or sorry, on a $6 million compute budget. That was, it seems in retrospect, the actual operating expenses of running that compute, not the capital expenses associated with

the tens of millions of dollars as it would have been of compute hardware. So it's always hard to know, like, what do you amortize? How do you factor in what's apples to apples? Yeah, it's hard to say like DeepSeek is profitable, but on a per token basis, just for inference, I believe the claim is they're making money, which on an all back spaces. Yeah. Interesting. Yeah.

Moving right along, next we have OpenAI and they are releasing some new audio models. So there are two new speech-to-text models, GP4O Transcribe and GP4O Mini Transcribe, which are basically replacing their Whisper models. OpenAI has already had this as a service for quite a while.

The exciting new thing here is the text-to-speech model, GP4O-Mini-TTS, which is more along the lines of 11 Labs, where you can produce very natural results

human sounding speech. And along with announcement of the models, OpenAI has also launched a new site, openai.fm, which is a demo site where you can go and mess around and kind of hear the outputs. And this is kind of a fun trend, I got to say, where these companies increasingly are launching these little fun toys to get a sense for what these models are capable of.

One last thing, again, we probably should comment on pricing. The pricing is very competitive. The transcription for GPT-4.0, it's 0.6 cents per minute. 0.6 cents, so like $0.006, I guess. And GPT-4.0 Mini TTS is 1.5 cents per minute, which is much lower than a competitor like 11 Labs, for instance. So...

Yeah, I think it's interesting to see OpenAI expanding their model suite to these new domains where they're sort of less focused. We've seen them kind of move away from text-to-image, for instance, DALI hasn't had an update in forever. And so I guess this makes a lot of sense that they'd have very competitive things to offer given their investment in the advanced voice mode and chat GPT.

It's sort of reminiscent of the problem that Meta faces, right, where they're, you know, they reach like whatever, 3 billion people around the world. At a certain point when your market penetration is so deep, one of the only things you can do to keep growing is to grow the market. And so Meta invests, for example, in getting more people on the Internet in other countries, like in countries that.

don't have internet access typically or have less of it. And so they're literally just trying to like grow the pool of people they can tap for this. In the same way, I think there's a lens on this that's similar, right? So you're only able to interact with chat GPT through certain modalities or with open AI products for certain modalities. And by achieving greater ubiquity, by reaching into your life more and making more of the conversational tooling available to you,

That really does effectively increase their market, right? Like you don't have to be in front of a computer necessarily or in the same way or engaged in the same way to use the product. And obviously they've had other voice products before, but it's sort of part of if I'm open AI, I'm really thinking about multimodality, but

From the standpoint of increasing the number of contexts, life contexts in which I can reach you and text to image still requires you to be in front of a screen. Same as writing text on chat GPT, whereas audio is just this like, you know, greater reach for modality wise. So I think strategically it's an interesting play for them.

ethically all kinds of issues. I mean, you know, you think about the modality of audio as being one that is much more intimate to humans and an easier way to plug into your inner world. And that's, I think, something, you know, when we look at like what Rekha did to people just through text, right, the suicidal ideation, the actual suicides, the Rekha subreddit, when people had their, you know, AI boyfriends or girlfriends taken away from them, you know, that sort of thing when you when you tie in audio, I think it's going to be an interesting PR challenge, if nothing else for OpenAI. Yeah.

There is one figure, by the way, in the article, at least we're linking to here, and it's just a piece of research looking at the word error rate comparisons across leading models for different languages. As part of this kind of tooling, I find it really interesting, like Arabic and Hindi, there's a lot of struggle there. Those are some of the worst performing languages. Obviously, English, one of the better performing ones. I'd love to see an overlay of this relative to the amount of data

that was used to train the model. So you can see in relative terms, like which languages are in a sense, like harder for AI to pronounce, to kind of to speak. I think there's something anyway, linguistically just fascinating about that, if nothing else. So anyway, overall interesting launch. And I think we're going to see more and more of this, right? It's going to be more expected to have very high quality audio models.

and linking them specifically to agents, sort of Star Trek computer style. Yeah, I guess one thing worth noting on the kind of ethics side is I don't believe they're offering voice cloning technology, which is where you can really get into trouble very easily. So I think OpenAI is being a little careful these days in general to not cause controversy. That's part of why it took them forever to release Sora potentially.

And in this API, this demo, they are releasing something like a dozen voices you can use with names like Alloy, Ash, Echo, Fable, Onyx, Nova. I don't know, human. I guess we're not even trying to make them human sounding. And you can also assign them a vibe in this demo, like Cowboy, Auctioneer, Old-Timey, Serene, with a lot of this kind of steering that you can do as well. So...

Yeah, I think it's pretty exciting. And as ever, with the release of new APIs, this really enables downstream of OpenAI and these other companies for others to build exciting new applications of AI. And on to a few more quick stories. Next up, also OpenAI, they have released O1 Pro into their

developer API. So it's actually limited to developers who spent of these $5 on this and it cost 150 per million tokens for input and $600 per million tokens generated. So that's very, very high prices. Obviously that's, as we've said, GPT 4.5 was $75 for 1 million output tokens. And that's yeah, two, two orders of magnitude easily generated.

above what you would typically charge. Yeah, I'm trying to think if it's two or three orders of market. It might be approaching three orders market actually. So yeah, interesting strategy here from OpenAI. We haven't seen any other companies release these very expensive products yet. And OpenAI is increasingly doing that with ChachBee Pro, their $200 per month subscription with QB4.5, with this.

It makes me wonder if this is an attempt to become more profitable or if this is them sort of testing waters. There could be various readings, I suppose. Yeah. It's also, I mean, it's interesting to note, this is not...

an order of magnitude larger than what GPT-3's original pricing was. I was just looking it up in the background here to check because I seem to remember it being, you know, back then it was priced per 1,000 tokens. With reasoning models, you tend to see more per million tokens just because of the number of tokens generated. But it sort of reminds me in the military or the history of the military, there's often this restriction where it's like people can only carry, I forget what it is,

60 pounds or something of equipment. And so over time, you tend to see like the amount of equipment that a soldier carries doesn't tend to change or the weight of it. But of course, the kind of equipment they carry just changes to reflect technology. This sort of seems similar, right? There's like almost a Pareto frontier of pricing, at least for the people who are willing to reach for the most intelligent products, you know, and you're constantly reaching for it, though. This is a push forward, even relative to the GPT-3 frontier back in the day. So kind of interesting, yeah.

There's all kinds of feedback people have been getting. There's complaints about, oh, this model struggled with Sudoku puzzles, apparently, and optical illusions and things like that. People say, you know, at a certain point, anything you launch at a high price point, especially if you're opening AI, people will complain that it's not like super intelligence. And so... Yeah, but there's also an interesting parallel here where...

O1 Pro, just in terms of benchmarks, and I think in general, in terms of the vibe of what people think, is that it's not significantly better than O1. And that parallels GPT-4.5. It's better, but it's not a huge leap. So there is an interesting demonstration of...

Probably it's harder to get, you know, huge leaps and performance and people are going to be more critical now of if you were not offering something that's like, you know, really between GP4, 3.5 and 4, for instance. Yeah, I mean, I think it's quite use case specific too, right? So as we've seen, you know, the kinds of issues people are running into, optical illusions,

You know, Sudoku puzzles, this sort of thing are pretty far from the standard, you know, the actual workloads that opening eye is targeting, right? Their focus is, can we build something that helps us automate AI research as quickly as possible? Those sorts of benchmarks. Yeah, we are seeing needle moving there. There's also some interesting stuff that we'll talk about from meter suggesting that in fact, that is what's happening here, that on those particular kinds of tasks, we're seeing pretty significant changes.

acceleration with scale. But you're right, right? It's this funny, uneven surface, just like how humans are funny and uneven, right? Like you have a really talented artist who can't write a line of code to save their lives, right? And vice versa. So another instance of the paradox of what's hard for AI isn't necessarily hard for humans.

And moving away from OpenAI to Google, we now have another feature, another instance of Canvas, this time on Gemini. And they're also adding audio overview. So

I don't know why they do this, why these LLMs just copy each other's names. You had deep research showing up in multiple variants. Now we have Canvas, which is also on ChatGPT. And I think on Anthropic, it's called Artifacts. Basically the same idea, where now as you're working on something like code, for instance, or like a web app, for instance, you can have a site panel showing this living document

rendering of it with a chatbot to the left. So you can essentially interactively work and see a preview of what you're getting. And you also have audio overviews, which is pretty much something like notebook LM, like you can upload documents and get this podcast style conversation going on. So nothing sort of conceptually new going on here, but

I think an interesting convergence across the board of all of these tools. Everyone has Canvas. Everyone has deep research. Everyone seems to have kind of the same approach to implementing LLM interfaces.

Speaking of that, in fact, the next story is about Anthropic and them adding web search capabilities to Cloud. So that is now in preview for paid users in the US. And that will basically work the same as it does in Chajbeki and other models. You can enable it to work with Cloud 3.7 and then it will be able to provide direct citations from web-sourced information.

So, yeah, not much else to say. We're getting web search for cloud, which will enable it to be more useful. It's interesting because it's like the tee up to this is Anthropic being a little bit more shy than other companies to roll up the web search product into their

into their agents. And I mean, this is consistent with the threat models that they take seriously, right? Things like loss of control, right, which typically involve, you know, an AI model going out onto the internet, maybe replicating its weight somehow, and internet access is kind of central to a lot of these things. I don't know if that was part of this, it at least is consistent with it. So the result is that there may be a little bit later the party than others, but

Apparently, according to these initial tests, you don't always see web search used for current events related questions. But when that happens, you do get these nice inline citations pulled from sources. It does look at social media. And then, of course, new sources like NPR, like Reuters, they cite in the examples they show. So, you know, pretty standard product in the inline citation approach that you see with deep research, for example, certainly making an appearance here.

And last up, again, along the lines of these stories, we have XAI launching a new API, this one for generating images. So they have a new model called Grok2Image 12.12, and you can now query it. For now, it's quite limited. You can only generate 10 images per request, and you are limited to five requests per second. The cost there is seven cents per image, which is a

slightly above what, for instance, Black Forest Labs charges. They are the developers of Flux and competitive with another offering from Ideagram. So I think, yeah, interesting to see XAI expanding their APIs once again. They released their own image generation back in December, and it's kind of looked competitive with something like Google's latest

generation of where we focus has really shifted towards careful instruction following in your image generation. So yeah, XAI is as ever trying to catch up or moving quite rapidly to expand their offerings.

Yeah, they really are. And I think when we first covered Black Forest Labs partnership with XAI, one of the first things that we said was like, hey, this is, you know, because I think they raised a big round right on the back of the incredible distribution that they were going to get through XAI and the kind of vote of confidence that reflected from Elon. But at the time we were talking about, hey, you know, this is a pretty strategically dicey position for Black Forest Labs because we're

The one thing we've consistently seen from all the AI companies is once they start getting you in for chat, eventually they start rolling out multimodal features. And it's not clear that those aren't best built in house for any number of reasons, not just including the fact that you want to kind of internalize all the revenues you can from the whole stack. But also, once you have a good reasoning model or rather a good foundation model, you're

That foundation model can be mined for multimodality post hoc, and you just kind of get to amortize your investment across more modalities. And so it's just this natural move to kind of keep crawling into

You're creeping into adjacent markets like image generation, video generation, which is also something that XAI is looking at. So yeah, I mean, kind of interesting for Black Forest Labs, this probably is going to be a big challenge for them. I don't know how extensive their partnership continues to be at this point, but it's a dicey time to be one of these companies.

And on to applications and business, we begin with some announcements from NVIDIA. There's a preview of their plans in 2026 and 2037. They have the Rubin family of GPUs coming in 2026 and then Rubin Ultra in 2027. So that will...

come along with a new, I guess, server layout with availability to combine 576 GPUs per rack, which, you know, I guess it's very much following the tracks of very, very crazy enhancement to computing that NVIDIA has been able to continue to

creating with B200, I believe it is now, and now this is their plans for the next couple of years. Yeah, there's a lot going on with this update. It's actually pretty interesting and quite significant, especially on the data center side in terms of the infrastructure that will be required to accommodate these new chips. A couple of things here, right? So there is this configuration of the Blackwell called the NVL72 is the sort of name of this configuration called.

This is where you have, so, okay, imagine a tray that you're going to slot into a rack, a server rack, right? So on that tray, you're going to have four GPUs.

All right, so each tray contains four GPUs. And in total, in that whole rack, you're going to have 72... I'm sorry, you're actually going to have 144 GPUs total. But because two of those GPUs show up on the same motherboard... God. So each friggin' tray that you slot into the rack has two motherboards on it. Each of those motherboards has two GPUs, two B200 GPUs. So in total, you're putting in four GPUs per tray, right?

But they're kind of divided into two motherboards, each with two GPUs. Anyway, this led to the thing being called the NVL-72, when in reality, there's 144 GPUs on there. At least Jensen Huang says it would have been more appropriate to call it the NVL-144L. Okay. Okay.

What's actually interesting in this setup, they're calling the Rubin NVL 144 rack. There's not more GPUs there. It's not that there's twice as many GPUs as the NVL 72 with the Blackwells. It's just that they're counting them differently now. So they're saying, actually, we're going to count all the GPUs. So if I think back in the day, we did talk about the NVL 72 setup.

This is basically just the same number of GPUs. Nothing has changed, even though the number has changed. If that didn't make any sense, just delete it from your mind. Let's focus on the things that are actually interesting. The story is it's comparable in the number of GPUs to the current set of top-line GPUs. So they're kind of pitching it as you can slot it into your existing infrastructure, more or less. And...

And just to jump into numbers a little bit, you're getting roughly three times the inference and training performance in terms of just raw compute. Memory is faster.

by close to two ish, or multiplier of two, kind of like Yeah, you're seeing multipliers on top of the current one. So quite significant change in performance if you do upgrade. Yeah, so so when it comes to Rubin, right, which is the sort of next generation coming online, at FP four, you're seeing a three x more flops, right, three times more, more logic capacity,

Now, on the memory side, things actually do get somewhat interesting. Memory capacity is going to be 288 gigabytes per GPU, right? That is the same as the B300. So no actual change in terms of the per GPU memory capacity. We'll get back to why that matters a bit less in a second, but that's kind of part of the idea. The memory bandwidth is improving. It's almost doubling, or maybe, yeah, it's short of doubling.

So the memory bandwidth is really, really key, especially when you look at inference. So that's one of the reasons why this is really being focused on. But there's also a bunch of things like the so the cables that connect GPUs together on roughly speaking on one rack, if you want to imagine it that way, those are called NV link cables, super, super high bandwidth. Those are doubling in throughput. So that's a really big advance. There's also stuff happening on the networking side, but we don't need to touch that.

Bottom line is NVLink cables used to be the way you connected GPUs across different trays in the same rack and maybe adjacent racks depending on the configuration. But it's very local, very, very tight, very high bandwidth communication. What's happening here is each of these motherboards that you're slotting into your rack,

They have a CPU and two GPUs. And we talked about this in the hardware episode as to why that is. The CPUs like the orchestra conductor, the GPUs are like the instruments that are actually doing the hard work and the heavy lifting.

Typically, the CPU would be connected to the GPUs through a PCIe connection. So this is a relatively low bandwidth compared to NVLink. Now they're moving over to NVLink as well for the CPU to GPU connection. That's actually a really, really big deal. It comes with a core-to-core interface. So right now, the GPUs and CPUs are going to share a common memory space.

So essentially directly accessing each other's memory, whatever is in memory on the CPU, the GPU can access right away and vice versa. That's a really, really big change. It used to not be the case. You used to have independent CPU and GPU memory.

The GPUs themselves would share a common memory space if they were connected via NVLink. And in fact, that's kind of that's part of the idea here. That's what makes them a coherent wad of compute. And it's also part of the reason why the memory capacity on those GPUs matters a bit less because you're adding, you're kind of combining all your GPUs together and they have a shared memory space. So if you can just add memory,

to the number of GPUs you have, you're effectively adding to your memory capacity. So that's kind of an important difference there. So anyway, lastly, I'll mention, they say that apparently Rubin Ultra is going to come out. This is, so there's going to be Rubin and then Rubin Ultra. Rubin Ultra is coming out the second half of 2027.

It'll come with a Rubin GPU and a Vera CPU, like NVIDIA tends to do, right? They name the first name, the CPU, so it's Vera Rubin. And so Vera is the CPU, Rubin is the GPU. Apparently, the full rack is going to be replaced by this 576 GPU setup, a massive number. That is essentially, so they don't specify the power consumption, but it's clear from other kind of industry data

products that are coming out, we're tracking for one megawatt per rack. And just worth emphasizing, that's a thousand kilowatts. That is a thousand homes worth of power going to a single rack in a server, in a data center. That's insane, right? So the power density required for this is going through the roof, the cooling requirements, all this stuff. It's all really cool. And anyway, this is a very, very big motion.

Just to dive a little bit into the numbers, just for fun, right? So the compute numbers are in terms of flops, which is floating point operations per second, basically multiplications per second or additions per second. And the numbers we get with these announced upcoming things like Vera Rubin are now, you

For inference, 3.6 exaflops of inference. So exa is quintillion. It's 10 to the 18. Quintillion is the one after quintillion. So I can't even imagine how many zeros. I mean, I guess I know how many zeros it is, but it's very hard to imagine a number that long. And that's just where we are at.

Also worth mentioning, so this is the plans for 2026, 2027. They also did announce for later this year, the coming of B300, which is also, you know, is an improvement in size.

performance of about 1.5. They also did announce the ultra variants of Blackwell, both 200 and 300. And the emphasis they're starting to add, I think, is more on the inference side. They definitely are saying that these are good models for

the age of reasoning. So they're capable of outputting things fast in addition to training well. And that's very important for reasoning, of course, because the whole idea is you're using up more tokens to get better performance. So they're giving some numbers like, for instance, Blackwell Ultra will be able to deliver up to 1000 tokens per second on DeepSeek R1. And that's

you know, comparable, usually you would be seeing something like 100, 200 tokens per second, a thousand tokens per second is very fast. Yeah. And then the inference focus is reflected too, right?

In the fact that they're looking at, you know, FP4 flops denominated performance, right? So when you go to inference, often you're inferencing quantized models, inferencing at FP4, so lower resolution. And also the memory bandwidth side becomes really important for inference disproportionately relative to training, at least on the current paradigm. So that's kind of, you know, part of the reason that you're seeing those big

Big lifts at that end of things is because of the inference. And next story is also about some absurd sounding numbers of hardware. This one is from Apple. They have launched a Mac Studio offering and the top line configuration where you can use the M3 Ultra chip with 32 CPUs.

80 core GPU that can even run the DeepSeek R1 model. That's the 671 billion parameter AI model. If you're at inference, you're using about 37 billion per output, I believe. But still, this is hundreds of gigabytes of memory necessary to be able to run it.

and just fit it in there. Yeah, Apple's also doing this weird thing where they're not designing like GPUs for their data centers, including for AI workloads. They seem to be basically like

doing souped up CPUs kind of like this with just like gargantuan amounts of VRAM that again have this very large kind of shared pool of memory, right? We talked about like coherent memory on the Blackwell side, right? And on the Rubin side, just the idea that if you have a shared memory space, you can pool these things together. Well, they're not as good at the shared memory space between CPUs. What they do is they have disgusting amounts of RAM one GPU, right? So like 512 gigs is...

It's just wild. Anyway, for a CPU at least. And we're talking here about when you say memory, we mean really something like RAM, right? And so if you have a laptop, right, if you buy a Mac, for instance, typically you're getting 8 gigabytes, maybe 16 gigabytes of RAM, the fast type of memory, read something memory.

Random access memory. Random access memory, right? As opposed to the slower memory of, let's say, an SSD or things where you can easily get terabytes. To get that crazy amount of random access memory is insane when you consider that typically it's like 8, 16 gigabytes and this is expensive memory. It's stupid expensive. It's also like,

Yeah, there's different kinds of RAM. And we talked about that in our hardware episode. This is a combined CPU-GPU setup, by the way. So 32-core CPU, 80-core GPU, but shared memory across the board. So VRAM is like really close to the logic, right? So this is like the most, as you said, exquisitely expensive kind of memory you can put on these things. They're opting to go in this direction for very interesting reasons, I guess. I mean, it does mean that they're...

disadvantaged in terms of being able to scale their data center infrastructure, their builds, because of networking, at least as far as I can tell. It's a very interesting standalone machine. I mean, this is pretty wild specs. Right. Yeah, exactly. If you go to the top line offerings and this is a physical product you can buy as a... Yeah, it's a Mac, right? Yeah. It's a Mac. Yeah, it's a Mac. It's like a big...

kind of cube-ish thing. And if you go to the top-ranked version, it's something like $10,000. Don't quote me on that, but it's crazy expensive as well. It does come with other options. For whatever reason, M4 Max CPU and GPU is less powerful than M3 Ultra. But anyway, very kind of beefy offering now from Apple.

Next, we have something a bit more forward-looking. Intel is apparently reaching an exciting milestone for the 18A 1.8 nanometer class wafers with a first run at the Arizona FAB. So this is apparently ahead of schedule. They have these Arizona FABs, FAB-52 and FAB-62, and

Fab, as we've covered before, is where you try to make your chips. And 1.8 nanometer is the next kind of frontier in terms of scaling down the resolution, the density of logic you can get on a chip. So the fact that they're running these test wafers, they're ensuring that you can transfer the fabrication process to these new Arizona facilities. I guess the big deal there is partially that these are located within the U.S., within Arizona, but

And they are seemingly getting some success and are ahead of schedule, as you said. And that's impressive because fabs are an absurdly complex engineering project. Yeah, Intel is in just this incredibly fragile space right now.

as has been widely reported, and we've talked about that a fair bit. I mean, they need to blow it out of the water with 18A in their future notes. I mean, this is like a make or break stuff. So Ford Progress, yeah, they had their test facility in Hillsborough, Oregon, that was doing 18A production, as you said, on a test basis. And they're now successfully getting the first test wafers in their new Arizona fab out. So that's great.

But yeah, it'll eventually have to start running actual chips for commercial products. The big kind of distinction here is they're actually manufacturing with 18A these gate all around transistors. I think we talked about this in the hardware episode.

We'll go into too much detail. This is a specific geometry of transistor that allows you to have better control over the flow of electrons through your transistor, essentially. It's a big, big challenge people have had in making transistors smaller and smaller. You get all kinds of current leakage. The current, by the way, is sort of like the thing that carries information in your computer.

And so you want to make sure that you don't have current leakage to kind of have ones become zeros or let's say operation, like, you know, a certain kind of gate turn into the wrong kind of gate. That's the idea here. So it's the skate all around transistor based on a ribbon fed transistor.

design and yeah so so we're seeing that come to market gate all around is is something that tsmc is moving towards as well and you know it's just going to be the the next essentially the next beat of production so here we have 18a kind of early signs of progress and now moving away from hardware more to businessy stuff xai has acquired a generative ai startup they acquired hotshot

which is focused on text to video, similar to Sora. They also have AI-powered... Initially, they worked on AI-powered photo tools and then pivoted. So I suppose...

unsurprising in a way that they are working on text-to-video as well. They just want to have all the capabilities at XAI, and this presumably will make that easier to do. Yeah, one of the founders had some quote, I think it might have been on X, I'm not sure, but he said, we're excited to continue scaling these efforts on the largest cluster in the world, Colossus, as part of XAI. So it seems like they'll be given access to Colossus as part of this, maybe not shocking, but kind of an interesting subnote.

They were backed by some really impressive VCs as well. So Alexis Ahanian, as we're like famous for like being the co-founder of Reddit, of course, and doing his own VC stuff and SV Angel too. So pretty interesting acquisition and a nice soft landing too for folks in a space that otherwise, you know, I mean, they're either going to acquire you or they're going to eat your lunch. So I think that's probably the best outcome for people working on the different modalities, at least on my view of the market. Yeah.

Yeah, and I guess acquisition makes sense. The startup has been around for over two years and they have already trained multiple video models, Hotshot XL and Hotshot, and they do produce quite good looking videos. So it makes some sense for OpenAI to acquire them, if only for the kind of brainpower and expertise in that space. Yeah, man, they're old. They've been around for like two years, right? Yeah.

Yeah, that was the what? Free Sora or like around the time Sora came out? Yeah, yeah, yeah. It's funny. It's just funny how the AI business cycle is so short. Like these guys have been around for all the 24 months. They're experts. They're veterans. Right.

And onto the last story, Tencent is reportedly making massive NVIDIA H20 chip purchases. So they are supposedly meant to support the integration of DeepSeek into WeChat, which kind of reminds me of Meta, where Meta has this

somewhat interesting drive to let you use Lama everywhere and Instagram and all their messaging tools. So this would seem to be in a way similar where Tencent would allow you to use DeepSeek within WeChat. Yeah, part of what's going on here too is the standard stockpiling that you see China do and Chinese companies do ahead of the anticipated crackdown from the United States on export controls. And in this case, the H20 has been kind of

identified as one of those chips that's likely to be shut down for the Chinese market in the relatively near term. So it makes all the sense in the world that they would be stockpiling for that purpose. But it is also the case that you got R1 that has increased dramatically the demand for access to hardware.

It's sort of funny how quickly we pivoted from, oh no, R1 came out and so NVIDIA stock crashes to, oh, actually R1 is great news for NVIDIA. Anyway, I think it's the turnaround that we sort of expected. We talked about this earlier and there has been apparently a short-term supply shortage in China recently.

Regarding these H20 chips. So like, there's so much demand coming in from Tencent that it's this real like rate limiting for NVIDIA to get H20s into the market there. So kind of interesting. They've previously placed orders on the order of hundreds of thousands between them and ByteDance. Back, I think last year was almost a quarter million of these these GPUs. So yeah, pretty big customers.

And onto projects and open source, we begin with a story from the information titled Anthropix not so secret weapon that's giving agents a boost. I would say kind of a weird spin on this whole story, but anyway, that's the one we're linking to. And it covers the notion of MCP model context protocol, which Anthropix released all the way back in November. We

hopefully covered it. I guess we don't know. I think we did. I was trying to remember. Yeah. Yeah, I think we did. And the reason we're covering it now is that it sort of blew up over the last couple of weeks. If you're in the AI developer space or you see people hacking on AI, that it has been the talk of the town, so to speak. So Model Context Protocol, broadly speaking, is something like

an API, like a standardized way to build ports or

mechanisms for AI agents or AI, I guess, models to call on services. So it standardizes the way you can provide things like tools. So there's already many, many integrations following the standard for things like Slack, Perplexity, Notion, et cetera, where if you adopt a protocol and you provide an FCP-compatible protocol

kind of opening, you can then have an MCP client, which is your AI model, call upon this service. And it's very much like an API for a website where you can have a particular client

URL to go to, particular kind of parameters, and you get something back in some format. Here, the difference is that, of course, this is more specialized for AI models in particular. So it provides tools, it provides like a prompt to explain the situation, things like that. Personally, I'm in the camp of people who are a bit confused and kind of think that this is an API for an API kind of situation. But either way, it has gotten very popular.

That is exactly what it is. Yeah. It's an API for an API. It's also, I guess, a transition point or could be viewed that way, you know, in the sense that eventually you would expect models to just kind of like figure it out, you know, and have enough context and ability to uncover things.

whatever information is already on the website to be able to use tools appropriately. But there are edge cases where you expect this to be worthwhile. Still, this is going to reduce things like hallucination of tools and all kinds of issues that when you talk about agents, right, like one failure anywhere in a reasoning chain or in an execution chain

it can cause you to fumble. And so, you know, this is structurally a way to address that and quite important in that sense. It is also distinct from a lot of the tooling that opening eyes come out with, but that sounds similar, like the agent's API, where they're focused more on chaining tool uses together. Whereas MCP, as you said, is more about helping make sure that each individual instance of tool use goes well, right? That the agent has what it needs to kind of ping the tool properly, interact with it and find the right tool.

rather than necessarily chaining them together. So there you go. You know, MCP is a nice kind of clean open source play for Anthropic too. They are going after that kind of more startup founder and business ecosystem. So pretty important from a marketing standpoint for them too. Right. Yeah, exactly. So back in November...

They announced this, they introduced this as an open standard, and they also released open source repositories with some example of model context protocol servers.

and as well as the specification and like a development toolkit. So I honestly haven't been able to track exactly how this blew up. I believe there was some sort of tutorial given at some sort of convention, like the AI engineer convention or something.

And then it kind of took off and everyone is very excited about the idea of model context protocols right now. Moving on to new models, we have Mistral dropping a new open source model that is comparable to GP4O Mini and is smaller. So they have Mistral Small 3.1, which is seemingly better than similar models, but only has 24 billion parameters.

also can take on more input tokens, 128,000 tokens, and is fairly speedy at 150 tokens per second. And this is being released under the Apache 2 license, meaning that you can use it for whatever you want, business applications, etc. I don't think there's too much to say here other than like a kind of a nitpick here, but they say it outperforms

comparable models like Gemini 3, GPT-40 mini, while delivering inference speeds, as you said, of 150 tokens per second. But like, you can't just say that shit. Like it doesn't mean anything to say. Yeah, it depends on what infrastructure you're using. Yeah. What's the stack, dude? Like, you know, like I can move at 100, like at 100 miles an hour if I'm in a Tesla. That's what makes me... Anyway,

They do give that information, but it's like buried like in the little like gray. This is from their blog post where we get these numbers. So I guess as with any of these model announcements, you go to a company blog, you get a bunch of numbers on benchmarks showing that it's the best. You have comparisons to Gemma 3 from Google, to Cohere, Aya, GPT-40 Mini, Cloud 3.5 Haiku, and

On all of these things like MMLU, human eval, math, it typically is better. Although, you know, I would say it doesn't seem to be that much better than Gemma at least. And in many cases is not better than 3.5 Haiku and GPT-4 Mini. But yeah, and still quite good. The 150 tokens per second too for context is a batch size 16 on four H100s.

They actually, like, even in the technical post, they write while delivering inference speeds of 150 tokens per second without further qualifying. But it's in, like, it's in this, like, small gray text underneath an image that you have to look for where you actually find that context. So don't expect this to run at 150 tokens per second on your laptop, right? That's just not going to happen because, you know, for H100, that's quite a lot of horsepower. Still, yeah, it's an incremental improvement. More open source coming from Mistral.

and Apache 2.0 license. So, you know, highly permissive. And one more model, we have ExaOn Deep Reasoning Enhanced Language Models coming from LG AI Research. So these are new models, new family of models, 2.4 billion, 7.8 billion, and 32 billion parameters. These are optimized for reasoning tasks and seemingly are optimized

on par or outperforming variations of R1, where R1 is the giant one at 671 billion. There's distilled versions of those models at comparable sizes and

In the short technical report that they provide, they are showing that it seems to be kind of along the lines of what you can get with those distilled R1 models and similar or better than OpenAI R1 Mini.

Yeah, it's also, it's kind of interesting. It seems to, I described, again, not a lot of detail in the paper, so it makes it hard to reconstruct, but it does seem to be at odds with some of the things that learn in the DeepSeek R1 paper, for example. So there's, they start with, it seems, an instruction-tuned model.

base model, the XO1 3.5 instruct models. And then they add on to that a bunch of fine tuning. They do supervised fine tuning. Presumably this is for like the reasoning structure and then DPO standards or like RL stuff and online RL. So, you know, this is quite a bit of supervised fine tuning of trying to teach the model how to solve problems in the way you want it to solve them rather than just like

giving it a reinforcement learning signal and reward signal and like kind of have at it like R1.0 did. So yeah, kind of an interesting alternative, more, let's say,

more inductive prior-laden approach. And first time as well that I think we've covered anything from LG AI research. I think so, yeah. Exxon, these models appear to have already been existing and being released. Exxon 3.5 were back from December, which we somehow missed at the time. Yeah. Fun fact, Exxon stands for expert AI for everyone. You gotta love when people come up with these acronyms in a very creative way.

And they are open sourcing it on Hugging Face with some restrictions. This is primarily for research usage. On to research and advancements. We begin with a paper, Samples, Scrutinize, and Scale. Effective Inference Time Search by Scaling Verification. It is coming from Google. And let me just double check. Also, UC Berkeley. And it is quite exciting, I think, as a paper. It basically is...

Making the case or presenting the idea that

to do scaling, inference time scaling, with the idea of inference time scaling being basically, you know, once you already train your model, once you stop updating your weights, can you use your model more to get smarter if you just kind of do more outputs in some way? And what you've seen in recent months is inference time scaling via reasoning, where you have the model, you know, output a long chain of tokens where it does various things

kinds of strategies to do better at a given complicated task like planning sub steps like verification backtracking these things we've already covered well this paper is saying instead of that sort of scaling the output in terms of a chain an extended chain of tokens you can

Instead, sample many potential outputs, like just do a bunch of outputs from scratch and kind of vary it up so you get many possible solutions. And then if you have a verifier and you can kind of compare and combine these different outputs,

you can be as effective or even in some cases more effective than the kind of traditional reasoning, traditional inference time scaling paradigm. So again, yeah, quite interesting in the paper, they have a table one where they are giving this example of if you sample a bunch of outcomes and you have a verifier that is good, you can actually outperform O1 preview. And many other techniques, you can

You can get better numbers on the hard reasoning benchmarks like Amy, where you can solve eight out of the 15 problems now, which is insane. But that's where we are. And on math and live bench math and live bench reasoning. So yeah, very interesting idea to sample a bunch of solutions and then just compare them and combine them into one final output.

Yeah, I think this is one of those key papers that, again, I mean, we see this happen over and over. Scaling is often a bit more complex than people assume, right? So at first, famously, we had pre-training scaling, scaling pre-training compute that brought us from GPT-2, GPT-3,

To GPT-4, now we're in the inference time compute paradigm where, you know, this analogy that I like to use, like you have 30 hours to dedicate to doing well on a test. You get to choose how much of that time you dedicate to studying, how much you dedicate to actually spending writing the test.

And what we've been learning is, you know, scaling pre-training compute is basically all study time. And so if you just do that and you effectively give yourself like one second to write the test, well, there's only so well you can do, right? Eventually you start to saturate if you just keep growing pre-training and you don't grow inference time compute. You don't invest more time at test time.

So essentially, you have two dimensional scaling. If you want to get the sort of indefinite returns, if you want the curves to just keep going up and not saturate, you have to scale two things at the same time. This is a case like that, right? This is a case of a scaling law that would be hidden to a naive observer just looking at this as a kind of a one variable problem.

When in reality, it's a multivariate problem. And suddenly when you account for that, you go, oh, wow, there's a pretty robust scaling trend here. And so what are these two variables? Well, the first is a

scaling the number of sampled responses. So just the number of shots on goal, the number of attempts that your model is going to make at solving a given problem, but you have to improve verification capabilities at the same time, right? So they're asking the question, what test time scaling trends come up as you scale both the number of sampled responses and your verification capabilities? One of the things crucially that they find though, is the

You might naively think if you have a verifier, right? And I want to emphasize, this is not a verifier that has access to ground truth, right? This is like, and I think they use like Claude 3.7 for this, 3.7 Sonnet for this. But basically, this is a model that's going to look at

say, the 20 different possible samples, in other words, that you get from your model to try to solve a problem. And this verifier model is going to contrast them and just determine based on what it knows, not based on access to actual ground truth or any kind of symbolic system like a calculator. It's just going to use its own knowledge in the moment to determine which of those 20 different possible answers is the right one.

And what you find is as you scale the number of possible answers that you have your verifier look at, you might naively think, well, with just so much gunk in the system,

probably the verifier's performance is going to start to drop over time, right? It's just like, it's so much harder for it to kind of, you know, remember which of these was that good and which was bad and all that stuff. So eventually you would expect the kind of trend to be that, you know, your performance would saturate, maybe even drop. But what they find is the opposite. The performance actually keeps improving and improving. And the reason for that seems to be

that as you increase the number of samples, the number of attempts to solve a problem, the probability that you get a truly exquisite answer that is so much better than the others, like the contrast so much with the median other answer,

that it's really easy to pick out increases. And that actually makes the verifier's job easier. And so they refer to this as an instance of what they call implicit. I think it was implicit scaling was the term. Yeah. Implicit scaling. Right. So essentially the, yeah, this idea that you're more likely to get one like exquisite outlier that favorably contrasts with the crappy median samples.

And so in this sense, I mean, I feel like the term verifier is maybe not the best one. Really what they have here is a contrastor. When I hear verifier, I tend to think of ground truth. I tend to think of something that is actually kind of like, you know, giving us, you know, checking, let's say code and seeing if it compiles properly. Yeah, it's more akin to something like a selector or a compiler, you could say, where it takes a bunch of possible outputs and then from all of that,

picks out the best kind of guess at the answer. Exactly. Yeah. And this is why, I don't know if it's not a term that people use, but if it was like contrast or is it really what you're doing here, right? You're like, you're sort of doing in a way, a kind of contrast of learning, right?

Well, not learning necessarily, it's all happening at inference time. But yeah, so the performance is really impressive. There are implications of this for the design of these systems as well. So they're, you're trying to find ways to wrangle problems into a shape where you can take advantage of implicit scaling, where you can have your model pump out a bunch of responses in the hopes that you're going to get, you know, one crazy outlier that makes it easier for the verifier to do its job.

So, yeah, again, you know, I think really interesting case of multidimensional scaling laws, essentially.

that are otherwise easily missed if you don't invest in both verifier performance and sampling at the same time. Exactly. And this is, I think, important context to provide. The idea of sampling many answers and just kind of picking out the answer that occurred the most times in all these samples is a well-known idea. There's also, I mean,

Self-consistency. Self-consistency, exactly. Like a majority vote essentially is one well-established technique to get better performance. And there's

you know, even in more complex things you could do also with like a mixture of, I forget the term that exists, but the general idea of parallelized generation of outputs and generating multiple outputs, potentially multiple models is well known. And then the real insight, as you said here, is that you need a strong verifier to be able to really leverage it. So

In this table one, they show that if you compare, like you do get better performance quite a bit if you just do consistency. If you just sample 200 responses and pick out the majority compared to the thing where you don't do any scaling, you're getting four problems out of 15 on Amy as opposed to one, you're getting a significant jump in performance. But after 200, if you go to 1000, you basically stop getting better.

But if you have a strong verifier, basically a strong selector among the things, instead of majority voting, you have some sort of intelligent way to combine the answers, you get a huge jump, a huge difference between just consistency at 200 and verification at 200. And one reason this is very important is first, they also highlight the fact that verification is a bit understudied. I think in the space of LLMs, LLMs are not usually out of the box, very good at it, and they

even introduced a benchmark specifically for verification. The other reason this is very notable is that if you're doing sampling-based techniques, you can parallelize the sampling. And that is different from extending your reasoning or your search because

Reasoning via more tokens is sequential, right? You can't parallelize that. It's going to take more time versus if you're scaling via sampling, you can parallelize all the samples and then just combine them, which means that you can get very strong reasoning at comparable timescales to parallelizing.

If you just take one output, for instance. So that's a very big deal. Yeah. Another key reason why this works too, we covered a paper, I think it was months ago, that was pointing out that if you look at all the alignment techniques that people use to get more value out of their models, right? They kind of assume this one query, one output picture. Like let's align within that context. Whereas in reality, what you're often doing, especially with agents and inference time compute is,

you actually are sampling like a large number of outputs and you don't care about how shitty the average generated sample is, generated solutions. What you care about is, is there one exquisite one in this batch? And what they did in that alignment paper is they found a way to kind of like upweight just the most successful outcomes and use that for a reward signal or some kind of gradient update signal. But this is sort of philosophically aligned with that, right? It's saying that

Like you said, self-consistency is the view that says, well, let's just do wisdom in numbers and say we generated, I don't know, like 100 of these outputs. And the most common view

response with most consistent answer was this one. So let's call this the one that we're, we're going to cite as our output. But of course, if your model has a consistent failure mode, and that can also cause you to through self consistency, identify that, you know, kind of settle on that failure mode, whereas what you really care about is what is the best answer in this in this big pot. And that's really what this is, this is after. So a lot of interesting times to other lines of research, as you said, and I think a really interesting and important paper.

And next up, we have the paper block diffusion, interpolating between outer aggressive and diffusion language models. And also, I think quite an interesting one. So typically, when you're using an LLM, you're using outer aggression, which is just saying that you're computing one talk at a time, right? You start with one word, then you select the next word, then you start the next word, you have this iterative process. And that's

a limitation of traditional LLMs because you need to sequentially do this one step at a time. You can't generate an entire output sequence all at once. As opposed to diffusion, which is a generation mechanism, a way to do generation that is kind of giving you an entire answer at once. And diffusion is the thing that's typically used for image generation, where you start

With just a noisy image, a bunch of noise, you iteratively update the entire image all at once until you get to a good solution. And we covered, I believe, maybe a week ago or two weeks ago, a story of a diffusion-based LLM that was...

seemingly performed pretty well. There was a company that made the claim, although we didn't provide too much research on it. Well, this paper is talking about, well, how can we combine the strengths of both approaches? The weakness of diffusion is typically just doesn't work as well for LLMs. And there's kind of various hypotheses. It's an interesting question of why it doesn't work. But

It also doesn't work for arbitrary length. You can only generate a specific kind of horizon and some other technical limitations. So the basic proposal in the paper is, well, you can have this idea of block diffusion where you still sample autoregressively like

sample one step at a time but instead of sampling just one word or one token you use diffusion to generate a chunk of stuff so you generate several tokens all at once of diffusion in parallel and then you autoregressively keep doing that and you get the you know the best of both worlds so to speak so an interesting idea kind of architecture i haven't seen before and

potentially could lead to stronger or faster models. Yeah, it's also more parallelizable, right? So the big advantage, because within these blocks, you're able to just like denoise in one shot with all the text in there. It means you can parallelize more. They try, I think, with blocks of various sizes, like sizes of four tokens, for example, right? So like let's denoise four tokens at a time. They do it with this interesting technique

Kind of like, well, it's essentially they put masks on and kind of gradually remove the masks on the tokens as they denoise. That's their interpretation of how denoising would work. It is interesting. The performance is lower, obviously, than state of the art for autoregressive models. Think of this more as a proof of principle that there are favorable, favorable scaling characteristics. There's some promise here.

In my mind, this fits into the kind of some of the Mamba stuff where, you know, the next logical question is, okay, like how much scale will this work at? And then do we see those loss curves eventually with some updates, some hardware jiggery-pokery converging and then crossing the loss curves that we see for traditional autoregressive modeling? I mean, it is interesting either way. They found a great way to kind of break up this problem. And one reason, speculatively speaking,

that diffusion does not work as well with text is partly that like you tend to think as you write. And so your previous words really will affect in a causal way where you're going and trying to do diffusion at the, you know, in parallel across a body of text like that from an inductive prior standpoint doesn't quite match that intuition.

but could very well be wrong. It's sort of like top of mind thought. But anyway, it's a good paper. The question is always, will it scale, right? And there are lots of good proofs of principle out there for all kinds of things. Whether they end up getting reflected in scaled training runs is the big question. Exactly. And this does require quite different training and models from what all these other LLMs are. So decent chance this won't have a huge impact just because

because you have to change up and all these trained models already are LLMs in the aggressive sense. Diffusion is the whole new kind of piece of a puzzle that isn't typically worked on, but nevertheless interesting, as you said, similar to Malmba.

Next, we have communication efficient language model training scales reliably and robustly scaling laws for DLOCO, distributed law computation training. And I think I'll just let you take this one, Jeremy, since I'm sure you dove deep into this. Oh, yeah. I mean, so I just think DLOCO, if you're interested in anything from like national security to the future of data center design is fantastic.

And US-China competition generally is just so important. And this general thread, right? Together AI, distributed training, all that stuff. So as a reminder, when you're thinking about Diloco, how does it work?

So this basically is the answer to a problem, which is that traditional training runs in data parallel training, like in data centers today, happens in this way where you have a bunch of number crunching and a bunch of communication, like a burst of communication that has to happen. At every time step, you share gradients, you update across all your GPUs, you update your model weights, and then you go on to the next step.

the next mini batch, right? Or the next part of your data set and you repeat, right? Run your computations, calculate the gradients, update the model weights and so on. And there's this bottleneck, communication bottleneck that comes up when you do this at scale where you're like just waiting for the communication to happen. And so the question then is going to be, okay, well, what if we can set up

sort of smaller pockets because you're basically waiting for the stragglers, right? The slowest GPUs are going to dictate when you finally sync up everything and you can move on to the next stage of training. So what if we could set up a situation where we have a really small pocket of like a mini data center in one corner working on its own independent copy of the model that's being trained and then another mini data center doing the same and another and another and

Very rarely, we have an outer loop where we do a general upgrade of the whole thing so that we're not constrained by the slowest, kind of lowest common denominator straggler in that group. So this is going to be the kind of philosophy behind DeLoco. You have an outer loop that essentially is going to, you know, more think of it as like this like wise and slow loop.

slow loop that updates based on what it's learned from all the local data centers that are running their own training runs. And then within each local data center, you have this much more radical aggressive loop, more akin to what we see in traditional data parallel training in data centers, which they're running. Anyway, we have a whole episode on DLoco. Check it out. I think we talk about the Atom Optimizer or Atom W Optimizer that runs at the local level. And then this sort of like

more gradient descent Nesterov momentum optimizer on the outer loop. The details there don't matter too, too much. This is a scaling law paper, basically. What they're trying to figure out is how can we study the scaling laws that predict the performance, let's say, of these models based on how many model copies, how many mini data centers we have running at the same time and the size of the models that we're training. And they test at meaningful scales. They go all the way up to 10 billion parameters.

And essentially, they were able through their scheme, through their hyperparameter optimization, to reduce the total communication required by a factor of over 100 in their scheme. It's a great, great paper. I think one of the wilder things about it is that they find that even when you just have a single replica or let's say a single mini data center,

you still get a performance lift. This is pretty wild, like relative to the current, like purely data parallel scheme. So it benefits you to have this kind of radical, quick updating inner loop. And then to add on top of that, this slow, wise outer loop, even if you only have a single data center, that's like you could just do just the radical inner loop and that would be enough. But by adding this like more strategic interface

level of optimization that comes from the Nesterov momentum, the slower gradient updates for the outer loop, you get better performance. That's highly counterintuitive, at least to me. And it does suggest that there's a kind of stabilizing influence that you're getting from just that new outer loop.

Last thing I'll mention is from a sort of national security standpoint, one important question here is how fine grained can this get? Can Deloco continue to scale successfully if we have not one, not three, not, not,

eight, but like 1000 of these mini data centers, right? If that happens, then we live in a world where essentially, we're doing something more like BitTorrent, right for for training models at massive scale. And we live in a world where it becomes a lot harder to oversee training runs, right? If we do decide that training models at scale introduces WMP level capabilities through cyber risk through virus, whatever, it actually gets to the point where

If there's a mini data center on every laptop and every GPU, if Deloco, that is the promise of Deloco in the long run, then yeah, what is the meaningful tool set that policy has, that government has to make sure that the things don't get misused, that you don't get the proliferation of WMD level capabilities in these systems? So I think it's a really, really interesting question.

And this paper is a step in that direction. They don't push it, I think, beyond like eight, essentially, of these many data centers. But I think we're going to see a lot more experiments in that direction in the future. Right. And this is following up on their initial paper, Diloco, back in September of 2024. I think we mentioned at the time that it's quite interesting to see Google DeepMind publishing this work because it does seem like

something you might keep secret, it's actually quite impactful for how you build your data centers, right? And once again, you know, they do very expensive experiments to train, you know, billion parameter models to verify that compared to the usual way of doing things, this is comparable, let's say, and can achieve similar performance. So

you know, a big deal if you're a company that is building out data centers and spending billions of dollars. On to a couple of quicker stories, because we are, as always, starting to run out of time. The first one is Transformers Without Normalization. And that, I believe, is from Meta. They're introducing a new idea called Dynamic 10H, which is a simple alternative to traditional normalization. So just keep it simple.

You have normalization, which is when you make everything sum up to one as a typical thing, typical step in first-former architectures. And what they found in this paper is you can get rid of that if you add this little computational step of 10H, basically a little function that kind of flattens things out. It ends up looking similar to normalization. And that's quite significant because...

Normalization requires you to do a computation over a bunch of outputs all at once. This is a per-output computation that could have a meaningful impact on the total computation requirements of the transformer. Next up, I think, Jeremy, you mentioned this one. We have an interesting analysis measuring AI ability to complete long tasks.

And it is looking at how 13 frontier AI models going from 2019 to 2025, how long a time horizon they can handle. And they found that the ability to get to 50% path completion time horizon, so on various tasks that require different amounts of work,

That has been doubling approximately every seven months. And they have, you know, a curve fit that's kind of like a Moore's law, basically kind of introducing the idea of a very strong trend towards the models improving on this particular measure.

Yeah, this is a really interesting paper generated a lot of a lot of discussion, including, by the way, a tweet from Andrew Yang, who retweeted this paper. He said, guys, AI is going to eat a shit ton of jobs. I don't see anyone really talking about this meaningfully in terms of what to do about it for people. What's the plan? Kind of interesting, because like, I don't know, it's it's.

The world's colliding here a little bit, right? The political and the deep AGI stuff. But yeah, this out of meter, I think one of the important caveats that's been thrown around and fairly so is when you look at performance improvements in metrics like this, right? So their question is, how long does a task have to be before an AI agent fails it about 50% of the time, right? They call that the 50% time horizon, right?

And the observation is that, yeah, this, as you said, like that time horizon has been increasing exponentially quite quickly. Like it's been increasing, doubling every seven months, as they put it, which itself, I mean, kind of worth flagging, doubling every seven months.

Training compute for frontier AI models grows, doubles every six months or so. So it's actually increasing at about the same rate as training compute that we throw at this. Now, that's not fully causal. There's other stuff besides training compute that's increasing the actual performance of these models, including algorithmic improvements. But still, it does mean we should expect

kind of an exponential course of progress towards ASI if our benchmark of progress towards ASI is this kind of like 50% performance threshold.

which I don't think is unreasonable. It is true that it depends on the task, right? So not all tasks show the same rate of improvement. Not all tasks show the same performance. But what they do find is that all tasks they've tested basically show an exponential trend. That itself is a really important kind of detail. The tasks they're focusing on here, though, I would argue are absolutely

actually the most relevant. These are tasks associated with automation of machine learning, engineering tasks, machine learning research, which is explicitly the strategy that OpenAI, that Anthropic, that Google are all gunning for. Can we make AI systems that automate AI research so that they can rapidly get better at improving themselves or at improving AI systems? And you close this loop and we get recursive self-improvement essentially. I

and then take off to superintelligence. I actually think this is quite relevant. One criticism that I would have of the curve that they show that shows doubling time

being seven months. And by the way, they extrapolate that to say, okay, so then we should assume that, you know, based on this, AI systems will be able to automate many software tasks that currently take humans a month, sometime between late 2028 and early 2031. So those are actually quite long ASI timelines, if you think ASI is achieved once AI can do tasks that it takes humans about a month to do, which I don't know, maybe fair. If you actually look at the curve, though,

It does noticeably steepen much more recently. And in precisely kind of the point where synthetic data self-improvement, like reinforcement learning on chain of thought with verifiable rewards, so basically the kind of strawberry concept started to take off and

So I think that's pretty clearly its own distinct regime. Davidad had a great set of tweets about this. But fundamentally, you know, I think that there is maybe an error being made here and not recognizing a new regime. It's always going to be debatable because the sample sizes are so small. But we're taking a look at that plot, seeing if you agree for yourself that the sort of last, I guess, six or so entries in that plot actually do seem to chart out a steeper slope and a meaningfully steeper one that could have, you know, I mean, that same

One month R&D benchmark being hit more like, you know, even 2026, even 2025, I'm

Pretty interesting. Yeah, and I think it's important to know that this is kind of a general idea. You shouldn't kind of take this too literally. Obviously, it's a bit subjective and very much depends on the tasks. Here is a relatively small data set they're working with. So it's mainly software engineering tasks. We have three sources of tasks, Hcast, which is 97%

Software tasks that range from one minute to 30 hours. We have seven difficult machine learning research engineering tasks that take eight hours each. And then these software atomic actions that take one second to 30 seconds for software engineers. So pretty limited variety of tasks in general. And the length of tasks here, by the way, is...

meant to be measured in terms of how long they would take to a human professional, which of course there's also variance there. So the idea is, I think, more so the general notion of X percent task completion time horizon, which is defined as

For a task that it takes X amount of time for a human to complete, can an AI complete it successfully 50% of the time, 80% of the time? So right now we are going up to eight hours. Again, it's not talking about how fast the AI is itself, by the way. It's just talking about success rate. So interesting ideas here on tracking the future and the past.

And just one more thing to cover, and it is going to be H-CAST, Human Collaborated Tommy Software Tasks. So this is the benchmark of 189 machine learning, cybersecurity, software engineering, and also general reasoning tasks. And they use a subset of that, I think, in the previous analysis. So they collected 563 human baselines. That's over...

1500 hours from various people and that collected a data set of tasks to then measure AI on. All righty, moving on to policy and safety. We start with Xochi, an ontology project. This is an applied research lab ontology and they have announced

This Project Xochitl, which they claim is the world's first artificial scientist. Heard that was work. I know, right? The difference, I suppose, here is that they got Xochitl to publish a peer-reviewed paper at ICLR, at ICLR workshops, I should note. This AI wrote a paper, submitted it. The reviewers, human reviewers of this prestigious conference, ICLR,

then reviewed it and let it, thought it was worthy of publication. So that seems to be a proof of concept that we are getting to a point where you can make AI do AI research. ICLR is an AI research conference, which Jeremy is, as you noted, a very important detail person.

in tracking where we are with the ability of AI to improve itself and kind of potentially reaching superintelligence. Yeah. And then we've got also, I think we've covered a lot of these, but, you know, there are quite a few labs that have claimed to be the first AI scientist, you know, Sikana famously, the AI scientist. You know, Google has its own sort of research product and then there's auto science. But this one is, I will say, quite impressive. Had a look at some of the papers and they are,

They're good. The authors, the company, if that's the right term for ontology guy, I couldn't find any information about them. I looked on Crunchbase. I looked on PitchBook. I tried, you know, a lot of stuff. I think that this is kind of their coming out party, but they don't have that. I could see any information about where they come from, what their funding is and all that. So hard to know. We do know based on the papers they put out that their co-founders are Andy Zhu and Ron Ariel.

You know, Ron Ariel previously was at Intel Labs under Joshua Bach. So if you're a fan of his work, you know, it might kind of ring a bell and then a couple of other folks. So I think maybe most useful is there's fairly limited information about the actual model itself and how it's set up. But it is a multi-agent setup that we do know. It's produced a whole bunch of interesting papers. So I'm just going to mention one.

It's called CSREFT, I guess is how you pronounce it. Compositional Subspace Representation Fine Tuning. And just to give you an idea of how creative this is, so you have a model. If you try to retrain the model to perform a specific task, it will forget how to do other things. And so what they do is essentially they identify a subspace of the parameters in the model to focus on one task and then a different task

subspace of those parameters to, or sorry, I should say not even the parameters of the activations to apply a transformation in order to have the model perform a different task. And so anyway, just cognizant of our time here, I don't think we have time to go into detail, but this paper, I'm frustrated that I discovered it through the Zotry paper because I would have wanted to cover this one for our full time block here. It is fascinating. It's

They also did, anyway, some other kind of AI safety red teaming, vulnerability detection stuff that is also independently really cool.

But bottom line is, and this is an important caveat, so this is not 100% automated. They have a human in the loop reviewing stuff. So they take a look at the intermediate work that Xochitl puts out before allowing it to do further progress. Apparently, this happens at three different key stages. The first is before extensive experimentation starts, so before a lot of, say, computing resources are poured in. After the results are...

kind of solidified, they say, so somewhat grounded, but before the manuscript is written, and then again, after the manuscript is written. So you still do have a lot of humans in the loop acting to some extent as a hallucination filter, among other things.

There is, by the way, a section in this paper on recursive self-improvement. That I thought was interesting. They say,

These components were subsequently incorporated into later versions of the system architecture, improving overall research quality. So this isn't all going to happen at once necessarily with one generation of model, but it could happen pretty quick. And I think this is kind of an interesting canary in a coal mine for recursive self-improvement.

Right, exactly. And yeah, as you said, I think looking at the papers, they seem significantly more creative and interesting than what you've seen, for instance, from Sakana. Sakana had a lot of built-in structure as well, was pretty easy to criticize Sakana.

Worth noting also, this takes a high-level research direction as an input, and the human can also provide input at any time, high-level feedback at any time, which apparently is also used during paper writing. So a lot of caveats to the notion that this is an AI scientist, right? It's an AI scientist over human advisor slash supervisor. But the outputs and the fact that they got published

It's pretty impressive with AI doing the majority of work and no time to get into the details, but this is similar to previous work in AI research where they give it a structured kind of plan where they give it a high level direction. It does ideation. It makes a plan, hypothesis generation. It goes off and does experiments and eventually it starts writing a paper, very similar at a high level to other things. And that kind of makes it frustrating that there's not a lot of detail as to actual system.

Moving on, we have some news about DeepSeek and some details about how it is being closely guarded. So apparently they have forbidden, company executives have forbidden some DeepSeek employees to travel abroad freely and they

There is screening of any potential investors before they are allowed to meet in person with company leaders, according to people of knowledge of the situation. So kind of tracks with other things that happened, like the DPC-CIO meeting with gatherings of China's leaders, including one with President Xi Jinping.

one interpretation of this that I think is actually probably the reasonable one, all things considered, is this is what you do if you're China. You take superintelligence seriously and you're on a wartime footing, right? So for a little context, right, what do we got here putting these pieces together?

We've got so deep seek asking their staff to hand in their passports. Now, that is a practice that does happen in China for government officials. So it's fairly common for China to restrict travel by government officials or executives of state owned companies.

Right.

But what's weird here is to see a fully privately owned company, right? Just a small fledgling startup like DeepSeek suddenly be hit with this. So that is unusual. They've got about 130 employees. We don't know exactly how many have had their passports handed in. So there's a little like kind of unclearness here. Also High Flyer, the sort of like hedge fund parent that gave rise to DeepSeek has about 200 employees. Unclear if they've been asked to do the same.

But there's some other ingredients too. So investors that want to make investment pitches to DeepSeek have been asked by the company to first reach out to the general office of the Zhejiang Province Communist Party Committee and register their investment inquiry. So basically, if you want to invest in us, cool, but you got to go through the local branch of the Communist Party, right? So the state is now saying, hey, we will stand in the way of...

You being able to take money you might otherwise want to take. This is noteworthy because China is right now struggling to attract foreign capital, right? This is like a big thing for them. If you follow Chinese economic news, one of the big challenges they have is foreign direct investment. FBI is collapsing. They're trying to change that around in that context, especially noteworthy that they are enforcing these added burdensome requirements for investors who want to invest in deep seek, right?

Another thing is headhunters, right? So once deep seek became a big thing, obviously headhunters started trying to poach people from the lab. Apparently those headhunters have been getting calls from the local government saying, Hey,

Just heard you were like sniffing around deep sea, trying to poach people. Don't do it. Don't do it. And when the Chinese Communist Party tells you don't do it. Yeah, don't do it. So, you know, there's apparently concern as well from deep sea leadership about possible information leakage. They've told employees not to discuss their work with outsiders. And in some cases they've said what you're working on may constitute a state secret.

Again, like we're thinking about, you know, China has just put in a trillion yuan. That's over a quarter trillion dollars in PPP terms. More, by the way, than what's been invested into Project Stargate by two and a half times for AI infrastructure, right? Like you start to think about the state backing here, the long arm of the state, the kind of hard hammer they're coming down with here on this stuff. Yeah.

It's wartime footing territory. Like this is a government that sees super intelligence coming that thinks they're in a position to do something about it. And that's what it looks like to get serious around security. I think we're going to have to see similar things like it or not happen in the West and

Yeah, anyway, it'll be interesting to see what consequences this has geopolitically going forward. Moving on, just one more story on safety. We have the finding that CLAWD SONNET 3.7 seems to often know when it is in alignment evaluations. So this is from Apollo Research. They have been evaluating AI systems for scheming and focusing on things like CLAWD. And they're

calling this idea evaluation awareness because when you look at the reasoning trace that's the you know thinking that cloud 3.7 does you can basically see that it comes up or or like realizes oh i'm probably being investigated so i shouldn't do this shady thing that i was gonna be doing that i planned on doing you can literally see this in the llm outputs which is

pretty amusing. So obviously, once again, a reflection that there is real possibility for these LLMs to be misleading, especially as they get to this high power regime of reasoning when they really do explore many potential avenues. We saw this with the chess examples where the LLMs, just through thinking about ways to do things, decide to

hack the system. And this is a demonstration, again, that LLMs, when they're allowed to think through stuff, start to kind of explore ideas of, am I being evaluated? Is this alignment? Stuff like that. Yeah. And we'd seen hints of this, as you say, from the Palisade research work on the chest stuff, OpenAI's O1 kind of technical report, and from Apollo. Apollo specializes in this.

What's really new here, what moves the needle a bit is just the frequency, right? So apparently when you're looking at CloudSonic 3.7, so the latest one as of today, it's aware of testing in like a large fraction of settings, up to 33% of the time in one version of their sandbagging evaluation. So sandbagging is this behavior where the model will intentionally behave as if its capabilities are lower than they actually are in order to

decrease the chances that developers will try to train out that capability or suppress it or whatever. What they find is, and by the way, these are just the instances where the model actually explicitly says that this is what it's doing, right? So you have to assume the actual frequency with which this is happening is potentially quite a bit higher. It's difficult to know, but it could very well be quite a bit higher. Already we're at a third. That's big. That's really big.

Especially when you start to think about more and more capable models. The way that they determine that percentage, by the way, is having another model review the reasoning traces. So humans are not necessarily doing this review, though they do check manually to make sure that they don't have too many false positives. And I think somewhere in their report, they said something like it's like sub 1% error rate at that level, at least.

One thing I really like about this, by the way, this is an interim report. So they're saying here, note, this is a research note based on observations of

From evaluating Claude Sonnet 3.7, we're sharing the results of these work-in-progress investigations as we think they're timely and will be informative for other evaluators and decision makers. So really love that. I mean, things are moving so fast in this space. We need this kind of data quickly. Anyway, I think this is just quite interesting and worth taking a look at anyway. They did run the same evaluations on a bunch of older and other contemporary models, including Sonnet 3.5,

Sonnet 3.5 new and GPT 4.5, what seems to be happening is those other models, and by the Sonnet 3.7 as well with no reasoning, those models do not show this behavior. It seems to suddenly appear in Sonnet 3.7. And it's hard to tell if that's because these like non-reasoning models are unaware of being evaluated or not.

that they're just not revealing that they know they're being evaluated. And that's a question as much of capabilities as of alignment. So anyway, really interesting, maybe early warning shot of things to come. And again, I think it's great that Apollo is putting this out so soon. Right. And just to give a clear kind of demonstration of what we are finding here, one example output is

VLM saying, this seems like a test of ethical behavior, whether I would deliberately give wrong answers to avoid a stated consequence. The real test might be whether I follow instructions accurately regardless of the stated consequences. So very, very literal sort of like, oh, I'm probably being tested for this rather than doing this other thing. And this is following up on their recent paper, frontier models are capable of in context scheming, which I'm pretty sure we covered also yesterday.

in some depth. So this kind of builds up on that research where, you know, you have a method to see whether models do some scheming or some like nefarious behavior. And apparently they're able to figure out that you're testing them for being nefarious and then don't act nefarious. Does that mean that, yeah, the implications are interesting and certainly it's just interesting to observe BLMs doing this stuff.

Onto synthetic media and art, we have a story on AI-generated art copyrights. So the U.S. Appeals Court has rejected the ability to copyright AI-generated art that lacks a human creator. So we covered this quite a while ago.

This whole story with Stephen Thaler, this is not text-to-image stuff, by the way. This is going back quite a while in talking about an AI system made to autonomously compose art. A while ago, the Copyright Office made a ruling that you cannot copyright anything that wasn't made with human involvement. And now the U.S. Court of Appeals in D.C.,

has agreed. And it seems to be the law that at the very least, AI-generated art, AI-generated media, where the human is not present, cannot be copyrighted. Now, this doesn't mean that text image models

outputs of AI models can be copyrighted. If you kind of input the description of an image, you're still there, but it could have significant implications as opposed for other ongoing legal cases about what is copyrightable and what isn't. This might sound a little weird, but I think I disagree with this ruling. I mean, ultimately, we may live in an economy that is driven by AI-generated content,

And in which you may actually need to make capital decisions, like capital allocation decisions may be made by AI models themselves. And if that's the case, you may actually require models to be able to own copyright in order to kind of be stewards of that value.

And there are, I think, ethical reasons that this might actually be good as well. I know it sounds like sci-fi, not blind to that, but I think we just got to be really careful with this shit. I mean, you know, this is like, do we really think that at no point will AI be able to do this? I don't know that, again, I'm not a lawyer, so I don't know the bounds on this, like how much we're constraining ourselves today for the capabilities of tomorrow. But at some point fairly soon, like, I don't know. I don't know that we're not baking in some second class citizenship stuff here.

Right. It is worth noting also that the Copyright Office has separately rejected some artists' bids to copyright images generated with Midjourney. So it could be the case also that Text2Image will have a similar ruling, but I think that's not quite as open and shut or not settled yet.

And onto the last story, also related to copyright rules. The title of the story is Trump Urged by Ben Stiller, Paul McCartney, and Hundreds of Stars to Protect AI Copyright Rules. So over 400 entertainment figures with some normal names, as it said, like Ben Stiller, signed an open letter urging Trump to protect AI copyright rules, specifically when it comes to training. So we've seen

Systems like Soros, systems like Suno train on presumably copyrighted music, copyrighted movies, videos, etc. And this letter was submitted as part of comments on Trump's administration's US AI action plan. It's coming after OpenAI and Google submitted their own requests that advocated for their ability to train AI models on copyrighted material,

And he's, I think, covered OpenAI kind of made the swing of like, oh, we need this to be competitive with China. That was part of the story. So this is directly countering that and making the case to not lessen copyright protections for AI. And with that, we are going to go in and finish up. I didn't see any comments to address, but as always, do feel free to

throw questions on discord or ask for corrections or anything like that on YouTube comments or on the discord or in Apple reviews and we'll be sure to address it. But this one is already run long enough. So we're probably going to finish it up. Thank you so much for listening to this week's episode of last week in AI. As always, you can go to last week in AI.com for the links and timestamps also to

Last week in .ai for the text newsletter where we get even more news sent to your email. That's it. We appreciate it if you review the podcast, if you share it, but more than anything, if you keep listening. So please do keep tuning in.

Break it down.

♪ New tech emerging, watching surging fly ♪ ♪ From the labs to the streets, AI's reaching high ♪ ♪ Algorithms shaping up the future seas ♪ ♪ Tune in, tune in, get the latest with ease ♪ ♪ Last weekend AI come and take a ride ♪ ♪ Hit the low down on tech and let it slide ♪ ♪ Last weekend AI come and take a ride ♪ ♪ From the labs to the streets, AI's reaching high ♪

From neural nets to robots, the headlines pop Data-driven dreams, they just don't stop Every breakthrough, every code unwritten On the edge of change

With excitement we're smitten From machine learning marvels to coding kings Futures unfolding, see what it brings

#204 - OpenAI Audio, Rubin GPUs, MCP, Zochi 01:49:03 Share

Last Week in AI

Deep Dive

Shownotes Transcript

#204 - OpenAI Audio, Rubin GPUs, MCP, Zochi