We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

#207 - GPT 4.1, Gemini 2.5 Flash, Ironwood, Claude Max

2025/4/18

Last Week in AI

AI Deep Dive AI Chapters Transcript

People

Andrey Kurenkov

Jeremie Harris

Topics

Andrey Kurenkov: 我认为OpenAI发布GPT-4.1系列AI模型是一个非常重要的事件。这些模型针对编码和指令遵循进行了优化，并具有GPT-4.1 Mini和Nano等变体，以及百万token的上下文窗口。这代表着大型语言模型在处理复杂任务方面的能力提升，尤其是在代码生成和指令理解方面。GPT-4.1在SWE bench verified基准测试中比GPT-4.0有了显著的提升，这表明其在实际应用中的性能得到了增强。此外，ChatGPT新增的记忆功能，允许其记住之前的对话并将其作为未来交互的上下文，这将提升用户体验，但同时也带来隐私方面的担忧。 Google发布的Gemini 2.5 Flash是Gemini 2.5 Pro的一个更小、更快的版本，旨在降低成本。这反映了当前AI模型发展的一个趋势，即追求更经济高效的模型，以满足更广泛的应用需求。 XAI发布了Grok 3的API，允许开发者付费使用该模型，这进一步推动了AI模型的商业化进程。Canva发布的Visual Suite 2.0，包含AI驱动的编码和聊天机器人功能，也表明AI技术正在逐渐融入各种应用中。Meta的Llama 4 Maverick模型在LM Arena基准测试中表现出色，但其普通版本性能较差，这引发了人们对模型评估方法的质疑。 Jeremie Harris: GPT-4.1在准确性和性能与成本之间取得了平衡，提供了多种选择，这对于开发者来说是一个好消息。ChatGPT新增的记忆功能，虽然带来个性化体验，但也存在隐私风险，OpenAI需要谨慎处理。Gemini 2.5 Flash的发布，体现了模型小型化和效率提升的趋势，这对于降低成本和扩大应用范围至关重要。Grok 3 API的推出，以及Canva对AI功能的整合，都表明AI技术正在快速商业化和普及化。Meta的Llama 4 Maverick模型的基准测试结果，提醒我们对模型评估方法的谨慎，避免出现误导性结果。Anthropic推出的Claude Maxx订阅服务，提供更高的速率限制，满足了专业用户的需求，也反映了AI模型商业模式的探索。总的来说，本周的AI新闻显示出该领域持续快速发展，模型性能不断提升，商业化进程加快，同时也面临着安全和隐私等挑战。

Deep Dive

Chapters

OpenAI released GPT-4.1, focusing on coding and instruction following with variants like GPT-4.1 Mini and Nano. It boasts a million-token context window but faces criticism for reduced safety testing resources.

GPT-4.1 models optimized for coding and instruction following
Availability via API, not ChatGPT
Million-token context window
Improved performance on SWE Bench Verified compared to GPT-4.0

Shownotes Transcript

Translations:

中文

Hello and welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual, we'll be talking about the major news of last week and you can go to the episode description to get all those articles and links to every story we discuss and the timestamps as well. I am one of your regular hosts, Andrey Karenkov. I studied AI in grad school and I now work at a generative AI startup.

I'm your other host, Jeremy Harris. I'm with Gladstone AI, an AI national security company. Yeah, I guess that's it. That's it.

National security company? Yeah, it's an AI national. So basically like we work with partners in the US government and private companies on dealing with national security risks. They come from increasingly advanced AI up to and including like super intelligence, but also AGI, advanced AI, the whole gamut. That's kind of our area. Yeah, yeah. I just like that phrase, AI national security company. You feel there's a lot of AI national security companies, but I imagine it's a pretty small space. Yeah.

Yeah, it's actually kind of weird. Like it's, I guess on the national security side, you could say Palantir is in a way they're more about, you know, like the application level. What can we build today?

I would say that companies like OpenAI and Anthropic and like Google DeepMind should be thinking of themselves as an AI national security company. Just like the age that you're building, like fucking super intelligence and shit. You think that's on the roadmap? Like, yep, you're in the national security business, baby. So I guess it's a short way of summarizing what otherwise could go on for some time. Just like, hey, I mean, what AstroK does is

more than a one-liner too although it's maybe clear a one-liner maybe maybe

This week, we've got a slightly calmer week than we've been seeing, I think, for a while. Some sort of medium-sized news, nothing too crazy. But as we'll be starting out, I think GPD 4.1 will be one of our first stories. That's going to be pretty exciting. Some other kind of incremental news developments, applications in business, some stories related to startups and Fropic, really opening up to competitors.

Projects in open source, we got, as always, more benchmarks coming out as people try to continually evaluate these AI agents and how successful they are. Research and advancements, we're going to be talking about yet more test time reasoning stories and how to get those models aligned and better at reasoning without talking forever.

And in policy and safety, some more stories about OpenAI policies and the drama going on with all the lawsuits and whatnot. That's an evergreen comment, though, isn't it? Like, we should have that in every episode. There's always a bit more to say. So that's just how it is with OpenAI.

And let's just go ahead and dive straight in. Tools and apps, we're starting with opening eyes announcement of GPT 4.1. This is their new family of AI models. It's including also GPT 4.1 mini and GPT 4.1 nano. And these models are, as per the title, all optimized apparently for coding and instruction following.

They are now available through the API, but not through ChatsGPT. And they have a 1 million token context window, which is what you would get with, I believe, Claude Opus. And also Gemini, kind of the big models, I believe, all have 1 million as input. That's a very large amount of words in a code base.

So I think an interesting development for OpenAI to have this model, this kind of focus with the most recent, I guess, SQL to GPT kind of reminds me of what Anthropic has done, particularly with cloud code. People like getting all about vibe coding, having agents build software seems a little bit aligned for that.

Yeah, it does. It's really all about kind of moving in this direction of cheaper models that actually can solve for real-world software engineering tasks. And that's why in the eval suite, you tend to see them focus on SWE bench scores, right? Which, in fairness, is more SWE bench verified, which is opening eyes version of SWE bench, which we've talked about before. But anyways, software engineering benchmark that's meant to test real-world coding ability. It does really well, especially given the cost associated with it.

You're looking at between 52 and 54.6, a bit of a range there because anyway, there's some solutions to SweetBench verified problems that they couldn't run on their infrastructures. They kind of had this range of scores. Anyway, is...

comparable to I mean it's it's all about this Pareto frontier right like you get to choose your own adventure as to how accurate and performant your model is going to be versus how cheap it's going to be and this is giving you a set of kind of on the cheaper side but more performant options especially when you get on the the nano end of things

It also has a whole bunch of other multimodal abilities, including the ability to reason over video or kind of analyze video. It comes with a more recent knowledge cutoff, too, which just intrinsically is a value add. So you don't need to really kind of do much other than provide more up-to-date training to add some value to a model. Up to June 2024, by the way, is that cutoff. So kind of cool if you're worried about software libraries that are a little bit more recent, for example, that might be a helpful thing.

But also, obviously, it has tool use capabilities baked in now as all these coding models do. So, yep, pretty cheap model. Pretty frustrating for anybody who's trying to keep up with the nomenclature on it.

which index are we at now? I thought we were at 4.0, but then I thought that we were going to switch and we're just going to have the O-series. So no more base models. But then the 4.5 comes out. That's the last base model. Okay, we're done there. But no, no, no. Well, let's go back and do 4.1. So confused right now. Exactly. Yeah. This is a prequel to 4.5, I guess. They just decided to release

And I assume we're not going, oh, because this is not only model. I assume it only processes text, parrots focus on coding. It does apparently have. So they say that it has some video capabilities, right? Right. To understand content and videos. Right.

Yeah, I did not understand that point. How multimodal do you have to be? Yeah, it's like how multimodal do you have to be before you're called an Omni model is the next question. Right. Well, on your note of improving on benchmarks, looking at the blog, it actually is a pretty impressive boost of GPT-4.1 compared to GPT-4.0 on S3.

As we bench verified, GPT-4.0 gets 33%, GPT-4.1 gets 55%. And that's higher by a little bit than OpenAI-03 mini on the high and OpenAI-01 on high compute. So pretty impressive for a non-high compute, non, I guess, test time reasoning machine.

to be even better than some of these more expensive, typically, and slower models. Much better than GPT-4.5 as well, interestingly. I will say it's a lot of internal comparisons. So they're showing you how it stacks up against other open AI models, which...

Even when Cloud 3.7 Sonnet came out, its range is 62% to 70% on SweetBench Verified. So this is quite a bit worse than Cloud 3.7 Sonnet. But that's where the accuracy cost trade-off happens, right? And next story also has to do with OpenAI. This one, though, is about ChatterGPT and some new features there.

And particularly the memory feature in ChaiGBT that basically just stores things in the background as you chat, that's getting an upgrade. Apparently, ChaiGBT can now reference all of your past conversations and that will, I suppose, be much more prominent. Actually, this was funny. A coworker posted and it was like, whoa, it referenced this thing from ChaiGBT.

recent interactions and they didn't even know memory was a thing on Chedrapti. So I imagine this might also be tweaking the UX to make it maybe more clear that this is happening. This does tweak the UI as well. You can still use saved memories where you can manually ask to remember and

And you can have ChatGPT reference chat history where it will, I guess, use that as context for your future interactions. Yeah, it's really exciting as part of the announcement. They're also letting us know that ChatGPT can now remember messages

All the ways in which you have wronged it and where you sleep and eat, who your loved ones are, your alarm code and what you had for dinner last night. So really exciting to look forward to those interactions with the totally not creepy model. Yeah, no, this is actually true. It is a cool step in the direction of these more personalized experiences, right? Like you need that persistent memory because otherwise it does feel like the sort of episodic interaction, right?

All kinds of psychological issues, I think, are going to crop up once we do that, obviously. Like the world of her, which is quite explicitly what Sam A. has been pushing towards, especially recently. You know, I mean, I don't know how people are going to deal with that long term. But in any case, as if to deal with objections of that shape, they do say, as always, you're in control of ChatGPT's memory. You can opt out of referencing past chats or memory altogether at any time in your settings.

Apparently, if you're already opted out of memory, they'll automatically opt out of referencing your past chats by default. So that's, you know, that's useful. And apparently they're rolling out today to plus and pro users, except in certain regions, like a lot of in like the EU type thing, including Liechtenstein, because, you know.

It's the first time I've seen that giant market cut out. I know, yeah. I guess very stringent regulations over in Liechtenstein. Yeah, interestingly, rolling out first to the pro tier, the like crazy $200 per month tier, which seems to be increasingly kind of the first way to use new features. And this says will be available soon for the $20 plus subscribers.

And on to the lightning round, a few more stories. Next up, we got Google and they also have a new model. This one is Gemini 2.5 Flash. So they released Gemini 2.5 Pro, was it? I think not too long ago and people were kind of blown away. This was a very impressive release from Google and

And kind of really the first time with Gemini sort of was seemingly leading the pack. And a lot of people were saying, I'm switching from Claude to Gemini with 2.5. It's better. And so this was kind of an exciting announcement for that reason. Now we've got the smaller, faster version of Gemini 2.5 Pro.

Yeah. And I mean, it's, it's designed to be cheaper. Again, it's like, it's all part of the same, the same push, right? So typically, what seems to happen is model developers will come up with a big kind of pre trained model. And once you finish doing that, you're kind of in the business of mining that model in different ways. So you're going to create a whole bunch of distillates of that model, right?

you know, to make these cheaper kind of lightweight versions that are better on a per token kind of price efficiency standpoint. So that's what happens, right? You get the big, the big thing gets done that may or may not be released because sometimes it's also just too expensive to inference. That's what a lot of people have suspected is what happened with Cloud Free Opus, for example, right? It's just too big to be, to be useful, but it can be useful for kind of serving as a teacher model to distill smaller models. Anyway, that's, that's more of the same here. Boy,

Boy, is this field getting interesting, though. As you say, I mean, it's, I remember when OpenAI was the runaway favorite. I'm really, I'm curious what the implications are for fundraising for OpenAI. Is it just that they haven't released funds?

their latest models to kind of like, you know, demonstrate that they're still ahead of the pack. All kinds of questions as well around the acceleration of their safety review process that we'll get into as well that ties into this. But things right now, like I'm really going to be interested to see if it's even possible for OpenAI. I don't know that they'll be able to raise, frankly, another round without IPO-ing, if only because they've already raised $40 billion and they're close to the end of the source of funds. But there you go.

Yeah, I think it's an interesting time for sure. For a while, it seemed like OpenAI was by far ahead of everyone, right? When for years, even before this became a sort of consumer, very business-based, OpenAI kind of got a head start, so to speak, where GPT-3 were the first ones to recognize LLMs and really create LLMs.

And yeah, for a while they had the first impressive text-to-image models, the first impressive text-to-video, and they had audio-to-speech as well with Whisper. But yeah, in recent times, it's increasingly harder to point to areas where OpenAI is leading the pack or significantly differentiated from Anthropic or Google or other providers of similar offerings.

And speaking of which, next up, we've got a story about XAI. They are launching an API for Grok 3. So Grok 3 recently launched. I think we covered it maybe a month ago. Very impressive, kind of similarly competitive model and the same ranks as ChatGPT and Cloud. At the time, you could play around with it, but you cannot use it as a software developer, as part of your products, whatever.

because you need an API for that. Well, now it is available and you can pay to use it at $3 per million input tokens and $15 per million output tokens with Grok free mini costing significantly less. Yeah. So they have also the option to go with a faster version, like I guess a version where

My read on this is it's sort of same performance, but I guess lower latency. So instead of three bucks per million tokens of input, it's five bucks for a million tokens. And then instead of 15 bucks per million output tokens, it's 25. So they kind of have this, that's for the full Grok 3 and they have a similar thing going on with Grok 3 Mini. But kind of interesting, right? Like if you want to get, I guess, maybe a head in line on a latency standpoint, introducing that option. So it's another way to kind of segment the market.

So that's kind of cool. We are seeing price points that are a little bit on the high end. I mean, comparing...

Sort of similarly to like 3.7 Sonnet, but also like considerably more expensive than the Gemini 2.5 Pro that we talked about earlier that came out, I guess, a couple weeks ago. But still, it's impressive. It's XAI again, kind of coming out of nowhere, right? I mean, this is pretty remarkable. There has been some talk about the context window. So initially, I think the announcement was there was supposed to be a 1 million token context window announced.

I think that was announced back in February. It seems like the API only lets you get up to about 131,000 tokens. So where that delta is, I mean, it may well come from the serving infrastructure, right? So the base model may actually be able to handle the full 1 million tokens, but they're only able to serve it up to 130,000 for right now, in which case, yeah, you might expect that to increase pretty soon. But anyway.

Yeah, really, really interesting. And another of these entries right in the kind of frontier models that all look kind of the same, not a coincidence, by the way, because everybody's getting comparable allocation from NVIDIA comparable allocation from TSMC, like it all kind of comes from the same place. And so unless you have 10 times more chips, like,

don't expect to have 10 times the scale or a significant leap in capability, at least at this point. Yeah, I think everyone has scraped the internet, got largely similar data sets. And it's, I think, also kind of the secrets of the trade are probably less secret than it used to be. It seems like

With Grok, for instance, they got into it a year ago and it became slightly clearer on how to train large language models by that point, in part because of LAMA, in part because of open efforts, things like that.

Well, and Jimmy, Jimmy Ba also like the founding engineer was also like, yeah, Google. And they had like, yeah, very experienced people who've already done this. So yeah, I think there is one of the interesting things here is like there is a lot of secret sauce that isn't shared, but it's adding up to the same thing. I just find that really interesting from a almost like meta, like zoomed out perspective. It's like you have this human ant colony and it

The ant colonies may have different shapes or whatever, but fundamentally the economics that they're constrained by, or the almost laws of physics and engineering are pretty similar. And until we see a paradigm shift that's big enough to give you like a 10x lift and there's no response from other companies, we're going to be in this intermediate space.

Don't expect that to persist, by the way, too long in the age of inference, because there I think little advantages can compound really quickly. But anyway, that's maybe a conversation for a later time. Next up, we have a story not related to a chatbot. It's Canva, which is basically tool suite for design, I think, and various applications related also to PowerPoint applications.

Well, they have announced their Visual Suite 2.0, which has a bunch of AI built into it. So they have Conva Code, which is a tool with generative AI coding, and that lets you generate widgets and websites with text. So kind of built-in vibe coding, I guess. And they also have a new AI chatbot, and that lets you...

use their generative AI tools like editing photos, resizing, generating content, all through this chatbot interface. It's increasingly the case that I guess people are building their AI into their product suite in a

Cleaner ways, better ways. It seems like we are getting to a point where some of this stuff is starting to mature and people are iterating on the UX and trying to really kind of make AI part of the tooling process.

in a more natural way. Yeah, it's one of the most interesting sort of design stories, I think, that we've seen in like actually in decades. I mean, this is a pretty fundamental shift. Think about the shift from, you know, Web 1.0 to Web 2.0. This is this is, again, a kind of similar leap, right, where all of a sudden it's a whole new way of interacting with computers and the Internet. And so, you know, designers are probably having a field day.

So yeah, I'm sure we're going to see a lot more of this stuff. Obviously, we're only like two or three years into this process. But we'll say it's also kind of funny that you open the story saying, hey, guys, like exciting because this is a story that's not about chatbots. And there's a chatbot in the freaking thing just shows you where we are. Yeah, yeah, that's a good point.

And one last story. This one is related to meta and also a chatbot. Well, at least a model. This is the Maverick model from Lama4. We covered Lama4, I believe, in the last episode and covered how it was met with a lot of, let's say, skepticism and people calling them out for seemingly having good benchmark numbers, but not actually being impressive in practice. Well,

This is an update on part of that where the Lama 4 seemed to be doing really well on LM Arena where people ranked different models. Turned out this was a special variant of Lama 4 optimized for LM Arena and the vanilla version is...

way worse. It is kind of matched with what seems to be the case for Lama 4 in general. It's underwhelming. So just sort of reaffirming the fact that they pretty much gained the benchmark and it was, yeah, pretty kind of pretty nonsense. Pretty clearly a stunt that they should not have pulled, I think, with Lama 4.

Yeah, I mean, this tells you a lot. It can't help but tell you a lot about the state of AI at Meta, right? Like there are a couple of things that companies can do that are pretty like undeniable indications of actual capability or the direction they're going in. You know, companies often have to advertise roles that they're going to hire for. So, you know, they're forced to kind of telegraph to the world something about what they think about the future by doing that.

And then there are things like this where it's very clearly a stunt and like a pretty gimmicky one at that. Look, the reality is this is Goodhart's law in part, right? So Goodhart's law is if you pick a target for optimization, in this case, the LMS leaderboard, and you push too hard in that direction, you're going to end up sacrificing overall performance. There are going to be unintended side effects of that optimization process. You can't be the best at everything all the time, at least not.

until we hit the singularity. And this is a reflection of the fact that, yeah, Meta made the call to actually optimize for marketing more than other companies. I think, you know, other companies just would not have made this move. That being said, I think the real update here is

any excitement you had about the Llama 4, like any variant of Llama 4's performance on Elemsys, basically just like ditch that and you're basically in the right spot. I wouldn't, what they're doing in this article is they're basically saying like, oh, look at how embarrassing Llama 4 Maverick is

is on a wider range of benchmarks. It's even scoring below GPT-4.0, which is like a year old. So that's like, that's truly awful. That may be true, but it's also not like, this is the version that was fine-tuned for the LM Arena. And like, I wouldn't even think of that as an interesting benchmark. It's like you fine-tune a model to be really good at a biological data analysis, and then you complain that it's not good at math anymore. And that kind of just makes sense.

We know that's already true. But anyway, so all wishes say this is a fake result or the original LM Arena result is basically fake. As long as you delete that, purge that from your memory buffers, you're thinking about Lama 4 the right way. It's a pretty disappointing launch. The update here is about meta itself, I guess, and just like...

you know something to think about because we've heard about some of these high-profile departures too from the meta team right like they're they're forced to do a clean sweep ian lacuna is trying to do damage control and go out and say like oh this is like it's like a new beginning and this is excite i mean dude open source was supposed to be the one place where they could compete like we've known that meta can't can't generate truly frontier models for a long time

But they were at least hoping to be able to compete with China on open source. And now that doesn't seem to be happening. So there's a big question is like, OK, what like what is the point, guys? I mean, we're spending billions on this. There's got to be some ROI. Right. Just to dive into a bit more details. The one that we got the initial results on that went very well was this Lama 4 Maverick

which was optimized for conversationality. And that's LMA Arena. You have people talking to various chatbots and inputting their preference. So it seemed like it was pretty directly optimized for that kind of benchmark of LMA Arena. And I believe they also implemented

did say that it was partially optimized for that specific benchmark. And as you said, the vanilla version, the kind of general purpose is, I mean, not horrible, but ranking pretty low compared to a bunch of models that are pretty old. I think 32nd place right now compared to a whole bunch of other models below DeepSeq, below

Cloud 3.5, Gemini 1.5 Pro, things like that. Onto applications and business. First story related to Google and a new TPU. So this is their 7th gen TPU announced at Google Cloud Next 25. It's called Ironwood. And they are saying that this is the first TPU designed specifically for inference applications.

in the age of inference. I think people pointed out that TPUs initially were also for inference. So this is a little bit of a, maybe not accurate, but anyway, they, as you might expect, have a whole bunch of stats on this guy.

crazy numbers like that a TPU can scale up to 9,216 liquid cooled chips. Anyway, I'm going to let you take over the details because I assume there's a lot to say on whatever they announced with regards to what people are also building for GPU clusters and generally the hardware options for serving AI. Yeah, no, for sure. And I actually didn't notice that

The first Google TPU for the age of inference thing. I like that kind of pseudo-hypey thing. I wish that the first email I'd sent after 01 dropped, I'd formally titled it my first email in the age of inference. That would have been really cool. I missed the opportunity. But yeah, essentially, as you say, a TPU, it is optimized for

thinking models, right? For these inference heavy models that use a lot of test time compute. So, you know, LLMs, MOEs, but specifically like doing the inference workloads that you have to run when you're doing RL post-training or whatever. So it's in the water, but it certainly is a broader tool than that. It is giant. Jeez, when we talk about all these chips linked together, like we have to put in a bit of context. So

I think the best comparable to this is maybe the B200 GPU and specifically maybe the NVL72 GB200 configuration. So essentially, and we talked about this a little bit in the hardware episode, but

So the B200 is one part of a system called the GB200. GB200s come in ratios of two GPUs per one CPU. And you'll have these racks with like 72 GPUs in them. And those 72 GPUs, they're all connected really, really tightly by these NVLink connectors, right? So this is extremely high bandwidth interconnect.

And so the question here is, so Google has essentially like groups of like 9000 of these TPUs in one what they'll call one pod. And they are connected together, but they're not connected through connections interconnect with the same bandwidth as the NBL 72. And so you have with the NBL 72 connection.

kind of like smaller pods, if you will, but the connection bandwidth between them is much higher. And so these Google systems are like a lot larger, but a bit slower at that level of abstraction, at the kind of full interconnect domain level.

So doing a side-by-side is kind of tricky because what it means to have like 72 GPUs or 9,000 kind of, or 72 chips or 9,000, I should say, sort of varies a little bit, but the specs are super impressive on a flop basis. So the Ironwood hits 4.6 petaflops,

That's per chip. And the B200 is going to hit 4.5 teraflops per chip. So very, very comparable there. Not a huge surprise because, you know, both have great design and both are relying on similar nodes at TSMC. There are a whole bunch of cool stuff on the memory capacity side. So these chips, the TPU-V7s, are actually equipped with 192 gigabytes of HBM3 memory. That's

really, really significant amount of like these stacks of DRAM, basically the HBM stacks. It's about double what a typical like B200 die will have. So it's pretty, or I should say feeding into it. And that's especially helpful when you're looking at really large models that you want to have on the device that have like MOEs. So you might be able to fit like a full expert, say,

a really big one on one of these HPM stacks. So that's a pretty, pretty cool feature. All kinds of details that get into like how much coherent memory do you specifically have? Like how the memory architecture is unified, right?

We don't have to dive into too much detail, but the bottom line is this is a really impressive system. The 9,000 or so TPUs in one pod, that comes with a 10 megawatt footprint on the power side. So that's like 10,000 homes worth of power just in like in one pod. Pretty, pretty wild. There is a lightweight variant with, I think it was like about 200 chips in a pod as well for sort of more lightweight power.

kind of setups, which I guess they would probably do it at inference, like data centers they've set up for inference closer to the edge or where the customer will be. But yeah, more power efficient too, by the way, 1.1 kilowatts per chip.

compared to more like 1.6 kilowatts for the Blackwell, that's becoming more and more important. The more power efficient you can make these things, the more compute you can actually squeeze out of them and power is increasingly kind of that rate limiting factor. So this is a big launch. My notes are a bit of a mess on this because it's just like there's so many rabbit holes we could go into and maybe worth doing at some point like a hardware update episode. But might leave it there for now.

Yeah, this announcement kind of made me reflect. It seems like one of the questions with regards to Google is they are offering very competitive pricing for Gemini 2.5, kind of undercutting the competition pretty significantly. That could be a loss just so that they can gain more market share. But I imagine having TPUs and having a very advanced market

cloud architecture and ability to run AI at scale makes it more feasible for them to offer

things at a lower price. And in the blog post for this announcement, they actually compared to TPU v2. TPU v2 was back from 2017. And so this iteration of TPUs have 3,600 times the performance of TPU v2, right? So like almost 4,000x

and way more of a DPU v5 as well. And as you said, the efficiency comparison, also they're saying that you get

29.3 flops per watt compared to TPUV2. So way more compute power, way less energy use for that compute power just shows you how far they've come in these years. And, you know, it does seem like this has quite a significant jump in terms of both flops per watt and peak performance compared to Trillium and V5. So

Another reason, I guess, to think that they might be leveraging this to be more competitive. People typically don't train their own models on the cloud. They are running models. And so it sort of allowed them to really support customers using their models relatively cheaply. Yeah, and the interconnect is a really big part of this too, right? So there is this move in the industry to kind of move away from

at least the NVIDIA InfiniBand interconnect fabric that is kind of, I don't want to say like industry standard, but you know, anything by NVIDIA is definitely going to have some momentum going for it. So Google actually invented this thing called interchip interconnect, which is an unhelpfully vague and general term.

but ICI. And this is essentially their replacement for that. And that's a big part of what's allowing them to hit like really, really high bandwidth on the backend network. So now that when we say backend, like kind of connecting different pods, connecting essentially parts of the compute infrastructure that are relatively far away. And that's important, right? When you're doing giant training runs, for example, at large scale, you're going to do that a lot. It's also important interconnect bandwidth is for inference workloads,

for a variety of reasons. So is also just like HBM capacity, which they've again dialed up. And this is like double what you see, at least with the H100. And on to the next story, we are going to talk about Anthropic. They have announced a $200 per month cloud subscription called Maxx.

So that's the end of the story. You're going to get higher rate limits. The $100 per month option, that's the lower tier. You're going to get five times the rate limits compared to Cloud Pro, the $20 subscription. And for the $200 per month option, you're getting 20 times higher rate limits. I think an interesting development we had was

OpenAI releasing their pro tier, I think a few months ago now, it's pretty fresh. And now Anthropic also coming with a $200 MLF tier. I think partially a little bit of expected developments in the sense that

If you are a power user, you're almost definitely costing Anthropic and OpenAI more than you being charged for $20 per month. It's pretty easy to rack up more cost if you just, you know, are doing a lot of processing of documents, of chats. And so, you know, it's a kind of unprecedented thing to have $200 per month tools, at least in the kind of productivity space.

Adobe, of course, and other tools like that charge easily this kind of very significant amount. Anyway, yeah, that's what I came to think is it.

might be a trend that we'll be seeing more of AI companies introducing these pretty high ceiling subscription tiers. 100%. And I mean, I'm actually, I'm a cloud power user for sure. So this is just definitely for me. I mean, the number of times I run out, it's so frustrating to

or has been where you're using Claude, you're in the middle of a problem and it's like, oh, this is your last query. Like you have to wait another, it's usually like eight hours or something before you get more ability to query. That's really frustrating. So awesome that they're doing this.

I think I'm trying to remember how much I'm paying for it. I think it's 20 bucks a month or so. So the 100 bucks per month for five times the amount of usage is actually just like they're all they're doing is really kind of allowing you, at least if my math is right here, just allowing you to proportionately increase the 200 bucks a month for 20 times the amount. OK, that's, you know, I guess a 50 percent off deal at that scale or something like that. But still.

These are really useful things. I mean, the number of times I have thought to myself, man, I would definitely pay like a hundred bucks a month to not have this problem right now is quite high. So my guess is they're going to unlock quite a bit of demand with this. Suggest maybe that they've

solved something on the compute availability side because they didn't offer this before despite knowing that this was an issue. And I'm sure that they've known this was an issue. So yeah, I mean, they may have just had some compute come online. That's at least one explanation.

And a few more stories related to OpenAI. First up, we've got, I guess, a new competitor to OpenAI that's slowly emerging. It's Safe Superintelligence, the AI startup led by OpenAI co-founder Ilya Suskovor, one of the chief kind of minds of research going back to the beginning of OpenAI and to the

to 2023 when he was famously involved in the ouster of Sam Altman briefly before Sam Altman returned. Then Ilya Soskovor left in 2024, is launching this, I guess, play for AGI. And now we're getting the news that they are raising $2 billion in funding and the company is being valued at $32 billion. So,

This is apparently also on top of a previous $1 billion raised by the company. And I think it's impressive that in this day and age, we are still seeing startups with prominent figures getting billions of dollars to build AI. It doesn't seem like there is a saturation of investors willing to throw billions at people who might compete at the frontier.

Yeah, hard to saturate demand for superintelligence or at least speculation. Yeah, pretty wild. The other kind of update here is with Alphabet jumping in, we are, I think, learning for the first time, at least I wasn't aware of this, that safe superintelligence is accessing or using TPUs provided by Google Cloud.

as their predominant kind of source of compute for this. So we've already seen Anthropic partnering with obviously Google as well, but Amazon to use Tranium chips and Inferentia as well, I believe, but certainly Tranium.

And so now you're in a situation where SSI, like Google's trying to say, hey, literally optimize for our architecture. And that's not a small thing, by the way. Like I know it might sound like, okay, you know, which pool of compute do we optimize for? Do we choose? Do we go with the TPUs, the NVIDIA, like GPUs? Or do we go with Amazon stuff?

But the choices you make around this are extremely, there's a lot of lock-in, like vendor lock-in that you get. You're going to heavily optimize your workloads for a specific chip. Often the chip will co-evolve with your needs, depending on how close the partnership is. That certainly was happening with Amazon and Anthropic. And so for SafeSuperintelligence to throw in their lot with Google in this way does imply a pretty intimate and deep level of partnership, but we don't know the exact terms of the investment. So maybe like Amazon,

presumably just because they are using TPUs. There's something going on here with compute credits that Alphabet is, I would guess, offering to save superintelligence as at least part of their investment in much the same way that Microsoft did with OpenAI back in the day. But something we'll presumably learn more about later. It's a very interesting positioning for Google now, kind of sitting in the middle of a lot of these labs, including Anthropic and safe superintelligence.

And the next story also related to a startup from a former high-ranking OpenAI person. This one is about Mira Murady's Thinking Machines, which has just added two prominent ex-OpenAI advisors, Bob McGrew and Alec Redford, who were both formerly researchers at OpenAI.

So, yeah, quite related or similar to safe superintelligence in that not a lot has been said as to what they're working on, really as to much of anything, but they are seemingly raising funds.

over 100 million and are recruiting, you know, the top talent you can get, essentially. Yeah, I mean, like, I don't know how Amir has done this. I don't know the detail. I mean, she was very well respected at OpenAI. I do know that. And John Shulman, she's recruited. Obviously, we talked about that. He's their chief scientist. Barrett Zoff, who used to lead model post training at OpenAI, is the CTO now.

So like it's a pretty stacked deck. And if you add as an advisor, Alec Radford, that is wild. Like to see Alec's departure from OpenAI, even though he had been there for like a decade or whatever it was. As a reminder, right, that he is the GPT guy. He did a bunch of other stuff too, but he was, you know, one of the lead offer. Yeah, he was one of the lead offers of the papers on GPTs that you said.

Exactly. Yeah. And just kind of known to be a, you know, people talk about the 10X software engineer or whatever. Like he was like the 1,000X AI researcher, right?

to the point where people were using him as the metric for like, when we'll automate AI research. Like, I think it was Dwarkesh Patel on his podcast. So when are we going to get, you know, 10,000 automated Alec Radfords or whatever? That was kind of his bar. So yeah, truly like an exceptional researcher. And so it was a big deal when he said like, hey, I'm leaving OpenAI. He is still, as I recall, he was leaving the door open for collaboration with OpenAI as part of his kind of third party entity that he's formed.

So presumably he's got crossover relationships between these organizations and presumably those relationships involve support on the research side. So he may be one of the very few people who have direct visibility in real time into multiple frontier AI research programs. God, I hope that guy has good cybersecurity, physical security and other security around him because would that be an interesting, that'd be an interesting target.

Next up, we got a story not related to chatbots, but to humanoid robots. The story is that Hugging Face is actually buying a startup that builds humanoid robots. This is Pollen Robotics. They have a humanoid robot called Reachy 2, and apparently Hugging Face is planning to sell and open it for developer improvements. So...

Kind of an interesting development, Hugging Face is sort of a GitHub of models. They host AI models and they have a lot to do with open source. So this is building on top of a previous collaboration where Hugging Face released LeRobot, an open source software.

robot and also we released a whole software package for doing robotics you know building on top of that and yeah I don't know interesting thing for a hungry face to do that I would say yeah I saw this headline and my first reaction was like what the fuck when you think about it it

it can make sense, right? So the classic play is we're going to be the app store for this hardware platform. And that's really what's going on here, presumably. They think that humanoid robotics is going to be something like the next iPhone. And so essentially, this is a commoditizer compliment play. You have the humanoid robot, and now you're going to have an open source sort of suite of software that increases the value of that humanoid robot over time and for free.

at least for you as the company. So Hugging Face is really well positioned to do that, right? I mean, they are the GitHub for AI models. There's no other competitor really like them. So the default place you go when you want to do some, you know, AI open source stuff is Hugging Face. It kind of makes sense. Remains to be seen how good the platform will be. Like Pollen Robotics,

I'm not going to lie. Hadn't heard of them before. They are out there and they are required. So, I mean, it'll be interesting to see what they can actually do with that platform and how quickly they can bring products online. And last story for the section, Stargate developer Crusoe apparently could spend $3.5 billion on a Texas data center.

This is on the AI startup, Cuso, and the detail is apparently not only are they going to be spending this amount of money, they're going to be doing that mostly tax-free. They are getting an 85% tax break on this billions of dollars project. So I guess a development on Stargate and just showing the magnitude of business going on here.

Yeah, the criterion for qualifying for the tax break is for them to spend at least $2.4 billion out of a planned $3.5 billion investment, which, I mean, I don't think is going to be a problem for them looking at how this is all priced out. They've since registered two more data center buildings with a state agency, so we know that's coming. We don't know who the tenants are going to be, but Oracle is, sorry, for one of those buildings, Oracle is known, of course, to be listed for the other. So important maybe to

context, if you're new to the data center sort of space or universe, what's happening here is you've essentially got, there's a company that's going to build the physical data center that is Crusoe.

But there are no GPUs in the data center. They need to find what's sometimes known as a hydration partner or like a tenant, someone to fill it with GPUs. And that's going to be Oracle in this case. So now you've got Crusoe building the building. You've got Oracle filling it with GPUs. And then you've got the actual user of those GPUs, which is going to be OpenAI because this is the Stargate project. And on top of that, there are funders who can come in. So Blue Owl is a private credit company that's

lending a lot of money. JP Morgan is as well. So you've got, this is, you know, it can be a little dizzying, but you have, you know, Blue Owl, JP Morgan funding Crusoe to build data centers that are going to be hydrated by Oracle and served to open AI. That is the whole sack. So when you see headlines where it's like, wait, I thought this was an open AI data center or whatever. That's really what's going on here. There's all kinds of like discussion around, well, look, this build looks like it's going to create like three to 400 new full-time jobs.

with about $60,000 worth of minimum salaries. That at least is part of the threshold for these tax breaks. And people are complaining that, hey, that doesn't actually seem like it's that much to justify the enormity of the tax breaks that are going to be offered here. I just think I would offer up that the

employment side is not actually the main value add here, though. Like this is first and foremost should be viewed as a national security investment, much more than a like a jobs and economic investment, or I should say as much as an economic investment. But that's only true as long as these data centers are also secured, right? Which at this point, frankly, I don't believe they are. But bottom line is, it's a really big build.

There's a lot of tax breaks coming and a lot of partners are involved. And in the future, if you hear, you know, Blue Owl and JP Morgan and Cruzo and all the rest of it, this is the reason why. Moving on to projects and open source, we start with a paper and a benchmark from OpenAI called BrowseComp.

And this is a benchmark designed to evaluate the ability of agents to browse the web and retrieve complex information. So it has 1,266 facts-seeking tasks where the agent, the model, equipped to do web browsing is tasked with finding some information and retrieving it. And apparently it's pretty hard. Just base models, GP4-0,

not built for this kind of task are pretty terrible. They get 1.9%.

do this, 0.6% if it's not allowed to browse at all. And deep research, their model that is optimized for this kind of thing is able to get 51.5% accuracy. So a little bit of room to improve on, I guess, finding information by browsing. Yeah. And this is a really carefully scoped benchmark, right?

Right. So we often see benchmarks that combine a bunch of different things together. You know, think about like SWE Bench Verified, for example. Yes, it's a coding benchmark, but it also depending on how you approach it, you could do web search to support you in generating your answers. You could use a lot of inference time compute. What capabilities you're actually measuring there are a bit ambiguous. And so in this case, what they're trying to do is explicitly get rid of.

other kinds of skills. So essentially what this is doing is it's, yeah, of

avoiding problems like generating long answers or resolving ambiguity. That's not part of what's being tested here. Just focusing instead on can you persistently follow a like an online research trajectory and be creative in finding information? That's it. Like the skills that you're applying where you're Googling something complex. That's what they're testing here. And they're trying to separate that from everything else. They give a

Here's one. So please identify the fictional character who occasionally breaks the fourth wall with the audience, has a backstory involving help from selfless aesthetics, is known for his humor, and had a TV show that aired between the 1960s and 1980s with fewer than 50 episodes. Right. So this is like a really like you would have to Google the shit out of this to figure it out. And that's the point of it.

They set it up explicitly so that current models are not able to solve these questions. That was one of the three core criteria that they used to determine what would be included in this benchmark. The other two were that trainers were supposed to try to perform simple Google searches to find the answer in just like five times, basically. And if the answer was not on the benchmark, then they were not able to solve the question.

on any of the first pages of search results, they're like, great, let's include that. It's got to be hard enough that it's not, you know, trivially solvable. They also wanted to make sure that it's like harder than a 10 minute task for a human, basically. So the trainers who built this data set made sure that it took them at least 10 minutes or more to solve the problem. So

Yeah, pretty interesting benchmark. Again, very narrowly scoped, but in a way that I think is pretty conducive to pinning down one important dimension of AI capabilities. And they do show scaling curves for inference time compute. No surprise there. More inference time compute leads to better performance. Who knew? Right. And as you said, narrowly scoped and meant to be very challenging. They also have some data on the trainers of the system who presumably rated the answers of AI

The AI models were also kind of tasked with doing the benchmark themselves. And on 70% of our problems, humans gave up after two hours. They just couldn't finish a task. And then they have a little distribution on the task that they could solve.

The majority took about two hours. You got some, like a couple dozen, maybe a hundred, taking less than an hour. The majority takes over an hour. And on the high end, there's just one data point at four hours. So yeah, you have to be pretty capable of web browser, it seems, to be able to answer these questions.

Next story is related to ByteDance. They're announcing their own reasoning model, Seed Thinking v1.5. And they are saying that this is competitive with all the other recent reasoning models, competitive with DeepSeek R1. They released a bit of technical information about it. They say that this is

optimized VRL, similar to DeepSeaCar 1. And it is fairly, I guess, fairly sizable. It has 200 billion parameters total, but it is also a mixture of Express models. So it's only using 20 billion parameters at a time. And they haven't said whether this will be released openly or not, really just kind of announced the existence of a model.

Yeah, the stats look pretty good. It seems like another legit entry in the canon. I think right now we're waiting for labs to come out with ways to scale their inference time compute strategies such that we see them use their full fleet fully efficiently. Once we do that, we're going to get a good sense of where the US and China stack rank relative to each other. But I think we're just kind of along that scaling trajectory right now. We haven't quite seen

We haven't quite seen the full scale brought to bear that either side can. One little interesting note, too, is this is considerably, I mean, it's about twice as activated parameter dense as a DeepSeek v3 or R1. So with v3, R1, you see 37 billion activated parameters per token at about 670 billion. So it's like about one in 20 parameters are activated for each token. Here, it's about one in 10. So you're seeing, in a way, a more dense model, which is kind of interesting, but

All of this is sort of building on the results from V3 and R1. So always, always interesting to see what the architecture choices are. I guess we'll get more information on that later, but that's an initial picture. So they actually ended up coming up apparently with a new version of the AIME benchmark as well as part of this. So AIME is that...

kind of math Olympiad problem set that has been somewhat problematic for data leakage reasons, for other reasons as well. So they kind of came up with a curated version of that specifically for this, and they call that Beyond Amy. Anyway, so on that benchmark, they show their model outperforms DeepSeq R1. It outperforms DeepSeq R1 basically everywhere except for SweeBench.

So that's definitely impressive. I'm actually kind of surprised. Like I would have thought SweeBench would have been one of those places where you could, especially with more compute, which I presume they have available now, I would have imagined that that specifically would translate well into SweeBench because those are the kinds of problems that you can RL the crap out of, you know, these like coding problems.

So anyway, yeah, kind of interesting. The benchmarks clearly show like it's not as good as Gemini 2.5 Pro or O3 Mini High, but it definitely is closing the gap. I mean, on Arc AGI, by the way, and I...

I find this fascinating and I don't have an explanation for it until we have more technical data about the paper itself. Like it outdoes like not just R1, but Gemini 2.5 Pro and O3 Mini High supposedly on RKGI. That's kind of interesting. That's a big deal, but could always be an artifact of some weird like over-optimization. Because again, on all the other

benchmarks that they share here, it's quite far behind. So, or not quite far behind, but it is somewhat far behind Gemini 2.5 Pro, for example. So anyway, kind of an interesting note and we'll presumably learn more as time goes on. Right. They also released a 10-page technical report going into the details of the training and the dataset. So pretty decent amount of information, which is refreshing compared to things like, you

Something I was not aware of, ByteDance had the most popular chatbot app as of

last year. It's called Doubao. And recently, Alibaba kind of overtook them with an app called Quark. So yeah, I wasn't aware that ByteDance was such a big player in the AI chatbot space over in China, but it makes sense that they're able to compete pretty decently on the space of developing frontier models.

Next up, moving to research investments. The first paper is titled Sample Don't Search, Rethinking Test Time Alignment for Language Models. This is introducing QAlign, which is a new test time alignment method for language models that makes it

possible to align better without needing to do additional training and without needing to access the specific activations with the logits. You can just

Sample the outputs, just the text that the model spits out, and are able to get it more aligned, meaning more kind of reliably following what you want it to do by just scaling up compute at test time without being able to access weights and doing any sorts of training. I found this a really fascinating paper, and it teaches you something quite interesting about what's wrong with current models.

kind of fine tuning and sampling approaches. So funnily enough, the optimal way to make predictions is known, right? We actually know what the answer is to like build AGI. Great. We can all go home, right? No, this is Bayes theorem, right? The Bayesian like way of making predictions, making inferences is mathematically optimal. At least if you, you know, if you believe all the great textbooks, like the logic of science and, you know, like ET Jane's type stuff, right? So

The challenge is the actual like Bayesian update rule, which takes prior information, like prior probabilities, and then adds essentially accounts for evidence that you collect to get your posterior probability.

is not being followed in the current hacky, janky way that we inference on LLMs. And so the true thing that you want to do is you want to take the probability of generating some output based on your language model, like just the probability of a given completion, given your prompt,

And you really want to, like, you kind of want to multiply that with an exponential factor that in the exponent scales with the reward function that you want to kind of update your outputs according to. So if, for example, you want to assign really high rewards to a particular kind of output, then what you should want to do is take the sort of tendencies of your initial model and then multiply them by the reward function

waiting, essentially the, the E to the power of the reward, something like that. And by combining those two together,

you get the Bayesian kind of the optimal Bayesian output very roughly. There's a, anyway, a normalization coefficient doesn't matter, but you have those two factors. You should be accounting for your base models, initial proclivities, because it's learned stuff that you anyway, for Bayesian reasons ought to be accounting for. But what they say is actually like typical search-based methods, like best event, they fundamentally ignore the probability assignments of the base model. They

They focus exclusively on the reward function. You basically generate a whole bunch of things according to the base model. You generate a whole bunch of different potential outputs. And from that point on, all you do is you go, okay, which one of these gives me the best or highest reward, right? You do something like that. And that causes you to basically, from that point on, you throw away everything your base model actually knows about the problem set. And what they're observing mathematically is that that is just a bad idea. And so they're going to ask the question, can we sample from our

our base model in a way that yes, absolutely accounts for the reward function that we're after, but also that accounts for what our initial language model already knows. And for mathematical reasons, the one approach that ticks this box that does converge on this kind of Bayesianly optimal approach, it looks something like this. So

you start with a complete response, get your initial LLM to generate your output, right? So maybe something like the answer is 42 because of calculation X, right? You have a math problem and it says the answer is 42 because of calculation X. Then you're going to randomly select a position in that response. So for example, like the third token, right? You have like the answer is, and you're going to keep the response up to that point. But the

But then you're going to generate a new completion from that point on and just using the base language model. So here you're actually using your model again to get it to generate something else out, usually with high temperature sampling so that the answer is fairly variable. And that gives you a full candidate response, an alternative, right? So maybe now you get the answer is 15 based on some different calculation.

And they have a selection rule for calculating the probability with which you accept either answer. And it accounts for the reward function piece. So which of those alternate answers is scored higher or lower by the reward. This is a way of basically injecting your LLM into that decision loop and accounting for what it already knows. It's pretty detailed or not pretty detailed, pretty nuanced. You almost need to see it written out.

But the core concept is simple. During sampling, you want to use your LLM. You don't want to just set it aside and focus exclusively on what the reward function says, because that can lead to some pretty pathological things like, you know, just over-optimizing for the reward method.

metric. And that ends leading to reward hacking and other things. So from a Bayesian standpoint, just like a much, much more robust way of doing this. And they demonstrate that indeed, this leads to better inference scaling on your math benchmarks like GSMAK. So I thought a pretty interesting paper from a very fundamental standpoint, giving us some insights into what's wrong as well with current sampling techniques.

Right, yeah, and they based this method or build on top of a pretty recent work from last year called Quest. The title is Quality Aware Metropolis Hastings Sampling for Machine Translation, which is just to say that

You know, it's a slightly more theoretical or mathy kind of algorithmic type of contribution building on, let's say, lots of equations. If you look at the paper, it's going to take you a while to get through it if you're not sort of deep in the space. But it does go to show that, you know, there's still room for algorithmic stuff for research beyond just big model good. You know, lots of weights make for smart model.

Next paper is called Concise Reasoning via Reinforcement Learning. So one sort of phenomena we've discussed earlier

Since the rise of reasoning models, first with O1, then with DeepSeq R1, is that it seems like the models tend to do better when you do additional computations at test time or you do test time scaling. Also seems that we're kind of not at a point where it's at all optimized. Often it seems the models do too much computation.

more than is necessary. And so this paper is looking into how to optimize the amount of output from the model while still getting the correct answer. And the basic idea is to add a second stage in the training of a model. So after you train it on the

being able to solve the problems of reasoning, same as you did with R1. They suggest having a second phase of training where you enforce conciseness while maintaining or enhancing the accuracy. And they show that you're able to actually, yeah, do that more or less.

Yeah, this is another, I think, really interesting conceptual paper. So the motivation for it comes from this observation of a couple of contradictory things, right? So first off, test time, inference time scaling is a thing. So it seems like the more inference time compute we pour into a model, the better it performs. So that seems to suggest, okay, well, like, you know, more tokens generated seems to mean higher accuracy. But if you actually look at a specific model, right,

Quite often, the times when it uses the most tokens are when it gets stuck in a rut. It'll get locked into these, I'm trying to remember the term that they use here, but like these like dead ends, right? Where it just, it's a state from which reaching a correct solution is improbable, right? So like you talk yourself, you paint yourself into a corner type thing. So they construct this really interesting theoretical argument that seems pretty robust. They demonstrate that like getting the right answer is going to be really, really hard for your model.

and you set your reward time horizon for your model to be fairly short. So essentially the model does not look ahead very far. It's focused kind of on the near term in RL terms. So in RL terms has a lambda parameter less than one in this case. Then what you find is that the model almost wants to like

put off or delay getting that negative reward. If it's a really hard problem, it will tend to like try to just write more text and write more text and kind of procrastinate really before. This is one of the fun details is the algorithm itself, the reinforcement learning algorithm

Loss favors longer outputs. The model is encouraged to keep talking and talking, especially when it is unable to solve a task. So if it's able to solve a task quickly, it gets more positive reward and it's happy. If it isn't able to solve the task, it'll just keep going and going. Exactly. And that's it. So the sign kind of flips, if you will, the moment that

the reward is anticipated to be positive, or let's say when the model actually has a tractable problem before it. And so you have this funny situation where solvable problems create an incentive for more concise responses, because in a way the model is going like, oh, yeah, yeah, like I can taste that reward. Like, like I want to, I want to get it, you know, whereas if it knows, like, it's like, if you know, you're going to get slapped once you finish your marathon, well, you're going to move pretty slowly. So

But if you know you're going to get a nice slice of cake, maybe you run the marathon faster. That's kind of what's going on here, not to like overdo this too much. But that is something that is almost embarrassing, right? Because it drops out of the math. It's not even an empirical finding. It's just like, hey, guys, did you realize that you were not deliberately incentivizing your models explicitly through the math here to do this thing that is deeply counterproductive? And so when they fix that,

all of a sudden they're able to so dramatically decrease the response length relative to the performance that they see. And they show some really interesting scaling curves, including one that shows an inverse correlation between the length of a response and the improvement of the quality of the response, which is sort of interesting. So yeah, I thought this was a really, really interesting, I mean, it makes you think of

like the conciseness of a model as really a property of a given model that can vary from one model to another and a property that's, yeah, determined in part by the training data. This is where this idea of that secondary stage of training becomes really important. They have an initial step of RL training. It's just like, you know, the general idea

I guess, you know, whatever deep seek R1, O1, O3 type reasoning stuff. But then you include a training step after that that explicitly contains solvable problems to kind of polish off your model and make sure that the last thing it's trained on is problems that it wants to solve concisely.

And so that's, by the math, going to be problems that are actually tractable. And there you go. So I thought just really fascinating and sort of embarrassingly simple observation about the incentives that we're putting in front of these RL systems.

Yeah, and the technique also is very successful. For the bigger variant of our 1 with 7 billion parameter model, you can get 40% reduction in response length and maintain or improve on accuracy. And that's, you know, they don't have a computational budget, presumably to do this optimally, you can presumably do even better, like optimize further to better

spit out less tokens while still getting the right answer. So very practical, useful set of results here. A few more stories. First, we have going beyond open data, increasing transparency and trust in language models with ALMO trace. So the idea is pretty interesting. You're able to look at

what in the training data of a model influenced it to produce a certain output. In particular, it allows you to identify spans of a model's output that appear verbatim in the training data. This is supporting the Olmo models, which we talked about a little while ago. These are completely like the most open models you can get.

out of the market. And so this is, you can use it against those models and they're pretty large training data set of billions of documents, trillions of tokens. It seems like a software advance, but it's a systems advance really. The core of it is, you can imagine like if you wanted to figure out, okay, my LLM just generated some output. What is the text in my training corpus that was

as like the most similar to this output or the contained long sequences of words that most closely match this output, that's a really computationally daunting task, right? Because now you're having to go for every language model output that you've produced. You got to go to your entire fucking training set and be like, okay, like, are these tokens there? Are these tokens there? You know, like how much overlap can I find on

on a kind of perfect matching basis. And what they're doing is actually trying to solve that problem. And they do it pretty well and efficiently. So you can see why this is really an engineering challenge as much as anything. So at the core of this idea is this notion of a suffix array. It's a data structure that stores all the suffixes of a text corpus in alphabetically sorted order, right? So if you have the word banana, the suffixes are banana, anana,

Na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na, na,

and like a LLM output that said like the cat sat on a bench, what you're trying to do is set up suffix arrays that have like, you know, whatever, like all the different chunkings of that text. And then you want to cross-reference those together. And by setting them up in a principled way with sort of alphabetical ordering and the suffix array,

you're able to use binary search to... So anyway, if you know binary searches, then you know why this is exciting. It's a very, very efficient way of searching through an ordered list, right? And you can only do it if your data is in the right format, which is what they're doing here. But once you do that, now you have a really efficient way of conducting your search. And so they're able to do that across the training corpus. It's like to do a binary search across the training corpus. And then you can do that

And then on the other side, in terms of the language model output, they are able to like massively parallelize the search process to process many, many outputs all at the same time, which again, amortizes the cost significantly. And so overall, just a much better scaling properties for the search function. And it leads to some pretty interesting and impressive outputs. Again, like imagine,

you see the output that your language model provides and you're just like, all right, well, what's the piece of text in the training corpus that overlaps word for word most closely with different sections of this output?

This is especially exciting if you're concerned about data leakage, for example. You want to know, well, did my language model answer this question correctly because it's basically just parroting something that was in the training set? Or does it actually understand in some deeper way the content? So it's not a full solution to that because it could just be paraphrasing, in which case this technique wouldn't pick it up. But it's a really interesting start.

And it's part of the answer to our language model is just stochastic parrots, right? If you're able to rule out that there is any text in the training data that exactly matches what you've put out. Right. And I guess I should correct myself a little bit. They aren't claiming that the matches are necessarily kind of the cause of the output. They're not sort of

competing in sort of influence function, they really are providing a way to efficiently search over the massive corpus to be able to do fact-checking

And they have a fun example in a blog where for some question, the model OMO claimed that its knowledge cutoff was August of 2023. Untrue, the actual cutoff was in 2022. So then they looked at the output and then they found that in the data, there was some document from somewhere, I think training data,

an open source variant of Ulmo, I guess a blog post or something like that. And that got swept up in the training dataset and made the model do this kind of silly thing. So presumably also quite useful if you're a model developer

or a model user to be able to fog check and see noise in your training dataset that causes potential explanations or false outputs. Next, we've got a story from Epoch AI, one of our favorite sources of stats and just interesting metrics on AI. This one is independent evaluations of Grok3 and Grok3 Mini on Epoch's

benchmarks and the short version is Grok 3 and Grok 3 Mini are really good. They are out there with Cloud 3 7 Sonnet, O3 Mini, comparable even with low amounts of reasoning Grok 3 Mini to the higher reasoning levels on some of these benchmarks. So just reinforce me, I guess, with general impression that we got with Grok that it's quite good.

Yes, very well said. It is quite good. Yeah, it is actually pretty shocking, at least on Amy. I mean, it's like Grok 3 Mini on high reasoning mode beats O3 Mini on high reasoning mode. It is literally number one in that category. That's pretty remarkable. Again, I hasten to remind people like Grok and XAI came out of nowhere. They are not, were they like two years old now? This is crazy.

It's supposed to take you longer than that. But yeah, so they're also more middle of the pack on other, like, for example, Frontier Math. It's just out of the top three. So it's number four. This is a really, really solid model across the board. There's just no two ways about it. There was some debate about

how OpenAI and Grok were characterizing scores on various agentic benchmarks, just in terms of how they were sampling and whether apples to apples is actually happening there. This, by the way, is, I suspect, a big part of the reason why Epic decided to step in and frame this as, as they put it, independent evaluations of Grok 3 and Grok 3 Mini, just because of all the controversy there. So they're basically coming in and saying, nope, it is in fact a really impressive model. It isn't

I mean, everybody's claiming to have the best reasoning model. I give up on like assigning one clear best. I mean, depends what you care about.

And honestly, the variation in prompting is probably going to be just as big as the variation from model to model at the true frontier for reasoning capabilities. Just try them and see what works best for you. This case, I think, is the clear winner in this instance. And moving on to policy and safety, starting once again with the OpenAI lawfare drama, I

OpenAI is countersuing Elon Musk. So they have filed this countersuit in response to the ongoing legal challenges from XAI that are trying to constrain OpenAI from going for profit. And they are saying, basically want to stop Elon Musk from further unlawful and unfair action. They claim that Musk's actions, including

a takeover bid that we covered where he offered what 97 billion to buy OpenAI the non-profit and they yeah basically OpenAI is saying here there's kind of a bunch of stuff that Elon Musk is doing please stop him from doing this sort of stuff

It's sort of funny on their characterization of the fake bid. Now, we can't know what happened behind closed doors, if there were comms, if there weren't comms of whatever nature, but certainly from the outside, I'm confused about what would make it fake. Like, was the money he was offering not real? Was it monopoly money? Like, he came in and offered ostensibly more money than what OpenAI was willing to pay for its own non-profit subsidiary or for-profit subsidiary or whatever. Like,

It seemed pretty genuine. And so it just, it's odd that they would nominally have a fiduciary obligation to actually consider that deal seriously. So it's unclear to me what, you know, what the claim is with legal grounding. The suit is fascinating, or the original Elon suit is fascinating, by the way. We covered this back in the day, but just like to remind people. So Elon sued OpenAI, of course, for trying to

Essentially, so the nonprofit that currently has control over the for-profits activities, they essentially wanted to buy out the nonprofit and say, hey, we'll give you a whole bunch of money in exchange for you giving away all your control effectively. And you'll be able to go off and do cute charitable donation stuff. And there are people arguing, well, wait a minute, like that is like the nonprofit was set up explicitly to keep the for-profit in check because they wanted

correctly reasoned that for-profit incentives would cause racing behavior, would cause potentially irresponsible driving

development practices on the security and the control side. So you can't just replace that function with money like money. OpenAI itself does not institutionally believe that money would compensate for that. They believe they're building superintelligence. Control of superintelligence is worth way more than like, you know, $40 billion, whatever they'd be paying for it. And so this is the claim anyway. The judge on this case seems to view that argument quite favorably, by the way, that you can't just

swap out the role of the nonprofit for a bunch of money, that the kind of OpenAI's public commitments, among other things, do commit it to kind of having some sort of function. And there are at least those claims are plausibly backed and would plausibly do well in court. The main question is whether Elon has standing to represent that argument. But the question is, did OpenAI enter into a contractual relationship with Elon or

Through email, because that's really the closest thing they have to a contractual agreement about the kind of nonprofit remaining in control and all that stuff. And and that seems much more ambiguous. And so Elon right now is in this awkward position where he has a seems like a pretty solid case. That's what the judge is telegraphing here. But he may not actually be the right. He may not have the right to represent that case.

The attorney general might. So there's speculation about whether the judge in this case is flagging the strength of the case to get the attention of the attorney general. So the attorney general can come in and lead the charge here. But everything is so politicized, too. Elon is associated with the Republican side. California's attorney general is going to be a Democrat. So it's all a big mess. And now you have opening eye kind of countersuing potentially partly for the marketing value at the very least.

But we're just going to have to see it. It really doesn't. I mean, there seems to be a case here. This seems at the very least to be an interesting case to be made. We saw the judge dismiss Elon's motion to kind of like quickly rule in his favor, let's say, and block the for profit transition. I would be surprised if this initial move like this countersuit happened.

would go through. I mean, I imagine there'd be a pretty high standard that OpenAI would have to meet to show that these lawsuits are frivolous. And that'd be tough, given that you now have a judge coming out and saying, well, you know, the case itself seems pretty strong. It's 50-50 whether Elon's the right guy to represent. So, you know, anyway, it's a mess. Yeah, it's a real mess. I don't know how technical this term, by the way, country suing, I guess,

In the document itself that they filed, they have a bunch of counterclaims to the already ongoing case. And yeah, it makes for pretty fun reading. Just to find this one quote here. Early in the document, this is like a 60-page document, they say...

Musk could not tolerate seeing such success from an enterprise he had abandoned and declared doomed. He made it his project to take down OpenAI and to build a direct competitor that would seize the technological lead. Not for humanity, but for Elon Musk. And it says,

Very much a continuation of what we've seen OpenAI doing via their blog, calling Musk out about his emails. They also posted on X with the same kind of rhetoric, saying Elon's never been about a mission. He's always had his own agenda. He tried to seize control of OpenAI and merge it with Tesla as a for-profit. His own emails prove it.

yeah, OpenAI is definitely at least trying to go under attack, if nothing else. Yeah, it's very kind of off-brand, or I guess it's now their new brand, but it used to be off-brand for them, right, to do this sort of thing. They had a very above-the-fray vibe to them. Sam A was sort of like this untouchable character, and it does seem like they've kind of like started rolling in the mud, and man, it's, yeah, interesting. Yeah, it seems like

tactically, they really just want to embarrass Elon Musk as much as they can. So this is part of that. And the next story also related to OpenAI, as you alluded to earlier, it is covering that it seems that OpenAI has reduced the time and resources allocated to safety testing of its frontier models. This is apparently related to their next gen model of free

And this is according to people familiar with the process. So some insiders, presumably the safety evaluators who previously had months now often just have days to flag potential risks. And this kind of tracks with what you've seen come out regarding the split analysis.

in 2023, the board from Sam Altman and generally the vibes we're getting from OpenAI over the past year. Yeah, consistent with people that we've spoken to as well, unfortunately, at OpenAI. And the reality is that they are, I mean, this is the exact argument, by the way, that was made for the existence of the nonprofit and it controlling explicitly the activities of the for-profit. Like,

This was all foretold in prophecy. One day, there's going to be a lot of competitive pressure. You're going to want to cut corners on control. You're going to want to cut corners on security, on all the things. And we want to make sure that there is as disinterested an empowered party as possible overseeing this whole thing. And surprise, surprise, that is the one thing that Sam A is trying to rip out right now. Like, it's sort of interesting, right? Like, I mean, it's almost as if Sam is trying to

trying to solidify his control over the entity and get rid of all guardrails that previously existed on his control. But I can't possibly be it. I mean, it's a ridiculous assertion. Anyway, yeah, like some of the quotes are pretty interesting. You know, we had more thorough safety testing when the technology was less important. This is one person who's right now testing the upcoming O3 model. Anyway, all kinds of things like that. So

Yep. No, no particular surprise. I want to say this is like pretty sadly predictable, but another reason why you got to have some kind of coordination on this stuff, right? Like you can't, if AI systems genuinely are going to have WMD level capabilities, then

You need some level of coordination among the labs. There is no way that you can just allow industry incentives to run fully like rampant as they are right now. You're going to end up with like some really bad outcome. Like people are going to get killed. That's a pretty easy prediction to make.

under the nominal trajectory, if these things develop, you know, the bioweapon, the cyber offensive capabilities and so on, like that's just going to happen. So the question is like, how do you prevent these dynamics, these racing dynamics from playing out in the way that they obviously are right now at OpenAI, I will say. I mean, it is very clear from talking to people there. It's very clear from seeing just the objective reports of like how quickly these things are being pumped out, the amount of data we're being given on the kind of testing side.

It's unfortunate, but it's where we are. And next, yet another story about OpenAI and kind of a related notion or kind of related to that concern is

The story is that ex-OpenAI staffers have filed an amicus brief in the lawsuit that is seeking to make it so OpenAI cannot go for profit. So amicus brief is basically like, hey, we want to add some info to this ongoing lawsuit and give our take. And so this is coming from a whole bunch of employees that have been at the company between 2018 and 2024.

There's Stephen Abder, Rosemary Campbell, Neil Chowdhury, and like a dozen other people who were in various technical positions, were researchers, research leads, policy leads. And the gist of the brief is, you know, opening eye would go against its original charter where it'd go for profit and it should not be allowed to do that. And it, you know, mentioned some things like, for instance,

OpenAI potentially being incentivized to cut corners on safety and develop powerful AI that is consecrated for the benefit of our shareholders as opposed to the benefit of humanity. So the basic assertion is OpenAI should not be allowed to undertake this transition. It would go against the kind of founding charter and I guess policies set out for OpenAI.

Yeah, and one of the big things that they're flagging, right? So if OpenAI used its status as a non-profit,

to reap benefits, let's say, that it's now going to cash out by converting to a for-profit, that itself is a problem. And one of the things that's being flagged here is like recruiting, right? Recruitment. The fact that they were a nonprofit, the fact that they had this very distinct bespoke governance structure that was designed to handle AGI responsibly was used as a recruiting technique. I know a lot of people who went to work at OpenAI because of those commitments. Many of them have since left.

But there's a quote here that makes that point, right? In recruiting conversations with candidates, it was common to cite OpenAI's unique governance structure as a critical differentiating factor between OpenAI and competitors such as Google or Anthropic. And as an important reason, they should consider joining the company. The same reason was also used to persuade employees who were considering leaving for competitors to stay at OpenAI, including some of us, right? So this is like...

Not great. Like if you have a company that is actually like using the fact of being a nonprofit at one time and then kind of cashing that out and turning into a for profit. So, you know, without making any comments about about the competitors, you know, Anthropic has a different governance structure. They're a public benefit corporation. But with but with kind of oversight board, XCI is just a public benefit corporation.

corporation, which really all that does is it gives you more latitude and not less. It's like sounds like it's just a just a positive, but it's complicated. It doesn't actually tie your hands. It gives you the latitude to consider things other than profit when you're, you know, as a director of the company. Really, it's you're just giving yourself more latitude. So when opening I says, oh, don't worry, we're going to go to a public benefit corporation model.

It sounds like they're switching to something that is kind of more constrained, that is still constrained or, you know, motivated by some public interest. But the legal reality of it, as I understand it, at least, is that's just going to give them more latitude. So they can say like, oh, yeah, we're going to do X, Y or Z, even if X, Y or Z isn't profit motivated. It doesn't mean that you have to do specific things regularly.

I guess unless they're in the kind of additional legal context around that. Anyway, bottom line is, I think it's actually a pretty dicey situation from everything I've seen. It's not super clear to me that this conversion is going to be able to go ahead at least as planned. And the implications for the soft bank investment for like all the tens of billions of dollars that opening has on the line are going to get really interesting. Yeah, it's.

Quite the story, certainly a very unique situation. And as you said, I think I'm a little surprised. I thought OpenAI might be able to just, you know, not really be challenged in this lawsuit, but it seems like it may actually be a real issue for them. And one more story about OpenAI. It just so happens that they are dominating this section, this episode.

They are coming out with an ID system for organizations to have access to future AI models via its API. So there's this thing called Verified Organizations.

They require to have a government issued ID from supported countries to be able to apply. Looking at their support page, I actually couldn't see what else is required to be able to be verified. In this page, they say,

Unfortunately, a small minority of developers intentionally use the open AI APIs in violation of our usage policies. And they're adding their verification process to mitigate unsafe use of AI while continuing to make advanced models available to developers and so on. So it seems like they want to prevent misuse or presumably also competitive behavior by developers.

other model developers out there. I don't know. Seems like an interesting development. Yeah. It looks like a great move from OpenAI actually to, yeah, it's on this continuum. Like I remember a lot of debate in Silicon Valley around like, let's say like 2019, especially in the, like the YC community, people are trying to figure out like, how do you, how do you strike this balance between privacy and

and verifiability and you know where things going with bots and all that stuff this is like kind of shading into that discussion a little bit and it's it's an interesting strategy because you're going at the organizational level not the individual level it does take a valid government issued id from a supported country so a couple of you know implied filters there and then each id is limited to verifying one organization every 90 days so it all kind of intuitively makes sense but

Not all companies or entities are eligible for this right now. They say they can check back later. But yeah, so interesting kind of another axis for OpenAI to try their staged releases where they're like, you know, first we'll release a model to this subpopulation, see how they use it, then roll it out. This seems like a really good approach and actually a pretty cool way to balance some of the misuse stuff with the need to get this in the hands of people and just build with it.

And one last story. The title is Meta Whistleblower Claims Tech Giant. Oh, this is a long story.

Anyway, the gist of it is there's a claim that Meta... Yeah, some of them. Fortune, I find, is just annoyingly wordy. But anyway, the claim is that Meta aided in development of AI for China in order to curry favor and be able to build

there and apparently they make quite a lot of money. This is from former Facebook executive Sarah Lynn Williams. She just released a book that has a bunch of alleged details from when she was working in a high profile role there from 2011 to 2017. And in this book,

testimony to the Senate Judiciary Committee, she said that that's what Meta did. Yeah, and Senator Josh Hawley sort of led the way on a lot of this investigation and had some really interesting clips on X that he was sharing around. But yeah, it does seem pretty, I'll say consistent with some things that I had been hearing about, yeah, like let's say the use of Meta's open source models and potentially the

Potentially, Meta's attempts to hide the fact that these were being used for the applications that they were being used for, things that, let's say, would not look great in exactly this context. They were different from this particular story, but very consistent with it.

One of the key quotes here is during my time at Meta, she says company executives lied about what they were doing with the Chinese Communist Party to employees, shareholders, Congress and the American public. So remains to be seen. Are we going to see Zuck dragged out to testify again and get grilled? I mean, there's hopefully going to be some follow on if this is true. I mean, this is pretty, pretty wild stuff. And then Meta used quotes a campaign of threats and intimidation to silence Zuck.

It's just silence. Sarah Wynn Williams, the one who's testifying here. That's what Senator Blumenthal says. And anyway, so she was very senior director of global public policy. This was all the way from apparently 2011 to 2017. So long tenure, very senior role. And this predates right the whole Lama period. This is this is way before that.

And certainly, like, I mean, like, anecdotally, I've heard things from behind the scenes that suggest that that practice may be ongoing if the people I've spoken to are to be believed.

So anyway, this is pretty, pretty remarkable, if true. Apparently, so Meta's coming back and saying that Wynne Williams' testimony is, quotes, divorced from reality and riddled with false services, sorry, with false claims. While Mark Zuckerberg himself was public about our interest in offering our services in China and details were widely reported beginning over a decade ago, the fact is this.

Excuse me. We do not operate our services in China today. And I will say, I mean, that's only barely true, isn't it? Because you do build open source models that are used in China and that for a good chunk of time did represent, again, at least according to people I've spoken to, basically the frontier of model capabilities that Chinese companies were building on. No longer the case now.

But certainly you could argue that Meta did quite a bit to accelerate Chinese kind of domestic AI development.

I think that you could have nuanced arguments that go every which way there, but it's sort of an interesting, very complex space. So this is all in the context, too, where we're talking about Meta being potentially broken up. There's an antitrust trial going on. The FTC is saying basically we want to potentially rip Instagram and WhatsApp away from Meta. That would be a really big deal. So anyway, it's interesting.

It's hard to know who's saying what. There is a book in the mix, so money is being made on this, but definitely would be a pretty big bombshell if this turns out to be true. Yeah, not too many details as to specifically AI. From what I've read of the quotes, it seems that there was a mention of a high-stakes AI race, but beyond that, it's just more generally about AI

The communications with the Communist Party that the executives had and, you know, wouldn't be surprising if they were trying to be friendly and do what they could to get support in China. For sure. I just want to add like for context, what I've mentioned about like.

for like other sources of information along these lines. I haven't seen anything firsthand. And so I just want to like call that out, but it would be, yeah, it just would be consistent with this generally if it's to be believed. So just to sort of like throw that caveat in there, a lot of, yeah, a lot of questions about a lot of different companies, obviously in the space, but Meta has been one, I think justifiably, if this is true to receive a lot of scrutiny.

And that is our last story. Thank you for listening to this episode of Last Week in AI. As always, we appreciate it if you leave a comment somewhere. You can go to Substack, YouTube, leave a review on Apple Podcasts. Always nice to hear your feedback or just share it with your friends, I guess, without letting us know. But either way, we appreciate you listening and please do keep tuning in.

♪♪

♪♪ ♪♪

From neural nets to robots, the headlines pop. Data-driven dreams, they just don't stop. Every breakthrough, every code unwritten. On the edge of change, with excitement we're smitten. From machine learning marvels to coding kings. Futures unfolding, see what it brings.

#207 - GPT 4.1, Gemini 2.5 Flash, Ironwood, Claude Max 01:42:30 Share

Last Week in AI

Deep Dive

Shownotes Transcript

#207 - GPT 4.1, Gemini 2.5 Flash, Ironwood, Claude Max