We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

#205 - Gemini 2.5, ChatGPT Image Gen, Thoughts of LLMs

2025/4/1

Last Week in AI

AI Deep Dive AI Chapters Transcript

People

Andrey Kurenkov

Jeremie Harris

Topics

Andrey Kurenkov: 我认为 Gemini 2.5 是一个令人印象深刻的 AI 模型，它在各种基准测试中都取得了显著的成功，展现出强大的推理能力和多模态功能。它能够快速准确地完成各种任务，包括编码、写作和问题解决，这在很大程度上超出了人们的预期。此外，Gemini 2.5 还拥有 100 万个 token 的上下文窗口，并且 Google 计划很快将其扩展到 200 万个 token，这将极大地提升模型处理长文本的能力。总的来说，Gemini 2.5 代表了大型语言模型发展的一个重要飞跃，它在多个领域都展现出了强大的能力，并且其多模态功能也为未来的应用提供了无限可能。 Jeremie Harris: OpenAI 将 GPT-4.0 的图像生成能力整合到 ChatGPT 中，这是一个令人兴奋的突破。与传统的扩散模型不同，这种方法使用多模态模型，能够直接处理文本和图像，并生成高质量的图像。该模型在图像编辑、生成复杂场景和精确遵循提示方面表现出色，其图像质量也显著优于以往的文本到图像模型。此外，OpenAI 即将完成一轮 400 亿美元的融资，这将是历史上规模最大的融资轮次之一。这表明投资者对 OpenAI 的技术和未来发展充满信心。同时，OpenAI 也调整了领导层结构，Sam Altman 将更多关注公司的技术方向，而 Brad Lightcap 将承担更多运营责任。这些变化可能预示着 OpenAI 将在未来更加专注于技术创新和产品开发。

Deep Dive

Shownotes Transcript

Translations:

中文

Hello and welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode, we will summarize and discuss some of last week's most interesting AI news and you can go to the description of the episode for all the timestamps, all the links and all that. I am one of your regular hosts, Andrey Karenkov. I studied AI in grad school and I now work at a generative AI startup.

And I'm your other host, Jeremy Harris. I'm at CloudStone AI doing AI national security stuff. This has been a crazy, hectic day, a crazy, hectic week. So I'm going to say right off the bat, there's a big, big anthropic story that I have not yet had the chance to look at.

Andre, I know you've done a bit of a dive on it, so I'm going to maybe punt my thoughts on it to next week. But yeah, this has just been a wild week. It's been a wild week. For the last couple of ones, we had slightly more quiet stuff about anything huge. And then this week, multiple huge stories coming out and really being surprising and actually quite a big deal. I think not since...

Grok 3 and that like slate of models, Cloud 3.7 had we had a week that was this big. So it's an exciting week. And we're going to probably dive straight into it. Let me give a quick preview of what we will be talking about. Tools and apps, we have Gemini 2.5 coming out and kind of steamrolling everyone's expectations, I would say.

And we have image generation from GBD 4.0 by OpenAI, similar to what we saw with Gemini, taking image generation to the transformer, getting rid of diffusion seemingly and being like mind-blowing.

Then we go to applications and business, OpenAI getting some money and a few stories related to hardware projects and open source, some very exciting new benchmarks where we continue to try to actually challenge these new models. Research and investments, as you said, Anthropic has a really cool interpretability paper that we will start talking about, but there's a lot to unpack. So we might get back to it next week.

And then policy and safety, some kind of smaller stories related to what the federal government in the US is doing. And actually some updates on copyright law stuff in the last section. So a lot to get through. We'll probably be talking a little bit faster than we are in our typical episodes. Maybe that's good. We'll see if we're able to keep it a bit more efficient.

So let's get straight to it. Tools and apps. We have Gemini 2.5, what Google is calling their most intelligent AI model. And this is one of their, I guess, slate of thinking models. Previously, they had Gemini 2.0 Flash thinking. That was kind of a smaller, faster model. Here, Gemini 2.5 is representing...

their bigger models. We had Gemini 2.0 Pro previously, which came out as their biggest model. But at the time, it was kind of not that impressive, just based on the benchmarks, based on people's using it and so on. So Gemini 2.0 came out, like topping the leaderboards by a good margin, which we haven't seen for a while with benchmarks, like their performance on the benchmarks is significantly higher than the second best one on pretty much any benchmark you can look at.

even ones that seemed saturated to me. And not only that, just based on a lot of anecdotal reports I've been seeing in terms of its capacity for things like coding compared to Claude, for its capacity for writing, problem solving, it's just like another kind of class of model that is able to one shot, just given a task, nail it without...

having to get feedback or having to do multiple tries, things like that. So yeah, super impressive. And to me, kind of a surprising leap beyond what you've had. Absolutely. And I think one of the surprising features is where it isn't SOTA quite yet, right? So SweeBench verified, right? It's actually a benchmark that opening I first developed. You have SweeBench

Essentially real world-ish software engineering tasks. SweetBench verified is the cleaned up open AI version of that. So on that benchmark, Cloud 3.7 Sonnet still number one and by quite a bit. This is pretty rare looking at 2.5, which just crushes everything else in just about every other category, but still that quite decisive edge, basically 6% higher in performance on that benchmark is Cloud 3.7 Sonnet still. But

That aside, Gemini 2.5 Pro is, I mean, it's just crushing, as you said, so many things. One of the big benchmarks a lot of people are talking about is this sort of famous humanities last exam, right? This is the benchmark that Dan Hendricks, Elon's AI advisor who works at

at Center for AI Safety put together. I mean, it is meant to be just ridiculously hard reasoning questions that call on general knowledge and reasoning at a very high level. Previously, OpenAI 03 Mini was scoring 14%. That was SOTA. Now we're moving that up to 18.8%. We're gonna need a new name for benchmarks that don't make them sound as final as humanity's last exam, by the way.

But we're on track right now to, I mean, like we're going to be saturating this benchmark, right? That's going to happen eventually. This is a meaningful step in that direction. And things have been moving really quickly with inference time reasoning, especially on that benchmark. But a couple of things, I guess, to highlight on this one.

Google coming out and saying, look, this is... So by the way, this is their first 2.5, Gemini 2.5 release. It's an experimental version of 2.5 Pro. What they're telling us is going forward, they are going to be doing reasoning models across the board. So like OpenAI, don't expect more base models to be released as such anymore from DeepMind. So everybody kind of migrating towards this view that like,

Yep. The default model should now be a reasoning model. It's not just going to be GPT 4.5 and so on. It's really going to be reasoning driven. And the stats on this are pretty wild. There's so much stuff. I'm trying to just pick one. For one, I mean, it tops the LM Arena leaderboard, which is a cool kind of rolling benchmark because it looks at

human preferences for LLM outputs. And then it gives you essentially like an ELO score for those and pretty wide margin for Gemini 2.5. So subjectively getting really good scores, as you said, Andre, this is kind of like the measured subjectivity side. On SweetBench verified 63.8 is really good, especially given, even though it comes in second place to Sonnet, when you look at the balance of capabilities, this is a very wide capability envelope.

They do say they specifically focused on coding. So again, still kind of interesting that they fall behind 3.7 Sonnet.

Maybe last spec to mention here is it does ship today with a 1 million token context window. So in the blog post announcing this, Google made a big stink about how they see one of their big differentiators as a large context. And they're going to be pushing a 2 million tokens of context soon as well, apparently. Right. And that is a significant detail because 1 million, I haven't been keeping track. I do think we had Claude Opus in that segment.

space of very large context, you know, but 1 million is still very impressive. And going to 2 million is pretty crazy. Again, like, you keep having to translate how big 1 million token is. Well, that's, I don't know, a few million words, or maybe slightly less than a million words, because maybe 700,000 or something, maybe 700,000, 2 million would be over a million. And

It's a lot of content. You can fit in an entire manual, an entire set of documents, et cetera, in there. And of course, as with our Gemini things, it is multimodal, takes in text, audio, images, video. I've also seen reports of it being very capable of processing audio and images as well. And...

to that point of it starting to roll out as an experimental model, you can already use it in a Google AI studio. If you're paying for Gemini advanced, you can also select it in a model dropdown and just try it. That's part of how we've been seeing people try it and report really good outcomes. So very exciting.

And now to the next story, also something very exciting and also something that's kind of mind-blowing to an unexpected degree. So OpenAI has rolled out image generation powered by GPT-4.0 to chat GPT. To my understanding, and I'm not totally sure this is exactly right details, but similar to Gemini 2.0,

last week it was it last week or two weeks ago i don't know from google the idea here is instead of having a separate model that is typically a diffusion model where the lm is like okay let me

give this prompt over to this other model that is just text to image and that will handle this and return the image. This is taking the full kind of end-to-end approach where you have a multimodal model able to take in text and images, able to put out text and images just via a set of tokens. And as a result of moving to this approach of not doing diffusion, doing full-on token language modeling,

These new category really of text-to-image models or image plus text-to-image models have a lot of capabilities we haven't seen with traditional text-to-image. They have very impressive editing right out of a box. They have...

Also, very, very good ability to generate text, a lot of text in an image with very high resolution. And they seem to just really be capable of very strict prompt adherence and making very complex text descriptions work in images and be accurate. And we've also discussed how

with image models, it's been increasingly hard to tell the difference or like see progress. But I will say also, you know, especially with DALI and to some extent also with other models, the,

There has been a sort of like pretty easy telltale sign of AI generation with it having a sort of AI style being a little bit smooth, being, I don't know, sort of cartoony in a very specific way, especially for Dali. Well, this is able to do all sorts of visual types of images. So it can be very realistic, I think, differently from what you saw with Dali from OpenAI and

It can do, yes, just all sorts of crazy stuff similar to what we saw with Gemini in terms of very good image editing, in terms of very accurate translation of instructions to image. But in this case, I think even more so, just the things people have been showing have been very impressive. Yeah, and I think a couple of things to say here. First of all, astute observers or listeners will note last week we covered Grok now folding into its offering internally

an image generation service, right? So this theme of the omnimodal, at least platform, right? Grok is not necessarily going to make one model that can do everything. Eventually, I'm sure it will, but we're making kind of baby steps on the way there. This is OpenAI kind of doing their version of this and going all the way omnimodal with one model to rule them all.

you know, big, big strategic risk. If you were in the business of doing text to image or audio to whatever, like assume that all gets soaked up. And because of positive transfer, which does seem to be happening, right? One model that does many modalities tends to be more grounded, tends to be more capable at any given modality now, just because it benefits from that more robust representational space. Cause it has to be able to represent things in ways that can be decoded into images, into audio, into text. So there's a much more robust way of doing things.

One of the key words here is binding, right? One of the key capabilities of this model. It's the binding is this idea where you're essentially looking at how well multiple

kind of relationships between attributes and objects can be represented in the model's output. So if you say, draw me a blue star next to a red triangle next to a green square, you want to make sure that blue and star are bound together. You want to make sure that red and triangle are bound together faithfully and so on. And that's one of the things that this model really, really does well apparently.

So apparently it can generate correctly bound attributes for up to 15 to 20 objects at a time without confusion. This in a sense is the text to image version of the needle in a haystack eval, right? Where we see like many different needles in the haystack. Well, this is kind of similar, right? If you populate the context window with a whole bunch of these relationships, can they be represented, let's say with fidelity in the output? The answer, at least for 15 to 20 objects in this case, in relatively simple bindings,

binding attributes is yes, right? So that's kind of one of the key measures that actually there's something different. I wouldn't be surprised if this is a consequence of just having that more robust representational space that comes with an omni-modal model. One other thing to highlight here is we do know that this is an autoregressive system, right? So it's generating images sequentially from left to right and top to bottom in the same way that text

is trained and generated in these models, that's not going to be a coincidence, right? If you want to go omnimodal, you need to have a common way of generating your data, whether it's video, audio, text, whatever, right? So this is them saying, okay, we're going auto-aggressive, presumably auto-aggressive transformer to just do this. So pretty cool. There's a whole bunch of, anyway, cool little demos that they showed in their launch.

Worth checking out one last little note here is they're not including any visual water markings or indicators that show that the images are AI generated, but they will include what they call their standard C2PA metadata to mark the image as having been created by OpenAI, which we've talked about that in the past. If you're curious about that, go check out those episodes. But yeah, so OpenAI kind of taking a bit of a middle ground approach on the watermarking side.

Yeah, and they also are saying there'll be some safeguards, certainly compared to things like Grok, where you won't be able to generate sexual imagery. You won't be able to, for instance, have...

politicians with guns or something like that, of course, you're going to be able to get around these safeguards to some extent, but certainly a more controlled type of model, as you would expect. Last thing I'll also say is you've seen a ton of different use cases for this popping up on social media. One you may see covered in media is the gibblification of images, where it has turned out that you can

taken a photo and tell the system to translate it to Ghibli style. Ghibli is a very famous animation studio from Japan, and it does a very good job, like a very faithful rendition. Definitely looks like Ghibli, and that kicks off its whole set of discussions as to, again, AI for...

what it means for art, you know, the ethics of it. There are also discussions as to what this means for Photoshop because it can do image editing. It can do design. You know, again, this is, I think, a surprising thing where we haven't talked about text-to-image as being mind-blowing in a little while and it kind of seemed to plateau for a while. And now it is, to me, certainly mind-blowing again to see the stuff you can do.

Onto the lightning round, and we actually have a couple more image generators to cover. I don't know if this decided to come out at the same time or what, but there are a few starting with Ideagram. They are presenting the version three of their system. Ideagram is one of the leading text-to-image focused businesses out there. Early on, their claim to fame was being able to handle text better, but these days, of course, that's not the case anymore.

They say that this 3.0 of their system is able to create better realistic and stylized images. In particular, they have the ability to upload up to three reference images to guide the aesthetic output. And there's 4.3 billion style presets. So I think this reflects Ideagram being a bit more of a business company.

And this being more of a product of them, like as a primary focus. So yeah.

Again, now with GPT-4.0, this is nowhere near that. But for specialized use cases, it could be still the case that something like Ideagram can hold on for a while. We'll see. Yeah, you can almost hear yourself arguing the TAM, the total addressable market size for these products down and down and down as chat GPT, as all the big players kind of grow and grow and grow their own TAM. This is one of the problems we've talked about a long time on the podcast. I think I

I think Ideagram is remain to be proved wrong here and expect to look stupid for any number of reasons as usual. But I think Ideagram is dead in the medium term, like a lot of companies in the space. Look, they do say 4.3 billion style presets. We, of course, are extremely competent. AI journalists have tested every single one and can report that they are pretty good, actually.

You're saying, Andre, that the text in image feature is a kind of lower value thing now because the competition, 100% the case. This is why Ideagram is now choosing, forced to maybe, emphasize photorealism and professional tools, right? That's kind of what they're making their niche, but they're going to get more and more niche-y. This is going to keep happening as their territory gets encroached on by the...

the sort of blessings of scale that the true hyperscalers can benefit from. Very cool, but kind of overshadowed by GPT-4. I will say one last point. It could still be the case that as this specialized sort of model or business for this case where they focused on, let's say, business use cases for posters, maybe they have training data that allows them to still be better for a particular niche.

I don't know. Yeah. I think opening eyes buying power for that training data is going to vastly exceed theirs. And I think also... I would say proprietary data from users of a platform. Oh, 100%. Yeah. Yeah. I mean, I think they're also fighting positive transfer as well. There are a lot of secular trends here, but you're right at a certain point. If you can protect that data niche, yeah, you're absolutely right. That's the one way out that I can see at least for sure. And the next story...

Also a new image generator that was also mind-blowing before GPT-4.0. So the headline is New Reeve Image Generator Beats AI Art Heavyweights Like Mid-Journey and Flux at Pennies for Image. This came out, there was a model codenamed Half Moon that was already impressing everyone. It came out now with Reeve Image 1.0.

They are providing service for it. You can get 100 free credits and then credits at $5 for 500 generations. And, you know, this was...

previous GPT-4.0, again, really impressive in terms of its prompt adherence, in terms of being able to construct complex scenes and just generally kind of do better at various more nuanced or tricky tasks than other image generators. Seemed like the best, like an exciting new step in image generation. I'll keep saying it, GPT-4.0, to some extent also like Gemini before, to be fair,

still kind of more mind-blowing than these things. Yeah, I mean, approximately take my last comments on ideogram copy-paste in here. I think it's all roughly the same, but it's a tough space now, right? It's really getting commoditized. Right. And one thing also worth noting quickly is one differentiator could be cost because the autogarissa model using LLMs, using

you know, cost and speed also, because LLMs are typically slower. You are using decoding these things. If they're still using the fusion models could be cheaper and could be faster, which could be significant. I don't know. I think in practice, this is really tough. I mean, OpenAI gets to amortize their inferences over way larger batch sizes. And that's really the key number that, you know, you care about when you're, when you're tracking this sort of thing.

There's also, you know, they're not going to be using, if it makes economic sense, OpenAI will just distill smaller models and or have models, you know, specialized in this. So I think, again, like long run, it's really kind of batch size versus batch size, compute fleet versus compute fleet. In my mental picture of this, the rich get richer, but again, like very, very willing to look like an idiot at some point in the future.

Yeah, I'm certain these companies are definitely thinking about their odds as well. Next up, moving away from image generation, but sticking with multi-modality, Alibaba is releasing QN 2.5 Omni, which is adding voice and video models to Quenched or also adding these things. So they're

Open sourcing QEN 2.5 Omni 7B that is a multimodal model, has text, image, audio, and video that's under the Apache 2.0 license. And it is somewhat significant.

Because to my memory, in the multimodal model space, we don't have as many strong models as just pure LLMs. We have started seeing more of that with things like Gemma, but this has text, images, audio, and video. So possibly, if I'm not forgetting anything, kind of a pretty significant model to be released under Apache 2.0 with this multimodality.

Yes. And kind of seeing, you know, maybe some of the blessings of scale positive transfer stuff starting to show up here as well. Interesting. You have to see it as an open source model. And yet again, you know, the Chinese models being legit, like, I mean, this is no joke. The benchmarks here comparing favorably, for example, to Gemini Pro on OmniBench. That's a, you

Sorry, let me be careful. Gemini 1.5 Pro, right? We're two increments beyond that as of today, but still, this is stuff from like six months ago and beating it handily in the open source. So that's a pretty big development. Right. And can you imagine if we had a history where OpenAI didn't create this versioning system for models and we had actually new names for models? Wouldn't that be cool? Yeah.

You know what? It also makes you want to go kind of like, 1.5 for this lab should be the same as 1.5 for this one. And you even see some of the labs trying to kind of like number their things out of sequence just to kind of signal to you how they want to be compared. It's a mess. Yeah. And speaking of impressive models out of China, next we have...

T1 from Tencent. So this is their thinking model. This is kind of equivalent to Gemini 2 to O1. And it is available on Tencent Cloud, priced pretty competitively. They say they tops leaderboards, beating R1 along O1. So they're

Another kind of impressive release. I couldn't see many technical details on this. And in fact, it didn't seem to be covered in Western media that much. But it could be a big deal, Tencent being a major player in the Chinese market.

Yeah, the big kind of release announces that, first of all, it is, interestingly, a hybrid Mamba architecture, by which they presumably mean the combination of Transformer and Mamba that we've talked about before, that a lot of people see as this way of kind of like covering the downsides of each. Check out our Mamba episodes, by the way, for more on that, because it's a bit of a deep, deep dive. But yeah, they claim, they refer to it as the first lossless application of the hybrid Mamba architecture.

I don't know what lossless means in this context. So I asked Claude and it said, well, in this context, it probably means there was no compromise or degradation in model quality adapting the Mamba architecture for this large scale inference model. Okay, fine. If that's the case. But again, this is where deeper dive would be helpful and it'll be interesting to see. Mamba, we haven't seen much. I haven't seen much about Mamba in quite a while.

That doesn't mean it's not being used in a proprietary context by labs that we don't know about, but it's sort of interesting to see another announcement in that direction. And moving on to applications and business, we begin with OpenAI and the story that they are now close to finalizing their $40 billion raise. This is led by SoftBank, and they have various investors in here, things like Founders Fund,

Cochu management, things that you actually don't know too much about. They have a nearly noise-based hedge fund that is contributing up to $1 billion, but the leader certainly seems to be SoftBank. They are saying they'll invest an initial $7.5 billion there, along with $2.5 billion from other sources.

And this will be the biggest round ever in fundraising, right? $40 billion is more than most companies' market cap. So it's crazy. And funnily enough, the shares of SoftBank dropped dramatically.

in the Japanese share markets, I think, because people are like, SoftBank, you're giving a lot of money to OpenAI. Is that a good idea? What? Has SoftBank made giant multi-billion dollar capital allocation mistakes before, Andre? I certainly can't remember. Yeah. I mean, there's no company that starts with V where SoftBank was famously involved.

Yeah, yeah, yeah. No, they've had a pretty rough time. And so SoftBank obviously is famous for those calls. Actually, I can't remember. I know that there is a noteworthy story to be told about their performance over the last few years. And my brain's fried. I'm trying to remember if it's like SoftBank actually is doing well, actually, or SoftBank is completely fucked.

It's one of those two, I think. The investors apparently include this. So Magnetar Capital, I had never heard of, to your point. The only one that I'd heard of in the list here is Founders Fund, which, by the way, I mean, these guys just crush it. SpaceX, Palantir, Stripe, Endural, Facebook, Airbnb, Rippling. Founders Fund is just absolute god tier.

But apparently, Magnetar Capital has $19 billion in assets under management in Asia. So they're going to put in up to $1 billion alone in this round. So that's pretty cool, pretty big. So yeah, going up to $300 billion would be the post-money valuation, which is basically double the last valuation of $157 billion. That was back in October. So I'm sorry, has your net worth not doubled since October? What are you doing, bro? Get out there and start working because OpenAI is...

That's pretty wild. So yeah, anyway, there's a whole bunch of machinations about how the capital is actually going to be allocated. SoftBank's going to put in an initial $7.5 billion into OpenAI, but then there's also $7.5 billion that they presumably have to raise from a syndicate of investors. They don't necessarily have the full amount that they need to put in on their balance sheet quite yet. And I think this was part of what Elon was talking about in the context of the Stargate build saying, hey, I'm looking at SoftBank.

These guys just don't have a balance sheet to support a $500 billion investment or a $100 billion or whatever it was claimed at the time. And this is kind of true. And that's part of the reason why there's a second trench of $30 billion that's going to be coming later this year that will include $22 billion from SoftBank and then more from a syndicate. So it's all this kind of staged stuff. There's a lot of people who still need to be convinced or when you're moving money flows that are that big, obviously, there's just a lot of stuff you have to do to free up that capital.

This is history's largest fundraise. If it does go through, that's pretty wild. Next up, another story about OpenAI and some changes in their leadership structure, which is somewhat interesting. So Sam Altman is seemingly kind of not stepping down, but stepping to the side and meant to focus more on the company's technical direction and guiding their research and product efforts. He...

is the CEO, or at least was the CEO, meaning that, of course, as a CEO, you basically have to oversee everything, lots of business aspects. So this would be a change in focus. And they are promoting, I guess, or it doesn't seem like they announced changes to titles, at least not that I saw. But the CEO, Brad Lightcap, is going to be stepping up with some additional responsibilities like marketing.

overseeing day-to-day operations and managing partnerships, international expansion, etc. There's also a couple more changes. Mark Chen, who was, I think, an SVP of research, is now the chief research officer. There's now a new chief people officer as well. So a pretty significant shuffling around of their C-suite. Of course, following up on a trend of a lot of people leaving, that we've been covering for months and months. So...

I don't know what to read into this. I think it could be a sign of trouble at OpenAI that requires restructuring. It could be any number of things that

But it is notable, of course. Yeah, there's, you know, that iceberg meme where they're like, you know, you get the regular theories at the top and then the kind of deep, dark conspiracy theories at the bottom. There are two versions of this story or a bunch. And I've heard one person, at least a former OpenAI person speculate about a kind of like a darker reason for this. But so Brad Lightcap, you're right. Was the CEO before is still the CEO. All that's happening here presumably is a widening of his mandate.

This is notable because Sam Altman is a legendarily good fundraiser and one would assume corporate partnership developer. And you can see that in the work that he did with Microsoft and Apple. Like very few companies have deep partnerships with both Microsoft and Apple who under any typical circumstance are at each other's throats. I will also say-

Quick on this Albin note, the fact that he got friendly with the Trump administration under Elon Musk's nose, also pretty legendary, in my opinion. Yeah, he managed to turn around essentially a lifetime of campaigning as a Democrat and for Democrats to tighten his tie and make nice with elements of the campaign. It's always hard to know.

But yeah, one take on this is, well, Miramarati has not been replaced. And Sam has said there's no plan to replace her. He essentially stepping in to fill that role. And it's founder mode stuff. He wants to get closer to Gears level. I'm sure that's a big part of it no matter what. And it may be the whole thing. Another take that I have heard

is that as you get closer to superintelligence, the people at the command line, the people at the console that get to give the prompts to the model first are the ones to whom the power tends to accrete or with whom the power tends to accrete. So wanting to get more technical, wanting to turn into more of a Greg Brockman type is

makes sense if you think that that's where, you know, if you're power driven and that's kind of like where, you know, where you want to go. Anyway, it's an interesting kind of iceberg meme thing. Last thing I'll mention is Mark Chen, who's mentioned here in the list as one of the people who's promoted, who you mentioned, you may actually know him from all the demo videos, right? So the, you know, deep research demo, I guess the O1 demo when it launched, he's often there as Sam's like kind of right-hand demo guy. So anyway, his face and voice will probably be familiar to you.

quite a few people. Next up, we have kind of a follow-on story from what we covered a lot last week. So we were covering the Rubin GPU announcements from NVIDIA.

This is a story specific to the 600,000 watt Kyber racks and infrastructure that is set to ship also in 2027 along with their announcements. So I'll let you take over on this one, Jeremy, if you know the details. Yeah, no, it's just a little bit more on power density, rack power density. So for context, you have the

the GPUs themselves, right? So currently the black wells like the B200, that's the GPU. But when you actually put it in a data center, it sits on a board with a CPU and a bunch of other supporting infrastructure, and that is called a system. So multiple of these trays with GPUs and CPUs and a bunch of other shit get slotted into these server racks, and together we call that whole thing a system.

A system with 576 of these GPUs, like if you counted all the GPUs up in that system, if you had 576 of them, that would be the NVL576 Kyber rack. This is a behemoth. It's going to have a power density of 600 kilowatts per rack. That is 600 homes worth of power consumption for one rack in a data center.

600 homes in one rack. That is insane. The cooling requirements are wild. For context, currently with your B200 series, you're looking at about 120 kilowatts per rack. So that's like a 5x-ing of power density. It's pretty wild. And Jensen, while they haven't provided clear numbers, has said that we're heading to a world where we're going to be pushing one megawatt per rack. So 1,000 homes worth of power per rack. Just kind of pretty wild for this Kyber system.

Just gives you a sense of how crazy things are going to be getting.

And another story on the hardware front, this time from China, we have China's C-carrier? I don't know how to say it. Side carrier? Side carrier, yeah. That is... Yeah, a Shenzhen-based company that is coming out as potentially a challenger to ASML, another developers of FabTool developers. So as we've covered many times, probably at this point, ASML is...

One of the pivotal parts of the ability to make advanced chips, they provide, they're the only company providing the really kind of most advanced techniques, tools to be able to fabricate, you know, at the leading edge of tiny note sizes. Yeah.

Nobody is able to match. And so this is very significant if, in fact, there will be kind of a Chinese domestic company able to provide these tools.

Yeah, what's happening in China is kind of interesting. Over and over, we're seeing them try to amalgamate, to concentrate what in the US would be a whole bunch of different companies into one company, right? So Huawei SMIC seemed to kind of be forming a complex. It's like as if you glued Nvidia to TSMC, the chip design with the chip fab, right? Well, here's another company, SCI Carrier, C Carrier, I don't know, but Silicon Carrier, right?

that's essentially like integrating a whole bunch of different parts of what's known as the front end part of the fab process. So when you manufacture semiconductors, the front end is the first and most complex phase of manufacturing where your circuits are actually going to be created on the silicon wafer. There's a whole bunch of stuff you have to do for that. You have to prepare wafers. You have to actually have a photolithography machine that like fires...

basically a UV light onto your wafer to then eventually do etching there. Then there's the etching, the doping with, with ions deposition. There's all kinds of stuff. They have products now across the board. They just launched a whole suite of products that,

kind of covering that end to end. So that puts them in competition, not just with ASML, but also with Applied Materials, with Lam Research, with a lot of these big companies that own other parts of the supply chain that are maybe a little easier to break into than necessarily lithography. But then on the lithography side, Sci-Carrier also claims that they built a lithography machine that can produce 28 nanometer ships. So less advanced, way less advanced than TSMC,

But it brings China one step closer if this is true. If this is true, and if it's at economic yields, it brings them one step closer to having their answer to ASML, which...

They're still a huge long ways off. This should not be kind of the jump from 28 nanometer lithography machines to like, you know, seven nanometer like DUV, let alone EUV is immense. You can check out our hardware episode to learn more about that. But it's the closest that I've heard of China having an answer to ASML on the litho side. And they're coming with a whole bunch of other things as well. Again, more and more kind of integration of stuff in the Chinese supply chain.

And the last story for the section, also about China, Pony.ai wins the first permit for fully driverless taxi operation in China's Silicon Valley. So they are going to be able to operate their cars in Shenzhen's Nanshan District, a part of it.

And this is quite significant because the US-based companies, Tesla and Waymo, are presumably not going to be able to provide driverless taxi operation services anymore.

in China. And so that is a huge market that is very much up for grabs. And Pony.ai is one of the leaders in that space. Yeah, China is making legitimate progress on AI that should not be ignored. One of the challenges with assessing something like this is also that

You have a very sort of friendly regulatory environment for this sort of thing. China wants to be able to make headlines like this and also has a history of burying fatalities associated with all kinds of accidents from COVID to otherwise. And so it's always hard to do apples to apples here on what's happening in the West, but they do have a big data advantage. They have a big data integration advantage, big hardware manufacturing advantage. It wouldn't be surprising if this was for real. So there you go. Maybe-

an interesting kind of jockeying for position as to who's going to be first on full driverless. Right. And Pony AI has been around for quite a while, founded in 2016, actually in Silicon Valley. So yeah, they've been leading the pack to some extent for some time. And it makes sense that they're perhaps getting close to this working.

Onto projects and open source, and we begin with a new challenging AGI benchmark to your point of us having to continue making new benchmarks. And this is coming from the ARC Prize Foundation. We covered ARC AGI previously at the high level. The idea with these ARC benchmarks, they

Tests kind of broad, abstract ability to do reasoning and pattern matching. And in particular, in a way where humans tend to be good without too much effort. So 400 people took this Arc AGI to test and were able to get results.

60% correct answers on average, and that is outperforming AI models. And they say that non-reasoning models like GPT-4.5, Cloud 3.7, Gemini 2 are each scoring around 1% with the reasoning models being able to get between 1% and 1.3%. So also this is part of a challenge in that there is this

to be able to beat these tests under some conditions, operating locally without an internet connection. And I think on a single GPU, I forget. And I think just with one arm. Yeah, exactly. Half the...

Transistors have to be turned off. Yeah. So yeah, this is an iteration on ArcAGI. At the time, we did cover also a big story where O3 matched human performance on ArcAGI 1 at a high computational cost. So not exactly at the same level, but still they kind of beat Benchmark to some extent.

On this one, they are only scoring 4% using the level of $200 of computing per task. So it clearly is challenging, clearly taking in some of the lessons of these models beating Arc AGI 1. And I do think a pretty important thing or interesting thing to keep an eye on.

Yeah, they are specifically introducing a new metric here, the metric of efficiency. The idea being that they don't want these models to just be able to brute force their way through the solution, which I find really interesting. There's like this fundamental question of, is scale alone enough?

And scaling maximalists would say, well, you know, what's the point of efficiency? The cost of compute is collapsing over time. And then algorithmic efficiencies themselves, there's kind of algorithmic efficiencies where conceptually you're still running the same algorithm, but just finding more efficient ways to do it. It's not a conceptual revolution in terms of the cognitive efficiency.

mechanisms that the model is applying. So think here of like, you know, the move from attention to flash attention, for example, right? This is like an optimization or like KV cache level optimizations that just make your transformer kind of run faster, train faster and inference cheaper. That's not what they seem to be talking about here. They seem to be talking about

just, you know, how many cracks at the problem does the model need to take? And there's an interesting fundamental question as to whether that's a meaningful thing, given that we are getting sort of these more algorithmic efficiency improvements without reinventing the wheel and hardware is getting cheaper and, and, and all these things are compounding. So if you can solve the benchmark this year with a certain hardware fleet,

then presumably you can do it six months from now with like a 10th of the hardware, the 10th of the cost. So it's kind of an interesting argument. Francois Chouinard, who designed this benchmark, is obviously on one side of it saying, hey, in some sense, the elegance of the solution matters as well. Yeah, I think it's sort of fascinating, apparently, to give you a sense of how performance moves on this. So opening eyes, O3 model, O3 low. So the version of it that is spending less money

test time compute was apparently the first to... Well, sorry, famously, I should say, not apparently. It was the first to reach some basically close to saturation points on Arc AGI 1. It hit about 76% on the test. That was what got everybody talking like, okay, well, we need a new AGI benchmark. That model gets 4% on Arc AGI 2.

using $200 worth of computing power per task, right? So that gives you an idea that we are suppressing the curves yet again, but if past performances, any indication, I think these get saturated pretty fast and we'll be having the same conversation all over again.

The next story also related to a challenging benchmark. This is a paper challenging the boundaries of reasoning and Olympiad level math benchmark for large language model. So new math benchmark, they are calling a limb math. And this has 200 problems with two difficulty levels, easy and hard where easy is similar to Amy and existing math.

Benchmark and hard being, I suppose, similar to the hard, like super advanced types of math problems that even humans can get. They curate these problems from textbooks, apparently printed materials and

They specifically excluded online repositories and forums to avoid data contamination. And on their experiments, they're seeing that advanced reasoning models, DeepSeq R1 and O3-mini,

achieve only 21.2 or 30% accuracy, respectively, on the hard subset of a data set. So still some challenge to be solved. I guess in a few months, we'll be talking how we're getting to 90% accuracy. Yeah, we'll have the next version of Olym Math. Yeah, they came up with a couple of pretty interesting observations, maybe not too surprising.

Apparently, models consistently are better on the English versions of these problems compared to the Chinese version because they collected both. That's kind of cool. They do still see quite a bit of guessing strategies. The models get to the end of the thread and they're just throwing something out there, which presumably is increasing through false positives.

the score somewhat. One thing I will say, first of all, like, yeah, kudos and interesting strategy to go out into the real world and bother collecting these things that way. It does make me wonder like how well you could meaningfully scrub your data set of problems that you see in magazines say, and like be confident that they don't exist somewhere. Obviously there are all kinds of data cleaning strategies, including using language models and other things to, to peruse your, your data, to make sure that it isn't referenced on the internet. But

These things aren't always foolproof. And there've been quite a few cases where people think they're doing a really good job of decontaminating and not leaving any of that material online to essentially like have models that have been trained already. So yeah, I'm kind of curious how, you know, whether we'll end up discovering that part of the saturation on this benchmark is at least at first due to overfitting.

Yeah, and part of the challenge with knowing is we don't know the training data sets for any of these companies, for OpenAI, for Anthropic. These are not to publicly release data sets. And I would say there's a 100% chance that they have a bunch of textbooks that they bought and scanned and included in their training data. Oh, yeah. So who knows, right?

A couple more stories, actually another one coming out of China. We have one open advanced large scale video generative models, and this is coming from Alibaba. So as we say, this is, as the title says, a big model, 14 billion parameters at the largest size. And they also provide a 1.3 billion parameter model that is more efficient and

train on a whole bunch of data, open sourced and

seemingly outperforming anything that is open source in the text to video space quite a bit, both on efficiency, on speed, and on kind of appearance. The only one that's competitive is actually Hun Yuan Video, which I think we covered recently. Things like Open Sora are quite a bit below in terms of, I guess, like appearance measurement stuff. So open source...

you know, steadily getting to a point where we have good text-to-video as we had with a text-to-image. Yeah, and just, I mean, anecdotally, some of the images are pretty damn photorealistic, or some of the, sorry, some of the, yeah, stills. I will note there's kind of this amusing, I don't know if this is intentional, I can't see it, see the prompt anywhere, but there is a photo on page four of this paper that looks an awful lot like Scarlett Johansson.

So that's kind of a, if intentional, I guess a bit of a, a bit of a swipe at opening either, which is mildly amusing, but anyway, yeah, there you go. I mean, China, especially on the open source stuff is serious. I mean, this is Alibaba, right? So they've got access to scaled training budgets, but they're not even China's like leading lab, right? That you, for that you want to look at Huawei and you want to look at a deep seek, but yeah, pretty impressive. Yeah.

Exactly. And I think kind of interesting to see so many open sourcing, like Meta is maybe one company in the US that's doing a bunch of open sourcing still.

Google doing a little bit with smaller models. Basically, the only models being released are the smaller models like Gemma and Fi. But we are getting more impressive models out of China. And there's certainly a lot of people using R1 these days because it is open source.

Speaking of that, the next story is about DeepSeek v3. We have a new version as of March 24th. This is another naming convention that's kind of lame, where the model is DeepSeek v3-0324. Just a kind of incremental update, but a significant update because this is now the highest scoring non-reasoning model on some benchmarks.

exceeding Gemini 2 Pro and Meta Lama 3.3 70B. Yeah, outperforming most models basically while not being a reasoning model. So presumably this is an indicator R1 was based on DeepSeek V3. V3 was a base model. V3 was also a very impressive model at the time, trained very cheaply. That was a big story.

Presumably, the group there is

able to improve v3 quite a bit, partially because of r1, synthetic data generation, things like that. And certainly, they're probably learning a lot about how to squeeze out all the performance they can. Yeah, I think this is a case where there are just so many caveats. But any analysis of something like this has to begin and end with a frank recognition that DeepSeq is for real, this is really impressive. And now I'm just going to add a couple of buts, right? So none of this is to take away from that top line.

We've talked about in this episode quite a few times how the lab, so Gemini 2.5 is no longer just a simple base model. All the labs are moving away from that by default, not releasing new base models. And so yes, DeepSeek V3, the March 25 version is better than all of the base models that are out there, including the proprietary ones.

but labs are losing interest in the proprietary base model. So that's an important caveat. It's not like DeepSeq is moving at full speed on just the base model and the labs are moving at full speed on just the base model, and that's kind of apples to apples. But the most recent releases of base models

are still relatively recent, still GPD 4.5 opening has been sitting on it for a while as well. So it is so difficult to know how far behind this implies DeepSeq is from the frontier. This conversation will just continue. And the real answer is known only to some of the labs who know how long they've been sitting on certain capabilities. There's also just a question of like DeepSeq could have just chosen to invest more, certainly now that they have state backing, who knows?

but into meeting this benchmark for publicity reasons as well. So none of this is to take away from this model. It is legitimately very, very impressive. By the way, all the specs are basically the same as before. So context window of 128,000 tokens.

Anyway, same parameter accounts and all that. But still, I think a very impressive thing with some important caveats not to read this right off the prompter, so to speak, in terms of assessing where China is, where DeepSeq is. Right. And pretty significant, I would say, because also DeepSeq v3 is fairly cheap to use. And you can also use it on providers like Grok, Grok with a Q. So if it is exceeding

you know, models like Claude and OpenAI for real applications, it could actually significantly hurt the bottom line of OpenAI and

at least with startups and the non-Azure price customers. Yeah, for people using the base model, right? And I guess that's the bet that everybody's making is that that will not continue to be the default use case. If you're doing open source, much, much more interesting to be shipping base models, right? Because then other people can apply their own RL and post-training schemes to it. So you're going to see probably open source continue to disproportionately ship some of these

base models. I wouldn't be surprised to find that the full frontier of base models be dominated by open source for that reason in the years to come. But there's a question of like, yeah, the value capture, right? Are people spending more money on base models? I don't think so. I think they're spending more money on agentic models that we're seeing start to dominate.

And one more story. This one is about OpenAI and not about a model. We saw an announcement or at least a post on Twitter with Sam Altman saying that OpenAI will be adding support for the model context protocol, which we discussed last week.

And that is an open source standard that is basically defining how you can use models as a protocol when you use the API.

A bit significant because it's coming from a topic. We are not introducing a competing standard. They are adopting what is now an open standard that the community got excited about. So I guess that's cool. It's nice to see some coalescence. And of course, when you have a new standard and everyone jumping on board, that makes it much easier to build tools and the whole ecosystem benefits if necessary.

a standard turns out to be what everyone uses and there's no like weird competing different ways to do things. Yeah. And I think there was so much momentum behind this already that it just made sense even at scale for OpenAI to move in that direction. Yeah.

Onto research and advancements. And as we previewed at the beginning, the big story this week is coming out on Fropic. This came out just yesterday. So we are still kind of absorbing it and can't go into full detail, but we will give at least an overview and the implications and kind of results. So...

There's a pretty good summary article, actually, you can read that is less technical from MIT Technology Review. The title of that article is Anthropic Canal Track the Bizarre Inner Workings of a Large Language Model.

And this is covering two blog posts from Anthropic. One is called Circuit Tracing, Revealing Computational Graphs in Language Models. They also have another blog post, which is On the Biology of a Large Language Model, essentially an application of that approach in the first blog post to Cloud 3.5 Haiku with a lot of interesting results. So

There's a lot going on here, and I'll try to give a summary of what this is presenting. We've seen work from Anthropic previously focusing on interoperability and exposing the inner workings of models in a way that is usable and also more intuitive. So we saw them, for instance, using...

Techniques to be able to see that models have some high level features like the Golden Gate Bridge famously. And you could then tweak the activations for those features and be able to influence the model. This is essentially taking that to the next step.

where you are able to see a sequence of high-level features working together and coalescing into an output from an initial input set of tokens. So they are doing this, again, as a sort of follow-on of the initial approach at a high level,

It's taking the idea of replacing the layers of the MLP bits of the model with these high level features that they discover via

you know, a previous kind of thing. They have a new technique here called a cross layer transcoder. So previously you were focusing on just one layer at a time and you were seeing these activations in one layer. Now you're seeing these activations in multiple layers and you see the kind of flow between the features via this idea of a cross layer transcoder. And

Here are some more details where you start with a cross-layer transcoder, you then create something called a replacement model. And they also have a local replacement model for a specific prompt. The idea there is you're basically trying to make it so this replacement model, which doesn't have the same weights, doesn't have the same set of nodes or computational units as the original model, has the same overall behavior, has the same...

Roughly is equivalent and matches the model as closely as possible so that you can then see the activations of a model in terms of features and can map that out to the original model sort of faithfully. So let's just get into a couple examples. The one they present in the blog post, figure five,

You can see how they have an input of the National Digital Analytics Group, and they are then showing how each of these tokens is leading to a sequence of tokens in this graph. So you start with Digital Analytics Group that maps onto tokens that correspond to those specific words. In the parentheses, after the parentheses,

There's a future that's just say slash continue an acronym. And then in the second layer of the computational graph, you have say D underscore one underscore say something A and then say something G as three features. And there's another feature called say D A.

And that combines with, say, G to say DAG. And DAG is the acronym for Digital Analytics Group. So that's showing the general flow of features. They also have, very interestingly, a breakdown of math features.

They have something, I think, 36 plus 59, and they're showing that there's a bunch of weird features being used here. So 36 maps onto roughly 30, 36, and something ending of six, 59 maps onto something starting with five, roughly 59, 59, and something ending of nine. Then you have like

40 plus 50-ish and 36 plus 60-ish. And then eventually through a combination of various features, you wind up at the output of 36 plus 59 is 95.

So that's a high-level thing. It is giving us a deeper glimpse into the inner workings of LLMs in terms of the combinations of high-level features and the circuits that they are doing internally. It's building on actually a paper from last year called Transcoders Find Interpretable LLM Feature Circuits from Yale University and Columbia University. They use a similar approach here, but of course scaled up. So

As with the previous work from Anthropic, to my mind, some of the most impactful research on interoperability and some of the most successful research, because it really is showing...

I think at a much deeper level, what's going on inside these large language models. Yeah, absolutely. And again, this is where I think I was caveating at the outset of today's episode. I haven't had the chance to look at this yet. And this is exactly the kind of paper that I tend to spend the most time on. So apologies for that. I may actually come back next week with some hot takes on it. It looks like fascinating work from what I have been able to gather. It is fascinating.

I mean, it's pretty close to a decisive repudiation of the whole, and not that people make this argument so much anymore, the stochastic parrot argument of people like Gary Marcus who like to say, you know, oh, LLMs and auto aggressive models are not that impressive. They're really just kind of predicting the next token. They're stochastic parrots. They're basically like robots that just put out mindlessly the next word that's most likely stochastic.

I think anybody following the interpretability space for the last two, three years has known that this is pretty obviously untrue, as well as people following the capability side, just with some of the things we've seen. But one example they gave was there's a question as to whether a model uses fully independent reasoning threads for different languages. So if you ask what is the opposite of small in English and French?

will the model use language neutral components or will it have a notion of smallness that's English, a notion of smallness that's French? That's maybe what you would expect on the stochastic parrot hypothesis, right? That, well, it's an English sequence of words, so I'm going to use my kind of English submodel. Turns out that's not the case, right? Turns out that

Instead, it uses language neutral components related to smallness and opposites to come up with its answer. And then it'll pick only after that, only after it's sort of reasoned in latent space at the conceptual level, only after that does it sort of decode in a particular language. And so you have this unified reasoning space in the model.

that is decoupled from language, which in a way you should expect to arise because it's kind of a more efficient way to compress things, right? That's just like you have one domain and essentially all the different languages that you train the thing on are a kind of regularization. You're kind of forcing the model to reason in a way that's independent of the particular language that you're choosing to use to reason. Ideas are still the same.

And then, yeah, there's this question around interpretability, right? This thing will confabulate. You gave that example of adding 36 and 59. If it does this weird reasoning thing where it's almost doing like a, if you like math, you know, like something like a, I don't know, Taylor approximation where you kind of get the leading digit, right? Then the next digit, then the next digit, rather than actually doing it in a symbolic way. But then when you ask it, okay,

How did you come up with that answer? It will give you the kind of common sense. I added the ones, I carried the one, I added the tens, you know, that sort of thing, which is explicitly not the true reasoning. It seems to have followed at least based on this assessment. This raises deep questions about how much you can trust things like the reasoning traces that have become so popular that

that companies like DeepSeek and OpenAI have touted as their, in some cases, chief hope at aligning super intelligent AI, it seems like those reasoning traces are already decoupled from the actual reasoning that's happening in these models. So a bit of a warning shot on that too, I think.

Right. And to that point about the multilingual story, pretty notable, not just the technique itself, but the second block post on the biology of a larger language model, they applied it to CLOD 3.5 HICU and have a bunch of results. They have the multilingual circuits, they have addition, medical diagnosis.

Life of a jailbreak. They're showing how a jailbreak works, actually. Also how refusal works. So some very kind of pretty deep insights that are pretty usable, actually, in terms of how you build your LLMs. So much to cover in terms of the things here. So we probably will do a part two next week.

And onto the next story, we have Chain of Tools utilizing massive unseen tools in the Chain of Thought reasoning of frozen language models. This is a new fine-tuning-based tool learning method for LLMs that lets them efficiently use unseen tools during Chain of Thought reasoning. So you can use this to integrate unseen tools

And they have actually a new data set, also simple tool questions that has 1,836 tools that can be used to evaluate tool selection performance. And tools, by the way, is kind of calling an API. LLM can say, okay, I need to do this fact, this web search, or I need to do this

addition, whatever, and it can basically use a calculator or it can use Google, right? So pretty important to be able to do various things. And this is going to be adding on to the performance of reasoning models. Yeah, this is a really interesting paper. They're classic, multi headed Hydra things that you're trading off anytime you want to do tool use in models. And

So some of these techniques, like you mentioned fine-tuning, if you fine-tune your models to use tools, well, you can't use your base model, right? You can't just use a frozen LLM. You're not going to succeed at using a huge number of tools because the more you fine-tune, the more you forget, right? There's this catastrophic forgetting problem. And so it can be difficult to have the model simultaneously know how to use like over a thousand tools, right?

And if you fine tune, you're never going to be able to get the model to use unseen tools because you're fine tuning on a specific tool set you want to teach the model to use. There's similar challenges with like in context learning, right? So if you're doing in context learning, you have a needle in a haystack problem. If you have too many tools to pick from and the model will start to start to sort of fail.

So anyway, all kinds of challenges with existing approaches. So what's different here? What are they doing? So start with a frozen LLM. That's a really important ingredient. They want to be able to use preexisting models without any modifications.

And they are going to train things. They're going to train models to help that frozen LLM that you start with do its job better, but that's not going to involve training any of the original LLMs parameters. So they're going to start by having a tool judge. And basically this is a model that when you feed a prompt to your base LLM, it's going to look at the activations, the hidden state representation of that input at all

any number of layers. And it's going to go, okay, based on this representation for this particular token that I'm at in the sequence, do I expect that a tool should be called? Is the next token going to be a call to a calculator, a call to a weather app or something like that?

And so this tool judge, again, operating at the activation level, at the sort of hidden state level, which is really interesting, it's going to be trained on a dataset that has explicit annotations of like, here are some prompts and here is where tool calls are happening, or sorry, here's some text and here annotated are where the tool calls are happening. But that data is really expensive to collect. So they also have synthetic data that shows the same thing.

So they're using this to kind of get the tool judge to sort of learn what is and isn't in activation space, what does and doesn't correspond to a tool call. So essentially training just a binary classifier here. And then during inference, if the judge scores for a given token that the tool call probability is above some threshold, then the system will go ahead and call a tool. When it does that, it does it via a separate tool.

kind of model called a tool retriever. And this tool retriever is, I mean, it's not a model, it's itself a system.

It uses two different models, a query encoder and a tool encoder. So this is basically RAG, right? This is retrieval augmented generation. You have embeddings that represent all of your different tools, your 1,000 or 2,000 different tools. And then you have a way of representing, of embedding the query, right? Which is really a modified version of the activations associated with the token that the tool judge decided was a tool call. My God. Anyway, from here...

is RAG. So if you know the RAG story, that's what they're doing here. And then they, they, anyway, they call the tool. So a couple of, of advantages here, right? Frozen LLM don't need to fine tune and no catastrophic forgetting issues.

They are using just the hidden states. So that's sort of fairly simple. And the tool retriever, right? This, this thing, the system that's deciding which tool to call is interestingly, it's trained using contrastive learning to basically like in each training mini batch, when you're feeding just a batch of training data to the system to get it trained up.

you're basically, instead of comparing one tool versus all the other tools in the dataset to figure out like, should I use this one or another? You're just comparing it batch-wise, like to all the tools that are called or referenced within that batch, just to make it more tractable.

and computationally efficient. So anyway, if you know contrastive learning, that's how it works. If you don't, don't worry about it. It's a bit of a detail, but it's, I think, a really important and interesting paper because the future of AGI has to include essentially unlimited tool use, right? That's something that I think everybody would reasonably expect and the ability to learn how to use new tools. And this is one way to kind of shoehorn that in potentially.

And just a couple more papers. Next one, also an interoperability paper. The title is Inside Out, Hidden Factual Knowledge in LLMs. And it's also quite interesting. So

The quick summary is they are looking to see what knowledge is encoded inside an LLM that it doesn't produce. So it may have some hidden knowledge that it knows facts, but we can't sort of get it to tell us that it knows these facts. The way they do that is they define knowledge as information

whether you rank a correct answer to a question higher than an incorrect one. So you basically know which fact is a fact based on it being

what you think is the right continuation. And the comparison of external knowledge to internal knowledge is externally, you can use the final token probabilities, visible kind of final external thing. Internal, you can use internal activations to get that estimate of

rankings. And that's an interesting result here. LLMs encode 40% more factual knowledge internally when they express it externally. And in fact, you can have cases where an LLM knows the answer perfectly to a question, but fails to generate it even in a thousand attempts.

And that's due to, you know, sampling, I suppose, processes. And perhaps I need to do a deeper dive, but it could be various reasons as to why you're failing to sample it. It could be, you know, too niche and your prior is overriding it. It could be sampling techniques, etc.

But either way, another interesting finding about the internals of LLMs. Yeah, this is almost the... It's the closest I've seen to hardcore quantification of Guern's famous...

aphorism, I guess, where he says prompting can reveal the presence, but not the absence of capabilities in language models. It can reveal that a model has a capability, can't reveal that it doesn't have the capability. And this is what you're seeing. It's pretty intuitive. If you try a thousand times and you don't get an answer that you know the system is capable of delivering, then that means that you just haven't found the right prompt.

And in general, you'll never find the right prompt for all prompts, right? So you will, in general, always underestimate the capabilities of a language model. Certainly when you just look at it in output space, in token space, this is why increasingly like all of the strategies, the kind of safety strategies, like the ones that OpenAI is pitching with just looking at reasoning traces are looking really suspect and kind of fundamentally broken. Yeah.

You need representational space interpretability techniques if you're going to make any kinds of statements. And even then, right, you have all kinds of interesting steganography issues at the level of the activations themselves. But

Interesting paper. I guess we'll have to move along because we're on a compressed time. Oh, yeah, we're doing a shorter episode today just because we got started half an hour late. So this is why we're kind of blasting through. But I do think this is a really important and interesting paper. Last paper, we are wrapping up with another new benchmark. This is from Sakana AI. And

Their benchmark is based on Sudoku. I think it's called Sudoku Bench. And this benchmark has not just a classic Sudoku you've seen, but also a bunch of variations of Sudoku with kind of increasingly complex rule sets as to how you can fill in numbers on the grid. Sudoku, by the way, is the other grid. There are some rules here.

And according to those rules, you have to figure out which numbers go where, basically. And so they introduce this benchmark. And because there's a progression of complexity, you see that even top line reasoning models, they can crack the easier ones, but they are not able to beat the

the more complex sides of this. And, you know, there's pretty much a fair amount of distance to go for the models to be able to beat this benchmark.

Yeah, the take home for me from this was as much as anything that I have no idea how Sudoku works because apparently there are all these variants. Like I remember being in high school, I had friends who loved Sudoku and it was just that thing that you mentioned where there's like a nine by nine grid and you have to put the numbers from one to nine in each of the components of the grid using them only once and then all that jazz. But now apparently there are all kinds of versions of Sudoku that

Unlike chess and go, those have the same rules every time. This is like...

So some versions, apparently they give as examples, they require deducing the path that a rat takes through a maze of teleporters. So like if the rat goes to position X, then it gets magically teleported to position Y, which could be somewhere completely uncorrelated. And that's kind of framed up in a Sudoku context. There's another one that requires moving obstacles, cars, they say in the correct locations before trying to sell. There's all kinds of weird variations on this. And they basically design a spectrum, right? From

really, really simple, like four by four Sudoku all the way through with more and more constraints and kind of sub rules added. It seems to just generally be a very fruitful, uh,

way to play a combinatorics game and procedurally generate all these different games that can be played. And then ultimately where they land is they share this kind of data set of how these models perform. You can sort of think of this as another Arc AGI kind of benchmark. That's what it felt like to me was, you know, it's interesting coincidence to see this drop the same week as Arc AGI 2. Basically all the models suck. That's the take home. The one that sucks the least is O3 Mini from

from January 31st. And so it has a correct solve rate of 1.5% for the full scale version of these problems. They have simplified problems as well. So you can actually kind of track progress in that direction. But anyway, I thought this was really interesting. They have a collaboration with a YouTube channel called Cracking the Cryptic to kind of put together a bunch of essentially kind of training data, I guess, evaluation data for these things. But yeah, this is a

you know, Sakana AI, and they are the company that put together that AI scientist paper that we covered a while back. They're back at it with this, I want to call it an AGI benchmark because that's kind of what it feels like.

Moving on to policy and safety, first up, some U.S. legislation. Senator Scott Weiner is introducing the bill SB53 meant to protect AI whistleblowers and boost responsible AI development. So this would...

First, including provisions to protect whistleblowers who alert the public about AI risks. It's also proposing to establish CalCompute, a research cluster to support AI startups and researchers with low-cost computing. This is in California in particular, so this would be protecting researchers

presumably whistleblowers from some notable California companies and letting startups perhaps compete. Yeah, this is actually really interesting, right? Because we covered extensively SB 1047, which was the bill that successfully came out of the California legislature, which Gavin Newsom vetoed over the objections of not only a whole bunch of whistleblowers in the AI community, but also Elon Musk.

who actually did come out and endorse, very unusual for him, being a sort of libertarian-oriented guy, he endorsed SB 1047. The original version of SB 1047 contained a lot of things, but basically three things. So one was the whistleblower protections. That's included in SB 53.

The other was Cal Compute. That's included in SB 53, which leaves us to wonder, well, what's the thing that's missing, right? What's the difference with SB 1047? And it's the liability regime. So SB 1047 included a bunch of conditions where developers of models that cost over $100 million...

to develop could be on the hook for disasters if their safety practices weren't up to par. So if they developed a model and it led to a catastrophic incident and it costs them over $100 million just to develop the model, essentially this means you've got to be super, super resourced to be building these models. Well, if you're super resourced and you're building a model that's like

$100 million plus to train, yeah, you're on the hook for literally catastrophic outcomes that come from it. I think a lot of people...

looked at that and said, hey, that bar doesn't sound too low. That's a pretty reasonable bar to meet for these companies. But that was vetoed by Gavin Newsom. So now essentially what they're doing is they're saying, okay, fine, Gavin, what if we get rid of that liability regime and we try again? That's kind of the state that we're at here. So they're working their way through the California legislature. We'll see if that ends up on Newsom's desk again. And if so, if we get yet another scrapping of the legislation. Yeah.

Right. I should be clear that this is a senator in the California legislature, not in the federal government. He represents San Francisco, actually, a Democrat from San Francisco, which is kind of interesting. And yeah, the main pitch is balancing the need for safeguards with the need to accelerate and, you know, in a position to the objections raised to 1047.

Next up, we have a story related to federal US policy. The title is NVIDIA and other tech giants demand Trump administration to reconsider quote AI diffusion policy set to be effective by May 15th. So this is a policy initially introduced under the Biden administration that broadly categorizes countries into three groups based on how friendly they are with US national security interests.

So the first category would be friends and that can import chips without restrictions. Second would be hostile nations, which are completely barred from acquiring US origin AI technology. And then there are other countries like India, which face limitations. And of course, companies like Nvidia aren't very happy about that because that would mean less people buying their chips.

I think that's basically the story. Yeah, no surprise there's a lot of lobbying against the AI diffusion policy. By the way, this is one that came out of the Biden administration, but interestingly has so far not been scrapped. That's really interesting.

Because, you know, so many executive orders from the Biden administration have been got rid of, as you would expect, as part of the Trump administration settling into their seats. So, yeah, I mean, this is, you know, NVIDIA kind of trying it on again, Oracle trying it on again, you know, see if we can loosen up those constraints. We'll, I'm sure, be talking more about this going forward. And next up, another story related to export controls, our favorite topic. The story is that the US has blacklisted over 50 Chinese companies

for the export blacklist. So this is from the Commerce Bureau of Industry and Security. There are now actually 80 organizations on this entity list with more than 50 from China. And this is companies that are allegedly acting against U.S. national security and foreign policy issues.

So, for example, yeah, they're blacklisted from acquiring US items to support military modernization for advancing quantum technology and things like AI. Yeah, this is one of the cases where I just, I don't know, when it comes to the policy side, and these are still, I think they're still Biden era policies, basically, that are operating here.

We may see this change, but for now, like, dude, come on. So get that like two of these firms that they're adding to the blacklist, we're supplying sanctioned entities like Huawei and its affiliated chip maker, HiSilicon. Okay. So HiSilicon is basically, it is basically Huawei. It's kind of a division of Huawei that is Huawei's Nvidia, if you will. They do all the chip design.

So then they blacklisted 27 entities for acquiring stuff to support the CCP's military modernization and a bunch of other stuff. When it comes to the AI stuff, like, okay, among the organizations in the entity list, they say, we're also six subsidiaries, six of Chinese cloud computing firm, Inspir Group. So Inspir is a giant, giant cloud company in China.

They actually famously made essentially China's answer to GPT-3 back in the day. You may remember this if you were tracking, it's called UN 1.0 or also it was called like Source 1.0. But the fact that like, this is China's game, right? They keep spinning up these like stupid subsidiary companies and taking advantage of the fact that

yeah, like we're not going to catch them. We're playing a losing game of whack-a-mole. It's super cheap to spin up subsidiaries and you import shit that you shouldn't until they get detected and shut down. And then you do it again until we move to a blacklist model, sorry, a whitelist model rather than a blacklist model with China. This will continue to happen, right? Like you need to have a whitelist where by default it's a no, and then certain entities can import. And then you want to be really, really careful about that because it's,

basically because of civil military fusion, any private entity in China is a PLA, People's Liberation Army, a Chinese military affiliated entity. That's just how it works. It's different from how it works in the US. That's just the fact of life. But until you do that whitelist strategy, you are just waiting to be made to look like a fool. People are going to spin up new subsidiaries and we will be doing articles and stories like this until the cows come home

unless that changes. So this is kind of one of those things where, you know, I don't know why the Biden guys didn't do this. I get that there's tons of pressure from US industry folks, because it is tough. But at a certain point, if the goal is to prevent the CCP military from acquiring this capability, we got to be honest with ourselves that this is the solution. There is no other way that we'll fail to kind of, or succeed rather at this kind of whack-a-mole game.

And one more story, this one focused on safety more so than policy. Netflix's Reed Hastings gives $50 million to Baudoin College to establish an AI program. This would be a research initiative called AI and Humanity and focus on the risks and consequences of AI rather than sort of traditional science

computer science, AI research. The college will be using these fonts to hire new faculty and support existing faculty on this research focus. 50 million is quite a bit, I would imagine, for doing this kind of thing.

Yeah, it's sort of interesting because I wasn't aware of, you know, there are all these kind of big luminaries every which way on this issue. And we hadn't heard anything from Netflix, right? From Reed Hastings. And so I guess now we know where at least that part of thing comes down on the equation. Yeah, it's interesting. Of course, this is a gift to Hastings' alma mater. He graduated from this college decades ago.

Onto synthetic media and art, we have just a couple more stories. First up, a judge has allowed the New York Times copyright case against OpenAI to go forward. OpenAI was requesting to dismiss the case, and so that didn't happen.

The judge has narrowed the lawsuit's scope, but upheld the main copyright infringement claims. The judge is also saying that they will be releasing a detailed opinion, not released yet. So pretty significant, I think, because there's a bunch of lawsuits going on, but this is the New York Times. A big boy in media publishing certainly probably has issues

experienced lawyers on their side and able to throw down in terms of resources with OpenAI. So the fact that this is going forward is pretty significant. Yeah, I mean, actually, I mean, nowadays, they may not have the kind of resources that they once had. Surprisingly, they're kind of successful, you'd think, because they managed to move to a subscription-based online model and survived better than other media entities in recent decades.

I don't know if they're as big as they used to be, but they're surprisingly successful still. Yes. I was just looking it up. Apparently, their subscription revenue

is, let me see in the quarter. Okay. Quarterly subscription revenue of $440 million. Jesus. Okay. That's pretty good. That's pretty good. Wow. Okay. I would not have, I would not have expected that. I'll have to update my, well, there you go. I mean, we'll, we'll get, we'll get the, the opinion, whatever judge Stein means when he says expeditiously, which I guess in lawyer talk or legal talk probably means sometime in the next decade, but there you go.

Another similar story, although this time to the other side, a judge has ruled that Anthropic can continue training on copyrighted lyrics for now. This is part of a lawsuit from the Universal Music Group that was wanting an injunction to prevent Anthropic from using copyrighted lyrics to train the models. That means that...

Yeah, Anthropic can keep doing it, assuming it's doing it. And this is also saying that the lawsuit is going to keep going. There's still an open question as to whether it is legal for Anthropic to do it, but there's not yet a restriction prior to the actual case being fought.

This is like very much not a, not a lawyer territory. So injunctions to my understanding, essentially are just things where the court will say ahead of time before something would otherwise happen, they will step in and say, Oh, pop, pop, pop, like, just so you know, like, don't do this. And then if you violate the injunction, it's a particularly bad thing to do. So this is sort of like the, the would be the court anticipating rather than reacting to something. So that's,

That's what the publishers are asking for. Hence the statement from the judge on the case saying, "Publishers are essentially asking the court to define the contours of a licensing market for AI training where the threshold question of fair use remains unsettled."

The court declines to award publishers the extraordinary relief of a preliminary injunction based on legal rights. So basically, we're not going to step in and kind of anticipate where this market is going for you and just say, hey, you can't use this based on legal rights that have not yet been established. So essentially, it's for another court to decide what the actual legal rights are. We're not in a position until that happens to grant injunctions on the basis of

of what is not settled law. So once we have settled law, yeah, if it says that you're not allowed to do this, then sure, we may grant an injunction saying, oh, anthropic, don't do that. But for right now, there's no law in the books and we don't really have precedent here. So I'm not going to give you an injunction. That's kind of the, at least my read on this. Again, lawyers listening may be able to just smack me in the face and set me right, but kind of interesting. Yeah. Sounds right to me. So...

Well, with that, we are done with this episode of Last Week in AI. A lot's going on this week, and hopefully we did cover it. And as we said, we'll probably cover some more of these details next week, just because there's a lot to unpack. But for now, thank you for listening through apparently this entire episode. We would appreciate your comments, reviews, sharing the podcast, etc. But more than anything, we appreciate you tuning in. So please keep doing that.

Tune in, tune in, news begins, it's time to break, break it down. Last weekend AI come and take a ride, get the low down on tech and let it slide. Last weekend AI come and take a ride, I'm allowed to the streets, AI's reaching high.

New tech emergent, watching surgeons fly From the labs to the streets, AI's reaching high Algorithms shaping up the future seas Tune in, tune in, get the latest with ease Last weekend AI come and take a ride Hit the low down on tech and let it slide Last weekend AI come and take a ride From the labs to the streets, AI's reaching high

From neural nets to robots, the headlines pop Data-driven dreams, they just don't stop Every breakthrough, every code unwritten On the edge of change

With excitement we're smitten, from machine learning marvels to coding kings. Futures unfolding, see what it brings.

#205 - Gemini 2.5, ChatGPT Image Gen, Thoughts of LLMs 01:34:18 Share

Last Week in AI

Deep Dive

Shownotes Transcript

#205 - Gemini 2.5, ChatGPT Image Gen, Thoughts of LLMs