We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

#214 - Gemini CLI, io drama, AlphaGenome, copyright rulings

2025/7/4

Last Week in AI

AI Deep Dive AI Chapters Transcript

People

Andrey Kurenkov

Jeremie Harris

Topics

Andrey Kurenkov: 我尝试过使用谷歌的Notebook LM来生成播客，但它会重复自己。LLM仍然存在迷失方向的问题，需要精确的提示才能复制我们的个性和声音。尽管如此，我相信LLM在未来有能力取代我们。 Jeremie Harris: 我认为AI无法像我一样因为女儿长牙而迟到。我们的工作是向你展示AI可以取代我们的时刻。我预计在未来18个月内，会出现可以与我们相媲美的AI。我相信已经有AI生成的AI新闻Podcast，它们可能比我们更好，而且每天更新。AI无法像我们一样缺乏智慧和思考。

Deep Dive

Chapters

Google launched Gemini CLI, a command-line interface powered by Gemini 2.5 Pro. While not yet as capable as Cloud Code, it offers a generous free tier and is expected to become a strong competitor.

Google's Gemini CLI is a terminal-based AI agent.
It offers 60 model requests per minute and 1,000 per day for free.
Initial feedback suggests it's not as powerful as Cloud Code in software engineering tasks.

Shownotes Transcript

Translations:

中文

Hello and welcome to the Last Week in AI podcast, where you can hear us chat about what's going on with AI. As usual in this episode, we will summarize and discuss some of last week's most interesting AI news, and you can check out the episode description for the links to that and the timestamps. I'm one of your regular hosts, Andrey Karenkov. I studied AI in grad school and now work at a generative AI startup.

And I'm your other host, Jeremy Harris, co-founder of Gladstone AI, AI National Security, blah, blah, blah, as you know. And I'm the reason this podcast is going to be an hour and a half and not two hours. Andrea is very patiently waiting for like half an hour while I just sorted out. My daughter's been teething and it's wonderful having a daughter, but sometimes teeth come in six or eight in a shot and then you have your hands full. And so she is the greatest victim of all this.

But 100 is a close second because, boy, I kept saying five more minutes and it never happened. So I appreciate the patience, Andre. I got an extra half hour to prep, so I'm not complaining. And I'm pretty sure you had a rougher morning than I did. I was just drinking coffee and waiting, so not too bad.

But speaking of this episode, let's do a quick preview. It's going to be, again, kind of less of a major news week. Some somewhat decently big stories, tools and apps, Gemini CLI is a fairly big deal.

applications and business. We have some fun open AI drama and a whole bunch of hardware stuff going on and not really any major open source stuff this week. So we'll be skipping that. Research and advancements, exciting new research from DeepMind and just various papers about scalable reasoning, reinforcement learning, all that type of stuff.

Finally, in policy and safety, we'll have some more interoperability, safety, China stories, the usual, and some pretty major news about copyright following up on what we saw last week. So that actually would be one of the highlights of this episode towards the end. Before we get to that, I do want to acknowledge a couple of reviews on Apple Podcasts, as we do sometimes. Thank you to the kind reviewers leaving us some very nice comments, also some comments

Fun ones. I like this one. This viewer said, I want to hear a witty and thoughtful response on why AI can't do what you're doing with the show. And wow, you're putting me on the spot being both witty and thoughtful. And it did make me think. I will say I did try Notebook LM.

A couple months ago, right? And that's the podcast generator from Google. It was good, but definitely started repeating itself. I found that LLMs still often have this issue of losing track of where they're at, like 10 minutes, 20 minutes in.

repeating themselves or just otherwise. And also, Andre, and repeating themselves too. And they'll just keep saying the same thing and repeating over and over. Like they'll repeat and repeat a lot. So yeah, yeah. That kind of repetition was solved a couple of years ago, thankfully. Yeah, that's true.

Honestly, you could do a pretty good job replicating last week in AI with LLMs these days. I'm not going to lie, but you're going to have to do very precise prompting to get our precise personas and personalities right.

and voices and so on. So I don't know. Hopefully we're still doing a better job than AI could do, or at least doing a different job than the more generic kind of outcomes you could get trying to elicit AI to make an AI news podcast. But dude, what AI could compete with starting 30 minutes late because it's daughter's teething? Like I challenge you right now, try it. You're not going to find an AI that can pull that off. You can have AI that says it does. Right.

But will the emotion of that experience actually be it? I don't think so. I think the copium way, right? People are often like, oh, it won't have the heart. It won't have like the soul, you know, the podcast. It will. It will. In fact, I think arguably our job is to surface for you the moment that that is possible that you can stop listening to us.

One of the virtues of not being like a full-time podcaster on this too is we have that freedom maybe more than we otherwise would. But man, I mean, it's, I would expect within the next,

18 months, hard to imagine that there won't be something comparable. But then your podcast hosts won't have a soul. They'll be inside a box. Well, in fact, I'm certain, I believe as of quite a while ago, there are already AI-generated AI news podcasts out there. I haven't checked them out, but I'm sure they exist. And nowadays, they're probably quite good. And you get one of those every day as opposed to once a week. And they're never a week behind. Yeah.

In some ways, definitely superior to us. But in other ways, can they be so witty and thoughtful in responding to such a question? I don't know. In fact, can they be so lacking in wit and thought as we can be sometimes? That's right. That's a challenge. They'll never out-compete with our stupid. Yes, as is true in general. I guess you'd have to really try to get AI to be bad at things when it's actually good. Yeah.

Anyways, a couple more reviews lately. So I do want to say thank you. Another one is called This is the Best AI Podcast, which is quite the honor and says that this is the only one they listen to at normal speed. Most of the other podcasts are played in 1.5 or 2x speed. So good to hear we are using up all our two hours at a good pace. That's right.

Funny, a while ago, there was a review that was like, I always speed up through Andres talking and then have to listen to Don't Worry for Jeremy. So maybe I've sped up since then.

So yeah, as always, thank you for the feedback and thank you for questions that you bring in. I think it's a fun way to start the show. But now let's go into the news, starting with tools and apps. And the first story is, I think, one of the big ones of this week, Gemini CLI.

So this is essentially Google's answer to cloud code. It is a thing you can use in your terminal, which for any non-programmers out there is just the text interface to working on your computer. So you can look what files there are, open them, read them, type stuff, et cetera, all via non-UI interface. And now this...

CLI is that is Gemini in your terminal and it has the same source of capabilities at a high level as cloud code. So it's an agent and you launch it and you tell it what you want it to do and it goes off and does it. And it sort of takes turns between it doing things and you telling it to follow up, to change what it's doing or to check what it's doing, etc.,

With this launch, Google is being pretty aggressive, giving away a lot of usage, 60 model requests per minute and 1,000 requests per day. It's a very high allowance as far as caps. And there's also a lot of usage for free without having to pay. I'm not sure if that is the cap for free, but...

For now, you're not going to have to pay much. I'm sure sooner or later you get to the cloud code type of model where to use cloud code at the highest level, you have to pay $200 per month or $100 a month, which is what we at our company already do because cloud code is so useful.

From what I've seen on conversations online, the vibe eval is that this is not quite as good as cloud code. It isn't as capable of software engineering, at using tools, just generally figuring things out as it goes. But it was just released, could be a strong competitor soon enough.

Yeah. I'm still amazed at how quickly we've gotten used to the idea of a million token context window, by the way, because this is powered by Gemini 2.5 Pro, the reasoning model. And that's part of what's in the backend here. So that's going to be the reason also that it doesn't quite live up to the Claude standard, which is obviously a model that's a lot... I don't know. It just seems to work better with code. I'm curious about when that changes, by the way, and what

anthropic's actual recipe. Like, why is it working so well? We don't know, obviously, but someday, maybe after the singularity, when we're all one giant hive mind, we'll know what actually was going on to make the Claude models this good and persistently good. But in any case...

Yeah, it's a really impressive play. The advantage that Google has, of course, over Anthropic currently is the availability of just a larger pool of compute. And so when they think about driving costs down, that's where you see them trying to compete on that basis here as well. So a lot of free prompts, a lot of free tokens, I should say, good deals on the token counts that you put out. So it's one way to go. And I think as the ceiling rises on the capabilities of these models,

Eventually cost does become a more and more relevant thing for any given fixed application. So that's an interesting dynamic, right? The frontier versus the fast followers. I don't know if it's quite right to call Google a fast follower. They're definitely doing some frontier stuff, but anyway. Yeah. So interesting. Next, next move here. Part of the productionization obviously of, of these things and entering workflows in very significant ways, I think.

This is heading in slow increments towards a world where agents are doing more and more and more. And context windows, coherence lengths are all part of that. Right. Yeah. We discussed last year, like towards the beginning of last year was real kind of hype train for agents and the agentic future. And I think CloudCode and Gemini CLI are...

are showing that we are definitely there. In addition to things like Replit, Lovable, broadly speaking, LMs have gotten to a point, partially because of reasoning, partially, presumably just due to improvements in LMs, where you can use them in agents and they're very successful. From what I've seen, part of the reason Cloud Code is so good is not just Cloud, it's also just

Cloud code, particularly the agent, is very good at using tools. It's very good at doing text search, text replacement. It's very keen on writing tests and running them as it's doing software engineering. So it is a bit different than just thinking about an LLM. It's a whole sort of

suite of what the agent does and how it goes about its work that makes it so successful. And that's something you don't get out of a box with LLM training, right? Because tool usage is not in your pre-training data. It's something kind of on top of it.

So that is yet another thing similar to reasoning where we are now going beyond the regime of you can just strain on tons of data from the internet and get it for free. More and more things, in addition to alignment, now you need to add to the alarm beyond just throwing a million gigabytes of data at it.

It really is a system, right? Like at the end of the day, it's not, it's also not just one model. I think a lot of people have this image of like, you know, there's one monolithic model in the backend. Assume that there's a lot of like models choosing which models to answer a prompt. And I'm not even talking about MOE stuff, like just literal software engineering in the backend that makes these things have the holistic field that they do. So, yeah.

FYI, by the way, I didn't remember this, so I looked it up. CLI stands for command line interface, command line, another term for terminal. So again, for any non-programmers, fun detail.

And speaking of Cloud Code, the next story is about Anthropic. And they have released their ability to publish artifacts. So artifacts are these little apps, essentially, you can build within Cloud. You get a preview and interactive web apps, more or less. And as with some other ones, I believe Google allows you to publish gems is what they call it. Now you can...

your artifacts and other people can browse them. They also added the support to building apps with AI built in, with Claude being part of the app. So now if you want to build like a language translator app within Claude, you can do that because the app itself can query Claude to do a translation.

So, you know, not a huge delta from just having artifacts, but another sort of seemingly trend where all VLLMs tend to wind up at similar places as far as you add things like artifacts when you make it easy to share what you build. And, you know, it's...

Something that anyone can do. Most users on their free pro max tiers can share and they'll be interested to see what people build.

And if I'm, if I'm Replit, I'm getting pretty nervous looking at this. Granted, obviously Replit has, so Replit, right, that platform that lets you essentially like launch an app really easily takes abstracts away, all the like server management and stuff. And like, you've got kids launching games and all kinds of useful apps and learning to code through it. Really, really powerful tool and super, super, I mean, it's 10x year over year. It's growing really fast, but

you can start to see the frontier moving more and more towards, let's make it easier and easier at first for people to build apps. So we're going to have an agent that just writes the whole app for you or whatever and just produces the code. But at what point does it naturally become the next step to say, well, let's do the hosting. Let's abstract away all the things. You could see OpenAI, you could see Anthropic launching a kind of app store. That's not quite the right term, right? Because we're talking about

more fluid apps, but moving more in that direction, hosting more and more of it, and eventually getting to the point where you're just asking the AI company for whatever high-level need you have, and it'll build the right apps or whatever. That's not actually that crazy sounding today. And again, that swallows up a lot of the Replik business model, and it'll be interesting to see how they respond.

Yeah, and this is particularly true because of the converging or parallel trend of these context model protocols that make it easy for AI to interact with other services. So now if you want to make an app that talks to your calendar, talks to email, talks to your Google Drive, whatever you can think of, basically any major tool you're working with, AI can integrate with it easily.

So if you want to make an app that does something with connection to tools that you use, you could do that within cloud. So as you said, I think both Replit and Lovable are these emerging titans in the world of building apps with AI. And I'm sure they'll have a place in the kind of domain of more complex things where you need databases and you need authentication and so on and so on. But yeah,

If you need to build an app for yourself or for maybe just a couple of people to speed up some process, you can definitely do it with these tools now and then share them if you want.

And onto applications and business, as promised, kicking off with some OpenAI drama, which we haven't had in a little while. So good to see it isn't ending. This time it's following up on this IO trademark kind of lawsuit that happened. We covered it last week, the

We had OpenAI, Sam Altman announced the launch of this I.O. initiative with Johnny Ive. And there's another AI audio hardware company called I.O., spelled differently, I-Y-O instead of I-O. And they sued, alleging that they stole the idea and also the trademark. The names sound very similar, right?

And yeah, Sam Altman hit back, decided to publish some emails. Just screenshot of emails showing the founder of IO, let's say, being very friendly, very enthusiastic about meeting with Altman and wanting to be invested in by OpenAI. And the basic gist of what Sam Altman said is this founder, Jason Rugelow, who filed the lawsuit said,

kind of persistent in trying to get investments from Sal Maltman. In fact, even reached out in March prior to the announcements with Johnny Ive and apparently Sam Maltman, you know, let him know that the competing initiative he had was called IO. So he

Definitely, I think an effective pushback on the lawsuit, similar in a way to what OpenAI also did with Elon Musk, just like, here's the evidence.

is receipts of your emails. I'm not too sure if what you're saying is legit. This is becoming, well, two is not yet a pattern, is it? Is it three? I forget how many takes to make a pattern, they say. Then again, I don't know who they are or why they're qualified to tell us it's a pattern. But yeah, this is an interesting situation. One interesting detail kind of gives you maybe a bit of a window into how the balance of evidence is shaping up so far. We do know that in the lawsuit,

EO, so not IO, but EO, I was going to say Jason Derulo, Jason Rigolo's company, did end up, sorry, where was it? They were granted a temporary restraining order against OpenAI using the IO branding themselves. So OpenAI was forced to change the IO branding.

due to this temporary restraining order, which was part of EO's trademark lawsuit. So at least at the level of the trademark lawsuit, there has been an appetite from the courts to put in this sort of preliminary temporary restraining order.

I'm not a lawyer, so I don't know what the standard of proof would be that would be involved in that. So at least at a trademark level, maybe it's like sounds vaguely similar enough. So yeah, for now, let's tell OpenAI they can't do this. But there's enough fundamental differences here between the devices that you can certainly see OpenAI's case for saying, hey, this is different. They claim that the IO hardware is not an in-ear device at all. It's not even a wearable device.

That's where that information comes from that was itself doing the rounds. This big deal, opening eyes, new device is not actually going to be a wearable after all. But we do know that apparently, so Rigolo was trying to pitch a bunch of people about their idea about the IO concept, sorry, the EO concept way back in 2022, sharing information about it to former Apple designer, Evans Hankey, who actually went on to co-found IO.

So there's a lot of overlap here. The claim from OpenAI is, look, you've been working on it since 2018. You demoed it to us. It wasn't working. There were these flaws. Maybe you fixed them since, but at the time it was a janky device. So that's why we didn't partner with you. But then you also have this whole weird overlap where, yeah, some of the founding members of the EO team had apparently spoken directly to EO before. So it's

pretty messy. I think we're going to learn a lot in the court proceedings. I don't think these emails give us enough to go on to make a firm determination about what, because we don't even know what the hardware is. And that seems to be at the core of this. So what is the actual hardware and how much of it did OpenAI, did LoveFrom, did IO actually see?

Right. And in the big scheme of things, this is probably not a huge deal. This is a lawsuit saying you can't call your thing I.O. because it's too similar to our thing E.O. And it's also seemingly some sort of wearable AI thing. So worst case, presumably the initiative by Simon and Johnny Ive changes. I think more than anything, this is just...

Another thing to track with OpenAI, right? Another thing that's going on that for some reason, right, we don't have these kinds of things with Anthropic or Mistral or any of these other companies. Maybe because OpenAI is the biggest, there just tends to be a lot of this, you know, in this case, legal business drama, not interpersonal drama, but nevertheless, a lot of headlines and

Honestly, juicy kind of stuff to discuss. Yeah, yeah, yeah. Yeah. So another thing going on and another indication of the way that Samothman likes to approach these kinds of battles in a fairly public and direct way.

Up next, we have Huawei Matebook contains Kirin X90 using SMIC 7nm N plus 2 technology. If you're a regular listener of the podcast, you're probably going, oh my God. And then, or maybe you are, I don't know, this is maybe a little in the weeds, but either way, you might want a refresher on what the hell this means, right? So

There was a bunch of rumors actually floating around that Huawei had cracked, sorry, that SMIC, which is China's largest semiconductor foundry or most advanced one, you can think of them as being China's domestic TSMC.

There's a bunch of rumors circulating about whether they had cracked the five nanometer node, right? That critical node that is what was used or a modified version of it was used to make the H100 GPU, the NVIDIA H100. So if China were to crack that domestically, that'd be a really big deal.

Well, those rumors now are being squashed because this company, which is actually based in Canada, did an assessment. So Tech Insights, we've actually talked a lot about their findings, sometimes even

while mentioning them by name, sometimes not. We really should. Tech Insights is a very important firm in all this. They do these teardowns of hardware. They'll go in deep and figure out, oh, what manufacturing process was used to make this component of the chip, right? That's the kind of stuff they do. And they were able to confirm that, in fact, the Huawei chip

X90, so system on a chip, was actually not made using 5 nanometer equivalent processes, but rather using the old 7 nanometer process that we already knew SMIC had. So that's a big, big deal from the standpoint of their ability to onshore domestically GPU fabrication and keep up with the West.

So it seems like we're like two years down the road now from when SMIC first cracked the 7 nanometer node, and we're still not on the 5 nanometer node yet. That's really, really interesting. And so worth saying, like Huawei never actually explicitly said that this new PC had a 5 nanometer node. There's just a bunch of rumors about it. So what we're getting now is just kind of the decisive quashing of that rumor.

Right. And broader context here is, of course, that the U.S. is preventing NVIDIA from selling top-of-line chips to Chinese companies. And that does limit the ability of China to create advanced AI. They are trying to get the ability domestically to produce chips competitive with NVIDIA. Right now, they're, let's say, about two years behind is my understanding.

And this is one of the real bottlenecks is if you're not able to get the state-of-the-art fabrication process for chips, there's just less compute you can get on the same chip.

of chip, right? It's just less dense. And this arguably is the hardest part, right? To get this thing. It takes forever, as you said, two years with just this process. And it is going to be a real blocker if they're not able to crack it. Yeah, the fundamental issue China is dealing with is because they have crappier nodes, so they can't fab the same quality of nodes as TSMC. They're forced to either steal TSMC-fabbed

So or find clever ways of getting TSMC to fab their designs, often by using subsidiaries or shell companies to make it seem like they're, you know, maybe we're coming in from Singapore and asking TSMC to fab something or we're coming in from a clean Chinese company, not Huawei, which is blacklisted.

And then the other side is because their alternative is to go with these crappier seven nanometer process nodes, those are way less energy efficient. And so the chips burn hotter or they run hotter rather, which means that you run into all these kinds of heat induced defects over time. And we covered that I think last or two episodes ago, the last episode I was on. So anyway, there's a whole kind of hairball of different problems that come from ultimately the fact that SMIC has not managed to keep up with TSMC.

Right. And you're seeing all these $10 billion, $20 billion data centers being built. Those are being built with racks and racks and huge amounts of GPUs. The way you do it, the way you supply energy, the way you cool it, etc. All of that is conditioned on the hardware you have in there. So it's very important to ideally have the state of art to build with.

Next story also related to hardware developments, this time about AMD, and they now have an ultra Ethernet ready network card, the AMD.

Pensando Polara, which provides up to 400 gigabits per second, is that it? Per second performance. And this was announced at their Advancing AI event. It will be actually deployed by Oracle Cloud with the AMD Instinct A350X GPUs and the network card. So this is

A big deal because AMD is trying to compete with NVIDIA on the GPU front. And their series of GPUs does seem to be catching up, or at least has been shown to be quite usable for AI. This is another part of the stack, the inter-chip communications, but it's very important and very significant in terms of what NVIDIA is doing.

Yeah, 100%. This is, by the way, the industry's first ultra-Ethernet-compliant NIC, so a network interface card. So what the NIC does, you've got, and you go back to our hardware episode to kind of see more detail on this, but in a rack, say, at the rack level or at the pod level, you've got all your GPUs that are kind of tightly interconnected with accelerator interconnect. This is often like the NVIDIA product for this is NVLink. This is super low latency, super expensive interconnect.

But then if you want to connect like pods to other pods or racks to other racks, you're now forced to hop through a slower interconnect, part of what's known sometimes as the back-end network.

And when you do that, the NVIDIA solution you'll tend to use for that is InfiniBand, right? So you've got NVLink for the really like within a pod, but then from pod to pod, you have InfiniBand. And InfiniBand has been a go-to de facto like kind of gold standard in the industry for a while. Companies that aren't NVIDIA don't like that because it means that NVIDIA owns more of the stack and has an even deeper kind of de facto monopoly on different components. Right.

And so you've got this thing called the Ultra Ethernet Consortium that came together. It's founded by a whole bunch of companies, AMD, notably Broadcom. I think Meta and Microsoft were involved, Intel. And they came together and said, hey, let's come up with an open source standard for

for this kind of interconnect with AI optimized features that basically can compete with the InfiniBand model that NVIDIA has out. So that's what UltraEthernet is. It's been in the works for a long time. We've just had the announcement of specification 1.0 of that UltraEthernet protocol, and that's specifically for hyperscale AI applications and data centers.

And so this is actually a pretty seismic shift in the industry. And there are actually quite interesting indications that companies are going to shift from InfiniBand to this sort of protocol. And one of them is just cost economics. Like Ethernet has massive economies of scale already across the entire networking industry and InfiniBand's more niche. So as a result, you kind of have ultra Ethernet

chips and switches that are just so much cheaper. So you'd love that. You also have vendor independence because it's an open standard. Anyone can build to it instead of just having NVIDIA own the whole thing. So the margins go down a lot and people really, really like that, obviously. All kinds of operational advantages. It's just operationally more simple because data centers already know Ethernet and how to work with it. So anyway, this is a really interesting thing to watch. I know it sounds like

It sounds boring. It's the interconnect between different pods in a data center. But this is something that executives at the top labs really sweat over because there are issues with the InfiniBand stuff. This is one of the key rate limiters in terms of how big models can scale. Right. Yeah. To give you an idea, Oracle is apparently planning to deploy these latest AMD GPUs with a Zeta scale AI cluster with up to 131 CPUs.

and 72 Instinct MI355X GPUs. So when you get to those numbers, like think of it, 131,000 GPUs. GPUs aren't small, right? GPUs are pretty big. They're not like a little chip. They're, I don't know, like...

notebook sized ish. And there's now 131,000 that you need to connect all of them. And when you say pod, right, typically you have this rack of them, like almost a bookcase, you can think where you connect them with wires, but you can only get, I don't know how many, typically 64 or something on that side. When you get to 131,000,

this kind of stuff starts really mattering and in their slides in this event they did let's say very clearly compared themselves to the competition said that this has 20x scale over in feeding band whatever that means has performance of 20% over competition stuff like that so AMD is very much trying to compete and be offering things that are in some ways ahead of NVIDIA

And others like Broadcom and so on. And next up, another hardware story, this time dealing with energy. Amazon is joining the big nuclear party by buying 1.92 gigawatts of electricity from Talon Energy's

Susquehanna nuclear plant in Pennsylvania. So nuclear power for AI, it's all the rage. Yeah. I mean, so we've known about, if you flip back, originally this was the 960 megawatt deal that they were trying to make. And that got killed by regulators who were worried about customers on the grid. So essentially everyday people who are using the grid who would

in their view, unfairly shoulder the burden of running the grid. Today, you know, Susquehanna powers the grid, and that means every kilowatt hour that they put in leads to transmission fees that support the grid's maintenance. And so what Amazon was going to do was going to go behind the meter, basically link the power plant directly to their data center without going through the grid. So there wouldn't be grid fees.

And that basically just means that the general kind of grid infrastructure doesn't get to benefit from those fees over time, sort of like not paying toll when you go on a highway. And this new deal that gets us to 1.2 gigawatts

is a revision in that. It's got Amazon basically going through in front of the meter, going through the grid in the usual way. They're going to be, as you can imagine, a whole bunch of infrastructure needs to be reconfigured, including transmission lines. Those will be done in spring of 2026. And the deal apparently covers energy purchased through 2042, which is sort of amusing because like imagine trying to get this far ahead of time. But yeah. Yeah.

I guess they are predicting that they'll still need electricity by 2042, which assuming X-risk doesn't come about, I suppose it's fair. Yeah. Yeah. Next story, also dealing with nuclear and dealing with NVIDIA. It is joining Bill Gates and others in backing TerraPower, a company building nuclear reactors for powering data centers. So this is through NVIDIA's venture capital ARM and Ventures.

And they have invested in this company, TerraPower, investing, it seems like, $650 million alongside Hyundai. And TerraPower is developing a 345-megawatt natrium plant in Wyoming right now. So they're, you know, I guess...

In the process of starting to get to a point where this is usable, although it probably won't come for some years. Your instincts are exactly right on the timing too, right? So there's a lot of talk about SMRs, like small modular reactors, which are just a very efficient way and very safe way of generating nuclear power on site. That's the exciting thing about them.

They are the obvious, apart from like Fusion, they are the obvious solution of the future for powering data centers. The challenge is when you talk to data center companies and builders, they'll always tell you like, yeah, SMRs are great, but we're looking at first approvals, first SMRs generating power like at the earliest, like 2029, 2030.

2030 type thing. So if you have sort of shorter AGI timelines, they're not going to be relevant at all for those. If you have longer timelines, even kind of somewhat longer timelines, then they do become relevant. So it's a really interesting space where we're going to see a turnover in the kind of energy generation infrastructure that's used.

And people talk a lot about China and their energy advantage, which is absolutely true. I'm quite curious whether this allows the American energy sector to do a similar leapfrogging on SMRs that China did, for example, on mobile payments, right? When you just do not have the ability to build nuclear plants in less than 10 years, which is the case for the United States, which is

like don't have that know-how and frankly, the willingness to deregulate to do it and the industrial base, then it kind of forces you to look at other options. And so if there's a shift just in the landscape of power generation, it can introduce some opportunities to play catch up. So sort of, I guess that's a hot take there that I haven't thought enough about, but that's an interesting dimension anyway to the SMR story.

By the way, one gigawatt, apparently equivalent to 1.3 million horsepower. So not sure if that gives you an idea of what a gigawatt is, but it's a lot of energy. Or one gigawatt is a lot. Yeah. One million homes for one day or what does that actually mean? So gigawatt is a unit of power. So it's like the amount of power that a million homes just consume on a running basis. Yeah, exactly. So-

One gigawatt is a lot. So is 345 megawatts. Now moving on to some fundraising news. Meera Muradi, her company Thinking Machines Lab has finished up their fundraising, getting $2 billion at a $10 billion valuation. And this is the seed round. So yet another...

Billion round, billion dollar seed round. And this is, of course, the former CTO of OpenAI, left in 2024, I believe, and has been working on setting up Thinking Machines Lab, another competitor in the AGI space, presumably planning to train their own monoliths.

models, recruited various researchers, some of them from OpenAI, and now has billions to work with that they'll deploy presumably to train these large models. Yeah, it's funny. Everyone just kind of knew that it was going to have to be a number with billion after it, just because of the level of talent involved. It is a remarkable talent set. The round is led by Andreessen Horowitz. So A16Z on the cap table now,

Notably, though, Thinking Machines did not say what they're working on to their investors. At least that's what this article, that's what it sounds like. The wording is maybe slightly ambiguous. I'll just read it explicitly. You can make up your mind. Thinking Machines Lab had not declared what it was working on, instead using Marotti's name and reputation to attract investors. So,

That suggests that A16Z cut, they didn't cut the full $2 billion check, but they led the round. So hundreds and hundreds of millions of dollars just on the basis of like, yeah, Mira's a serious fucking person. John Schulman's a serious fucking person. Jonathan Lachman, all kinds of people, bear it off. These are really serious people. So we'll cut you $800 million check, whatever they cut as part of that. That's

both insane and tells you a lot about how the space is being priced. The other weird thing we know, and we talked about this previously, but it bears kind of repeating. So Maradi is going to hold this

So Mira is going to hold board voting rights that outweigh all other directors combined. This is a weird thing, right? This is not what is with all these AGI companies and the really weird board structures. A lot of it is just like the OpenAI mafia, like people who worked at OpenAI did not like what Sam did and learned those lessons and then enshrined that in the way they run their company, in their actual corporate structure.

And Anthropic has their public benefit company set up with their oversight board. And now Thinking Machines has this Mira Maradi dictatorship structure where she has final say basically over everything at the company. By the way, everything I've heard about her is exceptional. Every opening eye person I've ever spoken to about Mira has just glowing things to say about her. And so even though $2 billion is not really enough to compete if you believe in scaling laws,

It tells you something about, you know, the kinds of decisions people will make about where they work include who will I be working with? And this seems to be a big factor, I would guess, in all these people leaving OpenAI. She does seem to be a genuinely exceptional person. Like I've never met her, but again, everything I've heard is just like glowing and both in terms of competence and in terms of kind of smoothness of working with her. So that may be part of what's attracting all this talent as well.

Yes, and on the point of not quite knowing what they're building, if you go to thinkingmachines.ai, this has been the case for a while, you'll get a page of text. The text is...

Let's say like reads like a mission statement that sure is saying a lot. There's stuff about scientific progress being a collective effort, emphasizing human AI collaboration, more personalized AI systems, infrastructure quality, advanced multimodal capabilities, research product co-design, empirical iterative approach to AI safety, measuring what truly matters.

I have no idea. This is like just saying a whole bunch of stuff and you can really take away whatever you want. Presumably it'll be something that is competing with OpenAI and Anthropic fairly directly is the impression. And yeah, near the bottom of the page at thinkingmachines.ai is

Founding team has a list of a couple dozen names, each one with, you can hover over it to see their background. As you say, like real heavy hitters and then their advisors and a join us page. So yeah, it really tells you what, if you gain a reputation and you have some real star talent in Silicon Valley, that goes a long way.

And on that note, next story quite related, Meta has hired some key OpenAI researchers to work on their AI reasoning models. So a week ago or two weeks ago, we talked about how Meta paid a whole bunch of money, invested rather in Scale.ai and hired away the founder of Scale.ai, Alex Wang, to head their new super intelligence efforts. Now there are these reports. I don't know if this is

highlighting it particularly because OpenAI, or perhaps this is just with juicy details. I'm sure Meta has hired other engineers and researchers as well, but I suppose this one is worth highlighting. They did hire some fairly notable figures from OpenAI. So this is Lucas Beyer, Aleksandr Kolesnikov, and Shihou Zhai, who I believe founded the

Sweden office? Oh, interesting. Anyway, they were a fairly significant team at OpenAI, or so it appears to me. And I think LucasBear did post on Twitter and say that the idea that we are paid $100 million was fake news. This is another thing that's been up in the air, Sam Altman said.

has been taking, you could say, some gentle swipes, saying that Twitter has been promising insane pay packages.

So all this to say is this is just another indication of Mark Zuckerberg very aggressively going after talent. We know he's been personally messaging dozens of people on WhatsApp and whatever, being like, hey, come work for Meta. And perhaps unsurprisingly, that is paying off in some ways in expanding the talent of this super intelligence team.

Yeah, there's a lot that's both weird and interesting about this. The first thing is anything short of this would be worth zero. When you're in Zuck's position and you are, I'll just sort of like, this is colored by my own interpretation of who's right and who's wrong in this space. But I think it's increasingly sort of just becoming clear in fairness. I don't think it's just my biases saying that.

When your company's AI efforts, despite having access to absolutely frontier scales of compute, so having no excuses for failure on the basis of access to infrastructure, which is the hardest and most expensive thing, when you've managed to tank that so catastrophically,

Because your culture is screwed up by having Yann LeCun as the mascot, if not the leader of your internal AI efforts, because he's not actually as influential as it sounds or hasn't been for a while on the internals of Facebook, but he has set the beat at Meta.

being kind of skeptical about AGI, being skeptical about scaling, and then like changing his mind in ego-preserving ways without admitting that he's changed his mind. I think these are very damaging things. They destroy the credibility of Meta and have done that damage. And I think the fact that Meta is so far behind today

is a reflection in large part, a consequence of Yann LeCun's personality and his inability to kind of update accordingly and maintain like epistemic humility on this. I think everybody can see it. He's like the old man who's still yelling at clouds and just like,

As the clouds change shape, he's trying to pretend they're not. But I think just speaking as if I were making the decision about where to work, that would be a huge factor. And it has just objectively played out in a catastrophic failure to leverage one of the most impressive fleets of AI infrastructure that there actually is. And so...

what we're seeing with this set of hires is people who are, I mean, so completely antithetical to Yann LeCun's way of thinking. Like Meta could not be pivoting harder in terms of the people it's poaching here. First of all, OpenAI, obviously one of the most scale-pilled organizations in the space, probably the most scale-pilled. Anthropic actually is up there too. But also, scale AI's Alex Wang. So, okay, that's interesting. Very scale-pilled dude, also very AI safety-pilled dude.

Daniel Gross, arguably quite AI safety-pilled, at least that was the mantra of safe superintelligence. Weird that he left that so soon. A lot of open questions about how safe superintelligence is doing, by the way, if Daniel Gross is now leaving. I mean, DG was the CEO, right? Co-founded it with Ilya, so what's going on there? So that's a hanging Chad, but just Daniel Gross being now...

over on the meta side, you have to have enough of a concentration of exquisite talent to make it attractive for other exquisite talent to join. If you don't break that critical mass, you might as well have nothing. And that's been meta's problem this whole time. They needed to just like jumpstart this thing with a massive capital infusion. Again, these massive pay packages, that's where it's coming from. Just give people a reason to come, get some early proof points that get people excited about meta again. And the weird thing is with all this,

I'm not confident at all in saying this, but you could see a different line from Meta on safety going forward too, because Yann LeCun was so dismissive of it, but now a lot of people they've been forced to hire because there is, if you look at it objectively, a strong correlation between the people in teams who are actually leading the frontier and the people in teams who take loss of control over AI seriously. Now Meta is kind of forced to change in some sense its DNA to take that seriously. So I think that's just a really interesting...

like shift. And I know this sounds really harsh with respect to Yann LeCun, like, you know, take it for what it is. It's just one man's opinion. But I have spoken to a lot of researchers who feel the same way. And again, I mean, I think the data kind of bears it out. Essentially, Mark Zuckerberg is being forced to pay the Yann LeCun tax right now. And I don't know what happens to Yann LeCun going forward, but I do kind of wonder if his meta days may be numbered or, you know, if there's going to be a face-saving measure that has to be taken there.

Right. For context, Yann LeCun is Meta's chief AI scientist. He's been there for over a decade, hired, like, I think around 2013, 2012, by Meta, one of the key figures in the development of newer networks, really, over the last couple of decades, and certainly is a major researcher and contributor to the rise of deep learning in general. But as you said, a skeptic on large language models and a proponent for sort of

I will say not entirely bought into this narrative personally. The person heading up the effort on LAMA and LLMs was not Yann LeCun as far as I'm aware. There was another division within Meta that focused on generative technology that has now been revamped. So the person leading the generative AI efforts in particular has left and now there is an entirely new division in

called AGI Foundations that is now being set up. So this is part of a major revamp. Yannick Kuhn is still leading his more like research publication type side of things. And perhaps, as far as I know, not very involved in this side of scaling up Lama and LLMs and all of this, which is...

less of a research effort, more of an R&D, compete with OpenAI and so on effort. Absolutely agree. And that was what I was referring to when I was saying Yann LeCun is not involved in the day-to-day product side of the org. It's been known for a while that he's not actually doing the heavy lifting on LAMA, but he has defined what it means, essentially articulated Meta's philosophy on AI and AI scaling for the last however many years.

And so it's understood that when you join Meta, or at least it was, that you were buying into a sort of Yann LeCun aligned philosophy, which I think is the kind of core driving problem behind where Meta finds itself today. Yeah, that's definitely part of it. I mean, that's part of the reputation of Meta as an AI research club. Also, I mean, part of the advantage of Meta and why people might want to go to Meta is because of their very open source experience.

Friendly nature. They're only very open source friendly because they're forced to do that because it's the only way they can get headlines while they pump out mediocre. But regardless, it's still a factor here. One last thing worth noting on this whole story. I mean, you could do a whole speculative analysis of what went on in the meta. They did also try to throw a lot of people at the problem, scale up from a couple hundred to like a thousand people.

I think probably had a similar situation to Google where it was like big company problems, right? Open AI, Anthropic, they're still, they're huge, but they don't have big company problems. That's a great point. They have scaling company problems.

So this revamp could also help. Yeah. Alrighty. On to research and advancements. No more drama talk, I guess. Next, we have a story from DeepMind and they have developed Alpha Genome, the latest in their Alpha line of scientific models. So

This one is focused on helping researchers understand gene functions. It's not meant for personal genome prediction, but more so just general identification of patterns. So it could help identifying causative mutations in patients with ultra-rare cancers. So for instance, which mutations are responsible for incorrect gene expression? I'm going to be honest.

You know, there's a lot of deep science here with regards to biology and genomics, which I am not at all an expert on. And the gist of it is similar to AlphaFold, similar to other alpha efforts on the benchmarks dealing with the problems that geneticists deal with, the kind of prediction issues, etc.

the analysis, Alpha Genome kind of beats all existing techniques out of the park. On almost every single benchmark, it is superseding previous efforts. And this one model is able to do a lot of things all at once. So again, not really my background to come with this too much, but I'm sure that this

is along the lines of AlphaFold. In terms of AlphaFold was very useful scientifically for making predictions about gene folding, protein folding. AlphaGenome is presumably going to be very useful for understanding genomics, for making predictions about which genes do what.

Things like that. It's a really interesting take. That's, I guess, a fundamentally different way of approaching the let's understand biology problem that, that Google deep mind. And then it's, it's subsidiary. I guess it's, it's spawned a company isomorphic labs, which by the way, Dennis is the CEO of and very focused on. I hear has kind of been, been very focused on anyway.

When you look at alpha fold, you're looking at essentially predicting the structure and to some degree, the function of proteins from the Lego blocks that make up those proteins, right? The amino acids, the individual amino acids that get chained together, right? So you got 20 amino acids you can pick from, and that's how you build a protein. And depending on the amino acids that you have, some of them are positive charge, some of them are negative, some of them are polar, some of them are not, and then the thing will fold in a certain way.

That is distinct from the problem of saying, okay, I've got a strand of 300 billion base pairs, sorry, 3 billion base pairs of DNA. And what I want to know is if I take this one base pair and I switch it from, I don't know, like from an A to a T, right? Or from a G to an A, what happens to the protein? What happens to the downstream kind of biological activity? What cascades does that have? What effects does it have?

And that question is a, it's an interesting question because it depends on your ability to model biology in a pretty interesting way. It also is tethered to an actual phenomenon in biology. So there's a thing called the single nucleotide polymorphism. There's some nucleotides in the human genome that you'll often see can either be like a G or a T or something.

And you'll see some people who have the G variant and some people have the T variant. And it's often the case that some of these variants are associated with a particular disease. And so there's like a... I used to work in a genomics lab doing cardiology research back in the day. And there's like famous variant called 9P21.3 or something. And if some people had, I forget what it was, the T version, you'd have a higher risk of getting coronary artery disease or atherosclerosis or whatever, and not if you had the other one. So...

Essentially what this is doing is it's allowing you to reduce in some sense, the number of experiments you need to perform. If you can figure out, okay, like we have all these different possible variations across the human genome, but only a small number of them actually matter for a given disease or effect.

And if we can model the genome pretty well, we might be able to pin down the variants we actually care about so that we can run more controlled experiments, right? So we know that, hey, you know, patient A and patient B, they may have like a zillion different differences in their genomes, but actually for the purpose of this effect, they're quite comparable or they ought to be. So this is anyway, really, I think, interesting next advance from Google DeepMind. And I expect that we'll see a lot more because they are explicitly interested in that direction.

Right. And they released a pretty detailed research paper, a preprint on this, as they have of AlphaFold, 55-page paper describing the model, describing the results, describing the data, all that. Also released an API, so a client-side ability to query the model. And it is free of charge for non-commercial use with some query limiting.

So yeah, again, similar to AlphaFold, they are making this available to scientists to use. They haven't open sourced this yet, the model itself, but they did explain how it works. So certainly exciting and always fun to see DeepMind doing this kind of stuff.

And up next, we have direct reasoning optimization, DRO. So we've got, you know, GRPO, we've got DPO, we've like, you know, there's so many, so many POs or ROs or O's, so many O's. So LLMs can reward and refine their own reasoning for open-ended tasks. I like this paper. I like this paper a lot. It's, I think I might've talked about this on the podcast before. I used to have a

who would ask these very simple questions when you were presenting something, and they were embarrassingly simple. And you would be embarrassed to ask that question, but then that always turns out to be the right and deepest question to ask. This is one of those papers. It's a very simple concept, but it's something that when you realize it, you're like, oh my God, that was missing. So first, let's just talk about how currently we typically train reasoning into models. So

You have some output that you know is correct, right? Some answer, the desired or target output, and you've got your input. So what you're going to do is you're going to feed your input to your model. You're going to get it to generate a bunch of different reasoning traces. And then in each case, you're going to look at those reasoning traces, feed them into the model, and based on the reasoning trace that the model generated, see what probability it assigns to the target output that you know is correct.

So reasoning traces that are correct in general will lead to a higher probability that the model places on the target outcome because it's the right outcome. So if the reasoning is correct, it's going to give a higher probability to the outcome. So this is sort of, it feels a little bit backwards from the way we normally train these models, but this is how it's done, at least in GRPO, group relative policy optimization. So essentially you reward the model to incentivize high probability of the desired outcome.

output conditioned on the reasoning traces. And this makes you generate over time better and better reasoning traces because you want to generate reasoning traces that assign higher probability to the correct output. So the intuition here is if your reasoning is good, you should be very confident about the correct answer, right? Now this breaks and it breaks in a really interesting way. Even if your reference answer is exactly correct,

you can end up being too forgiving to the model during training because the way that you score the model's confidence in the correct answer based on the reasoning traces is you average together essentially the confidence scores of each of the answer tokens in the correct answer.

Now, the problem is the first token of the correct answer often gives away the answer itself. So even if the reasoning stream was completely wrong, like even if, let's say the question was like, who scored the winning goal in the soccer game? And the answer was Lionel Messi.

If the model's reasoning is like, I think it was Cristiano Ronaldo, the model is going to, okay, from there, assign a low probability to Lionel, which is the first word of the correct answer. But once it reads the word Lionel, the model knows that Messi must be the next token. So it's going to assign actually a high probability to Messi, even though its reasoning trace said Cristiano Ronaldo.

And so essentially this suggests that there are some tokens in the answer that are going to actually like correctly reflect the quality of your model's reasoning. So if your model's reasoning was, I think it was Cristiano Ronaldo and the actual answer was Lionel Messi, well, Lionel, you should expect it to have very low confidence in. So that's good. You'll be able to actually correctly determine that your reasoning was wrong there. But once you get Lionel,

as part of the prompt, then messy all of a sudden becomes obvious. And so you get a bit of a misfire there. So essentially what they're going to do is they're going to calculate, like they'll feed in a whole bunch of reasoning traces and they'll look at each of the tokens in the correct output and see which of those tokens vary a lot. Tokens that are actually reflective of the quality of the reasoning

should have high variance, right? Because if you have good reasoning trajectory, those tokens should have high confidence. And if you have a bad reasoning trajectory, they should have low confidence. But then you have some like kind of less reasoning reflective tokens, like say, Messi and Lionel Messi, because then Lionel has already given it away. You should expect Messi to consistently have

high confidence, because again, even if your reasoning trace is totally wrong, by the time you get Lionel, by the time you've read Lionel, messy is obvious. It's almost like if you're writing a test and you can see the first word in the correct answer, well, yeah, you're going to get, even if your thinking was completely wrong, you're going to get the correct second word if the answer is Lionel Messy.

So anyway, this is just a way that they use to kind of detect good reasoning. And then they feed that into anyway, a broader algorithm that beyond that is fairly, fairly simple, nothing too shocking. They just fold this in to something that looks a lot like a GRPO to get this DRO algorithm.

Right, yeah. They spent a while in the paper contrasting it with other recent work that deals with, that doesn't pay attention to tokens, basically. So that just to contextualize what you were saying, their focus is on this token.

R-free, reasoning, reflection, reward. And DRO, direct reasoning optimization, is basically GRPO, what people use generally for RL, typically with verifiable rewards. Here, where it focuses, how do we train kind of generally in an open-ended fashion over long reasoning chains, identify some of these issues and existing approaches and highlight

this reasoning reflection reward that basically is looking at consistency between these tokens in the chain of thought and in the output as a signal to optimize over. And as you might expect, they do some experiments, they show that this winds up being quite useful. And I think another indication of

We are still in the early-ish days of using RL and training reasoning. There's a lot of noise and a lot of significant insights being leveraged. Last thing, DRO, I guess kind of a reference to DPO, as you said. DPO is direct preference optimization versus direct reasoning optimization. Not super related. It's just, I guess, fun naming conventions because aside from

Arguably being sort of analogous in terms of the difference between RL-based preference alignment and DPO. Anyway, it's kind of a funny reference. Yeah.

Next paper, Farseer, a refined scaling law in large language models. So we've talked about scaling laws a ton. Basically, you try to collect a bunch of data points of, you know, once you use this much compute or this much training flops or whatever, you get to this particular loss on language prediction typically.

on the actual metric of perplexity. And then you fit some sort of equation to those data points. And what tends to happen is you get a fairly good fit

that holds for future data points that typically you're like scaling up, scaling up, scaling up, your loss goes down and down and down. And people have found that somewhat surprisingly, you can get a very good fit that is very predictive, which was not at all kind of common idea or something that people had really tried pre-2020.

So what this paper does is basically do that, but better. It's a novel and refined scaling law that provides enhanced predictive accuracy. And they do that by just systematically constructing a model loss surface and doing just a better analysis

job of fitting to empirical data. They say that they improve upon the chinchilla law, one of the big ones from a couple of years ago, by reducing extrapolation error by 433%. So a much more reliable law, so to speak.

Yeah, the Chinchilla scaling law was somewhat famously Google's correction to the initial OpenAI scaling law that was proposed, I think, in a 2019 paper. This is the so-called Kaplan scaling law.

And so Chinchilla was sort of heralded as this kind of big and ultimately maybe pseudo final word on how scaling would work. It was more data heavy than the Kaplan scaling laws, notably. But what they're pointing out here is Chinchilla works really well for mid-sized models, which is basically where it was calibrated, like, you

you know, what it was designed for, but, but it doesn't do great on very small or very large models. And obviously given that scaling is a thing, very large models matter a lot. And the whole point of a scaling law is to extrapolate from where you are right now to see like, okay, well, if I trained a model a hundred times the scale and therefore at,

you know, let's say a hundred times this budget, where would I expect to end up? And you can imagine how much depends on those kinds of decisions. So you want a model that is really well calibrated and extrapolates really well, especially to very large models. They do a really interesting job in the paper called

We won't go into detail, but especially if you have a background in physics, like thermodynamics, they play this like really interesting game where they'll use finite difference analysis to kind of separate out dependencies between N, the size of the model, and D, the amount of data that it's trained on.

And that ultimately is kind of the secret sauce, if you want to call it that here. There's a bunch of other hijinks, but the core pieces, they sort of break the loss down into different terms, one of which only depends on N, the other of which only depends on D. So one is just model size dependent. The other is only dependent on the size of the training data set.

But then they also introduced this interaction effect between N and D, between the size of the model and the amount of data it's trained on. And then they end up deriving what should that term look like? That's one of the framings of this that's really interesting. Just to kind of nutshell it, if Chinchilla says that data scaling follows

A consistent pattern is like D to the power of some negative beta coefficient, regardless of model size. Like no matter how big your model is, it's always D to the power of negative B. So if I give you the amount of data, you can determine the contribution of the data term.

What Farsier says is data scaling actually depends on model size. Bigger models just fundamentally learn from data in a different way. And we'll park it there, but there's a lot of cool extrapolation to figure out how exactly does this term have to look. Exactly. And this is very useful, not just to sort of know what you're going to get. That aspect of it means that for a given compute budget,

You can predict what balance of data to model size is likely optimal. And basically, when you're spending millions of dollars training a model, it's pretty nice to know these kinds of things, right? And one more paper. Next one is LLM First Search, Self-Guided Exploration of the Solution Space.

So the gist of this is there are many ways to do search where search just means, you know, you look at one thing and then you decide on some other things to look at and you keep doing that until you find a solution.

So one of the typical ways is Monte Carlo tree search, a classic algorithm. And this was, for instance, done with AlphaGo. If you want to combine this in an LLM, typically what you do is you assign some score to a given location and make perhaps some predictions. And then you have an existing algorithm to sample or to decide where to go.

The key difference here with LLM for a search is basically forget that multicolored research, forget any preexisting search algorithm or technique, just make the LLM decide where to go. It can decide how to do the search. And they say that this is more flexible, more context sensitive, requires less tuning and just seems to work better.

Yeah. It's all prompt level stuff, right? So there's no optimization going on, no training, no fine tuning. It's just like, give, like, give the model a prompt. So,

Number one, find a way to represent the sequence of actions that have led to the current moment in whatever problem the language model is trying to solve in a way that's consistent. So like essentially format, let's say all the chess moves up till this point in a consistent way so that the model can look at the state and the history of the board, if you will.

And then give the model a prompt that says, okay, from here, like I want you to decide whether to continue on the current path or look at alternative branches, alternative trajectories. The prompt is like, here are some important considerations when deciding whether to explore or continue. And then it lists a bunch.

And then similarly, they have the same, but for the evaluation stage where you're scoring the available options and getting the model to choose the most promising one. So, you know, it's like, here are some important considerations when evaluating possible operations that you could take or actions you could take.

So once you combine those things together, basically at each stage, I'll call it of the game or the problem solving. The model has a complete history of all the actions taken up to that point. It's then prompted to evaluate the options before it and to decide whether to continue to explore and kind of add new options or to select one of the options and execute against it.

Anyway, that's basically it. Like it's a pretty conceptually simple idea. Just offload the tree and branching structure development to the model. So it's thinking them through in real time. Pretty impressive performance jumps. So when using GPT-4, when compared with standard Monte Carlo tree search on this game of countdown, where essentially you're given a bunch of numbers and all the standard mathematical operations, addition, division, multiplication, subtraction,

You're trying to figure out how do I combine these numbers to get a target number? So at each stage you have to choose, okay, do I try adding these together? Do I, anyway, so 47% on this, using this technique versus 32% using Monte Carlo tree search and

And this effect amplifies. So the advantage amplifies as you work with stronger models. So on O3 Mini, for example, 79% versus 41% for Monte Carlo Tree Search. So reasoning models seem to be able to take advantage of this. You can think of it as a kind of scaffold a lot better. It also uses fewer tokens. So it's getting better performance. It's using fewer tokens, so less compute.

than Monte Carlo tree search as well. So that's really interesting, right? This is a way more efficient way of squeezing performance out of existing models. And it's all just based on very kind of interpretable and tweakable prompts. Right. And they compare this not just to Monte Carlo tree search. We also compare it to tree of thoughts or tree of thoughts, breadth for search, best for search. All of these are, by the way, are pretty significant because search broadly is like

There's a sequence of actions I can take and I want to get the best outcome. And, you know, so you need to think many steps ahead.

And so depending, branches here mean like I take this step and this step and this step. Well, you can either go deeper or wider in terms of how many steps you consider one step ahead, three steps ahead. And this is essential for many types of problems. Chess, Go, obviously, but broadly we do search and all sorts of things. So having a better approach to search means you could do better reasoning, means you could do better problem solving.

And moving on to policy and safety, we have one main story here called unsupervised elicitation of language models. This is really interesting. And I'll be honest, like it was a head scratcher for me. Like I spent a good, embarrassing amount of time with Claude trying to help me through the paper, which is sort of ironic because if I remember, it's an anthropic paper. But this is essentially a way of saying,

a language model's internal understanding of logic to help it to solve problems. So I

So imagine that you have a bunch of math problems and solutions. So for example, you know, what's five plus three, and then you have a possible solution, right? Maybe it's eight. The next problem is like, what's seven plus two, and you have a possible solution. And that possible solution is maybe 10, which is wrong, by the way. So some of these possible solutions are going to be wrong. So you have a bunch of math problems and possible solutions, and you don't know which are, which are correct and incorrect.

And you want to train a language model to identify correct solutions, right? You want to figure out which of these are actually correct. So imagine you just lay these all out in a list. You have, you know, what's five plus three and then solution eight, what's seven plus two solution 10 and so on. Now what you're going to do is you're going to randomly assign correct and incorrect labels to a few of these examples, right?

So you'll say, you know, five plus three equals eight. And you'll just randomly say, okay, that's correct. And seven plus two equals 10, which by the way is wrong, but you'll randomly say that's correct. And then you're going to get the model to say, given the correctness scores that we have here, given that solution one is correct and solution two is correct,

What should solution three be roughly, or, you know, given all the incorrect and incorrect and correct labels that we've assigned randomly, secretly, what should be this missing label?

And generally, because you've randomly assigned these labels, the model is going to get really confused because there's a logical inconsistency between these randomly assigned labels. A bunch of the problems that you've labeled as correct are actually wrong and vice versa. And so now what you're going to do is essentially try to like measure how confused the model is about that problem. And you are then going to flip

one label. So you'll kind of think of like flipping the correct or incorrect label on one of these, one of these problems from correct to incorrect, say, and then you'll repeat and you'll see if you get a lower confusion score from the model. Anyway, this is roughly the concept. And so over time, you're going to gradually converge on a lower and lower confusion score. And

And that's, it sort of like feels almost like the model is relaxing into the correct answer, which is why this is a lot like simulated annealing. If you're, if you're familiar with that, you're making random modifications to the problem until you get a really low loss and you gradually kind of relax into the correct answer. I hope that makes sense. It's sort of like, you kind of got to see it and it's yeah.

Right. Just to give some motivation, they frame this problem. And this is from Tropic and a couple other institutes, by the way. They frame this in the context of superhuman models. So the unsupervised elicitation part of this is...

about the aspect of how do you train a model to do certain things, right? And these days, the common paradigm is you train your language model via pre-training, then you post-train, you have some labels for your words or preferences of our outputs, and then you do RLHF or you do DPO to make a model do what you want it to do. But

The framework or the idea here is once you get to superhuman AI, well, maybe humans can't actually see what it does and kind of give it the labels of what is good and what's not. So this internal coherence maximization framework makes it so you can elicit the good behaviors, the desired behaviors,

from the LLM without external supervision by humans. And the key distinction here from previous efforts in this kind of direction is that they do it at scale. So they train a CLOD 3.5 haiku-based assistant without any human labels and achieve better performance than its human-supervised counterparts. They demonstrate in practice that

on a significantly sized LLM that this approach can work. And this could have implications for future, even larger models.

Next up, a couple of stories on the policy side. Well, actually only one story. It's about Taiwan and it has imposed technology export controls on Huawei and SMIC. Taiwan has actually blacklisted Huawei and SMIC, the Semiconductor Manufacturing International Corp.

And this is from Taiwan's International Trade Administration. They have also included subsidiaries of these. It's an update to their so-called strategic high-tech commodities entity list. And apparently they added not just those, 601 entities from Russia, Pakistan, Iran, Myanmar, and mainland China.

Yeah. And one reaction you might have looking at this is like, wait a minute, I thought China was already barred from accessing, for example, chips from Taiwan. And you're absolutely correct. That is the case. That was my reaction. Yeah, it is. No, totally, totally. It's a great question. So what...

what is actually being added here? And so the answer is because of US export controls, and we won't get into the reason why the US has leverage to do this, but they do. Taiwanese chips are not going into mainland China, at least theoretically. Obviously, Huawei finds ways around that. But

this is actually a kind of broader thing to deal with a whole bunch of plant construction technologies, for example, specialized materials equipment that isn't necessarily covered by US control. So there's a broader supply chain coverage here, whereas US controls are more focused on cutting off

like specifically chip manufacturing. Here, Taiwan is formally blocking access to the whole semiconductor supply chain. It's everything from specialized chemicals and materials to manufacturing equipment, technical services. So sort of viewed as this loophole closing exercise coming from Taiwan.

This is quite interesting because it's coming from Taiwan as well, right? This is not the US kind of leaning in and forcing anything to happen, though who knows what happened behind closed doors. It's interesting that Taiwan is taking this kind of hawkish stance on China. So even though Huawei couldn't get TSMC to manufacture their best chips, they have been working with SMIC to develop some domestic capabilities for chip manufacturing. Anyway, this basically just makes it harder for that to happen.

Next up, paper dealing with some concerns, actually from a couple of weeks ago, but I don't think we covered it. So worth going over it pretty quickly. The title of the paper is Your Brain on the Chat GPT, Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Tasks.

So what they do in this paper is have 54 participants write essays. Some of them can use LLMs to help them do that. Some of them can use search engines to help them do that. Some of them have to do it themselves, no tools at all.

And then they do a bunch of stuff. They first measure the brain activity with EEGs to, they say, assess cognitive load during essay writing. They follow up by looking at recall metrics and

And the result is there's significant differences between the different groups. EEGs reveal less so-called brain connectivity between brain-only participants and LLM participants and search participants. Similarly, self-reported ownership, recall, all these things differ. This one got a lot of play, I think, on Twitter and so on and everywhere.

Quite a bit of criticism also, I think, in overblowing the conclusions. I think the notion of cognitive debt, the framing here is that there's long-term negative effects on cognitive performance due to decreased mental effort and engagement. And you can certainly question whether that's the conclusion you can draw here. What they show is if you use a tool to write an essay, it takes up less effort and you probably don't remember what is in the essay as well.

Does that transfer to long-term negative effects on cognitive performance due to decreased mental effort and engagement? Maybe. All I have is a personal take on this too. I think that good writers are good thinkers because when you are forced to sit down and write something, at least it's been my experience that I don't really understand something until I've written something about it with intent. In fact, when I'm trying to understand something new,

I actually make myself write it out because it just doesn't stick in the same way. Different people may be different, but I suspect that maybe less so than some people might assume they are. So I think at least for people like me, I imagine this would be an effect.

It's interesting. They say, yeah, after writing 17% of chat GPT users could quote their own sentences versus 89% for the brain only group, the ones who didn't use even Google. The other interesting thing here is that by various measures, Google is either between using chat GPT and going brain only or

Or it can even be slightly better than brain only. I thought that was quite interesting, right? Like Google is sort of this thing that allows like fairly obsessed people like myself to kind of do deep dives on the say technical topics and learn way faster than they otherwise could without necessarily giving them the answer. And, and,

Chat GPT at least or LLMs at least open up the possibility to not do that. Now, I will say, I think there are ways of using those models that actually do accelerate your learning. I think I've experienced that myself, but there has to be some kind of innate thing that you do, at least...

I don't know. I'm self-diagnosing right now, but there's got to be some kind of innate thing that I do, like whether it's writing or drawing something or making a graphic to actually make it stick and make me feel a sense of ownership over the knowledge. But yeah, I mean, look, we're going to find out, right? People have been talking about the effects of technology on the human brain for since the printing press, right? When people are saying like, hey, we rely on our brains to store memories. If you just start getting people to read books,

well, now the human ability to have long-term memory is going to atrophy. And you know what? It probably did in some ways, but we kind of found ways around that. So

I think this may turn out to be just another thing like that, or it may turn out to actually be somewhat fundamental because, you know, back in the days of the printing press, you still had to survive. Like, you know, there was enough kind of real and present pressure on you to learn stuff and retain that, you know, maybe it didn't have the effect it otherwise would. But interesting study. I'm sure we'll keep seeing analyses and reanalyses for the next few months.

Yeah, quite a long paper, like 87 pages, lots of details about the brain connectivity results. And ironically, it was too long for me to read. No, it's actually true. I used an LLM for this one. It's like...

Anyway, I have seen quite a bit of criticism on the precise methodology of a paper and some of its conclusions. I think also in some ways it's very common sense. You know, if you don't put in effort doing something, you're not going to get better at it. Yeah. You know, that's already something we know, but...

I guess I shouldn't be too much of a hater. I'm sure this paper also has some nice empirical results that are useful in, as you say, like a very relevant line of work with regards to what actual cognitive impacts usage of OLMs has and how important is it to like go brain only sometimes.

Alrighty, on to synthetic media and art. Just two more stories to cover. And as promised in the beginning, these ones are dealing with copyright. So last week we talked about how Anthropic scored a copyright win. The gist of that conclusion was that using content from books to train LLMs is fine, at least for Anthropic.

what is actually bad is pirating books in the first place. So Anthropic bought a bunch of books, scanned them and used the scan data to train that alum and that kind of passed the bar. It was okay. So now we have a new ruling about a judge rejecting and some offers claims that Meta AI training has violated copyrights. So the federal judge has dismissed a copyright infringement claim by

offers against Meta for using their books to train its AI models. The judge, Vincent Bria, has ruled that Meta's use of nearly 200,000 books, including the people suing to train the Lama language model, constituted fair use. And this does similarly align with a ruling, a very ruling about Anthropic with Claude. So,

This is a rejection of the claim that this is piracy. Basically, the judgment is that the outputs of LAMA are transformative, so you're not infringing on copyright. And this is, you know, using the data for training and a language model is for use and copyright doesn't apply. At least as far as you can tell, this is again, not a lawyer, is a conclusion that

Seems like a pretty big deal, like the legal precedent for whether it's legal to use the outputs of a model when some of the inputs to it were copyrighted appears to be being kind of figured out.

Yeah, this is super interesting, right? You've got judges trying to like square the circle on allowing what is obviously a very transformational technology. And, but I mean, the challenge is like, no, no author ever wrote a book until say 2020 or whatever, right?

with the expectation that this technology would be there. It's just sort of like no one ever imagined that facial recognition would get to where it is when Facebook was first founded and people or MySpace and people first started uploading, you know, a bunch of pictures of themselves and their kids. And it's like, yeah, now that's out there. And you're

waiting for a generation of software that can use it in ways that you don't want it to, right? Like, you know, deep fakes, I'm sure, were not even remotely on the radar of people who posted pictures of their children on MySpace in the late 90s, right? That's like...

That is one extreme version of where this kind of argument lands. So now you have authors who write books, you can say like in good faith or assuming a certain technological trajectory, assuming that those books when put out in the world could not be.

technologically be used for anything other than just what they expected them to be used for, which is being read. And now that suddenly changes. And it changes in ways that undermine the market quite directly for those books. It is just a fact that if you have a great, like a book that really explains a technical concept very well,

and your language model is trained on that book and now can also explain that concept really well, not using the exact same words, but maybe having been informed by it, maybe using analogous strategies

It's hard to argue that that doesn't undercut the market for the original book, but it is transformative, right? The threshold that the judge in this case was using was that Lama cannot create copies of more than 50 words. Well, yeah, I mean, every word could be different, but it could still be writing in the style of, right? And that's kind of a different threshold that you could otherwise have imagined the judge could have gone with or something like that. But

There is openness apparently from the judge to this argument that AI could destroy the market for original works or original books just by making it easy to create tons of cheap knockoffs. And they're claiming that likely would not be fair use, even if the outputs were different from the inputs. But again, the challenge here is that it's not necessarily just books, right? It's also like you just want a good explanation for a thing. And the form factor that's best for you is a couple sentences rather than a book.

So maybe you err on the side of the language model and maybe you just keep doing that. Whereas in the past, you might've had to buy a book. So I think overall, this makes as much sense as any judgment on this. I don't have, you know, like I feel like

feel deeply for the judges who are put in the position of having to make this call. It's just tough. I mean, you can make your own call as to what makes sense, but man, is this littered with nuance. Yeah, it is worth noting, to speak of nuance, that the judge did very explicitly say that this is judging on this case specifically, not about the topic as a whole. He did frame it as

Copyright law being about more than anything, preserving the incentive for humans to create artistic and scientific works. And fair use would not apply, as you said, to copying that would significantly diminish the ability of copyright holders to make money from their work. And so in this case...

Meta presented evidence that book sales did not go down after Lama released for these offers, which included, for instance, Sour Silverman, Junot Diaz, and overall there were 13 offers in this case.

So yes, this is not necessarily establishing precedent in general for any suit that is wrought. But at least in this case, the conclusion is Meta doesn't have to pay these offers and generally did not go against copyright by training on the data of their books without asking for permission or paying them. And just one last thing.

The next one is that Getty has dropped some key copyright claims in its lawsuit against Stability AI, although it is continuing a UK lawsuit. So the primary reason

claim against Stability AI by Getty was about copyright infringement. So they dropped the claim about Stability AI using millions of copyrighted images to train its AI model without permission.

but they still are keeping the secondary infringement and I guess trademark infringement claims that say that AI models could be considered infringing articles if used in the UK, even if phrased trained elsewhere. So honestly, don't fully get the legal implications here. It seems like in this case in particular, it was the claims were dropped immediately

Because of weak evidence and lack of knowledgeable witnesses from Stability AI, there's also apparently jurisdictional issues where these kind of lacking evidence could be problematic. So a development that is not directly connected to these prior things we were discussing seems to be, again, fairly specific to this particular lawsuit.

But another case of copyright in cases going forward, this one being a pretty significant one dealing with training on images. And if you're dropping your key claim in this lawsuit that

bodes well for stability AI. And that's it for this episode of Last Week in AI. Thank you for all of you who listened at 1x speed without speeding up. And thank you for all of you who are tuning week to week, share the podcast, review, and so on and so on. Please keep doing it.

Yeah.

♪♪ ♪♪ ♪♪

From neural nets to robots, the headlines pop. Data-driven dreams, they just don't stop.

Every breakthrough, every code unwritten, on the edge of change, with excitement we're smitten. From machine learning marvels to coding kings, futures unfolding, see what it brings.

#214 - Gemini CLI, io drama, AlphaGenome, copyright rulings 01:33:32 Share

Last Week in AI

Deep Dive

Shownotes Transcript

#214 - Gemini CLI, io drama, AlphaGenome, copyright rulings