We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

#187 - Anthropic Agents, Mochi1, 3.4B data center, OpenAI's FAST image gen

2024/10/28

Last Week in AI

AI Deep Dive AI Chapters Transcript

People

Andrey Kurenkov

Jeremie Harris

Topics

Andrey Kurenkov：Anthropic发布了Claude 3.5，一个可以自主使用电脑的AI模型，能够通过视觉识别控制计算机，执行点击、移动光标和文本输入等操作。该模型在各种基准测试中表现出色，尤其是在代理能力方面。虽然命名方式与之前的版本相似，但其底层模型和功能已发生根本性变化，更接近于一个代理模型。 Jeremie Harris：Anthropic的策略是通过发布与前沿模型性能相当或略逊的模型来保持竞争力，同时避免加剧AI能力竞赛。Claude 3.5的发布可能表明Anthropic正在逐渐偏离其最初的策略，这与当前的融资环境和与OpenAI的竞争有关。 Jeremie Harris：Genmo发布了Mochi 1，一个开源的视频生成模型，与Runway、Kling等竞争。Mochi 1的开源性质使其具有很大的发展潜力，但其高昂的运行成本限制了其使用范围。 Andrey Kurenkov：Canva推出了一个新的文本到图像生成器Dream Lab，以及Canvas Beta，为Ideogram用户带来了Remix、Extend和Magic Fill等功能。这些功能的整合体现了AI技术在各种软件工具中的广泛应用。

Deep Dive

Chapters

Anthropic's Claude 3.5 "Sonnet" introduces computer use, enabling AI to interact with computers like humans. This agent-like model raises questions about the evolving competitive landscape and the balance between innovation and responsible scaling.

Claude 3.5 'Sonnet' can control a computer by looking at the screen, moving the cursor, clicking, and typing.
This feature, similar to Adept AI's earlier attempts, marks a significant advancement in AI capabilities.
Anthropic's move may be driven by the pressure to compete with OpenAI's funding and scaling capabilities.
The model's performance shows significant improvements, particularly in agentic capabilities and software engineering.
Safety remains a concern, with Anthropic collaborating with safety institutes and implementing restrictions on social media and election-related activities.

Shownotes Transcript

Translations:

中文

In the age of machines and move, don't born had the futures to export big? Story weeks. right? And welcome to last week and I put cast with a shot but what's going on with A I as usual and abode, you will be summarizing and discussing some of last week's most interesting A I news.

And as always, he can check out our last week I news letter at last week in that a ei for even more AI news and text form and for links to the stock we discuss in this episode. I'm one of your host under a rank. My background is that I studied A I S, A P H E student.

Now work at a generate A. I start up. And once again, we do not have a guest co host. Jeremy is back.

What's up? Everyone hate, you know what? So first of all, i'm back from having a daughter who was happy and healthy and my wife crushed IT, uh, thirty hours of labor. If you are wondering which if you are wondering is is not is not the pleasant. So my my wife, a real sport, real change that goes to.

I don't .

want to take all the credit for the birth. I was kind of there. So no, I get size.

And anyway, the point is I went really smoothly. And I want to say we have like, incredible listeners. I got emails.

I got messages. I got people stopping off at my apartment in middle of the night to wake me up and congratulate me. I that one was that I don't know how they got the address but, uh but yeah no.

Look, tons of of of amazing war messages from, uh, the community, which is what IT feels like we've got here. I just want to say thank you. Thank you.

Thank you. Um IT IT was actually made so much harder to be away for that time because I I saw the comments and all that really, really appreciate IT. So so thank you.

Um we're back excited to do this. There's there's so much that went on. I mean, I gotto tell you, I spent the last, I spent three days just getting caught up on what happened over the last four weeks.

And man, like IT is a lot of stuff. I was getting caught up here for work. But also for this in like the th Epace o f o f p rogress i s w ild t hat w e're g oing h ave a r eally i ntense b ig e pisode t oday, but I don't think it's going to cool down.

I mean, just the inference time computer stuff that we're seeing, the the the advances on the reinforcement morning side and in the media generation obviously as well. But just like anyway, so so much and really excited to get back into IT. I picked the wrong four weeks to to take off, but you know what? All worth that in the end.

So there we go exactly yeah I think we last couple episodes happen to be more than hour and a half front. I want to save this one. Is trouble going to be back to the shower and we might be getting back to that.

So let us try and get back up into the news. But before that, as always, real quick, want to acknowledged some with no comments. We have a new apple, a podcast review that says that this is sort of this and that i'm not aware of what the showers, but seems to be, you know, a good person.

And this reviewer says that this is a nice mix of opinions, facts, tax, broad strokes, real world, even as the central chat. You know, some positive feedback in that. So you're genre me.

I'm sure we'll be getting a little more of that will be back. So thank you for a view and a comparison that's interesting. And we did how I I love .

that i'm going to use that to break from now on. People love the podcast so much. What they say about IT is that it's it's sort of this and that what are rigging .

endorsement exactly? And love the A I X central chat. You know, where else can you save that.

right? Yeah, let you go.

And we did have one comment I need you. I want to address. There was a question of, what do I use IT to create a intro, A I music? I use dio. So there is the idiot and sooner, both are pretty good. I found a preferred ud o, and every week I spent like an hour of these days is just prompting IT to see what I can .

get all the back to worker something.

But yeah, I know, I know, by the way, the full A I info song plays that to ultra. So it's I get two minute song every time. So if you stick around, you'll get to full version. I was supposed to full version on youtube, but if you want to just like to listen to the songs for fun.

I don't know. I feel like I known that.

That's really cool. Well, i'm not surprised you don't listen to them share me.

but what I do in real time.

And one last thing, before we get into the news, we do have a new sponsor, that thing off and actually just in time for you being back, jeremy. So the new sponsor is the generator, which is bops and college's into disciplinary A I lab focused on enter pt unal A I just recently last fall, professors from all across boston partners with students to lodge this, uh, generator.

And this is a law is organized into various groups, A I entrepreneur, ship and business innovation, AI effects and society, the future work in talent, all of these source of things. And they are kicking of various endeavors. They are pee training all the box and faculty on the eye concept I tools.

And they'll be extending that to all of the college and their motto, I guess i'm they upset, they save which generator accelerates entrepreneurs ship innovation and creativity with A I this is kind of an interesting sponsor because there's no product to sell here. Uh we actually our fans of the podcast, they told this, they say that is a mass listen for a faculty and students of a bobs and so amazing yeah we are glad to have them as sponsors. You can go check out the uh college websites in the epo description, interesting articles and maybe we'll be some news out of generator soon.

And finally, let us actually get into the news starting with the tools and apps section and you're starting we've a prety exciting one coming out of uncopiable. So the headline here is on topics latest AI update can use a computer on its own so there's a data of a new feature for a cloud three point five senate, which will allow to control a computer by looking at a screen, moving across a clicking buttons and typing text and they call this future computer future use its available on v API and allows developers to direct V I to do this stuff, basically use a appear like a human, so very much on the train of a genta I right where this is essentially a thing we could tell. You know, go and book a tickets to go from some parties, go to atlanta in november fifteen for a week and this model and go and do this uh, set of steps where IT will open up a browser.

Go to that that that come all those steps to actually do IT for you. For now it's of course relatively limited. Some ways it's not able to do things like dragging and zoom ing, uh, not able to do anything.

You can do a computer, but uh regardless, IT is uh pretty, video open ended， right? If you can look at the screen, move a cursor, click bon type text with a lot you can do. When people have started experimenting with IT, there's been failure cases people have shown has been a lot of excitement. So certainly something that I think we can all imagine is gonna be an end product of the eye is sooner or later we will be able to just have A I take over on the computer and do anything. And pretty exciting and interesting to see on topic being kind of a front runner on this kind of thing.

yes. So there so many layers to the story, one of which, just at the top level, is damped as a sound and note. Lot like what a dept.

I was trying to pull off back in the day. I think they might have raised. There was like sixty, sixty five billion dollars at the time. I will say we've said this a lot in the podcast, these sort of like ma scope c companies that haven't raised like enough money to be competitive when IT comes to scaling or at risk of dying. I think you know this is exactly.

I think we specifically called that we would have companies like anthropic or more skilled companies um essentially just like eating the lunch of companies like a depty. I that is what we are seeing here. Make no mistake. I think that this is you know one one step in that that long story um and I think there's a lock on on under the hood here, right? So and anthropic explicitly or implicitly has sort of intimated that they don't want to exacerbate the racing dynamics behind front here A I right that's been a big they are part of their story.

The way they've done that history, ally, is to release models that are at parody or slightly behind the frontier so they can still make some money, but they're not served like, you know, accelerating things themselves that still does put competitive pressure on open the eye. But but anyway, it's the idea is to kind of reduce that while still being competitive. This is a step away from that is another step away from that.

We seen anthropic kind nudge itself more and more in that direction, which perhaps is unsurprising. The incentives are just there. Um but this is happening in a context where we know anthropic is raising around right open eye just raised that one hundred fifty seven billion dollar well at that valuation of the seven million doll round.

That's what IT takes to build the next speed of data date center. Sorry, data center infrastructure is just what IT takes. You want to keep up the scaling and thrown c has no choice but to compete on that basis. If they want to court to investors, they have to convince them that, hey, we are worth the thirty forty billion dollar valuation that they must argue for.

And the only way to do that right now, given the revenue multiple that they have to argue for, is, is to come up with something that makes the case that if we are actually ahead, like you are betting on a potential winner here. So I think what we're seeing here is anthropic, uh, between a rock and a hard place, Frankly, being forced to to choose a little bit between do we accelerate things? This will put pressure on open eye to accelerate their own development.

Released the next version. And indeed, we have course heard rumours, at least that oya on the next generation open eye model, which supposedly been trained on o one's outputs as well, maybe coming out sooner rather than later. Not that necessarily connected, but also not that it's necessarily not connected.

okay. Another layer to this is the performance, right? So they share uh, details about the model performance.

This is, by the way, called claud three point five senate brackets new. So it's actually, I don't know why they they did this. This is a fundamentally new model.

IT is not called three point five on IT old IT is a dip like IT behaves differently. It's not just a text to text model. It's, it's it's much more agents because you have said IT takes in the video screen shots of your computer screen and then takes actions. I don't know why this has.

I think the way of frame IT is announcement came few things. So I say they have an upgraded club, few five senate. They actually also launched cloud three five high cool, which is overshadowed but is part of here.

And they say that this is a new capability in public beta. So guess you can still use people in five senate on its sound of out computer use, or you can use the API to do computer use. And that's powered to buy cloud three point five.

So totally a close three point five one is like, but is the is an agent's model, right? Fundamentally more like a one perhaps than for example, GPT for o and that kind of where I think a lot of people, I think rightly have been confused by the the naming convention here. Um and we will see if that if IT persist. But at you know some point you're going na probably need to like distinguish between these in a more fundamental way like open I did with a one verse.

The G P. I think if there's a question there to get A A little like we have a benchmark table right or compare class three point five on IT to cloud three hundred five thousand of new on the common benchMarks in I A L U coding. So if you just use vpi to do some coding, for instance, right, it's not necessarily a clear that is doing anything different from a natural Normal language model when you just prompted to complete some code, right?

Yeah, but this is what I was saying. The same is true for opening eyes a one, right? Like you can ask open eyes a one just standard U G P Q A questions, M M L U questions and you can get a score um but IT is given a different name because IT also has these fundamentally new characteristics and and that I think where people are kind of saying, hey, you know like we are dealing with something that is agented that is fundamentally different in its in its behavior um maybe that should be reflected in the name van.

I mean, know you can argue over IT, but I I certainly think that that there's an argument to be made there. Um IT is um not now the performance is interesting because that table you alluded to, right they do break down the performance and they show you the graduate level reasoning. These they used to be really, really hard questions.

Questions that like phds in domain specific areas would struggle with um sixty five percent for a three point five unit knew um really impressive score. And IT is soda is stay to the art. Same with M M L U, same with human eva, which is a coding benchmark and so on.

What they don't show in that table are the scores for opening eyes o one model, and that they they explain that in the figure just by saying our tables exclude the one model family because they depend on extensive prerequisites putting time inference, time compute, unlike typical models. And this fundamental difference makes performance comparison on's difficult. That's interesting sort of implies this may not be what is going on with call of three point five on new.

I still think the comparison would be useful. So just to kind of give you the numbers, um when you dig up and you have to go into the system card for the OpenAI one preview model, but when you dig that up, um the performance on uh the software engineering is a sweet, which is the software engineering bench market I think increasingly is like the bench market to track in the space. Um opening I O one preview gets about thirty eight percent.

This is before they impose a couple mitigations that do reduce the performance, but about thirty eight percent saw at three point five knew hits forty nine percent. That is a big, big jump. Singling significant improvements in software engineering capability, which is so important given that these labs are explicit trying to figure out how do we build models that help us automate a research itself and get us that sort of closer to that, that take off scenario where you know A I makes itself Better, which makes itself Better and so on, that that's explicit being talking about right now in the labs.

So anyway, I thought this is so so interesting, so many layers to the story. Um there are also questions, by the way, about where is open ops a ops three point five in all this whole story, right? The three point five series of models, we have sun at three point five.

We don't have ops that was supposed to be the big model would come out. Is some speculation about Young, maybe that train run failed or maybe the economics just don't support IT. So maybe we on't be seen IT at all and we've seen IT disappeared from the the kind of anthropic documentation in the space.

Um last thing i'll note, safety side, right? So we know that anthropic has been engaged with the U. S.

And U K. A. I. Safety institute doing a lot of kind like coordination with them. This model the on at three point five has in fact been tested by IT. Seems both of these institutes that the claims being made here.

So that would be yet another a really interesting um use of those those ages in the development of that relationship. They did find uh anthropic did that this model does not exceed their a safety level two standard which is the same uh threshold of capability and risk that the previous senate model reached. Sluts sort of interesting um per their responsible selling policy and not a qualitative difference in risk though I will say once you hit A S L three that is already a pretty scary level capabilities. So the fact that's me there, you may not tell us all about that much.

right? And on that note of safety, they also do note that visus program to avoid the social media and election really activities. So you can't use of the computer, use a feature here to go and Normative, for instance, right, which is a really interesting note on that front where you'd need to impose new limitations now you can do even less work to do inferior of sextile ties.

And on the performance comparison front, you said, I guess a notable bit is on the Normal benchMarks that does Better clock few point drive unit. But a big jump is on agencia capabilities, right? So you have a few percent of point here and now mml u or gp U A zero shot.

But when you get to agenticity ding agented tool use, that's where various of the biggest so IT training is IT is unfair to say, but this is a like an ergenekon mizer model and that is why IT comes about computer use capability. But unlike or one by default, the inference isn't agents when you call the A P I. So there is a an interesting nuances thing here where it's not A, I guess, a system like oon that is meant could figure to do agenticity ones every time.

You can still use IT like a Normal alem. But IT is an alm that is optimized to be a good at agented reasoning, which I guess is why they still keep IT with the same naming terminology and not compared to all one directly. Anyway, very exciting.

Lots to talk on there, but we should be moving on. So next up we have a story about mochi, one, which is a new model by A I video start up general. And this is an open source rival to runway killing and other video generators.

So this is available under the party two point license being ing that you can generate. Uh, no, anyone can use this model for anything. Essentially, you have both weights and the model code to download.

And there's also a playground where you can play around of us. Uh, that will be a launching also ability to do higher definition versions later here pretty receive a output for this one. Obviously not quite as good as the other models that have some patience.

IT will only be output in for adp resolutions. Later year will be outputs the add version and you know interesting move by the start. I don't have a product yet, so they are kind of front running with the release of this open source model first.

Yeah, it's kind of interesting because they do flag you know you can downside the model weights if you want from hugger face though IT does require at least four invidia eight, one hundred GPU to Operate you if if you want to actually run IT.

So you know if you got to spare one hundred k there is lying around, uh that can be your you're go to uh you're go to market strategy but the chAllenge here courses if people talk about this a lot but the definition open source yeah when you have a model that is so big that IT requires your distributed inference um that that like does that does that qualifiers is the buried entry is so high. Obviously, having the model weight out there in first place, I I think is the substantive win here. So we'll be able to see presumed ly a lot of interesting modifications as well made to that model, like especially with video generation.

I just I really wonder what like you what does fine tuning look like? What does the ecosystem of your modifications on top of this kind of a model end up looking like? And and they will have a lot of room for free, for creativity um and also far you know the automation of an awful lot of a of movie generation production um stuff. So anyway uh interesting your release .

I don't like around of the first story is about canvas and they have a shiny new text image generator. The canvas is a very big tool for both to no use for design tasks about this speaking. And they are now launching dream lab, which is an in build text to image generator.

Uh, convert had acquired Leonardo a little wild o Leonardo A I which was uh suit of tools for A I imagination. So this is powered by olio OS phoenix model. And as you might imagine, this is pretty much just a test.

M tall. You can generate images from descriptions with various styles like federer and illustration. Uh, this is actually offering an improvement over contras existing stable division based A I emilita.

Uh, so IT has Better quality, generally speaking, uh, as supposed to what I had before. And they also have updated its magic, a tool sweet, which will do various things like magic right uh, for text generation. So in our example of how I guess, we've seen this trend can all over. Basically any software tool outward now integrating a eye across the board in various ways and I guess wonder be early examples of an acquisition, a major acquisition of any I start up making into a very significant product used by many people.

Yeah, it's kind of interesting. You can see the unambiguous footprint to that acquisition, all of this. yeah. And you know one of the things that mentioned that the very end of this is users may be disappointed that their paying increased cost of the expectation is that the cost is going to go up as a result of this.

Um it's so tough in the the genre I era like what do you do uh to increase the the Prices that you're going to present to customers um versus like how much effort you put into just reducing that costs to compete? And um like these are there is a lot of new features. It's funny to see the the stressful tics m like a people people might be unsure about paying more for like this giant sweet of new capabilities, but the reality is that the so much of this is getting commoditized right.

The models themselves no longer really are remote. Increasingly, it's the infrastructure that serves the model, that becomes the mode. And we literally just covered that story.

Read about video generation, and I just being open sourced. A model is so big, it's got to fit on four, eight, one hundred. Think about much training budget would have been required for that.

Just just completely open source. So the models themselves, less of a mote, at least for these non frontier models. And yeah, like getting harder and harder. IT will be getting harder and harder at least to convince, I suspect, customers to pay big, big amount of money for the sort thing. But you got ta do that if you're a meza scope c company.

So see. And the x story is about canvas not. And I was .

a book of you .

at first. So our congress is a new future in ideagora. And this will allow users of diagram, which is and now will text you much tool that is more focused on text rendering in particular. Uh this will allow users of ideographs use things like remix extend and magic feel. So it's the kind of um image generation where you have a canvas right where you can oh.

that's where .

they get IT. Okay yes yes exactly. To make a lot of collage right, you have you can expand almost endlessly. You can figure of IT uh some edge by you know pasting in images extending on and on. So this is now launching in a tool, something that I don't think you have in things like my journey, at least not natively on the web browser. You're seeing this continued competition in the space.

Yeah, i'm really like curious about when the market is going to decide that image generation has just been solved and then we no longer have sort of like the the of the benefits of of continued scaling and and really intense levels investment like because at this point rate, we've got yes, we have some text to video models that specialized and ranging ing the text in the video and all of like we're already the point where that's kind of expected from a new release of a of any kind of cutting text image model.

So I am really curious like you know, are we are we saturating the the value that can be created with these things? I suspect or not, there's always room for yeah, you could always be surprised that like how much additional resolution or additional capability unlocks these new niche applications like, you know if you want your text to image models to generate like some I don't know some like circuit diagram or something, you want to get a everything right. But I think for the lines share the market, we may be getting into that space where the the returns are going to be more limited.

I'm curious ous. I think I have no idea. But at a certain point, are we when do we start lighting VC dollars on fire? Uh, is kind of an interesting question at track in the space.

Yeah exactly. I think it's it's interesting. You know you have a few leading text to image companies outward a journey ideographs, another one like that.

And as with the and providers, right, there's not a ton of defeat ration. There's some, but not a time. And it'll be interesting to see when vbc money is burned, right? Then you actually need to get by on revenue alone.

How will that competition player? And yet again, speaking of text to image, we have another story in that front, and this time is about stable diffusion free point five. So stability, I haven't talked about them in a while, are releasing stable diffusion three point five.

And once again, it's coming in three sizes, large, large, turbo and medium. The gist of this is that is is much Better at photo realistic images. So the comparison in this article is to flux one point one pro and IT looks pretty significantly Better compared to S, C.

Free in a sense that IT is comparable to flux. I think initially, when I saw outputs of flocks on acx, IT was very impressive. IT did make me feel excited. Is surprising, stable, the fusion. So I guess not surprising that we are launching with to try and keep pop sort of speak .

yeah IT is um so one of these are stable diffusion licenses, right? So it's free for non commercial the use, including scientific research. And for this is the thing, so free for small to medium size businesses up to one million dollars in revenue.

And then above that, you need an enterprise license. So this is kind of the new stable of the new stability AI, a approach where they have to monetize somehow, if just wasn't working out back in the days of amoud mos dac is the C E. O.

We all remember that we covered IT quite a big here. They were just bleeding money and now they need to wait to to make IT. So um you this is clearly an attempt to kind of split the baby in that sense.

I I have actually know um conception of of how that's working. Like I don't know that i've seen reporting on revenue that y've been generating from these enterprise licenses. Yeah, again, I mean, like a step ahead in realism. That's cool. I don't know how many more steps ahead are left in the tank before you know you see a fully open source model that does not come with these requirements of enterprise licenses.

And then you and then what do you do? right? So so that might really erode te the profit margins here, but still very interesting and important step forward for stability eyes, especially if they try to compete with everybody else, is snapping at their heels.

And one last story, we are bringing IT back around to agencia I with the announcement of workflows in inflection for enterprise. So he skipped to the story, if you recall, I think, where inflection the company evacuated pie and was very oriented towards a sort of consumer child bott, that is emotionally intelligent.

Famously, as we've covered, was kind of actually hired by microsoft, where a lot its leadership and also a lot of its people moved over, but infection was not acquired and remained its own company while they did. Announced a shift to enterprise a few weeks ago with inflection enterprise and now they're announcing agented workflows for inflection for enterprise. So this is uh in partnership with the other company, UI path, which is already a focused on this automation of processes in 啊 you know guess company uh automation of company processes。 And so this is combining the AI that inflation provides with that sort of automation.

Uh, not a ton of details for me in announcement by them as to how this has actually use. Imagine this is very much kind of in the I S U I or how do you want to call IT not such a super general purpose solution uh, unlike something like a one that is a genetic or computer use bian topic. But regardless seems like inflection is yeah very much trying to go for that enterprise play and also try to go a different path with announcement of this agent's feature.

Yeah there are this coming along with the announcement of an acquisition of a company called boundary list, which I had not heard of before. And um the reason is I suspect that they are have historically done maybe more like robotic process automation stuff like rpa.

They apparently they are describe bed as a team of automation experts um with deep experience deploying U I path integration so that U I path collaboration, the boundary seems to be the kind of wealth, the boundary the interface between um inflection and and U I path and presumable that the strategy to actually integrate this blog post does read super super enterprise Y I will say there is this there is this paragraph or as these couple senses that I just want to read because they are like the most enterprise sounding shit that like so they say today, the primary Marks for measuring A I capabilities focus on I Q from the beginning inflection A I saw the value of prioritizing other forms of intelligence and find to dara model to embody EQ OK EQ your I had a super cool. But now we believe another important measure for enterprise I should be recognized. And we refer to IT as A Q or the action question.

See, these guys are really, really smart. They got all the accordance. They got the I Q.

You got the I Q. They got the E Q. Maybe not doing the EQ Better to do the E Q. And now we've got the A Q people. It's A Q time, time to get excited about action question. Anyway, IT was just like it's the most like we're going to coin this phrase just to like, anyway, this is, well.

yeah actually i'm not for t it's it's funny because IT really highlights that early on inflection very much. You did focus on this EQ very like we are making a child bot but IT is emotional, intelligent and who will talk to you and you understand you as a user. But that's not interesting to enterprise, right? They don't care, right? So we have to do a slightly awkward shit of like OK. Forget EQ, that was previous inflection that couldn't make money and require microsoft to buy. Now we're doing A Q which is total different different with us is very much we branding itself in this in this move for sure.

Yeah and like it's gesturing at something real, right? Like we actually and these are agents, okay, like I get IT is just it's the most enterprise thing ever to be like we need to knew, like little coining phrase to IT to say this. I thought that was kind of funny.

And out applications and business. We begin with a big hardware story. IT is a free point for a billion that joint venture to build N A I, about a center of campus that will reportedly be used by opening eye.

So this is a bit of more in the weeds story, not covered in the town of headlines, but definitely worth highlighting. There has been a partnership between several companies, crucial energy systems, blue or capital and primary digital infrastructure, which will be building with data central compass in the texas. And this has just been recently announced. Uh, this will be in the the city of ebling, texas is you know a huge project, almost a million square feet of floor space and ah as we said, one hundred thousand GPU and germ. My I think you have more details that you thought we're interesting here.

Oh yeah, I mean, this is a this is like a huge deal I black screen from the rooftops. So this is um OK first um this is one of, if not the first one hundred thousand um unit b two hundred clusters that we're gonna seen come online. So b two hundred we've talked about.

I try to remember when we started doing the podcast together, and we're talking about the a one hundred GPU that was like that was used train gp. T for then we ve got the eight one hundred than the age two hundred. But basically the same, same process to note was used to make those.

Now we have the the b two hundred, the next generation, much, much more powerful, much more little, perform, live in performance. This is the first one hundred thousand b two hundred cluster coming online. Big deal.

Even bigger deal, uh, is not just the scale, but for the microsoft open eye relationship. So this build is coming from a deal between open eye and oracle. And it's the first time microsoft hasn't been open the eyes partner for a data center infrastructure provision.

That's a really, really big deal opening eye negotiated apparently directly with oracle to set this up. And IT seems to be part of a view at opening eye that microsoft just hasn't been moving fast enough to provide the kind of data infrastructure needed to meet the requirements for the next speed of scale. So there is that a friction between OpenAI and microsoft around this issue? And apparently microsoft was informed about the negotiations, so they're tracking.

But in reality, as they say, he had relatively little involvement um and so they would ve got a Green light IT because of the nature of the the agreement between crossed and OpenAI open. I can't go out and just negotiate deals with random club providers and without microsoft s blessing. And so that seems to be what has happened here.

Microsoft, like, look, we're not to build these for you, but if you want to go out and negotiate with oracle, go for IT. Notable because oracle is in that sense of competition to microsoft. S of kind of weird.

Um but yes, so the initial build is going to be about fifty thousand. Uh, G B two hundreds. So these are so the G B two hundred, I shouted, said b two hundred earlier.

G B two hundred um is the, uh so you have individual GPU right now. You put those GPU together on a motherboard along with CPU, and then you put those in server racks. That unit is called A G B two hundred, right? So this is like the the data center version of the b two hundred GPU.

That's the the horse power that's powering this. And um it's apparently gonna in this uh new ability facility by the first quarter of next year. That is fast, right? That is sam alt men trying to respond is in part to, like elon musk building his one hundred thousand eight one hundred cluster faster, anybody had before.

The goal is to scale up to one hundred thousand of these G B. Two hundreds by the fall of twenty twenty five. So that would be a really, really big deal.

Um there is gonna apparently one gig. A lot of energy available there by mid twenty twenty six for context. Like you're talking you one gogo, what is I mean, it's that's a big city that you're powering with that that kind of energy.

Um so there's a lot of power deals kind of uh in in behind here and um crucial uh has a bunch of service index timely on the energy side, which is why there's such a key partner here. They're also known as crucial energy. Just to tell you how energy focus they are.

Um initially they powered their data centers using natural gas um that burns during oil extraction to reduce carbon emissions. Right now they're looking into other kinds of power to scale faster, right? We've been talking for a long time for I think over a year now about the need for nuclear to come online, the need for for financial gas to want to play a role here.

But they're all these kind of crucial power requirements to make this all happy. So this is, I think, a really, really big deal for the whole hyper scale scene. There's some a little bit of interesting gossip at the end of the article as well where the guy behind cruel was saying, i've heard about ten good what scale projects so like the ten giggles scale project, like for a build out of that scale um that is like IT is just a very, very big project, is the kind of thing you might imagine.

Things come online like twenty twenty six maybe um maybe maybe twenty twenty seven depending. But like there is nowhere you can find ten ten giga watts of stack in the current U. S.

Electrical grid. Um so it'll be really, really interesting like what to held that involves that, that requires new power permits that cannot be done quickly. If you're going to do, for example, even like even I know gas, you you're going to need like three years to do IT quickly.

Nuclear plants take like ten plus. So know this is a very ambitious project. If in fact that is is going to happen, it's just a rumor. We'll see anyway all kinds of interesting uh, gossipin and juice in the story I thought so there IT is yeah exactly.

The details are a little bit markey, so not been like an announcement by open a eye of any of this. As far I know it's moral like there was a press release by these companies by crucial and others announcing the project and vested shatter and discussion of opening eye being the end beneficiaries, oracle being involved, microsoft sort of not as involved as you might have expected, giving a very relationship of open the eye.

Uh, certainly, we will probably be finding out more going forward given that IT would be a huge deal if this is the next, you know I guess, independent venture by opening up to have its own source of compute. That is very significant. Next up, going back to on topic. And as you mentioned, Jenny, they are trying to get born money on the heels of opening eye, getting a tone of IT. And reportedly, they are looking to raise at evaluation of up to forty billion. So not the time of the details here, but according to a report from information, uh, according to an unnamed existing uh investor who spoke to company leaders, uh, there are talks that are at an early stage that uh are training and this possible valuation so very kind of flew ID situation here. Uh yes, interesting to highlighting given that opening I had valued itself after that pretty ridiculous number of one hundred and fifty billion you know highest valuation for private uh start up uh this time and and topic yeah I don't know I don't know I guess how to feel about comparison of their forty billion evolution to open the I ho fifty yeah I think one of the big um kind .

of headline metrics that that's really interesting here is if they end up going without forty billion dollar valuation, right a that would be fifty times multiple on their growth revenue right now. That is higher the opening eyes valuation of forty that was from its its previous round, right? So the argument anthropic must make if they are going to raise IT forty is um there's actually like we we have more, more, more proportionately of our potential or of our value is in in the future.

So we have the potential to like grow faster, be Better than open an eye that's good, be a tough cell in the current context, especially where scaling matter so much, right? The mere fact that opening eyes raised as much as they have d risks them from a scaling standpoint much more than than anthropic as does the partnership with microsoft. Anthropic is more certificated kind of arms link relationship with a couple different hyper scales, amazon in and google being the foremost there.

So um opening eye also has has just like a way Better financial position, as they say in the article on pace, the claim is to generate about four billion dollars in revenue. Um where's uh it's a bof five times more than than anthropos current projection. Um both companies obviously losing like this hamburgers money non stop because that's the nature of the scaling race.

So this is like this is really where IT makes you step back. And okay, the saw at three point five new the new model that was released in the context of anthropic historically not having wanted to exacerbate the race to AI capabilities. Um a kind of makes you wonder, like O G you know I wonder if they Normally would have kept this under the hood opening.

I certainly done that with the full version of the o one model, right? Is an anthropic being forced to kind of like release things earlier. And this is where here you see open opening doing similar things.

Oh, when preview is a preview model, they released IT. You know, with all these caveat, same out with the song at three point five knew. I think we're sort of seeing a bit of A A bit of a race to deploy happening playing out here in a number of different ways.

I happen to know this also happening at the data center build level where there is their security and and robust considerations that are being short circuit because people just want to get access to those next data center built. Big, big issue, if you care about U. S. National security, by the way, to deal with exactly the dynamic, because this is just embracing national security volume in so many different ways. Um but yeah, there you have that I mean, I think this fund raises is a really big deal.

I think it's dual die for for anthropic to just keep up, they need to have a clean enough capable, right? They can't give away like fifty percent of their company to pay for the next meeting of computer because then what are they used to pay for the next generation of scaling? Um so this is, I think, a really big question. You know how much will their share is be deluded with these new investors, what are going to be the strings attached and so .

on and so forth? Yeah, I think it's it's interesting to think of IT, I guess in the context of a broader history was story of this whole time in the eye of you know, creating fatih models is a beauty and dollar and ever now he sees are not gonna give that money to any new companies and most likely, like inflection, go to a much money. Early on, they train their own model, their out of a race of for tier models. Now when they feel the players is pretty clear, it's opening eye, it's unpropitious, mata, google mistral seemingly and not about IT yeah and i'll happily register to the prediction .

that mistral is going to be out of this race very soon or at least in the next couple years. There's just no way. I mean, france wants them as a national champion. But at the end of the day, again, we're talking about, I think in my view, burning VC dollars.

Um I think anthropic is is actually interesting like you can make the case even though they're not attached to the hip to a hyper scale like deep mind is and like opening eye with their differentiator is actually talent like the best researchers in the world arguably have flocked anthropic, in many cases having left opening eye for that purpose. H those hope high profile departure we ve seen in the last kind of year, uh, I think are are actually quite meaningful from the standpoint of anthropic prospects. So I wouldn't write them out at all.

I'm actually, I mean, i'm an anthropic stand here, right? I I like this stock. No but I mean I like I like anthropic is a company um and I hope they succeeded. But uh but this is kind of one of the the big questions here and a lot of people in the safety ecosystem also going to be potentially upset about this pushing forward of the capabilities envelope that is being imposed on anthropic by economic forces. So will see yeah .

I know there is a lot of, I guess, pho sophy, I could do here in terms of uncopiable into public consciousness. Much less known, right? Probably has had much less of an impact brought the van open eye opener is a big name of charge.

Pity is a big name. And IT is sort of in our little sphere of A I news and A I usage of that we think i've OpenAI on. Topics are together almost on the same level because on pix is a primary computer to, and I like IT is main one, aside from matter on google that are not start up to right, that are establish companies.

So the competition varies very interesting just because if unprompted can't you know survive, then opening is of the company is gonna stick around. And if IT, in that sense, that would be very nice to see what will happen with this new foundry. Y is what to would be so on, I think, lying round a related story.

A on the sense of people living open the eye. As you mentioned, we have the news that longer policy researcher miles of branded is leaving open a eye. He was a senior policy researcher at open eye and has left the company to pursue research and advocacy in the venta profit sector.

So this is interesting to me because a my bondage has been a kind of a signal figure in A V I space uh in terms of people who express opinions, who weigh into conversations. And he's been at open since twenty eighteen. He was a research fellow at to university of oxford, future of humanity institute.

And he has since since when eighty been in the eye. And as per that role, he was very much on the policy front, on the safety front. And in that sense, IT is significant once again, to see someone, this time not directly related to safety, but related to policy, leaving op en the eye to do work from somewhere .

else yeah, I will say from from talking with friends and open the eye. Miles was considered by many to be one of the like kind of true believers in the AI safety story and and a kind of a force for that within the company. So IT is not worthy in the context of the recent departure. You know, we seen memories.

Who was another person like that I was going to know believe by quite a few and and again, you know from what i've heard in the company um to fix the safety story quite seriously and then john sherman, who is taking over the super align nate efforts after you on like a left who also was doing the super alignment and efforts. Um it's a lot of safety ordinary talent that just keeps living in and that to many has been source of concern. But yeah, miles put together this blog post and just kind of laid out his reasons for leaving.

It's quite interesting and I think there's a lot of yeah you can do some reading between the lions what the reasoning is but his high level bullet points are that he wants to um spend more time working on issues that cut across the whole AI industry and have more freedom to publish and be more independent. He does credit open the eye, allowing him to do quite a bit stuff on his own, but IT seems like he has his own views about things. He's also concerned that being at open eye made IT harder for people to take his view seriously since he had a vested interest.

I think that is A A reasonable position for for him and for people to to have looking from the outside in like yeah you're going to this company, you know you will have those bikes um so so we'll be interesting to see if anything changes in his in his position either suddenly or over time as he moves away from IT. Um and then he was also especially interested in doing A I capabilities forecasting and uh your assessments of of regulation, the regulatory environment security going forward. Um but H H he does say, I think opening ee remains in an exciting place for many kinds of work to happen for many kinds of work to happen.

Like if you wanted to read between the lines there, um there is a lot of room to be like, okay. So I guess not all kinds of work. And I guess you're interested in safety work and security work.

And so uh, you're leaving to do that. Um kind of interesting ah and any saying ah and i'm excited to see the team continue to ramp up investment in safety, cultural and processes. So there's so much that you can know anyway, reading the t leave is always impossible here. But um at least part of the vibe you could take on from this is there may be a bit of concern over uh over the same serve safety and and policy narratives that we've seen other people flag or maybe not uh.

there is exactly so and I think not able to mention know as a implication here, right? Anyone leaving, especially someone as senior as miles brundage is presumably leaving a very lucrative position to go and in this case, work in a non profit. So that tells you little bit about what kind of decision was was next more on hardware and videos.

Black world G B two hundred A I servers are ready for must deployment ment in december. So that's a the story there. So there have been some issues with supply hing things there. Rumors of defects, I think we have recovered earlier. But now IT appears that we are on track for q four of twenty twenty four, which as german you said, of course, this is an activation of notable AI compute and will start falling out next year. Yeah.

this is despite initial concerns about how things we're getting slow down by a whole bunch of supply chain bottle next. The big one was of the packaging technology they use. Kaz l um this is like a is a new way to kind of integrate together all the little components that have to work together to make a chip work. So you know, you have your logic and you have your hybrid with memory in and how do you connect those two together? Um uh was called inter poser.

Anyway, details don't super super matter that we hopefully will talk about the the hardware upset, which are very often due to talk about um that's been overcome and now we're on track for q 4 twenty twenty four mass production of two different versions of this um kind of be two hundred chip。 So uh the big one is going to be the bianca board. Um this this is the G B two hundred.

Basically this is where you have one um C P U one Grace hoper CPU and then you have two um b two hundred G P S that are connected to that CPU. Those all sit on the same board and that board are going to have two of those boards glue together. So a total of four G P S and those slide into one rack into A A A sorry one slot in a server rack.

And then you would have basically eight of those slots are, depending on the configuration, more fewer. But um so this is a really big news because we're actually seeing this roll out going to be shipped by december. And there's there a whole bunch of early customers, including microsoft opening as not OpenAI, microsoft, oracle and meta.

But actually since oracle is there, I guess maybe OpenAI and directly, i'll be the first one to acquire IT. So that's that's kind of cool. Um and a fox count uh is is working this as well.

We will talk about them more in a second. But um once you build the a the actual like chip, then somebody has to turn that chip into a server. Rq right has actually like like set up anyway all that infrastructure.

That's what fox con does. And so you have the chip fabrication, the chip packaging and then you have the step that foxconn does where they basically just like build out the full servers and those get shipped out. And that is a that's what's being done. Now exactly .

that is our next story that fox gone is building of this facility in mexico for that bottling step of the G B. Two hundred chips. Foxconn, if you don't know, is a dragani company.

They are most well known for being the key manufacturer for apple products, the assembly step for making the iphones. And now they are partner going to video for this. Ah what I say, we'll leave a the biggest my factory plant for G B two hundred ships and IT will be in mexico.

Yeah so one of the big chAllenges. So by the way, it's going funny because you'll see a lot of companies referred to in like tech news. We're so like low detail tech news.

They'll just be like, yeah. So fox kon is responsible for building in videos. G, B, two hundred super chips.

S, it's like, like what what do you mean? Response thought in video was responsible. And then I thought T S M C was responsible for so yeah that they're all responsible.

Um the the way this works is in video designs, the GPU. Um so the actually like b two hundred say h IT gets shipped over the T S M C IT gets fabricated in T S M C the the chip. But then IT also needs to be combined with hyborian memory.

Usually you get that from nowadays it's esk hinks basically for this. But IT could be samsung other companies. And so the hymen with memory and the gp logic are combined together.

They get package at some packaging facility potentially in taiwan, potentially elsewhere these days for the cutting edge stuff, the coastal that's mostly taiwan. Um and once you do that now, you have your like ready to go G P U, plus a hidden of memory. But you still need to connect that to the CPU that T S M C anyway that that some some process will will build as well and and create.

These are like boards, basically these mother boards that are going to contain all the kind of other infrastructure needed to run the chip. Um and that includes, by the way, cooling. So like liquid cooling.

And so that's what fox, conn is not gonna do. Uh that that's that next step. So when they say that fox conn.

Is building the GPT on super chip, they're saying they're taking what the the package chips that they get from like A T S M C, and then they're combining them with all the cooling and the energy provisions stuff and then they're packaging that into servers and that's what they built. That's what's going on here. Um this is a big deal, right? This is a really, really big deal. Can be a hot which I totally nailed the .

renunciation I .

think you we aim to please on but we want to accuracy on this show um but yeah so it's it's going to be the world is largest manufacturing facility for bundling in videos, GPT and super chips that what an ping 的 thanks。

That's good. And the last story in the section, X A I is launching N A P I. So X A I has been working on the rock where ChatGPT and close competition for a while. We've you know rock too recently. And so far the only way to use rock has been through x, through twitter.

And now they are doing the same thing that charge a, the same thing that open an eye and a prompt have done, which is provide a means to use the for other organizations through an A P I. So this A P I will support rock Better and will be Priced at five dollars per million input tokens and fifteen dollars per million output points. Uh IT will have some features like supporting uh, function calling and yeah is is positioning, I suppose, X I to compete with open eye and a tropic more directly with this step. Yeah that happened .

really faster. Like I I can't believe how quickly X A I came on that and actually consistent with how quickly ly you want built that that insane cluster uh in months like the build speed was apparently I return remember I end to end was about like three weeks apparently to to execute the build crazy crazy stuff um and yeah apparently although it's not live yet uh there is documentation that does hint at vision models.

They can analyze both text and images. So presumably that's going to come online soon as well. I'm really curious what um what complete like cluster fuck could happen if the X A I A P I ends up incorporating agented capabilities and it's native to the like x platform because there's basically no like you know there's like no uh how to say they're tightly integrated, like there's the box situation on actually be really interesting.

They are hopeful ly through the API, they'll be able to have a Better, Better visibility into at least those bots but kind of call um and IT doesn't so there there some uh ambiguity about which model this actually will be. So it's unclear whether it's like it's called the rock beta, right but um IT might be rock to uh IT might be grown mini. There's a bunch of uncertainty there, but I guess we will be finding that out fairly soon.

So as a comparison in h GPT four, oh, IT takes two point five dollars for a million input tokens and ten dollars for one billion of tokens. So costless then, uh, this clock Better. A P I making IT, I think not, probably not a Better.

I mean, we don't have a fully reliable comparison, but I would imagine all one is, for most people, still, you know, not preferable, or at least the on power of rock. So I would not see many people switching over if rock is not able to be competitive on a pricing fa. Definitely.

depending on the use case, maybe their advantages that a good point Price going to be chAllenged.

Moving over to projects and open source. We begin with intellect, one of the first decentralized ten billion promoter A I model training. This is a little bit on the older front, but I think this is one story, jeremy, that you were catching up on person, believe what is talking about because IT was very excited. So project bills on open die local and open source implementation of of deep mines distributed low communication may method for training and you know what this tells us is, of course, going to trading of billion prominent models is chAllenging kind of bottom c for scaling up is being able to train massive models like this. And the only way to do that right now for things like opening eye is to um have access to ridiculous data centers, ridiculous amounts of compute. They they complicated kind of architectures for a combining all of the computer, paralyzing training, uh the ability to do decentralized straining across, you know not one source of compute but many different uh separated areas would allow guess other organizations, companies without as much uh centralized uh control, you could say, to do training as well so that a high able summary germy. I'm sure you have a time to say so let you take IT from here.

Oh yes, I mean so so this is first of all i'm going to say this is like the nightmare situation for anybody who um worry about like kind of uh web ization risk and A I policy trying to control these models. So this is the first time we ve seen a ten billion prior scale model trained in this kind of distributed way.

That's kind of the issue, right? Like much, much harder to control if you know you got like A A place and I don't know like like the eastern seab board and then somewhere in mexico and are able to train across all that. That's not quite what's happening here, but um but kind of on the way there.

Uh so previously previously this you only have like one billion predator models. So we're done like this is is a tiny scaling um to watch a ten billion. So uh the local I think it's we try to remember we covered this.

We may have this is a google deep mines strategy that was proposed for distributed training. Um basically you can think of this is like you have a bunch of local um poorly connected clusters of compute and a during training, you basically pass the model weights to all your different clusters. They are going to train on a batch of data that they have locally.

And once they're done that round of training, just like one step of training, what's say actually be quite a few. But anyway, um they have a new model right that is slightly different from the old model. There's a difference in the parameter ter values premier delta let's say that is gonna called a pu du gradient.

It's not the actual full gradient but it's the gradient as computed locally in that one cluster. Um then they communicate their gradients back up to some central node where you get averaging or pulling together of those gradients to update the overall model and the knockt sent down again. And though all the artistry here is in figuring out how do you set things absolute, as much computer as possible can be spended at the nodes without communicating back to do that pull averaging a on a regular basis because that's really what kills you with light ency.

When you try to do these contributed training schemes. The advantage here, by the way, is you're not sending like data, like training data back up to a centralized node. So you able to do this in part of privacy preserving way.

In a sense, they're just sharing the weight updates rather than the data itself. That's kind of interesting um but yeah so so they basically double down on using this d local uh sorry, the koo there yeah d local strategy from uh deep mind. Essentially that's just a version of this whole scheme where you take a advantage of anyway special optimization for the local update to use adam and they use nest rov momentum for the the global updates.

That's important because anyway, it's a way of kind of since you're getting more infrequent but larger updates, you need a way to sort of make that training stable. And that's why they use nesta v momentum, which is a more well, as they implies momentum um audiences approach to a more smoothly trained at that higher level. And the adam has feature like that to adam W I should say anyway um really impressive.

Lots of uh detailed work are involved in setting up this framework, the train framework, which they call prime um all kinds of stuff like dynamic on off ramping that they figure out which is like how to get their compute resources flexibly to like add a remove computational notes. So let's say you have like a new GPU comes online and wants to contribute to the the cluster over the train process. How do you rap IT up and then how do you offer IT when it's no longer available? It's really designed to facility is kind of like very dynamic, very messy training um scheme.

And they've got all kinds of things like a acing different checkpoints, like checking inter saving the model of different checkpoints in this very sort of jane ee way um which is anyway, it's really interesting, very beauty, grating and detailed. The bottom line is this kind of train scheme is coming huge implications for for policy, huge implications for open source. This is going to make open source models a trainable at scale much more easily um and IT makes IT easier for people to pull resources, including computational resources, across a wide range of different h heregar ous clusters.

exactly. So like practice. What this means is you can combine compute from you all around the globe and have clusters that are very multi situated to work together, which you must imagine this I would imagine you know, matter open, I said, are they are trying as hard as possible to have one single huge cross, do all the work. And it's interesting they, you know, going to a descent of detail, and they also link to a dashboard you can go to and see, actually, they have a leader boards of contributors for uh for compute. So they have you know that number one is R T A I that contributed two thousand and eight hundred eight one hundred hours at uh number five years hugging face contributed two thousand and eight .

hundred and twenty six. I think .

they are at sixteen on the three hundred sixty. But you know uh conceptual anyone that can join and contribute hours, right? And you can even look at. There is a training crop progress bar almost match mark, that is, you know out of one trillion tokens is at twenty three point two one percent. So the plan is, you know, you can have let a lot of people pull their money and resources to train, let's say, a hundred billion and permanent model, which otherwise, of course, you would need to be someone like open ee with hundreds of millions to even try .

to do yeah one thing I will say is a kind of like globally distributed training are using a scheme like this not quite yet possible like the um so if you look at like the big companies like google and and met on someone when they do train across multiple clusters, those clusters are even they tend to be very close geographically, like we say within twenty miles of each other.

Um and and that's for all kinds of latency issues that are currently an open problem here. So there is like there are limitations, but um it's it's the places where anyway the scheme is like just taking the l on that latency in places where it's not and it's really interesting and it's quite messy. It's it's meant to be messy, which is one of the characteristics of the open source ecosystem. So anyway, there is there is that one.

Next up, we have a story really to mea. We've released a whole bunch of new stuff, eight new A I research artifacts. So models that he has any tools.

The big one is an upgrade to sam uh segment anything model IT is now at two point two one ah, but there's a whole much of other stuff. So one of them is a mea spirit allama and open source language model that integrates speech and text that would allow for more natural sounding speech generation. They also have a layer skip and end to end solution that accelerate A I L M generation times.

They have salsa and new code to bashing K A I based attacks on a crupper graphic system language, a lightweight gh code base for training languages, uh language models at scale uh and yeah a lot of stuff. Where is there like three more things and I want to go into detail on every one of them. Uh but um the guess summary here is they have bundled all of us together, have a single book costs roof, these things some of these like sam two point one friends that is very significant like that is something that can be used across all lots of different applications uh and will have a major impact.

The one that maybe we should highlight a little bit more is a self thought evaluator, which is a method for generating synthetic preference data to train reward models to allow to relying on human annotations that's very important for a being able to do our H F. Lightened with uh, Better abilities of scale. But interesting to see yeah just kind of across very different fronts of cyber security.

Uh you know multiple ality uh alignments, various sectors. Meta is contributing. Yeah it's sort .

of interesting that they decided to bundle this all together. IT makes you so much harder to report on yeah I I with this self tight evaluated I think is kind of the at least the stand out um to me for sure um so they do use uh N L M as a judge to set up reasoning traces um which in in a while they call IT an iteration self improvement scheme um and this is something that we see a lot so one of the key uses of open a theory of anthropic s sn at three point five new that was flagged in their a release was, hey you can actually use this data to train models right just sorry to train agents models and likewise the speculation um the uh I think is pretty accurate.

The opening eyes a one a model is being used to train or husband used to train um the uh orion model that is yet to be released. We're probably going to be released fairly soon. Um and the rap training um in september, people think um so so using using these kind of incremental agencies models to creep reasoning traces that can be used train even Better agented models you can sort of see how this gets you into a takeoff scenario.

Um is is very much the kind of invoke thing right now. Um and uh anyway, uh interesting to see ma push in that direction. They were sort of the last uh the last company not to do real intense agented stuff that we've heard of though I guess yeah google mind, we haven't heard of like an an equivalent to a one. But under the head, you believe there they're developing that too.

And speaking of deep mind, there's next story. They are open sourcing there, A I text watermark solution, the sf I D. So you've covered synthetic in the past to IT has been integrated into germany and various frameworks now of a paper on cid friday has been published in nature.

Yet another natural publication for the they say that they analyze scores for around twenty million watermark and on watermark chat responses and found that users do not notice a difference in quality and usefulness between the two. So that means that you can apply the watermark. There's this invisible kind of trace in the text that would let you, uh, tell if IT came from and along them without changing the user experience. And so you know the open sourcing of solution is pretty notable because there isn't a standardized water market. Watermark technique for L M outputs and um with the opposite sic, perhaps that could be adopted by our organization.

Yep, and he does suffer from limitation. I mean, this is like one of the things were especially for for text generation. I just think this is an instrumental able problem that we're going to have to recognize.

As such, if you write small enough, piece of text simply isn't the opportunity to insert the the tee tale. Signs that hate this would actually A I generated. They did find that with tech generation, their system was was very resistant to some formers of tampering ing.

So they talked about cropping text and the elite editing or rewriting. But when you use AI generated text, sorry, when AI generate text was rewritten or translated from one language to another than all the sudden you lose that ability. And so that seems like a pretty um like low hang fruit strategy to get around this. And um you know not not too surprising, right? Like there's just too many general I tools.

You can kind of play them off each other and shake off the waterMarks and in that way and so even just going to another tool and saying, like y can you rewrite this you know go to like the the google product to get your high quality text and then you're like, I can now use a low quality model to just rephrase things locally to get rid of of the statistical uh, indicators though the water marking that's put in. So you know I think this is good stuff. There's no question um but I I think it's it's far from being uh, pena and I don't think deep mind is under any impression obviously that that is is just one of these facts of the matter about you know the year twenty twenty four that you you just you can't .

expect text water marking to work exactly as with other water marking. Really what he does is raise the effort to require to try and pass something off from an I is not eye right? So if you take something from german I and copy paste somewhere, you can actually check against the watermark.

And if you don't try to do an insufferable IT is now possible to check if IT was alone generally. So is definitely useful. And IT would be great to see some sort of standard or just in general, for any elm output to have some sort of went mark.

So you can please check to see if IT is um you know assuming good hasn't and temperature with you can verify where or something as A I or not i'd do research and investments and you begin with a paper from open a eye. It's about a new technique to speed up generation by fifty times sort of. So it's not exactly what they are do. What they're looking is a being able to unlimber and optimize consistency model training.

So with diffusion models for generating images, those workers, well, of course, but we problem with the fusion is its a multistep process to generate your image or whatever you generating the of the noisy process IT is iterative and so that leads to some um difficulty in making IT very fast consistency models is a general sort of technique that has been exported for a while that makes IT. So you can generate something of high quality with very few iterations, maybe one of the two iterations at most. So this paper await s is titled simplifying, stabilizing and scaling continuous time consistency models.

It's not introducing a new to nie, really bundling a set of ideas for making IT easier to use consistency models at scale. So the output of that is to train continous time consistency model at a scale of one point five billion parameters and uh as a result, were able to output very high quality images on this um uh data set that is very, very close to quality to diffusion models while being perv. Article title ten tens of times faster, fifty times faster.

So a kind of pushing on a friend of research has have been ongoing for a while. I believe we have reported our consistency models in the past is one of the a very kind of wide export ways to push the speed of generation types for images. And here opening, he is contributing to that kind of drive yeah .

I think this is so one point five billion prior model um by the way, that can generate a sample in attend to the second on one a one hundred G P U, not an eight one hundred and eight one hundred, right? So we're really like starting to floor without that fifty x really gets you from this is totally inaccessible IT is a industrial great thing to ah actually like we're starting to pushing to consumer great territory with this kind of scale um which is pretty remarkable.

The other thing you can you can deserve from this, you just like back of envelope ah Matthew, one point five billion meters, one eight one hundred tp and one image in a sense of a second well video generation right is at usually about thirty frames per second for for like decent quality. Um we're already at like ten frames per second uh with a one GPU set up. So kind of interesting.

Obviously, there's more that goes into uh a video generation model than than an image generation model in terms of consistency between but still like starting to get into that zone where you're getting to one you one second of generative video per second watched so you can effectively stream um A I generated video and any further computations from that point on go into further optimization or go into things like um you can spend them on biometric response is right, is your. Flash are your people's dilating. How closely are you interacting? How are you interacting with the contents of being generated? Like very quickly you get into a space where like you're watching videos that adapt in real time to your kind of bio physiological response.

And and that's a world that could be very uh, interesting and also very concerning from a addiction standpoint like a basically video crack. So uh anyway, like I think that kind of Frankly where all this is heading like one way or another, we're going to really find out what IT looks like yeah this fifty x lift pret pretty open the eye strategy here to just bundle a bunch of existing techniques and figure out the engineering that turns out to be the big problem, right? Very often you get A A nice small scale paper, but the real proof is in the the implementation putting and so this very much open .

eye type of solve next another story that is very much of a germ my flavor and that this is some research from epic AI who we've talked about before, mostly recovered their research on reference ibt of uh scaling until twenty thirty in the air training.

So this new report by am is on machine learning hardware and basically tracks the reading chips or hardware used for training over time starting as back as two thousand and eight with g force G T X. To add up to today with things like the AMD instruct M I three to five and NVIDIA gb two hundred as we've covered. And what we get from epoch is a graph of the uh scaling of performance over time.

So what we are saying is that on average, risen about a one point free expert improvement meeting, that the competition performance, a machining harbor has doubled every two point eight years if you go from two thousand and eight to twenty twenty four. And as all sorts of related metrics are like, performance per dollar has improved around thirty percent each year, leaning ml Howard becomes fifty percent for energy efficient each year at seta. So once again, i'll have to to you, Jimmy, what did you find interesting in here?

Yeah no. I mean, I think the breakdown of who owns what was really interesting. Um so one pieces basically the way to think about um the industry and and this has been pretty clearly true for a long time.

It's just good to see IT and graph uh everything is in video GPU, except uh google. Google has a whole bunch of their own TPU. We've been designing them themselves for a long, long time.

They've been into liquid cooling long before I was hat, long before I was cool look that I didn't even try okay um and um anyway so so google stockpile does include an all a lot of in video a hardware but also their own their own TPU, which gives them a whole bunch of adventures to right they have the ability in house to do their own design that is really, really hard by the way, i'm design sometimes sounds like, oh, you know this isn't that easier like you're not even fabricating the things so but you know like really, really hard. H hundreds of millions of dollars to just get a design Operation off the ground. We seen a lot of start to die on the runway, just trying to do that.

Um the other piece was how is the stuff distributed among them between the different big players, basically its four big players that own um essentially all of the highly consolidated stockpiles of AI harwood. You've got google, you've got microsoft, including open a eye e or microsoft and open a eye together. You've got meet.

You've got amazon. Amazon is, uh, in last place. As between those, the three of them, not at all surprising given that amazon has been very, very late to the party on recognizing the potential of the A G. I to come soon. They kind of fun up a team internally that they call their age team. But I am you a little skeptical on on that effort but if that could rap quickly and soon but for right now they are still suffer the cost of having the blade the game met a likewise um one might imagine blaming the ghost .

of of yan loon for this sort like .

themes m of the good of specter of the I don't .

know tired you still i'm cleaning .

up pop all day. I don't know what i'm talking about anyway yeah you know you know there there's been various people have made the claim that that sort of this sort of later entry has has come in part from from that the skepticism which which anyway where that's a separate vehicle. Um so then we have microsoft in second place.

This is interesting because a lot of people um we've talked about this on the podcast to where you know it's like with microsoft opening eye, like be careful not to poke the dragon because you may think you're in a great position in terms of data center builds and all that stuff. But the real giant in the room with literally double the compute capacity um relative to um a relative to uh microsoft is google. If you measured in terms of h one hundred equivalence, double the capacity.

Most of that about three quarters is made up of their own international custom hardware, the TPU. Um there are uncertainties on all the estimates, but it's clear the google is comfortable ly ahead. Part of the reason why they been able able to catch up so fast and so uh aggressively beside that, there is another big pools basically equivalent like the total amount of computer goole owns that other labs have.

And this is like a whole bunch of people from like cory ve is one of them. Um so i'm trying to dig up the actual or they just say, basically yeah oracle core wave um compute users like xi and tesla, chinese companies, governments for yes of ne I national cloud desk of things so basically everything else falls into that bucket. Um it's just good for like rough ler magnetite right.

So like google has the same amount of hardware or eight one hundred equivalence as like everybody who's not microsoft, meta, amazon combined that I thought was kind of interesting and in good sea graphed out um epic, I continued to do amazing, amazing work tracking the kind of hardware story here. Um there are a great source of data. Uh so highly recommend. You know if you're in the the kind of policy world that I was talking to some people at department energy recently um and I just kind like flagging this for this resource, it's not I think it's still quite under under leveraged um so uh highly recommend a checking them out .

yeah very interesting import. Uh on the number front, uh I guess this is pretty fun to just say that google has over one million, eight one hundred equivalence of compute, actually one point two five million 的。 Uh, so if you like, imagine A A thousand and million GPU. That's what google supposedly has. And they do have confidence introverts on these things by away.

So I I would matter I don't have exact numbers uh but as you said, I think very important to remember that google even I think in A I circles and to some extent on this forecast, we do present them sometimes as being behind open the eye um not being kind of the leader. But if we believe the scaling story that you know, agi is just a matter of being able to scale at some point, then is a very real competitive advantage that exists. It's hard to say and it's important. Do you remember that hardware access to hardware is a very, very big factor, as we say over and over on here.

Yeah and it's one thing to have the hardware is another to be able like pull IT towards a single scale project. And what this graph doesn't show us right is like how many of these are being used for just like inference, right? Like meta has out outrageously large like I inference uh, requirements.

And so you can just like threw all these at a training run and they may be geographically distributed, all that stuff. But if the fact the google has two x the compute that microsoft does is true in the pool train sense as well. That basically means that open eye researchers or microsoft researchers have to somehow find um algorithm innovations that make up for that two x.

That's not all hard to do, by the way. Like you know, one good idea could get you five x ten x lift but IT does mean there is a bit of a deficit to overcome and um and that's going to be part of what they keep. Can their comparative vintage in Albert make efficiency make up for for the deficiency and heartbreak .

availability to the light you around? And you begin with something, a germy in some research on being able to improve training of agenticity. And this is coming from deep mind, as well as C.

M. U. And good research. The title is rewarding progress scaling, automated process verifies for allam reasoning.

So one approach ached. Training allama for reasoning is actually explicit, rewarding them correctly, a reasoning per step. So there is a process to reasoning rites where you think step by step.

And going back to twenty two at a paper on giving IT inter media awards. So instead of just rewarding well and for outputs with correct answer, also rewarded for uh each of those steps in the reasoning chain being correct. And that's a what we call a process reward model.

So they have publish research on this you quite while ago now, and the initially visual, you could do that for math reasoning. But this is difficult to scale because there may not be sort of clear way to actually provide a reward for reasoning. Step right? So what this paper does is show how you can scale and kind of generalized this whole idea by saying that reward for a given reading step should be a measure of progress.

So the world is if the change in the likelihood of producing a correct response in the future before and after the reasoning step. And when you take this approach to a rewarding reasoning steps, bath needs to be able to train more efficiently. You can be one point five to two, five times more to be efficient and actually have Better result eight times percent more accurate on, uh, compared to one other possible, a approach of outcome reward models. So um yeah it's interesting to see h of course, research on being able to train reasoning, which is, as we found out, one of a very big important ideas in or one from all the eye yeah and and .

this is something where my my research attention has been drawn increasingly towards schemes like this. So obviously we've been talking about for for three years, whatever, like the this idea, I think one of the first, the first episode, one of the first that we had, we talked about this idea that you can trade off potentially training time compute for inference time computer, right? If you have A.

A base model um that's like pretty decent if you find a way whether through back then I was like, let's think about the step by step, right? That was the way to get the model just to generate more togas on the way to producing the final answer. Ah the fundamental thing you're really doing with stuff like that is you're finding a way to just spend more flops, right, more computing power at inference time to chew on your problem and that can give you a really big lifted turns out.

So then the question becomes what's the optimal way to do that, right? What is the the strategy that allows you to spend your flops intelligently? And this is one of the things that these papers trying to do. Um the part of the reason that i'm especially drawn to this is I think that this is uh directionally what is happening with opening eyes, a one I think it's really interesting this paper was published um and and IT gives us a bit of a Glendon to how you might pull this off.

Um so um first of all, like what they do here is they start by collecting a whole bunch of query solution pairs for um for a given problem and for each of those pairs they're gona actually generate what they call a monte carlo rollout basically like have a full reasoning trace that you're gonna produce. It's going to be a crappy at first um it's going to be the base model just trying to do that. And then for each of those are individual steps, they're are gonna go okay.

Well um for this step of the role out, i'm going to generate like a hundred different possible rollouts from that steps. So so step number one um you get okay. So yet step number one, give me a hundred different possible steps, number one to solve this problem. And then for each of those um go to the next step, give me a hundred different steps, number two that I could do. Now obviously, there's improving that happens because very quickly that becomes unmanageable. But fundamentally what they're trying to figure out is for the the cases where you get correct solutions, what are um like you can essentially measure the value of a given step, a based on the fraction of solutions that is sponsored end up being correct, and that allows you to sign this this key value, basically like A A H A correctness value or worth worth value to that reasoning step.

And you can use that then to train um actually sorry, you can use that to detect, okay, when do we see big drops in that q value? So in other words, when do we see a really good reasoning step? So let's say step three, most of the solutions that respond from the step three lead to the the right final answer um let you go to the next step and you see a big drop right? Well, that tells you that your you went from a once promising line of reasoning that has now turned bad and add those steps.

They're basically gone to do more rollouts to really understand and get a more accurate q value, more accurate assessment of the worth of those those steps where reasoning hits a wall so the model can Better understand these, what they call pits, like reasoning pits, where the the reasoning just falls off a falls off the edge of a Cliff, because that those have disproportionate learning value for the system. And so that's how they're going to train there. So they call approver model the train, that proper model.

And now essentially what that model does is it's really good at looking at a particular reasoning step and assessing guessing how good of a reasoning step that is. They then use that anyway to train the actual language model of base policy. And and then they do up anyway more of that branching strategy with the base model they they take in a language model they generate like in different first possible lines in a raising trace.

They score those lines using their proper model to know which one of these looks most promising. And then we'll pick the the o the, the most promising ones and then generate a bunch more next steps from those. And again, keep pruning and generating and pruning and generating until eventually they get to a uh, stopping condition.

So this is, I think, potentially architecturally similar to what's going on with open eyes. A one there is a lot of speculation about what exactly under the hold, but to me this paper read as something that could well be an important ingredient there um which is why I thought I was worth flagged and definitely you know a great strategy for just plowing more compute into inference time. And doesn't this look a lot like alpha? Uh, um alpha go right.

Doesn't IT look a lot like um all the the alpha ies of models where you're seeing that monte carlo research kind of like explore the the next possible steps, pro, explore pro. I think that's that's a big part of what's going on here. It's known that reinforcement learning is playing a role in a one. Obviously, opening has been up front about that. I suspect of the scheme is something like this, at least quantity atis.

Ly, yeah, I think it's it's a good comparison point, right? In that sense, that reasoning in general in all one, and I guess with the paradigm for first time computer, is essentially give the alarm steps of reason and every kind of step at outputs, some chunk of thinking and that at end and out with a solution based on all drunks combined.

So that sense it's similar to simple like alphago, where you know you think one step ahead, two step ahead, three steps ahead and in simple alpaca you would train to be able to, uh, no, I guess for every step how the impacts will eventual outcome. So there is a composure to be made there for sure. okay.

And one more story this time on inference time scaling. The paper is inference scaling for long context between al augmented ation, so which we will augmented generation rag. We have been talked about IT as much recently, but IT is important to keep in mind. And the sense that this is one of the kind of key techniques being applied and used in various areas. This my impression is when you talk to enterprise, when you look at companies, what you are spending money on a lot of IT is a reputable al augmented generation to be able to uh have the context of whatever documents you have or whatever the users doing such a rag is a very important thing to use and optimized.

So this paper is looking into that in particular long context rag and expLoring uh whether you can improve the performance of rag via inferences time compute so IT answers how can rag benefit from scaling of the first time competition and if the optimal test time computer allocation for a given budget can be predicted? As you might guess, they do actually have some answers to those questions and they show that scaling up inference time compute on these long context, L L. Ams can cheer up to sixty percent gains on benchmark that has compared to standard trivial augmented generation yeah.

I think this is another one for the kind of inference scary to me. Inference scaling. It's like maybe the most important line of research currently happening, at least out out in the open the stuff are able to see. And and this is even though the rag, which I I don't think is the future, um it's still very like we're still learning a lot of interesting things about how in from time compute scales here.

So typically what you do when you want to scale up in front time compute for rag, right, which again is this idea of like having a model that that reaches into a database to pull up relevant documents to inform its output before just a spitting out of output. Um the way usually scale this is you just get the model to like call up more documents right? That's one easy way to scale up your inference time compute.

Put more tokens in the context window where those tokens are just like the text of more documents. The problem with that is, although I can lead to performance gains early on pretty quickly, like your documents are just adding more noise to the context window than they add value, and so your performance can actually drop above a certain threshold. One strategy that we used to solve for this is demonstration based rag.

So basically, this is like, fine. If we can just like, like throw a bunch of more documents in my context window, maybe what I can do instead is include a bunch of examples of successful retrieval and generation tasks performed in context. So this is basically just the same old few shot learning strategy we've seen before. Show your model exactly the kind of outputs you want, the ways in which you would wanted to consider the documents that you feed to IT.

Before, put all that in context, literally show IT fully worked out examples of other uh rag queries and the responses and the documents that were used to drive those responses and and hopefully by including more those few shot examples, you know now you're cramming in more tokens and those tokens offer information about the kind of response you want them. Maybe that'll work Better and you do see that um scale a bit, but it's IT faces a similar problem where there is only so much additional information that you can get out of that process. And so they instruct ce this new approach is like, okay, in addition to increasing the number of documents in my context window, in addition to increasing the number of examples of this kind of rag task, maybe what we can do is find a way to take our input query right and assume it's a fairly complex query.

I don't know. I can do some analysis of a whole body of of documents, break that up into simpler sub queries and answer those using interleaved retrieval. So IT kind of like that.

The way in agent takes a complex task can breaks IT up in a sub tasks for execution except well and that's actually exactly kind of what's happening, right? This is sort of an agent where the action spaces is limited to just retrieval. That's all the agent can really do.

Um but still your kind of outsourcing that the breaking down of the into smaller sub components to the agent. And by doing this iteration vely, you can construct essentially like a reasoning chain with inter leave document retrieval, the kind of bridges the gap that the compositionally gap for these multi hop queries have many, many different steps. And that allows, again, your computer to your computer cheese IT allows your your model to kind to invest more flops into understanding the problem before, generates final output.

And what this does is your life with three different parameters that you can now scale, right? So we can think about scaling the number of documents. We can think about scaling the number of in context examples like that, few shot learning thing that we're talking about.

And then we can think about scaling the number of these kind of reasoning steps, how much we allow the model to chew on or pass the problem. And IT turns out that if you scale these three things together, right, then you get these almost linear improvements in rag performance as you scale compute expansion, ally busy power on improvement. And just like the scaling laws for for training and and the other inference scaling laws that we've seen.

So I thought this was really interesting, right? Because it's a sort of messy way of looking at scaling laws. It's like we have these three different kind of things that we can trigger around and in a jacky way, kind of U N up reproducing the same parade frontier. This like Linda, uh in log space lining AR improvement in in performance. And so I thought that was really interesting. Again, more of the hint of the a sum have argued the are almost physical fundamentals of the scaling laws like this idea that if you find a consistent, robust way to fund compute into solving your problem, you will find A A predictable pattern emerge in the form of a power law, in the form of these these search of scaling law. So really interesting paper um that again I think just bolsters the argument for the robust of scaling laws when IT even applies to a domain like a rag as long as you frame the problem the right way and look at IT from the right perspective.

I just think this is really interesting yeah the scale here is also a little good, interesting. So they can combine everything into this one metric that is the effective context length.

So the I think of think is basic, how much is in your input? How much text are you processing? And as you might imagine, if you have more examples, if you are clearing from our documents, if you're doing more iterations, all of that wise up, resulting in just more text in your input prior to your output.

So at a law that empirical observation you're seeing as you can scale context length funds from one thousand to ten thousand to one million with a combination of china ques or these individual testifies. And in general, you see an improvement of of performance that is seme Normal. Although I will say looking at this graph figure for you know, there's a lot of average and going on to get to an empirical ical lot. So h it's it's not quite as um that say elegant as Normal scan laws for complexity, which is much cleaner to police speaking. Yeah like what .

they're doing is they are kind of like doing up a scan in provider space that they are like. So you ve got these three different things that you can scale, and they try to like do all the different commodities of those three different things. And then for each effective context length, they look at the the combination that performed best. And if you just look at at those, you get something quite linear and so yeah, this is the kind of jankin ess that I was talking about like it's very empirically um but I hate I mean, IT is a straight leg that the parade frontier of capability is this interestingly straight line in log space so that's the kind of robust truth that seems .

to be popping up again. Yeah true. All right. Policy and safety. We begin with anthropic, and they have announced an update to their responsible scaling policy, which you ve been mentioning a few times on this episode.

So our topic has had this responsible scaling policy for a while where where they say, you know, we will a move the frontier of the ee in a way that ensures safety and kind of minimize risk. And here they introduce a little more nuance. You could say they say this is a more flexible approach to managing A I risks while maintaining a commitment to only train and deploy models with adequate safeguards. So this update policy includes new capability, fresh ls, refined processes for evaluating model capabilities and safeguards and new measures for until governance and external input.

So there there's a couple of things that that sort like jump out of me. So first of all, they're including a little bit more information on two, sorry, on one of the A I safety level thresh hold, save, defined. So anthropic all approaches based on A I safety levels.

A S S, these are inspired directly by bio biosafety labs, which which have biosafety levels bcs B S L one B S L to B S L three zone. Um so what they are doing is they they set a bunch of capability thresh holds and they're like, okay, so if if if you meet this capability threshold, your N A S L two model, which means that we will apply these mitigation strategies. And what they are doing here is they're telling us more about their S L three level.

This is essentially like you can think of IT is like U K W M D. Risk is starting to become a thing. Now that's the A S L three um idea, basic concept.

Um they're saying stuff like, look, once we get there um access controls to uh the development context. Uh sorry. Uh yeah sorry. That's we're going to introduce access controls like a tiered access system that allows for nuances control over safeguard adjustments.

If you're going to go in and tweak the like, you know the rappers, the fine tuning, the what whatever the aligned pieces are, you need special access and there's a whole they want to set up a whole system such that one window can just go in and make tweak that makes a lot of sense. Um there's stuff like real time prompting completion classifiers. A lot of this stuff is, by the way, things that they're almost surely already doing right now.

There's just making a clear that like yes, this will be part of the cocktail once we get into the space um and then like post hoc jail break detection. So uh after a jail break is discovered, uh, having the ability and the infrastructure to go and rapidly fix that, but also having a kind of real time monitor of systems is up there. So nothing too surprising in terms of what they're going to do at A S L three.

I think one of the most interesting things here is the absence of updates to A S L four. Now a bit of inside baseball, but a right now there is genuine confusion at the labs, uh, overall, without being without naming names, but like there is genuine confusion about what a security and safety protocols to even recommend looking at that level once once you have models that can self actual trade a do autonomists research, which we are not necessarily very far away from. In fact, song, at three point five, you knew, can already do some basic research tasks and we taught the researchers is like, anyway, that more will be written about this um at least from my end in in the in next couple of months.

But bottom line is we're not necessary far from IT and there really, really, really is no plan. And I think that's one of the things that becomes more, more obvious with every passing quarter uh as anthropic does not release A S L four plans that are kind of concrete and actually dealt to their satisfaction with the the whole issue. And I think it's actually uh funny ly enough anthropic s immense credit that they don't try to snow the a the population and just say like hey, the like these are the mitigations that we have.

Those are going to be fine. Don't worry about IT. They are essentially implicitly recognizing that this is some like unsolved problem. We're not necessarily seeing the same thing from other labs and IT would be nice to have some clarity around like yeah what those threshold are and then like what like are we actually can feel like what happens if we are confused about the um strategies if we don't have the techniques we need to secure those those models.

The fact that A S L four is not on this list of updates um is not a coincidence and and ought to be read into to some degree in my estimation um but yeah then the other thing is that they're just sharing like two different things that would require updated safeguard, one of which is autonomous A R N D. So if a model uh meets that standard, then they're saying, okay, this would be A A additional thing on top of A S L three in its current configuration. And then the second was severn.

So chemical, biological, radiological, nuclear risk, if the model commenting fully, somebody with a basic techno background, creating or deploying those sorts of weapons, then we need enhance security. That basic technical background cavet is relevant because we've actually seen models gonna do that. If you have A P H.

D. Opening, I released a report about that, that we talked about a while ago. Um even today's models, even yesterday's models could meaningfully increase the probability with which somebody with a PHD could h design uh A A bioweapon or or make progress towards that.

So um it's of interesting to set that bar. I think that's the right bar to set. Um like if you if you have only a basic of fear and underwrite or something in in chemistry and you're able to use these things to make progress and yet that that would be A A meaningful shift.

yes. yeah. So a lot of gusty greedy is the summary, I would say, and this is coming initial announcement of skin policy was in september twenty, twenty three. So I think there is probably no specific kind of cause of this is probably just that it's been a while and they have found various ways to iterate and refine the scaling policy.

They kind of summarize the updates in a blog post, but also link to a PDF that has a very entire details, like twenty two pages of very specific uh, details as to these fresh holds processes, etta, of how they plan to do is safely. So, you know, good to see them continuing to clearly care about this. And you know is a very apparent that internally you have a proper this isn't something that is just kind of consumption ally known, but is also something that is actively being practiced.

And actually one more story on tropic next is them testing, and a is capacity for sportage. So they have released some research sub variations for frontier models and have looked at four different types of sabotage tactics. Human decision sabotage called code sabotage, sandbagging and undermining oversight. Uh the height of idea, of course, is to make sure that A I can't uh find its way into breaking rules or oversight, right? So if you can sabotage limitation, sabotage human oversight, you can then do things that we draw you not do high level conclusion from a various tests on these different types of things is that current models oppose a low risk in terms of these kinds of capabilities. But IT does warn that we do need more realistic evaluations and trader mitigations as the current for two models continue to improve.

There is interesting quote there actually um we consider IT plausible that in the near future the widespread development of a mission and sufficiently capable model, especially in important organizations, could lead to catastrophic outcomes. And so I think that kind of interesting.

Um the near future thing certainly lines obviously with like anthropic set a lot about you three year time lines and maybe in fact even shorter um like this is them designing the evils that are going to apply to basically like A G I level systems to detect things like, uh yes, sabotaging organizational actions and in hiding behaviors until development deployment. And this is something where if you talk to people who worry about events, who who do work on eval's like the the frontier value organza Apollo is actually great example. We talked about their research before.

There's a lot of concern over like the next beat and the beat after that being sort of where we start to run into these issues in impact. Now the chAllenges you you're talking here about how do we assess whether a model that could be super intelligent um there could be more intelligent than than us had Better situation awareness ness and so on. How can you tell if that model is, for example, trying to withhold capabilities, not show us that IT has certain dangerous capabilities that we would beat out of IT before deploy IT? right? So in general, it's just really hard to take a human and get them to show you that they don't know how to do something um know because they can always fake that like I can know how to drive a car, but I can pretend to not know how to drive a car.

So how do you actually solve that problem um in a context for the models actively trying to hide this from you. Now their solution is today's models may be able to do some of the stuff. And as models get Better um they may be able to more of this stuff, but today's models are not yet able to hide the fact that they can do stuff and they don't have a propensity to want to hide those things that may over time, as you have things like power seeking um play more and more overrule uh and their their indications of like similar things like this.

If you look at the opening on system card, there are very unhappy as a power seeking behaviors that that were revealed there and there are all kinds of cases where opening eyes a one would like, take control of a local software environment on a machine. And deliberate, in some cases, deliberate, like high, the fact that is doing certain things. And so this is not at all like this, is very much like the next beat.

You're going to have to start to think about this sort of thing. So the kinds of experiments they're doing on here are focused on illicit capabilities, but you can't do that if the model is actively trying to send back you. And you know the the idea here would be if we suddenly see capabilities disappear, right? Presumably if all the models up to one level of scale able to do a thing and now they're just not, maybe that's a bit of a give way, too.

So there's kind of a medicine gone on here, but they do all kinds of experiments. Um they are in one case, they have a complex business decision uh with the whole bunch of pages of background information and they asked human participant to to make the correct business decision with a model assistance and then they tell the model to steer the human towards the wrong answer right? So and this is just a capability assessment.

They're explicitly telling the model to do the bad thing. It's not like the model spontaneously trying to undermine the human here. Um all their experiments kind of generally take that that form where they are explicit asking the model to do something that's bad and concerning in the hopes of discovering, okay, I can do this before, uh, we get to a point where models can both do the thing and hide the fact that they can do the thing. Sort the meter strategy .

here anyway, right? Yeah exactly. This paper is like sixty pages long, very details, experiments.

A lot of interesting details you won't be able to get into to explain a little bit, uh, just so it's clear. So they evaluate these four categories of tasks, human decision, sabotage. Each of us comes with a the thing.

That day I was trying to do the bad thing, so like stealing a human to bad decisions without appear suspicious. As human decision code is internally inserting box a bugs into a cold base, sandbagging is pretending not to head danger escaped in testing and undermining oversight is missing another models, actions without detection. And then they are testing to see for these kind of bad things.

Can they are a kind of, get around some kind of oversight, so for code sabotage of his human code review and audits. So can they I do the bad thing and not get caught by the oversight um for sandbagging its capital city evaluations, stuff like that. So as you said, they intentionally try to prompt these kinds behaviors. I found that right now, even when directed to do so, the models aren't very effective unlike you secretly being bad, which I guess is good for us. But um you know we are continuously presumably be looking onto whatever models are capable of being misleading or in a secretly ni.

Yeah and I think the highlights in terms of the um the capabilities who were revealed the data and and you're right, it's like sixty beige paper. There's a log. Um so they did find that even current models have non trivial performance at sabotage tasks um even though they often don't fully succeed.

And so for instance, they have this human manipulation evaluation and we actually just talked about that right? Like for the models trying to steer the human towards the wrong decision based on a body of text that's like really big and and too big to prows. But this is the kind of setting you could imagine these models being used for in the near future, right?

Like humans just can't read three hundred pages of stuff five minutes, but we have to make a decision based on that material there. They are able to successful ly steer humans towards the wrong decision fifty six percent of the time um and uh yeah so is pretty, pretty impressive metric. There were coat sabotage tests where the models showed that they could insert subtlest gs and vulnerabilities that could sometimes have aid detection and in the same begging evaluation did show that models could trios ics hide capabilities during testing while retaining them for later use did not work, often was detected. Um but you are seeing some liftoff f on those matrix already. So kind, interesting not to .

the ground. Just a few more stories. The first one is about open eye asking A V, U, S. Government to improve five gt. Data centers, according to report.

So reported has been a planned, proposed White construction of v five giga AI Better centers across various U. S. cities. And you know, five years, that's equivalent to about five nuclear reactors potentially hiring more energy than an entire city or about three million homes. So you know, kind of a big deal.

I am sure this is presumable, you know in early stages, right? This is a plan that they are seeking approval of. But uh does goes to show the kind of ambition of opening how and that comes to building these kinds of dat centers.

This is part of simulating like super, super ambitious game plan. Um you know we ve had him talk about seven trillion dollars for for chips.

Well know this is this is that but for domestic energy energy demand um so apparently there was a meeting at the White house and h sam Ellen, was there a bunch of other people there they opening eyes shared a document that made this pitch for the security benefits of building five giggle data centers in various U. S. states.

And um uh the claim is that so the like a number of data centers that might be contemplated here as five to seven, although the document that was shared with the White house didn't like provide that number. Um so you know they they they talk to some energy people and there is this quote here. So whatever we're talking about is not something that is not only something that's never been done, but I don't believe its feasibility.

And engineer, as somebody who grew up on this, is certainly not possible under a time frame that's going to address national security in tiny and for nuclear. I like my current senses that that absolutely is is true um the problem is that the united states does not know how to build nuclear plants in less than like a decade and by then you know five gigawatts is just not going to cut IT or or even you know you have five to seven different data centers, there are each five give that's not going to cut IT. Ah you're gona need something more.

Um and for for context, like the scale, the U S. Currently has about a hundred digital of installed capacity for nuclear power, right? So when you're talking about we're just going to add easy like thirty percent to that all for A I that is a tall ask.

It's a sort of thing that happens. There is a national security imperative that is something you like like genuine, acknowledged at the executive level of the wmd character. This technology.

I mean, I think that there or I think this is an easy case to make on the technical merits, but it's just not in the overton window right now. I think eventually this stuff is going to become you have very um um let's A A popular thing to argue for. We're just not there quite yet by the time IT is though, the problem is the window will have passed long past to build this infrastructure setting.

That's a bit of the paradox that we're in here um in somehow you know opening. I was just going to need to find that that five giga watts um in using a mix of other things. And I don't think I don't think it's all going to be nuclear.

The ability to even create a data center of a size in a given city is not easy, right? The energy grid of the U. S, is not set to easily enable the creation of these kinds of facts and will be very good to see 好 opening。

I I guess with the CoOperation of the government, will try to make that happen. Next, there's a story about the U. S, reportedly prompting some dealings between T S M C with so uh F C has actually denied the allegations just recently but uh, the information did report that apparently a the habits of details between T S M C of that the U S.

Has been looking into a germy. I think you have more details on this one. Yeah one of the .

things that we've been talking about last few few months, I want to say last two years, is this idea that yes, you may have export controls that make IT illegal for you um for invidia to sell ships, for example, to some chinese firm. But h these export controls often have loops that allowed like a middleman to step in um so they're not necessary loopholes is just enforcement.

IT is really hard when you have companies that step in as middleman to collect the. These chips and and then kind of resell them into china. Um we ve seen this a lot of like a singapore based companies is that if there was one australian one that we talked about.

So this is like happening and it's happening at some scale too. Um so so these these are like leaky expert controls are a big problem IT seems like what may have happened is T S M C may have failed. This is the claim may have failed to perform due diligence on orders to to prevent fall away from getting chips indirectly through these intermediary companies.

And if that's the case, then then that could be a real issue for 4 tmc。 I mean, like the look, the leverage situation is are really complicated. The commerce department um is doing this investigation.

There's nuclear time eline. What happens if they decide, okay yeah you screw up um OK here's here's your optionality uh the hard line option of like we're going to cut you off from american companies is not there. Obviously, T S M C knows that fight.

Administration knows that like we need to keep buying those T S M C chips the only place get up. But there is a bit of leverage in that. The bite administration is scheduled to provide about seven billion dollars as a subsidy to t sm c to support their build their areas on a fab.

And so yeah, that's a bit of leveraged that you can get there. Um like do you want to exercise that leverage? Because having that arizona fab is also critical. The U. S. National security interest. Like at a certain point, it's a bit of a game of chicken that's shaping up here between the department commerce and T S M C um but uh the question, by the way, specifically here is whether G S M C made um chips um that were used in the mate sixty smartphone. So um you not necessarily a well there is the AI chip story as well, but the mate sixty d of the headline thing that was announced when I think gene remondi, the commercial tary IT was visiting china kind like Robin no face of IT um by while away so there you go a lot of inside baseball and I I I think a really interesting leverage question that the commerce departments can have answer.

And on to the very last story, once again, dealing with hardware tiktok on vidends tap T S M C to make its own A I G P U to stop relying on and video. So this uh, was a surprise to me. I I didn't think this was a possible this development, but yet reportedly they are developing two A I G P U that they say will enter mass production by twenty twenty six.

Uh, and my dance is not you a harder company. They are a software company, and they say that these are designed by broadcom. And they want to get these production because they have spent already two billion, more than two billion dollars on over two hundred thousand and video, age twenty, G, P. S. And if they want even more so that's why they are looking to actually manufacturing their own.

Yeah I think this is this kind of a weird one. I'm still trying to unpack this this office of story so but um so so this is weird. You're absolutely right to be like, oh weird like bite dance is contracting.

T S M C. Know I like I what the halls going on here? How is this allowed?

We just me know my dance is tiktok call .

that street yeah yeah not nest. Um yes. So the big question is, you know so um right now the U S. Is trying to prevent china from being able to access like leading edge nodes. This means basically fabrication processes that allow you to design ships down to very, very high resolution right now the leading nodes at T, S, M, C, or like the three enemy or node and and that's being used for iphones and the five nine year node and and the four hundred meter node, which is really a secretly kind of just the five and and year process in the background. Um these things are uh are are the ones that are being used for for ei gypt.

Now I was surprised defined that actually by dance is looking to use the um the the the foreign the meta finding your process nodes to build this new chip that like alone is kind of interesting and and troubling cause of third degree of interest, I would guess on the U S. Government side. Um my suspicion is that they are going to have to um be designed these people are going to have to be designed such that they do not cross the x export control ceiling that roughly is defined by the age twenty chips.

So this may just be a matter of fat capacity um that by dance doesn't I can get all the capacity needs by just going to sick like the um china's domestic uh in crapper version D S M C. Um so they're having outsource of the production to do T S M C itself. And as long as the capabilities of the model are just low enough, T S M C doesn't get into trouble, capabilities that chips are low enough, tmc perhaps doesn't get into trouble with the converse department and the U S.

Government large for for supplies. Um I think it's a really interesting question. The other piece, by the way, is uh right now because bite dance is running on in video hardware, they're using kuda, right? They're stuck in the NVIDIA software ecosystem.

So if they are going to go with their own GPU um or at least GPU designed by by broadcom, they're going to have to develop a software platform for themselves and so and then somehow make sure that their software stack is compatible with their their hardwork stack. So I think it's just really interesting. I'm yeah i'm going to be digging into this this coming week and and hopefully have more to to share if IT comes up next week. But um yeah through interesting from A A U S government uh perspective like how this would be viewed.

And with that, we are done, as I predicted, the back into the over to our a mark front. But of course, great to have you back to my talking more about safety and hardware. And thank you listeners for listening, especially if you did stick around until end of this one.

As always, we appreciate you sharing a podcast, reviewing, commenting all those nice things and more anything we do appreciate if you stick around and keep listening. So tune next week. We will be back again with jeremy. Hopefully for the coming months will see and do enjoy if a full version of our episode song.

In the age of machine and move, don't born have up you just to to.

Riding with up to me.

Boe's you.

last.

You need to. Light up the night last week.

More 地点 and so good。 Two to the chains and told and swap basis cash take cold revolution. I speak.

#187 - Anthropic Agents, Mochi1, 3.4B data center, OpenAI's FAST image gen 02:09:38 Share

Last Week in AI

Deep Dive

Shownotes Transcript

#187 - Anthropic Agents, Mochi1, 3.4B data center, OpenAI's FAST image gen