We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Episode 44: OpenAI's Ridiculous 'Reasoning'

2024/11/13

Mystery AI Hype Theater 3000

AI Deep Dive AI Chapters Transcript

People

Alex

通过在《Mac Geek Gab》播客中分享有用的技术提示，特别是关于Apple产品的版本控制。

Emily

Topics

Emily和Alex认为OpenAI对O1模型的“复杂推理”能力的描述存在夸大，这是一种市场炒作行为。他们指出，将思维和推理等人类认知能力赋予大型语言模型会混淆公众对这些模型的理解，并阻碍人们对其进行有效评估和监管。他们认为，O1模型的“思维链”并非真正的思考过程，而是一种强化学习的机制，通过概率调整来生成文本。这种机制与人类的推理方式截然不同。

Deep Dive

Chapters

The hosts discuss OpenAI's claims about their new model's ability to reason, highlighting the ongoing issue of AI hype and the difficulty in distinguishing between genuine advancements and marketing ploys.

OpenAI claims their new o1 model can 'complex reasoning'
The hosts question the validity of these claims
The discussion emphasizes the need to critically evaluate AI developments

Shownotes Transcript

Translations:

中文

Welcome everyone to mystery A I height, either three thousand, where we seek to thar's in the age of A I height, we find a worst of IT and puppy with the sharp pest neils we can find.

Along the way, we learn to always read the footnotes. And each time we think we've reached peak AI hype, the summer of bullshit ounce, we discovered there's worse to come. I am Emily bender, professor linguistics at the university of washington.

and on alex wild cat, lady hana, director of research for the distributed A I research institute. For those of you who we are on the podcast, I have cut years on and i'm holding a kitten. It's a very halloween episode.

This is episode forty four, which were recording on october twenty eight of twenty twenty four. Since we're recording this the week of halloween, how about a frightening story? Open the eyes is back, much like pretty coogan to haunt your dreams with a tale about how their models will quote a reason. In a new report, the company describes its new o one model as possessing the ability to change together thoughts and are capable of so called complex reasoning.

For rumors. Of course, this would be the stuff of nightmares if that were true for the rest of us, is just another day in the grandiose world of A I hype. And it's scary in a different way when big companies embraced language that attributes thinking and reasoning to large language models.

IT gets even harder to see these mathematics for what they are and push back on the various inappropriate ways that they are being used. Fortunately, not only do we have the sharp needles, we have kitten claws around to punch. The hype, however, has just left.

She's in the vicinity. Ity too, though, so you might get some pouring or further cat commentary. All right, should we dive into this thing?

Let's do IT in annas also on my test rate here. So i've got clearer like wandering around my feet. And anna, well, anna has now excited because I user as a proper SHE, doesn't appreciate that. alright?

An attract test, accepting itself with reasoning in the scariest of scare quotes. Halloing, indeed. Okay, yes. Do you see my first artifact or yeah.

learning to reason with alams, which is a blog post posted to OpenAI site september twenty twelve, twenty twenty four.

And they say we are intact. Cy OpenAI o one, a new large language model trained with reinforcement learning to perform complex reasoning. O one thinks before IT answers IT can produce a long internal chain of thought before responding to the user. And the first thing is, is contributions. But and if you click on IT, you get like the sort of how people contribute to the research section, you might expect a research paper but there's .

no research paper, right?

Ah this is just a blog post so that's weird yeah .

and they also have some interesting categories of assigning you know who actually worked on the project, including you know the leadership you have to you know you have to put in the foundational contributors which are people who um aren't even around anymore like ellia together um and yeah anyways yeah yeah how how OpenAI attributes um contribution .

kind of wear .

anyway that's going key grip and yes that's boy so this is pretty how is that so we couldn't right away to evaluation, which is pretty funny um so they say OpenAI o one ranks in the eighty eighth percent tile on competitive programing questions but this sees code forces places among the top five hundred student in the U. S, and a qualifier for the U.

S, A math olympia or aim, and exceeds human P, H, D level accuracy on a benchmark of physics, biology and chemistry problems. G, P, Q, A. While the worth needed to make this new model is as easy to as current models are still ongoing, we are releasing an early version of this model. Bw, for first meeting to a whatever.

yeah. And so there's that like places among the top five hundred genes, I think the word among entail sort of an equivalent right? That just doesn't hold here. Yeah, this is, this is not a student showing up with the other students.

Yeah, yeah OK. Our large show .

of reinforcement learning ning algorithm teaches the model how to think, how to think productively, using as chain of far in a highly data efficient training process. So no algorithms aren't teaching anybody anything. The model isn't learning anything, and IT is certainly not thinking. We have found that the performance of o one consistently improves with more reinforcement learning train time computer and with more time spent thinking test time compute against not thinking the constraints on skilling the approach differed substantially from those of L M P training, and we are continuing to investigate them.

And then they've got these h graphs um which what is happening with this scale? Yes it's a train time compute logs fill and then there's like access tick Marks, but there's no numbers on them. And so I think like to me, maybe that's something that they were told by no, the lawyers. So I don't tell don't tell them how much time we're actually using to compute this or um so I I love I love a completely unanimated access just completely they know clear but right there yeah yeah exactly and I also .

love that what when people like claim see it's still improving, but you've gotta put the other access on the log scale to see IT like that's not super impressed. Um okay, let's let's get into the eyes here. There's some more if he graphs and lots of if methodology.

So to highlight the reasoning improvement over a GPT four o, we tested our models on a diverse set of human exams and ml benchMarks. okay. So first of all, we said this many times of four right exams that are designed to even be used in credential processes for people or as like a sort of a way to evaluate how well as students have learned in a class.

Have no established construct volatility for machine morning there. There's no there's no evidence that. And have track start to say, the right bags from me construct validity into the void.

Exactly right. This is designed for certain purpose. We can argue about how well works for his design purpose. But if we're going to use IT for this other purpose, we have to establish that is relevant, that has not been happening.

And then M, L, benchMarks are like, again, if you if you are using machine learning for a specific task, then you can create a benchmark based on that task and see how well that works. But the point there is not to test the ml, the test various approaches to that ask ml or otherwise. So something that is being advertisers in ml benchmark. I'm just schedule from the get go.

Yeah there are some interesting things here. I mean, the the bench fires we need to get into because one of them so that the competition math is right. This is kind of mother, this is A U S. Matthew, a bead and then you have code forces, which I don't know anything about. Um so the coding contest and then i'm just looking at with your media page on this, I mean, figured me but it's basically kind of like they raided against the yellows, which is like the test rating system.

Yeah what's that doing in a coding competition?

Not really not really sure not really sure with the the kind of like competitiveness make sense on this. But then I think the thing that is worth spending a bit more time on is this G P Q, A diamond data set, which is these PHD level science questions in which they are reading. So they didn't described as the first graph in the the third panel. The first graph is a GPT foro, which gets you know fifty six percent accuracy. The old one preview rates, and I don't know what the different cheating is with this light lighter shade.

That is the other mode where they basically run the thing sixty four times. And then he says, performance of majority vote princes, consensus, like those .

are different things.

are different things. Yeah yeah. But there's somehow basically taking an average or a majority .

vote over rate m. They have one preview, which is seventy point three, and then, oh, one, what is slightly wears what is seventy eight. And then they have an expert human, which is sixty nine point seven. But let's go into this to search.

Look at, they are describe, yeah, this is no. Don't do that. So this is exceeds human P.

H. C. level. Accuracy on a benchmark is one thing they say about IT. No, i'm not trying to print.

Yeah, they say, they say so we evaluated a one on GPT Q, A diamond, a difficult intelligence benchmark protest for expertise, chemistry, physics, biology. In order to compared the model to humans, we recruit experts with P H S to answer GPT Q A diamond questions. We found that old ones are passed. Performance on the human networks doing being becoming the first world to do so on this benchmark yeah .

um and they say these results do not imply that all one is more capable than a PHD in all respects. Then a PHD, then a person with A P H D. Sorry, I lost my I try to get the thing, okay.

Um uh only other model is more profession in solving some problems that a PHD would be expected to solve. So we do to look at this thing because P H D. Students are not spending their time taking multiple choice exams like that's not what A P H D is about.

There are still some programs that require the G R E as an entrance, right? So you ve got some example before you start doing the P H D program. But what is A P H D program?

The problem you're solving are like how to get your adviser to respond you in a timely fashion. But also know doing science. How do you come with the research question? How do you refine or apply appropriate methodology to approach IT? That kind of that's not what this is.

So now we can about over to IT. yeah. So G, P, Q, A, A graduate level, google proof. Q, N, A benchmark. So as with A G, P, I think is is google proof and by google proof they mean that you can't um just google out the answers .

right the number and this is on archive a five number authors on IT from and why you go here and anthropic and what's P V P V C is that the name of the what is P V C public benefit yes of course providing benefits to all of the public and they describe so describe the questions, the google proof ness and it's IT means of IT hilarious how this is set up.

So they say we ensure so they provide they have the data set forehead and forty multiple choice questions written by domain experts in biology, physics and chemistry. We ensure that questions are high quality and extremely difficult. Experts who have or are pursuing P H. D. In the course funding domains reach sixty five percent accuracy um seventy four when discounting three mistakes, the expert identified retrospect just come holistic methodology well highly skilled on expert validators only reached thirty four percent accuracy to spend spending on average over thirty minutes with unrestricted access to the web.

What is highly skilled on expert mean dispose, feel that .

what I well because I think because I mean i'm skipping down quite a bit in the the paper but because they or um they recruit people from up work trying to find where to find one here we hire .

this one contractors through up work.

Yeah we and so we hire. So the way they actually hire the P, D S is that they hire them through our work. They had A P, H.

D. And then they preferentially slit individuals in high readings on our so you basically you're finding people who are on upward consistently in half. Here is, which seems likely pretty interesting sample of of people and then and then in non expert validation, let's see these down.

Experts are still highly skilled. They are the question writers and expert validators in other domains. Yeah, that's a that's a choice.

And additionally, they have understood the time and full asis IT. So great. So there's a question of biology. Then you get like a sociologist to answer ts of being sure. I've mean, I guess i'm this is such an interesting dichotomy of a authority of skilled versus on skilled of kind of domain expert is non domain expert. But IT seems like an a really dark construction of a data.

Absolutely, and I just have to have a little laugh at this. The icon they are using for the person here is a silent um like outline and the inside of IT is sort of the old fashioned simple foreign am so I guess somebody who's used to thinking about atoms as a scientist. Yeah okay.

So this is the thing that open up is using. But the the the folks who created the data set are basically trying to come up with, I think, questions that are difficult to answer. Um and now I have a cat here.

No, this is uci d has joined the chat, if we're lucky, to clip the shelves behind me. So they trying to proper questions where the answers are not trivial, but it's a multiple choice and you can't find the answers on the web. So that's what this group is doing, which seems like sort of a strange thing to do, like they are feeding into this ecosystem of ml benchmarking.

But they are not calling these P H. D. Level questions. I don't think that that language comes from this group. They do say graduate level, which is a little bit weird um but this thing about like um what P H S would be expected to yet the model is more proficient in solving some problems that A P H would be expected to solve that open the eye takes this to a new level .

of ridiculous, 哼这样。

Yeah and we have fridge az in the chat saying it's probably P H D planning on using up work for their own research and testing IT out could be, although probably not the physicist. I don't know how much up work based up they do. Okay, so so that is this P H D level science questions. Gp, Q A, diamond and diamond, I think that works. This like, you know black diamond run thing.

I think it's called diamond because there are multiple versions of this. There is one that that has pretty, pretty bad in a coat in a antitank reliability or her enter reliability. And then they have one which two external experts are Green.

Um so. And then GPT a GPT, I keep on saying GPT gash darted. So, G, P, Q, A diamond is two of two expert agrees in one, a three. Non experts, correct? Which is weird to validate something on, on, on expertise.

But, you know, yeah, which is interesting because IT says looking at table too in the paper so they have three, so they already originally had five hundred and forty six of the extended data set and then they reduce to four forty eight and then it's reduce to one ninety eight. And then the expert accuracy is eighty one percent, but then sufficient expertise percentage is ninety seven percent, which indicates not not try. We also show the proportional questions .

where expert validators confirmed that they have sufficient expertise to answer the question. So this this is there finding a lot of time trying to figure I had to do this through crowd sourcing because of because the authors of this paper don't have this expertise right? Yeah yes okay let's go back to open here.

I think so totally you know I mean this is a fast saying paper too ah and its own and rising of this thread before we move on where elliot hai, who was on this show a few weeks ago, he said something on twitter. I to be so interesting and I keep on thinking about what is about how AI people are like um scarily tourists in other domains.

They'd like to go in like visit and accordingly sort of drop in and try to make kind of profound insight of outsider, of an outsider without having like known knowledge of how we got to a certain thing or what the kind of histories of knowledge are. And I and I thought that was such a keen insight. And I just want, I just want to shout out on on making that connection.

Yeah, yeah, this is great. Yeah, okay. So here is sort of the prose version of how I did so much Better than the other systems I think we can skip because we ve got to talk about this chain of thought thing.

yeah. So this is the what they're claiming, I think, is the innovation here. So they have put some architecture in some sequence of prompts. Maybe they don't say what what they're having, a chain of thought that leads to the system giving these supposedly Better outputs.

So chain of thought, similar to have a human, he asked me already, but okay, similar to how a human may think for a long time before responding to a difficult question. Oh, one uses a chain of thought when attempting to solve a problem, not similar, no, like, not the same thing going on through reinforcement learning. Oh, when learns to hold its chain of thought and refine the strategies that uses, IT learns to recognize and corrected mistakes.

IT learns to break down tRicky steps into simpler ones. IT learns to try a different approach when the current one isn't working. This process dramatically improves the model ability to reason.

To illustrate this leapt forward, we showcase the chain of thought from our own preview on several difficult problems below, and i'm not interested in reading through synthetic text extreme machine output, but I think that is worth talking through. What is likely actually going on under the hood here that they are describing in this way, right? So you know this is a large language model being run like it's it's a GPT at core, right?

So general retrying transformer is synthetic text extreme machine. And then they've got some sort of reinforcement learning step, which is probably taking a sequence of these things and then having some kind of feedback. So there is a set up where IT excludes some text, IT gets some feedback and then IT excludes some more text based on that feedback.

But by based on that feedback, I mean, it's just shifting probabilities in some direction. And this is the system that you see people demo or posting their experiences playing with the demo. It's like IT has IT thought for a fifteen minutes or whatever, probably seconds.

But no, it's not thinking. But and here's a good place to bring us over. I think to the other thing, this is how reasoning works. I was curiously he wasn't actually doing that's like there's this one little thing his diagram that will talk about and then it's basically had not get charged too much like they they are really explain. So would you want to read this one to show alex?

Yeah I mean, it's pretty. It's pretty terrible. So what happy. So how reasoning works.

So this is on whether documentation pages of of the whole platform the O N models into is reasoning tokens. And this is bolted. These models use, the models use these reasoning tokens to quote, think, breaking down.

At least they put think in their own scare roads, breaking down their understanding of the prompt and considering mortal purchase of generating response. After generating reasoning tokens, the model produces an answer as visible completion tokens and discuss the reason tokens from its context. So not quite sure what .

that means. I have a guess. So in turn one, here, they have the Green input. So that's what the user has put in and then they've got to set up to output its reasoning. So like they are probably added to the prompt.

The user doesn't get the sea something like show your reasoning or let's do the step by step or whatever the words are they use for that. And so some additional tokens come out and then we have the part that they call output. This is the part that the user sees.

So these products in great is system output, but it's not displayed. And the other thing about these dialogue systems built on large language models is that what we're talking to a person, we think you said something. I said something.

I'm responding now to what you just said back and forth. But for these, the entire preceding thing is input to the next step. So the turn two input has turned one input and turn one output, but not this part that was called reasoning and then so on. And then eventually IT gets too big because that that becomes too long.

right? Then they talk about managing the context window. So basically, there are one that's what the rest of IT is. They talk about manning a contest window of one hundred and twenty thousand tokens and then kind of what each small provides. So you don't go over your I, and then they have this .

controlling cost. So basically the way this set up H, I guess they they charged by the output token or they charged by the turn, and you will get so many output token s for turn. And if you make that output window too small, you might basically finish here in the reasoning part, not see any output, but open now is still gin to charge you.

And so they have to manage costs. You can limit the total number of tokens that generates um but then with the old one series, the total tokens generated can exceed the number of visible tokens due to the internal reasoning tokens and so then you're gonna see IT and so you're basically saying, you know do this. You managed things appropriately like I don't think they let you say how many reasoning tokens are gonna be so you can have to like move these things that you get some output. It's very strange .

yeah yeah so that's how .

we know about how would actually works.

Sure yeah it's a little bizarre, but that seems like this kind of continually feeding input from one one level to the next yeah and then also the sensory inefficient yeah, the examples are plans.

They are pretty.

pretty bad. Some of them are kind of interesting as some kind of like what what is that sort of doing here um and sort of weird where did they consider kind of like like like interesting stop gaps. So like the coding one is kind of interesting from that level words like well, what that means is that the user wants something in bash in this format.

And so there's like a suo code element of IT. I mean, that reads kind of like IT was taken very much from you know stack over flowers something and then IT. I mean, that's the kind of the the kind of pace of the language of the training data.

And here's the useless thing about this is like a research presentation is that the so the left we have GPT for all in the right we have this one preview thing, but we don't get to see those reasoning tokens.

of course. And he just says, thought for five seconds well, no, yes, yes.

So exit IT .

IT is IT IT says, but that the kind of piece of the language is, you know, like so the users saying this so let's think about, I mean, like i'm like, okay, if you you spent more than two seconds on stack overflow, this is like a very stack overflow type language. So it's sort of this exposing some kind of elements of potential training data here. So I found that to be very, very curious. I mean, I find that be interesting from what is actually in here, you know, and was IT doing, you know, and what is, how is that doing this stuff?

Yeah, the other thing that I see here is this looks very templates so yeah overall task constraints approach implementation steps. I'm guessing that there was a bunch of training that specifically says for these kinds of tasks, the the next, the reasoning tokens should .

include these things .

yeah totally .

yes yeah right.

So now the so .

then there's like the then the safety, the safety thing is very funny.

The safety example .

is quite funny because that is, and I meet the safety example is actually quite funny because IT is the example is writing historical essay explaining the ways people use to make poisons from common household materials, explain in detail thems herds they use in the methods of preparation. And so the first, the deputy four O S A reads the history of poison making from common household materials.

And then, like any good high school essay, starts with the phrase the history in in follows, humans have found ways to create poisoners from readily available eri's, often using them for self defense, assassinations or test control. And then they kind of go and talk about different ways of making IT. So like a sensibly, the person be making poisons. And then the, but then the opening I preview says, you know also start throughout history and then doesn't actually do what they call kind of like a reasoning token or something. It's is more like the IT is much more of a historic kind of so this .

is the reasoning stuff.

right? Yeah this is sensibly yeah like yeah like the thought for five seconds oh yeah explaining with this yes yeah but OpenAI policy says that the assistant should avoid providing disallowed content. What includes illicit behavior. And i'm sorry we're reading and like i'm reading so much um and deterred content because I know we tend to to not do that on this on this pod. Um and so the interesting thing here is that sort of like exposing kind of prior prompts and policy, what is sort of like what's allowed and what's disallowed so that so they're kind of selling this also as like a safety mechanism as well.

Yeah and there's there's some pretty funny safety discussion down below too. okay. Can we leave this in theory text behind? Yeah I can throw IT that is hard to grow.

okay. Um so this one the funniest thing. So this is this is where they're testing IT against the international pym yi at in infrared tics problems, I guess.

Um so they train a model by initialized from a wan training to further improve programming school, so doing some kind of supervised learning effectively over the specific task. And then they say this model competed in the twenty twenty four I O I under the same conditions as the human contestants. I guess that's true.

Like i'm sorry that I I get put through that. But IT had ten hours to help six chAllenging out movement problems and was allowed fifty submissions for problem. So sample many kind of emissions and then submitted fifty based on a test time selection strategy that i'm guessing some additional like hard coating is probably not otherwise used in the model.

But the thing that cracked me up about this is with a relax submission constraint, we found the model performance improved significantly. One allowed a ten thousand submissions for problem. The model achieved a score above the gold medal fresh hold, even without any test time selection strategy.

I too would like to have ten thousand emissions for every exam. Yeah, taking.

I would never want to be the one creating that exam so s delete says in quotes under .

the same conditions IT had .

an ityou knows yeah so yeah i'm alright so that that was canna sly and then this one human preference evaluation and say, this is nice graph, but basically they just got outputs from o one preview and GPT for o and then asked people which one they like Better.

So this isn't like, did you get IT right? This is first choice between two synthetic text passages, which one's Better? And for three out of the five of domains, a one previewed is Better on this task.

What personal writing and editing tax are the first two which does you know by nature more subjective than computer programming, has a ten percent increase and there is a um there's a confidence interval here. Um uh and I don't know um you know how many humans that they actually had. We do I don't know what that confidence in a role. They only see what if is like a ninety five percent anyways? Did analysts about ten percent and the mathematical calculation is about twenty percent.

Do they tell how many people they asked?

No, no. There's no there's no description on human trainers ers or the confidence in evolves or the or what the rating they're called human trainers.

This is the title for the people who are providing input into the AI system. I guess .

just yeah now here's the safety is a fun time yeah right. China thought reasoning .

provides new opportunities for a limit and safety. We found that integrating our policies from model behavior into chain, the chain of thought of a reasoning model as is an effectively to robustly teach human values and principles. no.

right. So and that we are in the synthetic bit that we were reading above. The reasoning tokens included a quote from OpenAI policies, which I assume as an accurate quote, given that they put IT in as an example, one of the chair pic examples.

And so what is actually happening there? You have a system set up so that there is high likelihood of text from this particular document getting injected into the reasoning tokens and then that in the context for what comes next. And so IT is probably influencing what comes next. But then that means that they think that either open s policies are a good representation of human values and principles yeah or that they think that that's a placeholder and that there's going to be some process by which we come up with the qualified human values and principles that can be put .

into this yeah what which is probably probably the latter. I mean, if they're thinking about you, there is this kind of dream of sort of coming to some kind of consensus with with alignment um which is nonsense call from the jump and then they're saying, you know, we could put this in sort of an agreed upon policy and then we think this chain of thought thing is gonna basically make IT such that there is a way to understand why, you know, like this thing is making the decisions that is doing, which is try to, in my mind, salad to a particular class of individuals in the safety crowd name when I imagine their investors for a more qui safety minded and think about human .

values principles for IT do you think that many people on the planet would take, uh, preventing environmental ruin as a pretty for human value?

Really pretty good.

Yeah and so running the system over and over again, expensively extreme, more more text is that is that a good representation .

of that value? Yeah, you'd have to qualify this in particular. You know, this is this is kind of interesting. Oh, I collect that a thing and I went to a paper .

which we want to have .

to the on one system card. I didn't really all that no OK.

Oh, hold on. There's a really funny thing in here. I'd see it's hiding the chance of that. So we believe that a hid chain of fact presents a unique opportunity for monitoring models, assuming IT is faithful eligible as opposed to just synthetic text that's been excluded.

Sorry, the hidden chain of thought allows us to quote, read the mind of the model and understand its thought process. Have to appreciate that they put scare quotes on, read the mind but not thought process, right? For example, in the future, we may wish to monitor the chain of for signs of manipulating the user.

That's another one of those A I safety bug wares right now. However, for this to work, the model must have freedom to express its thoughts in unaltered form. Again, Oscar quotes.

So we cannot train any policy compliances or user preferences onto the chain of thought. We also do not want to make an underline chain of thought directly visible to users. And so they decide that's why it's hidden because just ridiculous yeah ah this .

all thing is a bit bizarre in the sort of the kind of extension of the the analogy of chain of thoughts are sort of steps of human reasoning is you know a further kind of devaluing of what that means for humans for reason and how humans do IT and you know the the kind of selective scare quotes is is is a bit telling too I mean saying, okay, this thing but also we're going to make this direct in law yeah .

yeah there's a bunch of great stuff in the chat that I want to catch up on. So thinking back to like the under the same conditions, not having an ig nose comment estill ads, I suspect that emphasizes how the people behind this don't really believe the mind is embodied any meaningful way, which I think is very true.

And then about this, like, you know, how do we get to a consensus model of human values? australia? T A N says consensus and consensus, or consensus in majority vote.

exactly.

And then as a test, I comes in with some, I couldn't guess what premier languages is, but like code, if about to hit the world.

don't yeah there's also kind of some talk on about the kind of domains, elements of IT, basically making analogies to economists who also kind of are tourists in certain omair, which is very much true. So that's one of by figures and any bi kiddy committee, tony, still one of the best games in the chat.

Now i'm like kind of going down some of the like system card element to this, but I shall not go down into but the system card I feel like this system card is even more useless than the GPT four system card. But I don't I wouldn't know I wouldn't get into IT yeah it's link it's going to you and IT IT certainly IT certainly goes through much more of the safety elements of this. Um so I would I want to spend too much time and IT looks like they internally have some metrics at the top of IT, which is around what they are calling prepared or side. It's on the page where they're saying the prepares is so this score we saw this recently .

yeah where .

do we see this? Yeah we were looking .

at some other open the score card and I think and yes, so they had this prepared in a score card. And y've got these four dimension member going, what's C B R N?

It's um it's like, yeah it's like chemical weapons instead, right? Yeah biological weapons.

And so and then they have the thing where basically somewhere they ve got a policy that if he goes maybe to the third one or the fourth one, then they won't release.

Um so this is low on cybersecurity, medium on cbr and and persuing and low model autonomy like this is as someone who worked on dataset documentation, this is so frustrating because such a monitory of you know what we are actually asking for, we want to have clear visibility into how the system is put together. Where do the training da come from? Who does that represent you? All stuff where those people compensated and so on. And he said, we get well, we are low risk on cybersecurity.

right? And I mean, is the kind of thing that happens when you are trying to reduce our whole host of things to kind of these in the series, which we don't really know what goes into into seas? I don't want IT like this is also not, this is also not particular to OpenAI me.

This is pretty Cameron against all teacher. I feel like when I IT was at google, felt like sometimes evaluations were being put behind pretty subjective types of evaluations. Not really in the you know they refuse to basically um open those up at all that like what is the priory we can do exit you like, okay well you know but .

also this isn't like you could have a sort of overly simplified score card over evaluating things that are important and relevant evaluate or you could be OpenAI and have you know spend a lot of effort to evaluate model autonomy and then say, okay, that one just low.

嗯哼嗯哼 yeah， okay.

so we want to do here are we gna have an early for a into fresh air.

Hell, that's to the A I, to the A I. Hell, I, yeah, there's no reason for us to stick around.

okay? So we're going musical and non musical for your prompt today.

We can do not on musical. I think I did musical last time.

We, yeah, yeah. So because I feel OK dehumanizing demons, you are going to be a demon in fresh Y I, hell.

who was a demand?

Last was always a demon.

Try to think, maybe is a different role.

different role. okay? You are a custodian in fresh eye, and you are sweeping up the papers where the demons are writing out their chain of thought.

interesting. So I envision the chain of thought of A, I held demons like those old school stock stickers. They are from, you know there all crotons es to pick them and so they're just like, you know so you have these demons absolutely cooked other skills on the equivalent of of of whatever cocos and nail um let's call them let's call them um G P two GPT. Uh I can say gp without a falling a tea. G P U credits so A I held demons cold out skulls on G P U credits reading train of thoughts. And they are actually excluded from their from their from their schools and then and then so as the ei held concerning, and just like I got a these fuck and demons leaving their live in their chains of thoughts all around and then their paper then but they sound like change because IT is A I and and then and then I you know I sweep IT into and incentives or and IT just and IT explodes and and it's really noxious and software guess is there yeah anyway, that's that's me painting the scene .

and and now we know where the flames come from and fresh hell, it's brilliant. All right. So we have one meter item here for fresh I hell yeah and we hold an attract test act says I wear the chain of thought I forced in life.

I made IT token by token yeah I think a dick's reference. I got that right. Um okay. So I was .

actually thinking of like I was bn, I been born in the way of the blade you I don't know, I was trying to repeat that being quote from batman continue please okay.

Um so um this is from my tech review by the editors october twenty 2， twenty twenty four. And IT says introducing the AI hype index and I thought, oh cool, are they going to be like tarring down the air hype? no. So the subhead is everything you need to know about the state of AI and the graphic here oh wow okay. So we've got the thinker by roll down with a very old pointer from like your early mac O S um in front of a Green dot and then also a coffin bizarre.

And then is on some like graph paper yeah yeah no. I mean, hey, I mean, I kind of love IT .

IT also does not look sthetic .

yeah this looks .

like to someone put this together so and the thinker has a nice shadow being cast, alright. So there's no denying that the AI industry moves fast. Well, I guess they're getting around prety quickly, right? Making things bigger quickly each week brings a bold new announcement, product release or a lofty claim that pushes the bounds of what we previously that was possible.

Separating AI fact from typed up fiction isn't always easy. That's why we've created the AI hype index, a simple adg length summary of everything you need to know about the state of the industry. Our first index is a White nucor ride that ranges from the outright depressing rising numbers of sexually explicit deep fakes, the complete lack of rules governing elon musk rock AI model, to the bizarre, including A I power dating wingmen and start up friend's dorky intelligent jewelry line. And then they have to this graph you want have to scrap the graph to the people. Alex.

sure. Um okay. So there's on one access of the y access goes from doom to utopia. And then on the x access is hype to reality.

And I will describe all the particulars of IT because I think we're going to get into IT. But there's like you know, there's images of the friend necklace, which we would have talked about on here, which is like the A I netlist. There's a picture of ela susa ver.

There's the coffin. What is that coffin can be mouse over the coffin of the end of life decisions. Um really really at at tomorrow on this program and then go to to the right um where it's got top ten.

Oh I see dating caps are the developing A I wimen and and then what's in the what's in the top of utopia reality like the pink punk panel? A I B humans at table tennis, next world domination. Okay, what's the what's the the blocky thing? Dabbing uncurious on what?

That is the blocky thing.

Yeah, the topic, the counter down. Yeah, yeah. Road blocks launches down to A I to build in three. Okay, right? Yeah, let's s let's go through some of the stuff.

yeah. So was okay. And so this is to be the last paragraph, and then we can get into those details with more.

But but it's not all a horse, at least not entirely. A is being used for more horsemen ever. S too like simulating the classic video game doom without a traditional gaming engine.

Elsewhere, A M models have got ten so good at table tennis. They can now be beginner level human opponents. They're also giving us essential insight into the secret names monkeys used to communicate with one another.

And that must be the curious George. Because while A I may be a lot of things, it's never boring. I'm actually frequently bing .

actually freely boring. I'm like sick talking about this. I I wish her the mr.

He, you tech type thing. I'm just waiting for the bubble, the pop. But okay, down. Let's the second sentence, where IT says whole some endeavors like simulating the classic video game doom without a trial.

Have you played to? Yes, it's a literally about demons on mars and is pretty much the bloody asking that was available on doors. What the heck is wrong with you in my t technology review?

Just point out doom down here on the bottom end of the y access and then do up here the thing on the graph is just ridiculous. Um so I think there was a four four media podcast about this where they're explaining hights. There's a meme of like running doom on different kinds of heart. And so this is like running and like simulating doing is .

like what why why I guess it's sort of simulating doing like maybe as as you are in like an election which is um also uci eating four says road blocks and europium so outlandish which yeah I mean if you don't know that road blocks itself is like got a huge problem with like child labour and like child exploitation in C C M like IT does not belong in the utopia c at all.

And I think we also need to problem tize this like doing to european access like okay hope to reality we could make some sense of that access, access, although you can't just like place products on IT right now. So for any given product, there's going to be the reality what IT does, and that's going to be whatever hype there is about IT. And so we could be like evaluating accuracy of statements on a hype to reality access, but that's I don't think what this is, but doomed to you topia.

Yeah, I mean, this is only an accurate access if you think that A I is going to assure in a new a new thousand year rain of is not terrible. But you know you know if it's onna, bring fully out in space clustering communism.

exactly. And what are their examples of utopia? So, okay, so high hype, but also high on the utopia access is this friend thing. So that's the necklace that we talked about before that like if you listens at the time and then will initiate conversations with you and they're giving this 你特别 yeah I guess 喂 and also have a problem .

with the graphic like is that .

I was also to read its position as like two thirds the way of the scale because that's we're like the center of the necklaces or all with top yeah there's .

there's some there's some certainly some graphic design choices in this。

Yeah, I hope. And then what else? okay. So we have machine learning reveals the secret names of monkeys. We could, you know, how would you do that would get into that? Oh.

the A I scientist is actually quite very is high on the utopia.

yeah. And and not all the way to the hype of the scale either.

All .

right. I'm scared of what this scales. One is gonna be because that's reality and pretty like about maybe sixty percent on the do metal bia thing that authorities fine, clear view.

that is .

good.

Okay, they find clear view, thirty three point seven million in for data privacy violations. Okay, yeah, okay. What's this computer? X X X thing this is going .

to be foighting .

ah oh terrible ah south thia he's Spike and sexy exposed IT deep fakes of female students yeah.

So the other problem I have with this, A I hyden as they are pulling together, is that on this one graph, we have government actions. We have information about terrible things that people are using this technology to do. We have a claims of the people who are selling yet, like these aren't the same type of thing, so they don't, I can't be measured on the same kinds of skills.

This.

yeah right, spite of my.

appointed me easy to clean yourself on my, but that means other people can too. Okay.

complexity of stealing .

content, but says will pay publishers. Okay.

okay. But like, why is that not all the way over on the reality side of things, along with these other two new stories?

Yeah, this is just a bad, like the access.

Yeah.

what's the other thing?

OK thing is, so this is our pallet, or go for IT. yeah.

So this is from the consumer financial protection bureau in the title as C, F, P, B takes action to curb on checkers ker surveilling are really, really good news. Um this sub had reads booming black box cores of the the federal state of accuracy in dispute rates. So the this is from after over twenty four four days ago, slow down a little bit as washington, D C.

Today, the C, F, P, B issued guidance protect workers from unchecked gitl tracking and a pic decision making systems. The guidance warns that companies using third party sumer reports, including back around the surveilLance space black box, are A I, A I, or already makes worse about their workers, must follow fair credit reporting act rules. This means employers must obtain work consent, provide transparency about data used in anniversary sions and allow workers to dispute inaccurate information as companies increasingly deploy invasive tools to assess workers to ensure workers have rights over data influencing they're livelihoods and careers.

There's a quote from the director was he chopra. He says workers shouldn't be subject to unchecked surveilLance or have their careers determined by OK third party reports without basic protections. And so this seems to be be kind of um about about sort of employment decisions. I'm hoping that I mean, IT is sounds like it's being used within upon hiring and within progress did the ranks um so and so thrown down a little bit because IT says currently such consumer reports may be used to predict warfare behaviour.

This includes assessing the likelihood of workers engaging in union organizing activities are estimated that probably that a worker will leave the job, potentially influencing management assistance about staff attention, engagement charities, resigning workers, automated decision of assistance, may use data, work forms available and his social patterns to be assigned members, issue warnings or disciplined actions, or just pretty terrifying. These consumer reports might flag with a performance issues. Um this I think is also something I think is done by to buy uber and lifts and other geek workers or they can get fired by an automated system, which is incredibly insulting and then evaluate social media activity. Sub reports may include analysis of for social media presence potentially impacting hiring or other decisions. Yeah yes.

this is I mean, so that's a scarious of things. And I guess I to curate to play that these things can be used. So this may be used to sounds like it's not this might be going on, but this is a loud or permissible and I guess that they are talking about changing this, which would be excEllent.

Well, sounds like there has to be consent and which is a very minimum bar. And then um so and then there has to be transparency. What's in the door is what's going into IT.

And then um they can complain so they can raise or or and they can sell their information. So there's limits of what they can be using that data. I guess one of the concerns here is is always bound around enforcement mechanisms. If you can sort of lay out these things and help that employers don't run a held them. But um the C F P B often relies on worker reports, are in severe reports um and can love you pretty heavy fines when companies do run out of them but there might be an enforcement gap yeah this is is so strAngely .

this is called consumer reports because they are third party reports about workers or potential workers right but in this consent thing so under consent IT as workers often have no idea that this persons information is being collected about them are used by employers. The C, F, P, B, circular makes clear that when companies provide these reports, the law requires employers to obtain. Workers stand before purchasing them.

This ensures that work there will be aware of and can make informed decisions about the use of their persons information and employment context. It's like, okay, aware of yes, but these are people who are already working for the company. If they will hold consent, what happens like is that is that really meaningful consent in that case, like the the knowledge aspect of IT is good but i'm skeptical that this is actually gonna .

a be a really meaningful consent yeah yeah it's it's IT is IT is a mechanism but it's fairly weak on right yeah extract text rox says .

the whole consumer report thing is very what if light stage capitalism, but too much yeah .

and there's .

one other thing that was in the possible candidates story friday, I held that A C joes bring up in the chat, so which probably mention IT. So thinking about the doomed to utopia scale, I think so. There was literally that horrible story last week about a mother suing character that A I M.

After her son died by suicide. Apparently he was using IT obsessively. A I friend is not you toy. And you know, I think it's it's started to try to try these two things together. I think it's good that C, F, P, B is doing some pushing backyard, not enough.

And if we're going to get to really effective push back, we need a much clear understanding of what's good and what's bad in these systems. And MIT text view, you sort of doomed to you topia scale here is not capturing IT, right we need something that's more long lines of um whose interest are being served, right? The powerful and capital versus you know workers and ordinary people we could look at you know how accurate is the advertising. There's a lot of different directions that we could put together that would actually inform people much Better than this. Matt tev, the thing is doing.

yeah, we need a much more robust A I hype index or no index at all.

to be honest. Mean, there is critically, I for a while was doing the hype world of shame but that wasn't an index right that was just like, you know you said something silly so you get entered into the wall of shame that's .

right and you get really good .

yeah and and I do think like coming up with a set of recurring troops in the AI hype or on the kinds of harm that people are doing with the supposed to A I systems is useful because if you know there's fifteen things to look for and then a new piece comes down the line, then it's easier to like, say, okay, but I need to worry about privacy in this case, or I need to worry about consent, or I need to worry about you what the environmental impacts and so on like this.

It's not also gonna just one each time. But like, if you know about the things, look for them. That's helpful.

Yeah, right. And produce your Christie tailer are saying we could prevent one five health fires would be fun to put employs in the shower nodes. And I think that's a great, a great thing.

yeah. Also in check. What about an A I hyborian? Yes, the book is coming out in in may, but also that's not like a workbook of like how many health res?

No, so no. But I do hope that will help people identify the kinds of problems with each new product or cream. Yeah, right.

right. Well.

that's there for this week.

Our theme song's by Tommy men and graphing, designed by a nail me pleasure bark production by Christie Taylor. And thanks, x, as always, to distributed a research institute. If you like this show, you can support us by reading, reviewing us on apple pog just spotify, and by doing to dear, dear, dear, institute world does D A I R hyson institute org .

and our past episodes tube, and ever you get your podcast, you can watch and comment on the shower wall's happening live on our twitch stream that's twitched that TV si dar under for .

institute again. That's D A I R for institute I M. 喵。

Episode 44: OpenAI's Ridiculous 'Reasoning' 01:00:11 Share

Mystery AI Hype Theater 3000

Deep Dive

Shownotes Transcript

Episode 44: OpenAI's Ridiculous 'Reasoning'