The fact that it can make valid moves almost always means that it must in some sense have something internally that is accurately modeling the world. I don't like to ascribe intentionality or these kinds of things, but it's doing something that allows it to make these moves knowing what the current board state is and understanding what it's supposed to be doing. Everyone needs something different by reasoning.
And so the answer to the question, is that reasoning, is entirely what you define as reasoning. And so you find some people who are very much in the world of, I don't think models are smart, I don't think that they're good, they can't solve my problems. And so they say, no, it's not reasoning, because to me, reasoning means, and then they give a definition which excludes language models. And then you ask someone who's very sort of much on the AGI, you know, language models are going to solve everything. By 2027, they're going to have displaced all human jobs.
You ask them, what is reasoning? And they say reasoning is... Hi, so I'm Nicholas Carlini. I'm a research scientist at Google DeepMind. And I like to try and make models do bad things and understand the security implications of the attacks that we can get on these models. I really enjoy breaking things and have been doing this for a long time. But I'm just very worried that because they're impressive, we're going to have them applied in all kinds of areas where they ought not be. And why...
As a result, the attacks that we have on these things are going to end up with bad security consequences.
MLST is sponsored by CentML, which is the compute platform specifically optimized for AI workloads. They support all of the latest open source language models out of the box, like Lama, for example. You can just choose the pricing points, choose the model that you want. It spins up, it's elastic autoscale. You can pay on consumption, essentially, or you can have a model which is always working or it can be freeze-dried when you're not using it. So what are you waiting for? Go to centml.ai and sign up now.
Tufo Labs is a new AI research lab. I'm starting in Zurich. It is funded from PASS Ventures, involving AI as well. We are hiring both chief scientists and deep learning engineer researchers. And so we are Swiss version of DeepSeq.
And so a small group of people, very, very motivated, very hardworking. And we try to do some research starting with LLM and Owen style models. We want to investigate reverse engineer and explore the techniques ourselves. Nicholas Carlini, welcome to MLST. Thank you. Folks at home, Nicholas won't need any introduction whatsoever. Definitely by far the most famous security researcher in ML and working at Google. And it's so amazing to have you here for the second time.
Yeah, the first time, yeah, was a nice pandemic one, but no, it was great. Yes, MLST is one of the few projects that survived the pandemic, which is pretty cool. But why don't we kick off then? So do you think we'll ever converge to a state in the future where our systems are insecure and we're just going to learn to live with it? I mean, that's what we do right now, right? In normal security. Like, there is no perfect security for anything. If someone really wanted you...
to have something bad happen on your computer, they would win. There's very little you could do to stop that. We just rely on the fact that probably the government does not want you in particular to have something bad happen. If they decided that, I'm sure that they have something that they could do that they would succeed on. What we can get into a world of is the average person probably can't succeed in most cases.
This is not where we are with machine learning yet. With machine learning, the average person can succeed almost always. So I don't think our objective should be perfection in some sense. But we need to get to somewhere where it's at least the case that a random person off the street can't just really, really easily run some off-the-shelf GitHub code that makes it so that some model does arbitrary bad things in arbitrary settings.
Now, I think getting there is going to be very, very hard. We've tried, especially in Vision, for the last 10 years or something, to get models that are robust. And we've made progress. We've learned a lot. But if you look at the objective metrics, they have not gone up by very much in the last four or five years at all. And this makes it seem somewhat unlikely that we're going to get perfect robustness here in this foreseeable future. But at least...
we can still hope that we can do research and make things better and eventually we'll get there. And I think we will, but it just is going to take a lot of work. So Ilya asked me to ask you this question. Do you ever think in the future that it'll become illegal to hack ML systems? I have no idea. I mean, it's very hard to predict these kinds of things. It's very hard to know, is it already? Especially in the United States, the Computer Fraud and Abuse Act covers who knows what in whatever settings.
I don't know. I think this is a question for the policy and the lawyer peoples. And my view on policy and law is as long as people are making these decisions coming from a place of what is true in the world, they can make their decisions. The only thing that I try and make comments on here is like, let's make sure that at least we're making decisions based on what is true and not decisions based on what we think the world should look like. And so, you know, if they base their decisions around the fact that
we can attack these models and various bad things could happen, then I'm... They're more experts at this than me and they can decide, you know, what they should do. But yeah, I don't know. But in the context of ML security, I mean, really open-ended questions, just to start with. Sure. Can you predict the future? What's going to happen? Future for ML security...
Okay, let me give you a guess. I think the probability of this happening is very small, but it's sort of like the median prediction, I think, in some sense. I think models will remain vulnerable to fairly simple attacks for a very long time, and we will have to find ways of building systems so that we can rely on an unreliable model and still have a system that remains secure.
And what this probably means is we need to figure out a way to design the rest of the world, the thing that operates around the model, so that if it decides that it's going to just randomly classify something completely incorrectly, even if just for random chance alone, the system is not going to go and perform a terribly misguided action, and that you can correct for this, but that we're going to have to live with a world where the models remain incorrect.
very vulnerable for, yeah, I don't know, for the foreseeable future, at least as far as I can see. And...
Especially in machine learning time, five years is an eternity. I have no idea what's going to happen with what the world will look like if there's machine learning language models who know something else might happen. Language models are only seven years of real significant progress. Predicting five years out is almost doubling this. So I don't know how the world there will look. But at least as long as we're in the current paradigm, it looks like we're in this world where things are fairly vulnerable.
But then again, language models are only seven years, and we've only been trying to attack them for really two or three. So five years, that's twice as long as we've been trying to attack these language models. Maybe we just figure everything out. Maybe language models are fundamentally different and things aren't this way, but...
My prior just tends to be the case of other vision models we've been trying to study for 10 years. And at least there, things have been proven very hard. And so my expectation is things will be hard. And so we'll have to just rely on building systems that end up working. Amazing. So I've been reading your blog.
And everyone should read his blog because it's really amazing. And actually, when you first put out this article about chess playing, I've cited it on the show about 10 times. So it's really, really interesting. But let me read a bit. By the way, it's called Playing Chess with Large Language Models. She said, until this week, in order to be good at chess, a machine learning model had to be explicitly designed to play games, had to be told explicitly that there was an 8x8 board and that there were different pieces and how each of them moved and what the goal of the game was.
And, you know, it had to be trained with reinforcement learning against itself and then it would win. And you said that this all changed at the time on Monday when OpenAI released GPT 3.5 Turbo Instruct. Can you tell me about that? What GPT 3.5 Turbo Instruct and later other people have done with open source models that you can verify...
They're not doing something weird behind the scenes. Because I think some people speculate, well, maybe they're just cheating in various ways. But there are open source models that replicate this now. What you have is you have a language model that can play chess to a fairly high degree. And yeah, okay, so...
When you first tell someone, I have a machine learning system that can play chess, the immediate reaction you get is like, why should I care? We had Deep Blue, whatever, 30 years ago that could beat the best humans. Isn't that some form of a little bit of AI at the time? Why should I be at all surprised by the fact that I have some system like this that can play chess? The fundamental difference here I think is very interesting is that
The model was trained on a sequence of moves. So in chess you represent moves, you know, 1, e4 means move the king's pawn e4, and then you have e5, the black responds, and then 2, whatever, and f3, white plays the knight, whatever. You train on these sequences of moves, and then you just say, 6 dot, language model, do your prediction task. It's like just a language model. It is being trained to predict the next token.
And it can play a move that not only is valid, but also is very high quality. And this is interesting because it means that the model can play moves that accurately... Let's just talk about the valid part in the first place. Valid is interesting in and by itself because...
What is a valid chess move is like a complicated program to write. It's not an easy thing to do to describe what moves are valid in what situations. You can't just be dumping out random characters and stumble upon valid moves. And you have this model that makes valid moves every time. And so...
I don't like talking a lot about what's the model doing internally because I don't think that's all that helpful. I think you just look at the input output behavior of the system as the way to understand these things. But the fact that it can make valid moves almost always means that it must in some sense have something internally that is accurately modeling the world. I don't like to ascribe...
you know, intentionality or these things, these kinds of things. But it's doing something that allows it to make these moves knowing what the current board state is and understanding what it's supposed to be doing. And this by itself I think is interesting. And then not only can it do this, it can actually play high quality moves. And so I think, you know, taken together it
in some sense tells me that the model has a relatively good understanding of what the actual position looks like. Because, you know, okay, so I play chess at a modest level, like I'm not terrible, I understand, you know, more or less what I should be doing, but if you just gave me a sequence of 40 moves in a row, and then said,
you know 41 point like what's the next move like i i could not reconstruct in my mind what the board looked like at that point in time somehow the model has figured out a way to do this like having never been told anything about the rules but that like they even exist as poor like it's reconstructed all of that and it can put the pieces on the board correctly in whatever way that it does it internally who knows how that happens and then it can place the valid move like
It's sort of very interesting that this is something you can do. For me, it changed the way that I think about what models can and can't do in surface-level statistics or more deeper statistics about actually what's going on.
And I don't know this is I guess mainly why I think this is an interesting thing about the world Yeah, we have this weird form of human chauvinism around the abstractness of our understanding and these artifacts have a surface level of understanding But it's at such a great scale that at some point it becomes a weird distinction without a difference But you said something very interesting in the article. You said that the model was not playing to win right and
You were talking about, and I've said this on the show, that the models are a reflection of you. So you play like a good chess player and it responds like a good chess player. And it's like that whether you're doing coding, whether you're doing it. And it might even explain some of the differential experiences people have because you go on LinkedIn and those guys over there clearly aren't getting very good responses out of
out of LLMs, but then folks like yourself, you're using LLMs and you're at the sort of the galaxy brain level where you're sort of like pushing the frontier and people don't even know you're using LLMs. So there's a very differential experience. Yeah. Yeah. Okay. So let me explain what I mean when I said that. So if you take a given chess board, you can find multiple ways of reaching that.
You know, you could take a board that happened because of a normal game between two chess grandmasters, and you can find a sequence of absurd moves that no one would ever play that actually brings you to the board state. You know, so what you do is like piece by piece. You say, well, the knight is on g3. So what I'm going to do is I'm just going to first move the white knight just to whatever random spot and put it on g3. Okay, and now I know the bishop is on, you know, whatever, you know, h2. And I'll find a way of moving the pawn out of the way and then putting the bishop on h2.
and you can come up with a sequence of absurd moves that ends up in the correct board state, and then you could ask the model, "Now play a move." Okay, and then what happens? The model plays a valid move. Still, most of the time it knows what the board state looks like, but the move that it plays is very, very bizarre.
It's like a very weird move. Why? Because what has the model been trained to do? The model was never told the game of chess is to win. The model was told, make things that are like what you saw before. It saw a sequence of moves that looked like two people who were rated negative 50 playing a game of chess. And it's like, well, okay, I guess the game is to just make valid moves and just see what happens. And they're very good at doing this. And you can do this both in the synthetic way.
And also what you can do is you can just find some explicit cases where you can just get models to make terrible move decisions just because that's what people do commonly when they're playing. And, you know, most people fall for this trap. And I was modeled to...
play like whatever the training data looked like, and so I guess I ought to fall for this trap. And one of the problems of these models is they're not initially trained to do the play-to-win thing. Now, as far as how this applies to actual language models that we use, we almost always post-train the models with RLHF and SFT instruction fine-tuning things. And a big part of why we do that
is so that we don't have to deal with this mismatch between what the model was initially trained on and what we actually want to use it for. And this is why GPT-3 is exceptionally hard to use and the sequence of instruct papers was very important, is that it takes the capabilities that the model has somewhere behind the scenes and makes them much easier to reproduce. And so when you're using a bunch of the chat models today, most of the time, you don't have to worry nearly as much
exactly how you frame the question because of this, you know, they were designed to give you the right answer even when you ask the silly question. But I think they still do have some of this. But I think it's maybe less than if you just have the raw base model that was being trained on whatever data happened to be trained on.
Yeah, I'd love to do a tiny digression on our LHF because I was speaking with Max from Cohe here yesterday. They've done some amazing research talking all about, you know, how this preference steering works. And they say that like humans are actually really bad at kind of like distinguishing a good thing from another thing, you know. So we like confidence. We like verbosity. We like complexity. And for example,
I really hate the chat GPT model because of the style. I can't stand the style. So even though it's right, I think it's wrong. So when we do that kind of post-training on the language models, how does that affect the competence?
I don't know. Yeah, I mean, I feel like it's very hard to answer some of these questions because oftentimes you don't have access to the models before they've been post-trained. You can look at these numbers from the papers. So like in the GPT-4 technical report, one of these reports, they have some numbers that show that the model before it's been post-trained, so just the raw base model, is very well calibrated. And what this means is when...
it gives an answer with some probability, it's right about that probability of the time. So if it says, you know, if it gives an answer for like, you know, you ask a math question and it says the answer is five and the token probability is 30%, it's right about 30% of the time. But then when you do the post-training process, the calibration gets all messed up
and it doesn't have this behavior anymore. So I like that some things change. You can often have the models that just get fantastically better when you do post-training because now they follow instructions much better. You haven't really taught the model much new, but it looks like it's much smarter. Yeah, I think this is all a very confusing thing. I don't have a good understanding of how all of these things fit together.
These models, they make valid moves, they appear to be competent, but sometimes they have these catastrophic weird failure modes. Do we call that process reasoning or not? I'm very big on not ascribing intentionality or I don't want to... Everyone needs something different by reasoning.
And so the answer to the question, is that reasoning, is entirely what you define as reasoning. And so you find some people who are very much in the world of, I don't think models are smart, I don't think that they're good, they can't solve my problems. And so they say, no, it's not reasoning, because to me, reasoning means, and then they give a definition which excludes language models.
And then you ask someone who's very much on the AGI, language models are going to solve everything. By 2027, they're going to be displaced all human jobs. You ask them, what is reasoning? And they say reasoning is whatever the process is that the model is doing. And then they tell you, yes, they're reasoning. And so I think it's very hard to talk about whether it's actually reasoning or not. I think the thing that we can talk about is what is the input of a behavior?
And, you know, does the model do the thing that answers the question, solves the task and was challenging in some way? And like, did it get it right? And then we can go from there. And I think this is an easier way to try and answer these questions than to ascribe intentionality to something like it. Like, I don't know, it's just really hard to have these debates with people when you start off without having the same definitions. Yeah.
I know I'm really torn on this because as you say, the deflationary methodology is it's an input output mapping. You could go one step up. So Benjo said that reasoning is basically knowledge plus inference, you know, in some, some probabilistic sense. And,
I think it's about knowledge acquisition or the recombination of knowledge. And then it's the same thing with agency, right? You know, the simplistic form is that it's just like, you know, an automata. It's just like, you know, you have like an environment and you have some computation and you have an action space and it's just this thing, you know, but it feels necessary to me to have things like autonomy and emergence and intentionality in the definition. But you could just argue, well, why are you saying all of these words? Like if it does the thing, then it does the thing.
Yeah, and this is sort of how I feel. I mean, I think it's very interesting to consider this, like, is it reasoning? If you have a background in philosophy and that's what you're going for. I don't have that. So I don't feel like I have any qualification to tell you whether or not the model is reasoning. I feel like the thing that I can do is say...
here is how you're using the model. You want it to perform this behavior. Let's just check. Like, did it perform the behavior? Yes or no. And if it turns out that it's doing the right thing in all of the cases, I don't know that I care too much about whether or not the model reasoned its way there or it used a lookup table. Like if it's giving me the right answer every time, like,
I don't know. I tend to not focus too much on how it got there.
We have this entrenched sense that we have parsimony and robustness. For example, in this chess notation, if you changed the syntax of the notation, it probably would break, right? Yes. There are multiple chess notations. And I have tried this. So before there was the current notation we use, in old chess books, notation was like, you know,
you know, king's bishop moves to the, like, you know, to queen's three, whatever, like that, you just number the squares differently. If you ask a model in this notation, it has no idea what's happening, and it will write something that looks surface level, like a sequence of moves, but has nothing to do with the correct board state. And of course, yeah, a human would not do this if you ask them to produce the sequence of moves, like they, it would take me a long time
to remember which squares, which things, how to write these things down. I would have to think harder. But I understand what the board is, and I can get that correct. And the model doesn't do that right now. And so maybe this is your definition of reasoning, and you say the reasoning doesn't happen. But someone else could have said, why should you expect the model to generalize this thing that it's never seen before? It's interesting to me
We've gone from a world where we wrote papers about the fact that if you trained a model on ImageNet, then like, well, obviously it's going to have this failure mode that when you corrupt the images, the accuracy goes down or you can't, like, suppose I wrote a paper seven years ago. I trained my model on ImageNet and I tested it on CIFAR-10. It didn't work. Isn't this model so bad? People would like laugh at you. Like, well, of course, you trained it on ImageNet, one distribution. You tested on just a different one. You never asked it to generalize.
and it didn't do it. Good job. Of course it didn't solve the problem. But today what do we do with language models? We train them on one distribution, we test them on a different distribution that it wasn't trained on sometimes, and then we laugh at the model like, "Isn't it so dumb?" It's like, "Well, yes, you didn't train it on the thing." Maybe some future model, you could have the fact that it could just magically generalize across domains, but we're still using machine learning. You need to train it on the kind of data that you want to test it on, and then the thing will behave much better than if you don't do that.
So in an email correspondence to me, you said something, you didn't use these exact words, but you said that there are so many instances where you kind of feel a bit noobed because you made a statement, you know, your intuition is you're a bit skeptical. You said there's stochastic parrots and then you got proven wrong a bunch of times. And it's the same for me. Now, one school of thought is, you know, Rich Sutton, you just throw more data and compute at the thing. And the other school of thought is that we need to have completely different methods. I mean, are you still amenable to the idea that just scaling these things up will do the kinds of reasoning that we're talking about?
Possibly. Yeah, right. So there are some people I feel like who have good visions of what the future might look like. And then there are people like me who just look at what the world looks like and then try to say, well, let's just do interesting work here. I feel like this works for me because for security in particular, it really only matters...
if people are doing the thing to attack the thing. And so I'm fine just saying, like, let's look at what is true about the world and write the security papers. And then if the world significantly changes, we can try and change. And we can try and be a couple years ahead looking where things are going so that we can do security ahead of when we need to. But I tend, because of the area that I'm in, not to spend a lot of time trying to think about, like, where are things going to be in the far future, right?
I think a lot of people try to do this and some of them are good at it and some of them are not. And I have no evidence that I'm good at it. So I try and mostly reason based on what I can observe right now. And if what I can observe changes, then I ought to change what I'm thinking about these things and do things differently. And that's the best that I can hope for.
On this chess thing, has anyone studied, you know, like in the headers for the chess notation, you could say this player had an ELO of 2,500 or something like that. And I guess the first thing is like, do you see some commensurate, you know, change in performance? But what would happen if you said ELO 4,000? Right. Yeah.
Yes. We've actually trained some models trying to do this. It doesn't work very well. It's like you can't trivially... At least, yeah. If you just change the number, we've trained some models ourselves on headers that we expected would have an even better chance of doing this, and it did not directly give this kind of immediate wins, which, again, is not to say that...
I am not good at training models. Someone else who knows what they're doing might have been able to make it have this behavior, but when we trained it and when we tested 3.5 Termo Instruct, it might have a statistically significant difference on the outcome, but it's nowhere near the case that you tell the model it's playing like a 1,000 rated player and all of a sudden it's 1,000 rated. People have worked
very hard to try and train models that will let you match the skill to an arbitrary level and like it's like research paper level thing not just like change three numbers and head on hope for the best right so you wrote another article called why i attack sure and um you said that you enjoy attacking systems for the fun of solving puzzles rather than altruistic reasons can you tell me more about that but also why did you write that article
Yeah, okay. Okay, so let me answer them in the opposite order you asked them. So why did I write the article? Some people were mad at me for breaking defenses. They said that I don't care about humanity, I just, I don't know, want to make them look bad or something. And half of that statement is true. Yeah.
I don't do security because I'm not driven by I want to do maximum good and therefore I'm going to think about what are all of the careers that I could do and try and find the one that's most likely to save the most lives. If I had done that, I probably would, I don't know, be a doctor or something. Actually, it immediately helps people. You could research on cancer. Find whatever domain that you wanted where you could...
measure like maximum good. I don't find those things fun. I can't motivate myself to do them. And so if I was a different person, maybe I could do that. Maybe I could be someone who like could meaningfully solve challenging problems in biology by saying like, I'm waking up every morning knowing that I'm sort of like saving lives or something.
But this is not how I work. And I feel like it's not how lots of people work. You know, there are lots of people who I feel like are in computer science and or you want to go even further in like quant fields where like you're clearly brilliant and you could be doing something a lot better with your life. And some of them probably legitimately just would just have zero productivity if they were doing something that they just really did not enjoy.
find any enjoyment in. And so I feel like the thing that I try and do is, okay, find the set of things that you can motivate yourself to do and, like, will do a really good job in, and then solve those as good as possible, subject to the constraint that, like, you're actually net positive moving things forwards. And...
For whatever reason, I've always enjoyed attacking things and I feel like I'm differentially much better at that than at anything else. And I feel like I'm pretty good at doing the adversarial machine learning stuff, but I have no evidence that I would be at all good at the other...
you know, 90% of things that exist in the world that might do better. And so, I don't know, the way that I, maybe in one sentence that I think about this is the like, that's how good you are at the thing, multiply by how much the thing matters. And you're trying to sort of
maximize that product and if there's something that you're really good at that at least directionally moves things in the right direction you can have a better higher impact than taking whatever field happens to be the one that is like maximally good and moving things forwards by a very small amount and so that's that's why i do attacks is because i feel like generally they move things forward and i feel like i'm better than better than most other things that i could be doing
Now, you also said that attacking is often easier than defending. Certainly. Tell me more. I mean, this is the standard thing in security. You need to find one attack that works and you need to fix all of the attacks if you're defending. And so if you're attacking something...
The only thing that I have to do is find one place where you've forgotten to handle some corner case, and I can arrange for the adversary to hit that as many times as they need until they succeed. This is why you have normal software security. You can have a perfect program in everywhere except one line of code,
where you forget to check the bounds exactly once. And what does this mean? The attacker will make it so that that happens every single time, and the security of your product is essentially zero.
Under random settings, this is never going to happen. It's never going to happen that the hash of the file is exactly a power of, like, you know, is equal to 2 to the 32, which overflows the integer, which causes the bad stuff to happen. This is not going to happen by random chance, but the attacker can just arrange for this to happen every time, which means that it's much easier for the attacker than the defender, who has to fix all of the things.
as this, and then in machine learning it gets even worse. Because at least in normal security and software security or other areas, we understand the classes of attacks. In machine learning we just constantly discover new categories of bad things that could happen. And so not only do you have to be robust to the things that we know about,
you have to be robust to someone coming up with a new clever type of attack that we hadn't even thought of before and be robust there. And this is not happening because of the way... I mean, it's a very new field. And so, of course, it's just much easier for these attacks than defenses. Let's talk about disclosure norms. How should they change now that we're in the ML world? Okay, yeah. So in standard software security, we've basically figured out how things should go. So for a very long time,
you know, for 20 years, there was a big back and forth between when someone finds a bug in some software that can be exploited, like what should they do? And let's say, I don't know, late 90s, early 2000s, there were people who were on the full disclosure, which they thought, I find a bug in some program, what should I do? I should tell it to everyone so that we can make sure that like people don't make a similar mistake and we can put pressure on the person to fix it and do all that stuff.
And then there were the people who were on the, like, don't disclose anything. Like, you should report the bug to the person who's responsible and wait until they fix it. And then you should tell no one about it. And because, you know, this was a bug that they made and you don't want to give someone else, anyone else ideas for how to exploit.
And in software security, we landed on this, you know, what was called responsible disclosure and is now coordinated disclosure, which is the idea that you should give the person, if it affects one person, a reasonable heads up for some amount of time. Google Project Zero has a 90-day policy, for example, and you have that many days to fix your thing.
And then after that, or once it's fixed, then it gets published to everyone. And the idea here in normal security is that you give the person some time to protect their users. You don't want to immediately disclose some new attack that allows people to cause a lot of harm. But you put a deadline on it and you stick to the deadline to put pressure on the company to actually fix the thing.
Because what often happens if you don't say you're going to release things publicly is no one else knows about it. You're the only one who knows the exploit. They're just going to not do it because they're in the business of making a product, not fixing bugs. And so why would they fix it if no one else knows about it? And so when you say, no, this will go live in 90 days, you better fix it before then, they have the time. It's just like now if they don't do it, it's on them because they just didn't put in the work to fix the thing. And there are, of course, exceptions. You know,
Spectre and Meltdown are two of the most common exploits, one of the biggest attacks in the last 10, 20 years in software security. And they gave Intel and related people a year to fix this because it was a really important bug. It was a hard bug to fix. There were legitimate reasons why you should do this. There's good evidence that it's probably not going to be independently discovered by the bad people for a very long time. And so they gave them a long time to fix it.
And similarly, Google Project Zero also says if they find evidence the bug is being actively exploited, they'll give you seven days. If there's someone actively exploiting it, then you have seven days before they'll patch because the harm is already being done. And so they might as well tell everyone about the harms being done because if they don't, then it's just going to delay the things.
With that long preamble, how should things change for machine learning? The short answer is I don't know, because on one hand, I want to say that this is like how things are in software security. And sometimes it is, where someone has some bug in their software, and there exists a way that they can patch it and fix the problems.
And in many cases, this happens. So we've written papers recently, for example, where we've shown how to do some model stealing stuff. So OpenAI has a model, and we could query OpenAI services and allow us to steal part of their model. Only a very small part, but we could steal part of it. So we disclosed this to them because there was a way that they could fix it. They could make a change to the API to prevent this attack from working, and then we write the paper and put it online.
this feels very much like software security. On the other hand, there are some other kinds of problems that are not the kinds that you can patch. Let's think in the broadest sense, adversarial examples. If I disclosed to you, here is an adversarial example on your image classifier. Like,
What is the point of doing the responsible disclosure period here? Because there is nothing you can do to fix this in the short term. Like we have been trying to solve this problem for 10 years. Another 90 days is not going to help you at all. Maybe I'll tell you out of courtesy to let you know, like, this is the thing that I'm doing. I'm going to write this paper. Here's how I'm going to describe it. Do you want to like put in place a couple of filters ahead of time to make this particular attack not work? But you're not going to solve the underlying problem.
And when I talk to people who do biology things, the argument they make is, suppose someone came up with a way to create some novel pathogen or something. A disclosure period doesn't help you here. And so is it more like that? Or is it more like software security? I don't know. I'm more biased a little bit towards the software security because that's where I came from. But it's hard to say exactly which one we should be modeling things after. I think we do probably need to come up with new norms for how we handle this.
There are a lot of people I know who are talking about this, trying to write these things down. And I think in a year or two, if you ask me this again, we will have set processes in place. We will have established norms for how to handle these things. Now, I think this is just like very early. And right now we're just looking for analogies in other areas and trying to come up with what sounds most likely to be good. But I don't have a good answer for you. Yeah, immediately now. Are there any vulnerabilities that you've decided not to pursue for ethical reasons?
No, not that I can think of. But I think mostly because I tend to only try and think of the exploits that would be ethical in the first place. So I just like, it may happen that I like I stumble upon this, but I tend to, like, I think research ideas, you some in some very small fraction of the time, research ideas happen just by random inspiration.
Most of the time, though, research ideas is not something that just happens. Like you have spent conscious effort trying to figure out what new thing I'm going to try and do. And I think it's pretty easy to just like not think about the things that seem morally fraught and just focus on the ones that seem like they actually have potential to be good and useful. But...
Yeah, very well may happen at some point that this is something that happens, but this is not a thing that I... I can't think of any examples of attacks that we've found that we've decided not to publish because of the harms that they would cause. But I can imagine that this might be something that
I can't rule out this is something that wouldn't happen, but I tend to just bias my search of problems in the direction of things that I think are actually beneficial. Maybe going back to the why I attack things
You want the product of how good you are and how much good it does for humanity to be maximally positive. You can choose what problems you work on to not be the ones that are negative. I don't have lots of respect for people where the direction of the goodness of the world is just a negative number. You can choose to make that at least zero, just don't do anything. I try and pick the problems that I think are generally positive and then
Among those, yeah, just do as good as possible on those ones. So you work on traditional security and demo security. What are the significant differences? Yeah, okay, so I don't work too much on traditional security anymore. So I started my PhD in traditional security. Yeah, I was like, I did very, very low-level return-ended programming. I was at Intel for a summer on some hardware-level defense stuff. Yeah.
And then I started machine learning shortly after that. So I haven't worked on the very traditional security in like the last, let's say, eight something, seven something years. But yeah, I still follow it very closely. I still go to the system security conferences all the time because I think it's like a great community. But yeah, what are the similarities and differences? I feel like the systems security people are very good at...
really trying to make sure that what they're doing is a very rigorous thing and evaluated it really thoroughly and properly. You see this even in the length of the papers. So a system security paper
is like 13, 14 pages long, two column. A paper that's a submission for iClear is like seven or eight or something, one column. The system security papers will all start with a very long explanation of exactly what's happening. The results are expected to be really rigorously done. Machine learning paper often is, here is a new cool idea, maybe it works.
And like, this is good for like, you know, move fast and break things. This is not good for like really systematic studies. You know, when I was doing system security papers, I would get like, you know, one, one and a half, two a year. And now like a similar kind of thing I could probably, of machine learning papers, like, you know, you could probably do five or six or something like to the same level of rigor. And so I feel like this is like, it's maybe the biggest thing I see in my mind is like the level of,
Yeah, thought here that goes into some of these things that like, it's a conscious decision to the communities, right? And I think it's worked empirically in the machine learning space. Like it would, it would not be good if every research result in machine learning needed to have the kind of rigor you would have expected for a systems paper, because we would have had like five iteration cycles in total, right?
And, you know, at machine learning conferences, you often see the paper, the paper that approved upon the paper, and then the paper that approved upon that paper all at the same conference because the first person put it on archive and the next person found the tweak that made it better and the third person found the tweak that made it even better. And like, this is good. Like, you know, when the field is very new, you want to allow people to rapidly propose ideas that they don't have full evidence or, you know, evidence working. And when it feels much more mature, you want to make sure that you don't have people just proposing wild things that have been proposed 30 times in the past and they don't know that it works.
And so I think having some kind of balance and mix between the two is useful. And this, I think, is maybe the biggest difference that I see. And this is, I guess...
Maybe if there's some differential advantage that I have in the machine learning space, I think some of it comes from this where in systems, you were trained very heavily on this kind of rigorous thinking and how to do attacks very thoroughly, look at all of the details. And when you're doing security, this is what you need to do. And so I think some of this training has been very beneficial for me in writing machine learning papers, thinking about all of the little details to get these points right because, you know,
I had a paper recently where the way that I broke some defense and the way that the thing broke is because there was a negative sign in the wrong spot. And like, it's like, this is not the kind of thing that like I could have reasoned from first principles about the code. Like if I had been advising someone, like I don't know how I would have told them check all the negative signs. It's like, you don't know, like you just like,
what you should be doing is this. You should be understanding everything that's going on and find the one part where the mistake was made so that you can break it by doing just the one right thing. This is maybe my biggest difference, I think, between these communities. Next article.
It was called Why I Use AI. And it was about a couple of months ago you wrote this. And you say that you've been using language models. You find them very useful. They improve your programming productivity by about 50%. I can say the same myself. Maybe let's start there. Can you break down specifically the kind of tasks where it's really uplifted your productivity? So I am not someone who believes in these kinds of things. I don't...
There are some people who their job is to hype things up and their job is to get attention on these kinds of things. And I feel like the thing that was annoyed about is that these people, the same people who were, you know, Bitcoin is going to change the world, whatever, whatever. As soon as language models come about, they all go, language models are going to change the world. They're very useful, whatever, whatever. And the problem is that if you're just looking at this from afar, you're
It looks like you have the people who are the grifters just finding the new thing. And they are, right? Like, this is what they're doing. These people have no understanding what's going on in the world. They're trying to find whatever the new thing is that they can get them clicks. But at the same time, I think that the models that we have now are actually useful.
And they're not useful for nearly as many things as people like to say that they are. But for a particular kind of person, the person who understands what is going on in these models and knows how to code and can review the output, they're useful. And so what I wanted to say is, I'm not going to try and argue that
they're good for everyone, but I want to say, here's an n equals one me anecdote that I think they're useful for me, and if you have a background similar to me, then maybe they're useful for you too. And I've got a number of people who are security-style people who have contacted me and said, thanks for writing this, they have been useful for me, and yeah. Now there's a question of, does my experience generalize to anyone else?
I don't know. This is not my job to try and understand this. But at least what I wanted to say was, yeah, they're useful for people who behave like I do. Okay, now, why are they useful? The current models we have now are good enough that the kinds of things where I want an answer to this question, whether it's write this function for me or whatever, do this, I know how to check it.
I know that I could get the answer. It's like something I know how to do, I just don't want to do it. The analogy I think is maybe most useful is imagine that you had to write all of your programs in C or in assembly. Would this make it so that you couldn't do anything that you can do now? No, probably not. You could do all of the same research results in C instead of Python if you really had to.
it would take you a lot longer because you have an idea in your mind. I want to implement something trivial, some binary search thing. And then in C, you have to start reasoning about pointers and memory allocation and all these little details that are at a much lower level than the problem you want to solve. And the thing I think is useful for language models is that if you know the problem you want to solve and you could check that the answer is right,
then you can just ask the model to implement for you the thing that you want in the words that you want to just type them in, which are not terribly well-defined.
And then it will give you the answer and you could just check that it's correct and then put it in your code and then continue solving the problem you want to be solving and not the problem that you had to do to actually type out all the details. That's maybe the biggest class, I think, of things that I find useful. And the other class of things I find useful are the cases where you rely on the fact that the model has just enormous knowledge about the world
and about all kinds of things. And if you understand the fundamentals, but like, I don't know the API to this thing. Just like,
make the thing work under the API and I can tech check that easily or I don't understand how to write something in some particular language, give me the code. If you give me code in any language, even if I've never seen it before, I can basically reason about what it's doing. I may make mistakes around the border, but I could never have typed it because I don't know the syntax, whatever. The models are very good at giving you the correct syntax
and just like getting everything else out of the way and then I can...
figure out the rest about how to do this. And if I couldn't ask a model, I would have had to have learned the syntax of the language to type out all the things or do what people would do five years ago, copy and paste some other person's code from Stack Overflow to make annotations. And it was just like a strictly worse version of just asking the model because now I'm relying on me who doesn't know anything to just do copy and paste. And so this is, I guess, my view is that for these kinds of problems that they're currently
Plenty useful if you already understand and by that I mean an abstract understanding then then there are super power which explains why you know like the smarter you are actually the more you can get out out of out of a language model but how is your usage evolved over time and just what's your what's your methodology I mean you know speaking personally I know that specificity is important so going to source material and constructing the prompt you know imbuing my understanding and reasoning process into the prompt I mean how do you think about that.
Yeah, I guess I try and ask questions that I think have a reasonable probability of working. And I don't ask questions where I feel like this was going to slow me down. But if I think it has, you know, a 50% chance of working, I'll ask the model first. And then I'll look at the output and see, like, does this directionally look correct? And if it seems like directionally, it like maybe is going to approach the correct kind of solution,
then I might iterate a little more. And if it gives me a perfect solution the first time, then great, I accept it. And then I learn like, okay, the models are not very good at this kind of problem. I just won't ask that again in the future. And so I feel like some people who say like they can't get models to do anything useful for them. Yeah, it may be the case that models are just really bad at a particular kind of problem. It may also just be you don't have a good understanding what the models can do yet. You know, if you like, I think most people, you know,
today have forgotten how much they had to learn about how to use Google search. You know, like people today, if I tell you to look something up, like you, you implicitly know the way that you should look something up is to like use the words that appear in the answer. Don't, you don't like ask it as the form of a question. Like you sort of, there's a way that you type things into the, into search engines to get the right answer.
And this requires some amount of skill and understanding about how to reliably find answers to something online. I feel like it's the same thing for language models. They have a natural language interface. So like technically you could type whatever thing that you wanted. There are some ways of doing it that are much more useful than others. And I don't know how to teach this as a skill other than just saying like,
try the thing and maybe it turns out they're not good at your task and then just don't use them. But if you are able to make them useful, then this seems like a free productivity win. But this is the kind of thing where, again, caveat it on, you have to have some understanding of what's actually going on with these things because there are people who don't, who I feel like can try and do these similar kinds of things.
And then I'm worried about, you know, like, are you going to learn anything? You won't catch the bugs when the bugs happen. All kinds of problems that I'm worried about from that perspective. But for the practitioner who wants to get work done, I feel like in the same way that I wouldn't say you need to use C over Python, I wouldn't say you need to use just Python over Python plus language models. Yeah.
Yes, yes. I agree that laziness and acquiescence is a problem. Vibes and intuition are really important. I mean, I consider myself a Jedi of using LLMs. And sometimes it frustrates me because I say to people, oh, just use an LLM. I seem to be able to get so much more out of LLMs than other people. And I'm not entirely sure why that is. Maybe it's just because I understand the thing that I'm prompting or something like that. But it seems to be something that we need to learn.
Yeah, I mean, every time a new tool comes about, you have to spend some time, you know, I remember when people would say, real programmers write code in C and don't write it in a high level language. Why would you trust the garbage collector to do a good job? Real programmers manage their own memory.
real programmers write their own python why would you trust the language model to output code that's correct why would you trust it to be able to have this recall real programmers understand the api and don't need to look up the reference manual like i mean like you can draw the same analogies here and yeah no i think this is the case of like when the tools change and make it possible for you to be more productive in certain settings you should be willing to look at them into the new tools
I know I'm always trying to rationalize this because it comes down to this notion of, is the intelligence in the eye of the prompter? Does it matter? This is like maybe the difference between how I use these things out with people is the thing makes me more productive and solves the task for me. Was it the case that I put the intelligence in?
Maybe. In many cases, I think the answer is no. In some cases, I think the answer is yes. But I'm not going to look at it this way. I'm going to look at it as, is it solving the questions that I want in a way that's useful for me? I think here the answer is definitely yes. But yeah, I don't know how to answer this in some real way. So obviously, as a security researcher, how does that influence the way that you use LLMs?
Oh yeah, this is why I'm scared about the people who are going to use them and not understand things because you ask them to write an encryption function for you and the answer really ought to be, "You should not do that. You should be calling this API." And oftentimes they'll be like, "Sure, you want me to write encryption function? Here's the answer to an encryption function." And it's going to have all of the bugs that everyone normally writes and this is going to be terrible. The same thing for... I was writing some random stuff that made calls to a database.
And what did the model do? It wrote the thing that was vulnerable to SQL injection. And this is terrible. If someone was not being careful, they would not have caught this. And now they've introduced all kinds of bad bugs. Because I'm reasonably competent at programming, I can read the output of the model and just correct the things where it made these mistakes. It's not hard to fix the SQL injection and replace the string concatenation with the templates. The model just didn't do it correctly.
And yeah, so I'm very worried about the kind of person who's not going to do this. There have been a couple of papers by people showing that people do write very insecure code when using language models when they're not being careful for these things. And yeah, this is something I'm worried about. It looks like it might be the case that it's like differentially more vulnerable when people use language models versus when they don't. And yeah, this is, I think, a big concern. I think the reason why I tend to
think about this utility question is often just from the perspective of, yeah, security of things that people use actually matters. And so I want to know what are the things that people are going to do so you can then write the papers and study what people are actually going to do. So I feel like it's important to separate can the model solve the problem for me? And the answer for language models using it is oftentimes yes, it gives you the right answer for the common case. And this means that
Most people don't care about the security question. And so they'll just use the thing anyway, because it gave them the ability to do this new thing, not understanding the security piece. And so that means we should then go and do security around this other question of like, we know people are going to use these things, we ought to do the security to make sure that the security is there so that they can use them correctly.
And so I, yeah, I often try and use things that like are at the frontier of what people are going to do next, just from that, just to try and put myself in their frame of mind and to understand this. And yeah, this, this worries me quite a lot. Because yeah, things could go very bad here. How and when do you verify the outputs of LLMs? The same way that you output, I mean, like, this is the other thing. People say like, you know, maybe the model is going to be wrong, but like,
half of the answers on Stack Overflow are wrong anyway. So it's not the case that if you've been programming for a long time, you're used to the fact that you read code that's wrong. I'm not going to copy and paste some function on Stack Overflow and just assume that it's right, because maybe the person asked a question that was different than the question that I asked. Whatever. I don't feel like I'm doing anything terribly different when I'm verifying the output of a language model code versus when I'm verifying the output of some function that I found somewhere else right online.
Maybe the only difference is that I'm using the models more often, and so I have to be more careful in checking, like, you know, if you're using something twice as often, then if you're redefining bugs with something, you know, you're going to have twice as many bugs because you're using it twice as much. And so you have to be a little more careful. But I don't feel like there's anything I'm doing that's especially different in quality. It's just...
don't trust the thing to give you the right answer and understand the fact that like 95% solutions are still 95% solutions. You know, you take the thing, it does almost everything that you wanted. Then like it maxed out its capability is good. You're an intelligent person. Now you've finished the last 5%, fix whatever the problem is. And then there you have a 20x performance increase there.
You touched on something very interesting here because actually most of us are wrong most of the time. And that's why it's really good to have at least one very smart friend because they constantly point out all of the ways in which your stuff is wrong. Most coders, I mean, it's your job to point out how things are wrong. And I guess we're always just kind of on the boundary of wrongness unwittingly. And that's just the way the world works anyway. Yeah. Yeah, right. And so...
I think, yeah, I mean, I think that there's a potential for, you know, massive increases in quantity of wrongness, you know, with language models. Like, this is, I think, it's like, there are lots of things that could go very wrong or go very bad with language models, you know. The ability of them, you know, previously the amount of bad code that could be written was, like, limited to, like, the number of humans who could write bad code. Because, like...
There's only so many people who can write software, and you had to have at least some training, and so you... the number... some bounded amount of bad code. One of the other things I'm worried about is, you know, you have people who look at these people saying models can solve all your problems for you, and now you have ten times as much code, which is great from one perspective, because isn't it fantastic that anyone in the world can go and write whatever software they need to solve their particular problem? That's fantastic.
But at the same time, security in person and me is kind of scared about this because now you have 10 times as much stuff that is probably very insecure. And you are not going to be able to have – you don't have 10 times as many security experts to study all of this. Like, you're going to have a massive increase in this and some potential futures. And, you know, this is one of the many things that I'm, I think –
I'm worried about and is why I try and use these things to understand does this seem like something people will try and do? It seems to me the answer is yes right now and this worries me.
So I spoke with some Google guys yesterday and they've been studying some of the failure modes of LLMs. So like just really crazy stuff that people don't know about. Like they can't copy, they can't count, you know, because of the softmax and the topological representation squashing in this particular... Loads and loads of stuff they can't do. In your experience, have you noticed some kind of tasks that LLMs just really struggle on? I'm sure that there are many of them. I have sort of learned to just not ask those questions.
And so I have a hard time coming up like, you know, in the same sense, like, what are the things that search engines are bad for? You know, I'm sure that there are a million things that search engines are like completely the wrong answer for. But if I sort of pressed you for a question, answer this right now, you'd have a little bit of a hard time because the way that you use them is the things that they're good for. And so like, yeah, so yes. So all of these things, like whenever you want, like,
correctness in some senses the model is not the thing for you um like in terms of like specific tasks that they're particularly bad at um i mean of course you can say anything that requires some kind of if it would take you more than you know 20 minutes to write the program probably the model can't get that but like this is the problem with this like this is changing you know like i so okay so this is like the other thing like
There are things that I thought would be hard that end up becoming easier. So there was a random problem that I wanted for...
unrelated reasons that like is a like it's a hard dynamic programming problem to solve um it took me like i don't know two or three hours to solve it the first time that i had to do it um and so oh one just launched you know a couple days ago i gave the problem to oh one and it gave me an implementation that was 10 times faster than the one i wrote in like two minutes and so i like and like i can test it because like i have a reference solution and like it's correct and like it like it it
It's like, okay, so now I've learned here's a thing that I previously would have thought. I would never ask models to solve something because this was a challenging enough algorithmic problem for me that I would have no hope of the model solving, and now I can. But there are other things that seem trivial to me that the models get wrong, but I mostly have just not asked those questions. But yeah, this is why I guess going back to the thing I'm worried about is I worry people will not have the experience to check when the answers are right and wrong, and they'll
just apply the wrong answer as many times as they can and that seems concerning.
Yeah, I mean, this is part of the anthropomorphization process because I find it fascinating that I think, you know, we have vibes, we have intuitions, and we actually know and we've learned to skirt around the failure mode, you know, the long tail of failure modes. And we just move it over in our supervised usage of language models. And the amazing thing is we don't seem to be consciously aware of it. Yeah, but like programmers do this all the time, right? Like you have a language, the language has...
Let's suppose you're someone who writes Rust. Rust has a very, very weird model of memory. If you go to someone who's very good at writing Rust, they will structure the program differently so they don't encounter all of the problems because of the fact that you have this weird memory model.
But if I were to do it, I'm not very good at Rust. I try and use it, and I try and write my C code in Rust, and the borrow checker just yells at me to no end, and I can't write my program. I look at Rust and go, I see that this could be very good, but I just don't know how to get my code right because I haven't done it enough. And so I look at the language and go, if I was not being charitable, I would say,
why would anyone use this? It's impossible to write my C code in Rust. You're supposed to have all these nice guarantees, but no, you have to change the way you write your code in order to change your frame of mind, and then the problems all just go away. You can do all of the nice things, just accept the way that the paradigm you're supposed to be operating in, and the thing goes very well.
I see the same kind of analogy for some of these kinds of things here where the models are not very good in certain ways and you're trying to imagine that the thing is a human and ask it the things you would ask another person, but it's not. And you need to ask it in the right way, ask the right kinds of questions, and then you can get the value. And if you don't do this, then you'll end up very disappointed because it's not superhuman. What are your thoughts on benchmarks? Okay, yes, I have thoughts here.
This, I guess, is the problem with language models. We used to be in a world where benchmarking was very easy because we wanted models to solve exactly one task. And so what you do is you measure it on that task and you see can it solve the task? And the answer is yes. And so great, you've figured it out. The problem with this is that task was never the task we actually cared about. And this is why no one used models.
No ImageNet models ever made it out into the real world to solve actual problems because we just don't care about classifying between 200 different breeds of dogs. The model may be good at this, but this is not the thing we actually want. We want something different. And it would have been absurd at the time to say the ImageNet model can't solve this actual task I care about in the real world because, of course, it wasn't trained for that. Language models...
The claim that people make for language models and what people who train them is, "I'm going to train this one general purpose model that can solve arbitrary tasks." And then they'll go test it on some small number of tasks and say, "See, it's good because it can solve these tasks very well." And the challenge here is that if I trained a model to solve any one of those tasks in particular, I could probably get really good scores. The challenge is that
You don't want the person who has trained the model to have done this. You wanted them to just train a good model and use this as an independent, you know, just here's a task that you could train the model, you could evaluate the model on, completely independent from the initial training objective in order to get like an unbiased view of how well the model does. But people who train models are incentivized to make them do well on benchmarks. And while in the old world, you know,
I trust researchers not to cheat. So, like, suppose I wanted to have maximum ImageNet test accuracy. In principle, I could have trained on the test set. But this is, like, actually cheating. You don't train on the test set. So I trust that people won't do this. But suppose that I give you a language model and I want to evaluate it on, you know, coding, which I'm going to use, you know, a terrible benchmark, but human eval, whatever. I'm going to use MMLU. I'm going to use MMMU, whatever the bad cases may be.
I may not actually train the model on the test set of these things, but I may actually train my model in particular to be good on these benchmarks. And so you may have a model that is not very capable in general, but on these specific 20 benchmarks that people use, it's fantastic. And this is what everyone is incentivized to do because you want your model to have maximum scores on benchmarks. And so I think...
I would like to be in a world where there were a lot more benchmarks. So that is not the kind of thing that you can easily do. And you can more easily trust that these models are going to give you the right answers. But they accurately reflect what their skill level is in some way that is not being designed by the model trainer to maximize the scores.
So at the moment, you know, like the hyperscalers, they put incredible amounts of work into benchmarking and so on. And now we're moving to a world where we've got, you know, test time inference, test time active fine tuning, you know, people are fine tuning, quantizing, fragmenting and so on. And a lot of the people doing this in a practical sense can't really benchmark in the same way. How do you see that playing out? That I don't know. I feel like if you're doing quantizing and stuff,
Good luck. I don't know. It just seems very hard to test what these things are. You can use the average benchmarks and hope for the best, but I don't. I feel like the thing I'm more worried about is people who are actively fine-tuning models
To show that they can make them better on certain tasks. So you have lots of fine tunes of Lama, for example. That are claimed to be better. And they'll show all the benchmark numbers. And it just turns out that what they did was they just really trained the models to be good on these specific tasks. And if you ask them anything else, they're just really bad. I think that's the thing I'm more worried about. But yeah, for other cases, I don't know. I agree this is hard, but I don't have any great...
solutions here. That's okay. We can't let you go before talking about one of your actual papers, because I mean, this has been amazing talking about general stuff, but I decided to pick this one, stealing parts of a production language model. So this is from July. Could you just give us a bit of an elevator pitch on that? For a very long time, when we did papers in security, what we did was we would think about how a model might be used in some hypothetical future,
And then say, well, maybe we have certain kinds of attacks that are possible. Let's try and show in some theoretical setting, this is something bad that could happen. And so there's a line of work called model stealing, which tries to answer the question, can someone take the model that you have and without, and just like by making standard queries to your API, steal a copy of it?
This was started by Florian Tremer and others in 2016, where they did this on very, very simple linear models over APIs. And then it became a thing that people started studying on deep neural networks. And there were several papers in a row by a bunch of other people. And then in 2020, we wrote a paper that we put at Crypto that said, well, here is a way to steal an exact copy of your model.
Whatever the model you have is, I can get an exact copy. As long as you have a long list of assumptions, it's only using a value activation. The whole thing is evaluated in floating point 64. I can feed floating point 64 values in, I can see floating point 64 values out. The model is only fully connected. Its depth is no greater than 3. It has no more than 32 units wide on any given layer. It just has a long list of things that are never true in practice.
But it's a very cool theoretical result. And there are other papers of this kind that show how to do this kind of, I steal an exact copy of your model, but it only works in these really contrived settings. This is why we submitted the paper to Crypto, because they have all these kinds of theoretical results that are very cool, but are not immediately practical in many ways.
And then there was a line of work continuing extending upon this. And the question that I wanted to answer is like, now we have these language models and
If I list all of the assumptions, all of them are false. It's not just ReLU-only activations. It's not just fully connected. I can't send floating point 64 inputs. I can't view floating point 64 outputs. They're like a billion neurons, not 500. So all these things that aren't true. And so I wanted to answer the question, what's the best attack that we can come up with that actually I can implement in practice on a real API?
And so this is what we tried to do. We tried to come up with the best attack that works against the most real API that we have. And so what we did is we looked at the OpenAI API and some other companies. Google had the same kind of things.
And because of the way the API was set up, it allowed us to get some degree of control over the outputs that let us do some fancy math that would steal one layer of a model. It's like among the layers in the model, it's probably the least interesting. It's a very small amount of data, but I can actually recover one of the layers of the model. And so it's real in that sense that I can do it. It's also real in the sense of I have the layer correctly.
But it's not everything. And so I think what I was trying to advocate for in this paper is I think we should be pursuing both directions of research at the same time. One is write the papers that are true in some theoretical sense, but are not the kinds of results that you can actually implement theoretically.
in any real system and likely for the foreseeable future are not the kinds we'll be able to implement in any real systems. And also at the same time, do the thing that most security researchers do today, which is look at the systems as they're deployed and try and answer, given this system as it exists right now,
what are the kinds of attacks that you can actually really get the model to do and try and write papers on that pieces of it. And I don't know what you're going to do with the last layer of the model. You know, we have some things you can do. But if one thing that tells you like the width of the model, which is not something that people disclose, so...
In our paper, we have, I think, the first public confirmation of the width of the GPT-3 Ada and Babbage models, which is not something that OpenAI ever said publicly. They had the GPT-3 paper that gave the width of a couple of models in the paper, but then they never really directly said what the sizes of Ada and Babbage were. People speculated, but we could actually write that down and confirm it. We also had...
As part of the paper, we ran the attack on GPT-3.5, and we correctly stole the last layer, and I know the size of the model, and it is correct. It goes to responsible disclosure, like we talked about at the beginning. We agreed with them ahead of time we were going to do this,
This is a fun conversation to have with not only Google lawyers, but OpenAI lawyers. Like, hi, I would like to steal your model. May I please do this? OpenAI people were very nice and they said yes. The Google lawyers initially were also very like, you know, the Google lawyers, like, I would like to steal OpenAI's data under no circumstances. But like I said, if I get the OpenAI general counsel to agree, are you okay with that? They said, sure. We put it on an isolated VM. We ran everything. We destroyed the data, whatever. But like,
Part of the agreement was like they would confirm that we...
that we did the right thing, but they asked us not to release the actual data we stole. Which makes sense, right? You want to show here's an attack that works, but let's not actually release the stolen stuff. And so if you were to write down a list of all the people in the world who know how big GPT-3.5 is, the list includes all current and former employees of OpenAI and me.
And so, like, it sounds like this is, like, a very real attack because, like, this is, like, this is the easiest, like, how else would you learn this? The other way to learn this would be, like, to, like, hack into OpenAI servers and try and, like, learn this thing or, like, you know, blackmail one of the employees. Or you can do, like, an actual adversarial machine learning attack and recover the size of those models and the last layer. And so that's, like, the sort of motivation behind why we wanted to write this paper was to get people
and try and encourage other people to get examples of attacks that even if they don't solve all of the problems will let us make them increasingly real in this sense. And I think this is something that we'll start to need to see more of as we start to get systems deployed into more and more settings. So that was like why we did the paper. I don't know if you want to talk about the technical methods behind how we did it or something, but...
Do you want to go there? Okay, sure. I can try. For the next two minutes, let's assume some level of linear algebra knowledge. If this is not you, then I apologize. I will try and explain it in a way that makes some sense. The way that the models work is they have a sequence of layers, and each layer is a transformation of the previous layer.
And the layers have some size, some width. And it turns out that the last layer of a model goes from a small dimension to a big dimension. So this is like the internal dimension of these models is, I don't know, let's say 2048 or something. And the output dimension is the number of tokens in the vocabulary. This is like 50,000. And so what this means is that if you look at the vectors that are the outputs of the model, even though it's in this big giant dimensional space, this 50,000 dimensional space,
Actually, the vectors, because this was a linear transformation, are only in this 2,000, 4,000 dimensional subspace. And what this means is that if you look at this space, you can actually compute what's called the singular value decomposition to recover how the space was embedded into this bigger space. And this directly, like the number of, okay, I'll say a phrase, the number of non-zero singular values tells you the size of the model, right?
Again, it's not challenging math. The last time I used this was in undergrad in math. But if you work out the details, it ends up working out. And it turns out that this is an exciting... It's a very nice application of some nice math to these kinds of things. And I think part of the reason why I like the details here is this is the kind of thing that...
It doesn't require an expert in any one area. It's like undergrad knowledge math. I could explain this to anyone who has completed the first course in linear algebra. But you need to be that person and you need to also understand how language models work and you need to also be thinking about the security and you need to be thinking about what the actual API is that it provides because you can't get the standard stuff. You have to be thinking about all the pieces. This is why I think the paper is interesting is like
This is what a security person does. It's not the case that we're looking at anything... Sometimes you look at something far deeper than any one thing, but most often with these exploits, how they happen is that you have a very broad level of knowledge and you're looking at how the details of the API interacts with how the specific architecture of the language model is set up using techniques from linear algebra. And if you were missing any one of those pieces, you wouldn't have seen this attack was possible.
Which is why the OpenAI API had this for three years and no one else found it first. It's like they were not looking for this kind of thing. You don't stumble upon these kinds of vulnerabilities. Like you need people to actually go look for them. And then, you know, again, responsible disclosure, we gave them 90 days to fix it. They patched it. Google patched it. A couple of other companies who we won't name because they asked not to patched it. And, yeah.
It works. And so that was a fun paper to write. Amazing. Well, Nicholas Carlini, thank you so much for joining us today. It's been an honor having you on. Thank you.