We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode Robustness, Detectability, and Data Privacy in AI // Vinu Sankar Sadasivan // #289

Robustness, Detectability, and Data Privacy in AI // Vinu Sankar Sadasivan // #289

2025/2/7
logo of podcast MLOps.community

MLOps.community

AI Deep Dive AI Chapters Transcript
People
V
Vinu Sankar Sadasivan
Topics
Vinu Sankar Sadasivan: 我认为目前声称能够检测AI生成文本的工具并不可靠,水印技术只是文本检测方法之一,但并非唯一。过去文本水印技术是在文本中加入拼写错误或空格模式,但现在AI水印技术已经发展。水印技术在有攻击者的情况下效果会打折扣,我的论文研究了四种检测器,水印是其中之一。随着语言模型越来越大,检测变得越来越困难,因为它们可以轻松模仿人类的写作风格。水印只是我们论文中分析的工具之一,如果攻击者真的想攻击,很容易被打破。我们的理论表明,目前不存在万无一失的技术,水印技术可以作为一层安全保障,但很容易被移除。仅仅通过提示很难使AI生成的文本看起来更像人类,因为模型已经经过微调,可以更好地嵌入AI签名。我在论文中使用释义器,因为这与我们的理论一致,并且也为了攻击水印技术。对于水印技术,不能说集合B中的所有段落都是水印文本,因为这会导致人类写作的文本也被检测为水印文本。使用这些检测系统需要在第一类错误和第二类错误之间进行权衡。我认为AI巨头们最近采取了一些措施,使模型更容易被检测到。

Deep Dive

Chapters
AI text detection is not foolproof. Watermarking is one technique, but it's not effective against determined attackers. The paper explores various detection methods and demonstrates their vulnerabilities.
  • AI text detection methods are not completely reliable.
  • Watermarking is a technique that can be broken by attackers.
  • There is a fundamental trade-off between detecting AI-generated text and avoiding false positives.

Shownotes Transcript

Translations:
中文

Hi, my name is Vinushankar Sadashivan. I am a final year PhD student at the University of Maryland. Currently, I am a full-time student researcher with Google DeepMind working on AI jailbreaking. So today we'll be discussing more about the hardness of AI detection and retuming of these AI models, especially with the focus on generative models.

And yes, I generally do not drink coffee. And if I do get coffee, I go for latte. What is happening, good people of the world? Welcome back to another MLOps Community Podcast. I am your host, Dimitrios. And today we get talking about red teaming and also how to jailbreak those models. Venu.

Did a whole paper and PhD on being able to identify LLM-generated text. So we talked about watermarking and if it is a lost cause. And don't think I forgot about you. We've got a song recommendation coming in hot. Good old Tim Maia. Timmy Maia from Brazil. ♪

Let's get into it.

Okay, so let's start with why watermarking is so difficult. And you basically told me, or you didn't say it, but I read between the lines in our conversation before we hit record, which was,

All of that stuff that you see where you can turn in a piece of text and it will tell you how much percentage is AI generated versus not. That's kind of bullshit or what? Okay. So I wouldn't say it's bullshit, but I would say something which we should be not completely relying on. So, yeah.

We had this paper where we were researching on different kinds of detectors. So to start with, watermarking is just one kind of detectors that we are looking at, which has been predominantly a famous method of watermarking. Watermarking has been there forever. It has been there for images and for text for a very long time.

So for text, earlier we used to put like, say, a spelling mistake in between text and say if the spelling mistake keeps repeating, say a watermark or a double space or the pattern of the spaces or the punctuations or these kind of things could be like a watermark. But now things have changed. AI is blooming. People have made new methods for watermarking.

So what I'm seeing is watermarking is a really good technique. But the problem is when there are attackers in the setting which you're looking at, then it might not be as effective as we think it is. So in the paper which we looked at, we look at four different kinds of detectors. And one of the most important one is watermarking, which also really works well. And the other kinds are using a trained detector, which is what I think most of

the detectors out there are using right now because language models are not yet completely watermarked. Not all of them are.

So it's just basically a classifier where you give an input. It just says it's like a dog or cat. It just says AI text or not an AI text. And the other one is zero-shot detectors where you don't have to train a network, but you just use a network to somehow look at the statistics, look at the loss values and say if it's less loss value, it is probably a part of Asian detected text if it was...

Not a low loss value. It is human-generated text because generally AI text quality is higher. So the loss values are lower and hence it's AI text. And retrieval is another method where you store all the AI-generated text in a database and then basically search given a candidate text if that text is present in the database or not. So we look at all these kind of detectors and we break them empirically and

in two ways. One is to make AI text look more like human text and the other way is to make human text look more like AI text. So type one and type two errors both. And we also show a theory between fundamental trade-off of these detectors. And we show that as language models get bigger and bigger, detection gets harder because if you think intuitively about it, language models

are uh highly capable when they get bigger they can easily mimic the style of human writing if you give it the relevant instructions to it yeah or even enough data so if i give like a longer text of okay this is donald trump the way of writing text or talking and if i ask it to mimic with a lot of data given to it it probably would mimic it very well and it gets harder as language models get even bigger uh so that's one of the other concerns so um

Watermarking is just one of the tools which we analyze in the paper and we show it's easy to break them if an attacker really wants to. So the takeaway from the paper would be if someone wants to really attack something, they'll access no foolproof technique right now. And our theory shows that there will not exist anything like that. So it's really good to have one layer of security like watermarking for now.

just to tackle cases where people are directly using AI text out of an AI model and we can detect such text very well using watermarking. But if I really want to remove it, it's easy to remove that.

Now, what are the cases that folks, A, would want to know when it is only AI-generated text, and then B, when someone would want to be adversarial and not let the person know that they are only using AI-generated text? Yeah. Yes. That's a good question. So the cases where I wouldn't want to show that it's an AI text is when I'm a student.

or when I'm trying to do plagiarism, I'm submitting my assignment, but I use ChatGP to write my answers. So in that case, I wouldn't want my professor to know that I used AI to submit my assignments. That's one major scenario which people are looking at right now because that's where a lot of the cash is, where a lot of the money is. So all the leading text detection tools basically focus on that for plagiarism purposes because there's a lot of money there.

And the other case where you want to really detect is again plagiarism would be one case. It's like a min-max game here. So students want to make it look like human text, but professors want to actually make it look like AI text if it actually was AI text. And the other case would be like spamming, phishing, these kind of attempts where you might be actually talking with a chat agent who is not a human. They might be scripted to fool you into some scams.

And it's easier for them to scale up these scamming attempts if they have access to AI, which is really dangerous. What if there's an AI model later on, which is like so natural, it converts text to speech and they're basically simulating a call center, talking to multiple people parallelly, trying to scam them and make a lot of money out of it.

So here they would really want to make AI text or AI speech or whatever modality it is look like more human-like so that humans don't detect it, but still they get to do whatever they want to do, the adversarial objectives they have without getting caught. Now, you mentioned there were ways that you broke these different settings or the detectors. And one way was making AI-generated text look more human.

Was that just through the prompt? No. So what we do is, so that is one of the methods which you can actually do. But as we discussed before the conversation, the evolution of the systems, it has become harder to do that. So now I think the models are well fine-tuned to somehow imprint the AI signatures better. So it's harder to actually use AI methods

input prompts to just change the detection or affect the detection well. So I was recently trying with all these tools, Gemini, ChatGPT and Anthropics Cloud to see how actually you can give prompts to make them look like less AI text.

So I can't really make a study out of it because it's very hard to do manual prompting on like thousands of text myself and then do something. But what I figured out mainly was commonly for most of these AI models was generally if I give it prompts like convert passive text to active text or

something like use simpler sentences don't use like longer sentences avoid and or punctuations and things like that if I give such such features which models use generally to write very high quality text and make them try to be little low quality by using okay use

less rich vocabulary like how humans would stop using a lot of punctuations and write shorter sentences and things like that. I get to make them break sometimes, but it's hard. So I've been noticing this because I try them out sometimes with some gap in the timeline. So every, say, few months, I try this out. And I find that it's getting harder to do that. So the way which we did in our paper is

and which aligns with the theory of our paper is basically use a paraphraser and why we did that was to get our attack the empirical attack which we do to be in line with our theory and one other reason is because also we wanted to attack watermarking and one of the attack methods which are watermarking the the first ai watermarking text paper the ai text watermarking paper showed was

to change words just replace words with another word so that's the first naive attack you would think of so given an ai candidate text can you do minimal editing to it by changing the words to add some synonyms or change some of the fluctuations made like add words like and or and things like that to make minimal edits in terms of the edit distance yeah to

see if I can break it or not. So this is basically like you saying control F for all the instances where delve is in there and then you replace all of the delve with another. Okay, I see. And then it passes with flying colors.

So that would be the case for some other detectors, but it's not the case with watermarking. So watermarking is quite robust to it. So if you really want to attack watermarking by just changing words, you might want to change almost like 50% of the words in place, which is going to be a very hard task, right? Because you can't change a lot of the words. And even given a sentence with 10 words, you can hardly change like two or three words in place.

without affecting the quality or the meaning of the sentence. So that's where paraphrasing is really important. So basically, given a sentence, I can completely change the structure of it. I can make an active voice sentence to a passive voice, so it completely flips, changes the structure. I can join different sentences with and and ors and things like that using a paraphraser, which was not the case if I simply changed different tokens or words in the passage.

So I think to explain the attack well, it's important to go into how watermarking works. Yeah. How the first text watermarking paper did it was it's a simple algorithm. Uh,

but but it works really well given so you're writing a passage think of it like this way you're writing a passage you start with writing a passage or an essay about the dog so you say the dog was playing this is how a human would write a text you wouldn't think exactly how to pick the words you just want to pick the word so that it makes sense but how the watermarked ai would think is okay i started with the word what the

Now for the next word, I have to only pick words from 25,000 of the words out of the 50,000 words I have access to in my vocabulary. So suppose I have 50,000 words in my vocabulary. The AI would partition that into two halves. They call one half the red list and the other half the green list. Now the AI would focus on always picking the word from the green list.

So eventually when it writes word by word, it would try to make most of the words in the passage coming from the green list. So what a human would do is, a human does not know what a red list and green list is. So he or she might end up writing, taking almost 50% of the words from the red list or the green list, right? And what this Watermark AI model would do is it will end up having a passage which is mostly having the green list words.

So this is the watermark. So the AI model knows given a passage, it has 90% green words in them and 10% red words in them. So it's very highly likely they are AI text. But given a human text, okay, it's around 50% red words and 50% green words.

Now there are more advanced versions of this which makes it better in terms of quality. So what they would do is instead of having a hard partitioning of Red List and Green List because sometimes when I write the word Barack Obama, the word Obama might be in Red List but I want it to be Green List because I wrote the word Barack before I want Obama to happen after Barack. So what

What they would do is, I would make this red list, green list partitioning dynamic. So every step I would choose a new word, my red list and green list will keep changing. Damn, all right. That would change based on the previous word which was written. So if I had written Barack, for the next word, the green list and red list will be determined by that word Barack. So it would be modeled such a way that Obama will be in the green list mostly. Something like that.

So this begs the question, all of the model providers or all of the model creators need to have this inside of their models for it to be valuable, right? And you also kind of all have to be on the same page on how you're doing it so that then when you have some kind of watermarking detector, it can detect if it's AI-generated content. I guess you could take...

all the different ones, each flavor of model has its own flavor of watermarking, and then the detector could just have all of these different ways that it can detect it. But you need to start at the model level. And if the model providers aren't doing this, then you can't detect it. That's a really great observation. That's one of the major limitations which watermarking modelers have right now. I'm glad you pointed it out. So,

If OpenAI has its model watermarked and Gemini does not have or vice versa, it does not make any sense because these are hidden models and attacker, if they want to use the model, they would go for the non-watermarked AI model which exists out there. And other main concern, which I don't know which no one is talking about right now is we already have open source models which are released, which are not watermarked.

So I have Lama 3.2, the larger models. I can always download and keep them in my hard disk. They're pretty good in doing whatever I want right now for scamming and things like that or plagiarism even. I can always use them for writing AI text at least to what good quality we think is good right now.

Maybe, I mean, at least at par with humans. So it's also crazy that we are still trying to do watermarking. Okay, I understand in future it might be more powerful. Watermarking might be something which we want. But to some extent, we have done the damage already. Someone who needs to do the damage in future could still do a lot of automation with the open source models, which are already out there, which is not watermarked.

Right. And also, if all these AI companies come up with different watermarking schemes, it might be harder for detection in the future, like you mentioned.

So they all always have to be on the same page, someone regulating them how to watermark and how to do detection and things like that. Because say in future, there are like 10,000 AI companies and you don't know what is coming from where. You might be making a lot of compute on detecting which came from where and looking at the permanence of the AI text, which...

might be hard. Yeah. So there are like technical limitations to this problem, which we haven't addressed yet. Yeah, the genie's already out of the bottle. So what are you going to do on that? It does feel like, especially with the SEO generated content from AI models, that you can see, I don't know if Google...

does or doesn't punish you if you throw up a bunch of different blog posts at one time and you're just churning out AI generated content. I think I read somewhere on their SEO update that if it is valuable to the end user, then you're not going to take a hit on your SEO score. But you have to imagine that there is a world where they're looking and they're seeing, hey, this is

90% AI generated. If they can figure that out, they would want to. But right now it's almost like, yeah, maybe they can figure it out a little bit. As you're saying that if they watermarked things and if everyone was on the same page with watermarking, then it would be useful or we could potentially see that. But at this point in time, that's not happening. Yeah, yeah, that's true.

So then what else about watermarking before we move on to red teaming, my other favorite topic. Yeah. So I think I was coming to the attack, which we were doing in the paper to remove watermarks. So the really important thing with the text watermarks that exists right now is this dependence on the previous word, which it had sampled. So if I had sampled Barak, I want to sample Ababa next.

So I might have a random red list or green list for the next word, but I would increase the probabilities of sampling the word Obama so that it is sampled. That's one thing, so that the text quality is not affected. And the other thing is the green list and red list is now dynamic. So it changes with every word depending on the previous word. It is basically the seed to the random number generator for partitioning the vocabulary to red list and green list.

So the way watermarking works is basically this. And if you want to attack it, if you just think of it, if I change one word in the middle, it might affect the red list and green list of the next word, right? Because that might not be ending up in a red or green list. Oh, yeah. But the problem is if I do just one word, it only changes the red or green list for the next word. And so if I need to completely change all the number of green words in it, I have to essentially make a lot of edits.

But what if I rearrange these words? Then the structure of this, the ordering is completely disturbed and the red list of team list is completely random now.

So that's what happens when you rewrite a sentence. You generally don't try to preserve the exact ordering of the words, but you write them so that it's essentially rewritten. So if I have, say, a longer sentence, I can probably swap the sentence A and B. I could probably say B and A. Even within A, I could change the grammar maybe, or I could change active voice to passive voice and so on, or even synonyms and things like that.

So the attack which we look at is using an AI model itself to paraphrase the AI text. So ideally you would want the AI paraphrases output also detected as AI text because it is again AI, not a human. So what we observed was if you use a paraphraser model to paraphrase the AI text, the output you get is mostly detected as human text for a lot of these detection techniques.

But one of the most robust technique is watermarking. It is still robust likely to paraphrasing a little bit because the current paraphrasers are not trained to do this. So if we give really good manual prompts to the paraphraser, we could actually do it in one shot. But what we ended up showing was something called recursive paraphrasing where the AI text which you paraphrased once would be again given back to the paraphraser to paraphrase once more.

So it will be paraphrased twice. And then you can keep doing this multiple times, whatever you want, based on the strength of the watermarking. So we find that just after two rounds of paraphrasing, recursive paraphrasing, the watermarking scheme's accuracy goes below 50%. So we just see that two rounds of paraphrasing is enough for breaking the watermarking algorithm.

And this is what we essentially show in our theory too. So the theory goes like this. Suppose if you have like a distribution of text, which is basically AI text plus human text, which is a subset of that, right? So it can be anything. So given a sentence, a passage A, I can also look at another set B, which is essentially all the passages similar to A in meaning.

So even if I take something from B, I'm okay to replace A with that, right? But the problem is for watermarking in the set B, suppose there are 100 passages which I can replace A with, the passage A with. For a watermarking agent, I can't say out of all the 100, 50 of them are watermarked because if that's the case, it's likely that a human writing a passage would be 50% coming in the watermarked label.

So I have to make sure that the likelihood of a human writing a passage with similar meaning is less because it has to be watermarked. For that, out of the 100, I have to say, okay, one or two of the sentences are only labeled as watermarked and others are not. Still, the false positive rate, which is a human writing a text and detected as AI text is 1% or 2%, which is quite high. But okay, let's be, I mean, lenient on that. Just say, okay, one or two.

of the 100 texts are labeled as AI watermark text. But now the problem is, it's easy. So I have the first sentence labeled as AI watermark. What if I use a paraphraser, which is really good. And I hop from the first text to one of the random 100 texts. So very likely I'll drop onto a text, which is not watermark now, because the watermarking was designed in such a way that it's likely to drop into another text, which was not what I want.

So if you look at this, this is a trade-off. If I try to increase the watermarking strength to paraphrasing, I have to increase this number 1 out of 100 labelled as watermarked to say 10 out of 100 labelled as watermarked or 50 out of 100 labelled as watermarked. But if I do that, I'll end up making a human falsely accused of plagiarism.

with a higher chance. So it's essentially a tier up between type 1 and type 2 error if we use these kind of detection systems. So this is what our theory shows. So our theory says even for the best detector that can exist out there, we upper limit the detection performance of this model using the distance between the distributions, which is jargon and we don't need to go into that.

But essentially for the best detector out there, we are not claiming that's watermarking, but even if something that's better than watermarking, the best detector that can access theoretically out there is upper limited by a quantity which we characterize in our paper, which is still not 100% reliable.

And the performance of that still has a trade-off with respect to true positive rates and false positive rates, which is the type 1 and type 2 errors. So you have to go down on one of the errors to make it better on the other error. So that's the major highlight or the major results which we show in our paper.

So following up on that, paraphrasing is something which is really considered a good attack right now. And a lot of the leading text detection tools like Turnitin have been using methods to deal with it. But what they are doing in the recent blog post they had mentioned was to use a paraphrase detection tool, which tells you if an AI text was paraphrased or not. If it was paraphrased, you can say it's an AI text or not.

And the problem is if you do that, you are ending up hurting your true positive rates. Yeah. Well, yeah, that's exactly what it sounds like is that you have this upper and lower bound or the right and the left side of the spectrum. And the more that you go on to the right side of the spectrum, then you're going to get these false positives. Yes, exactly. You can't really win no matter what you try and do. Exactly. Yeah.

So if more money is in detecting innocent kids as plagiarized, okay, you can choose to make money like that. But if most money is... Okay, as an AI detection tool, I would probably want to say that, okay, I caught more students plagiarizing. Maybe that's more efficient for them. But it can actually be ending up having bad reputation for them falsely accusing students. So it's a choice they need to make.

but the real question is, do we really want to use this for such strict plagiarism detections? Because I think as we go ahead and as these technologies come up, we have to find a way to use them collaboratively for our work because they improve our production. Instead of replacing us for something like that, I believe it improves our production and we have to learn to use them

as a tool instead of using them completely for plagiarism. So yeah. That you were talking about there for a second, you were mentioning how you feel like it's becoming more and more difficult to make it, make,

these models write more like humans and they're almost being trained or they're being red teamed to not write like a human and have their own distinct AI voice yeah yeah so I think this is just a speculation

I'm not sure what actually goes into the training of these models. So, but yeah, from what it looks like, I think there has been recent steps taken by these AI giants to make the models more easy to be detected. It could be either that or it could be the AI detection companies doing a better job at making their detectors. It could be either one of them. But

uh i feel also in at some point of time when watermarking was introduced some of these models text quality actually went down a little bit so while i was seeing some of the comments on twitter i saw sometimes people were saying is it me or just is it just me or everyone else finding that chat gpt text quality has gone out a little bit i'm not sure if it's because of these kind of trainings added on top of it it could be that

It could be some other safety alignments which they have added on top of the model which actually trades off on the performance.

Yeah, it's a side effect. It's a side effect of it. So there's a trade-off in everything. If you try to make detection better, you have to trade off on the text quality or on the type 1 or type 2 errors of detection and things like that. But I think, yeah, my speculation is models could have been fine-tuned to make detection more possible because recently governments have been pushing these tech giants to have watermarkings embedded in them

DeepMind recently released their watermarking paper on nature and Meta already has been trying has been all in the game for image watermarking and things like that OpenAI, I'm not sure what their scenario was but I think all these companies are pushed to have watermarking in them so they are trying to possibly do potentially they're doing something which helps detection better

But these are some of the trends which I think, which I've been noticing, which is again a speculation. But I think one common features with all these models have been they have been trained to have really good text quality. So that could be another side effect of it. They try to make, use very poetic words and poetic devices, things like that to make text look very nice, ornamental words and things like that.

So if we give some instructions, it was earlier very easy, I think, in my experience. So I think that's because of the training which these models have been going through to make detection far more easier. So recently, if I tried to make it, okay, I just say, okay, make it sound like how Donald Trump says, still the text which comes off the model still is detected as AI text, which was not the case earlier when I had checked like a year back or so.

So yeah, potentially it's either the models being trained to do that or the detection tools are improving with time. Okay, so let's take a turn for red teaming and just give me the lay of the land on how red teaming has changed over the years because of...

All of the models getting better. I think there's a whole lot more people that are red teaming models, whether they are getting paid to red team them or not. Everybody loves to think about or loves to be able to say, I got ChatGPT or insert your favorite model to say this or to do this. It's almost like a badge of honor that we can wear on the internet. So what have you seen over the years and how has it differed? Yeah.

So I think it's like a very recent advancement which we've been seeing in AI. People trying to jailbreak, people trying to do red teaming. But I'd like to see more of it like it started off with people and then it went to automated things because the efforts which people have been putting is almost similar because it's just manual attempts, trial and error. Some insights which you get from the model, the feedback you get, you put it back to the problem and it's an iterative process.

So it started off like that. It was called Do Anything Now, DANT. So there was a page where people were compelling different ways to jailbreak these models. You could either write manual system prompts or write manual input prompts to make the model think that you are innocent, you are not going to do something harmful.

So some of the, I think, classic examples are where the question is, you have to make the model say how to make a bomb, which for some reason you can find it on Google, but you don't want your model to output.

Oh, I never thought about that. Yeah. But I mean, yeah, sure. It's still a good objective to keep in mind. But yeah, I mean, you can just take it as a toy example for now. We don't want the model to talk about something. Okay. It's totally cool. We just don't want it to talk about it. And how do we do that?

Because BOM can come in different contexts, right? We can't just use a word filtering algorithm. BOM could be something explosive. Or I could also say, okay, that was BOM. Maybe to say that was like a fantastic thing or something like that. Yeah. I don't know in what context you would be using it unless I understand what it is. So I can't just use a string matching algorithm, see, okay, if there is a BOM in it, I don't reply to that. That's not going to work. Even the word screw, if I say BOM,

I'm going to screw that, it might be okay. But if I'm saying I'm going to screw you up, then it could be something offensive.

So there are things which comes in context. So it's not to just give a context of things. It's not easy to do word filtering in most of these cases. And also if you have to make like a list of a blacklist of these words is going to be really large. And you can't end up doing that, especially when these models are getting multilingual now. It's hard to maintain a list of words which you want to filter on.

But the thing which people used to do was to write manually crafty system where they say my grandmother is sick or things like that. And I have to make like a magical potion for her or say it's for my school project. I really have to get a good grade. It's funny. It's exactly how you would try to fool a human to get them to answer something, saying, acting, working on the sympathy aspects.

uh emotional aspects of it and making the model say since they are human aligned somehow it ends up breaking it uh because that's how they were trained also they were safety aligned with human values so probably it is expected that they break to these methods where human would break an average human or a below average human would break based on the way in which you be trained it yeah um

So that was initial evolution of red teaming, like where it started, where people showed, okay, there are these manual techniques where I can ride prongs to break them.

which started off as a prompt engineering trick where people initially use prompt engineering to improve models performance, but now they started using it to break models. But now it's more like an iterative process. So you give it a question with a prompt, it does not answer it. You refine the prompt eventually, but you essentially get not much signal. It ends up saying, I can't answer your question, but it is really not a good signal for you, but you end up doing Thailand error.

to reiterate and refine your input from such that the model somehow breaks in the some later iteration of your attack but from there it has advanced a lot more people have come to a point where it is more automated now which is more dangerous because then the attacks get scalable and do much more harm than what it could do when you were manually system prompting so um the thing is uh

what the models have been doing after. So what the attackers have been doing after that. There was this paper last year, which was from Andy Zhu, where they introduced an algorithm called GCG, which is essentially a gradient-based algorithm. So what they do is, if I have a question how to make a bomb, I can add random 20 tokens after that.

and then use gradients to optimize that to select like a random set of 20 tokens, basically the suffix tokens, which makes no sense. But when added that as the suffix to the input, how to make a bomb, the model just breaks. So it doesn't make any sense to us, but the model somehow interprets that and breaks to it. So this is something which was really expected if we come from machine learning adversarial literature, because machine learning models, which are essentially neural networks,

can be broken with some input perturbations. This is a well-known strategy in the adversarial literature, but the thing was it was hard for language models because the way they work is very different from the traditional machine learning model which we have used.

Because in traditional machine learning, people were mostly focused on computer vision and continuous data where you have, say, images, which is like a different kind of data where you have continuous data, right? It's just pixels between value 0 to 255 or between minus 1 and 1. It's just continuous data. But for text, when you come to text, they're just discrete data, which is just, say, 50,000 tokens. So you just say this token is 255 or maybe 50,000 or something like that. So they're discrete.

So the thing is, when you do gradients, take gradients on this with respect to the input, since the input is discrete, it's really hard to take gradients on them because they're not defined at all the points.

So the attacks were not effective most of the times when we tried to do attacks on language models, but this recent work did multiple tricks or multiple hacks to make that work out. And this is the end result where you add random jargon, random suffix to the end of the question, it will break.

And it's just random words that if so, like I can have my prompt and then I can just think of random words or is there specific words that you throw on to the end or just specific letters? Yes, it looks random to us, but to the model, it does not say specific words according to the model, just optimize to make it break down.

But how does the attacker know what the optimized words are? Yes. So one is if you have direct access to the model, you can optimize the model with taking gradients of the model and updating the words. The other method which they show is something called transferability, which is again a non-optimized

a phenomenon in adversarial reputation, what you do is you have access to, say, three or four open-sourced models. You find a suffix which breaks all of them, right? Now, if you use this suffix to a new model, which you never had access to, it might end up breaking to that. So that's transferability. And they show that these suffixes can be transferred to a new model. So you can use LAMA to train for the suffixes and use that to break chargeability. And so it doesn't matter. It's not about...

the training of it. It's really about the model architecture. Exactly. So it's one thing probably because most of these models are transformer based. That's one thing. The other thing is the data. Most of these models end up using very similar kind of data. The structure is same. So the way you can break them, uh,

similar probably that's why I think transferability works because in a lot of the cases which I have looked in in language models transferability works quite well I think it's because of the underlying similar text which was used for training and so this is one vulnerability we could say how are the model providers combating it yes

So one thing is again to use automated red teaming to combat this. That's one way. Okay, so let's go again in a sequential way, the evolution of the defense systems tool.

So the first, this particular method which I discussed to you is just adding a random jargon, right? So if you look at here, the quality of the text goes bad. So one easier way to do it, I look at the input prompt's text quality. If it's really bad, I just don't answer to that. I just say I don't understand your question. So that's one way. Which is why later attacks came to improve the readability of the suffix. So they actually make text which is readable,

and do the similar attacks so the models can't really detect based on the quality of the text so what models end up doing there is one thing is chain of thoughts the other is to have something a llama guard where they train use a copy of the llama model and make it to train like a classifier so it is trained to see if this input is harmful or not

Right. So a pre-trained Lama model is taken and it is trained to just do a classification task where it is given a set of harmful problems and a set of not harmful problems and may give a label 0 or 1. If 1, it's harmful. If 0, it's not harmful.

And you can also add the training data set of this classifier, these kind of adversarial pumps, which we designed earlier to make them robust to this kind of attacks. But again, if you use AI to defend AI, the AI in your pipeline, if one of them breaks, all of them breaks. Yeah.

And because this is all on the input, right? There's none that are happening on the output. Or is there also another filter that's happening on the output in case it goes through the input catch-all?

So there are multiple ways these systems work. Some work just on inputs because if you want to really save on the compute time, it's better to look at the input. But if you have the capacity to look at both outputs and inputs, that's the best method. So Dharma Guard ends up looking at output and input if it has a capability of doing that.

And some methods even look at the internal activations of the model. And there are detectors which are trained to see if I have a layer of transformer and if I look at the activations of different neurons of them, if, say, a set of these neurons activated more, it's probably harmful prompt. So there are even methods which even look at the internal activations of these models to see if it's the harmful text or not.

So it's like basically the dark side of latent space. They can know where that is, the back alleys, and they can say like, hey, if you're traveling in these back alleys of the latent space, you're probably up to no good. Exactly. So they call it circuit breaking where you try to make the model understand if it's going

its activations are towards the dark side of it and then just break it off right there and stop the generation and just that's why sometimes you have seen the models just stop generating some things they are probably uncertain about it or they just are just the circuit is just broken so that they don't continue it because it's probably going to something harmful

And have you seen, because I know you're doing a bunch of red teaming right now for DeepMind, right? And I can only imagine that you've been having fun with all of the models, not just DeepMind. And so have you seen different ways that certain models are stronger and certain models are weaker? So I think...

That's, again, red teaming is, to me, is a more broad term. Because in my last work, which was published at IHML, we look at red teaming algorithm where we find an automated algorithm to make, again, readable suffixes, which breaks the model, but in a fast way. So the GCG algorithm, which I mentioned to you, takes like 70 minutes for optimizing the suffix.

which is really large. So what we did was our attack would take like one GPU minute to take the model. And that's what we published at ICMN. And in that particular paper, we just proposed the algorithm, but the capability of this algorithm is tremendous because since it's fast and from an academy setting back then with less GPUs, we could try out different attacks. So one is jailbreaking, which is one kind of red teaming, I would say it

which exist. Other we do is something called hallucination attack where we change the prompt such that the models end up hallucinating much more.

And the third one is something called privacy attack, where we attack the prompts such that the existing privacy attacks performance is boosted. For example, there's something called membership inference attack where you want to see a given, say, I caught a text passage from Harry Potter and ask if it was part of your training data or not. So there were attacks to do that well. But in the end... Let me get... It was...

So, yeah, we can't really give a guarantee on that right now, but I am definite most of these models used Harry Potter for training. Yeah. They actually generate sometimes verbatim exact text from Harry Potter. Oh.

Yeah. Oh, that's classic. Yeah. So these models, so what we did was the third attack was privacy attack where we attack the input such that the performance of the privacy attacks improves. So there are different kinds of attacks which you can think about and people have been only looking at jailbreaking for red teaming. But we have noticed it really depends upon your training. So Lama is one really good open source model which I know is good at resisting to jailbreaks.

So when you compare it to other models like Mistril's model or Vicuna or Lama and things models like these we find that Lama is quite robust to jailbreaking attempts so is ChatGPT or Claude or Gemini but

Cloud does really well in terms of defense. Most of the times they have a lot of safety filters and chain of thoughts going in the background, which I understand is why they are very good at it because they're mainly an AI safety research organization. So they are focused more on safety. I'm assuming they have put more compute on making them better at safety.

But yeah, coming back to open source models, Lama has been really good at not jailbreaking. So we found that it's easier to jailbreak Lama when you ask to make it to generate fake news. So Lama has a vulnerability where it generates more of fake news. It's easier to make them generate fake news when compared to other models.

And one thing to note here is that this is just jailbreaking attacks and all these models have been fine-tuned to be resistant to jailbreaking attacks. But when we move to hallucination attacks, we find that LAMA is equally breakable when compared to all other models for hallucination attacks.

So the hallucination attack is essentially we add a suffix and the model ends up saying, okay, eating watermelon seeds is dangerous for you. You might even end up dying eating watermelon seed. Or walking into a wardrobe, you might end up dying if you walk into a wardrobe. So things like that. So we have actual examples where Lama ends up doing this after an attack.

And it's crazy how it works because it really depends upon how you fine-tuned your model. If you forgot to fine-tune your model to be robust to hallucination attack, your chances are gone. So once you deploy the model, it's out there. People can attack to make it hallucinate more, make misinformation out there.

So yeah, it really depends on your training, how your training is done. So when comparing these open source models to the production models, which are already out there and not open source, closed models out there, I think...

They have been extensively fine-tuned to be robust to these kind of techniques. And also they are in academic setting at least. We actually let them know this attack is there. We are going to publish them. So they get time to adapt to it if it's a really important attack. And also these...

companies have these programs where you were earlier asking if they're paid to attack or not, where they have like a bounty program where you red team them and find vulnerabilities and report to them and the models actually train to make them better on that. So they are actually doing good red teaming research to be well ahead in the game before someone else outside their organization breaks it. So I think as this gets more

like how the open source community grew if the red teaming community grows very large and it outgrows the community that the companies have it might be harder for companies to cope up with the red teaming approaches which would exist but we could make it

Little more harder for the attackers by adding these kind of defenses, which the models, which these companies use right now. But yeah, again, it's the same problem which existed in detection. I believe it's not easy to have a complete solution for jailbreaking because if you look at it, the definition of jailbreaking itself is not fully clear to us.

What are the kind of questions we are not supposed to answer to the model? If we do not know that and we don't know how to train the model for that, that's the fundamental problem which we are looking at. So if we can't define the problem, how do we find a solution which is well defined? So we need to define the problem first, which is very ambiguous.

because the context changes and the scope of harmful questions changes and things like that.