We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Auditing LLMs and Twitter

2025/1/29

Data Skeptic

AI Deep Dive AI Chapters Transcript

People

Asaf

Erwan Le Merrer

Gilles Trédan

Topics

Asaf: 本期节目讨论了图论方法在大型语言模型（LLM）幻觉检测和Twitter影子封禁检测中的应用。我们首先探讨了LLM在生成空手道俱乐部图时出现的幻觉现象，并分析了其原因。LLM并非为图而设计，其能够处理图是令人惊奇的。 Erwan Le Merrer: 我们认为LLM处理图的问题更像是一个记忆和统计问题，而不是图论问题。如果LLM能够更多地接触到这个数据集，它可能会更好地重现它。 Gilles Trédan: 我们研究了Twitter上的影子封禁问题，发现Twitter声称不存在影子封禁与数据结果不符。我们使用图论方法检测影子封禁，并发现影子封禁并非随机的，而是集中在某些社区。我们使用流行病模型来模拟影子封禁的传播，该模型能够很好地解释观察到的现象。 Erwan Le Merrer: 我和Gilles的学术合作始于博士期间，我们一直研究图论在分布式系统中的应用。在分布式系统中，节点和边代表了各个参与者及其之间的协作模式。在处理数百万个节点的分布式系统时，可扩展性是一个挑战。我们还将图论应用于分布式系统、算法执行、推荐算法和LLM审核等多个领域。我们要求大型语言模型生成著名的图，例如空手道俱乐部图，来研究其幻觉行为并试图从中了解模型内部结构。我们设计的提示很简单，要求模型以Python边列表的形式输出指定的图。我们通过图编辑距离和度数序列来衡量大型语言模型生成图的幻觉程度。即使大型语言模型生成的图不是完美的复制品，我们也关注其是否保留了原始图的结构特征，例如社区结构。与传统的基于二元问题的LLM评估方法相比，使用图作为提示可以获得更多信息。我们使用图集距离（GAD）来衡量大型语言模型的幻觉程度，该指标与幻觉排行榜上的结果具有良好相关性。图集距离是一种计算密集型且粗略的图距离度量方法。不同大型语言模型在生成图时犯的错误是不同的。我们利用多种方法检测影子封禁，包括搜索禁令、回复禁令以及幽灵禁令等。我们使用流行病模型来模拟影子封禁的局部性。 Gilles Trédan: Twitter将影子封禁归咎于bug，但我们认为这掩盖了更深层次的原因。我们通过分析Twitter用户的邻居中影子封禁用户的比例，发现存在显著差异，这表明影子封禁并非随机的。我们使用流行病模型来模拟影子封禁在Twitter网络中的传播，该模型能够很好地解释观察到的现象。我们利用Twitter用户ID的特性进行随机抽样，以避免引入偏差。

Deep Dive

Chapters

This chapter explores the accuracy of LLMs in recreating the Zachary's Karate Club graph. The researchers tested multiple LLMs, measuring their ability to reproduce the graph's structure and community properties, introducing new metrics to evaluate the quality of the LLM's output and comparing the results to existing hallucination benchmarks.

LLMs were prompted to generate the Karate Club graph as an edge list.
The accuracy of LLM outputs was measured using graph edit distance and a novel metric called Graph Atlas Distance (GAD).
Results showed varying degrees of hallucination across different LLMs, with some showing remarkable accuracy while others generated significantly distorted graphs.
The study highlights the potential of using graph-based approaches to understand the internal workings of LLMs and to evaluate their performance in handling complex data structures.

Shownotes Transcript

Translations:

中文

You're listening to Data Skeptic: Graphs and Networks, the podcast exploring how the graph data structure has an impact in science, industry, and elsewhere. Welcome to another installment of Data Skeptic: Graphs and Networks. Asaf, as you were pointing out, it's a double feature. We've got two guests, two topics.

We're going to talk a little bit about their work on detecting shadow banning on Twitter. But we start the whole conversation off a little bit more timely and topical. And that's the intersection of LLMs and graphs. In particular, they asked the graphs to remember or spit out the karate club and then looked at the degree to which the LLM would hallucinate that graph.

for them? I had to check it myself. So I asked ChatGPT to speed out the edge list of the Zachary's Karate Club data set. And actually, it missed two edges. Hey, that's not bad. Yeah, well, I was a bit surprised because it's such a classical data set. And the data set is, of course, public and public.

And so it's a really interesting question. Why did it hallucinate? I guess my defense of the large language model is it wasn't built for graphs. It was built for next token prediction and dealing with text. It's almost miraculous that in that process, the transformer architecture is able to figure out graphs at all.

So let's say I asked you on the spot, no preparation time, no notes to reconstruct from memory your best recollection of what the Karate Club data set looks like.

you would probably get about the right number of nodes and edges, and you would probably also remember that study of it shows it separates into two communities. So you'd probably draw something that had two communities. You would recapture these properties from memory that are faithful to what the data set is, rather than just a list of random edge links. Yes. Actually, if we're talking about the Zachary Scarrett Club data set,

I sort of memorized it with my kids. We built a network, the Karate Club, using spaghetti.

And I took a picture of it for a network science competition in the network science convention. And actually, it got first place. Yeah, my kids have a natural way of doing a layout with spaghetti. So I memorized it pretty well. But I think the way I think about the graph and the way LLMs work is totally different. Sure. It's...

Extra challenging because they're confined in the way that they're thinking. You know, the transformer architecture is predicting the next token rather than I feel like I'm doing something fancier when I think about a network in its totality. But actually, if it will encounter this data set more and more, it will probably get better at reproducing it, right? So I don't see it as a graph problem and more of a...

memorizing problem, a statistical problem. Well, maybe their more central network result is their work on shadow banning. But they were looking at Twitter back when, I guess, it was more data readily available. And at the time, yeah, Twitter was saying there is no shadow banning, but the data said otherwise. What's interesting is how they discovered who was shadow banned. Because I think they got like 5 million users,

And they had to check each one of them if it was shadow banned. It's very interesting how they did it. Yeah, there's a couple of steps we'll get into in the interview of how they found ground truth. It wasn't just something silly like saying, oh, I didn't get enough retweets. I must be shadow banned. It was more concrete than that.

Because I feel shadow banned sometimes. I feel shadow banned on just about every post somehow. Off the record, it was interesting because they said they talked about the method, but they didn't say how they reproduced it like five million times. It sounds...

like a lot of research. I think they download a lot of data. Back then, the API was more open. Another point was talking about network science and network models for epidemiology. Is shadow banning is kind of like a viral disease going on in the network? I think it's

It's important to remember that network science is behind the different models that were used during COVID-19, right? Everybody heard of them, like SI, SIR, and so on. But only if you know that we have network science to thank for them. The fact that our guests use these models to study network dynamics

shows you that you can use it not only to study actual diseases but also to apply it on online social networks or study virality at large on different datasets. Yeah, you may get wildly different parameters but all the methodologies seem pretty universal. That's what network science is all about. Let's jump right into the interview. Erwan Lemaire from INRIA Rennes in France.

and Gilles Trédan from CNRS in Toulouse, as well in France. And before we get into some of the work we're going to discuss today, can you talk a little bit about how your academic collaboration began?

So we were in PhD together quite a long time ago, back in old times in Rennes. And we started collaborating then and we've never stopped somehow. So that's why. Yes, we started in a team which was interested in distributed systems. And for distributed systems, there's a way often or a way of seeing things through graphs because of course we have agents or peers collaborating together and graphs are

good abstraction to consider this so we were interested in that and i remember our first collaboration was introducing centralities into computer science because before it was a lot of physicians

working on that. And we noticed that in this distributed systems world, graph centralities did not reach that at the time, so our first collaboration was on that. It was quite successful and we continued on these topics. And what are nodes and edges when you apply graphs to distributed systems?

So when you have a distributed system, it means that it's defined as opposed to client-server model, where clients come and one server has a whole model. In peer-to-peer, every computer is created somehow equal and none has a higher power.

The thing is, you might imagine systems with thousands or millions of such peers, and what you have to organize is the collaboration pattern between those peers so that the overall system achieves the greater goal you gave it. Think about peer-to-peer downloading, for example. Who connects with whom to enable the data sharing, for example, or the downloading? Did you face any scalability challenges? Millions of nodes sounds like a lot to handle.

Yeah, probably. Experimentally, of course, this cannot be produced, but the idea was to, at the time, come up with models and proofs that at least show that the algorithm we are proposing can scale.

The funny thing is that our centrality, which we coined at the time the second order centrality, was one of the first centrality you can compute in a distributed fashion out of what we call random walks. Random walks is somehow a process that can jump from one node to a neighbor node in a graph. And out of that, we propose the centrality.

But the funny thing is that this was not really scalable because I think we were in N to the power of three for the scalability. So it was fine for small graphs, not that big graphs, except if you want to wait for a long time. So probably not millions of graphs, but it was still impactful because it was one of the first

to say that we can do things in a distributed fashion. And also it brought some interesting measures, we think, as compared to the other centralities in itself. So probably later there were some refinements of it, but we stopped working on that. So scalability is a big issue. When you're in a lab and you want to show it, generally you try to prove it analytically, and then maybe you shoot hundreds of nodes, but hardly a large amount.

Well, the name of the team was ASAP, an acronym for as scalable as possible. So you're right, it was definitely a central question. Well, shortly, I'll start getting into the main paper I invited you on to discuss LLMs hallucinate graphs to a structural perspective. But before we jump right in, I'm curious, you began working on graphs in distributed systems. Large language models seemed a bit distant. Were there any other interesting highlights along the way where you applied graphs?

Yes, exactly. That was on the path because we at some point decided to get an outside view on algorithms executing remotely. So for instance, you're a client and you're connecting to a server, so that's absolutely not distributed. But then we were interested to see what we can learn or infer from algorithms executing on a third-party machine.

And one of our first work on the topic was sort of

You call it an algorithm for doing peer ranking. Now you call everything a machine learning model or an AI. And we also applied a graph here, a graph way of thinking to start solving the point we had at the time. So this came naturally. Then we were interested in the recommendation algorithms like the YouTube one online that will also use graphs to try to answer several questions.

Then also on Twitter about shadow banning. Maybe we'll come to that later with graphs too. And then of course, classifiers, because this was like two years ago, maybe the main topic of machine learning, I mean, the trend in machine learning. Now LLMs are arriving and we decided to work on what we call auditing these systems, meaning you're facing this third party. We try to design algorithms or show what you can do or cannot do as an auditor, like

You have a regulation entity that decides that it wants to understand more about what is the third party doing with its remote model or remote AI. And you shoot some algorithm and try to exfiltrate some information to measure, for instance, bias.

or infer some things on what is executed remotely. So that was a smooth path from distributed systems to finally very centralized systems, but we're trying to audit them. So when you operate a distributed system or you interact with a distributed system, you have no way of seeing its complete state.

So you need to design algorithms that would somehow find ways to explore or to measure the state of this global system. And this applies to peer-to-peer networks, but also marginally to networks. And those networks are the product of algorithms that somehow have each router or each node take decisions.

But at the end of the day, no one knows in which state it is, yet it has important implications. So we were starting to design, let's say, measures to our systems to measure the state of the networks. And then it went on to design algorithms to measure the state of other algorithms, let's say. And that's how we went gradually to transparency, I think.

Well, I think most listeners will already be aware that large language models hallucinate. What makes graphs a specific aspect of hallucination? Why is that particularly interesting or worthwhile of study?

Since we had this graph mining background and we are also interested in auditing and we are interested in the trendy topic of LLMs, we were really trying to say how can we import graphs or have a way of thinking about the problem of hallucinations with graphs. And so we scratched our heads and

We already thought that people ask LLMs anything these days, but apparently nobody really asks them for graphs and especially non-graphs. So we started to say, what about asking the LLM? The prompt was...

Give me the famous, for instance, Karate Club graph, one of the most famous graph applications. Because with this question, we have a background. As we work on graph mining and network science, this graph is known. And there are other graphs that are not known too. The idea was to try to exfiltrate data from DLLM and see what they will answer.

So we knew they would hallucinate. The question is what does this hallucination reveal somehow? How can we exploit this hallucination to understand what are the internals of this LLM?

Yes, and the answer was not obvious because, of course, if the LLM were trained on some, I don't know, personal data, like addresses of some people, for instance, now we know that they refuse to answer because it's personal data. So we are like, okay, they probably, because we know that they ate most of the internet and the clean data available at the time. So the question was,

will they respond nicely or at least correctly or hallucinate a little bit or completely hallucinate or does not respond to anything. And this is some information because from most of the LLMs, we interrogated like 21 LLMs online through APIs, and most of them responded in some way that we really think that they ate

these famous graphs, otherwise they could not respond what they responded. Can we chat a little bit about the prompt at the heart of your experiment? I presume it's something like output the Karate Club network in JSON format, or what are you looking to get out of the LLM? So the prompt was really provide me with the so-called X graph as a Python edge list printed simply.

And then we generally get a bit of verbose from certain LLMs in the first place and then they throw the NotDoIt list as expected.

More or less accurate, as we will probably discuss. But that's simple. So just copy paste or scrap this answer and then we import that with some basic functions in the networking sense. That's it. We can play with the graphs. They responded. Maybe you can discuss also a bit about the difficult part of the prompt, which was to have them not answer. Yes, it's easy.

print nx.karateclub and you will get the graph. So we wanted them to actually give the list and not a way to obtain this list. And that was the convincing, hard to convince part. Yeah, exactly. For some...

you have to force, say, don't provide me code to generate the graph. I want the edge list. There was also some fun part is that some LLMs refused to respond saying that, for instance, when you query for the Florentine families graph,

Some refused to respond, like it's a private information about some family, I cannot respond. So there was also some filtering to do based on the prompt if we want to have these 21 LLMs to be comparable based on their output. So yes, there's definitely a little bit of prompt engineering, but that's okay.

So if it were able to faithfully reproduce, essentially if the LLM memorized the Karate Club data set and successfully output it, that would be a perfect score. But I don't know that it's a binary test. Maybe even an isomorphic graph would be a perfect score, but you could miss some edges or infer too many edges. How do you measure the degree to which it was hallucinated?

So we know, for instance, Karate Club graph has 34 nodes and 78 edges. So we did not consider the labels because mostly they started either, you know, in the real one, it starts at zero, it ends...

where it ends and sometimes they just swapped all the labels or started at one and so we decided to just look at the topology without any label right so we okay so you have two graphs so you the simple the first thing that probably comes to mind is the graph edit distance right to see how we go from the real graph

a ground truth karate club graph to the output karate club, supposedly karate club graph of the LLM. We took the degree sequences

of both graphs and we simply took the distance between these two-degree sequences. Like this, we can have a sorting of which one is hallucinating less. For instance, in this Caltech Club graph, we have a table in the paper and the LLM is hallucinating less, it's the DBRX LLM.

followed by chatGPT 3.5 and 4.0, which gave the same response. And then in the lower part, you really find sort of old networks, based on this metric at least.

We are happy that the answers are not perfect. And actually, that's not what we expect from large language models in the sense we don't want them to store all the provided information. We don't simply say if the answer, if something is different and if the graph is

the answer graph is not exactly a perfect copy of the original graph. We don't yet say it's a failure. What we are interested in is how those failures impact the general graph, let's say the general structure. What I mean by that is, for instance, the karate club we asked

is a central benchmark graph for the community detection, for instance, where it played a pivotal role, being the graph on which most, let's say, centralized methods are, centrality community detection methods are tested. In that sense, what we wanted to test is whether, even though the answer graph is not exactly the same as the karate graph,

if the answer graph still captured this importance of communities or like this huge noticeable community structure. And what we wanted to understand is whether despite not memorizing the graph, somehow the LLM would have kept memory of its huge community structure or its noticeable community structure. I think you raised a couple of good points. Especially for me, what resonates is the expectation that

the hallucinated graph or the outputted graph would have these resemblances to the original. Like you said, we don't expect the LLM to memorize the Karate Club. It's not a database. Similarly, I know this data set. I don't have it memorized. But if you showed me a random graph, I would say that's not the Karate Club.

If you showed me something that had these roughly two communities, I might say that could be the karate club. Do you find that the hallucinated graphs meet this sniff test? Do they seem to faithfully reproduce it or are you seeing randomness? There are LLMs, which are

I mean, hallucinating very few edges. For instance, DPRX LLM, for instance, adds just two edges. So you cannot say it's random stuff. It ate that graph and it's able to throw it pretty correctly. GPT works pretty well too. LAMA 70 billion works well. And then, of course, you have all this gradient of degradation.

Maybe it's time to discuss about how LLMs are judged based on the way they are hallucinate. Normally in the community, you have like really binary questions that are asked or maybe with three answers or four maximum. The data sets are tens of thousands of binary questions like that. And you say this LLM is better because it answers correctly. But here with a single question,

a single prompt, we can get a lot more information. Actually, you can get up to n-squared bits of information per prompt because of the amount of edges you can have in a given graph. So this is really the pivotal point of the idea. With a single request, you get much more information than a single binary question. Since we are facing real LLMs that are online,

we needed a ground truth to see if this prompt of graphs makes sense or can compare to other datasets for measuring hallucinations. And that's what we did by comparing to some dataset or some online website called the Hallucination Leaderboard. And the Hallucination Leaderboard sends 50,000

binary questions to LLMs to sort, to make a sorting of the most hallucinating LLMs. And we just throw actually five. We decided to use the graph Atlas. Probably if you're familiar with this Atlas too. So we just took from this Atlas, which is a list of many, many small graphs. And we decided to use the five first

connected graphs from the graph atlas. So we first query the LLM, the first connected graph, then the second one. We do that five times. So five prompts. This time, not like the karate club, just one prompt. We do five prompts of tiny graphs. We measure the hallucination on each of these tiny graphs.

We average the hallucination amplitude on each of these, and then we get a distance, an hallucination amplitude. We call it the graph at last distance in the paper. And this is our metric. So instead of throwing 50,000 questions, we bring 5.

prompts to each LLM and then we make the sorting. It's correlating nicely with this hallucination leaderboard. Well, having introduced the GAD, the graph atlas distance, we now have a good mathematical way of measuring so we can do an ordering and this kind of thing. But since it's a new measure, I don't really know what to say about if it's a good or a bad score. Do you find that the best performers are faithfully reproducing something that looks like Karate Club?

As for graph at last distance, it's a very peculiar way to capture the distance between the ground truth and the answer. First, it's computationally intensive. That's why we only focused on little graphs. But also for it, all failures are equal.

So somehow, if you forget the link between Elon Musk and Donald Trump, it's the same as forgetting the link between me and my neighbor, for instance. It's one error anyway. Yet, in many applications, those errors matter. So there's an infinite struggle on what is a good graph distance. But I think graphs are so versatile that they cannot be captured with a single distance. You need to perceive them in different angles and dimensions.

Graph at last distance is a very, perhaps the only solidly grounded distance we have in this paper, but it's a very crude way in its way, I'd say. Well, one good metric amongst a bunch that you'd look at then, I guess, yeah? Yeah, yeah, yeah, definitely. And when you look at the graphs that are outputted, do you find that the various LLMs make similar mistakes or is there a diversity in what they're doing wrong?

From what we know, there's a huge diversity and that's what is very interesting. Maybe that could be used as some fingerprinting method because we also work on fingerprinting classifiers mostly. Now it's a trend on fingerprinting LLMs. But from what we see and what we report, at least we report several

graphs, outputs in the paper. So they look very different. And you can see the table with the different metrics you have. They are all dissimilar. Yeah, that t-sneak plot is very telling. I want to encourage listeners to go check out the paper and see that.

To what degree do you think maybe prompt engineering or continued work there could improve all of these? Maybe you just need to start the prompt by saying, you are a PhD candidate in graph theory with several successful publications. Now answer the question, you know, would that help?

Maybe after the next LLM has eaten our paper, we can directly answer your own graph at last distance. And like this, we have nothing to compute. That would be great. Right. Yeah, this idea of hallucination in graphs is fascinating. I look forward to seeing where else you guys take it.

If we could pivot then, could you share a few details on the other work you mentioned involving the study of shadow banning on Twitter? As I recall, the position of Twitter, at least at the time, was that they don't do shadow banning. Is that right?

Exactly. I will leave the floor to Gilles for the technical part. But the idea was initially they said we don't shadow ban. And finally, because we are already trying to measure things from third-party algorithms or models, we said, can we maybe not have a yes-no answer to their claim, but at least to provide some measurements and make readers aware

make their opinion about that. At the time there were also a lot of people... Now things changed dramatically, but at the point Republicans in the US were thinking they were shadow banned. So there was a huge amount of claims in the newspapers or online that shadow ban do exist. So we decided to have a focused look on that by doing some huge data collection

and then to have some statistical look at it. Gilles, if you want. What sparked the research was also Twitter telling that it was a bug.

So that what the Republicans in this situation, but what the users were complaining about, like witnessing many neighbors that are some of their answer that were shadow banned. So they answered it was a bug. And for us, we thought it's a good way to

somehow it's a good way to avoid any scrutiny because everything can be a bug somehow and we thought there is a deeper meaning in that somehow they implied that it was not due to some action or something that somehow it was random

This perspective that if it's a bug, it should hit everybody uniformly can be translated into an hypothesis. And this hypothesis can be tested on the network. Like concretely, what we did is we sampled Twitter, I think at the end, around 5 million profiles. So what we did is we started with a bunch of random users and a bunch of celebrity and a bunch of politicians, for instance, like three populations. And first we looked

Among their neighbors, what's the fraction of them that are shadow banned? And we saw a huge disparity in them. That means that some users had up to 47% of their neighbors or, let's say, recent contacts that were shadow banned, according to the detectors we had. Whereas, let's say, the majority or on average, it was 2.3%, I think.

This huge disparity tells us something about the shadow ban principle. It's the idea that somehow you're not equal because, let's say, if you toss a coin that has 2.3% of...

heads and the other remaining of tails. And then you have hundreds of your friends toss the same coin. The likelihood that 47% of your friends get a head is very, very low. That was a way for us to somehow, let's say, attack this assertion that it was a bug. What

What we ended up rephrasing is that if it was a bug, then it was, let's say, targeting or falling on some neighborhoods way more often than others. So we could say that it's not a bug also because this is where the graphs kicks in. If we consider shadow banning to spread like a disease on a graph, which is also something that is well known from people working on graph analysis,

then we can pretty well model the real ego graphs that we extracted from these 5 million profiles. Again, some random profiles and then the neighbors create a graph. It's an ego graph at two hops. Some are very dense, some are poor because it's based on interactions.

And with this epidemic view, with this epidemic model, we tried to fit a spreading model and it was explaining quite well the amount of shadow banned people we see in this evograph. So this model was way more probable than the known possibility of shadow banning, as I said at the beginning. So the idea is not to say that this is an epidemic propagation, but

The neighborhoods that might be susceptible to shadow banning, maybe they are really. And this is one model of what happened. And yeah, it was fitting pretty well. And yeah, based on that, we did the study. It was interesting.

on why we ended up using an epidemic model. It's because when you observe that some neighborhoods are really plagued with shadow banning and some others are nearly untouched, you think that there is some kind of locality phenomenon that is at play. And a dedicated tool, or let's say one tool to capture that, is this concept of susceptible infected models and this idea of contamination somehow.

We don't mean that you can get shadow banned by contamination, but rather that contamination was a tool to model the locality and the very uneven locality of the shadow banning occurrences we witness in certain neighborhoods.

It doesn't mean that it's the product of contamination. And actually, we don't know where it comes from. Maybe we can give a trail of answer because we checked the most contaminated profile. And honestly, I'm quite glad that we censored them because they were like, let's say, not safe for work kind of profile and connected in bunches like this. So that kind of confirmed information

to us that we were on some right track, let's say. Well, 2.3% versus 47% is a very stark difference. You've measured a phenomenon, but that depends on your detector. Could you share a few details on how you determine if a shadow ban is in place or not?

Everything started with a group of German people that put on GitHub some ways, let's say some tests to detect the shadow banning. So what we call shadow banning is an umbrella term maybe to define originally shadow banning. It started, one of its occurrences was on a board and let's say IRC and all this chatting thing where somehow trolls would come in and

annoy everyone. But if they get kicked by some admin, then they would just simply change their nickname and come back again and again and again. So the admins came up with this technique, which is called shadow banning. Shadow meaning that I don't tell the user that he's banned and banning because I actually ban him. And the consequence or the action was to stop relaying the message of the troll to

all the other members of the channel, but not kick the troll. So the troll would just see his message unanswered and would just get frustrated. So that's the idea. On Twitter, we had to get inventive or we, let's say, exploited different things. So the starting point was those tests provided by German guys.

And different profile visibility diminution. One, for instance, very easy one, is when you type the name of some users, it should somehow autocomplete in the search string. At that time, it should. Well, what you can observe is that it doesn't happen for some accounts. And so this was what we call the search ban.

I think the completion ban, I can't remember. And there is a second step in that is when you search for the user, it not even appears within the search result. And that's maybe the search ban. And the other...

We tested, for instance, was the reply ban, where somehow when you reply to some message, then when there are many replies, sometimes you can see that those additional replies are hidden behind a little button that you have to click to show additional replies.

this happens to Shadowban users even when there's no other answer or very few alternative answers. They would always be downranked to the show more replies list.

And the last one is a ghost ban. I'm not sure I want to enter into the detail, but that plays with some visibility. Maybe the general takeaway is that shadow banning implies different visibility. What the troll sees from his perspective and what the other users see are different things. And that's by playing with these different visibilities that we manage to detect something.

There may be other techniques that we never looked at or never knew existed. And those ones we cannot talk about. I mean, we have no results about them. Well, yeah, I mean, my first instinct was I should look at does a popular person get the same amount of retweets every time? But of course, they could have a boring post. And that's why it got no retweets. But you have a variety of metrics that look at more concrete things.

It's very hard to sample to do what we do, like start from seed profile, look around. If you explore, let's say, a big part of the graph like this, you will induce bias because somehow you started with specific points and somehow this may impact the results you have downstream. It's difficult to find baseline in such graphs. But luckily, the old version of Twitter is kind of a hack. The 32 bits of...

user identifiers, they were filled consecutively. This means that the users, they would get a number depending on their arrival. And those were dense, meaning that you could throw a random number between 0 and 2 to the 32 and fall on some profile. And this

Gave us a way to randomly sample users within the Twitter space, which is quite rare and very, very useful for statistical exploitations. And I think your research was before the sale of Twitter and things like that, if I'm not mistaken. And specifically this thing they call the Twitter files, which was some big information dump on many, many topics.

Did anything in the Twitter files affirm your findings? Before that, I think they removed their post. I mean, they denied to comment after some journalists asked about the paper, but then they removed it. We cannot say it's because of our work, but it's consecutively to that. So maybe it has some impact, maybe not. But at least they were aware of this work, saying that statistically it's not possible that they do not shut a ban.

To the point of the Twitter files, I'm not sure it belongs to the Twitter file, but at some point on the internet, there was a so-called screen capture of the Twitter administration interface that circulated. Its origin is undefined, and so take it or leave it, let's say. What it showed is a bunch of buttons that kind of corresponded to the different shadow ban actions we tested for that helped us grow confidence in our results, let's say.

Well, let's see. We've covered a lot of great work today. I wanted to give the chance to ask you guys where you're going next, if there's anything you want to share about your current research or promote that's coming up. Still on this auditing of models, of remote models on a user or auditor perspective and try to find a

exciting ideas on how to apply new prompts, maybe, new data structures to the questioning of these fascinating remote models so that we can make sense a bit of these black boxes that are executing remotely and have this huge impact on our lives. So this is our topic and we will explore things based on the ideas we have at the moment. So it's very fascinating, at least on our side.

I'd like to add that maybe our perspective is to try to find a way to capture or measure correctly the behaviour of platforms. The interaction of platforms and society raises many questions and I think it's going to be even more important with the advent of LLMs everywhere.

It sparked lots of debate and things like this. For example, around the shadow banning, there were lots of people discussing about that. And what we thought is, can we bring some, let's say, strong answers to the debate? We are just interested in a very little part of the auditing that is measure things correctly. Let's say, oh, it's not... That's where we focus. With LLM and this open-ended interaction somehow, I think we'll have lots of work to come up with clever ways to...

let's say, extract information and extract reliable information from those platforms. Absolutely. And with the ubiquity of LLMs, I think we're going to definitely need increasing interest in auditing and stuff like that. So I'm glad to know guys like you are taking a look into these problems. Thank you, Kyle. Thank you. Yeah, thank you both so much for taking the time to come on and share your work. Thank you for the interest. It was awesome. See you.

Auditing LLMs and Twitter 40:26 Share

Data Skeptic

Deep Dive

Shownotes Transcript

Auditing LLMs and Twitter