We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Unveiling Graph Datasets

2025/5/8

Data Skeptic

AI Deep Dive Transcript

People

Bastian Rieck

Kasaf

Topics

Kasaf: 我认为一个好的算法应该能够在各种数据集上稳定地工作，例如进行链接预测。数据集本身就是数据集，重要的是我们向数据集提出的问题。Bastian提出的RINGS框架是一个很有趣的评估图学习数据集的方法，它通过扰动图来测试数据集的鲁棒性和实用性。我们应该思考向数据集提出的问题，因为有时网络可以帮助我们发现新的问题，而不仅仅是回答旧的问题。例如，在GitHub存储库网络中，人们的合作并非总是基于地理位置，而是编程语言。使用错误的同质性标记会扭曲网络，并导致结果失真。数据驱动和探索性的方法可以帮助我们从网络中学习新的知识，而不是只提出旧的问题。学习的智慧在于提出正确的问题，而不是回答问题。这篇论文的结果更偏向机器学习社区，而不是网络科学社区，因为网络科学社区一直都在做这类事情。我们需要思考我们提出的问题和我们依赖的数据集。 Bastian Rieck: 我是巴斯蒂安·里克，瑞士弗里堡大学的终身教授，机器学习方向。我的实验室名字“IDOS Lab”源于希腊神话中的敬畏女神，体现了我在科学研究中的敬畏之情。人工智能与纯数学不同，人工智能高度依赖数据，需要根据数据集回答问题。我热爱开源，并通过GitHub跟踪我的工作，并保持学习。我从2018年开始使用Markdown文件进行项目和个人事务的结构化记录。我最初学习的是纯数学，后来转向机器学习，因为机器学习的结果更易于衡量，结果也更直接。拓扑学在机器学习中有很多应用，例如将拓扑信息融入模型。拓扑方法可以补充机器学习中其他结构归纳偏差。拓扑学是关于形状的数学，它有三个主要分支：点集拓扑、代数拓扑和微分拓扑。拓扑学是研究形状的数学。机器学习吸引我的地方在于它有数据和明确定义的问题，结果也更直接。机器学习的结果更直接，不像纯数学那样需要等待很长时间才能得到结果。图的生成和推理方式多种多样，这让我很感兴趣。现在有很多来自分子建模或药物发现的数据集，以及社交网络数据集和道路网络数据集。图学习方法主要有两大类：图神经网络和Transformer架构，图核方法由于计算效率问题而逐渐被淘汰。我们的目标是提供一个框架来评估图学习数据集的实用性。我们想提供一种方法来衡量图的“图性”，即节点特征和图结构对结果的贡献程度。我们想提供一个框架来评估图学习数据集的实用性。在评估图学习方法时，除了精度、精确度等指标外，还需要考虑节点或特征之间的距离度量。在评估图学习方法时，除了精度、精确度等指标外，还需要考虑节点或特征之间的距离度量。RINGS框架旨在对图数据集进行分类和评估，它考察节点特征和图结构的相关信息。许多图数据集都包含图和节点特征，RINGS框架旨在评估这些特征对结果的贡献程度。RINGS框架通过扰动图结构和节点特征来评估数据集的实用性。RINGS框架通过扰动图结构和节点特征来评估数据集的实用性。理想的图数据集应该同时具有信息丰富的结构信息和特征信息。许多数据集并不需要同时具有信息丰富的结构信息和特征信息，这可能会导致错误的结论。如果使用不需要图结构的数据集来评估图学习方法，可能会得出错误的结论。如果使用不需要图结构的数据集来评估图学习方法，可能会得出错误的结论。如果使用不需要图结构的数据集来评估图学习方法，可能会得出错误的结论。基准数据集的质量会影响对方法的评价结果。我们不能完全依赖现有的数据集来评估图学习方法的优劣。我们不能完全依赖现有的数据集来评估图学习方法的优劣，因为有些数据集即使去除图结构或替换为随机图，性能也不会下降。如果扰动网络后性能没有下降，则说明图结构没有发挥作用。如果扰动网络后性能没有下降，则说明图结构没有发挥作用。我们在RINGS框架中使用了四种扰动：无扰动、空图/特征、完全图/特征和随机图/特征。我们在RINGS框架中使用了四种扰动：无扰动、空图/特征、完全图/特征和随机图/特征。许多基准数据集需要重新考虑，因为其性能并不总是由图中的所有可用信息驱动。许多基准数据集需要重新考虑，因为其性能并不总是由图中的所有可用信息驱动。目前只有少数数据集，主要是在分子领域，同时具有信息丰富的结构信息和特征信息。有些数据集过于简单，研究人员应该停止使用它们。有些数据集过于简单，研究人员应该停止使用它们，例如酶数据集和Reddit数据集。有些数据集过于简单，研究人员应该停止使用它们，例如酶数据集和Reddit数据集。分子数据集可能因为其规模较大而具有更好的特性。分子数据集可能因为其规模较大而具有更好的特性，也可能因为分子天然具有结构和特征。未来的论文应该包含使用RINGS框架的结果，以显示扰动后指标的变化。未来的论文应该包含使用RINGS框架的结果，以显示扰动后指标的变化。未来的论文应该包含使用RINGS框架的结果，以显示扰动后指标的变化。理想情况下，扰动后性能应该急剧下降。理想情况下，扰动后性能应该急剧下降。某些流行的数据集存在偏差且不具有代表性，这体现在节点和边的分布上。某些流行的数据集存在偏差且不具有代表性，例如AIDS数据集，其性能几乎完美，这表明任务可能过于简单。某些流行的数据集存在偏差且不具有代表性，例如AIDS数据集，其性能几乎完美，这表明任务可能过于简单。我们应该更多地考虑图的来源，因为不同来源的图具有不同的特性。某些流行的数据集存在偏差且不具有代表性，例如AIDS数据集，其性能几乎完美，这表明任务可能过于简单。我们应该更多地考虑图的来源，因为不同来源的图具有不同的特性。我们当前的框架只考虑了消息传递图神经网络，这是一种局限性。我们当前的框架只考虑了消息传递图神经网络，这是一种局限性。我们希望社区能够使用RINGS框架来讨论新的数据集，并更加关注数据集的设计和维护。我们希望社区能够使用RINGS框架来讨论新的数据集，并更加关注数据集的设计和维护。

Deep Dive

Shownotes Transcript

Translations:

中文

You're listening to Data Skeptic: Graphs and Networks, the podcast exploring how the graph data structure has an impact in science, industry, and elsewhere. Welcome to another installment of Data Skeptic: Graphs and Networks. Today we're going to talk a little bit about data sets, but more on how to rate them. We've covered the Karate Club and a few others. There's some famous data sets in network science. How useful is each? What do you think, Kasaf? Some data sets are gibberish, but some are useful.

Well, from like the computer scientist in me wants to say, if I'm going to develop some great algorithm, it should work on a large variety of data sets and consistently do link prediction, let's say, very well. I think it depends. And I think what Bastian is a very interesting paper. What he shows is that I'm building on things he said that the data set is the data set.

A network dataset models reality. This is reality, right? We can't choose it. But what we can choose is what we ask from the dataset.

In some networks, I'll expect that networks that are more like preferential attachment networks, networks that favors, let's call it high-ranking nodes, hubs. The paper we're going to discuss, Bastian and his colleagues developed something called RINGS, which is a framework for evaluating graph learning datasets.

One of the interesting insights in the rings framework, I think, is how do you take the graph and they're going to perturb it in a variety of ways. And if those perturbations don't make the outcome worse, then you're probably not utilizing the structure of the network. So interesting insight that can be used to test out how robust and useful these data sets are.

My take from it is that maybe, and I think Bastion touched the subject a bit, what questions are we asking the data set?

Most of the, let's say, challenges that you see, like on Kaggle and some of that, involves the labeling or tagging the data, right? Like in the IMDB data set of movies, they wanted to tag the different genres. Sometimes we need to ask ourselves, is this the interesting questions that networks can help us discover? Because sometimes networks can help us discover

tag nodes, like in the case of homophily, right? If you have different communities on the network, usually the communities are created because the nodes are kind of similar, so they attract. But the thing is that the attraction, the homophilic nature of the community might not

agree with what we're asking. Maybe if we look at the homophily of the community, we can learn things to ask about. Like a callback to an older episode we had about GitHub.

We discovered that the GitHub repository network, who commits with whom, creates communities, of course. You intuitively might think that people from, let's say, the same country commit together, right? Commit together, because, you know, geographical proximity sounds like people in the US. You should commit with people in the US. It sounds reasonable.

But actually, what she found was that it's the programming languages that people commit with each other, that people commit, right? And you say, well, that's very intuitive. But also the geographical proximity reason was very intuitive, right? Both are intuitive, but only one is right. So...

When you think about when you're trying to tag things with the, I would say, quote unquote, wrong homophily, you're making the network something it isn't. And I guess in this cases, the network will distort the results as Bastion showed us.

But if you let the... But if you are more data-driven and explorative and look at the network and sees what it has to say, you might learn something new instead of just asking the old questions. You know, people...

People were saying that the wisdom in learning is not answering questions, but knowing which questions to ask. Well, for a long time, I thought it was a foolish saying. But when I got into networks, I started to understand what it's all about, you know, asking the right questions. Do you think this paper has insights that the network science community should more eagerly embrace? I like these ideas of perturbations of networks and studying how those effects contribute to changes in your results.

I think the results of the paper are more towards the machine learning community, less than networks. Because in network science, they do these kinds of things all the time. Like the modularity or Louvain method is based on the idea that you work with null models, right? With configuration models and so on. So as a base of comparison. I think the machine learning community should...

can learn from it. And again, I think we can all learn from it that we need to think about what questions we ask. Yeah, and what data sets we rely on in our evaluations. So a lot of good insights here. Well, let's jump right into the interview then.

I'm Bastian Rieck and I'm affiliated with the University of Fribourg in Switzerland. I'm a tenured professor of machine learning and I have my own group, the IDOS Lab. IDOS stands for AI for Data Oriented Science.

The Eidos Lab, this is kind of my own in-joke. It's also the name of a Greek goddess, the goddess of awe. So like when you're awestruck by something, because awe is like, it sounds a little bit weird, right? But this is what I feel when I do science. So I'm awestruck by trying to understand something and then trying to get into things and trying to see how they work. All the things we're doing, at least in machine learning, they're so much driven by the availability of data.

It's almost never like in pure math, where we just think about the platonic realm of ideas and of objects, and then we sit down and we try to prove certain properties or whatnot. And AI is totally different. AI is all about, you get a data set, you have to answer a question or multiple questions, and you have to kind of make it work. And I wanted my lab's name to embody that a little bit. And so now here we are, and we're doing that in a variety of ways.

Well, I promise we're going to get around to talking about the paper, no metric to rule them all, but I have a few preliminaries. First, and maybe this is a weird aside, I have to say you're the first guest whose GitHub contribution graph is more green and more dense than mine. You are a worker bee, I can tell. What are you committing all the time? I love GitHub, or rather I should say I love open source.

But I also love to track my work and to do things whenever I can, learn something new every day. A lot of those commits that you're seeing, they come from some internal repositories. A lot of those is actually structured journaling. Might sound a little bit weird, but I'm keeping a markdown file since I think about 2018. And you heard it right, it's

It's one markdown file for project-based things and another one for like personal things. And I kind of tried to hash out things like blog posts or like the next paper, et cetera, before putting that in a different format. So here we go. Yeah.

And obviously we're going to get into some machine learning topics, which is close to what we cover on Data Skeptic quite often. But you came from a slightly different original place, I guess, academically. Yeah, so originally I studied pure mathematics in Heidelberg University in Germany. So that means algebraic topology and differential topology.

But at some point after my PhD, I said, okay, we're going to have to go somewhere else. And I'm going to have to try something new, reinvent myself a little bit, because it was intellectually stimulating. I can confess that. But it was also a little bit vandalistic.

vacuous at times at least for me because I never felt that I had that many people to connect my research to and or to talk about my research to to many people certainly not my family or friends not because they don't care but just because it's such an abstract thing to to discuss and and so I figured that I needed a change of scenery I needed a change of um of topic and then and

Yeah, here I am. Well, I guess there are probably many ways in which topology is useful. Clearly, ML. Are there any other that you would highlight? Yes. I mean, one aspect that we want to do, not only in future work, but also that we're currently having as a research design, is this idea of imbuing a model with information about the topology of the data. And that can take on various forms. It can be something as simple as just saying, oh, you have a graph, and the graph has multiple connected components. Or it could be something more complex where you're looking for, let's say,

holes in a dataset, like structural shape descriptors and these sort of things. What we're finding there is that some topological methods, and don't want to bore the audience with details here, but some of them, they really give you something that is complementary to what you can get with other structural inductive biases that you have in MLs. And so that's kind of a neat way to add this additional facet to some models, I guess.

Well, I have a suspicion that if we surveyed most machine learning practitioners and asked them to define topology, they would struggle. Maybe at best they'd know what a Klein bottle was, but I'm not even confident about that. Can you give maybe a pitch for the degree or the importance of topology and why machine learning people should be getting more into it?

In today's data-driven world, the ability to extract value from data isn't just an advantage, it's essential. Mastering analytics can transform both your career and the organization you work for. It's your turn to transform your career and drive organizational success through analytics. Let me tell you about the Scheller College of Business' Business Analytics Graduate Certificate at Georgia Tech. It's

It's 100% online. Scheller College ranks in the top 10 US business schools for busy business analytics professionals. They have a world-class faculty that can help you graduate in as little as a year, but

But maybe you're busy like me and you want to take it a little slower. You can combine flexibility with rigorous education. Scheller's Graduate Certificate Program adapts to your life, not the other way around. Their program is designed for professionals like us who want to leverage data and solve real-world business challenges, but need flexibility with their time and schedule.

That's why you can schedule your classes in a way that makes sense to you. On top of that, you're not just earning a certificate. You're potentially opening doors to Georgia Tech's prestigious MBA programs. Now is the time to become a data-savvy leader with Georgia Tech's Business Analytics Graduate Certificate. Applications are open for spring 2026.

Visit TechGradCertificates.com to learn more and apply before the August 1st deadline at TechGradCertificates.com.

Delete Me makes it easy, quick, and safe to remove your personal data online at a time when surveillance and data breaches are common enough to make everyone vulnerable. As a podcast host who's been in the public eye for years, I know firsthand how important privacy is. One of the best things about the service is when you first sign up, they give you the flexibility to start with just basic information. You choose what details you want them to protect.

Delete.me's experts spent about 22 hours scanning and removing my data, work I'd never have time to do myself. What's really concerning is that without Delete.me, this personal information would be available to anyone with an internet connection. Some removals happen within 24 hours, while others might take a few weeks, but Delete.me manages it all.

They keep you informed throughout the process and their quarterly reports show you exactly what they're doing to protect your privacy. Take control of your data and keep your private life private by signing up for Delete Me now at a special discount for our listeners. Today get 20% off your Delete Me plan by texting DATA to 64000.

The only way to get 20% off is to text DATA to 64000. That's DATA to 64000. Message and data rates may apply.

Can you give maybe a pitch for the degree or the importance of topology and why machine learning people should be getting more into it? Yeah, certainly. I mean, I think in one sense, we mathematicians are the wrongdoers here because when we talk about topology, there's actually three big branches. There's point set topology, where you talk about how things are being connected. You talk about neighborhoods and these sort of things. There's algebraic topology, which you could

basically defined as calculating shapes using linear algebra. And then there's the wonderful realm of differential topology, where you're looking at some kind of functional description of your data or of your topological spaces. But fundamentally, I would say topology is the science or the math behind shapes. It's a very general way of putting it, but this, I hope, may also convince some listeners to actually dig into it.

So then when you were looking to, I don't know if you consider it a pivot or not, but maybe start focusing the tool set you'd built elsewhere, what was attractive about machine learning? The attractive part was that you had data available and you had like clearly, or so it seemed at the time, you had clearly defined problems.

For me, when I wrote a proof in pure mathematics, it was hard to say whether it was wrong or not, right? I had to reshape and refine the argument, et cetera. Whereas in machine learning, you have a data set. Your task is to maybe, I don't know, classify images, find the cats, find the dogs. Then you do it. And then you look at the accuracy and you're like, okay, this is good. And then you try something else and the accuracy goes up and you're happy. And then you try something else and the accuracy goes down and you're not happy. And so this was very appealing at the time. And I don't want to say that machine learning is easier. Rather, I want to say that

results are more tangible and you don't have this delayed gratification that you get with math or in other subjects, I guess. What about graphs in particular are interesting in this context? Well, I mean, graphs have always intrigued me because they come in so many different ways, shapes or forms almost. And already when I started looking into ML a little bit, I figured that

that graphs are somehow distinct from other modalities because for images we have lots of ways to talk about them right i can say it's a black and white image it's a three channel image it's a grayscale image it's an image of a cat it's an image of a dog etc but for graphs we have such a variety of ways to generate them or to reason about them some have node attributes some don't some come from pure mathematics some are modeling a system some are a model of reality like

a metrograph or a tram network, a train network and these sort of things. And this has always, always intrigued me. And I always figured that it would be cool to look into frameworks that were able to tackle all of these things into one fell swoop.

So as you'd mentioned, it's appealing because there seem to be some defined problems. There seem to be some famous datasets. What's out there in terms of datasets? So I mean, there's lots of datasets coming now from molecular modeling or from drug discovery, I guess, where you're asking about certain properties of a compound, of a molecule.

property prediction, you're asking about toxicity, solubility in water, these sort of things. There's social networks that are quite famous, Reddit and the previous Twitter, a lot of data sets coming from there.

There's also, I think, a couple of road networks datasets, but I haven't really worked with them. So I would say a large chunk is what one could call bioinformatics or chemoinformatics datasets with proteins, molecules, these sort of things. Obviously, a graph researcher or someone interested in algorithms and things like that would like to have this diversity of datasets to try out their methods. You know, is my regularization technique the best one or things along those lines?

What does the playing field look like? Is there strong methodologies that are known to commonly work, or is it kind of diverse by category? So I would say, and this is at the risk of making some enemies there, but when I look at the community, I think there's basically two big families of methods. And then the one would be graph neural networks, so things like convolutions, but on the graph level defined methods.

with something like the message passing paradigm. The new content or the new kid on the block, as it were, is maybe the Transformers architecture, which is kind of interesting for different reasons because you're not really making use of the graph as such, but rather you're looking at all the nodes and all the possible connections to it.

This used to be a little bit different. Historically speaking, there were other methods like graph kernels, but I would say that these have fallen a little bit out of favor because of computational inefficiency mostly. So the graphs have become larger and we now need some methods that actually scale nicely to them.

So let's jump into no metric to rule them all towards principled evaluations of graph learning data sets. What did you set out to accomplish that's covered in this paper? Maybe historically speaking, I should say that this dates back to a lot of ideas that in particular Corinna and I had for some time now. So like dating back to a couple of years ago.

We wanted to have a measurement for understanding how graphical a graph is. This sounds super weird, but in contrast to the images, when we're dealing with computer vision tasks, for instance, we don't really have language to say that, oh, this graph actually needs a lot of its edges or doesn't need a lot of its edges.

With this paper, we try to provide a language and provide a framework that tells us how useful our datasets actually are when it comes to measuring graph learning methods.

In, let's call it plain vanilla machine learning, the metrics I'm familiar with are like accuracy, precision, F1 score, area under the curve, these sorts of concepts you learn in your first machine learning course. Are those universally useful or do graphs have a different set of metrics?

No, these are really universally useful when it comes to assessing the end task or the outcome as such, right? What we are also looking for is metric in the sense of distance measure between the nodes or distances between different features of the graph. So this is where it's almost like a double entendre, one could say. But of course, there's also an allusion to a lot of the rings, right?

So yeah, can you expand upon rings as a framework? I don't know if we've spelled that out yet exactly. What is rings? Yeah, so rings is our attempt to provide a new framework for classifying graph datasets, for doing an evaluation of evaluations, as it were. Rings is short for relevant information in node features and graph structures.

A lot of these graph datasets, they're almost like a wolf in sheep clothing, right? Because they come with a graph, but they also come with node features with measurements attached to the graph.

And our basic question, and this I think we answered in the paper, was to what extent do we need the node features if we have the graph and vice versa? And it turns out that it's actually not that easy or not as one might expect. How do you go about evaluating something like that? Yeah, this is why it took us so long to actually sit down and write that paper. So it's not like we have been working on this for a couple of years, right? But the ideas have been growing and growing and growing in our

our minds. In the end, what we came up with is a twofold approach. So we're asking how well a given task can be solved by different models under some perturbations of the graph structure.

But then we're also asking ourselves how well the graph and the perturbation, how well they are kind of measuring complementary information. Because ideally what we want to have in an ideal graph data set, we want the node features and the graph structure to be extremely relevant and extremely important for the underlying outcome. So we want to say that, oh, those go hand in hand. If we take away one, we can't actually do the task or our performance drops substantially.

What we actually find is that in many datasets, we don't need this conjunction. We either can replace the node features by something else, or we can even replace the graph by something else. And we still get quite useful, quite high performance. And in some ways, this is troubling, I think, because it might tell us the wrong things about the methods that we're developing.

Could you expand upon that? What incorrect conclusions might one draw? The biggest issue for me, I think, is that suppose I have this wonderful idea for capturing something in a graph, and

I try this out on datasets that actually don't need the graph structure at all. So maybe these are datasets that are so simple that I can do the prediction just as well based on the node features themselves, right? So basically, I would treat the graph as a point cloud of some high-dimensional coordinates, and I would just need to use the coordinates for the prediction task. Then what can happen is I come with my cool method that is really very structural and very graph-oriented.

And it just doesn't work. It just doesn't give me a good signal over the other methods that I'm comparing to. And this might actually lead me down the wrong path, right? Because then I would say, oh no, I wasted my time and I have developed this wonderful method, but it actually doesn't do anything.

And vice versa, I could fool myself into thinking that another method that is actually not really good for graph-based analysis, but that's really, really adept at capturing features, maybe I think, oh, that's great because it gives me good performance on this dataset. So that's kind of, I guess, the dark side of using benchmark datasets is that your insights are only as good as the benchmarks themselves, right?

So if I'm a methodologist, I've developed some new algorithm or something like that, I'd like to claim it's the best one out there. Can I rely on the generally available datasets?

This is a tough one to answer. And I would say only partially because one of the issues that we found in the paper is that some data sets where you get better performance on the described task if you remove the graph entirely, or even if you replace it with a random graph and these sort of things. So there's all kinds of like funky perturbations. I can zoom into that in a second if you want.

We looked at all of those on the graph level and on the feature level. And primarily, well, we looked at two things, but primarily we looked at how the performance under some models changes. Ideally, you don't want your model performance to change if you give it the wrong graph, right? Because if that happens, or if the performance stays the same, that happens, then the graph was useless to begin with and you don't need it, right? Then in that sense, what I said earlier, we could say that this graph data set is not really graphical, right?

But you have to imagine the big air quotes here because it's really hard to make a specific assessment. But you raise a good point. If we can distort the network, effectively turn in a random network, the performance should drop. Otherwise, as you say, it didn't matter. So I like the key insight. Could you zoom in on the styles of perturbations that you use in these tests?

If I recall correctly, we have four of them. We have the original one, which is really no perturbation at all, but fair enough, right? We have the empty one, so either doing the empty graph or the empty features, which means that we replace them by essentially zero-based features. We have the complete graph or the complete features, and we have the random graph and the random features.

So we can do these both on the feature-based level or on the graph-based level. And you would expect that there should be big changes depending on what thing you're looking at. Because for instance, to come back to the road network example that you gave, it matters, of course, how things are being connected. If you start randomly perturbing the roads that exist in that network, then the travel times should change substantially. We all see that when there's an actual traffic jam or when

or when a road is being cut off for some other reason. This you should be able to capture in your data set. But it turns out that there are data sets where some of these perturbations actually are okay. They give you good performance.

And so I know you tested these variety of perturbations on a large set of data sets. It's a big task to summarize all of that, but what's the view from 50,000 feet? So the view is that we have some rethinking to do for some of the benchmark data sets, which I think is good. I also think that some of us...

Well, most of us, I think, who have published in that area, we have suspected that for a while now because we often saw that performance are kind of not necessarily driven by all the available information in the graph. But I guess the main takeaway is

If you're looking for key takeaway, then I would say there's only a few datasets, mainly in the molecular realm, that both have informative structural information and informative feature information. So both the structure, both the graph as such, and the features are useful.

This is one of the bigger trends that we're seeing, that only molecular datasets seem to have this, or I should say only chemoinformatics/bioinformatics datasets seem to have this at the moment, from what we can tell. Well, if we switch to the opposite polarity, is there a dataset that's so basic and obvious that maybe researchers should stop using it? Yeah, there's a couple of those. We point them out.

One is a dataset looking at enzyme information, which I know sounds a little bit like contradictory to what I said earlier, but it's a very small dataset, just I think about like 600 graphs and its performance in standard techniques already varies quite a lot. So you have, you go from between 25% and 65% or something like that. I've seen everything across the board, which is weird because you typically, if you zero in on a good method, you typically get

get also kind of stable performance, but with enzymes that's totally different.

Another data set based on social graphs, which have been collected based on Reddit discussions. So there's a Reddit binary one, meaning that it has a binary classification task and the Reddit multi one, which means that it has a multi-class classification task. And one of our outcomes is also that those should probably be thrown away or at least repurposed for some other tasks, because in both cases, the structure and the features are not informative, unfortunately.

So I know you'd mentioned the molecular datasets are the ones that seem to have these nice features. Could it be that that's for some simple reason, like those are the biggest datasets? That might be the case. I mean, we are looking at the individual graphs at some level, of course, but of course, bigger datasets might certainly have more variety.

I think what is the saving grace for those data sets is also that molecules naturally come equipped with some kind of structure and some kind of features because they exist in a physical reality. But of course, we can also have the mathematically abstract view. And I think both of those views are kind of important to make a discussion or to make a final classification, for instance, of them. But yeah, it's true. There could also be some underlying causes that we're not aware of.

While I'm used to reading all the standard metrics, F1 score accuracy, this sort of thing in papers I read, do you think there's room for authors in the future to be also including a section in their results where they apply the rings framework and show the decline in those metrics after these perturbations?

I mean, that would be the dream. I mean, what we were envisioning, if we had the support of the community, we would love for people that pitch a new dataset to actually analyze that performance. So if you write a paper and you say, hey, this is my cool new graph dataset, then you describe how you created the graph, you describe how you collected the data, the features, etc.,

and what tasks you envision. And then as the last step, you will basically also say, and by the way, here's a couple of performance changes over the rings framework. Here's some performance separability and some mode complementarity information telling you that maybe this data set could be useful or not. I think that would be great. But I'd also settle for a more detailed discussion of the data set as such without our framework. That would also be great.

Well, we know the negative case that we do these perturbations and our metrics don't change, telling us that the graph structure isn't influencing our decisions, things along those lines. What would be the ideal case? Obviously, the performance would decline. Would you expect a sharp decline or what does that look like in practice? Yeah, oh, this is a really tough one to answer. But ideally, we would expect...

a sharp drop if we make big changes to the graph structure maybe to put it like this so if you only rewire a couple of edges then i would guess that everything should stay more or less the same unless it's a very critical edge that connects like i don't know maybe two two groups in your in your graph that are not connected otherwise or stuff like that but otherwise i guess the the performance should roughly correlate with the amount of changes with the amount of perturbations i

I'm doing to the graph, and maybe then also later on to the features. One of the issues highlighted in the paper is that certain popular data sets are biased and non-representative. How does that take form in terms of just nodes and edges?

There are some datasets where the task is apparently too easy because it really doesn't matter what type of perturbation you apply, you get more or less the same performance. There's a famous dataset called the AIDS dataset, which is, I think, analyzing some kind of molecular compounds as well. This dataset has...

almost like perfect performance across the board, regardless of what you're doing to it. And so this would tell you that, well, here's something that would need to change. At least you would need to change something about the task. You control the graph as such, but you also control the task that you want to solve. And in some cases, we have this kind of mild, let's say,

a mild discrepancy between the task and the graph. So we could say the graph is maybe good and the features as well, but maybe the task is not super good. Maybe the task is too easy. Maybe you should ask it for something else. We're seeing this also with the social network data sets where essentially all these Reddit data sets, they are kind of good in terms of the structure that they have. They have lots of variety. They have lots of interesting motives in there because there are graphs that essentially mimic the Reddit discussion threads.

But what they don't have is a task that is really, really hard. So in fact, the task is so simple that you can actually do it, if I recall it correctly, based on just some degree information of the individual nodes. And then you can predict whether this is a discussion in one forum or in another, in another subreddit.

Do you have any aspirations or hopes that future data sets could be released? If there was maybe a company or an organization or just a field where you'd like to see something be produced or released or whatever it may be, where could the most interesting data sets come from?

Putting on my mathematician's hat, I would hope that maths could produce some data sets that are actually very, very helpful. So, I mean, graph theory, of course, is already is ripe with and rife with graphs, right? So they are dealing with lots of graphs. These graphs are not necessarily represented in our benchmarks because from some perspective, they're kind of uninteresting because they often don't have any node information and we...

definitely want some kind of node information. So that would be like the pure math community potentially. But then of course, I would love to have something more come out of the molecular, out of the chemistry realm. That would be really, really lovely. Potentially even something from geography directly. So all the things that are

that are effectively modeling reality to some extent. So a road network or a traffic network, these sort of things. We have a couple of those already, but they're not at the level of detail and depth that they could be. So for instance, it would be super cool if Google would say, hey, I'm going to give you all the public transportation network in Australia and

the USA and Germany and Switzerland and whatnot, right? Because this data already exists. We're using it all the time, but we're not really bringing it into a format that people can use it, that people can play around with it. So that would be some hopes, I guess. One opportunity that the paper presents is that an independent researcher could adopt your perturbation techniques on their own work. Do you have any recommendations for how they do that? Is it...

at the end of the day, I guess, has to be done in software. Is there a convenient way to do these or is it a sort of manual effort at this point? - Convenient is a big, is doing some heavy lifting. And I say it's convenient, but they could use our code. So we have the code on GitHub. We're planning on making a proper release afterwards with a lot of tutorials and a lot of information thrown in.

which I think is also a coincidentally, which I also think is very important to do in machine learning research. Like not only have the paper as an artifact of your research, but also have something usable for the community, for the practitioners, for the rest of the world to try out your model, to try out your new ideas. And,

I mean, ideally, if we managed to make that work, we would love to have a simple drop-in solution where people can say, "Here's my new dataset. It just has to be that format." And you just put it into our framework and it spits out some of the perturbation metrics that might be interesting and it gives you

Maybe we won't get there directly, but maybe we could even think about some very simple assessment of those metrics, some automated assessment. So you give us the data set and we say, well, this looks actually kind of good, but notice that the complete graph also gives you good performance. So maybe you want to look into that, something along those lines.

Do you think we should formalize any of this as like a no free lunch theorem if there's no single best metric? I think what we should definitely do in graph learning, we should think more about where our graphs are coming from. Because I think that what we're currently doing is we're trying to have this no free lunch in the sense that we say, oh, it's a graph, so our method should be able to handle it.

But we completely disregard, or in most cases, we disregard where this graph came from. And it makes a difference. So I can create a graph from some prior geometry. I can sample some points and I can say, well, I connect the nearest neighbors, then I have a graph. But that's a different graph than what maybe a graph theory person in math would call a graph, right? They would say, oh, I'm looking at Ramsey graphs for some reason or Cayley graphs. And they are generated in some other shape or form.

But for some reason, we expect our methods to do well on all of these types of graphs. And I think this is where the no free lunch should definitely kick in and where we should say, well, maybe our methods only work very well for geometric graphs. Maybe they only work well for graphs that don't have an underlying geometry. I don't know, right? But we're trying to kick off this discussion as well in the community and to try to give people a language to talk about these problems.

Are these problems unique to graph datasets? This is a hard one. And I'm inclined to say, to some extent, yes. No very precise answer. Sorry about that. But they are less prevalent in computer vision, which is not to say that computer vision is any easier, right? But it's just that

If you have a vision task, you have some sensor data, you kind of understand how the images are being formed and what you're trying to do. But in graphs, I think we still lack this principled understanding of how our graph was initially created. To give you a very concrete example, there are some graphs, even in the benchmark datasets, that have been created by thresholding a correlation matrix.

So someone sat down and measured correlations between objects, and then they said, okay, if I put, let's say, a threshold of 0.65, then I get the following graph out of there.

But in a way, this graph is just a snapshot of a much more complicated process, namely this whole correlation matrix that tells you everything about how your objects are interacting. And we don't get to see this correlation matrix. Rather, we see the graph that was created by this one threshold. I'm not sure whether images are suffering from similar things. It doesn't seem that way to me.

Do you see any limitations in the current framework as presented in the paper? We are only looking at message passing graph neural networks at the moment. So we're literally excluding anything else. We are not doing this out of spite or out of negligence, but we're doing this out of computation considerations, but also because message passing is still the predominant paradigm in

graph learning research. So we're saying, okay, we focus on the biggest chunk first. But this means that all of the results that are in the paper, they are done with message passing methods. And as such, things could change if you have a radically different paradigm, which would be good if we came up with something like this. We also don't have a way to

assess the hardness of the task in all the cases. So we have a notion of mode complementarity where we're looking at how well the features and the graph, how well they are aligned in a certain sense. And this is actually both task independent and model agnostic, which I think is useful to have, but

In all fairness, one could complain and one could say, well, but I'm interested in how well this works for my specific task. And then this metric, at least the specific one that we designed, won't be able to give you an answer. But we hope that we can address some of these limitations, of course, in future work.

So there's certainly good insight to be taken from the paper. If the community were to embrace it and take it in the ideal direction, what would that look like in your mind? Yeah, I think one ideal outcome would be that people start using that type of framework to discuss new data sets.

But I would also be very happy if they would just accept the paper as is and they would just say, okay, well, I'm trying to be more mindful about the calculations that I'm running, about the task that I'm running. Maybe adding a couple of more what we would call ablation studies. So studies where you change parts of the graph or where you change parts of the model. Maybe being a little bit more mindful about that and not...

see the data set as something that is more or less like fixed and that cannot be changed anymore, but rather would be great if people would understand that data sets are also just something that we in the community design and should curate. And that is not something that should remain static over time because there is such a thing as oversaturation. And at some point we need to replace data sets and we need to build better ones. And I think that time is approaching rapidly, I believe.

Well, I've got a curveball question for you as we wrap up. Given your background in mathematics, and what I have to presume is your familiarity with NP-complete problems, do you think topology and geometry might have the insight we need to finally have a proof that P does not equal NP?

This is, I think, a little bit beyond my ken, as they say, because I'm not really that familiar with the computational complexity theory. However, I believe that all the big questions in math, all the big questions in life in general are almost always at the interface of different things, of different domains, right? And so I think what we saw in...

in topology in the last couple of decades is that the really good proofs, the really good results, they always came from people that, that wore multiple hats. So someone who was a geometer and a topologist and these, these sort of things. So who knows? I mean, if, if we ever make progress with the, with the big questions like this or like the Riemannian hypothesis and so on, I, I have the hope and the hunch that it will come from someone who tries to, to combine and, um,

different disciplines and bring them together rather than from someone who is working in one domain specifically. And can you share any ongoing projects at the lab that you're excited about? Yeah, one that is really nice and a little bit in the spirit of the Nometric to rule the mall paper is the work that we call Mantra. It's the manifold triangulations assemblage. The idea here is that we collect a lot of

nice objects, so-called triangulations from pure mathematics, from two and three manifolds. So like very nicely, very well studied objects. And then we just let them loose on machine learning models. So we're trying to curate and create a new data set with some of the lessons of the Nometrics paper in mind, with some of the lessons of the Rings framework, I should say, in mind. And what we're seeing is really astonishing. We're seeing that

a lot of the existing methods are not really doing well on very simple topological questions. So even if they have lots of data, there's a couple of 10,000 manifolds in there, we can generate even more if we want through some process known as barycentric subdivision, but details don't matter here. We see that that's still...

the current methods, they don't work well, even at answering very simple questions. And that is exciting and astonishing because it's exciting because this means that there's a lot of room for new methods. That's always great. But it's a little bit astonishing because one might expect that we would have made more progress in these sort of endeavors. But it turns out that

Dealing with just purely combinatorial data is still very hard and it's still something that needs to be studied carefully. So that's something that I'm definitely excited about.

What about you specifically? What's next for you? Lots of things. I'm dabbling a little bit in, I guess you could call it efficient small models. I'm very excited about the idea of solving a task just as well as a big model, but at a fraction of the computational cost. These are then, of course, not general purpose methods, but more task-specific methods. But

If this works nicely, then we will see very, very nice results for, let's say, generative models in the realm of geometry and topology. And that would be really, really great. That's something that I'm currently very passionate about.

Very cool and timely for sure, as models seem to be growing unbounded. Yeah, exactly, exactly. Well, is there anywhere listeners can follow you online? There's a couple. So I'm still active on some social media under my moniker PseudoManifold. So I have Trianglesky at pseudomanifold.topology.rocks.

But I also have a personal web page, which is also reachable under topology.rocks. So I'm trying to stay on brand here. And yeah, I still even have an axe handle, which is at pseudomanifold as well. But I'm not that active there at the moment.

Very cool. We'll have links to all of the above in the show notes for listeners to follow up. Bastian, thank you so much for taking the time to come on and share your work. Yeah, but I would also love to use that opportunity to say thank you from all the teams, so from Corinna, Jeremy, Emily, and myself, of course. So this is a great opportunity for us to talk about this work, which we consider to be very important, and we are really happy to have this platform for additional outreach and dissemination. Thank you so much.

Unveiling Graph Datasets 44:12 Share

Data Skeptic

Deep Dive

Shownotes Transcript

Unveiling Graph Datasets