You're listening to Data Skeptic: Graphs and Networks, the podcast exploring how the graph data structure has an impact in science, industry, and elsewhere. Welcome to another installment of Data Skeptic: Graphs and Networks. Today we're going to do another graph database related topic. This one on finding bugs in graph databases themselves through a clever testing technique. Asaf, I get the sense this isn't your area of network science.
Well, if I would code, I would code without bugs. So I can't understand why people do it. Strange behavior, job security. Why do they do it? It's just more work, right? Probably. Yeah, the project we're going to discuss today is called Dinkle, which is a fuzzer, which is a term I actually was vaguely aware of, but have never heard in this context before. I guess I don't get deep into this literature, but a fuzzer is...
is like a tool that's going to send a lot of static noise at something as a means of testing it. In this case, noisy, quite meaningless queries to a graph database, not to performance test it per se, but to find bugs in the underlying core graph database engine.
Interesting topic, and it's graph-related because, of course, they're testing against a graph database. What I liked about talking with Celine, that at the end, when you asked her about the different graph databases, it was interesting to hear what a person that works on, let's say, a ground level of graph databases that knows the in and outs, what's their view?
of the different graph databases that are out there. Yeah, it doesn't seem that there's one ring to rule them all in this case, that depending on your needs and what you're trying to optimize for, you would make a particular decision about one of the many graph database offerings. Yeah, it's a trade-off. Yeah. So interesting stuff from a software automation testing point of view in here as well that I'm glad we're getting a chance to share. And let's jump right into the interview. ♪
Hi, I'm Celine. I'm a master's student from ETH currently working in automated software testing. My main focus was on testing graph database management systems, which is why I'm here today. So I am studying computer science. I am about a few months away from finishing my master's right now with a focus on secure and reliable systems and data management.
Masters often will have either a project or a thesis. What route are you going? At ATH we do have to have a thesis at the end. I am probably going to stay in this field of automated software testing. Could you tell us a few details about the specific area your thesis gets into? Now, I did have a bachelor thesis on graph database management systems.
like testing graph database management systems, and there's still a lot of work to be done. So I will likely expand upon that, if not just generally helping security people test things. Yeah. And I assume you're referring to Dynkle, is that right? Exactly. Yeah. And for listeners who aren't yet familiar with it, could you give an overview of what Dynkle is? Dynkle originally started as my thesis.
My bachelor thesis back in 2023. It is a fuzzer, so it automatically generates input for graph database systems and aims to test it.
We did mainly focus on really just incorporating a lot of stateful information, making sure the query generation is very complex. It had great results in the end. Well, I get the idea of trying to automate some of the database systems that are out there, specifically graph databases. I'm curious about where you start with that. Like one thing that comes to mind is in trying to attack a traditional program, maybe I want to see if I can get a buffer to overflow or something like that.
What are the, what's in the toolbox of toys or whatever that you can attack a graph database with?
This comes quite naturally if you just aim to generate random queries. So if you do look at some of the bug reports we have, we have encountered buffer overflows, we encountered null values being dereferenced, we encountered weird program states that weren't meant to be reached. We pretty much cover this whole basis, but only if it is really accessible through the input of a user through a query.
So am I correct in saying there's no code inspection? You're really just treating the system as a black box? Yeah, generally you do black box fuzzing like that. You can expand upon that, but that is pretty much just a way of accelerating the testing, making sure you cover more new codes, but it isn't really necessary to reach really the effectiveness that is needed to find the box.
And I believe you'd mentioned that query generation, specifically complex queries, is part of the effort. Is that something that's generative in AI as sort of the flavor of the week anymore? Or can AI even do that? Maybe these have to be very finely crafted knowing the schema or something. How does that work?
So this really is just a random generator. It is based upon the syntax of Cypher. So it does have to know Cypher very in depth, really be able to make those connections between different clauses and really tracking the state in a good way. We don't really use AI for us. It really is just a random bit string we use to generate a query in the end.
I can see how AI could maybe be used in some places, but it seems like a really difficult task for it to complete. And you'd mentioned this being state-aware. Could you expand on how?
In Dynkle, we track two different kinds of states. We have the query context and the graph schema. Now in the query context, we just have all the variables that are created and referenced within a query. So you can imagine if you create a variable x, assign it the value of zero, that would be stored in the query context. And we store this information during the generation of a query.
And then additionally, we also have the graph schema, which we essentially simplify as a big set of node and edge labels and different key value properties, which we then reference during the query generation again to make sure that there's a lot of data dependencies within the query, which directly makes it a lot more complex and increases the chances of really triggering deeply hidden bugs in the graph database management system.
Are those deeply hidden bugs something like an oversight that there's a missing unit test and that's all it's going to take to fix it?
I mean, a lot of the bug reports we submitted did just result in more tests being added, but they pretty much always had to fix something in the Graph Database Management System. So usually it really is a lot of interactions going on in the background between different parts of the system in a way that the developers didn't expect, which then resulted in those bugs.
We only had one single case where they actually adjusted the documentation of Cypher to disallow certain query structures. Rather than fix it, we'll just make it forbidden, huh? Exactly, yeah.
So when you encounter a bug, what is the user experience? Am I going to get an exception or am I going to get the wrong answer? So we did mainly just focus on exception and crash bugs. This means that either you're going to get an error message saying something went wrong or a really confusing error message that seems very unrelated to your query.
or it just flat out crashes the system. We didn't focus on finding wrong answers, like wrong answer bugs, which we call logic bugs, because that is a whole different field. But the thing is, Dynkle really is just a query generator upon which you can pretty easily expand. So logic bug fuzzing can be done using Dynkle, but you would have to expand upon the generational system of Dynkle.
So Dynkle is a very general tool in that regard, and I know you were able to use it to test many different database options. Could you talk about which databases you've been exploring using Dynkle? In the paper on Dynkle itself, we mainly focused on Neo4j, and we also tested RedisGraph, which is now under the name FalkerDB. And lastly, we also tested Apache AGE.
We found the most bugs in Neo4j because we spent a lot of time testing it, because it was really just a very widely used database. And I think as far as I know it still is. Yeah, we found a lot of exception bugs there. No crashes though, because it is written in Java, it was very hard to get it to crash. That was a bit different with the others, because they were written in C and C++. We did find quite a few crash bugs here and there.
When you find a bug, I assume you report it? Could you talk about that process? When I talk about online privacy security, I speak from experience as a Delete Me subscriber. What's been eye-opening is watching my quarterly privacy reports show me just how many data brokers are out there collecting our information. In my first report alone, I discovered hundreds of listings containing my personal details. Everything from old addresses to family connections.
I'm a low-rent public figure, but I don't want the rest of that stuff out there. What makes Delete.me different is their comprehensive approach. They don't just remove your information once. They keep monitoring and removing it continuously, and they stay ahead of privacy threats. As new data collection methods emerge, their team adapts and evolves their protection strategies. They don't just focus on known data brokers. They're constantly scanning for new sources of data exposure and developing removal processes to address them.
It's like having a dedicated privacy guardian working around the clock. Take control of your data and keep your private life private by signing up for Delete.me. Now at a special discount for our listeners, today get 20% off your Delete.me plan by texting DATA to 64000. That's DATA to 64000. Message and data rates may apply. When you find a bug, I assume you report it? Could you talk about that process?
I mean, most security non-critical bugs, we can just open a ticket, open an issue on GitHub, and they will fix it from there. But for ones that are security critical, most developers have a security policy set up where we can send an encrypted email somewhere with the details of the bug, and they will take it from there.
Can you share a few details on that process? I would imagine you are in contact with them a lot if you've submitted more than one bug. Do you know people on the other side of the table that you've become in communication with or things like that? Yeah, to some extent. I am connected on LinkedIn now with a few developers, but also, especially Neo4j was very thankful and they actually sent me some merch. I have a t-shirt and a beanie from them now, which I...
I can rock during the winter. That's awesome. They ought to pay some bug bounties out too one of these days, I think.
Yeah, no, not really. We didn't get any. Maybe one day. Is there any risk of a bug that's out there that's hard to reproduce? And maybe, you know, you generated a query, but you don't see it consistently. And when you submit the bug, they would have a hard time seeing it. Yeah, there was a few cases where this non-determinism was really an issue. Because mainly when you submit a bug report, you want to reduce the query as much as possible.
Because, I mean, the queries we generate are thousands of characters long with thousands of clauses and very hard to understand. So what we do is we try to find a minimal test case that reproduces this bug. And if there is some non-determinism, that means that we often have to like restart the database again, get it to its initial state, rerun the query a few times to make sure the bug still triggers.
But in the end, it just takes more time to report. But I never had a bug that I reported that they couldn't reproduce, mainly because I tested everything in different Docker containers as well to make sure it's not because of my local setup or anything.
So you find these bugs by generating these complex queries that, as you said, might be difficult for a human to parse, perhaps. Does that mean we're finding edge cases in the software, or are these bugs that could manifest themselves under other circumstances under more everyday queries as well? Yeah, that is a good question that also has been explored in other projects in completely different fields of automated software testing.
Some of the queries we generate that produce bugs, they do seem pretty nonsensical, but in the end, they could show up as part of a bigger query
actually used in production, and also with code generators. They do also sometimes produce structures that don't really seem to make a lot of sense, but because it's easy for this query generator, I'm talking about query generators here, where you produce complex queries using very simple code. For example, OEMs are a common thing where this shows up.
and they could also have those structures in there. I remember one bug where there was also just another developer that reacted to it saying he was encountering the same issue. But also most of the bugs we found were in pretty newer versions
So just like bugs that were recently introduced, which means that a lot of the people probably didn't migrate yet to the newest version, so they did not have a chance to really encounter the bug yet. And can you share a few details? I know you've touched on the various systems, but what is maybe the magnitude of bugs you're finding? Are we talking about single digits or hundreds of bugs? How many issues are out there?
In the paper we referenced 60 bugs over three databases. There are still a lot more bugs being found in the background using different strategies. So we are in the triple digits right now and we have tested four databases so far.
It is quite a significant number, especially also taking into account the previous work that has been done on fuzzing graph database management systems. Because essentially, especially more naive approaches we can cover completely. So all the bugs that they found, we would have also found with Dynkle. Is concurrency something you have to worry about? Mainly maybe in the non-deterministic test cases.
Otherwise, it's not really a big issue. Do you have a vision for a solution like Dynkle finding its way into relational databases as well?
The work on relational database systems, especially just automated testing of them, is already very well explored. There's a ton of papers on it. I kind of doubt it's showing up there, but I mean, I'm open for it. Is Dynkle a completed work? Is it just a collection of these queries, or does it evolve in certain ways?
No, it is continuously evolving, also with the databases themselves, because graph database systems are still more or less in their infancy compared to, for example, relational databases. They are still changing their syntax, their semantics, they are adding new features here and there, and we adapt Dynkle to also focus on those so that they're also able to test these new features.
What is often done in new work in automated software testing is really just coming up with new strategies of making bugs show up, like making them evident, for example, for logic bugs.
It doesn't conscious run a query and say, oh, this result is wrong. We found a bug. You have to have some strategy to really identify this as a bug. That is also something that we took into account when building Dinkle to really make sure that it's really just the foundation for those strategies to be added on top later on.
Could you give an example of what a strategy is at sort of like a programmatic level? So there are a lot of different papers on different strategies. I would say the easiest one is you generate a query, you run the query, and you transform the query in a way
such that the result shouldn't change. So for example, you could change, let's just say you have a variable declaration in the query assigning 1 to the variable x, and you just change this to be 1 plus 0. So this obviously shouldn't have an effect on the result of the query, but if you rerun this query, this transformed one, and it produces a different output, then you know that something must have gone wrong in the background.
So that feels like a low level check that we're going to say, oh, you have a tautology here. Clearly adding zero should give the same result. But then, you know, it seems like almost like you're casting a very wide net. Can you just expect to make a simple change like that? Like, I think if I did that manually, I would do it for weeks and never find an issue. Is there something you have? Yeah. How do you manage the search space?
It's obviously a lot more in-depth than just, for example, doing a +0, especially if you do look at just all of the semantics present in Cypher, which you can then use to transform the query in some way. There's also papers on deconstructing the graph in a way such that you can reuse it in a way where you can also predict the result.
It goes a lot more in depth than just adding zero, for example, to a number, but still small changes could result in the database generating completely different execution models in the background. So it can
get a query that looks pretty similar, but in the end, it actually runs it in a very different manner, which then could result in you pretty much encountering buggy code. And do you have any metrics around like how long you need to fuzz for before you find a bug? Maybe the bugs per 100,000 queries or something like that?
I mean, of course, that decreases significantly when you start reporting them. Like when we started fuzzing. Yeah. But even now when we fuzz the current version on like the version we originally tested on, we find like a bug.
seemingly every few dozen queries maybe. But for now, especially on Neo4j, which has been really extensively tested, we can run it for really days and find nothing. So it varies very heavily.
It seems to me it would be to their benefit if they did that internally as part of their continuous integration, continuous deployment strategy. The way you have the code and licensing and things like that, is that a possibility? Yeah, we are running it. Well, we don't have it open source yet because we're still working on expanding it here and there. But we are planning on just releasing it with the MIT license. So pretty much a license that's just...
just allows you to do anything with it. And we are hoping that different companies could adopt Dinkle in their continuous integration, because especially when we see new versions being released and we test them, it's probably annoying for the developers when within a day or two, we come with five new bug reports, which they could have just found if they tested it themselves beforehand. Yeah, I think that'd be a good procedure for them to adopt.
Could you share a few more details on the generative process? As you mentioned, Cypher is a structured language. How do you take that insight and programmatically generate queries? What we do is we construct a so-called abstract syntax tree, which essentially is a data structure that represents the query, but not as a string just yet.
And we just continuously try to build this tree until we decide to stop. And then we translate this tree back into a string. That is a pretty common approach for generating queries like that. And we just have to make sure during this generation to have the nodes be of the correct type.
the nodes of the tree, which then allows us to make use of all those different Cypher features. Is the interest in graph databases a focus for you, or is that like your specialization, or is it just part of the bigger landscape of automated testing you're interested in?
Yeah, I feel like I really just stumbled into this field. I really just wanted to write a bachelor thesis in automated software testing. And my supervisor had this idea for a while of trying to test graph database management systems in a more in-depth way. I guess I inadvertently became an expert, but it wasn't really my focus at the start. And I'm curious where you think the future will take you as you continue to explore this or what's next for you?
That is a very good question. It could go many different ways. I could continue on testing graph database management systems because it is fun. I could also switch to a different domain. I already have some plans for maybe different projects.
But it could also be that I just become a software engineer and leave academia behind me. But we'll see where it takes me. Earlier, you'd given a form of a compliment to Java in that it's given its structure and those sorts of things. I guess the JVM gets some credit here. It's harder to make that crash.
Do you have opinions about other languages? If you were maybe going to start a technology company, what are the languages you'd recommend adopting? That probably depends on a lot of different factors. Like, for example, depending on how big the system is going to get, depending on how experienced your developers are going to be, depending on how really fast you want it to be.
Like for example, I see Neo4j written in Java. I see how huge this company is. There's so many people collaborating on it. And obviously Java is probably not a bad choice there, but you do notice the performance drop off, especially when you compare it to databases written in C and C++.
Obviously there's a lot of hype around Rust right now, which combines this performance with really just this comfort of knowing that your code is essentially correct and cannot have certain kinds of bugs. But of course then with Rust comes slower development speeds. So be
building the same application with similarly talented developers would probably take a lot longer if they were to do it in Rust compared to Java. But of course, I cannot speak on that in general. I don't have too much experience in the industry, but it's really just a trade-off between a lot of different factors that ultimately probably decide upon which language one should choose.
So we got your thoughts on language. If you were going to include a graph database at that company, how would you make your selection? I would say if things are customer-faced, I would go for Neo4j. The thing there is, I would say it's a trade-off between maturity and performance.
So Neo4j is very mature, it is very secure, you don't really have to worry about people being able to attack the database. Whereas if you were using something else like FileQDB, it's a lot, lot faster. We collected those statistics when testing Dynkle, just how much faster FileQDB really is compared to Neo4j.
But you do have this possibility of having crashes, which could DDoS your system or there's even like CVEs on remote code executions in those database systems. So I would say if it's customer facing, I would go for Neo4j. If it's just internal and not really a threat to security, I would go for FalconDB though, because it is just a lot more performant.
Good decision tree, because I know a lot of data people will do some offline backend ML internal only project. Exactly. Yeah. Especially seeing how much Falkyrie is also advertising grounding LLMs in graph database management systems. So I think that probably also has a lot of potential to be used there.
So there's no doubt that large language models can do a certain amount of testing for you. But I also feel like I've seen their limits pretty early on as I've asked it to review code and stuff like that. Do you have a vision for what the...
landscape looks like in the future with the balance between large language models, reading code and more formal systems like you work on? That is very difficult to predict, especially seeing just how volatile the AI industry currently is, especially LLMs. It feels like every day there's new breaking news. So you never know if it's going to suddenly skyrocket in performance and be an absolute code guru or if it's going to stagnate and
really not become any more useful. I guess the thing there really is just the importance of accountability and security. I can see how using LLMs encoding projects can be extremely useful in accelerating development, making the process a whole lot easier. But in the end, it is pretty much just running untrusted code if no one has reviewed it.
And there have been cases of like ChatGPT producing Python code that essentially just directly imports malware. And I've heard of a lot of people telling me how they wanted it to generate some pretty simple code. And in the end, there was like a very obvious edge case it missed that could have easily led to bugs being introduced. I guess in the end, I can see it becoming really useful in certain areas.
But in others, I myself advise caution. But yeah, we'll see if there's maybe tomorrow a new paper on how to do it perfectly and everything I said was invalid.
Well, it would be surprising to me because at the end of the day, we've always got the halting problem. Exactly. Yeah, that's true. But yeah, I guess also just in the end, it's probably just important to really have a person behind the code. Just imagine like in the worst case, if let's just say Google introduces some new code generated by an LLM into the backend of your search engine and it takes everything down.
Or into Chrome? Yeah, for example, yeah. Puts new code into Chrome that pretty much results in Chrome just becoming unusable for users. And then who's to blame in the end? I guess that is probably also something that I'm guessing companies put a lot of focus on. Really making sure that, you know, there is someone to blame in the end because it is... Sure. Yeah, you always need a throat to choke. Exactly, yeah.
Well, Selene, it's been great chatting with you. Is there anywhere listeners can follow you online? You can also find me on LinkedIn, just under Selene Wüst, W-U with an umlaut, S-T. Otherwise, I mean, thank you for having me. It's been a pleasure.