We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

GraphBI: Expanding Analytics to All Data Through the Combination of GenAI, Graph, & Visual Analytics // Paco Nathan & Weidong Yang // #310

2025/4/29

MLOps.community

AI Deep Dive Transcript

People

Demetrios

Paco Nathan

Weidong Yang

Topics

Paco Nathan: 我认为自然语言的递归性是LLM能够捕捉到的一个细微之处，这在书面语言中很常见，但我们往往将其视为线性的。实际上，句子内部的引用构成了一个图，LLM能够识别这种自指性。在与YHAW的Tom Smoker的讨论中，我们了解到他们如何利用本体论和模式递归地追溯信息。我非常欣赏Weidong Yang他们放松前期约束，然后让上下文传播的方法。东西方哲学的差异影响了我们对数据管理的看法，图和AI的结合可以将这两种方法结合起来。在数据管理方面，我们有很长的历史是从西方角度出发，例如数据仓库，它侧重于事实，而忽略了上下文。然而，在处理新闻报道等数据时，我们事先并不知道领域是什么，因此放松这个约束可以让我们构建一个关系图。图可以收集越来越多的抽象概念，低层次的连接有助于理解整体，高层次的则可以用于推理或组织其他数据。在金融调查中，图思维是一个四步过程：构建图，划分图，利用图算法，以及将其纳入工作流程。图可以帮助我们识别未知因素，并通过可视化来探索数据。图的结构可以帮助我们发现数据质量问题，例如重复的社会安全号码。在处理安全问题时，我们可以通过可视化创建不同的访问控制。图和表是同一枚硬币的两面，图本质上是一个巨大的稀疏矩阵。AI能够处理非结构化数据，并将其转换为计算机可以访问的结构化形式，而这种形式就是图。在早期的AI中，A*和B*等算法以及规划系统都是用图来表达的。图思维的出现，可以帮助我们处理数据，并结合人类和AI技术。图的可视化不是最终目标，而是将数据从捕获形式转换为呈现形式的工具。我们需要一个逐步的流程来转换数据，并支持大规模图数据的处理。跨领域数据分析需要考虑领域差异，并找到跨领域的方法。 Weidong Yang: 我认为LLM改变了我们处理信息的方式，它更像人类一样理解信息，而不是像过去那样精确的机器。为了让LLM更好地工作，数据结构应该尽可能保留上下文信息和细微差别。即使对文档进行匿名化处理，文档的结构本身也可能泄露敏感信息。本体论很重要，但需要明确其边界，并将其限制在特定领域内。在探索性分析过程中，提出正确的问题至关重要，而图可以帮助我们发现需要提出的问题。图是一种极好的可视化媒介，可以帮助我们发现数据中的模式和缺失的联系。图作为信息捕获媒介和图作为思维媒介是两件不同的事情。图数据比表格数据更灵活，更易于捕获信息并发现数据质量问题。在处理大型图数据时，需要进行简化和抽象。我们可以通过不同层次的抽象和特定领域的细化来处理大型图数据。图数据的访问控制应该在数据管理层实现。AI的发展推动了图数据处理技术的应用。图的可视化不是最终目标，其目的是支持分析。我们需要一个逐步的流程来转换数据，并支持大规模图数据的处理。跨领域数据分析需要考虑领域差异，并找到跨领域的方法。 Demetrios: 我对Paco和Weidong关于数据匿名化方法的解释不太理解，希望他们能重新解释。我对他们关于如何更好地利用图作为工具的讨论印象深刻。在金融犯罪调查中，图思维是一个四步过程：构建图，划分图，利用图算法，以及将其纳入工作流程。图可以帮助我们识别未知因素，并通过可视化来探索数据。

Deep Dive

Shownotes Transcript

Translations:

中文

So I'm Wei. My full name is actually Wei Dong Yang, but Wei is easier to pronounce. And I'm CEO for Kinabiz, a data analytics, a visual data analytics company. And I love coffee. I think the civilization starts with the invention of coffee. So I have to drink a coffee. I do add milk to coffee because the black coffee is a little bit too strong for me.

Welcome back to another MLOps Community Podcast. Today, we are lucky enough to have not one, but two graph experts who have been doing this for a very long time. I got schooled. I felt like I learned a ton about how to use graphs as tools and ways that we can leverage them better. Let's get into this conversation with Paco and Wei. As always, I'm your host, Demetrios. And you know what is a huge help? If you can hit...

a little review and whatever you are listening to this on that would mean the world to me boom let's jump into it and oh yeah if you are one of those people that is listening on a podcast player i have got the recommendation for you for our music reco this is thanks to one of the

We are one by Maze. Listen at pace. We were talking about PII and using...

different methods to anonymize data, right? And Paco, you had said something that I didn't fully understand. And then Wei, you said something else that I didn't fully understand. So maybe we can rehash that and I can understand it the second time.

Awesome. Well, I was going to ask if you all ever came across, there's another podcast that I follow called The Dark Money Files. And it's, you know, people who, there's a couple of consultants who have worked in banks and understand a lot of the ins and outs of financial crimes and investigations. And so I was just going to preface it because they've had a great series recently. If you've ever heard of this thing called a SAR, it's a Suspicious Activity Report.

And the laws are really weird depending on what country the bank is in. But basically this, if you're at a bank and you see some suspicious activity, like there's a money transfer and the counterparty is like a known terrorist group or something, you see something weird going on. Okay, number one, you have an obligation to report a crime to a criminal investigation unit. If you see something suspicious and you don't report it, that's a crime. Yeah. If you see something suspicious, you have...

Not an obligation, but a responsibility to send it up the chain so that other financial houses might share. But if you send too much information, you might get sued.

And then, so there's these reports and it usually, it costs on average about $50,000 to process each report. So you don't want to generate too many of them. And like machine learning models could generate thousands per day, which would be like, you know, tens of millions of dollars of liability. So this whole space of like, what do I do? I'm getting, I'm getting attacked. And what do I do? Because, I mean, also these people are taking money and you might have to, under some situations as a bank, you might have to compensate them.

if there is some kind of scam. So you could be losing money and facing like legal threats from three sides. And meanwhile, there's this thing called a SAR. And like, I've actually been yelled at for asking what I was supposed to integrate with something. And I was like, can I see what the scheme is? Like, no, you're not allowed. No, it's too confidential. So it's like,

It's just this whole tangle of worms about how to, what do you actually do once you have evidence of financial crime or even suspicion of it? What next steps do you take are really tangled. And I think, Weidong, you probably have a lot more experience about this in certain theaters too, so.

I have some similar experiences where even the schema is not allowed to see because the schema may actually reveal some secrets or certain activities may become liable to certain parties. So that can be pretty tricky. And so it basically gives away information that if you were looking at it, because you know the schema, you can guess a few other people

parts of this puzzle and get information that people don't want out there? The banks are using a lot of data that come from providers. There may be other cases where there's data that's coming from, say, public sector agencies, crime investigations. There may be intelligence reports. And so there may be parts of the schema that are highly sensitive and only certain people are allowed to see. But you were saying that

With graphs and anonymizing that PII, you're still able to gather insights, right? Yeah, that was cool. We were just in a talk and Brad Corey from Nice Actomizer was showing where they're preparing to do RAG and they were using, I think, Bedrock.

And they know that they've got a hot potato. They know they've got a lot of customer PII that just can't go outside the bank. So what they were doing is substituting PII with unique identifiers they generate, tokens they generate on the fly.

And then they make the round trip after they've run through LLMs and made a summary, and they replace the tokens with the highly confidential material that they just have internally. And so this is a way of being able to use some sort of external AI resources, but still manage a lot of data privacy. That's cool.

Yeah, I've seen it with we had these folks on here from Tonic AI and they were talking about how they would use basically the same information, but swapping it out. So if it is someone's name, they just change the name. So it went from Paco to John.

And if it is a social security number, they would swap out the social security number and totally randomize the number. But it still is a social security number. So you at the end of the day, you get almost like this double blind. So even if you're a data scientist looking at the information, you can understand it. But you don't know if it is the true information that's going to reveal that PII. Interesting. Yeah.

Interesting, yeah. Although I do see situations where even the structure of the document itself, it gets revealed, it revealed information that you do not want people to know. Like in the investigation space, very often you do not want people being investigated to know that being investigated. But certainly information, even the structure, if you structure the document, being revealed can become a problem.

So at some point I felt like the in-house on-prem LLM might be necessary, especially just red news that the M3 Ultra studio with the 500 gig RAM can run large-language models at 20 token per second. That could potentially...

Yeah, an interesting solution for that. Yeah, I mean, for our end use cases, you know, like 60% of those are air-gapped. And so, you know, the largest chunk of that, there are going to be a lot of public sector agencies running in skiffs. So they can't do any data out, yeah.

And there's good news for running really interesting LLMs on local hardware. There's a lot of really good news. I will shout out to my friends over at Useful Sensors, Pete Worden and company. I'll put that in the chat. You can do a lot with hardware, with local hardware. Yeah. What are they doing? Useful Sensors. So Pete Worden...

and Mentoreth Kudler, they were part of the TensorFlow team at Google. And for, I think, like eight years, they evangelized use of deep learning inside of products at Google, like internally.

And then they left, and the team has a startup in Mountain View now. And what they're showing is, hey, here's like $50 worth of hardware. Here's an ARM chip with a neural network accelerator on it. And we can run three LLMs on battery power. So it's pretty cool because they came out of like the tiny ML. I don't know if you've ever seen the conference. Oh, yeah.

And so, you know, this is a lot of the specialty that Pete has. And Manjula, you know, he was on the CUDA team at NVIDIA before. So, I mean, these folks really know how to make AI infrastructure run on hardware and particularly how to handle a lot of low power and low latency kinds of situations.

and where to punch through the bottlenecks. You don't necessarily have to have a ginormous GPU cluster, although in some cases it helps. But especially when you're running inference, you can be running on much lower power and doing really interesting things out in the field. So wild. Now, I know that we had originally wanted to chat a bit about this idea that I think, way you had proposed, and it's a little bit of,

a differentiation on GraphRag. And so maybe you can set the scene for us because I want to go deeper there. Yeah. I run in dangerous, danger of pulling way far. Fundamentally, I think with LM, whole machine process information has changed.

Before LLM, everything is exact, symbolic, like matching all the APIs, all the rigid data structures. Just think about Deep Blue and Beat Chess. Everything is rigid knowledge as rules and things. LLM changed everything because LLM started to understand things in the contextual base.

understand fuzzy things. And it suffers the same weakness of a human being not exact. Like we glide over information, we draw conclusions, we make leaps, make jumps. But at the same time, LLM's ability to reason like human, that for me is fundamentally changed how we approach the computing.

And so in the applying LLM to analyze document,

my feeling is, my analysis is now we can let LLM work more like human rather than like a machine we understand in the past. That also implies what the data structure is preferred for LLM, which I would argue that a data structure, a data management that preserves as much contextual information as possible, preserve as much nuance as possible, that structure

The subtle nuances may come out to be important. So I use the example of my wife is Brazilian. The American tourist to Brazil gets invited to a horse party. Says the party starts at 6 p.m. So as a good American guy, she's promptly on time at 6 p.m. And the hostess comes out, still wrapped in the shower tower and totally confused.

And so, right. And it turned out over there when they see 6 p.m. is where the hostess start thinking about the party, start like going out shopping, preparing food and getting ready. And the people usually don't show up until like two or three hours later. And so that caused a difference.

Yeah, right. If we try to capture that in a knowledge graph, what kind of construct allows us to capture those subtle cultural nuances there? And that might become important in understanding the document later. So I think that's the challenge. Paco, you want to add something? I'd like to hear what you think.

Well, from a perspective of natural language, something that the models bring in, but it's kind of a nuance and I don't think it's talked about a lot.

There's a very recursive nature to how we as people talk with each other and tell stories and share information. We do reference it in the sense of like going down the rabbit hole. Like if you follow a thread too far, you're kind of going down the rabbit hole. And there's this very recursive nature of how we think and especially how we express. It certainly comes across in written language, although we tend to think of written language as something linear. There's paragraphs and sentences and it can all be diagrammed.

But when you look at the actual references that are inside of those sentences, they're making recursive calls throughout a story, throughout somebody's speech or throughout a book. And, you know, we can try to linearize that and come up with like an index or a bibliography. But at the end of the day, it's a graph. And you get this very self-referential thing in any text. And this is something that the LLMs have really, I think, pulled out,

And we were also just in a part of the talk we were in also Tom Smoker from YHAW was showing about how they, they leverage ontology, they leverage schema and, and chase after information recursively. So I, I, that's just another kind of view on this, but I, I,

Wei, I love how you all are approaching this. You have a very powerful view of kind of relaxing the constraints up front, but then having the context propagated through. I realize there's an important philosophical approach difference between East and West. And the Eastern philosophy very much drives towards the nature of things.

And it's important, which is that's very curiosity about nature of things, the desire to have a definitive definition of nature of something is led to the great scientific discovery over the past several hundred years. The Eastern philosophy very much on the outside is focused on the contextual, focused on shifting, changing nature of things.

Like the Chinese, the Bible, the Taoism, the Bible, Tao Te Ching, the first verse, it says, 道可道非常道 means if you name something, you get it wrong. Or it's like, it's not permanent. It's really focused on impermanence of things. It focuses on everything changes nature in context with other things. So that is essentially a graph.

Now, you're putting both things together. So, okay, I have to say that the attitude towards like, oh, everything changed, thus we cannot see anything, thus everything is fuzzy, is very much contribute to the Chinese technology science developed very far in about a thousand years ago and stalled.

And a lot of its attributes, these philosophical things, reduce a lot of curiosity and drive down deeper into the nature of things.

However, in practical things, there's some practical application of that approach, which in today with LLM and the graph, we really see that it's like a great combination of you allow certain things to be drilled down, to be very definitively defined, to be clearly defined within the context. But allow a lot of information, contextual information, stay fuzzy.

So in fact, like I felt like I'm really excited about like integrating sensing and our GraphXR and kind of as a solution together because the sensing helps to drive this definitive part. Once you have the definitive part, drill down, named, defined, it really speeds up to make a lot of assessment fast, definitive and precise, which is crucially important.

But on the other hand, you allow this loose structure of information decompose as a graph that you can easily retrieve

And without losing the nuances, the subtlety, like in the cultural difference, things like you still preserve that. So those things come together, I feeling is the one how you want to ground LLM to create a precise, accurate, and know the limit. Know when it does not know.

not to make a judgment. I think that's also very, very important. So in my mind, it's like the graph and AI right now is present opportunity to allow this Western way of driving the nature of things and Eastern way of focus on the contextual information come together to work together to solve practical problems. So very well said. And, you know, the challenge we face is we don't really know what the downstream application will be.

Like we're doing investigation. We're doing some kind of discovery, whether you're trying to find, you know, money launderers or whether you're trying to find, you know, who's my best customer for this hotel. It's a discovery process. And by nature of discovery, you don't know what the answers are. In fact, in a complex system, you don't even know where or how just, you know, it's unknown unknowns. Right. So by preserving that context, then you are sort of fortifying yourself, right?

so that when the time presents, you'll be able to make the right discoveries. You won't have cut them off in advance. I think it's, if you go back to like before relational databases came out, you go back to some of the earlier writings from Ted Codd, and one of his colleagues was William Kent, who did...

a book called Data and Reality. If you go back to some of the early, like 1970s thinking about data management, it's really interesting to see where the lines are drawn because in this Western view, so much of data management was about, let's have a data warehouse. Let's pretty much throw away the relationships. Let's focus on the facts.

We have a lot of, as we were saying, a very Western view of like, I just want to know like millions of facts and I will piece them together with a query. I'm not, yeah, I'm not really interested in preserving the context. So, I mean, I think we have a long history from like data warehousing of going too far on the Western side.

What is interesting to me is the conversation that we had with Robert Kalk on here probably three months ago and how he said we've completely thrown out ontologies. And for his specific use case, that isn't the way that they wanted to go. And I wonder if you guys have thought through that and what that looks like, what the benefits are,

And is it one of these things where you potentially are experimenting on those levels too? In my perspective, ontology is important, but you have to know the boundaries. Like I give a parallel into all the theory in the physics, like Newton's law.

Newton's law is important. It captures important truth in the nature. However, just like any physicist, I'm a physicist, any physicist theories, the moment when the theory is proposed, it's a very important fact, important concept is you're willing to be disapproved. So you never accept as the truth of everything. You have a theory,

Well, Parker is a math scientist, so I think he's also very familiar with the concept. When you propose a theory, it has to be true, but you're always looking for situations, looking for the boundaries where the theory will stop to be true.

So I don't think ontology is anything different. It's just like ontology needs to be very well grounded. The contextual context needs to be defined. And within this context, this ontology knowledge is real. It's truth.

The problem I see as a lot of traditional knowledge graph approach is people ignore the fact that ontology has to be confined within a specific domain. The moment you step out of the domain, you have problem.

But the other thing is we think this domain ontology is fantastic. It helps you to solve problems so much faster, so much precise. But again, just like as long as you can define the boundaries, define the domains, it's great.

You know, what Rob Kalk and Ellen Tornquist and others at Ask News, you know, what they're doing is they're looking at news sources, especially regional news sources across the world. And they really are finding like hard evidence, you know, groundbreaking evidence on the ground, literally, if you're doing ESG research.

and you're trying to do due diligence on a company or a set of suppliers, and you want to find out, like, what are their operations really like over in that other country where they're based? And then you find out they're engaged in, like, I don't know, child labor or something, and, you know, you want to make other arrangements before your shareholders find out. So I think with Ask News, you know, they're out and they're looking, they're working with those publishers and they're collecting that news and representing it in a graph.

And yeah, as you were saying, I mean, ontologies really don't work across domains. You really want to focus more on like closed world within a domain. Having a full enterprise-wide ontology, nice idea, but I rarely see it work.

And I think that in the case of like understanding news reports on the world, you don't know what the domain is in advance. You only know this is what is being published. And so I think by relaxing that constraint at Ask News, they're able to come up with a graph of like, here are things that are related. You can follow this evidence and you can find more historically about this area.

I think those are very important, but ultimately it will be shaped by some kind of context, some type of shared definitions. And ontology is really more about sharing definitions and making sure we're describing the same thing. Because I swear, you go to a big company, use the word customer in front of one VP in sales, it means something different to the VP in charge of procurement. So even the words themselves don't cross domains. Yeah.

the graph is basically our idea that we know that there's connections. Like if you, if you do have your, your operations data, but then you also have your, your like sales data, you know, there's some connections across there. It's not exactly the same, but some stuff is connecting. So graphs show where those connections are. But I think, you know, think about like the example of Google maps, right?

Like there's different levels of detail. And of course, any video game, of course, has this too. But, you know, if you're taking satellite data and like trying to stitch together a map, you zoom in and you can see the beach and you zoom in, you see the car tracks and you zoom in further. At some point, you're going to get to pixels, right? Yeah.

And you zoom out and maybe you see this landscape of like a beach next to the ocean. But then probably you zoom out at some level and they've got like the name of the beach. Right. So there's like a high level detail. I think graphs are much the same. There are connections at the low level, like Ask News is saying is like, you know, here's reporting from Zimbabwe. This is like the reporters on the ground. But then you zoom out and you're like, OK, well, you know, what impact does this have on our supply network?

Do we have to really make different plans? Is there going to be like a war breaking out that causes, you know, all those shipping containers to be delayed by three months? I think at some level you need to think of the graphs as sort of collecting higher and higher into more abstracted, more refined concepts, if you will.

And so the stuff at the low level is kind of like, let's see how it all fits together. The stuff at a higher level, it's like, oh, actually, we can maybe do some inference on this or we can use this to help structure other data that we're going to piece together. So, Dimitrios, you actually touch up a really big subject that things are now...

In the exploratory process, it comes up with the questions. Knowing what question to ask often is 80-90% of the work. So prescribed things to give you the answer often miss the point or miss the important subtleties.

But the problem is how do you discover the question you need to ask? And so in the way that our brain, our perception, our visual perception, our brain is a fantastic tool

I don't want to call it a machine, or I don't want to even call it a tool, but it has this great power of seeing patterns in information. Like we look out in the sky, we see the cloud, we have some concept, we have some kind of, like you are a performer, I look at your performance, your dance, like there's information being expressed without,

without being able to verbalize it, to define it. But you have to watch it to feel that. Maybe you watch long enough, you start be able to describe it. You start be able to say, oh, this is, something is there. So in a way, what the graph does is the graph is a fantastic medium for visualization.

You look at the information expressed, it's just like how our brain, like when we think about you, Dimitris, I immediately think about Paco because we're in the same room together. So that's association. Yeah.

So this association of multiple pieces of information, entities in a space, if you visualize effectively, it helps you to see the patterns, help you to see all the missing links, missing patterns, things that get our attention. And then we start to be able to formulate the question, to formulate, to answer the question.

So more than a tabular data structure, I have to say, the graph really helps us to engage our brain in this way, to spot important information. Just go watch a dance performance. You see something definitive happening, but you know it before you engage your language or logical thinking.

Afterwards, concepts start to form and then you can start to build things around it. Oh, dude. How cool is that? You know it before you can express it in that way. Absolutely. I think a lot of analytics workflow is worked the other way around. We focus so much on building up the queries, build up the...

programs to drive it, to drive the answer. But as Park Geun-do and we in the investigative space, we all know that too often getting the hint is 80% of the work. If you know that you're being attacked, you know that they came in through some vector, there's probably some set of machines that are compromised.

You're not seeing that. You're seeing where, you know, the bad things are happening, stuff is being stolen or whatever. So looking across your network, just building up a graph of like the associations of what's happening during an attack. There are some placeholders. There are definite questions that could be generated, like which machine was compromised? Maybe I should fix that. So I think from the operational perspective, you know, I mean, you kind of have to think of, I mean, we do think about that, right? We do think about like, how do we identify those unknowns?

But the problem is that the more complex the problem becomes, the more that those unknowns are not something that can really be charted. They have to be sort of poked at and explored. Yeah, and I think that's why, Wei, what you're saying with the graph being this visual medium that we can poke at and we can explore.

And it gives us a different perspective with which we can work with and wrestle with the data is something that I hadn't heard before, but it makes complete sense.

From a historical perspective, in terms of data, you know, something to bring out would be to consider about spreadsheets. Because like spreadsheets are sort of my go-to example. This is all in tabular form. It's very, very sort of, you know, left brain. Everything is very buttoned down. But the thing about spreadsheets that you never see is there is a really complex graph behind it. And it only works because of that.

But they never show that. They just show the tabular part. But all the real knowledge and dynamics and all the real information you're capturing in a spreadsheet is about those different dependencies and how that graph functions. Classic. Of course we don't see it because that would be absolute chaos for us, right? Mind blown. The graph is this front-end media for this perceptive thinking.

Well, the challenge is when we talk about graph, I think that we need to really separate two things. Graph in the medium of information capture and the graph in the medium to help us to think. They're two different things. Graph as information capture, the sole purpose is to capture information

as precise as possible, as complete as possible. You want to capture as much truth as possible. However, graph as a way of thinking, if you take the raw graph captured

preserve a lot of truth. Well, the problem is we can only hold seven pieces of information in our brain at any given moment. We'll be overwhelmed by all those graphs. Like if we think about our brain in that way, like even the vector embedding, I call it an implicit graph because vector embedding gives you medium to compute the similarity. Effectively, you can construct a graph. Yeah, you can construct a graph on it. Yeah.

Exactly. You can manifest a graph out of it. So you will see that the graph being captured at the layer, at the stage, that's really designed to preserve the ground truth, as much truth as possible. But then you need a way to work the data

into a form that we can easily digest with our perceptive power. That is a challenge. This is also why in my mind is a lot of graph in theory, people know the graph is how we think. Thus is important. But in practice, that is a barrier.

And how do you reconcile the need between graph as information capture medium and the graph as our, to support our perceptive thinking medium? It's a very different thing.

Just going back to what you were saying with we can relate each other because we're on this podcast together. We've done stuff together. Maybe there's certain things that come up in our memories that are going to be the most pertinent to that graph that we have in our head. But it's never going to expand more than seven hops or seven different parts of that graph.

Have you ever worked with, there's a kind of, I guess rubric might be a way to say it, came out of Carnegie Mellon, out of CMU. Jeanette Wing had this idea of what's called computational thinking. And so there's sort of like a four-step process of like breaking down a problem and then being able to abstract it back out. It's really powerful. And I've used it a lot in courses teaching people. But I think that there may be something

kind of emerging as like graph thinking. And so just to throw out like a straw man here, this is kind of thinking out loud, but one of the things that we see in like fin crime in financial investigations is a kind of graph thinking, a four-step process repeated over and over where

you do your best to build out this graph and it might have hundreds of millions of no's or billions of no's or some ginormous number, something beyond human scale, beyond human comprehension. But then step two, partition. So like, can we break out this enormous graph into some areas of sub-graphs of patterns that are interesting? Like, hey, this looks like a really good customer or hey, this looks like a money mule fraud scheme.

And so you go, you do this dimensional reduction then because you go from like 5 billion nodes in a graph down to maybe 10 or 20 that are interesting. And so that's like, there are graph algorithms like Louvain or like, you know, weakly connected components or there are different ways to get down to that scale.

And like in machine learning in general, we're looking at a lot of dimensional reduction, right? So once you've got down to that scale, now you can use other graph algorithms, like maybe between the centrality or different forms of centrality to understand how are these parts connected. And gosh, maybe there's like one node in there who's orchestrating the whole crime ring, which is typically the case. There might be like a person with a bunch of shell companies, right? And they're doing fraud.

So that's step three is like leveraging certain types of graph algorithms to sort of think of page rank. Let's bubble up to the top the parts that are probably first good steps to investigate.

And then step four, put it through a work process. And I mean, if you're working with people in a bank, put it through case management tools. You know, a level A analyst gets assigned it. They go and they start poking around the graph. They do something interactive. They work with the visualization and they apply what they've learned. Or you may have some agents involved there too to help like summarize and take up parts. But it's a workbook. So it's kind of a four-step process.

process of sort of graph thinking, if you will, that can be applied and can integrate people and also AI technology together.

I want to add one more thing to what Paco said. It's really, really important to be able to narrow it down, to be able to identify things and to reduce, reduce, reduce. But there's also another aspect, which is a simplification, abstraction. Like very often when you capture the data, you don't really like the domain or you don't need to know the future question. So the domain is wide.

But we look for the information answer that domain is narrowed. When domain is narrowed, for example, like I call Paco as a mad scientist. At some point, I can just refer Paco as a mad scientist. I don't need to add the information because mad scientist is Paco. And that only about in a specific domain.

So the reason I say that is because a lot of information, when you are domain-wide, like I call it when you capture information, I prefer, I call it a pure edge approach. Like in the graph,

Edge has no properties. It's just edge. It's just association. Anything you need the property means the things you may need to amend it up on, maybe you have something pointed to it or pointed out to it, you keep it as a node. Now, as you're thinking, very often, like, I know Paco, but I know Paco, this relationship, I can carry a lot of context in it.

I don't need additional information to show, to tell how I know Paco. It just can be in there. I know Paco itself is sufficient. So what that means is when we present, like I know Paco, that relationship as a single relationship, right?

In the data layer, there might be a tons, thousands, or tens of thousands piece of information there. But it comes out as one single piece of concise information. I think that is where I think an analytic workflow or visual analytic workflow should be, is to be able to go from a very detailed, broad, big, large information, distill or aggregate information

down to a simple representation, but is grounded in that particular domain, in that particular context. So for us to... So we can communicate. We can communicate in simple language rather than carry a lot of information when we have to. I know Paco. That's it. We don't need to know how we know each other. Where do we know each other? In certain contexts. Is it almost like the...

data underneath is like an iceberg in a way and you knowing Procco is like the tip of the iceberg you have that one piece of information but then if you wanted to get more granular you can go down and see the whole iceberg yes could we say

Could we say then that, you know, we pull everything, we connect everything together. It's very noisy. We can go up different levels of abstraction. But to your point then, we're going up levels of abstraction in particular domains, like for purpose. So we have some shared definitions. And then we can start to say, okay, now let's do our Louvain partitioning or whatever. Then we start to like drill down into subgraphs. It's like maybe a five-step process.

Even with Levene commutative calculation or any centrality calculation, the graph has to be simple because very often I think the graph we talk about is I call it the multi-multi-graph. It's called a multi-domain graph. It has different type of information in one graph. So computing a centrality graph

in that kind of a hypergraph, as a hypergraph, is very challenging. Or what does it mean as a result if you mix human and the emails and it's difficult. So that process itself to me is we already need to prepare our, transform our graph data

into a form that is suitable for that centrality computation. Very often, you have to already project into a specific domain for that computation to happen. Very good. That's what I was thinking is like the data that you have

only becomes relevant once you've narrowed it down in a certain way. And you're looking at a certain plane of that domain and you say, okay, now we're going to be focusing in on this plane. That's when certain nodes and certain data and certain connections become relevant because you're looking at that layer almost. In my head, if I visualize it, and we're talking about that Google Maps image

example, again, you're diving deeper and deeper and you see different structures depending on the layer that you're looking at. And this fits very well with like data mesh kinds of concepts, you know, Jamak Degani talking about how different domains share. You have to abstract, you have to come up with the relations. I think Chad also has the idea of like contracts, you know, where you have relations across domains and

So you share some definitions. You have to condense down to that level before you can go across domain. So, yeah, if we use the domains in an organization to kind of guide when and where and how do we condense down, then we can really, really take advantage of this kind of abstraction. But it's almost like I realized after I said it, the

There's two vectors or there's two dimensions that you are looking at when you are zooming in or zooming out because you're playing on the field of granularity, but you're also playing on the field of the domain and what is relevant in that domain. So if we have that X and Y axis, you can get more granular values.

inside of the domain, but then you can also just go on the X-axis and change domains. And so that, like a kaleidoscope, when you turn it, you see a whole different set of relations. Yeah. And I mean, in an enterprise context, this gets really bizarre because, you know, you...

The people in the domains that you depend on may not even know that you're out there. You know, you may be consuming from some log files from another application that are like totally driving your product. So like, can we have some sort of contract so that we know about each other? But yeah, scooting across the domains, that's the key challenge to like leveraging these kinds of technologies because usually...

You are in a particular domain when you're making those decisions, but for most applications, you have to combine a couple domains, right? So it's usually like there's something interesting going on between like sales and procurement or sales and marketing or, you know, some other business unit. So usually, oftentimes, you will have to combine. And do you then try and create...

two different graphs that are connected to each other? Or is it one larger graph? How do you look at it in that regard? Well, federation sounds good. I think trying to have one ginormous graph is usually weird. And those projects usually don't ever end. But federating and being able to go across domains and say, okay, over there, let me send you something. I'd like to know what you can...

What results can you bring back? So are you making a prompt in GraphRag across a different domain? Are you making a query, running some algorithm, whatever? There's some kind of information transfer, but federation.

I can talk about a couple of my personal experience. First, bringing information to graph is a step forward, a step up. Because information as a tabular format, it needs to be confined to a very specific definitions as pretty narrow domain.

A graph is, there's one example. I look at the U.S. flight record. You can download it from Department of Transportation. They release every two weeks after. The damn thing has 140 columns, I think. Really, really wide. And the reason is because the flight maybe get diverted. Whenever the flight get diverted, you add about 10, 15 columns of information.

So then you need to capture that the flight may be diverged more than once. Like you need twice, is that enough? No, some three times. Three is enough? No, some is four times. So they actually have five diversions. But if you have diverged six times too bad, it cannot exist. So that's the limits of tabular format in the information capture. With the graph, it relaxes a lot.

you naturally, you can have a thousand diversions. I don't care. You can just like the graph can keep amending to it. So that is really, really a big improvement with the graph to allow you to have a lot more flexibility in capturing the information. And the other thing is like very often in the tablet format is very difficult to check the mismatches like

We have example of bringing two data set manager from two or three different department in the same organizations. Everybody know other person's data has a problem, but you can't force other people to fix it.

But with the graph, when you bring things together, you immediately see the mismatches. So we have one example of a company spent a couple years, they could not reconcile the data, but once they bring the data into the graph, they start to see the mismatch. In one month, they fix the data problem.

But they start to see the mismatch because of the dependencies? Because once, now, because we, let's see, you know the records are unique, right? But then when you link the other record together, you need to just see, oh, this record is actually duplicating other systems that they recorded differently. Somebody made a mistake there. Yeah. We see that a lot for entity resolution where you think like a social security number should be unique.

But then you're bringing in data from some other sources. And there was an application where maybe early on the product manager said, yeah, we need to collect the social security number. And then later on they said, oh, no, we can't do that. Just put it in, you know, a dummy number.

And so now you've got like this data set that has, you know, 5,000 instances of the same social security number. So once you start putting a graph, you're like, wait, isn't that supposed to be unique? How come there's like this enormous node with like all these things connected to it? Something's wrong. So it's really also a great way to figure out data quality issues. Yeah.

Although there's security. I mean, going back to what we were talking about before, if you are looking in financial investigations, if you're looking at sort of a criminal investigation, okay, maybe you've got some open data, like here's, you know, sanctioned shell companies or whatever. And then maybe you've got some private information like customers, but maybe you've also got some feeds of like, oh yeah, here's an active investigation. We're looking at these people. But then these particular people,

They have immunity because they're diplomats. So there's all these different levels of security. And you start to pull it all together in a graph, you get a very comprehensive view. Maybe not everybody can even see that. You don't want the police officers who are doing parking tickets to know that XYZ diplomat might be investigated for a crime. That information should not go out. Yeah.

So where do you draw the line? Because the graph really brings it all together. But then how do you handle security issues? Yeah, the access control with the graph is automatically harder than the tabular, the relational database. Well, it feels like one of these, what you were talking about, with the ways that you visualize it, you can...

almost create different access controls on the visualizations. So I don't know if you've thought through that in a way, but is that kind of how you go about it? So fundamentally, access control needs to be in the data management layer. Like if the database can support access control, you're great.

We actually, however, run into a situation that database do not have the sufficient access control that supports business need. So in that situation, we actually have to implement a filter layer in the data access. When we put the data from the database, depends on the roles and capabilities.

teams and we actually prohibit certain information from being accessed. But that's not a fundamental solution. Fundamental solution has to be in the data management layer. It's a hard problem. In previous work, which is more like knowledge graphs being used for large scale manufacturing,

You know, one of the things we ran into is security access because you take like procurement data plus some operations data plus some sales data, put it all into a graph. Suddenly you have a picture of like how the company works, but it's like a really confidential picture. It's like maybe the board could see this, but nobody else in the company should see it. So there's a real power there, but there's always a risk.

And how do you manage that is a mind-bogglingly difficult problem. I read a book talk about the certain like intelligence communities when they go to another countries. In the past, you use like falsified identities and...

But today it's not a good idea anymore because all the open source intelligence out there, even you want to withheld some information, but people can stitch together a picture because of a related piece of information. Sit there outside on the social media, like maybe there's a picture of you with somebody that you did not take a picture, you did not post it, but somebody posts on Instagram.

And so all those information out there can essentially is a graph, can link back to you even though you try really hard to stay hidden. That's a fundamental thing, problem in the, in terms of the privacy security or you want to control the access information. But because you have all those connections in the graph that make it like really, really hard.

And a corollary with that, when I talk with, you know, people in enterprise who are doing large-scale knowledge graph practices, the one thing that I keep hearing over and over again is companies using graphs for market intelligence or maybe sometimes you would say competitive intelligence. But, you know, a lot of this might be for like sales win-back strategies, trying to understand who's the competitor that got our bid away from us. How can we go back and try to, like, you know,

give them a better quote. Oh, wow. And so I've heard this over and over again. We're like, that's one of the first graphs that starts making a lot of money is like literally doing intelligence inside the enterprise. Yeah, I was going to go down that route of like, let's talk about a few other cool use cases that you have seen, whether it's just graphs or it is graph rag, which is a hot term these days, you know.

I mean, you know, it's interesting. There's a lot of graph database vendors and they really kind of lean heavy on the graph query side of how to run this. And that's something that's very familiar with people in data engineering, data science, you know, using a query. But I think in the graph space, there are other areas that aren't query first, like using graph algorithms or using, there's a whole other area of,

what should be called statistical relational learning, but you know, you've probably heard of like Bayesian nets or causality or different areas over there of using graphs. But then there's also graph neural networks. Like how can we train deep learning models to like understand patterns and try to suggest, Hey, I'm looking at like all the contracts you have with your vendors. And I noticed that these three here are missing some terms. Do you, you know, is that a mistake?

So I think that, you know, there's the queries, there's the algorithms, there's the causality, you know, that area of, there's also the graph neural networks. There's a few other areas too, but these are all like different camps inside of the graph space. They don't always necessarily talk with each other, but I think it's really fascinating now that we're starting to see more and more hybrid integrations of them.

Yeah. I'd like to point out that fundamentally graph and table are two sides of the same coin.

As a physicist, we look at the sound, music, both from frequency domain, like is it C, D, E, F, what's the frequency distribution? And also look at what waveforms, like time domain. Some situation you want to filter or you want to access more on the frequency domain, some time makes more sense on the waveform domain.

The same data, like a graph essentially is a joint, I call it. Like if you think about the large language model, the neural network, it's a graph.

but it's a gigantic, extremely sparse matrix, which is table, right? And the fact that because it's such a giant sparse matrix causing today, NVIDIA is really hard because NVIDIA has these GPUs that can process those matrix. But guess what? My brain consumes about 19 watt energy.

the GPU running large-language model consumes tens of thousands of watts of energy to get similar computation needs.

And that's extremely inefficient. Even though the computation unit is much smaller than my neuron, you think it should supposed to compute at higher efficiency. That's precisely because they're dealing with extremely sparse matrix. They're not dealing the neural network as an

as a graph, they're dealing neural network as a matrix, and that's fundamentally the problem for the power efficiency. So there are certain models that come up that really deal with AI as a graph that several order magnitudes save in energy consumption. So in the real-world application, one of the reasons why graph hasn't been taken off as we all think for the past 20 years, like, oh, graph doesn't take off, graph doesn't take off. But no, it did not.

The fundamental problem is because we are so familiar with all the tools and methodologies, like workflows. It's well established in the tabular-based way of thinking. It's like the Department of Transportation do not release the flight data as a graph. They release it as a table. It's easy to access. We have all the toolings that mature. To change that is extremely difficult.

So in a way, I would argue that AI is always almost made for graph because AI suddenly allow you to process unstructured information like emails, reports, this like a podcast, transcriptions, like videos into a structure form that computer can access. But guess what?

it is a graph that AI will convert those data into. So now you suddenly have this, some people argue, I think it's like 80% of information existing unstructured form. Some people argue that even, that percentage even larger. So the AI suddenly make this like the majority of the information available for analytic workflow and assessment.

And the funny thing is, it needs graph to do that. So in the way that my assessment is, because of AI, because of junk AI, we're actually entering the boom, like exponential growth era of a graph because of the availability in the data. It's like the Internet of Things. We've been waiting for it to happen since 2010.

2010 or 2005, whenever, and it's always just around the corner. But now it does make sense that if you have all of this unstructured data and you have these relations, then that sounds like a graph to me. Yeah.

And going back to like 1980s era, hard AI, you know, whether we're talking about like A star, B star kind of algorithms or talking about planning systems, all of these were expressed as graphs. And like, you know, some of the early thinking that was like pre-Google that led to Google, they were talking about graphs. Some of that work actually came out of like groupware, but based on graphs. So it's there. Yeah.

Funny you say that because we had one of the talks at the AI Quality Conference back last year was from the guy who created Docker, Solomon. And his whole talk was really like, everything's a graph. If we really break it down, it's just, it's all graphs and how one thing relates to another thing. I'll throw something else in to kind of go back to our early part. We were talking about East meets West and

There's a book that, a really favorite book though from early days, this is like going back to the early 90s, but early days of neural networks about this idea of like, yeah, there's some conventions in the West, maybe we can back off. It's by a USC professor called Bart Kosko. It's called Fuzzy Thinking and sort of his critique of science, but more from a lens of more Eastern perspectives.

I know that this book is like more than 30 years old, but I think that there's some really great perspectives there that weigh in a lot, especially what we were saying about like, where are we now with LLMs and how are we leveraging this in the context of graphs? So I think the other thing, was there anything else that you guys wanted to talk about before we jump? I know there's a lot of cool data visualization stuff that you're doing. Yeah, I just want to add one thing.

I just want to say the visualization is not the end. The goal is to support analytics. So I know everybody when it comes to the graph talk about graph visualizations. But in my mind, what we really need is visual analytics.

how can we visually transform the information? How can we visually go from information that was suited for data management, for data capture, that you can access, work them step by step towards information that's suitable for presentation, for answering the specific questions in that particular domain.

So that steps requires a transformation of data. It's not just like a filter, but also fundamentally in the graph schema mutation. The schema you have for the data capturing is not a schema suitable for presentation. There are two different things.

If you think about in the big data era, the developer of the MapReduce allow you to have this step-by-step flow of information from the original captured tablet format into a final, very different table that you can present. In graph, it's the same thing that

the graph analytics needs is a step-by-step, we call it calculus or operators, to transform your data from the form that's been captured to the form that you want to present to answer the question. Now that

Calculus is best done, I think it needs to be in two forms. It needs to be in the form that you can process data in large quantity, like a large graph, step-by-step mutates. But also needs to be visually. You need the same set of, a parallel set of operator that a data analyst

but ideally a domain expert, not a data, not somebody who can write Python or Cypher queries or GQL, but somebody with the domain knowledge to look at it because graphs so visual. You're like, hey, I want to simplify this. Oh, I know.

If Paco and Wei has so many meeting points, let's abstract that out. Let's just create a single reading stream that Wei inference, like Wei and Paco, that they know each other and get rid of all the other information.

So this all maybe say, hey, Paco knows like a million people. Maybe I underestimate a little bit of Paco. So sorry about that. No kidding. You probably know more than that. But let's like from the graph, we can quickly compute this number and put it in the Paco, make Paco very, very big because Paco knows a million people.

Right. So that kind of operation is highly intuitive. So I want to stress this. The visualization for graph is not the end. The visualization for graph is tool you use to transform the graph to get you the answer. That's a waypoint. Very good. Yeah, that is very in line with what you were saying earlier on how when you don't know the answer,

question, that's sometimes the hardest part. And so being able to wrestle with the data in different forms, one being the visualizing it in different ways, that's one tool to hopefully help you get to the answer or first step, the question, which can then lead to the answer you're looking for. Yeah. And to mutate the graph visually.

So you can start poking it. Yeah, exactly. It does feel like the ability to just mutate the graph is such a strong tool.

because of all these different reasons that we had mentioned when it comes to the depth and the way that you're able to look at the domains or you're able to just find anomalies or find different data quality issues, whatever it may be, whatever your use case is. It's very cool. It does sound, though, instinctively a bit manual, though, right? Yeah.

So far. I think Way has brilliant examples, what they're doing like with SiteXR, of leveraging 3D visualization, zoom in, zoom out, in conjunction with algorithmic ways, using graph algorithms to sort of focus the lens, focus the search light. I think that more can be automated over time. And maybe this is where agents come in is actually helping determine how to be the cinematographer there on the graph. Yeah.

So there's definitely a way of helping you to look at perspectives. And it's very often we deal with the data that's both graph-connected nature, but it's also dimensional. Like each node has so many properties. Each property is a dimension. So it's high-dimensional information.

So which dimension set you want to take in combination with the network information to help you to see, be able to have a versatile way, flexible way of choosing the dimension set. Or it's very often like when you shift from one dimension to the other dimension, you reveal some flocking or things go together, some clustering start happening, and you say, hey, those things always move in the same direction.

So those signals help you to formulate a lot of ideas, instincts from the data. And then when you see that information, the next thing you want to know, hey, I want to capture that as a feature. Now can you represent that as a feature to become what you see become a thing that become an entity in your visualization?

Put back in the group. Yeah, it can put back in there. That is the visual analytics. Whoa. So capturing it as a feature and then you can feed it into the tabular data in a way. Yes, exactly.

Guys, this is awesome. Is there anything else that you want to hit on before we stop? I feel like I've learned a ton just from talking to you all. I knew it was going to be a great conversation. I was hanging on to my seat this whole time. It's like, oh my God, I'm learning so much. Yeah. In terms of cross-domain, I want to share one funny example, like how difficult cross-domain is. So in this example, it's extreme cross-domain. So, yeah,

I organized a Kinetic Arts, Dance and Science nonprofit. So one thing we do is every Wednesday, we bring people in the engineering science domain and people in the dance, art, music domain together. We explore something together and have a conversation. The very first meeting, when we bring people together, that happened about like a

11 years ago, we had about 20 people sitting in the room, everybody like a very vibrant conversation. And then I suddenly realized something that it is true that everybody speak English, but nobody can understand each other because they're using the same vocabularies, but because of domain, just like Paco talked about earlier in the enterprise setting, because of domain difference, they mean totally different things.

A physicist talks about energy, we have very concrete things that we call energy. A dancer calls energy as a very different way of energy. When the computer people talk about Python, we're not talking about a snake. But the dancer, when they hear Python, they're like, "Why are you bringing a snake to the conversation?"

So I think just like what Paco said earlier, in the enterprise data context, that domain is very, very important. Be aware of the domain, like knowing the limit of the domain and how to find a way to cross-domain. For us, it's generally a lot of compensation. I think it's a human problem. It's not a technical problem. Well, technical can help, but only that much.

We had a conversation on here a few months ago with folks who had created a data analyst agent, and they said one of the hardest parts for the success of this agent was to first create a glossary of business terms so that the agent, and really trying to nail down

these fuzzy words and these words that maybe for one person, they mean one thing and another person, they mean another thing. And the quintessential example of this is in MQL, when you're at one company in MQL or when you're on one team in MQL is one thing. And when you go to another team in MQL is another thing.

They all mean marketing qualified lead, but when does that person become a marketing qualified lead? What do they have to have done or what stage are they in? And so the agents may understand and the LLMs understand what an MQL is kind of, but you really have to flesh out this glossary to let them know

all of these different terms that you use and that are in your database. So when the agent needs to go and pull, how many MQLs did we have last week? It understands what that means. Yeah, that's your semantic layer right there. That's a controlled vocabulary that, you know, you put enough of these together, you get your ontology. Yeah, yeah, yeah, exactly. ♪

GraphBI: Expanding Analytics to All Data Through the Combination of GenAI, Graph, & Visual Analytics // Paco Nathan & Weidong Yang // #310 01:14:01 Share

MLOps.community

Deep Dive

Shownotes Transcript

GraphBI: Expanding Analytics to All Data Through the Combination of GenAI, Graph, & Visual Analytics // Paco Nathan & Weidong Yang // #310