We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Networks and Complexity

2025/6/14

Data Skeptic

AI Deep Dive AI Chapters Transcript

People

Kyle

Topics

Kyle: 我认为在评估算法时，扩展性至关重要。当输入数据量翻倍时，算法的运行时间也应该尽可能地保持在可接受的范围内。然而，许多图算法在处理大规模数据时会遇到扩展性问题，这使得在实际应用中面临诸多挑战。我发现子线性算法在处理图问题时非常理想，但前提是问题可以用子线性时间算法来表达。线性时间算法也很有用，但更复杂的问题可能需要多项式时间算法，这通常需要更强大的计算资源和工程解决方案。最难的是NP完全问题，虽然验证解决方案相对容易，但找到解决方案却非常困难。我认为即使面对NP完全问题，我们也不应该轻易放弃。在实际应用中，通常不需要找到精确的最优解，而是可以接受近似解。通过牺牲一定的精度，我们可以使用更高效的算法来获得接近最优的解决方案。因此，我认为图问题是工业界中最值得研究和解决的问题之一。理解问题的难度级别，并结合云工程师和软件工程师的专业知识，可以帮助我们开发出可扩展的解决方案，从而应对现实世界中的挑战。我坚信，在图算法领域，人类的专业知识仍然是不可或缺的，人工智能在短期内还无法完全取代我们的作用。

Deep Dive

Shownotes Transcript

Translations:

中文

You're listening to Data Skeptic: Graphs and Networks, the podcast exploring how the graph data structure has an impact in science, industry, and elsewhere. Welcome to another installment of Data Skeptic: Graphs and Networks. As a lot of you know, we find the vast majority of our guests through an algorithmic process.

We crawl sites like the Archive. We have a machine learning prioritization process. But there's some more traditional search on my part too, especially when I feel there's a gap maybe in our coverage. And one of the gaps that I wasn't able to fill with just the right guest was someone to talk about complexity theory.

specifically, obviously, in the context of graphs and networks. So for anyone not deeply familiar with complexity theory, I'll give you a pretty quick crash course. One of the main questions you want to ask about an algorithm is how it scales. If I double its input size, is it going to take twice as long? Maybe worse, maybe four times as long, a hundred times as long.

In rare cases, like binary search, you can double the input size and the time to find your solution doesn't go up by that much. It's more of like a log of n term. So any question we want to ask of a graph, any algorithm we want to run on a network, more often than not you hear about people bumping into scalability challenges, specifically with graph problems.

Although that does not have to be the case. Let's start our tour with maybe some of the fastest algorithms: sublinear time algorithms. So what does sublinear mean? Everything is the asymptote in computational complexity theory. So we have an input size which we're always going to call n. However, that's a little confusing with graphs. Sometimes you need n to mean nodes, or v to mean vertices, e to mean edges,

It's true graphs can grow in both ways. They can grow in node size and in edges, and that's going to affect some algorithms differently. As it gets bigger, how does your algorithm perform? Well, a sublinear method means that if you double the input size, you don't necessarily double the runtime. Now that sounds really great. You can get fast answers to questions about graphs, but only if you can articulate that question using a sublinear time algorithm.

So this could be something like sampling. If you want to know the average number of incoming edges to a node in a directed graph, yeah, you can take some subsample. What's the average of that population? Maybe bootstrap it, do that a couple of times and rely on the central limit theorem. You'll have a pretty good estimate of the average incoming link, or indegree, of your network. But you're just relying on statistics there.

If you cared about, let's say, the max, well, the only way to be sure you've got the max is to look at every element. And the odds that you're going to sample exactly the max are pretty low. There are very sketching and streaming algorithms that can do neat things for you. For example, estimating the number of triangles in your network. A triangle being the case where three people are mutually friends or mutually connected. I should say three nodes are mutually connected.

Well, after sublinear, there's linear time algorithms. And if your question can be expressed or answered by a linear time algorithm, you should also count yourself in luck. There's a whole bunch of classic and still very useful algorithms that can run in linear time. And that's what you usually see written as V plus E, or vertices plus edges. So count the number of nodes, count the number of edges, that's your N in terms of defining the size of your network.

Breath-first search and depth-first search. Both big O of V plus E. Topological sorting, connected components, cycle detection, checking bipartiteness, and a slew of other things I hadn't heard of until I started doing this research. All questions you can efficiently answer about a graph. One could argue this is why Google was able to scale up to web scale crawling with breath-first search.

or why Facebook was successful on doing some things like connected components analysis. Now if you're a true complexity theorist, there's a thousand grains in differences I've already skipped over a bunch. But if we're going to hit the highlights, we go from linear to polynomial time algorithms. Those that run in, let's say, n squared, n cubed, n to the 4, something like that. So as your graph size goes up, your compute time goes up by a lot.

yet maybe in a practical way, in a way that arguably we can scale it up and find a way to make it cost efficient. Calculation of the all-pairs-shortest-path algorithm, Flood-Warshall algorithm, if you're working in some, let's say, topological system or maybe navigation system, it's probably useful to pre-calculate the distance between all points. Max-flow algorithms that you might encounter in certain optimization problems,

PageRank and Eigenvector Centrality, Graph Partitioning, and Triangle Counting exact this time, as opposed to the approximation I proposed earlier. All of these and more done in polynomial time in the worst case. This is the level where in practice, if you're really truly scaling something up, you probably need an engineering team and a novel cloud solution. If your data is small enough, maybe you can cram it all into one machine, get a beefy machine with lots of memory,

If your graph were especially sparse, that parallelization process is pretty easy. However, unlike a lot of other big data challenges, it is rarely true that you can split your graph into those subgraphs. First of all, it's probably not so sparse. Most real-world graphs are highly connected, or even if they're not, they still have that one giant connected component, the largest component of the graph, which surely will encompass...

an enormous number of nodes if you're dealing with some big data problem. It's not clear we can just cut up that connected component and push it out to infinitely parallelizable machines. We can approximate things that way, but even at scale there are non-obvious solutions to be developed, often something novel is required in industry applications. It's no surprise that a lot of the big players doing stuff with graph end up inventing new things along the way.

And if our quick tour of complexity theory that started with sublinear, then linear, then polynomial time algorithms has an obvious next visit, it's NP-complete graph problems. These are a very special class of problems, and it relates to one of the biggest open problems in computer science: P versus NP.

In today's data-driven world, the ability to extract value from data isn't just an advantage, it's essential. Mastering analytics can transform both your career and the organization you work for. It's your turn to transform your career and drive organizational success through analytics. Let me tell you about the Scheller College of Business' Business Analytics Graduate Certificate at Georgia Tech.

It's 100% online. Scheller College ranks in the top 10 U.S. business schools for busy business analytics professionals. They have a world-class faculty that can help you graduate in as little as a year, but

But maybe you're busy like me and you want to take it a little slower. You can combine flexibility with rigorous education. Scheller's Graduate Certificate Program adapts to your life, not the other way around. Their program is designed for professionals like us who want to leverage data and solve real-world business challenges, but need flexibility with their time and schedule.

That's why you can schedule your classes in a way that makes sense to you. On top of that, you're not just earning a certificate. You're potentially opening doors to Georgia Tech's prestigious MBA programs. Now is the time to become a data-savvy leader with Georgia Tech's Business Analytics Graduate Certificate. Applications are open for Spring 2026.

Visit techgradcertificates.com to learn more and apply before the August 1st deadline at techgradcertificates.com.

Building multi-agent software is hard. Agent-to-agent and agent-to-tool communication is still the Wild West. It's clearly the emerging future, but how do you achieve accuracy and consistency in non-deterministic agentic apps? That's where agency comes in. They have a very clever spelling. Here's how it goes.

A-G-N-T-C-Y. I'll give it to you again in a minute. The agency is an open source collective building the Internet of Agents. The Internet of Agents is a collaboration layer where AI agents can communicate, discover each other, and work across frameworks. For developers, this means standardizing agent discovery tools.

seamless protocols for interagent communication, and modular components to compose and scale multi-agent workflows. Build with other engineers who care about high-quality multi-agent software. See where you can fit in this ecosystem. Visit agency.org and add your support. That's A-G-N-T-C-Y dot O-R-G. It's NP Complete Graph Problems.

These are a very special class of problems, and it relates to one of the biggest open problems in computer science, P versus NP. That's all very nice, but what is it? It's a particular type of problem where finding the solution seems pretty hard, but verifying the solution seems quite easy.

There's tons of examples of this. I think maybe the traveling salesman problem is the most popular, maybe the go-to in a lot of lectures and courses. And that suits me just fine because the traveling salesman problem is a graph problem. You have a variety of destinations. Think of them as cities, vendor demos, whatever it is. It's a physical point of contact between the salesman and their customer. There's no zoom allowed.

and the traveling salesperson wants to find an efficient route. Now, in some worlds, that's easy. Imagine you live around a perfectly round lake. There's one ring road surrounding it. Your only real debate is do you go clockwise or counterclockwise on your tour?

And if you think of only these simple examples, it's easy to convince yourself that this problem is readily solvable. Well, it's obviously solvable. The traveling salesman can find a complete route. Can he or she find the most efficient one? That's the challenge.

Or more specifically, can they do that on the average case, not our sleepy lake town with one ring road around it? I actually little prefer the analogy to a jigsaw puzzle. When I think of a jigsaw puzzle, I often think of maybe there's a brute force component to it.

I can see parts of the image, I've got certain clues, but there's no algorithm that's going to work for all puzzles. Except maybe just that brute force, try every combination of every coupling of pieces until you've got it. Yet once you've got it, once you assemble the jigsaw puzzle, if you show it to someone, they will very quickly recognize if you have completed the puzzle or not. And this is true of all NP-complete problems. Formally we call it a witness string.

And let's get into formalities instead of my loosey-goosey definitions. So an NP-complete problem is one for which the solution can be verified in polynomial time. And that's very specific, maybe something you could verify faster than polynomial time. That problem is going to have other qualities about it that almost certainly won't land it in the NP-complete category. Okay, easy to verify, or at least easy as polynomial time algorithms.

But then the key caveat is there are no known polynomial time solutions. You can verify a solution, you can't reliably always find one in polynomial time. Or at least that's what almost everyone believes.

No one's proved that yet. That is the crux of this central problem P versus NP. It seems almost ridiculous for P to equal NP, yet to date we lack either the proper mathematics to express such a proof, or a clever enough human to articulate the solution in the mathematics we already have. I've called these NP-complete problems a class.

And if you go and study it, you'll learn about the polynomial time reduction between them, this very formal way that we say one equals the other. We're going to skip that here, and I'm just going to give you a list of some of the famous graph problems that all bear this property. Easy to verify, hard to solve. The traveling salesperson, like we talked about. The Hamiltonian cycle. Graph coloring. The sparsest cut problem. Maximum clique. And subgraph isomorphism.

I wish we'd done a formal episode on each of those wonderful topics. Let's just talk about subgraph isomorphism.

It's like asking whether a small pattern exists inside of a larger graph. So think of a graph, maybe your social network, and now let's randomly but consistently swap out all the names you know with fictional names. If you watch me do that process, it's easy to say that I made a one-to-one mapping of your real-world social network onto my fictional characters.

But what if you wanted to pose the question, does anyone else out there on the same social network as me have the same precise structure of friends? Now, if I only have two friends and we're all mutually friends in this little triangle, it'd be easy to identify there are other triangles out there. But in the vast network of people saying, does my exact pattern, my local area network, so to speak, is there an equivalent structure out there?

somewhere inside the larger graph. If there is, if you bring me that solution, I can verify it quite quickly. Just compare the two. But where I would go to begin finding such a match, I don't even know what good approaches might exist. Whatever they are, we know none of them get the correct answer in polynomial time. Because if they did,

We could use that same solution to solve tons of other problems. So does this make things hopeless? When we bump into NP-complete problems, do we just give up? With so many graph algorithms and methods for answering questions about graphs in the NP-complete class, does that mean graphs are just inaccessible, overly complicated objects, not worth our analytical time? Well, hardly, obviously.

The truth is, at least in industry, you rarely need that confirmation of an exact solution. And if you can relax that, NP-complete problems aren't as cumbersome as they might otherwise seem. Okay, maybe I can't find the perfect route for the traveling salesperson to follow. But what if I could get that route epsilon close to optimal, 99% optimal?

or better even, at least 99% close to the optimal answer. And by just taking on that slight compromise, you're able to use that admittedly lesser algorithm. Doesn't get you an exact solution, but if it can quickly get you, or efficiently get you, one that's 99% close, we ought to be able to do something with that. So if anything, maybe the takeaway is that graph problems are the most interesting ones to chase in industry.

It isn't as simple as dump them into some big data solution and rely on a vendor to do all the interesting stuff. If you can understand these classes, what problems are truly hard, and what viable approximation methods can be deployed,

And you can collaborate with the cloud engineers and software engineers or take on those hats as well. In order to deliver scalable solutions that solve some real-world problem, the inherent difficulty of managing graphs makes you a critical asset and one that something like an LLM isn't likely to replace anytime soon. Or at least that's my two cents on it. But regardless of the size of the graph data you might have the opportunity to work on now or in the future,

One of the challenges you'll likely have to face is how to scale up your analysis. If the network is of any consequence, and likely to be worth looking into, it's got to have a lot of nodes and edges. Some of the questions you might think are obvious to answer about it, things like calculating the shortest paths, or maybe some hotshot in your organization that loves buzzwords thinks you should calculate page rank and use it in some clever way.

Those tasks can be easier said than done. Helping to articulate why these challenges exist and the fundamental challenges coming from complexity theory can make you a better communicator within your organization and hopefully to deploy a graph-based solution.

Networks and Complexity 17:49 Share

Data Skeptic

Deep Dive

Shownotes Transcript

Networks and Complexity