We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Github Network Analysis

2025/6/22

Data Skeptic

AI Deep Dive AI Chapters Transcript

People

Asaf

Gabriel Ramirez

Kyle

Topics

Asaf：我认为组织网络分析不应被视为严格的成绩单，不应期望开发者和项目经理有固定的模式。网络分析和中心性指标不是万能的，仪表板不适合，每个组织和网络都是不同的。定量分析可以指出需要调查的地方，但还需要定性研究，这更多关乎组织健康而非员工成功。如果发现资深专家或主题专家位于网络边缘，可能表明他们没有充分参与知识传递，或者管理者没有充分利用他们。这可能关系到组织健康和员工表现。 Kyle：组织可以根据情况选择让专家培训他人或专注于自己的工作，GH Explorer项目可以帮助了解如何组织。 Gabriel Ramirez：我创建了一个丰富的数据集，包括工程师、项目经理和参与软件制作相关对话的其他人，而不仅仅是提交代码。链接是用户和GitHub对象之间的所有交互，例如创建问题、在问题中被提及、批准或拒绝拉取请求以及参与讨论。我希望看到所有团队成员紧密合作。我不认为这些指标应该放在仪表板上，因为脱离了管理者的对话，这些指标可能意义不大，甚至可能意味着相反的事情。成为经理后，我意识到自己始终处于网络的中心，这让我意识到我是一个关键节点，如果我休假或离职，网络就会崩溃，这不是我希望看到的，经理应该赋能他人，而不是占用他们的工作。仅仅依靠数字无法量化所有的定性因素，人们的故事以及正在发生的事情，人们可能会通过在所有事情上发表评论来操纵系统，但评论的质量可能很低，因此网络指标可能会很高，但工作质量却很低。网络分析只是更大拼图中的一部分，而不是我们可以作为任何依据的指标。

Deep Dive

Chapters

This chapter explores using GitHub metadata (pull requests, issues, discussions) for network analysis to understand team collaboration. It introduces the concept of analyzing this data as a bipartite graph and using network centrality measures to reveal organizational dynamics.

GitHub metadata, including pull requests, issues, and discussions, can be analyzed as a bipartite graph to understand team collaboration.
Network centrality measures, such as eigenvector and betweenness centrality, reveal organizational dynamics.
LLMs can be used to analyze networks, particularly smaller ones, providing insights into team collaboration.

Shownotes Transcript

Translations:

中文

You're listening to Data Skeptic: Graphs and Networks, the podcast exploring how the graph data structure has an impact in science, industry, and elsewhere. Welcome to another installment of Data Skeptic: Graphs and Networks. Today we're exploring one of my favorite topics, metadata around Git and Git commits and pull requests and all that good stuff. The way software development collaboration takes place and how we can view that as a network.

Asaf, I assume this is right up your alley. Is this not organizational network analysis? Exactly so. And a very interesting data set to do it. Really, really cool. Yeah, it's there, right? If your organization is using GitHub, which...

And there's competitors. I'm not trying to be biased in that way, but Git as your source control, it has a wealth of metadata. Today, we're going to zoom more in on the project management side, where I don't know if you've used it, but GitHub Issues, which I've used a little bit. So similar, just like task tracking software to lots of other options out there, but it's integrated with GitHub.

and people can at mention each other so you can see who are in the same discussion threads who are commenting on the same prs and all of a sudden we have nodes and edges something gabriel and i get into is that even though he's a manager trying to figure out how can i you know better support my team and maximize what the group is doing there's not a strict report card it's not like i expect developers to look like this type of node and project managers have to look a different way it's

It's what suits the organizational needs at the time. This is just a view into how that network collaborates. Usually people, when they hear about network analysis and centrality measures, all kinds of measures, they say, okay, why can't we make a dashboard out of it, right? But that's exactly the thing. It's not one size fits all. Every organization is different. Every network is different. So that's why a dashboard is not the right tool for it.

You can use the quantitative part to point you to places you need to investigate, but then you need to use qualitative investigation. Sure. It's more about organizational health than employee success or something along those lines, in my opinion. Do you have a well-constructed team or do you need to rearrange it somehow?

Well, it depends. Again, back to the quantitative-quantitative. Let's say you're seniors or you're SMEs, you're specialists. Let's say you find them in the periphery of the network. It's something you want to avoid, right? And now you need to know, is it their fault? Maybe they don't engage enough in passing their knowledge. Or maybe, let's say you're the manager, maybe it's your fault. Maybe you...

I don't know, maybe send them to the periphery and underuse them. So it can be both the health of the organization but also performance.

Yeah, well, there's a polarity there. In one case, I could see saying, I expect my subject matter experts to be trainers and to be socializing the ideas and teaching people. Or in another situation, I say, I need those people heads down. They should just help when they need it and let the other people be self-starters. And neither of those strategies is wrong per se. But some of the tools we'll talk about with the GH Explorer project today might give you some insight into how you want to do your org.

An interesting thing that Gabriel did was use LLM to analyze the network. I tried to use HHPT to analyze some networks. I fed it small, medium, and large networks and prompted to analyze the network and give me insights. That's the prompt I gave it, a very simple prompt.

On the simple networks, especially networks that could find extra data on the web, it gave a really good analysis, a really great analysis, plus more information and attributes about the nodes that it got from the Internet. So that was really cool. But the larger the network, the more it, I won't say hallucinated, but...

lost its chain of thoughts. It didn't give information about the different clusters, although there was information it could give, and sometimes it did, and sometimes it didn't. So I think it's not yet there for large networks, but for small networks, when I say small, I mean like 100, 200 nodes, it really did a very nice job. Yeah, this is an interesting thing to ask the language model to do some sort of network analysis, because...

It's not totally obvious that the language model is capable of that. I mean, maybe in small example cases, but if I gave you some network data, I trust you have experience to analyze it. Let's say we gave it to some college students and gave them 48 hours to look at it, and they're new to even the tooling. They can't possibly have the same level of insight. And I don't know the skill level of the LLM at this point.

I didn't either. So that's why I tried and it was really amazing. By the way, to my students, that's one of... Because, you know, everybody uses LLMs now and it's kind of... It's like using... It's like telling people who learn math, don't use a calculator, right? It's ridiculous. So I tell my students, use LLMs, analyze your networks with LLMs. What I'm asking you is to look at it...

critically. So when you get back the analysis, you need to find what's explaining it, find out what's wrong with it, why, and so on. They have to learn how to use network analysis in order to look critically at the results of the LLM. And as I said, larger networks, you still need men in the loop to understand what's going on.

For what it's worth, I heard recently Sergey Brim from the founder of Google on the All In podcast was saying he'd done a similar exercise, asked Gemini something about some data he had, and based on that, recommended an employee for a promotion. So maybe that's a nice anecdote for a podcast, but there it is. I'm glad to see that me and the Google founder see eye to eye.

All right. Yeah, let's jump into the interview. Let's go ahead. Let's do it.

So I'm Gabriel Ramirez, and I guess my affiliation would be GitHub. I am the manager for the notifications team at GitHub. And can you share a few details on what that role consists of? Basically, our team runs most of the communications that you probably do get from GitHub. So every time that you open up a pull request and you're asked to review it, or you tag someone to review their pull request, issues, discussions...

Anything that goes on Slack from GitHub and email on the web, of course, also on mobile, all those notifications kind of go through the systems that my team works on and is building currently. If we talk more broadly about, let's just say, source control, of which I guess Git is sort of the preeminent example, for listeners who aren't deeply familiar with it, what sort of background do you think they should have just to appreciate the project we're going to talk about? Hmm.

My background is actually in anthropology, sociology, and to some degree Arabic, which I studied along with my career. And this is actually how I got into network sciences. So it's really about kind of understanding the structures that people communicate with and in. And I

I think that's the main prerequisite. Of course, there's ways that you can analyze. You don't have to have a specific background to get benefits. You basically, I see the only prerequisite is you have to be able to have the interest and like be able to talk to people about what they're doing to help kind of contextualize some of the data that you collect.

My inspiration was actually Zachary's Karate Club. That's a really old data set that goes back to kind of, I think, even before they had computational methods for quantifying networks, right? You kind of had people, the anthropologist, I forget who his name was, but he was just drawing lines on a piece of paper.

I think I have to fact check that, but that's what it seemed to like to me and trying to figure out who was talking to each other and how the network would split up due to a fight. So I think if you, if you enjoy, you know, exploring people and exploring humans, I think this is a, this is the background that you need. Yeah.

Well, you said part of your interest in network structures emerged from your background anthropology and Arabic and these things. And you said as though it was so natural. It doesn't seem that that's an obvious insight. Maybe one you had to come to. In what ways can network structures, you know, how did you find your way there from these origins?

It started off just studying sociology. It's one of the computational methods that we study. At least I did in undergraduate. I think it's UCI and net or something. I forgot how to pronounce it. But that was a tool that we kind of initially used to study networks. At the time, when I was an undergraduate, Twitter was getting really popular. We had events like the Twitter revolution that happened in Iran. And that really kind of caught my eye. It got me interested in networks in general, right? Like in the online networks.

After that, in graduate school, when I was studying Arabic, I kind of fell into a collaboration with my roommate who was a master's in computer science. And he was using NodeXL at the University of Maryland where they developed that tool.

I was just amazed at how you could take the structures from Twitter and explore them and study them. Together, I helped them out on a project where I helped give a little bit of context to what was happening in the Arab Spring.

from my point of view, right, as someone who was studying the Arabic language and kind of really reading those articles in kind of like a research way and then how he was doing it, right, in a network way using network sciences. So we kind of spent a week or two just exploring these things together and he ended up writing, like collaborating with other people and writing a paper about it. So after that, I was just hooked and I was like,

wow, this is the most interesting thing ever. And it kind of encouraged me to start to learn how to code. And after I graduated, I guess, from my master's in Arabic and I spent some time in Egypt, I came right back to just computer science and I took a bunch of courses and I ended up doing this kind of network analysis with open source data at the Department of State for a little bit. And I

After that, I just kept doing it, kept it in the back burner. Every time I wanted to kind of learn something new about the organization I was in or just kind of get an idea of who I was working with, I would always kind of go back to these network structures and these network studies that I had done. And I never kind of gave it up.

Eventually, when I became a manager at GitHub, I found the perfect opportunity to kind of explore the network that I was part of, which are my reports, basically 11 of them and the projects. And I felt like it would give me better insights into what people were doing so I could kind of help improve my team. And I started using it as a tool to improve myself as a manager. That's how I kind of developed this tooling. I've been developing it over the last one year and a half. So when everyone learns at some point, starts to learn to program as you did,

Maybe you encounter Git and source control on day one, I certainly didn't, but it's one of the many things you learn along the way. At what point did it become clear to you that there was a rich source of metadata available?

In today's data-driven world, the ability to extract value from data isn't just an advantage, it's essential. Mastering analytics can transform both your career and the organization you work for. It's your turn to transform your career and drive organizational success through analytics. Let me tell you about the Scheller College of Business' Business Analytics Graduate Certificate at Georgia Tech.

It's 100% online. Scheller College ranks in the top 10 U.S. business schools for busy business analytics professionals. They have a world-class faculty that can help you graduate in as little as a year, but

But maybe you're busy like me and you want to take it a little slower. You can combine flexibility with rigorous education. Scheller's Graduate Certificate Program adapts to your life, not the other way around. Their program is designed for professionals like us who want to leverage data and solve real-world business challenges, but need flexibility with their time and schedule.

That's why you can schedule your classes in a way that makes sense to you. On top of that, you're not just earning a certificate. You're potentially opening doors to Georgia Tech's prestigious MBA programs. Now is the time to become a data-savvy leader with Georgia Tech's Business Analytics Graduate Certificate. Applications are open for Spring 2026.

Visit techgradcertificates.com to learn more and apply before the August 1st deadline at techgradcertificates.com.

Delete.me makes it easy, quick, and safe to remove your personal data online at any time when surveillance and data breaches are common enough to make everyone vulnerable. Like many of us who enjoy connecting online, I take my privacy and personal information seriously. Privacy protection has never been more crucial. That's where Delete.me comes in. Their team of experts specializes in finding and removing your sensitive data from data broker sites before it can be exploited by bad actors.

In just one quarter, they reviewed over 3,000 listings containing my personal information. They're sending over detailed reports that show exactly what they found and what they removed. It's incredible how much exposed data they discover and protect, all while you're saving countless hours of frustration. Take control of your data and keep your private life private by signing up for Delete Me, now at a special discount for our listeners. Today, get 20% off your Delete Me plan by texting DATA

to 64,000. The only way to get 20% off is to text DATA to 64,000. That's DATA to 64,000. Messages and data rates may apply. At what point did it become clear to you that there was a rich source of metadata available?

So it took me a while to get into Git. To be honest, I didn't know the difference between Git or GitHub. I was using, even when I was coding in Python for the first year or two, I wasn't even using source control. I was just kind of saving things into my Google Drive. Eventually, when I started coding professionally, right, I got more accustomed to getting GitHub in it.

It occurred to me right away, I would say, just because I had that background in social network analysis and I knew that people had done this with email. I knew people had done this with many things related to work.

But it just didn't, it wasn't something I explored. And part of the reason for that is, you know, the time and also the ethics of it a little bit, to be honest. Like if you're a peer and you're kind of collecting data on your peers networks and like making that known, that's a little bit, I don't know, it's a little bit like spying on your neighbor.

type situation. So I knew that it was there all along. I just, I just didn't want to kind of make people uncomfortable. But then when I became manager, of course, your whole job is to kind of like get to know your team, get to know what they're doing. Maybe I can make the best case for my team if I, and, and improve myself as a manager. If I start doing this network analysis, that's how I, I kind of slowly led into it. It was,

It was something that I had thought about for years, but never really had the time or kind of just felt uncomfortable really getting into it. There's a strong ethical question involved for me always.

Well, for many people, you could say that a large measurable part of their work product is their commits to Git. I mean, you're still doing meetings and this sort of thing, but every day, I think a lot of professionals make many commits, maybe once an hour. And so you have this very interesting log of all the changes they made. And of course, there's pull requests and these sorts of things going back and forth. But if we're going to talk about a network, what are our nodes and edges in this case?

For my case, the data that I'm collecting, this is a bipartite network, right? So the nodes on one end are users. So that'd be my handle, your handle, the people on my team's handles. And then the other side of that network would be what I call kind of GitHub objects. And that can be three main things. I could build out more, but the ones that I use right now are pull requests, issues, and discussions.

And these are the nodes. The links are everything that happens between them. So when you create an issue, that would be a link between you and the issue. When you're mentioned on the issue, it would be you and the issue that you're mentioned on. For PR, it's the same thing with created, PRs that are approved, PRs that are rejected in the review. I have things for those. Same thing with discussions. So

I like to create a very rich data set that includes engineers who are actually making commits, but also PMs, directors, people who are participating in the whole conversation related to the making of software, the making of a team that go beyond just a commit. I think commits are really interesting. But for me as a manager, I also have to handle all the connections that my team is having. So that's why I go a little bit further in a different direction than just commits.

Yeah, so I can look at then we've got a whole structure on top of all the code that's evolving, that's describing, you know, starting with the issue, what the organization wants to change, and then discussions therein, I'm going to reference to certain people, maybe reassign it to them and eventually get it to a pull request. So bipartite graph, I got that pretty rich data set. What can you tell us broadly about its structure? I have some intuitions, but what are some of the key findings?

It looks like a pretty normal network where there are some nodes that are highly connected.

There are nodes that are on kind of on the edges and not a little bit more isolated and connected. So it follows the same kind of power law, I guess, distribution. It very clearly shows communities, especially people who tend to work together. It shows them like really closely together. People who don't, they're further apart. Usually you tend to see people who are very senior at the center of the network just because they happen to touch a lot of different issues together.

This separate network varies across time. Sometimes during vacation, it's obviously very sparse. Like December 25th, there's nothing there. People aren't working. But there are other times kind of like towards the end of the quarter or like when there's heavy work happening where the network is actually really tightly connected and you have these cases where everyone in the user network, I'll take like their network centrality measurements, their eigenvector centrality, where they tend to have almost like

really similar centrality, meaning that everyone's like really connected. The degree of connection is very close. And I measure my success based on like how connected my network is, right? If I have a network where it's really sparse and there's people that are really centrally connected and everyone else is in the outside, I consider that a failure on my part. I like to see, I like to format networks that where all the reports that I work with are working together and connected in a way, so.

Are there any metrics that network science gives us that are interesting to you? Like maybe network centrality you kind of touched on. Is this something that should be on a corporate KPI dashboard? You know, we got our centrality up to the goals for Q3, or is it more exploratory phase currently?

It's kind of exploratory. So I can pull up, so the main, I guess, measurements that I use, obviously nodes, right? How many nodes are in the network, the density of those nodes to see how closely everyone's connected. I use eigenvector centrality to find out like the connectivity of like highly connected nodes, people that are at the center. Betweenness centrality to find out the people who are kind of like in between different types of clusters.

So those are the basic network measurements that I tend to use when analyzing things. Nothing super special. I also use like Levain's kind of modularity scores to community algorithm to get the communities. So all those are tools that I use. Whether they need to be on a dashboard or

I don't know. I have my, I have strong feelings about that. And I think that the answer is no, mainly because when you take these measurements outside of the conversations that you have as a manager, right. As, as a team, um,

they can mean very little. They can mean two opposite things. Let me give you this example, right? When I first became a manager, I realized that I was always at the center of the network. At first, I was like, oh, great. Like, I'm important. I'm the manager. But after a while, I started to realize that I was the critical node, right? And a critical node, one that's very highly central, it's also one that's like a single point of failure, right? So if I would go on vacation or if I were to quit tomorrow or whatever...

the network would start to fall apart and that's not where I wanted to be. Right. As a manager, you're supposed to be enabling other people, not kind of like taking up all their work and sucking up all the energy in the room. So I don't think it should be on the dashboard. I think it should be a tool for, for people to ask more questions and get curious about what's actually going on. Cause I think, I think, uh,

The bare numbers can't, in this case, can't quantify all the qualitative stuff, right? And the stories that people tell and like what's going on. I can think of other ways that people could game the system, right? If I go in my comment on everything, it's like the quality of my comments is

It could be really low, right? I could be saying high, high, high on like 20 different things, 30 different things, but the quality is just not going to be there. But my network measurement would be really high. So I think centrality measurements would be really high or between the centrality would be extremely high if I did that everywhere. But the quality of my work wouldn't be. So I think, again, it's a piece in a bigger puzzle, for me, not a measurement that we can base anything on.

Well, as a manager, you knew your team, although not omnipotently. Maybe this is a data set that saw more than you could see and getting a summary could bring new insight. Did it mostly have the structure you expected or was there anything novel and unexpected about the results?

So one of them was vacation. Vacation was surprising to me. If someone who's really central takes vacation, I thought that the network would kind of fall apart and then recover. But what I noticed is that when people take vacation, new connections start to form. Looking at it in hindsight, I was like, of course they would, because that central person isn't there to talk to anymore. So people need to talk to each other. People need to review different pull requests to push work forward. But

I kind of underestimated the importance of that. And it kind of reminds me of that one really famous book in engineering, right? The Phoenix Project. There's like a person who's like really central to the project and then he takes vacation and somehow it doesn't go so well, but eventually it does because they're not at the center of the network anymore. They're not the blocker.

In the same way that I saw that happen. And not to say that central nodes are bad or anything. We ask our senior engineers to do a lot, and that's why they're at the center of it. But the same way, for me as a manager, that I have to take a step back and not be involved in everything, I think that was my recommendation for central nodes.

I would say the second thing that I thought was pretty interesting was kind of the isolation measurements. Initially, I thought nodes that were isolated were maybe people that weren't collaborating. I had this bias, right? And that's what it looks like on the network. But when I actually started talking to people and like,

Just being curious, I was like, oh, like, you know, what are you, what's up this week? What are you doing? People were doing all sorts of different things. Sometimes it was, you know, personal reasons, right, why they couldn't participate. And it was always good instead of me saying like, oh, you're doing a bad job to kind of be curious and ask, right, and to help people. Other times people were learning and researching. And when you're learning and researching, you need heads down time. You can't be social. All those are signs of growth that aren't necessarily positive.

represented in the network. So for me, it was a lot of the initial assumptions I had got flipped, not only from just the network itself, but also from just talking to people and getting curious. And this is why I

After doing this for one and a half years, I strongly believe that it can't be like a dashboard tool, just like what is it called? The B-side or there are other metrics that quantify commits and everything. Like in the same way that those metrics are abused. I think this one could easily be abused and I would rather not turn it into that. But as I agree with you fully there, and I think you would have, or at least I do, I have different expectations of different people that,

There's, you know, every once in a while an engineer that needs to go into a cave and solve a problem for a week. And that's their role in the organization. I wouldn't want them penalized on some dashboard because of it. But to have a managerial level view from 50,000 feet kind of insight into, you know, let's face it, I can't read every PR. I can't read every issue and every discussion point. With that in mind, can we talk about GH Graph Explorer? Sure.

We'll have a link in the show notes for people who want to check out the repo, but what are they going to find in this repository? Yeah. In this repository, people will find, first of all, just a little, I would call it a white paper that I wrote about some of the findings that I had. The second thing that they'll find is a series of tools.

So they're all based in Python, and I kind of try my best to make the usage clear for people. And of course, contributors are always welcome. I always enjoy that. So yeah, what you'll find is different ways of collecting data, either through printing stuff out directly on the screen, CSV, and also even a way to use Neo4j, a network-based database that allows you to

kind of collect network data in kind of a native way. So you'll find ways of collecting data and then you'll find different ways of analyzing the data. So I have some tools that can kind of

Read the CSVs, read the Neo4j database and just spit out basic network results and metrics. And then also you'll find an MCP server, which is a tool that you can use with hookup to cloud or hookup to an AVS code. So you can pull down data and have actually data

the LLM analyze the data for you and kind of give you insights. This has been really interesting because it takes away a little bit of your own bias there and kind of tells you things. And this is how I learned that I was too central, right? It's like, you are the critical node. And I was like, oh, you know, it kind of shocked me with the insights. It gave me a little bit of...

Harsh introspection there. So you'll find that. And you'll also find an action that you can set up to collect the data and kind of save it onto your repo. So essentially, it's just a collection of tools to give you access to this data. You also have the keys that you can get from GitHub to download it. So I would like people just to be able to use it and analyze it for themselves for whoever's interested. Yeah.

What drove the, I guess, design choices there? Were you already working on Neo4j or what was the impetus to expand the code to offer that service? No, I had never worked on Neo4j before and it was something that was really...

I had known about it for a long time. I just never had the chance to work on it professionally. And I knew it had a lot of interesting tools and visualizations and ways of storing data. So I just went with it because I thought it would be a great tool. And it has been. You can open up the Neo4j Explorer and use Cypher to write queries to get the data back. I think it's a really elegant way of searching through the data and finding the links and

It's certainly for me at first it was less intuitive than NetworkX, which is a Python tool for using it. But lately, if I just want to go in and check out like, oh, what's going on this week, I will use Neo4j and I'll use the Cypher queries to get out the data. So that was my initial motivation was like, oh, how cool. And then now it's become like, oh, wow, this is just a really elegant language and like a way to get data that works a whole lot better than a CSV, which I was using before.

I'm wondering if we can zoom in on the onboarding process, because I'd like to think a lot of listeners will check this out. What's it take to go from cloning your repo to getting myself a nice CSV of all the metadata about my repo?

Exactly. So first you would clone it. The second thing is that you need a Python token. Maybe I should update the repo to put some instructions on there on how to get a token. You can set it as an environment variable or put it into your repo. Just be careful where you put that token because you can get a lot of stuff stolen. So

Then after that, you can just start collecting the data. I have a number of scripts on here that says if you're interested in the last seven days, you can use this command.

If you're interested in these other ones, you can use that command. There's JupyterLab Notebooks. So I think the main thing is download it, get a Python key, and then you can start running some of the scripts that I've written. After this, I probably will go back and write even more detailed instructions for people. But ideally, I would like it to be simple enough where you can just copy and paste 90% of the instructions and you'll collect data or get some basic analyses.

Well, I very much agree with our earlier point. It's not time to make this a simple dashboard or report card, but any other thoughts on how another manager, I know you've used this in your own work, how could other people pick this up and what sorts of insights might they gain looking at their organizations? The main insight is just learning how their network looks like. I think initially the network is

and I know we talked about this a little bit, surprised me because I started to see more clearly than ever how critical almost everyone on the team is. Usually we tell stories about kind of the lead engineer solving everything or like coming up with this great thing. And like, there's a high focus on like one or two people that are very good at telling stories, right? And very outwardly communicative. By all means, like they deserve credit, but there are also a lot of other people that go into building software, right?

And maybe they're reviewing pull requests. Maybe they're reviewing pull requests a lot, right? And they're stopping bugs from ever getting to production. Maybe they are the ones commenting on issues and helping plan or doing all this glue work. And when you base your understanding of a team on just the leaders or just the people who are most vocal, you don't really see the entire picture. And I think that's just in general, like an important tool for understanding and realizing this.

The second thing is I use it as a report card for myself in a way. And I do that with this really specific kind of measurement. So what I do is I look at this centrality measures for every single person on my team. And then I take the average of those, sometimes the average, sometimes the median. And that gives me a number which I plot and align for every two weeks and

Every two weeks, I get to see like, is my team, is the centrality of my team going up or down? Because that for me is a measurement of how closely they're working together. And when I see it go up, I, you know, I can feel it. It's palpably noticeable on the team. People are like in a good mood. People are working together. When it's going down, it means that everyone's kind of bogged down. They're stuck on like paperwork. Maybe something wrong is happening or something, nothing wrong is happening. Maybe half the team is on vacation and that's dropped.

But it gives me in general, like a good measurement of like what's going on. One other thing that I've learned too, that I just have to mention here is that when new people come into a team, these measurements go down. The average centrality of the nodes on your team goes down because people are

are learning how to work together, right? And as they get more used to it, three or four months later, you start to see it go back up again. People are starting to work together. And those insights kind of mimic what we know as managers. What is it? Like storming, like forming, storming, and norming, and performing. And you kind of get to see that on plots and graphs when you take these measurements. So maybe I take it back. Maybe it is an important measure that we can put on a dashboard, but not one for each individual person, but

an important measurement for a team. But again, it can always be abused. So I go back and forth, but I think it is a good measurement to tell us if there's team cohesion. Knowing the context of the organization, what it's up to at the moment, and getting these metrics can keep, you know, I would imagine keep you more informed about how the team's doing. Yes, it definitely does. It definitely does. It also, yeah, there's all sorts of things that I've discovered. I feel like I haven't even...

I'm not even remembering them all. There's so many. If I see an aggregation of nodes one week, I'm like, oh, that's interesting. Maybe this is more important than I thought. Or maybe this is more distracting than I thought. I don't know. So there's all sorts of insights that you can gain from having a different view. And that's what it's about, right? Having a different angle to understand things.

So given the code you've made available, pretty much anybody could follow the same path, although it's admittedly a little DIY. How long before you can just turn this on or get a subscription or something going? Or do you have a vision for how this might become a more integrated product?

I don't actually. And that's a good question. I forget what license I've given it, but I've given it for anyone to use and turn into a product. If someone is interested in collaborating on that, I would love it. But again, my primary job and focus right now is a manager at GitHub. So that's what I'm using it for.

I will say I would love to do some more exploration in general related to how LLMs and just the participation of LLMs is affecting how software engineers work together.

But those are questions I think are outside of my purview. But I would love to see if anyone using these tools or using any other tools can start to answer those questions. I would love to read those studies. But for now, this is where I'll stay. Just kind of giving people the tools to do their own research.

Do you think a simple scaling up of the tools you're designing to a much, much larger organization could be at least a tool for helping mold it into that more decentralized group?

I think so. I think so. And I think a lot of decentralized decision making probably happens. It's just invisible, right? We don't know those people. They're not really captured anywhere and maybe there's a sense of it. Well, one could argue that some companies operate in all public Slack channels and that you have a similar data set there, perhaps. That's true. That's true. GitHub for...

Good or bad, right? We do kind of dog food our own products. So we tend to really cluster around GitHub as a tool that we use for product management and coding and everything, right? For discussions. And that's because we believe in the tool that we have and we're always trying to improve it. So we use it all the time. But for sure, another organization that maybe uses a tool like Jira or something else, I'm sure they have different data sets there that they would be able to extract and see how people are collaborating.

But then same, I guess, more about the fundamental analysis than the data source. Do you think that can scale up to like a Microsoft or Google Facebook scale where there's, I don't know if it's hundreds or thousands of engineers collaborating? Do you need a different kind of tool at that scope? Or do you think the same ideas will scale up? I think the same ideas would scale up. I think so. And I think the reason I started putting things into Neo4j was for that reason. Neo4j can scale up.

quite a lot. Maybe if one day I become a director or hire, maybe I'll still be doing this. And then just the number of people that I'll be exploring will be bigger. But I'm sure there'll be some fascinating insights to do across an organization. Yeah, either way, you've built some of the fundamental data engineering tools to unlock a lot of work here. So great job on that. Thank you. Yeah, that's my hope that other people go out and attempt and try to use it and get their own insights.

Well, Gabriel, is there anywhere listeners can follow you online? Not really. Only my GitHub handle. So that's GE Ramirez. I'm sure you'll post a link and everything. But that's the main place where I say I'm social. Well, thank you so much for taking the time to come on and share this project. Thank you so much. Take care.

Github Network Analysis 36:46 Share

Data Skeptic

Deep Dive

Shownotes Transcript

Github Network Analysis