cover of episode Fraud Networks

Fraud Networks

2025/4/1
logo of podcast Data Skeptic

Data Skeptic

AI Deep Dive AI Chapters Transcript
People
A
Asaf
B
Bavo D.C. Kempo
Topics
Asaf: 我对Bavo的实现印象深刻,特别是考虑到研究保险欺诈的难点在于获取欺诈者的数据。他通过创建模拟器来解决这个问题,并利用联邦选举委员会的数据进行测试,发现了类似于欺诈数据的有趣模式。通过分析政治捐款网络图,我发现了异常值,例如巨额捐款,并以此找到了欺诈行为的线索,这与在欺诈数据库中寻找异常值的逻辑相同。社会网络分析并不仅仅局限于社交媒体,它涵盖了人类所有关联的网络,在保险欺诈检测中,它可以揭示隐藏在索赔和相关方之间的关系。 Bavo D.C. Kempo: 我是一名数据科学家和统计学家,我的工作结合了精算科学和社会网络分析的优势。精算科学可以被理解为保险领域的数学统计学,它涵盖了保险的定量方面,例如计算汽车保险的保费。在保险定价中,我们使用历史数据,分析重要的预测变量(例如年龄),并构建预测模型来预测损失。保险欺诈是指提交虚假或夸大的索赔申请,严重的情况包括多起虚假索赔,目的是牟利。我们可以利用传统特征(例如索赔金额高于车辆初始价值)和社会网络分析来检测保险欺诈。保险欺诈的“真实标签”通常由专家根据调查结果判断,但专家判断也可能存在误差。我的研究使用了德国奥斯卡·道蒂尔的研究数据,该数据包含汽车保险索赔信息以及与之相关的其他业务线的索赔信息,并形成了一个包含索赔和相关方的社会网络。我们使用BiRank算法(基于PageRank算法)对索赔进行排序,已知的欺诈索赔用于指导排序,与已知欺诈索赔联系紧密的索赔排名更高。我们检测到的网络关系是欺诈者之间密集的联系,而不是简单的个体欺诈行为。我们的分析不使用社交媒体数据,而是基于索赔与相关方之间的关系。如果某个修理厂参与了多个欺诈案件,该修理厂会在分析中获得较高的欺诈分数。为了解决数据集中欺诈案例数量少的问题,我开发了一个名为iFraud的模拟器,用于生成具有网络结构的合成数据集,用于训练和测试欺诈检测模型。iFraud模拟器通过迭代算法生成更接近现实世界特征的网络结构,模拟保单持有人特征、合同特征、索赔频率和损失成本等。iFraud模拟器允许用户控制模拟数据集中欺诈的比例以及其他特征的权重。BiRank算法提供一个欺诈概率的排名,而不是一个简单的阈值判断。欺诈检测模型需要持续更新,以适应欺诈行为的动态变化。保险公司利用已有的索赔信息构建图数据库,用于欺诈检测。将图数据转换为表格数据通常是通过提取图的特征作为表格的列来实现的。iFraud模拟器能够生成包含社会网络结构的合成欺诈数据集,并允许用户调整各种参数。许多保险公司已经采用分析方法来打击欺诈行为。尽管存在误报的风险,但通过概率模型和专家审核,可以将误报的可能性降到最低。我的研究和工作涵盖了精算科学、医学研究和统计方法等多个领域。我开发了一个名为“Easy Calibration Curves”的软件包,用于评估预测模型的准确性。我乐于继续从事与图和网络相关的研究,并欢迎大家与我联系交流。

Deep Dive

Shownotes Transcript

You're listening to Data Skeptic: Graphs and Networks, the podcast exploring how the graph data structure has an impact in science, industry, and elsewhere. Welcome to another installment of Data Skeptic: Graphs and Networks. Today we're talking about insurance fraud and how graphs and networks can be used to analyze and find some of that. Asaf, what did you think of Bavo's implementation here?

Bavo created a simulation for fraud because, well, there's a problem when trying to study fraud and fraud networks is because it's hard to find fraudsters' data, right? It's hard to find. His solution was to create a simulator.

What I did actually was I used the FEC, the Federal Election Committee data. I tried to practice fraud on it. Is that presumed that the Federal Election Committee is actively engaged in fraud or –

Explain this out a little bit more for me. I think you should trust your government more than you do because the fraud is not the FEC. What I'm trying to say is I was looking at the donations, the data set of the donations that are on the FEC's website. And what I looked was who donated to whom. And you can find lots of interesting data inside, which...

kind of resembles fraud data, like what's their address and so on. Yeah, that's a novel data set to compare against. Even if it's donations, not that they're fraud, I would assume the dynamics of such a network is similar. Yeah, you can find interesting things. So I focused on donations to PACS. PACS is a political action committee. It's an organization to funnel political donations and so on. You can do many things, like

Two examples. One was I projected the network. What I mean is I connected PACs that got donations from the same person. So I think the data set I used was from 2022. What I got was, of course, two large communities, the Democrat PACs and the Republican PACs. That's proving the method, yeah. Yeah, with many links between them. But the interesting example was the Liz Cheney. Right.

Liz Cheney is famously a Republican, but her PAC belonged to the Democratic community because of her stand against Trump. So it was interesting that people who usually donate to Democratic PACs donated to her PAC. Another example relating to our case, which is how to find fraud on graph, is by looking at anomalies. When I'm looking at the graph, I'm looking for anomalies.

In this case, I wasn't looking specifically for fraud, but you can use the same logic. For example, when I looked at the graph, I could see a very thick edge. That means an edge that has had many donations. It was large sums of money donated to a PAC. And when I dived into it, I found that it was a donation made by SBF, Sam Bankman Freed. Aha!

who ironically was convicted of fraud in relation to his crypto company, right?

Using the graph, I could easily find his social relationships by looking at, let's say, who used the same addresses he used. So I could follow up on the donations they made, his other, let's say, friends or family, and see the cluster. So it's the same logic when looking at a fraud database or a fraud dataset. Very interesting. Yeah. Bavo is smart.

says that he analyzes social networks, but the name social network can be misleading. Social network analysis exists for about a century. So you can deduce from that that when people say SNA or social network analysis, they don't mean just social media like Facebook, LinkedIn, but all the networks that humans are associated with.

In this case, what he studies is the relationships not of friends on LinkedIn or Facebook, but the names of the people that were part of the fraud were filed together in a fraudulent insurance claim, for example.

Almost certainly there's some in there, but these are people actively trying to hide themselves. So you have to find new fingerprints and the network seems to reveal them a bit in this case. They have to reveal some of the relationships because they file a claim, right? So you can use it.

And look for the same people, the same addresses, the same thing I did with the FEC dataset, for example. Or not to spoil one of Bavo's most interesting points, but detecting a fraudulent repair shop involved in the ring somehow can appear in the data. Spoiler alert. Yeah, right? Yeah.

One of the techniques that he mentions using just in passing is SMOTE sampling, which I've never actually applied myself, but I've read up on it a couple of times because it seems like a very interesting technique. I guess from the way your face is moving, you're not familiar with SMOTE. It is a sampling technique. So SMOTE is an acronym, a sort of a forced one for Synthetic Minority Oversampling Technique.

Which, of course, if you want to do some sampling in this case, if you sample from an insurance claim network, probably you're going to sample all good claims. You would adjust your sampling technique to be more likely to draw from the minority class in this case. Not minority in the social sense, but minority in the least popular that most people.

insurance claims are not fraudulent. The fraudulent ones are less than probably 1%. Yes. Well, I didn't remember he mentioned smoke. So that's where I was baffled. It's a pretty neat technique. I hope I have to pull that out of the toolbox one day myself. Well, let's jump right into the interview then.

I'm Bavo D.C. Kempo. I'm a data scientist and a statistician, and in my work I try to combine the best of two worlds. Now, I did my PhD in actuarial science, and before this I worked as a biostatistician at the K11, where I contribute to research on ovarian cancer.

Now currently I work at an insurance company where I analyze risks through stochastic simulation and I'm also a research associate at Imperial College and KL Leuven.

For listeners who don't know actuarial sciences particularly well, could you give us a quick high-level summary? Yeah, you could kind of see it as mathematical statistics for insurance. Everything that has to do with the quantitative aspect of insurance, for example, pricing the premium of your car insurance.

How do people with that sort of academic background usually transition into industry? What kind of roles are available? There's actually quite a lot of different roles that you can take on in the industry. One of them is, for example, as a pricing actuary.

Now here you will develop pricing models and then the pricing models they're predictive models to predict the losses based on the policyholder characteristics. Now just a simple example take for example car insurance. Here you can develop a pricing model protection model to predict

how many claims on average a policyholder will have with certain characteristics and also the loss that will be expected. And I'm curious if you could share some of the challenges in work like that. I would imagine you're trying to predict the future and that's a notoriously hard thing to do. Yeah, true. And this is why we still work with the expected value just like everything else in predictive modeling.

We work with the expected value based on the information that we have. Then historical data that the insurance company gathered. So we will analyze this data, try to find some characteristics that are important. For example, age. This is a well-known important or predictive covariant.

in car insurance and this also has a very interesting non-linear effect. So young policy holders they do not have a lot of experience and because of this they present a high risk. Then the more experience you get older you get you will have less claims but also less severe claims on average of course but then once you reach a certain age when you get older

you actually present a higher risk again. So this is just a short summary of how we use predictive analytics within insurance. I know another facet of that that you've looked into is the presence of insurance fraud. Yes. And I guess how to detect it and things like that. To kick it off, could you give me a summary of what is the nature of insurance fraud? How might I commit it if I was trying to? Yeah.

So this involves submitting an illegitimate claim. Now just because this is again a very technical term but

Just a simple example, say that you had a car crash and you actually exaggerate the claim cost. Then you report it to the insurance company. Now this is just a minor example of fraud, but the more severe cases are where there are multiple illegitimate claims and these are then reported by fraudsters who actually try to make a profit.

And this is, well, detrimental not only for the insurance company, but also for the policyholders, because the insurance company, they will also increase the premium to offset these losses. Makes sense, yeah. Could you expand on how we might use data and data science to fight insurance fraud?

So one of the things that we could first use is to look at the characteristics of the claim. We can refer to the traditional claim characteristics as the characteristics that were initially used within fraud research. For example, if the claim is higher than the initial value of the car, for example. So within car insurance,

If you submit multiple claims and the cumulative claim amount is higher than the initial value of the car, this is of course suspicious. And this is one of the characteristics of the claim that you can then use to detect fraudulent claims. Now, one of the interesting new approaches is where we incorporate social network analytics in a context of

of insurance fraud. This refers to the relationship between the claims on the one hand and the policy holders or the involved parties on the other hand, because this allows us to go beyond the traditional claim characteristics. One of the difficulties within fraud research is the evolving nature of fraud. Most fraudsters aren't stupid and as soon as they

kind of detects a pattern or if they get detected they will also adjust their strategy. But using social network analytics we can potentially uncover fraudsters that are trying to hide their tracks and this is through collaboration with other fraudsters. Well I definitely want to zoom in on the social network analytics component but before we get there

Could you say something about the nature of, I guess, ground truth? You know, it seems natural that if the claim is that, you know, I need to be paid out more than the initial value of my car, that's quite suspicious. But what if I had, you know, done tons of upgrades after I bought it? Or maybe it was in a situation like we've seen recently where the price of used cars has soared from some supply chain shortages. What, you know, notion of ground truth do you have?

Delete.me makes it easy, quick, and safe to remove your personal data online at a time when surveillance and data breaches are common enough to make everyone vulnerable. Your data is a commodity. Anyone on the web can buy your private details. This can lead to identity theft, phishing attempts, or harassment. But now you can protect your privacy. That's why I've been using Delete.me. One of the best things about the service is when you first sign up, they give you the flexibility to start with just basic information. You choose what details you want them to protect.

I started conservatively, but after seeing their detailed remover reports and experiencing their service firsthand, I felt confident enough to expand my protection. The peace of mind that comes with Delete.me's service is invaluable. Knowing that a dedicated team of human privacy experts is actively working to protect your personal information lets you focus on what matters most in your life.

Some removals happen within 24 hours, while others might take a few weeks. But Delete.me manages it all. They keep you informed throughout the process, and their quarterly reports show you exactly what they're doing to protect your privacy. Take control of your data and keep your private life private by signing up for Delete.me now at a special discount for our listeners. Today, get 20% off your Delete.me plan by texting DATA to 64000. The only way to get 20% off is to text DATA

to 64,000 that's dated as 64,000 message and data rates may apply. What notion of ground truth do you have? Yeah, well, these are indeed all things that need to be taken into account in the predictive model. And I would see them as confounding variables.

Because we initially start from, let's say, a historic database. And this is a database that contains all investigated claims in the past. And here there will be an expert judgment. Once a claim has been flagged as suspicious, an expert will investigate this claim in detail.

He or she will decide based on an in-depth investigation whether this claim is fraudulent or not. And this is what we, well, most often refer to as, well, I wouldn't exactly call it Crown Truth label.

Because in a very small amount of cases I could also imagine that label that is given by the expert is not exactly the same as the ground truth label. If you understand what I mean or should I be a bit more specific?

I guess my interpretation is that any expert is going to make some amount of errors, however small they might be. Maybe it's one in a thousand, but it can't be perfect, right? Yeah, exactly. So that's what I'm referring to. And then let's zoom in on the social network aspects. What sort of data set do you have available to link in cases like this? So my paper was based on the research of Oskar Dothier in Germany.

Her research, she used a car insurance data set. And here we had information on not only the car claims, but also claims that were linked to the initial car claims in other lines of business. So this might be fire insurance or anything that's related to it.

There were two parts to this data set. You had the one with the traditional claim characteristics that was in a tabular format. But then the second part, it was more of a social network. And here we had the relationships between the claims and the involved parties.

Now the involved parties, they are commonly, well of course, always the claim and the policy holder. You also have the garage for example, that is also another involved party in the claim. You also have the broker, the expert, and this is the initial data that we had.

With the social network, of course, it's a graph structure and it does not have the tabular data set that is often required by predictive models. What Oscar Dauter did is that she designed an algorithm, the BIRANK algorithm, to rank the claims to the known fraudulent claims. Now, the BIRANK algorithm is based on the PageRank algorithm, which was initially used by Google to rank web pages.

But okay, going back to the ByRank algorithm, using the known fraudulent claims, you will then use this as a... Well, we put this in a vector. This is what we call the query vector in the article. And this is actually used to steer or actually to give a ranking to all the claims.

The result is then a kind of ranking of all claims with respect to fraud. So the ones who are more densely connected to known fraudulent claims, they will be ranked higher. We then have our first kind of ranking or score for both the claims and the policyholder. And then based on this score, we can also construct or engineer several features.

So we have two different types of features that we then can construct the neighborhood based features and the score based features now with the score based features we're gonna look at the Fraud scores that we arrived at by running the by rank algorithm for example for one of the claims Yeah for the claims we can then for example look at

at the first order neighborhood, so then to the connected parties. We can then construct several features such as the fraud score

in the first order neighborhoods of a specific claim. And this is one type of feature or one example of a score-based feature. Another thing that we can construct, so it's the other type of features, are the neighborhood-based features. And then we're going to look at the neighborhoods of the claims.

And here we can, for example, look how big the second order neighborhood is, but we can also look at the ratio of the fraudulent claims in the neighborhood of the claims.

I'm wondering if density plays a role in your analysis. You know, if I think of someone like maybe living in a big high-rise in Manhattan and there's a thousand people in their building, probably one of them committed insurance fraud, but who knows if they know each other even. Would they be connected because they're in the same neighborhood or is it purely from social elements? No, no, it's purely from social elements. And in the buy-rank algorithm, we also account for the fact that there are high prices

density networks, let's say. There's a sort of normalization in the algorithm that takes place to lessen this impact. So it's

mostly steered or it's predominantly steered by the fact that you have a lot of fraudulent claims in the neighborhood of the original claim. So when I think of an accident, and I haven't been in that many so I don't have my own data set to really analyze, but they were always with strangers, people I didn't know.

So are you saying that maybe I would have a confederate and I would tell my buddy, hey, we're going to split the money if I just give you a minor fender bender? Or is it more that I'm amongst a community of thieves, so to speak? What sort of network relationship are you detecting? I would say the second one, because this is also one of the basis assumptions of the methods, that there are dense networks of fraudsters everywhere.

will work together and they try to remain undetected. They all work together because it also instills trust and also within criminology it's a well-known fact that fraudsters work together. But this is also exactly what we are trying to unravel, let's say. So it's a dense connection of fraudsters and it will not be a single fraudulent claim, let's say.

So when you begin with social network data, I would think that a smart criminal would tell his criminal friends not to add him on LinkedIn or Facebook so that they wouldn't create this sort of digital fingerprint. Does that mean that essentially these people are making errors in their work, that they should just be better criminals if they want to pull it off? We do not use any...

social media data. So it's purely based on the relationship between the claims on the one hand and the involved parties on the other hand. So it's really based on the connection between the filed claims. Now if you see that there are a couple of fraudsters file a lot of claims together or maybe that there's a

Like you have a car accident, well, you kind of fake a car accident with your buddy. If this buddy pops up in a lot of other car accidents, because you're like a huge group of criminals, this will pop up.

So this is also the basic assumption that we're working with. I believe you also mentioned there could be like a shop involved. What if there was one criminal shop that was somehow aiding and abetting, I guess might be the right term. How does that appear in your data? Yeah, this is actually a very good question. And here I can maybe go back to the, let's say the fraud scores that we assign to

through the buy rank algorithm. So one, the claims, they will get a fraud score, but also the parties. So if there's like this garage that's aiding a lot of fraudsters, this will pop up and this will get a high fraud score. So all the claims that are connected to this garage that commits a lot of fraud, this will pop up in the net or yeah, this will pop up in the analysis.

Well, I know one problem that is common in fraud detection is a class imbalance. I do have enough confidence in my society that most people are honest, even though I know there is fraud out there. I would hope it's some small percentage. I don't know what number exactly, but that presents a challenge for any sort of analytical or machine learning approach because you have so few positive examples of fraud. How do you face that challenge in your work? Yeah, this is a huge challenge indeed. And

There's a lot of research done about or performed at the moment about this class imbalance. Now, one of the

Well, reasons why it's affected is by the mechanics of the machine learning algorithm itself and also how you optimize it. If you're just going to optimize it by the accuracy, for example, you're just going to put or your machine learning algorithm is just going to put everyone as non-fraudulent. One of the techniques that we could then apply and it's often applied is resampling techniques such as SMOTE.

There has also been research performed showing that just a simple logistic regression model is often sufficient and it does not suffer from this class imbalance. Purely, well because I also researched this, but this is purely because of the fact how this works here. We optimize the likelihood and we do not work with the accuracy.

And I know one major component of your work, I presume it relates a little bit to the class imbalance problem, is building this simulation engine. Could you tell listeners about the iFraud Simulator? So one of the challenges within fraud research is the availability of public data sets.

A lot of research is performed, but it's often difficult to replicate this research because none of the data sets is shared. In addition, just as you just mentioned, we have the class imbalance and it's not perfectly clear how much of a role this plays or how much this actually biases our predictive models. And this is why I came up with this simulation engine.

because this allows us to create a synthetic data set that has a network structure and has the same available covariates similar to the real life insurance front data set analyzed in Oscar Dodir. And in this simulation engine, you can also adjust several characteristics, specify dependencies, for example, between policyholder characteristics. And at the end, you have a synthetic data set

that you can use to train your model or even pre-test your predictive model.

Well, one of the challenges I know in creating a simulated network is if you make a random network, just initialize some number of nodes and then roll the dice and make random edges, the graph you get looks nothing like real-world graphs in general. So how do you go about the process of simulating a network structure that bears those sort of real-world features?

Now what I first do is that I simulate the policyholder characteristics. After this I simulate some contract specific characteristics and we use this as input into a data generating claim frequency model.

With this we can then generate the number of claims. After this I use another data generating claims severity model to simulate the loss cost per claim. And with this we have our first simulated dataset. Here I only retain the observations that had at least one claim. And at this point I will randomly allocate

the involved parties to the claim. And after this, because at this moment I will generate the network structure, I will take a small subset and this is an iterative algorithm, but I will always take a small subset. In this small subset

I will then construct the social network features. Now, I can already hear you think, but in the beginning, you don't have any social network features that you can use because none of the claim labels have been generated. So we do not know whether a claim is fraudulent or not. So in a very small subset that I start with, I will generate a claim label without the social network features.

After this, I will then run an iterative algorithm. In the first step, I always take the claims that are linked to the fraudulent claims generated in the previous iteration. And then I take another random subset of other claims and then based on, well, for this I just use a simple logistic

regression model to generate the claim labels. And with this, I incorporate both the social network features and the, well, let's say the traditional claim characteristics. And at the end, we then have a network structure that more resembles a real life data set, because we will also have a couple of claims that are not connected to any other claims.

Are you able to control that? I assume that your simulated data set would be very useful to someone building the next generation of fraud detection. Do you maybe seed it with a certain amount of fraud that you hope they detect, or is that perhaps a parameter they would want to control? The simulation engine has quite a lot of flexibility, so you can...

kind of the site on the weights that you assign to each of the features. And this includes both the traditional claim characteristics, but also the social network features. So if you want to, for example, just switch off the social network effect, you can do so. And this is also what I investigated in my paper.

Because of course I built a simulation engine and well I wanted to verify whether it worked. One of the things that I first investigated was whether I could generate a dense network of fraudsters

And how we assess this is by computing the dyadicity and heterophilicity of a network. Now, shortly summarized with the dyadicity, we check how strongly fraudulent claims are connected to other fraudulent claims. And then with the heterophilicity, we assess how strong fraudulent claims are connected to non-fraudulent claims. Now, we want the dyadicity to be high.

higher than one and the heterophilicity we want this to be low because in essence we are assuming that there are dense networks of fraudulent claims meaning that they have strong connections between each other and also that they are less well connected to non-fraudulent claims.

It seems to me that there could be a gradient here, like maybe an innocent person is accidentally connected to one case of fraud, but if you're connected to 99 cases of fraud, for sure you're a, or not for sure, but you're most likely a fraudster too. I guess for sure, yeah. Where do you put the dotted line? There isn't a specific threshold that I could give you because going back to the buy rank algorithm, it's more about the ranking.

So if you're higher up the rank, you're more probable to commit fraud. But then again, it doesn't necessarily mean that you also committed fraud. This we also take into account by constructing the different social network features. So it's kind of a combination of the values for the different social network features that gives you a probability of having a fraudulent claim.

As with any probabilistic classifier, you will get a probability. And it's very difficult to put a threshold there because here we're also going back to the class imbalance. Because of the class imbalance, the probability will also be affected by this. If you have, for example, a prevalence of 50%, well, your probabilities will also reflect this.

How, I guess, obvious is it once you have the right systems in play to find this kind of fraud? Well, if you have a well-performing predictive model, also referring back to the dynamic nature of fraud, I guess it will be quite accurate for a while. But you should always keep updating it.

With new information and also new networks will be formed to take this into account. So you might come up with a very good predictive model, but I think you should also be aware of the fact that you should retrain this.

In your simulation, I guess you can simulate whatever network data you want. Do you have a sense of, in the real world, if, you know, I presume the insurance companies are not necessarily looking at Facebook, how do they develop the graph that's going to power the analysis an expert would do? Once a file is claimed, they have a lot of information available.

And then to put this on a graph, they're gonna, well, first make all of the connections and use a specific program for this. I think they also have Neo4j, for example, and then you will use this

to build your data set. So this will then make a graph of all the claims with all the involved parties. And you can also then kind of extend this not only to the current insurance product, but also other insurance products, for example. And this is how they built their graph database.

Traditional machine learning or even just sort of statistical analysis approach usually appreciates having tabular data, something that goes nicely into a CSV or Excel or something like that. But of course, graphs don't really fit in that format. Maybe we can summarize a graph by certain features it has and use those as columns. Is that the general approach? And if so, what are some of the common features you'd recommend?

Yeah, that's indeed the general approach that we apply. This is actually also where you find in the, let's say, the result of the iFraud simulator. Now what you have is then a synthetic data set in a tabular structure.

And here you have again referring to the traditional claim characteristic but also the social network features. And these social network features they're actually quantifications of some of the aspects that we find within the graph data.

Then this traces back to the original BIRANK algorithm and the fraud scores that we obtained by running the BIRANK algorithm on the graph network.

And with regard to the iFraud simulator, could you talk about it a little bit, maybe as if you were pitching it as a software package for someone to use? What are its features? Why would I adopt it? For one, I think as far as I know, it's the only software package that allows you to generate synthetic fraud network data. I'm not aware of any other software package.

that also incorporates this social network structure into the synthetic dataset.

In addition, it has a lot of things that you can think about so you can adjust a lot of aspects of the resultic synthetic dataset. One of the things that you can also adjust is the class imbalance, for example. You could, for example, use my package to generate multiple synthetic datasets with a different class imbalance. In addition, you can also specify the value

or the weights of each of the features and you can kind of play around with it to make one feature more important or the other one and this allows you to generate a wide array of different datasets. It's also very interesting aspect is that you could potentially use it to incorporate the dynamic nature of fraud so you can

generate a first data set with certain characteristics and then you evolve the characteristics or maybe the features that have an importance and you can then maybe investigate how you would best update your model to stay as accurate as possible.

Well, as we discussed earlier, the presence of fraud in any system is going to drive up the cost for the good users. And as a non-fraudulent insurance purchaser, I would hope that my insurer is adopting tools like yours. Do you have any insight into the degree to which they've gone ahead and done that? Yeah, actually quite a lot of companies currently adopt analytical approaches to combat fraud.

So I do know of quite some insurance companies who adopt this approach to prevent and to identify fraud. Do you have any thoughts on the risk that, however small, some probability of false positives will occur here? And an innocent person would be labeled as a fraudster despite not committing fraud? Well, here I can...

maybe assure you that even though you have the predictive model, it's still a small number of cases that are then flagged as suspicious because they're not immediately flagged as fraud. And this is also why we use a probabilistic model that gives you a probability of committing fraud. So even if you have like a high probability, well,

Well, also the mathematics says that there's still a chance that this is not fraud. So mostly the claims that are identified as suspicious, so that have a high probability of being a fraudulent claim, they will be forwarded to an expert and they will then conduct an in-depth investigation whether this claim is fraudulent or not.

So is work like this the focus of your career in your research, or is this just one facet of what you do? Yeah, well, this is just one facet of what I did during my PhD. I actually focus on quite a lot of things, and especially in my free time, because I'm a research associate, as I previously mentioned, but I'm also involved in medical research still.

So this is not only about, let's say, the statistics behind it, but I'm also working with real life medical data sets to examine a certain research question. Interesting. Is that along the lines of medical fraud then or something different? Very little of what I currently do is actually related to fraud. Mostly

well examining theoretical aspects of statistics. One of the things that I'm also very passionate about is assessing the accuracy of probabilities or predicted values and this is just one aspect of my work. Another aspect is where I just contribute as biostatistician and where I analyze a data set but this here is mostly well

is the doctors who ask a question and where I help them out by performing the statistical analysis and investigating it together with them. Then apart from that I'm also still involved in actuarial science and here I tried to stay up to date with the latest research.

What's next for you? At the moment I'm contributing to another paper with some ex-colleagues and I'm very much looking forward to that. So this is the first thing that I will finish. Well, that's next week. After this I'm finally going to make some time to update one of my software packages and push it to CRAN, the calibration curves package. Updating the vignette, everything else.

And then I still have so much to work on, so it's kind of difficult to summarize everything. Could you maybe expand on that "Easy Calibration Curves" package? What will our users benefit from that when it gets released? This refers back to assessing the accuracy of your predictions.

And this is why you can use the calibration curves package to assess the accuracy of your predictions. Now currently it has implementations for a logistic model. I also implemented something for generalized linear models or just where the outcome follows a distribution from the exponential family.

And since the current theme of our podcast season is all about networks and graphs, and that played a major role in your work that we talked about today, I wanted to ask, where do networks and graphs fit into your current research and maybe future research?

Whenever I get confronted with a new research project that involves graphs, that's where I'll continue, I guess. But it could also be that I will no longer continue on this specific project. But it doesn't mean that... I also want to say this to the listeners. If anyone has any questions...

please feel free to contact me. It doesn't mean that I'm not currently involved in the subject, that I no longer want to help out. I'm more than happy to answer anybody's questions or maybe questions related to the iFraud simulator or anything else. And is there anywhere listeners can follow you online? It's mostly LinkedIn. I don't have any other social media that I actively use.

Yeah, so if you want to follow me or just ask me a question, you can always add me on LinkedIn. But you can also go to my personal website where you can always send me a mail. We'll have links in the show notes for listeners who want to follow up.

Pavel, thank you so much for taking the time to come on and share your work. Yeah, thank you so much for inviting me and for covering my work and also for asking all the interesting questions. I hope it was also helpful for you and interesting for the listeners too. Definitely, very cool project.