We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Episode 7: Lies, Damn Lies, and Statistics

2020/7/5

The Theory of Anything

AI Deep Dive AI Chapters Transcript

People

Bruce Nilsen

Cameo

Topics

Bruce Nilsen：本期节目讨论了统计数据在社会中的广泛应用和误用，以及人们如何利用统计数据来掩盖对未知事物的认知不足。他通过多个例子，例如女性患乳腺癌的概率、新冠病毒的死亡率以及疾病筛查测试的准确性，说明了统计数据通常应用于群体，而非个体，将群体统计数据直接应用于个体判断是不准确的。他还介绍了概率论的两种用途：一是描述实际概率，二是弥补我们对未知事物的认知不足，并指出我们通常使用概率论来弥补我们对未知事物的认知不足，而非仅仅用于计算真实概率。在讨论机器学习时，他指出机器学习虽然有其独特的工具和词汇，但其根本还是建立在统计学基础之上的，其核心问题是如何将统计数据应用于个体预测，而这并没有简单的答案。他还讨论了机器学习中可能存在的种族和性别偏见问题，以及如何通过调整训练数据或算法来减轻这些偏见。最后，他总结了我们会在各种情境中使用概率论，但其适用性和意义并非总是明确的，并强调统计数据适用于群体，而非个体，在应用统计数据时需要考虑其局限性。 Cameo：Cameo在节目中与Bruce Nilsen一起讨论了统计数据的应用与误用。她指出统计数据适用于群体，而非个体，因此不能直接将群体统计数据用于个体预测。她还指出，人们常常错误地认为统计数据（例如疾病死亡率）在人群中均匀分布，而忽略了其个体差异性。在讨论新冠病毒死亡率时，她指出使用统计数据来弥补我们对未知事物的认知不足，这在解释不同地区新冠病毒死亡率差异时尤为明显。在讨论机器学习时，她指出解决机器学习模型中的偏见问题，需要考虑如何提高模型对不同群体的预测准确性，而非仅仅关注消除偏见本身。在讨论种族偏见时，她指出在消除机器学习模型中的偏见时，需要权衡模型的准确性和公平性。最后，她总结了人们常常利用统计数据来支持自身的偏见，并对其他群体做出假设。

Deep Dive

Chapters

Cameo and Bruce discuss the common misuse of statistics, particularly in the context of individual probabilities versus population statistics, using examples like breast cancer risks.

Shownotes Transcript

Translations:

中文

The Theory of Anything podcast could use your help. We have a small but loyal audience and we'd like to get the word out about the podcast to others so others can enjoy it as well. To the best of our knowledge, we're the only podcast that covers all four strands of David Deutsch's philosophy as well as other interesting subjects. If you're enjoying this podcast, please give us a five-star rating on Apple Podcasts. This can usually be done right inside your podcast player or you can google The Theory of Anything podcast Apple or something like that.

Some players have their own rating system, and giving us a five-star rating on any rating system would be helpful. If you enjoy a particular episode, please consider tweeting about us or linking to us on Facebook or other social media to help get the word out.

If you are interested in financially supporting the podcast, we have two ways to do that. The first is via our podcast host site, Anchor. Just go to anchor.fm slash four dash strands, F-O-U-R dash S-T-R-A-N-D-S. There's a support button available that allows you to do reoccurring donations. If you want to make a one-time donation, go to our blog, which is fourstrands.org. There is a donation button there that uses PayPal.

Thank you. Welcome to the Theory of Anything podcast. I'm Bruce Nilsen. I got Cameo here with me. And today we're going to do a subject that Cameo herself chose that she wanted to talk about, which was statistics. So you probably heard the joke about lies, damn lies, and statistics. That's what we're going to talk about today is how people use statistics or misuse statistics.

So yeah, so I've got a couple of things just to kind of use as an examples. Kimmy and I started this conversation over lunch. We, this is right during the coronavirus scare. So it was an online lunch. And, uh, and, uh, so here's some things that we'd kind of talked about. So suppose there's a woman that's age 52 and she hears this statistic that women over 40 have a 5% chance of getting breast cancer. Okay. What is her chance of getting breast cancer?

Anybody know? Probably the assumption would be 5% chance, right? Right. Okay, but suppose later she hears this statistics that women over 50 have a 6% chance of getting breast cancer. Well, she's both over 50, over 40, and over 50. So which one applies to her? Again, maybe you would assume it means 6%. Oh, she's actually in the 6% category. But now suppose this woman later hears,

that women over 60 have a 10% chance of getting breast cancer. Now, what are, what is her percent chance of getting breast cancer?

I'll stop and think about that one for a second. What do you think cameo at this point? Well, none of those percent, I mean, because you and I have already been talking about this, of course, none of those percentages actually represent her likelihood of having breast cancer. Yes. Because breast cancer is typically based on personal and individual things, you know? Yeah. So in this case, it's hard to even figure out

She's not even in any of these categories, right? I mean, women over 50 includes women over 60 who have a much higher chance. So does that mean her chance would be lower than 6%? Probably. You know, it's hard to figure out based on these numbers because statistics apply to populations. They don't apply to individuals. And so we use statistics as if they apply to individuals, but they really never directly apply to individuals. Yeah, and you know, even...

Even within populations. So the reason why we wanted to talk about this is, you know, we see a lot of statistics about how much people are going to die because of coronavirus. Right. You know, originally people weren't scared so much like two or three weeks ago because, you know,

you have the death rate is lower than the flu or the death rate is just 1% bigger than the flu or a variety of different percentages that we were understanding. And when people think about those percentages, a lot of times they assume that they're going to be applied consistently across the population. Right. So if I know, if we know that the death rate is 10%,

That's pretty scary because one in 10 people that I know, or I'm sorry, yeah, one in 10 people could be dying. But the book that has been in my head a lot lately is a book called The Doomsday Book by Connie Willis, in which a population...

ends up having 100% mortality, a small population, because they're unable to care for themselves and they all end up dying from something that only has a 20% death rate. - Right.

And so that was what you and I first started talking about that kind of got us into this mode that wanting to talk about probability. And I love that you have this slide now, the two uses of probability theory. Yes. So there are two uses for probability theory. And this is something that I think people don't think that much about. Okay, so like take a die. And before you roll it, what are the chances you're going to roll a six?

but obviously it's one in six. Now you probably don't think about this,

but now I take a die and I roll it and I cover it with a bowl. So I don't know what I just rolled. Okay. One of the chances that the die underneath the bowl is a six. Well, obviously it's one in six. Okay. But there's actually a difference between these two cases. One is actual probability and the other is ignorance. So we use, we use probability to cover ignorance, to discuss what we don't know.

And in fact, this is probably the main way we use probability theory today is to discuss things. And now you can see from this example that it makes some sense, right? This isn't just something stupid. The odds that the die is a six of the die is either a six or it isn't right. It's already been rolled. It's been determined. It's either a six or it isn't right. So there's no probabilities at all involved in the strict sense of probabilities and

And yet to say there's a one in six probability still makes sense to us, right? Because we just don't know what it is, but we know it was produced by a process that gave it a one in six chance, right? You're breaking my brain a little bit here.

So it gets crazier though, right? So what's the difference between these two examples? Well, it's a little hard to describe what the difference between the two examples are, okay? Other than the obvious one that one's a straight up probability and the other one is somehow related to ignorance. Okay. Okay, so we use probability to cover our ignorance, to measure our ignorance. And there are ways in which this makes sense in ways in which this doesn't make sense.

And it's not always obvious which way it is when we start using probability theory on things. Okay. So suppose we wanted to discover that some disease has 100% chance, you have 100% chance of getting some disease if you have a certain gene and the gene exists in one in 10,000 people. Okay. So the question I want to ask is, what are the odds that, you know, I'll get this disease? Okay. Yeah.

Right off the bat, we could say, well, there's one in 10,000 people have this gene. We can treat that as a probability. We could say there's a one in 10,000 chance you're going to get this disease. Okay. If you live long enough or whatever, right? Sure. Okay. So, and this makes sense, right? This isn't complete stupidity that we're using probability theory in this way, but there's something a little off about it too. Okay.

Okay, so just a quick primer on probability theory so I can use some use some notation here. So we put P disease if you're listening to this. If you're watching this on YouTube, then you can see the screen, but if not, then

You are going to have to just bear with me here for a second. So P in an parentheses disease, that's the percentage chance that you're going to get the disease. And so we're postulating that it's a 0.01% chance, one in 10,000.

Okay. Then the next one is probability of the disease given, the bar means given, that you have the gene. Well, that's 100% chance. Okay. So if you know you have the gene, there isn't a probability that you're going to get the disease. You are going to get the disease. Okay. Then there's the probability that you're going to get the disease if you don't have the gene. And in this case, we're claiming it's 0%. If you know you don't have the gene, there is no chance you're going to get the disease. Okay.

So then the real question that we're asking is, what's the probability that you have the gene? Right. We're postulating that that is also 0.01% because that's one in 10,000 people that have the gene. But the odds of you having a gene is not random. Okay. Obviously it depends on, you know, who your parents were, right? Whether they had the gene or not.

And that's not going to be randomly distributed throughout the population. That may be the odds across the entire world, but like maybe nobody in India has the gene or it's way more rare there or something like that. Right. So which population you're a part of matters in this case. And that's,

that odds 0.01% isn't really going to apply to you, right? It's just, it's the best we could come up with to express our ignorance, if that makes any sense.

It actually makes great sense. I love this example, especially given coronavirus and how much conversation we're having about which populations are likely to be hit hardest. When you look at death rates right now in Germany as compared to Italy,

There's so much of this under the surface and so much of people postulating about what's causing different death rates. Right. Where really we're just doing it to cover our ignorance. It's fascinating. Go on, Bruce. Okay. So this is a famous example that comes up. So I'm obviously studying machine learning and working on a master's degree. This came up in my machine learning class. This comes up everywhere. They always bring this up.

So this is an example that gets used. So it's the idea of screening for a disease.

Okay. So you have a test for a disease. Let's say it's the coronavirus test or whatever, right? And it's 95% accurate. Okay. What we mean by that is that the probability that it's going to show a positive if you have the disease is 95%. So that's the second line down there, right? Right. Okay. But it gives 5% false positives. That means that if you do not have the disease, there's a 5% chance it will still say positive. Okay.

Okay. So now let's say the disease is, you know, one in a thousand, so 0.001. And you've just been tested positive. What are the odds that you have the disease? So just using your intuition cameo, make up what you think the odds are going to be based on what I've told you so far.

Well, you know, the easy answer, the answer my brain wants is that there's a 95% chance that I have the disease. Yeah. Okay. Because that's easy math, right? Is to just say, well, that's as good as I can guess anyway. So why do you think that that number is wrong? Well, I'm not sure I believe that it's wrong or right because...

because I don't believe any of this data probably is necessarily relevant to me or to anybody in this specific instance. All right. So there's actually a way to calculate the actual probability using something called Bayes' theorem, which deserves its own show, so I won't get too far into it. If you look on the screen here, though, I give what the formula is for Bayes' theorem. And what Bayes' theorem does is it allows you to –

switch the order. So we know what the pot, the pot probability of getting a positive. If you have the diseases is 95%. What we really want to know is what's the chances that you have the disease if you have a positive. Okay. So we want to take that and,

and flip that. You see how I'm showing how it can be flipped there with Bayes' theorem? Yeah. Okay. Now, I'm not going to use Bayes' theorem. I'm going to just give you straight numbers that will be far more intuitively obvious. Okay. Here's the actual numbers. So let's say we have a population of 100,000. And how many of them, based on our numbers, have the disease? Well, it's one in 1,000. So 100 of them have the disease. Okay. And the rest don't.

Now we test all of them. And of those 100 that have the disease, 95 of them get a positive and five get a negative. Okay. Because that's the numbers. That's the chances, right? Right. Okay. Now of the 99,900, 5% of them get a positive and 95% of them get a negative. Okay. So that's 4,995 that get a positive.

If you have a positive from this test, you're in that group of 5,090, of which only 95 have the disease. In other words, there's a 1.9% chance you have the disease. Interesting. Okay. Okay. If I tested positive. If you tested positive. So this is actually why it doesn't make sense to have a screening test like this and give it to the whole population.

So this assumption here is that we're giving it to the whole population, which in this case we're saying is 100,000 people. Sure. You would never really do this in real life. It would be stupid to use a test like that that way in real life. What you actually do is you go to the doctor. The doctor first checks you out, says, oh, your symptoms suggest you have this disease, which –

Although I don't know what the probability is at this point, moved you into a much higher probability category. Certainly. For having the disease. Then the test makes some sense, right? Sure. But if you just went out and you just tested everybody, the test would be basically useless. Yeah, because, well, but, you know,

For us as humans, when we look at numbers, we think that 95% accuracy is a pretty good number. Yes. When we look at that, we say, okay. In fact, when you put this example up here, one of the first things that I thought about was condoms. Condoms are considered to be 98% accurate when they're used correctly every time.

But even 98% of accurate isn't great if what you're trying to do is prevent pregnancy. Right. And it's also interesting, obviously, if you were, let's say for some reason you had to test the entire world, what you would want to actually do is you would want to have the test multiple times, right? Oh, right. It would be super expensive, but like you would want to have the test get down to, okay, this is the group that is most likely to have it.

And then you'd want to test them again. And then you'd want to test them again until you got it down to some level of odds that was reasonable. It would take a while to get there.

Well, because you would have that 5,000 number of people who had tested positive that then you would test again. Yeah. And you'd probably have about the same percentage again as you were narrowing it down for accuracy. Yeah. But consider also, what you're trying to do is stop the spread of the disease. Let's say it's contagious. You still got that five out there that had the disease but were tested negative. Right. It's...

you got to deal with them somehow, right? So...

Fascinating. Okay. Okay. So now let's talk about machine learning. So machine learning is based on statistics. There's a joke meme out there about a guy included it in the original version of the show and then thought, you know, I'm not sure I should because I'm not sure who it came from. But a joke about how machine learning is really just statistics, right? But it's been reframed so that people like it.

Right. We can't tell that people like it. Yeah, so this isn't quite true. Machine learning is its own field of study that has its own vocabulary and it's got its own tools that are different from statistics, but it is rooted in statistics. That part is certainly true. And a lot of the machine learning techniques are actually statistical learning techniques, which are used by statisticians.

And then they have their own, machine learning has their own techniques separate from statistical learning techniques, such as neural nets, which aren't used by statisticians. Right, right. So, but they all work in the same way. The idea is that

We're trying to split the popular, we're trying to make a prediction for an individual based on some sample that we've been trained on. And it's the exact same problem. How do you apply statistics to the individual? And there's no easy answer to it. So machine learning is trying to come up with some way to automatically split up the population to where it's going to give you a good answer for an individual.

And there's tons of problems with it. It's a, that's what the whole study is about is how do we solve these problems? Right. And so this, this graph I've got here, let me describe, this is from one of my actual master's papers. Right. So what this is, is this was the, there's a data set called the diabetes data set where they were checking for diabetes in Navajo Indians and they

you can use the machine learning techniques to try to predict if someone's going to have diabetes or not based on certain characteristics, okay?

So what I've done here is I've done something called the t-SNE, which takes a whole bunch of different dimensions and flattens them to two dimensions. And so that you can get a kind of an intuitive feel for what's going on for machine learning. And what we have here is we've got class zero and class one. So class zero is red. That means you don't have diabetes. And class one is the blue. That means you do have diabetes.

So what's this machine learning technique doing right here? Well, it's taking this giant population based on a number of different statistics that have been flattened to a simple X and Y. And it says, I'm going to draw a line. And if you're above the line, then I'm predicting that you have diabetes. And if you're below the line, I'm predicting that you don't. Okay. And you can see the line there, right? Okay. Um,

Well, now, if you really look at it, the diabetes and not diabetes are all pretty intermixed. Okay. But you can see that it rather intelligently drew the line such that there's going to be, if it predicts that you have diabetes, there's a better than 50% chance that it's right. And if it predicts that you don't have diabetes, there's a better than 50% chance that it's right.

So it just draws that line and it says, okay, this is it, predicting diabetes for these and not for these. And then it knows it's going to maximize its percent chance based on this technique with the line, which is logistic regression is what it's called.

It's predicting, it's going to maximize its predictions by drawing the line right there. Okay. Okay. Now, when I put it in this way, you see what a simple mathematical technique it is, right? It's not even all that intelligent, right? I mean, we treat machine learning like it's some sort of, you know, counterpart to human intelligence, but really it's just simple.

statistics, right? It's just measuring, okay, I'm going to maximize my predictions if I guess here. Now we do that on the sample set and notice that I refer to it as the training set. Okay. It gets a 70% accuracy on the training set. And then

on my cross validation set, it's a little less. Well, the reason why that is, is because of course it trained, it scores really well on the set that it trained on. Okay. The question is how well does it do in real life? Okay. And we don't know that you never actually get to know how well it does in real life. What you do to try to mimic that is first we have something called a cross validation set where I hold back some of the samples and don't let it look

look at them until after I've trained. And then I have something called the test set, which I don't show here, which I don't, I don't,

I'm like, maybe for each time I train it, I then use the cross-validation set, but slowly over time that contaminates the cross-validation set because it slowly learns based on, oh, I need to tweak this because it's not doing very well yet. It starts to indirectly learn the cross-validation set. Then I have a test set I use at the very end that it's never seen at all. And then that's supposed to give me some sort of confidence that my final

Training, you know what I what I came up with will actually work in real life because now I'm checking it against a certain amount of samples that I've held back that it's just never seen before. Right. And it never scores as well on that test set as it does on the training or cross validation set it always scores less well and in real life, it'll do even worse. Right. And that's precisely because

There is no easy, straightforward way to go about applying statistics to individuals. So this is all the techniques we've had to come up with over the years to try to deal with that. Okay. And there's really interesting things.

Things that have come out of this. So maybe you've heard about like racist machine learning or sexist machine learning. Okay, this is the exact same problem. Okay, it's the statistics problem trying to apply it to individuals, but in a special case. So I've got two actual online magazine quotes here that I've put up one from the Virgin one from wired and on the verge of

It talks about how Google Photos had a problem where it was classifying, you know, somebody noticed that the photos of their African-American friends were getting tagged as gorillas.

And so they were embarrassed by that. So how did they take care of it? They simply removed Gorilla as a label so that it couldn't possibly do that. They didn't have any direct way to go in and try to get it to tag correctly. So they just removed Gorilla as a label. And that way it couldn't do that. And they wouldn't be embarrassed by that. So they didn't really ever fix the problem at all. The other one that's interesting is the Wired story. So, and this, the way they word this

kind of downplays what's really going on here. The idea was, is that they had this thing that could predict your gender from your photo and it worked, it worked really well. Okay. But it, what, if you were, and it says that if you were a, if you were white, it would predict well, but if you were a African-American woman or a woman with darker skin, then it tended to error a lot. Okay. Well, what's really going on here is that

Its sample set was probably based on a population in, say, the United States or something like that. There were more white people than there were black people. And so when it trained, it found dark skin as a good predictor that it should guess that it was male. So it tended to guess that black women were male. Oh, interesting. Okay. And because...

Because they're only a small part of the training set, this didn't really screw up its statistics for trying to do the predictions. And so it made it so that the end result of the predicting engine was really only useful if you were white.

Right. Right. And so what they had to actually do is they had to go back and do a separate training just for African-Americans so that it would learn to predict correctly, not using skin color as the basis. You know, go ahead. Well, just the, it's super fascinating. The first example, Google dumps tons and tons and tons of money into things like this. Like they, they,

they might've been appalled at the mistake. This quote says nearly three years on Google hasn't really fixed anything. And so I'm assuming this was in 2018. What do you think Google's trying to do to deal with these problems? I mean, I assume that the fact that they haven't done anything is in a little bit, it's misportraying. I'm assuming that they're actually trying to fix the problem. Yeah.

So, you know, there are techniques you can use to fix problems like this. And this is actually a big area for study right now is how to make, they usually say it, how to make machine learning not racist. But there's a broader problem here, which is how do you just make machine learning accurate to begin with so that it isn't trying to make guesses for a broad population, but is better at making guesses for the population that you are a part of that you care about. Right.

And there was an interesting, similar study to this. There was that, that guy that solution streams on some business with Ben, I forget his last name, Ben Taylor. Oh yeah, yeah, yeah. So I went to one of his talks at a conference and he had created a, a machine learning, a, a,

an engine that would take a picture of you and then tell you how good looking you were. Okay. Interesting. And, um, and obviously there's, there's ethics around that too. You know, I mean, like, do you want your kids to take a picture of themselves and then get rated as a four, you know? Um, so, but, but it worked. Okay. It, it would take a picture. You could put

Keira Knightley into it and it would tell you that she was good looking, you know, it actually did properly predict if this person is good looking or not. Right. And, um,

One of the things that he pointed out is that they had to do special training to try to get it to, so they, they, they, their data set they got was off of, they found some dating site where they could download the data. So they could download the pictures, they could download the ratings from real people. Right. Interesting. And that was, and so they downloaded, they took this data off the site and

And if I remember correctly, they had to slowly download it so they didn't get caught and took the data off this site and then trained on it.

Now, here's the thing, though. You're being trained on a population that matches whoever that dating site catered to, which for the sake of argument, let's say that's the United States. Okay. Well, you're going to have a matching population that matches the racial breakdown of that population. So obviously African-Americans are going to be in a minority, right? Sure.

And so they would then do these ratings. And what would happen of course, is because, I mean, and you can look this up, there's certain races are more popular for dating than other races, right? This is something that's actually been shown well on a number of different studies on dating sites. Well, it would match those. So it would rate, if you have some race that was less popular for dating, they would get automatically rated lower in terms of their looks.

So what he did is he would tweak it so that it would rate equivalently, no matter what your race was, by only looking at just that population and then giving a rating based on just that population. So if it's giving you a seven, then you're a seven for that subpopulation rather than for the overall population. And that was his way of making it non-racist. Oh, interesting. Okay.

Now, one of the things that's interesting here, though, is let's say that you're just a regular person on a dating site.

the pre-adjusted numbers actually may well represent your views, right? So you've adjusted it so that it's not racist, but it's actually in a sense less accurate in terms of prediction for certain people now, right? Presumably the people in the majority. Right, because while the population itself may not be racist, it is likely to find the things that it finds attractive, attractive. That's right.

Yeah, fascinating. Okay, so, and it kind of shows how there's no real easy answer for these questions, right? Is your, when you make it so that the machine learning isn't racist, you are giving up a certain amount of accuracy to do that. Or you have to find some way, I mean, there's other interesting things that come out of this. For instance, the fact that even if you don't have race in your training set, it might be able to figure out race

through your zip code or something like that. Right. And so it might end up becoming racist, even if there's not an ounce of race listed anywhere within the numbers that it's playing with. If it has some sort of correlation to race, it will still find it and use it.

Because it's just trying to, and there's cases like where they'll predict your chances of going back to jail or, you know, something like that. They'll use it for parole. And, you know, if you get a case like that, you know, if it starts using race and it starts saying, okay, you know, all black men are going to not get parole. Right.

Right now, you're really talking about straight up racist machine learning where it's greatly favoring one race. And you can see why people would want to come up with some way to tweak that so it doesn't do that anymore. So, you know, bringing it back to statistics.

It's actually interesting because the way we use and talk about statistics within our culture is kind of that way. We use statistics a lot of times to back up our own biases. Yes. And to make assumptions about other portions of populations or things like that, where, you know, we have statistics that say,

how likely people are to go back to jail. And we use them as ways to be biased against certain populations. Right. It's really, really interesting. I mean, there's not a way to fix, even if you fix machine learning, it's a lot harder to fix people's brains. Yeah.

And this is actually an interesting thing is there's this big concern about racism in machine learning. But one of the things that you can demonstrate is that humans are biased, racist, sexist, right? And even when they're not intending to be, they can be. But there's some really interesting ones out there. Like there was a...

There was somebody who found statistically, so they gave, they had these inmates in jail and they gave some of them facelifts, plastic surgery, so they would be more attractive. And they found that the ones they gave the surgery to didn't return to jail.

And so they had this theory where they said, okay, it must be that now that they're more accepted because they're more attractive, that they're able to go get a job, become part of society, and there's no reason for them to go rob a bank and come back to jail again, right? So this was the original theory. Well, some scientists can...

came up with this alternative theory. He says, I wonder if that's not what's going on. So he went and he did a study, or I don't know if it was a he or a she, but I'm going to say he. He went and he did this study and they took people who were in court and they had pictures of them and they would rate them by their attractiveness. And then they would see how many of them end up going to jail. And what they found is that juries do not send attractive people to jail.

Oh, interesting. So this is now the alternative explanation that you gave these guys a facelift. They aren't necessarily fitting into society better. They're just not getting sent back to jail anymore. Ha ha ha ha ha ha ha ha ha ha.

Just a side note, the interesting thing about that to me of that study is almost nobody goes to trial anymore anyway. Our judicial system has eliminated trial as a part of what it means to go through the judicial system. And it's really, really rare for people to ever go to trial. In fact, I saw some statistic that said

I don't remember. Actually, I don't remember. Anyway, okay. So this is awesome. I have a friend who's a lawyer who told me that. He goes to trial. He and the lawyers he partners with, they go to trial. They were saying, we actually go to trial. But if you see one of those big billboards, call so-and-so if you got injured or something like that. He says, they never go to trial ever. And everybody knows that, right? They're just trying to settle out of court. Right.

So in part of his case here is you should really go with a group of lawyers that intends to go to trial, right? That they'll settle out of court fine, but they will go to trial if they have to, because then they're actually in a more powerful position than one of these who they make their money by not going to trial. Right, right. That's it. That's interesting. Well,

I think that there's some interesting topics we could have around that. We'll have to talk about that later. Okay. All right. So what are the chances? All right. So this was actually just intended to be a bunch of examples. We don't have to talk about any of them specifically. The point I'm making here is that we use probability theory

on all sorts of different things. And I've got this large list here of things that maybe we would try to use probability theory on. Some of them make perfect sense. Some of them make sense as a measure of ignorance under some circumstances. And some of them don't make any sense at all.

So what are the chances of rolling a six by six sided dice? What are the chances of winning at poker? What are the chances of catching coronavirus? You don't have it on here, but what are the chances of dying of coronavirus? Right. What are the chances of winning lottery? What are the chances that the 49ers will beat the Miami Dolphins? Oh, I like the, what are the chances that Obama is the best president of the last century? Yes. And we say things like this and we use probability theory language at least.

And it's not always that clear what we mean, right? So like the chances of Obama being the best president in the last century is probably meaningless in terms of probability theory. It probably doesn't mean anything at all, right? What the chances that global warming is a real problem

It's not clear if that applies to language probability makes sense in that setting or not. Okay. The chances the 49ers will beat the dolphins, we would use probability theory in this situation and we would base it based on a population of their most recent wins. Right. Right. It's not clear what that means.

You know, it's probably a decent thing for you to do. It's probably not completely inappropriate, like trying to say if Obama's the best president, but it's not super clear how meaningful it is either. And so whatever those odds are you come up with, they should probably not be taken too seriously because it's actually going to be determined by a bunch of other factors. Have you ever seen or read Rosencrantz and Guildenstern are dead?

So it's a play, it's an absurdist existentialist play. And the first part of the play starts with the two characters, the one character is flipping a coin over and over again, heads, heads, heads, and he's on this roll, heads, heads, heads, heads, heads.

And it's this really beautiful thing because when people think about chances, one of the things that we want is for, we want chance to be applied evenly.

They do studies where they will have people trying to make up, to show randomness through number strings. Going back to your six by six sided diet, they might have a group of people put together a let's pretend what a role would look like, what the numbers would look like. And they can always...

figure out which one was made by people because people want chance to be applied evenly. They want heads to roll 50% of the time and ultimately it will, but it's just a question of how long it takes to get there. You might roll heads a hundred times. It could happen. It's because each role is a new chance. It doesn't have any knowledge of the previous chance. Right. Right.

So in fact, that is one of the things that is interesting is humans do not mimic randomness well. In fact, rock, paper, scissors, which SolutionStream does a lot. The trick that I used that usually gets me to the, I don't always win, but gets me to the end of the competition that got me a reputation for being good at it is that people overwhelmingly do rock. Right.

Right. So you just select paper as your first go. And then after that, you just have to pick whatever. Right. And you'll win more often that way. And my wife and I used to do rock, paper, scissors to determine who was going to go do something unpleasant. And I won every single time.

And she drove her nuts, right? And I didn't back then know that it was non-random. I found out about that later. But for whatever reason, the way I happened to play would beat her strategy that she was happening to play. Presumably, I was less likely to go with rock on my first one, and she was more likely to go with rock on her first one. And so it was vastly disproportionate how often I won.

So rock, paper, scissors is not random, right? It's even possible to have an explicit strategy that wins it for you. I like to always...

say, you know, you play with people's heads. That's also interesting because you have the, what are the chances of winning at poker? You know, poker is, there is statistics that are, and the chances around getting a hand, but you don't win poker based on what hand you have. Right. You win poker based on the way you play the

the table. Yes. And the way you manipulate the people around you and manipulate their understanding of the statistics around or the probability of each individual hand at any given time and the cards that have already hit the table and so many other things that are, that have nothing to do with chance.

They recently did a, and this was, they've been working on artificial intelligence algorithms for each of the different games. And so obviously we had chess originally with Deep Blue and then we had Go was beat, was dominated by AlphaGo. They just recently did Texas Hold'em.

And the thing that's interesting is they discussed whether or not they should try to read people's facial expressions as part of the game or not. And they decided not to because they realized at some point that if they tried to read people's facial expressions during the game, people would learn to beat the AI by faking facial expressions.

And so they'd have it just play the odds and it got to the point where it could beat all the humans just playing the odds. But that's not the way humans play, right? Humans actually read each other. So anyhow, yeah. And that's why I use that as an example is because poker is not a purely chance thing. You know, it seems like it's heavily based on chance and it is, but it really isn't truly all chance.

Another one that's interesting is Microsoft rising by 10 points. We measure stocks and their risk based on applying probability theory and a bell curve, a normal distribution to stocks and how much they move. But what actually causes stocks to move is events in the real world.

right, that have nothing to do with probability distributions. Right. And this is the concept of Nicholas Taleb's The Black Swan, that it's the real-life black swan rare event

that will cause stocks, you know, if stocks really followed a probability distribution of financial things that really followed a probability distribution, then a 10 Sigma event would only happen once in the entire history of the universe. If that, whereas they happen all the time on the stock market, you know, once every 50 years or something like that, right. Rare, but not that rare. And in fact, the, those rare moves, the,

are what determines, you know, half the value of the stock market. And so, um, if you like only had your stock, your money in the stock market on, you know, the top 20 moves for the decade or something like that, that would be half the value of the stock market. Um, and,

And so we try to use probability theory. In this case, it is something to do with probability, but it's just not probability theory, right? It's determined by something else entirely. The random events that are just completely unpredictable that do not follow probability theory at all. And this is what Nicholas Taleb's The Black Swan, the book The Black Swan is about, right?

is the fact that we're trying to use probability theory in situations that have something to do with probability, but do not follow probability distributions. - Right. - And so we use the numbers because it's quote the best we can do, but really that it's just an inappropriate use of probability theory. - Well, and going back to, I think you're very, one of your first slides, it's us using statistics to cover our ignorance again.

Um, so what are our chances of dying of the coronavirus? You know, we don't know, right? And what they're trying to do is they're trying to say, okay, out of the people that have the coronavirus, how many of them die? Well, that makes a certain amount of sense, right? That in terms of using probability theory to cover our ignorance, you would think that

Look, if 100 people got the coronavirus and one of them dies, then your chances of dying are one in 100, right?

Okay, but first of all, we don't actually know what the population is because we don't know how many people have the coronavirus and how many don't. Right. We only know of the cases that are reported. And obviously, there's going to be a disproportionate number of cases reported if you die. Yes. So that automatically inflates the numbers in terms of deaths from the coronavirus. So that's why you hear like 3.8 or something like that. That's where that number comes from. And it won't turn out to be that bad.

We don't know how bad it is. It could be bad. The flu is 0.1%. Okay. Coronavirus might be like 0.7%. I'd be seven, seven times higher than the flu, which is why we're being so cautious with the coronavirus because it may be that it kills quite a bit more often than the flu. Right. However, even then it's not an evenly distributed virus.

the population. Like the flu, I mean, like, I get the flu all the time, and I'm never even a little bit worried about dying, right? Because I'm not in the population that's the most at risk for dying from the flu, which would be older people. And the coronavirus is going to have the same sort of thing. Your chances of dying from it will depend on your age, right? I mean, to say, let's say that it was a 1% death rate, for

For you personally, it won't be 1%. It will depend on what your age is, right? Well, and age is also likely only one of the factors. That's correct. You know, did you smoke for 30 years of your, you know, you're 60 and you've been smoking for the last 30 or 40 years? Yeah.

how good is your healthcare in your community? What I suspect is that we'll see populations that

are much, much higher death rates. And that we'll see populations that are much, much lower death rates depending on even with even maybe potentially at the state level, you know, Utah's a very healthy state. We may end up with much lower death rates than places where there's a lot of obesity. I mean, we haven't actually seen or have any real knowledge as to what the...

the contributors are to death. Right. And obviously it's also going to be related to what's your living conditions? How good is your healthcare system? It's going to be based on a number of different things, probably based on

Earlier on versus later on when it was catching people off guard versus now that we have knowledge of that it is this virus we have to worry about. Right. Probably greatly affect the death rates. So trying to use probability theory in this case maybe makes some sense. But again, you have to keep in mind that statistics apply to populations, not to you. Right. Yeah.

Well, that was a great wrap up for this week's podcast. Yes. And on that note, thank you. This was fun. This was a great conversation. Yes. Thank you, Cameo. Thank you.

Episode 7: Lies, Damn Lies, and Statistics 46:45 Share

The Theory of Anything

Deep Dive

Shownotes Transcript

Episode 7: Lies, Damn Lies, and Statistics