We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode The Hidden Stories In The U.S. Census

The Hidden Stories In The U.S. Census

2023/3/16
logo of podcast FiveThirtyEight Politics

FiveThirtyEight Politics

AI Deep Dive AI Chapters Transcript
People
D
Dan Bouk
G
Galen Druk
Topics
Galen Druk:数据不应被表面化地看待,需要批判性地分析其来源、方法和解读,美国人口普查数据也不例外,需要深入了解其背后的故事和局限性。 Dan Bouk:许多塑造我们生活的重要事物都隐藏在枯燥乏味的官僚体系和数据中,需要更多人关注和解读。官僚机构的复杂性有时是为了避免过多干预,但这种复杂性也可能掩盖了其运作中的问题。数据分析结果会受到多种因素影响,包括社会政治因素和人为判断,数据本身可能掩盖了其他社会决策和政治因素。数据收集过程中的分类方法会影响结果,并可能被用来掩盖歧视等问题,人口普查的分类方法尤为重要。美国人口普查是宪法规定的,其根本目的是为了根据人口分配政治代表权,这使得民主制度在根本上依赖于数据。人口普查不仅用于分配政治权力,还用于各种政策制定和社会讨论,为治理提供可靠的事实依据。人口普查的历史悠久,但将其作为分配政治代表权依据的做法可能始于1789年。人口普查的影响广泛,涉及政治、政策、商业等多个方面。人口普查数据被各方利用,例如企业用于选址和销售,工会用于谈判工资,同时也是各种民调和抽样的基础。对政府收集大数据,特别是个人收入数据的质疑由来已久,这反映了美国社会对数据隐私和政府权力的担忧。对大数据收集的担忧,不只是对政府权力滥用的恐惧,也包括对个人信息泄露的担忧。获取准确的人口普查数据面临系统性问题,例如对政府的不信任,以及特定群体(如西班牙裔)的漏报问题。在人口普查中加入公民身份问题,可能意图减少特定群体的回应,从而影响选区划分和政治代表权。人口普查的漏报和多报问题是普遍存在的,但近年来已经取得了很大进展,目前漏报率已经降至较低水平。人口普查的多报问题通常是由于重复计数,而漏报问题则与住房稳定性、语言障碍等因素有关。人口普查的漏报和多报会影响政治代表权的分配,而各州在人口普查中的投入也可能影响其最终结果。人口普查结果用于分配数万亿美元的联邦资金,但人口普查局会利用后续数据和模型来修正误差。人口普查历史上存在人为操纵数据的情况,例如“路边造假”(curb stoning),这与当时的政治腐败有关。人口普查数据收集过程与政治密切相关,存在利益冲突和操纵的可能性,但也有机制来防止和纠正这种行为。人口普查中问题的数量减少,部分原因是抽样技术的进步和对隐私的日益关注。人口普查中的分类方法会塑造人们对自身身份的认知。人们在填写人口普查表格时,会根据自身情况和表格提供的选项来选择合适的分类,这反映了个人与国家数据系统之间的互动。人口普查中的种族、性取向等分类会影响政治代表权的分配,并塑造人们对自身及所属群体的认知。西班牙裔美国人被归类为白人,是出于对种族隔离和歧视的回应。人口普查中的种族和族裔分类随着时间的推移而不断变化,反映了美国社会中种族政治的演变。未来的人口普查可能需要调整种族、族裔、性取向和性别认同等分类方法。在人口普查中增加性取向和性别认同问题,需要确保数据的保密性。理解数据中的人为缺陷和偏差,有助于增强人们对数据的批判性思维能力,并改进数据的使用方法。在使用人口普查数据进行政治权力和资金分配时,需要考虑数据的不确定性,并探索更合理的分配方法。研究美国数字的历史,可以帮助我们理解数据在社会中的作用和演变。数据对人们对自身和社会的认知产生影响,但数据本身并非总是客观和准确的。媒体对政治观点的分类可能掩盖了人们观点的多样性和复杂性,例如对“温和派”的定义。宏观经济数据和犯罪率等数据会影响人们对社会现实的感知,进而影响投票行为。

Deep Dive

Chapters
The podcast introduces the U.S. Census as a pivotal data set in America, influencing political representation and federal funding allocation. The conversation sets the stage for a deeper dive into the complexities and implications of census data.

Shownotes Transcript

Translations:
中文

You're a podcast listener, and this is a podcast ad. Reach great listeners like yourself with podcast advertising from Lipson Ads. Choose from hundreds of top podcasts offering host endorsements, or run a reproduced ad like this one across thousands of shows to reach your target audience with Lipson Ads. Go to LipsonAds.com now. That's L-I-B-S-Y-N-Ads.com.

These kinds of microphones, you can really like make love to them. So you can be like, hey, hey, right. We've known each other for a long time. It's like Christmas Eve in the city. You're like sending out a message to like all the lonely lovers. Who just want to talk about data. Yeah, exactly. Won't you be my senseless friend?

Hello and welcome to the FiveThirtyEight Politics Podcast. I'm Galen Druk. We have a regular segment on this show, the premise of which is that data shouldn't simply be taken at face value. It's important to ask, is it a good or bad use of polling or a good or bad use of data, as listeners well know?

And what we've tried to stress over all these years is that data can go awry in all kinds of ways. When it comes to polling, question design matters, how you reach people matters, how you make up for a skewed sample after the fact matters, and of course, how resulting numbers are framed certainly matters.

Our guest today makes the case that America's most famous data set should also not be taken at face value either. That data set is, of course, the census. Dan Bauk is his name, and his new book is called Democracy's Data, the Hidden Stories in the U.S. Census and How to Read Them. He is a history professor at Colgate University. Thank you so much for joining me today.

This is pretty much the ideal podcast for me. So thank you so much. We're excited to have you. I should ask, first and foremost, would you agree with me that America's most famous data set is the census? It's probably the nerdiest thing I've ever said on this podcast. It's on every like it's the cover image for every statistical magazine. Yeah, like there is no more famous glamour shot than a statistical table. It's like the Anna Jolena Jolie of data. Yeah, no, by far.

So just to like set the table here a little bit, on your website it says that you research the history of bureaucracies, quantification, and other modern things shrouded in cloaks of boringness. What does that mean? So my theory is that a lot of stuff that really shapes how our lives work –

Yeah.

And so like one of our tasks, and I think this is something you all here do as well, is trying to convince people that they should be engaged with this stuff. Wait, and so is this almost like a conspiratorial view where it's like, oh, they purposely make it boring so no one realizes what's actually happening? Or you're just saying that by dint of the fact that this stuff is complicated, most people don't want to engage with it and just ignore bureaucratic mess, right?

and therefore you want to shine a light on it? Or is it the former? Are you a conspiracy theorist? I don't believe that those two positions are mutually exclusive in the sense that a bureaucracy, insofar as it wants to operate successfully, builds complicated systems. To some degree, it's easier if there are fewer people putting their noses into the bureaucracy

the kitchen trying to like mess with things. And so there's a certain advantage to complexity for complexity's sake, even though a lot of that complexity is also just necessary to run what this like gigantic technological system, which is census. So obviously we're going to talk about the census, but just to sort of give me some sense of what you're talking about here, what are some other examples?

Well, so my first book was about the life insurance industry. And so there, one of the kind of fascinating things is this is an industry which is predicated on the idea that we will use these scientifically accurate tables of life expectancy as a means of then pricing policies and saying this person pays so much, this person pays so much.

And at some level, it's like, oh, all right, yeah, this is like science, data, right, that's really being used to then turn this big multibillion-dollar industry into something that's completely rational.

And then you start to dig into the process, which is like full of all of this like – all these trappings of science. And then when it turns to actually producing these policies, setting premiums, this sort of thing, you have these very precise numbers upon which are added 30 percent fudge factors because they're like, actually, there's a lot of uncertainty in this. We're not really sure what's happening. And a lot of it turns out to be that this is a way of justifying different forms of discrimination in pricing that may or may not –

be the most important way of understanding like how risk is distributed in the society. So like that was kind of one of the first things that got me into the sense that there are these –

realms in which there is important kind of data work going on. The data work is not just for nothing. It does inform the decisions being made. And yet it also can be used as kind of a cover over other kinds of social decisions, politics, other factors that are shaping how it is that a system works. I think we actually talk a lot about this in our podcast episodes that we call Model Talk, where we dig deep into the forecast and talk excessively about polling and

And that in crafting a survey, there is both an art and a science to it. You know, the scientific method underpins it, but pollsters are making a lot of decisions, as I mentioned, about how to craft a survey, even how to wait after the fact that, you

can include, you know, maybe your own biases or whatever you think about the current political environment in the country. You're like sort of putting your finger on the scale in a certain way. Is that what you're talking about? Yeah. I mean, so in that, for like that first book, one of the moments that was really striking to me was when I found these

Like a thesis that was written by a black insurance scholar in like the 1940s in which he took this very famous table that had been produced by a major metropolitan life insurance company statistician that showed like white racial stocks and their different longevities who all in life insurance tables were classified as white and therefore paid the kind of white rate.

And this person inserted the life expectancy for African-Americans into that table showing it existed within the distribution of these so-called white racial stocks. And it was this very powerful argument about how because the industry had separated white from black and created this big white bucket, it said it was just for all of those people to pay the same rate even though it said it was also just for African-Americans to pay a higher rate in those premiums. And it was not because of –

the science, it was because of the decision about how to make those classifications. And of course, the census is all about making classifications. It is all about making classifications. So let's talk about that. And I'm curious, before we dig into the nitty gritty, and there's like a lot to, you know, you wrote a whole book about the stories behind the census.

What exactly is the census? Why do we have it? And do other countries have a census the way that we do? It's really kind of remarkable. The census is in the Constitution. So the Constitution is not primarily a document that is set up to do scientific work. And yet it does call for this complete enumeration of the entire population, the premise of which is that this will allow for representation to move with population. So the kind of fundamentalism

fundamental idea of democracy, that you will be represented and that the power of your state will change depending on how many people are there.

is only made possible by the fact that there's this census every 10 years. And that's why you say in your book that democratic representation is like fundamentally data-driven. Fundamentally data-driven, exactly. I love that. Yes. Well, it doesn't have to be that way, but that has been the mechanism by which this has been produced, this idea that you just can't actually represent a mass society without producing that data. I mean, and it goes even further than that, right? Because it is necessary to figure out how to apportion political power

But then we also use the census. It has come over time to be used to answer all these other questions. So I think of it as like a factory of American facts. And those facts are then also what we –

as citizens and residents to fight about things and talk about things, what politicians use to make points. And so it's necessary for governance in that sense as well, that like democracy depends on us having some kind of reliable facts that we can – even if we can test them, something we can kind of hold together as shared. So is the census then an American idea? I mean –

I mean, right, censuses go back way, way, way, way, way back. Well, at least to the Roman Empire, right? Yeah, right. I mean, they famously show up in the Bible as things that people were not necessarily a fan of. Those censuses, one of the key things that they did was try to both find people so you could tax them and find people so that you could pull them in for military service. And so for this reason, censuses have often not been particularly favored by individuals. And indeed, there were a bunch of these censuses that happened in

the British colonies before the founding of the United States. What is maybe distinctive, like I think probably this is an invention of 1789, is to use the census as like not just something you use to decide how to attack somebody, but also this is what your representation is going to be premised on. So there's a carrot to go with the stick.

Okay, so apportionment, redistricting, important parts of the census, and maybe the most high profile way the census is used, at least for our podcast purposes. But I mean, it's used for all kinds of things, right? In terms of how policies are administered, how corporations use the census, like give us some example of all the ways that the census touches our lives.

Well, so for instance, in the 1940 census, in 1939, there's this conference held to figure out what the questions will be for the 1940 census. And so you can look at the people in the room and get a sense of like who is planning to be a data user henceforth, right? One of the most important people in the room is like the Jeff Bezos of that moment, the chair of

Sears and Company, which is at that point, Sears Roebuck is like the big mail order firm. It's also beginning to create a brick and mortar empire of stores. And so they're in this room and saying, we want information like what is the income of all these individuals in the census?

because they want to use that to figure out where they can place their stores. They want to figure out how they can optimize their sales forces. Similarly, we see people from the labor unions, the new labor unions are in that room, and they too are interested in things like income because they're trying to think about how they could then use that to talk about more wages and try to negotiate for better wages amongst their workers.

So that's one level of things that would happen. The other thing that's really important, and again, we can taste this in some ways to the 1940 census, the full count of the census, which is like as we said, it's not perfect, but it's basically the best thing we have. It is our gold standard even if it's not great. There's nothing better. And it is then the sample frame. It gives us the basis upon which all kinds of polling and sampling can build in the future. Right.

So in this case, in the 1940 census, this is the first time in which they bring probabilistic methods into the census operation. And so on each of these census sheets, there would be these two bars drawn in. And whoever happened to fall into that sheet then would be asked a series of extra questions, which could then be used to extrapolate answers for the rest of the population. So from what I understand in reading your book – actually, I listened to your book, the audiobook. Did they do a good job? Yeah. All right. Yeah.

The idea of adding an income question to the census in 1941 over like a lead balloon. And this is a window into American skepticism of big data, collection of personal data, things like that. When did a skepticism of big data begin, particularly big data in government? And why was income specifically such a hot button issue?

So when we think of skepticism of big data, I think one of the things we're often imagining is that the Orwellian big brother, right? That government has this database and they're going to look at it and find us and then do something bad to us. That's true. That's a very reasonable fear. But that's not the primary fear that was motivating people who were concerned about this income question. So some part of it is that

Every 20 years, the census coincides with the presidential election. And so every 20 years, you're basically guaranteed that there's going to be a very politically contentious census because whichever party isn't in power is going to use the census as a reason or a means to try to air some grievances or try to score some political points. So that's what happens here. The Republicans –

use the census and particularly this income question as a means to attack the Roosevelt administration's attempts to use the New Deal to actually think about and improve the economic lives of individuals and citizens. So we have like at the big political level, people saying, be afraid. This is a terrible thing that is happening.

But then when we dig in and see the kinds of fears that individuals are actually expressing about this question, they're a lot less about like what's going to happen at the center and they're much more about – I'm concerned about like the person, my neighbor, who's coming around asking these questions. I don't want them to know what my income is. I don't want them to know these details about my life. There's like this great moment which really hit this home for me where a woman writes a letter to her senator and she says, look –

I make $260 a month or a year or whatever and I really don't want the government to know this.

And she's like, she just wrote this to her senator. But for her, it wasn't really that she didn't want the government to know. She didn't want to tell it to like the enumerator wandering around. Okay, well, this is a perfect segue into maybe the most important question as relates to how we think about polling. What is the challenge of getting an accurate census? Is it largely that people don't want to reveal information about themselves to either the forum on the internet, the person who shows up on their doorstep?

Or is it something more systemic than that? I think it's I think systemic is is the more reasonable answer. I mean, there is always some amount of do I really want somebody to come to my door and knock on this and ask this question? But I mean,

To that point, you don't have to these days talk to somebody necessarily at the end of the door. In 2020, the first time at the mass scale, thank heavens, we had an internet self-response so that people could fill this in just on the internet. But somehow that might still be that people don't trust the government and therefore don't want to answer questions.

I mean, when I talk to people and when I look at people who have done this research, they often point to other kinds of factors, right? So there was a very significant undercount in the Hispanic ethnic category, for instance, in 2020.

And there, you know, one could very plausibly point to the way in which the Trump administration tried to put a citizenship question on the census as a clearly as a mechanism to try to depress responses. And that's probably something. Hold on. Was it to try to depress responses or was it to try to say that it could be possible for states or municipalities to district according to citizenship status only? Because I know that that is a legal battle that has prevailed.

sort of gone on for a while and it's likely to continue to go on because there are some places in the country that have tried to draw districts based on citizenship population, which gets to the question of do we have representation proportionate to the number of voting eligible people or just the number of people?

It seems constitutionally like it's the number of people, period, although there are some debates about whether it should be the voting eligible population. And then you get into questions about, OK, well, what about prisons? What about children? What about all of these different things? That was what I understood part of at least that debate to be. I mean, yes, this is a really important question. So I'll answer it and then I'll dig into the deeper part of it. So

I would say that, yes, there's a lot of evidence the Trump administration would have been very happy to get a count of people based on citizenship. And then they certainly later tried to also produce this file of undocumented people that they're also – we think probably we're going to try to use to somehow remove those – that number of people from the –

But even if that didn't work out, I think there was probably a sense in which bonus will be we can decrease – we can depress responses in the first place by just the idea of citizenship. And what you're saying is the advantage there is that then when the districts get apportioned, they're sort of like spread out larger in places where –

say, immigrants who are in the country illegally are concentrated. And then, therefore, there's less democratic representation at the end of the day. That's what you're saying the goal is. Yeah. In the end, right, the idea here is since the Constitution, since the 14th Amendment, the

The census has very clearly counted persons. And so to that point, it isn't voters. And there have been fights about this since at least the 1920s in which people have said like, you know, maybe we should be voters in terms of just at the apportionment level. And states have often used only voters. New York State up until like the 60s was drawing its line based on citizens, not based on persons.

But at the constitutional level, it is persons and it would take a constitutional amendment in order to make it something other than persons as the basis for apportionment. So that means then if you're trying to skew the numbers –

And you think you can make it so that a district that somebody else is going to win, you can pack it more with people who aren't counted. That might be to your advantage. And so that's a concern about what are the other concerns about how an undercount or an overcount might result? And actually, I want to read a quote. You wrote that the Census Bureau's post-enumeration survey, so they go back and try to figure out where they messed up.

Like, how did that happen?

How did they count an extra million people? Yeah, I mean, I think to most people listening, they'd be like, that sounds shady. That sounds weird. Yeah, it's actually, I'm glad that you put it like that, like this sounds kind of shady. One of the reasons I wrote this book and one of the reasons I hope people will read it and think about it is that people often talk about the making of census as the making of like sausage making, right? No one wants to see the way the sausage is made. It might taste delicious, but if you look at it, you're like, whoa. I mean, this is, I think, actually the way that most data systems work.

And when you start looking at how the data system works, you're like, oh, actually, there's a lot of messy stuff happening in here. This has been often used to the advantage of groups who want to sow doubt. So you think of like climate change, skepticism, cigarettes and causing cancer, right? This kind of – there's a whole industry that's been built on the idea of taking ordinary scientific process –

putting it in front of people and saying, oh, this looks awful shady. And the way that you sort of do away with that is you shine light. So how did this happen? How do overcounts and undercounts happen? You shine light. But also, I think one of the things you do is you say, like, actually, this is not that bad an undercount. I mean, this is now a bad overcount or undercount because for the last 50 or 60 years, there's been so much energy put into trying to limit those overcounts and undercounts.

But the total undercount in 1940 was 3% of the population. Amongst African-Americans, it was between 13% and 15% of the population. So the fact that we're now talking about single percentage points is actually a lot of progress. It's part of why even though we now rely on people to self-respond, we send half a million people or hundreds of thousands of people out into the field to try to count just those people who couldn't be found in the first place.

So when we have a large overcount like this, what that probably means – what it surely means is some people are being double counted. And in the pandemic in 2020, you can see pretty easily how that would happen. Somebody gets a form sent to them at their home in New York City where we live here. And then they also get sent to their home where like I teach in upstate New York. And if I had a home there, I could have gotten a form there. If I filled it out in both places and the census wasn't able to figure out I was the same person in both places –

Bam, I've just been counted twice. And that problem of deduplication turns out to be very tricky. The Census Bureau has not been able to do it particularly well. And when we even say that in recent censuses, there's been usually about a 1% or sometimes even less than 1% undercount, that's a net undercount. So some amount of the inequity in the census comes from the fact that there are double counts of usually more privileged people who have multiple homes and

And an undercount of people who are in less stable home positions. If you were thinking about structural forces, it's like having forms that aren't in your language or not having a home is one of the primary things. Like people who are more transient are much harder to count. And what are the consequences of...

over counts and under counts. Obviously, from an apportionment perspective, it's diluting your representation or over-representing you in Congress. So I guess we here in New York are over-represented. I mean, right now, when the numbers came out,

Everyone that I was talking to imagined that New York was about to lose two seats. Yeah. And then it came like, what, a couple thousand away from not losing any seat at all? This is one of the both funniest and most infuriating moments of my entire time here because somebody asked a Census Bureau official about the New York counts. And the response was not expecting was the official said, yeah, we're just a couple hundred seats short of not losing a seat at all.

And my head blew up because one, that was incredible compared to what we were expecting. I mean not like ridiculous, not a sign of shady dealings. It's perfectly within the reasonable way in which like a weird census turns out. But that number that like there's – we're a couple hundred away.

infuriates me. It's like whenever you hear this number, do not trust these kinds of numbers because like it's the premise of that is that if the Census Bureau somehow held all the other states at exactly their level and then added 500 people to New York state,

But there's no plausible method, methodological change that would hold all the other states' populations the same while adding 500 seats to New York, right? Anything that the Census Bureau did differently to count people would undoubtedly cause changes through all of the states' numbers. And so like the – all we can tell from that number like we're saying – or it was 81 was the number. I think it was like 81 people short. It's not like that there were 81 people who if they had just been counted would have like –

I mean, this is a political thing, too, though, because like in blue states, you saw a really big effort to get counted in the census. There were you know, we have those Wi-Fi hubs on the sidewalk in New York City that also now have like digital ads. And it was like celebrities being like, make sure you get counted, make sure you have a voice, whatever. Whereas other states, oftentimes red states, were not.

spending a lot of money trying to make sure that people were responding to the census, which is kind of, you know, like shooting yourself in the foot in some ways. But there is like a political aspect here and that New York was one of those places that was very diligent and other places were not quite as diligent. Yes. And this is a point that, I mean, some people I think very reasonably are concerned about the idea that the amount of money you spend on this campaign

count might then factor into how well you were counted. Of course, there's a reason that New York State spends so much money on this count. It's because, to answer your other question, one of the primary things that the census does, apart from representation, is that it helps to apportion trillions of dollars of federal funds and

A lot of that has increasingly come to be essentially automated where Congress seldom now directly puts funds into various places. It says we're going to apportion these funds according to census –

both direct census counts and then the census estimates. The nice thing about that, and maybe that puts some listeners at ease a little bit, is that the census isn't the final say on this, right? So the Census Bureau doesn't want to overcount or undercount. It would like to get as accurate counts as possible.

And so it can use then subsequent data and modeling through the population estimates program to actually mean – to make it so that those numbers – they can make up for some of these miscounts. And as a result, when they do go to apportion funds in subsequent years, it's more accurate. At least we hope that it's more accurate. You bring up the idea though in your book – and I'm not saying that at all that this is what happened in New York –

There is such a thing as census padding and that this was especially maybe an issue back in the days of shadier census practices where it was all enumerators going door to door, which is literally making people up. What's the history of that? I mean, there's this like there's just a wonderful like whole vocabulary of census fraud that has shown up over the years. So padding is the general term.

The particular method by which an enumerator would do this had its own term as well, which was called curb stoning because the idea was the enumerator would go someplace, sit down on the curb and just start writing out the names of all these people that they're inventing. In its worst manifestations – so we had a number of these big events happening in the late 19th and early 20th century –

These would be coordinated events though. So it's not just like some lazy enumerator but rather it would be local boosters who were interested in trying to essentially draw business to their cities. And so they would pay a whole bunch of enumerators to invent a large subset of individuals. And you can now go back into the census records and we can find – I had a researcher, Ethan So, and he went through and I think it was in –

Maybe Wichita? I can't remember which place he went. But he went and found a number of people who he was like, oh, these are totally made up. This is definitely a whole bunch of – like a series of made up families. And this was all part of sort of corrupt machine politics of the time where even getting an enumerator job was like –

A cushy position or like some – a reward that local politicians would give out to their constituents? Yeah. I mean so because it's – I would hesitate to tie those two so closely together because just because it's a patronage job doesn't mean that they're necessarily making things up.

OK. Yeah, yeah, yeah.

And it's really like the history of the census, if anything, it tells us is that like when you have a mass mobilization, when you need hundreds of thousands of people to be involved in this thing, there's just no separation between politics and statistics. Like so it makes sense that throughout most of the census, there has been some mechanism by which Congress intervened.

is encouraged to be bought into this process, whether it's by tying their representation to it or whether it's by saying, all right, we need you. We're going to give you the power if you're in power, if your party is in control of the Congress to decide who should be hired and who should then hire people to be these enumerators.

With the check that for the most part, these congresspeople want there to be an accurate count. If there isn't an accurate count or if they don't count all of their people, they might lose their seat. So there is a certain kind of – I mean, right? It could be an emphasis on them creating – inventing individuals, right? There is that possibility there. But then there are other mechanisms in place to try to catch and prevent that kind of fraud. Yeah.

You say that sort of in this way, statistics and politics are inextricably woven together in the census. And I think that that was a really important and interesting point because it is such an important statistical exercise that impacts business and governance and all kinds of things. One of the areas, we already sort of touched on this a little bit, where politics has been

gotten involved and been debated and whatever is what we actually ask people. So in terms of income, in terms of citizenship status, we used to ask 30 questions. In fact, in most of your book, you dig into the 1940 census because after 72 years, they reveal the sort of personalized data of the census. And it was the most recent census for which that had happened at the time you wrote the book. Now we have the 1950 data as well. But back then in 1940, they were asking 30 questions. Now we ask only 10.

Why? What happened? So one answer is there were new technological possibilities, right? So we're talking about how sampling came to be a possibility starting in the 1940s. Once it was possible to bring the casino into the census and start using probabilistic methods to take a full count of everybody but then like draw out samples, in that case in the 1940 census it was a 5% sample of the population.

In subsequent decades, the Census Bureau starts to move some questions out of that what they increasingly call the short form into what they're calling then the long form in which then there's like a 20 percent sample population which will be asked many more questions than that smaller group. So some part of that is an idea that this will be – this decreases the burden on individuals and hopefully improve response rates there.

The other thing that's happening is that there is a serious political move coming from both the left and the right in the 1960s and 1970s in favor of privacy and with concerns about privacy. And so as people start to say, we're not sure if we want the government to know this much about all of us.

Sampling becomes a very convenient method by which the Census Bureau can say, "All right, fine. We're going to secure the privacy of many people by just simply not asking many of them these questions." We see that dynamic play out with income. We were just talking about that. Income was very controversial in 1940 when it was asked of everybody. Not surprisingly, in 1950, it becomes a sampling question so that only a fifth of all people are asked about their income.

as a means of trying to like tamp down on some of that. And citizenship is also a sampling question. Right. Yes. So they're kind of trying to move out many of these more what can be maybe seen as more sensitive questions out of that form.

You're a podcast listener, and this is a podcast ad. Reach great listeners like yourself with podcast advertising from Lipson Ads. Choose from hundreds of top podcasts offering host endorsements, or run a reproduced ad like this one across thousands of shows to reach your target audience with Lipson Ads. Go to LipsonAds.com now. That's L-I-B-S-Y-N-Ads.com.

You're a podcast listener, and this is a podcast ad. Reach great listeners like yourself with podcast advertising from Lipson Ads. Choose from hundreds of top podcasts offering host endorsements, or run a reproduced ad like this one across thousands of shows to reach your target audience with Lipson Ads. Go to LipsonAds.com now. That's L-I-B-S-Y-N-Ads.com.

We have become very used to, especially in, you know, on a podcast like this, at a website like FiveThirtyEight, categorizing people in certain ways, you know, by race, education, income. How does the census, you know, both the questions that we ask on the census and the fact that we have one to begin with, shape how we view ourselves? So this is one of the primary questions that I had in writing this book was we—

I mean, in a podcast like this, am I allowed to say Michel Foucault's name? Go for it. There are these theorists that have talked about the way in which populations, their identities are shaped –

by the way in which like states produce categories and impose those categories on people and then through that, they come to know themselves differently. That's like a little bit true, right? It is true that the census helps produce these categories. A category like white actually is a census invention in some ways that comes then to like incorporate different people over different – at different periods, different times and helps them to understand themselves through that process.

But one of the things that's like so interesting in the census because we get to like see every person's record after 72 years is that we can then also see the work that goes into like each person trying to figure out how do I fit in this system, right? The state might give us a series of categories.

But then it's between that person looking at the forms. If there's an enumerator involved, the enumerator is there also. And they're essentially like talking to one another and trying to figure out like how do I take my own particular situation and like try to find a way to make it work for this form. So like for me in this book, one of the things that got me really interested in this question was that the – in part because of the concern about fraud, the census had this category for relationship issues.

In which – so the kind of theory was, well, it will be harder for enumerators to make up plausible households if they have to also make up plausible like family structures. So there's a head of the household and then every other person in the household has to be – has to have a relationship to that head. The way it's structured, it works really well for –

like a very straight household. It's assuming a male head of household. It's assuming that there's a woman who is the spouse, like the punch card itself says head, and then it has WF for wife, right? So it's like built into the structure that this is going to be like a heteronormative household. So like how do queer households, which do exist, the 1940s, how did they work out and get fit into the census? And I mean, there are a number of different answers, but one of them is that there was this like

kind of acceptable category. Like it had, it appears in some of the instructions and not in some of the other ones called a partner, which turned out to be one of the mechanisms by which like in queer communities, people could identify themselves as part of a household and like fit into the data system. Did they then, as a result of this, talk about themselves as partners in their communities?

I actually, I have no idea. Here we're talking about somewhat specific situations. The structure of the census obviously applied to the vast majority of people living at the time. And so when we do zoom back out and think about, okay, these are the ways that we're categorizing the mass public by race, by education, by income, by, you know, urban, rural, by population density. How does that shape how we view ourselves today?

Right. I mean, so one of the reasons that we spend a lot of time actually and we see a lot of advocacy around race labels, for instance, is precisely because then when this is all ultimately tabulated outside of the question of like us talking to an enumerator and figuring out how we're being labeled, right? Throughout most of these sentences, an enumerator would have looked at a person and ascribed to them a race and then written that down. But then when those get tabulated, those categories and the numbers associated with them

shape the way that politicians look at their constituencies. It makes essentially visible people to their political representatives. And therefore, one of the reasons we often see arguments about trying to more finely differentiate, for instance, racial groups is because we see or to see arguments about like trying to put sexual orientation and gender identity questions on the census or these sort of subdividing questions.

is because that makes it easier than to actually get representation for a group, right? When you appear in the census and when your category exists in the census and when a table has numbers for you, you can prove to somebody with political power or somebody who sells goods to you, look, we are a market. We are voters. We are a constituency. Here are the numbers. You can see that we matter. How did Hispanic Americans come to be considered white?

In 1930, right, so there was a move to create a Mexican racial category for the first time. What happens, initially it's not clear this is particularly contentious.

Many of the people who are being labeled Mexican in the census in 1930 and afterwards are noticing a series of other trends happening around the same time. This is a time of Jim Crow segregation, right? So like throughout the South, there's strict racial segregation between white and black individuals.

So to be labeled not white, especially in the south, comes to have a real danger associated with it. And so this construction of a Mexican racial category in the 1930s comes to be contested by many Mexican-American groups. And Mexico itself, right? And ultimately they draw in the Mexican government as well.

to say this can't happen. We need you to remove this category and henceforth all people who had been labeled as racially Mexican become racially white. And that can only be understood as a kind of response to like the horrors and terrors of the Jim Crow South in which being outside of whiteness was essentially dangerous. How have the other racial and ethnic categories been developed over time and changed over time?

So, when we think about the first census, it had three categories drawn out for white people of different ages, I think like a couple for men, one for women, and then free people of color and enslaved people. So that's because the constitution is built on this compromise in which enslaved people would count as three-fifths of individuals.

Then the white people are being differentiated because there's an interest in trying to figure out essentially the number of people who are of military age or of taxable age, of labor, of working age. But so it's building in that first kind of racial category, a white and other category. Yeah.

Very quickly, that starts to proliferate throughout the 19th century into a whole variety of different racial categories which track then, I mean, the changing racial politics of the time. So like as there is an increasing fear about interracial relationships, we see these –

both a fear of this and an increasing desire to kind of surveil people in the Jim Crow system. We have this differentiation of black into multiple different groups, mulatto and quadroon and these sorts of categories. Later on, we see the introduction of other categories in 1940,

Legal categories on the census included Hindu as a racial category, Japanese, Chinese are listed here as racial categories. And they end up being quite consequential, right? In 1940, one of the things that happens is that with World War II, when the United States enters,

the Census Bureau is asked, well, the Census Bureau wants to make itself useful. And so one of the early things it does is it produces fairly fine-grained tabulations of Germans, Italians, and Italian immigrants, but then also anyone of Japanese racial heritage, especially on the West Coast, which facilitates then the movement and incarceration of Japanese as part of World War II. So it sounds like there's a real push and pull here between

wanting special categorization so that you can prove your numbers or your political power or whatever, but also then not wanting it in other circumstances so that you can avoid being targeted by the government in some way or another.

One way that we often talk about the story is for the first 150 years of the census in which this was a data system that was very clearly in place to enforce white supremacy. And so in that mechanism, these counts are built in a way to make sure that the white population continues to have a predominant place in

From the 1950s, 1960s on, because of the efforts of civil rights activists and all kinds of other kind of politics that are happening, there was a move to turn this around and suddenly make representation and visibility as a tool for gaining political power. So that's part of why this thing which for a long time was maybe something in which visibility might have – was particularly dangerous. Now visibility has increasingly come to be seen as necessary to then be able to fight against the hundreds of years of discrimination that came before.

What are ways in which we could – like the census could be different now. What are the different categories that we might include on the census that would shape how we understand the country or talk about the country in politics? So I mean one thing that's under consideration right now is our new guidelines which would join the ethnic and racial categories into a single question, which has long been – people have kind of thought about and there are –

Arguments from all kinds of different directions as to why it might or might not make sense and fears that by merging those two questions, some people would then pick only one of these boxes. One of the kind of tricks here is that under the current racial system, you can click multiple boxes. But savvy kind of census watchers look at this and say like, oh, I might do that.

But I also know that when people ultimately publish these tables, they're often going to then put people who chalk multiple boxes under like two or more boxes umbrella and not then show me as representing one of these other racial groups to which I belong. So some of the kind of things ahead of us might be arguments about how to consolidate some of these questions. Certainly another thing...

thing that people are talking about a lot is the sexual orientation and gender identity questions, which it looks very much like we might end up having in the future. There, what that calls forth when we think about the way in which particularly there are increasingly laws that can make it dangerous to have a trans child, for instance, in some states is

It's going to make confidentiality of census questions, census results, a continued and really serious concern because once we have sexual orientation to generate any questions, what I think should be there, it just is going to get people to answer honestly. It's going to have to really believe that these are going to be confidential responses. Yeah, I got that push and pull. You write in your book that to find the stories and the data, we must widen our lens to take in not only the numbers but also the process that generated those numbers. Yeah.

In doing that, what's the takeaway that, you know, I mean, one takeaway could be that like, it's not perfect. And, you know, New York really screwed the pooch in 2020. But like, what do you hope people take away from understanding that data sets have human created flaws and have human biases? Like, is that what you want people to take away? Like what once we know that, what should we do with that information?

So, three answers to that. Like, the first one, if you are a historian, then one of the things you take away from this is that this data set, which, like, looks like it might just be a table of numbers, also is something we can read...

like a story telling us about the society that produced it, right? So like it's not just that it tells us through the tables about what America was like. We can look at every stage in the production of this data and it tells us about the values and about the way in which that society operated. So in history, you would call this like historiography. In data, you would call it like...

Data – datography? Datography. That sounds like a new art form. Da-da, da-da, datography. Right. So that's – I mean one way and I suspect like not all listeners are going to be particularly keen about that one way of approaching this. But the other two ways which I think are really important – I mean one, you kind of alluded to this. I do think this can act as a kind of inoculation against data doubt.

That like there's this – we know there's this consistent method by which people who want to foment doubt try to show people very ordinary data practices and be like, ah! And one way of getting people to then be more resistant to that kind of method is to like show them data sets that they can and should trust like the Census Bureau's data sets like the census and say, look, this is what data looks like. It is messy. Right.

I want you to get to be used to the mess of it so that when you encounter the mess in the rest of your life, sometimes it does deserve to be shrieked at. But there's a difference between shrieking at it and then refusing to use it to make policy decisions. And then the final thing is that I think it's important to recognize this –

uncertainty buried deep into the data because then it can shape how it is we rely on it. So one of the big moves over the last, since 1920 in apportionment and since the 1970s in terms of federal funds has been automatizing the use of this data where the Census Bureau produces numbers that are seemingly very precise and then uses that seeming precision to allocate political power and allocate funds and

I just don't think that's a great idea. I mean I think we need to recognize – What's the alternative? One alternative is to find ways to build in some allowance for that variance. So like in apportionment, which is the way I've thought about this the most –

The old mechanism by which one dealt with the fact that these were uncertain numbers was by essentially allowing the pot to grow, right? So like you would increase the amount of seats in the house so that then fewer states would be – or no states would be penalized by losing a seat. And others would grow as well with the benefit being that like also there's more representation. I'm skeptical of this because ultimately it is a zero-sum game, right? You're going to have –

proportionally more or less representation than other states, than other people. I don't see how growing the number of seats in the House will change the fact that if there are inaccuracies in the census, there will be inaccuracies in representation. So you can't get rid of it, right? You can never get rid of any of these things.

At least you're not taking away seats from a place and also taking away representation from people by – So emotionally it might be easier to stomach. But like functionally, is it any different? Not even just emotionally, right? Like –

This is about how many – how close you are to your representative. And if there's a miscount and suddenly like a state loses a representative, suddenly there's now – that representative – the remaining representatives have bigger districts that they're trying to cover.

And this kind of continues to compound. If we are increasing the size of the house, it might be that we're still wrong and that there are still errors here. But at least the size of the house, the size of the district is decreasing. There's a better chance that you have some kind of grasp on your representative and being represented. So it's like not adding insult to the insult of a shrinking – of an expanding district to the injury of a miscount. Yeah.

Okay, interesting. Interesting. I have a final question here, which is a little hokey, but I saw on your website that in 2023, you are teaching a course called the history of numbers in America. What is the best number in America? Or what is a uniquely American number? What is a uniquely American number? What's the best number in America?

I mean, I could say 538, right? I mean, why would you not? That's really the answer I was looking for. I'm going to go with 538. Are you going to include that in your course? Oh, yeah. Just so we don't overindulge here, I mean, what is the lesson of the history of numbers in America? Is it about how statistics has evolved over time?

So I've done it different ways in different times. So like sometimes I'll take a series of different kinds of influential numbers and look at them. So we'll like look at the history of like the body mass index or credit rating systems or things like apportionment numbers and we'll try to trace their histories. Sometimes I'll do it more as like a long history thinking starting with

Alexander von Humboldt wandering around very closely measuring— Creating data science, right? Yeah, creating data science, sure. Wandering around doing precise measurements and these interesting data visualizations in the 18th century straight through to Moneyball and then trying to think about how and why at different moments and different times we have new institutions committed to doing quantification.

But I mean that's really what this is about. I treat always the history of numbers as a kind of cultural history. And so I'm interested in how and why institutions produce numbers. What do those mean? Yeah. I mean in reading your book or again listening to your book, it got me thinking, OK.

In America today, I know what the average height of a male is. So I can compare myself to that average. Like you mentioned body mass index, I can compare myself to the average on that. There are all these sort of different ways that in the absence of data, I don't know, like, and especially on partisanship, we're always talking about like, there's this many Republicans and there's this many Democrats and this many people believe this and that many people believe the other thing. In a world where that kind of data doesn't exist, how do we develop our data?

sense of something you just look around and say well I'm not quite as tall as most people here and

And do you understand yourself to be like a short person in a non-statistical way? Like, what is the impact of all of the different ways that we have now come to quantify people? Yeah, no, that's interesting. I mean, one way of reframing that, right, is like, how do we move from a norm being set by like statuary to a norm being set by statistics, right? Like, how is it that you are on thing like, well, a proper height is the height of that statue of David over there versus like a proper

proper height. And like to that point, like for statuary, like I think it was in the 1890s in one of these kind of world's fairs in which anthropologists went and measured classically, right? They measured like a whole bunch of like Penn students or something, like a bunch of elite college students, took those measurements, averaged them and generated from them the statues which were like the ideal Americans and

put them on exhibits that people could go and then like stand next to them and see like how they fit and compared to these, again, like white privileged elite students who then were the models of what Americans should look like. So there is like a distinct set of processes trying to teach people to understand themselves and to think about themselves in relationship to these statistical categories, which is also kind of weird, right? Like, especially when you think about it, because statistics are supposed to be

group things. One of the things that I got most interested in thinking about life insurance was this move towards trying to measure individuals against statistical norms, because you're not actually supposed to do that, right? Like statistics only work for groups. They're not supposed to work for individuals. No individual should look like the statistics. And the way in which then we try to like take individuals, and we do this now all the time in society, but it's a kind of a weird move.

Interesting. I mean, maybe one way that that is seen in, I'll just use this example because it's one of the areas I know best, in like political coverage is that when you ask a whole bunch of people where they fall on a certain set of policy questions, you're going to find that there's a bunch of people who land in a moderate category because they're

Lots of Americans don't have left-right ideologically consistent views because a lot of policies don't necessarily have anything to do with each other. Like whether you support X policy on healthcare, but Y policy on immigration may have nothing to do with each other. And so Americans who aren't tapped into like the MSNBC, the Fox, whatever, do have these ideologically inconsistent views and they get categorized as moderate. But they really do have specific views

views on specific issues, but it comes to look like they're a combination, an equal combination of left and right. And then in the media, the way that gets reported is that there are all of these like moderate voters who kind of like don't have an extreme view on this, but

They fall right in the middle. They want the average between the Republican policy on immigration and the Democratic policy on immigration. They want the average between the Republican policy on healthcare and the Democratic policy on healthcare. Then we go out in the world and try to find such people or whatever. Right, exactly. We've invented these moderates who are in fact not a single coherent group. What it makes me think of too is this thing we often hear about how

When we think about the way the economy affects elections, I probably heard this on this podcast, right? It's that people are partly looking at like how am I doing in my life? But often, right, it's the like GDP. It's these big influential political numbers that shape people's perceptions of what the economy is and that drives voting as much or more than like even their individual circumstances. Well, it's certainly the case with crime.

Which is why you see so often, like, are you worried about crime? Yes. Are you worried about crime in your neighborhood? No. Yeah. Well, we could talk forever. This has been a thoroughly nerdy podcast. I hope listeners have enjoyed it nonetheless or specifically because of it. But we're going to leave it there for today. So thank you so much, Dan. All right. Thank you. My name is Galen Druk. Kevin Ryder and Anna Rothschild are in the control room. Tony Chow is on video editing and Chadwick Matlin is our editorial director.

You can get in touch by emailing us at podcast.538.com. You can also, of course, tweet at us with any questions or comments. If you're a fan of the show, leave us a rating or review in the Apple Podcast Store or tell someone about us. Thanks for listening, and we will see you soon. Bye.