We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

A new data infrastructure for the social sciences?

2025/6/4

LSE: Public lectures and events

AI Deep Dive AI Chapters Transcript

People

David B Grusky

Mike Savage

Topics

David B Grusky: 自然科学非常重视测量，并致力于测量工具的创新。相比之下，我认为社会科学更关注概念创新。如果社会科学像自然科学一样重视测量问题，将会发生什么？我提出了一个思想实验：如果我们对建立社会观测站的承诺与我们对建造太空观测站的承诺相匹配，会怎么样？我将讨论美国的测量基础设施如何不足，以及我们能做些什么。如果真的建立了新的基础设施，它会解决一些问题，也会带来一些新的问题。我将首先介绍这个系统最原始的形式，然后讨论如何通过修改后的基础设施来解决这些问题。

Deep Dive

Shownotes Transcript

Translations:

中文

Welcome to the LSE events podcast by the London School of Economics and Political Science. Get ready to hear from some of the most influential international figures in the social sciences. Welcome to the London School of Economics. It's great to see you tonight. I am Mike Savage. I'm Emeritus Professor of Sociology at the LSE and I'm a Professorial Research Fellow at the International Inequalities Institute.

And I'm really pleased to welcome David Gruski to this event tonight. David is Edward Amos Edmonds Professor in the School of Humanities and Sciences, Professor of Sociology and Director of the Centre on Poverty and Inequality at Stanford University. As many of us know, he's one of the leading American sociologists of inequality, poverty, stratification.

His publications are manifold. I think what stands out to me is his willingness to think about American research in a much bigger global context. So he's, as well as major contributions on studying poverty in the US, he's also been very concerned about the multidimensional aspects of that, thinking about questions of gender. He's contributed debate to debates we've had about thinking about social class and questioning some of the ways in which British sociologists have understood class. So he's really been prepared to kind of think big

and think globally. And today he's going to be presenting this very bold vision about a new data infrastructure for the social sciences, question mark. So we're looking forward, he's going to speak for about an hour and that will leave about half an hour for questions and discussion.

If you want to join this conversation on the social media, the hashtag is there. I know we don't really use X anymore, do we? But whatever your preferred platform is, you can tweet or whatever the appropriate expression is using that hashtag. At the end, there'll be a chance for questions, both in terms of people in the room, but also there's a big online audience, and I've got an iPad, which seamlessly feeds the questions to me. So, OK, without any further ado, welcome David Gruski. APPLAUSE

Thank you, Mike, for the generous introduction. It's nice to see some friends out there in these times. I'll get right to it. I'm evidently talking about a telescope. I put this up there because, to my mind, one of the most striking differences between the natural and social sciences is that in the natural sciences, they care deeply about measurement.

and are very committed to innovating when it comes to tools for measurement. And by contrast in the social sciences, I think we're a bit more interested in conceptual innovation. One might even say we fetishize that type of innovation. So I want to ask here, what would transpire if we attended to measurement issues to the extent that natural scientists attend to them? So here's my thought experiment.

What if our commitment to building a social observatory, an observatory where we really take seriously our commitment to measuring what's happening out there in the world, what if our commitment to building a social observatory matched our commitment to building a space observatory? Like the James Webb Space Telescope, approximately $10 billion.

So here's the game plan. Here's what I'm going to do in this talk. I'm going to talk about how our measurement infrastructure in the U.S. falls short. I think it's similar to what's the case in other countries, but I'll focus on the country I know. So how it falls short, why it's falling short, and what we can do about it. And then after laying that out, I'll describe the consequences of this new infrastructure, if indeed we were to build it, and

It would solve some problems and it would bring on some new problems. I should warn you that this is not perhaps the typical talk that I would give. It will start off normal and standard and maybe boring and then there's going to be a curveball. It's going to go a direction you might not have thought and you're going to be worried about the direction. That's why I'm going to talk about problems that this new infrastructure would engender. It would entail problems and I'm going to

talk about how we could address them with revised infrastructure, but I'm going to first lay it out in its most stark form. Okay, but problem first. What are the problems we're trying to solve? In what ways does the current measurement infrastructure fall short? First problem is that we rely too much on studies that are potentially unrepresentative. So I want you to think about

the kinds of data that we need to make the big decisions within the U.S. that affect the lives of everyone. So I'm referring, for example, to decisions about how to balance, say, our commitment to keeping unemployment as low as possible and inflation as low as possible. How do we match, how do we deal with those competing sorts of objectives? Well, we certainly rely heavily on data to make those decisions, decisions that affect the everyday lives of all Americans.

We rely in particular on the Current Population Survey, but some of these other surveys as well, to make those decisions and many other decisions that are kind of the fundament of U.S. policy. And the big point that I want to emphasize here is that response rates to those critical surveys are declining often quite precipitously. Now, of course,

We don't care in and of themselves about response rates. We care about whether or not the parameters that we want to estimate are or are not biased, right? And there are ways to deal with declines in response rate in ways to ensure that we're still getting unbiased estimates of the parameters about which we care, things like adaptive survey design and so forth. But my point here, they are very expensive.

And we have no reason to believe that they will continue to work as response rates continue to fall. So because it's so expensive to try to deal with non-response in the context of a probability sample, increasingly social scientists are turning to non-probability sampling. This chart here is put out by Pew Research Center, and it's all about going to all the survey houses in the U.S. and figuring out what kinds of

sampling their undertaking. And increasingly what we see in the orange here is that they're relying on non-probability sampling, in particular online opt-in. Now, why do we care? We care because they further went on to show, and it's difficult to show this, but they did the best they could. They further went on to show that when you rely on non-probability sampling, as increasingly survey houses are,

you're less likely to nail the parameter of interest. There's more bias. And so that comes, that decision, it's cheaper, but that decision comes with a cost in terms of informing our policy in the ways that we want to inform it, in terms of getting the science done that we want to get done.

So that's the first problem. We're relying on surveys that are facing either increasing non-response or because of that we're relying on non-probability samples that are problematic in their own ways. Second problem, our studies are often underpowered. We seem to love small samples.

especially, you know, in the social sciences and in psychology and sociology in particular, we often rely on small samples and that means our studies are underpowered and our results are often inconclusive. Now part of the problem is that we

have built a culture within, say, sociology in which we take it as kind of a rite of passage that you collect your own data, that you own those data. And that's seen as a fundamental part of being a scholar. But this privatization of the means of scientific production has implications in terms of the power of our analyses and the extent to which the results are conclusive. So the small sample problem is another problem about which we should worry.

Here's problem number three, the slow science problem. We live in a world that's crisis rich. That's clearly the case. And in that crisis rich world, we need to understand how people are responding, what they're feeling, what they're thinking, what they're doing as these crises cascade through the system. I'm referring, for example, to distributional crises.

rising inequality, profound increases in inequality on a lot of dimensions, rapid climate change, failing governance, out of control technological change, lots is happening and the population is responding in ways that we might not be able to understand unless we engage in real-time monitoring and yet we don't have the capacity to do that well. If you fail to do it, here's a failure, we have to own this one, failure to detect rising populism.

staggering amount of damage caused because we didn't see it and respond accordingly. Staggering. I submit that it's worth doing the monitoring we need to do in order to get our job done as social scientists. We've failed. So we need tools for discovery in this crisis-rich period. We don't know what is going to happen as further crises course through the system, in part because we've failed in our job of monitoring in the past.

And so we can't rely on omniscient survey designers to ask the questions that need to be asked in order to understand where the population is. That's too hard. We need to have tools for discovery, right? That don't require omniscience, but just let us find out what's happening. We haven't done that. We need to. So here's the fourth problem that I'd like to point out, and that's an excessive division of labor when it comes to data collection.

It's a very fragmented data environment. The ICPSR has over 21,000 studies and the cross-study division of labor in terms of method and topic is formidable. So we have studies, surveys,

specializing in family and child well-being, social and economic mobility, substance use and abuse, values, attitudes, and opinions, social networks, aging and retirement, and on and on and on. What's the cost of this fragmentation by topic and method? It's very hard to understand and examine cross-domain processes. We understand what's happening in silos, but you need to be able to put it all together, and that means some omnibus studies are in order.

This fragmentation is especially lethal when it's combined with privatization. So the original data owners, that's us, the people, right? The original data owners have those data appropriated by government agencies, by survey houses, by corporations, and by individual scientists. And we're reluctant, once appropriating the data from the people, we're reluctant to share them, right?

So if you have fragmentation and privatization, you have a problem in terms of understanding what's happening because we have privatized ownership of the people's data. So I've shared four problems with you. What's the fallout of these four problems? I would say slow accumulation and abject scientific failures, like, for example, the failure to detect the rise of populism swiftly enough.

Now the typical diagnosis of the sources of this problem seems to be the open science diagnosis. That is, our science is not working well, we need to open it up, we need to pre-register, we need to have normative guardrails that

prevent the bad apples, the data fabricators from doing their bad work. And if we set up this normative and regulative system more successfully, we could make sure that science plays out as it should. And it's hard to object to setting up a more rigorous set of norms guiding scientific production, absolutely.

but I think what's also in play and what I've tried to argue is that there are a host of data problems also that account for our failure, slow accumulation and other shortcomings of scientific production. And in fact, some of these data problems enable that. If you have very small samples that allows, say, p-hacking, it encourages p-hacking, and we need to own up to how our data environment works.

is causing some of the problems and that we ought to try to fix the data environment. It's not given. It didn't have to come down the way it has come down. Hopefully we have some agency, some capacity to build a better data environment. Okay. So why does our data infrastructure fall short on so many fronts? I think it's money. We've undercapitalized science.

In the US we decided to pull back even more, but even before that it was deeply undercapitalized, at least as against the standards of capitalization that we have for the natural sciences. So again, here's the question: what if our commitment to building a social observatory, one that could monitor what's happening in the world, what if our commitment matched our commitment to building a space telescope?

capitalized at about 10 billion. I don't begrudge one of those 10 billion dollars. I think that space telescope is immensely valuable. But I think we should have an equal commitment to building a social observatory that can protect us against the problems that arise when we don't monitor well. I've talked a lot about the James Webb telescope, but there are other projects in play within the natural science world that are even more expensive. I think part of the problem

is that social scientists don't dream big. They don't think there's any possibility that those dreams will be realized, and that becomes then a self-fulfilling prophecy. So I'm encouraging us to try to think about what kind of monitoring infrastructure we need if there weren't capital constraints. It may not happen in the US, but it could happen somewhere. And try to then make those dreams come true rather than just giving up from the get-go.

So here's a dream. In the end, I'm going to say it's a bad dream and I'm going to revise it. But I want to lay out the dream first and then we can talk about how it should be revised. And so this dream is what I'm going to tag a dynamic data system. And it would solve our infrastructural problems by providing instantaneous population level measurements of attitudes, behaviors, and social processes.

so the core idea is to build an agent-based network with as many agents as there are individuals in the us 260 million adults and then allow those agents to come alive and to act in accord with behavioral blueprints and to interact with one another you can think of it as a simulation but hopefully deeply resonant with what's happening in the world so i'm going to talk about the threats that this dynamic data system would would imply the costs of that innovation and i'll do so

throughout the balance of the talk, but in the main I'm just going to lay out the system without talking about the threats. And you're going to say, "Oh my god." Because the threats are going to come through loud and clear. And then we'll talk about how to address those threats. But I first want to lay it out. So let that happen. And then we'll talk about how we can address the threats that would arise if this were actually built out in the way that I'm describing.

Okay, so there are four steps that are entailed in building out this dynamic data system. Not four threats, four steps. The first one, pretty standard. It's just about exploiting existing administrative data to build a population that's fine for the population. Standard issue stuff. The next three steps are less standard issue. It's about implementing new protocols for sharing data. That's going to be hard to do, and it might...

might be worrying in some of its implications. The next two steps are even more worrying. It's activating the data system through behavioral blueprints and then allowing the agents to interact. I'll walk you through each of those four steps. I should say though, before I even get into those four steps,

All of the analysis of the resulting data infrastructure would have to occur within very secure facilities like federal statistical research data centers, which we already have in the U.S. and have yet to have any major breaches. So they're one of the great achievements of government, extraordinary achievements that have enabled

Very important research, and this kind of system would have to reside in a secure setting such as a federal statistical research data center. Okay, let's get to it. Four steps. Step one is to build the population spine. And that's no more than simply linking existing administrative data for the full population of approximately 260 million adults in the U.S. Already underway.

I've been working alongside many, many other people to help build this system.

It involves linking tax census and other administrative data and making them available to qualified researchers and secure federal statistical research data centers. Some of the most important research in the United States is happening with these linked data. What we know about, say, social mobility in the U.S. comes out of that linked data, by and large. There are other sources of what we know as well, but some of the most important results of late have come out of those

those linked administrative data and much other research as well exploits these data. It's one of the great achievements. But that work has to be put on steroids by increasing opportunities for linkage, making it possible for more scholars to have access to these data, all without relaxing in any way our commitment to confidentiality, which the Federal Statistical Research Data Centers have done a good job of protecting to date. So this is just continue on with this good work. Now.

here's a little bit of an aside about some worries even with this existing type of linkage effort i haven't even gotten into the nasty stuff uh and already there are some worries so the good news as i've already stressed is that the foregoing fsrdc based initiative has successfully protected confidentiality while supporting critical social science research no major breaches all good great research getting done

confidentiality protected. That's the good news. The bad news, there's a separate effort via the data science company Palantir in collaboration with the federal government to build another linked data system, a separate independent linked data system. It's basic, well, to address national security threats, but in the end it's a surveillance system. So what is this system all about? First off, you're going to declassify data.

that are classified to make it possible for them to be exploited for this purpose. And secondly, you're giving access to these data to agency heads. So we now have a two-system setup. We have a Palantir-fueled surveillance system and a separate census-based research system. Now, this is kind of an important point from my point of view, but maybe it doesn't matter. The research system didn't enable, I would argue, the surveillance system.

It was Trump's executive order that established this entirely new data linkage initiative. So you could say we should defund the research system so as to protect against the surveillance system, but they're really quite independent systems. It wasn't that one was just taken over for purposes of surveillance, but a new one was built, and it's being built with the very best data science we have that's embedded mainly in that famous company a couple blocks away from where I live, Palantir.

Although you might say that we should still do research and get good stuff done with linked administrative data, and I think that would be true. There's no point in just stopping doing that. We'll still have surveillance no matter what. But another question that does need to be asked and is relevant for what we'll be talking about later on is whether or not any of the efforts that are happening within the research system might yield results

tools and insights that could be applied within the surveillance system. So you have to worry about those kinds of spillovers in a way that before you wouldn't have had to. Now you might argue that Palantir already has all the data science firepower it would ever need, and that's probably true. There are extraordinary folks over there. A lot of resignations, but there are a lot of people willing to do this work. Okay. So already some problems even with our existing data linkage system in terms of malign

actors using it, not the research system, but a similar system that's now being set up for surveillance purposes. But let's go on. Let's go on to step two. So step one is just, you know, kind of put on steroids the existing data linkage system used for research purposes. What we also need to do is exploit survey data. And that requires more, so we need more than just administrative data, but there's a problem in that survey data are usually collected, say, by survey houses, they're used for one-off purposes, and there's no sharing.

And so that means that respondents who agree to participate in those surveys are being, one might say, exploited and abused in the sense that their data aren't used to the full capacity that they should be used. And that's why we have so many surveys. People keep getting asked the same questions over and over again. So you would want to...

take advantage of such surveys as one has and bring them into this infrastructure. And this already happens again. Just as we already link administrative data, we link survey data in some cases to those administrative data, but more of that would need to happen under this ideal system that I'm describing. We have hundreds of relevant government surveys, thousands of proprietary surveys,

We don't, of course, have them for everyone in the country, far from it, but nonetheless they are very valuable for the purposes of calibrating the data system that I'll be discussing and for updating it. So you'd want to take advantage of those data. So you have to incentivize survey houses to contribute the data to this larger collective infrastructure. Right now they're hoarded by survey houses for one-off purposes. So you need a new social compact. It might be as follows. You could say to a survey house,

If you agree to share your data with this collective infrastructure, in exchange you can have access to the data that are being collected through the administrative data system.

That would be of immense value to survey houses because they wouldn't have to recollect data that are already known. They would focus instead on their value add data. Now, of course, if you did this, there would have to be a new type of consent agreement. Of course, respondents would have to be informed that these data would be contributed to this collective infrastructure, and they would have to be okay with that. But I do believe survey houses would take this deal because it would make sense.

for a much more efficient data collection effort. And it would end the data throwaway debacle, replace the single study data collection effort that yields so many surveys with people being asked the same questions over and over again with a data sharing commitment, thus respecting the contributions of respondents. Okay, so that's step two, incentivize data sharing so we don't have to carry out so many surveys. They can be more targeted to the value-add data we need. The next steps, steps three and four, are the most critical ones.

Let's get to them. They entail activating the agents and allowing the agents to interact. So far it's just conventional linkage of administrative data and linking also with survey data. We already do that. It's just doing more of it. The next step though is a bit more in the land of secret sauce. It's activating the data system through qualitative data and large language models.

So this rests on new and still emerging evidence that qualitative data, and in particular American Voices project data, can provide dark matter infused behavioral blueprints that can be used to activate these agents. I'll talk more about this, but before I do that, let's take a look at the AVP just so you understand where the data are coming from. So it's the largest representative omnibus qualitative study in the US. It features two to three hour immersive conversations.

The protocol covers key life domains, the full arc of one's life, turning points, crises, family life, friends, employment, health and mental health, and much more. It leaves off with the prompt, "Tell me the story of your life," and then a host of similar engaging prompts.

In my reading of the transcripts, it's often a therapeutic and cathartic experience. It's not like the survey ordeal to which many of us have been exposed. It's an engaging conversation. So how would a dynamic data system use the AVP? And the answer is that each AVP transcript serves as a behavioral blueprint that tells us how the respondent thinks and acts. It captures the dark matter that activates the agent and

and allows us then to successfully represent the respondent through that agent. Now, this dark matter claim has long been made, but there's now evidence on behalf of it. I want to describe some of that evidence. It's already in a TED talk. You probably have heard of it, or maybe not, but let me lay it out. I'm going to lay out a paper by Park, Zhao, Shaw, Hill, Cai, Ringel, Moritz, Willer, Lang, and Bernstein. This is what they did.

they took an avp transcript now i should say from the get-go not the american voices project data they asked us to use those data we looked at our consent agreement and we did not think it supported the uses that they wanted to which they wanted to put the data so we said we could not share the avp data with them but we were willing to share the avp protocol and they did use that protocol and they collected an entirely new data set in which

they did inform the respondents about how they would like to use the data. And so they had a consent agreement that was consistent with their uses. Okay, so now they have these new AVP data. Not our data, but data with a consent agreement that supports the uses to which they wanted to put the data. They took an AVP transcript and fed it into a local LLM, a local LLM, so it didn't expose the respondents to any threat.

of compromising their confidentiality. And they told the local LLM to basically act like this person after processing the transcript. So then they asked, so now you have an LLM acting like this person. And then you could post questions to that LLM that's been instructed to act like a person as represented in the transcript.

and they asked some questions that weren't directly embedded in the transcript itself, that is, ones that would require some extrapolation from the contents of that transcript. They asked questions of the LLM that was posing as the respondent. Then they went back to the original respondent from whom the transcript came and posed the very same question and compared the responses. What they found was a very high correspondence between what the LLM delivered when posing as the respondent and what the respondent themselves delivered.

now you might ask can an equally high quality twin be built with quantitative data and their answer and actually it's been replicated with many other studies since then is no survey data don't contain so it seems the dark matter that make it possible for an llm to to to act successfully like the human that the transcript represents so here's the performance scorecard if you give the llm just survey and and

something similar to what administrative data would provide, the performance that it can deliver is pretty inadequate. There aren't any dark matter in those data. But if you give the LMAVP data,

You get much better performance. Now if you add in survey and administrative data to the AVP data, their performance doesn't improve much. So it seems like all of the dark matter is in the qualitative data. It's quite extraordinary. Qualitative researchers have said for a long time that there's dark matter. A lot of people say, really? Is there really dark matter? Well, it seems like there is.

If by this we mean that an LLM, when provided with that transcript, can do a pretty good job of predicting what the responses of the person would be. So you could imagine building activated agents, 260 million of them, in which some of the agents would be really high quality because you would have AVP data for them. Other agents would be lower quality because you just have survey or administrative data for them.

That's step three. You're activating the agents with behavioral blueprints. Step four, we want to make it an interactive data system. So far we don't allow these people to interact and affect one another, but let's do that now. And you could do that by overlaying a network structure on these agents. You could, for example, use cell phone data in the US, which indicate when people are coming into contact with one another through their pings.

You could use that type of network structure, but there are others as well, to approximate the way in which people are interacting with one another in the US. And there are ways to fill in missing holes within the national network structure and actually overlay on the entirety of the adult population something that's pretty close to the ways in which those people are interacting with one another.

And now you could allow for network diffusion processes to operate, so that people are coming into contact with one another in ways that they, that are quite resonant of what's actually transpiring, and then they can affect one another. Okay. So I call it a dynamic data infrastructure for four reasons.

First off, as new administrative data become available, you can update the agents with changes that are reflected in those administrative data. Changes in their income, their employment, and much more. You can do that for the full population. You could update with survey data. That's more spotty coverage, but some proportion of the population would have ongoing engagement with surveys, and you could update their vectors accordingly.

you could update with new avp interviews and that should be done with the purpose of design that i'll talk about shortly and then you have network induced diffusion processes so it's dynamic in these four senses so let me talk a bit about the avp refreshment interviews you want to keep keep keep this this population of agents uh fresh uh

And to do that, insofar as the behavioral blueprints that the AVP provides are critical, you need to make sure that you administer a tractable number of those AVP interviews. They're expensive. So you would probably want a purpose of design of this sort. Two prongs. The first prong is interview people who are in critical nodes in the network. The ones that are actually, as it were, opinion leaders. Central players, people who are bridging between

between two portions of the network, those who are likely to have influence. You'd want to identify those and you can through the network overlay and oversample them so that we have kind of key opinion leaders and how they're changing under control. And then the second prong is you want to interview all types of people. And let me talk a little bit more about what I mean by types of people.

You want your interviews to be representative of the sorts of behavioral blueprints that are out there. So how might you do that? So what do I mean by types of people? This is really the interaction of personalities, positions, and places. What we want to know is the dimensionality of the human experience.

There are people out there who have had life experiences that are roughly like mine and who are going to have opinions, attitudes, and beliefs that are roughly like mine. I want to figure out how many types of people there are out there in the U.S. that have roughly similar life experiences. We want to make sure that we represent each of those types and interview them frequently and re-interview them frequently so we can see how people are changing as all these crises course through the country.

So we don't know the dimensionality of the human experience. Again, it's roughly going to be an interaction of personalities, positions, and places, but there are a lot of places, there's a lot of different types of identities and positionalities, lots of personalities, could be millions and millions of types of people in order to do a good job of characterizing behavioral blueprints. If there are many millions of types, we need to administer lots of AVP interviews. If there are a smaller number of types, the data acquisition costs are reduced.

Now the other question that has to be borne in mind is that there can be mutations of types, right? There could be over time what was once one type would split off into two. People in say different neighborhoods, slightly different neighborhoods might respond in slightly different ways as new events course through the country. You'd have to keep track of that. I'm not saying this is cheap or easy. This is the data infrastructure that we need in order to keep track of what's happening.

So how frequently do they change? We don't know, but we'd have to keep track of that. If there's mutations, if one type splits off into two, you've got to know that. So what's the worst case scenario in terms of how costly this would be? How many AVP interviews you would need to undertake? Well, the worst case scenario is there's lots of types of people, millions and millions of types.

and they change rapidly, and we find out that indeed as early research, but it's still early days, as early research suggests we need the AVP in order to understand behavioral blueprints. That's a worst case scenario in the sense that it's expensive to carry out an immersive interview. But how expensive is it?

I haven't talked about how that paper that I discussed at length collected their AVP data. They collected their data in a very different way than we collected the data. In the initial AVP fielding, it cost approximately $4 million after you subtract out some front end costs, figuring out how to do this, about $1,500 to respond. It's not cheap. You could drive that down, I'm sure, but not cheap.

In that most recent AVP fielding by that team that I discussed, you know how much it cost? $40 a person because they used AI interviewing. So one thing to bear in mind, with AI interviewing, they got very powerful behavioral blueprints. Imagine what we could do with face-to-face. But the implication is that at that price point, you could actually do a lot of AVP interviewing.

Now, you'd want to do it well, obviously, and I'm no expert in interviewing, far from it, but obviously you couldn't do, I wouldn't argue, complete AI-based interviewing. You would need to have a human overseer who would introduce the project, talk to the interviewee, ask them if they're okay with a tool that's AI-based for the purposes of interviewing.

and always be overseeing the interview to make sure that they intervene as necessary. So it's not going to be nearly as cheap as was possible with that team that didn't have this type of more complicated infrastructure. You could also imagine an AI interview in which you have real-time guidance about where the most important and richest sources of data are to be found.

You can imagine ways to make those interviews especially successful in understanding what's happening in people's lives. So we're not going to do any of this, right, for reasons that we'll talk about. But I hope we're not going to do any of this. But I want to first talk about what we would get if we did, and then talk about how we could revise this formulation in ways that would make it palatable.

Hi, I'm interrupting this event to tell you about another awesome LSE podcast that we think you'd enjoy. LSE IQ asks social scientists and other experts to answer one intelligent question, like why do people believe in conspiracy theories or can we afford the super rich? Come check us out. Just search for LSE IQ wherever you get your podcasts. Now back to the event. So

Let me just talk about the benefits if you had this kind of infrastructure. And in the main, these benefits are that they address the infrastructural problems with which I led off. So let me march through those benefits. First off, it solves the real-time monitoring problem, right? We have real problems on that front. We don't have the capacity to understand where the population is, what they're thinking, feeling, and doing, and anything approaching real-time.

If we did, the benefits would be extraordinary. We would know where people stand with respect to political attitudes. We would know, for example, whether or not new variants of populism are emerging. We would know, we have better understanding of social attitudes, our new types of autonomy and disaffection emerging. That's pretty important to know.

we would know more about economic attitudes or new types of economic rebellion emerging or just more prosaically, we would know about consumer sentiment, the foundation of our decisions about monetary policy and how we balance our commitment to high employment and low inflation. If we could have a better measurement of economic attitudes alone, the value of that would be extraordinary.

and so much more. Basically, we would have a real-time monitoring infrastructure that would mean that we would understand developments like populism, not when it's too late, but as they're happening. We could have a more responsive system that understands where people are.

The agents could also respond to survey questions, they could engage in immersive interviewing, they could serve as participants in experiments without any respondent burden, without any costs, except the cost of building the infrastructure in the first place, but no marginal costs, right? I talked a lot about the cost of having underpowered analyses.

We're inured to those costs. We think it's inevitable. We run much of our science on small samples that are often inconclusive and misleading. If we had this type of infrastructure,

Full population analyses would be freely available. We'd have ample power for all intersectional analyses, for analyzing small and hidden populations, for analyzing very granular geographies at the neighborhood level, for evaluating possible interventions, even local ones. The compromises that we regularly make on sample size simply disappear.

What about response rates? I talked a lot about the problems we're facing with response rates and consequent unrepresentativeness. Within the data system environment, guaranteed 100% response rate, right? The twins would always respond. Within the real human environment, the argument would be that the decline in response rates would be less significant.

less rapid because there'd be less respondent burden under this system you don't have to ask them the same thing over and over and over again and that reduction in burden uh would would stem the decline what about the rise of non-probability sampling well the main reason why we resort to non-probability samples is that they're cheap right if we had this this dynamic data system no need to resort to non-probability sampling you would have access

for free as a researcher to this data system. So if you had this kind of system, how do you think it would play out in terms of the way in which science would happen, social science would happen? My guess is it would be a gradual transition to this platform. So initially, it would just be used as synthetic trial data. And we have data sets like that already now that are used in this way. So imagine that you wanted to

undertake an intervention for some social problem. You read the literature, there's about a hundred possible interventions that have been proposed. You don't have the money to go to the field with a hundred possible interventions, so you want to sort through them and try to get a guess at which ones are most likely to have payoff. So you could test them out within this data infrastructure. Maybe you find the ten that seem most promising, then you go to field with those, do an RCT.

and you would be able to do a much more comprehensive analysis

sorting in a cost-effective way with these synthetic trial data. But what if we found over and over and over again that when you go to field, you're getting exactly the same results as you got with this synthetic trial data? At some point, we might say the real gold standard is right here with this dynamic data system. Of course, you'd always have to do ongoing calibration to make sure that that system isn't misleading you. That would be

Probably the most important cost would be constant calibration and checking. But we might well find that it delivers and that the real gold standard is the dynamic data system. I want to talk a little bit about all the work that it would take to get to this kind of system, and then I'll talk about what would be a more viable system than what I'm describing here. But first, how would we get there?

There's some data tasks, there's some analytic tasks, there's some LLM tasks, there's some network tasks, there's a lot of work. I can't talk about all of it, but let me just give you a sampling of what would have to happen. Obviously, it's very early days, and much of what I said might be possible could in the end not be possible, depending on how the research comes down. But what would one want to do? You would want to develop a better AVP that is more successful in building out these behavioral blueprints.

The AVP was never devised for the purpose of building a behavioral blueprint, and there's certainly reason to believe that one could do a better job if that were the objective. You would also want to explore whether or not you need the AVP in the first place. There's some suggestive evidence that you do, but it's early days and that may well be wrong. It could well be that a combination of survey and administrative data could get the job done and build behavioral blueprints that are quite good.

A lot of CS tasks have to be done. The research that I described relied on very simple prompt engineering. They did very well with that, which suggests that if you instead fine-tuned a local LLM, you could do yet better. So you can imagine creating a massive training set, providing for each person the full set of available administrative survey and qualitative prompts and their responses.

And then you would make the resulting weights available to qualified researchers in FSRDCs. That may well yield a better result. We don't know. We need to find out. And that's all I'm going to say about further tasks. There are many more besides those. But I want to turn now to some threats and why we wouldn't want to build what I just described. And it's probably already obvious to you, but let me lay them out and then talk about what we could build that I think would address these threats.

So some of the threats aren't all that troubling. They're just technical failures. I've already mentioned some of them. We might not be able to incentivize cooperation and data sharing in the way that I discussed. We might not be able to successfully capture population heterogeneity. LLMs have had problems with that. The basic task here, I should say, is right now an LLM is conceived as kind of a virtual assistant. We want to make them into humans.

successful at mimicking humans. It's a big ask. And so we may not do a good job of that. We may fail to capture trends well, capture the effects of aging, period effects, cohort replacement. I talked about how we might do that, but it might not happen very successfully. And there could be growing non-response. I claim that there wouldn't be, that it would go in exactly the opposite way, but I could imagine respondents rebelling, saying, "I do not want to participate in this type of system."

and we might have yet more extreme non-response problems. And if you don't have that ground truth of the surveys underlying this system, you got nothing. Here are the kind of the more troubling zones, all that compromised confidentiality and misuse. There are simpler user breaches, so the servers could be breached. I said all this would have to happen within FSRDCs, but that system could be compromised.

You could have disclosure avoidance review that's compromised. You could have all sorts of breaches of confidentiality that would be very troubling. But here's the really troubling stuff, is government breaches, right? This is why you can't do this. So let's talk about the context, the political context in which you could do something like this. If you had a benign government, if you knew that the research system wouldn't be repurposed for surveillance, you could do this. But we will never know whether or not

this transition from a benign government to a surveillance state will happen. So it does not seem like this condition will ever be met. Now you might say, okay, even within a surveillance state, you could imagine some context in which it would be okay to build a system like this. So there are two types of protection you might imagine in play, but I doubt that either in the end will reassure any of us. But you could imagine

have what I'm calling here differing objectives protection. So the research system may not be built for objectives that are the same as the objectives of a surveillance system. And so you could say whatever we learn within the research system wouldn't help with surveillance. And so they'd have to build a separate surveillance system. And so we're not in any way enabling that surveillance system. I just don't think that's true. The second kind of protection is what I'm calling resources protection. You might say, well, however

substantial the resources might be that could be devoted to the build out of this system, they'll always pale in comparison to the amount of resources the government has to build a surveillance system. The amount of money that's being poured into Palantir now in the US is incredible.

So we'll always swamp what happens for purposes of research. And you might say, well then, that research-based system will be useless because you'll just build a much better system within the surveillance sector. Maybe there's something to be made at this point, but still I don't think it's enough to carry the day and convince one that we should proceed. So let's suppose, as I am now, that the calculus resolves against the kind of data system that I just described. What's the fallback?

Is there a fallback that is palatable? And I'm almost done, Mike. One hour, five minutes. The fallback is shockingly similar to what I've described, but I believe it's safe. Although if I'm not right, I want you to tell me. We'll see. See what you think. Here's the fallback system. Very similar. That's why I described kind of the system in its own glory, because it's going to get you here, I think. So I want to build a simulated population of agents. So how would you do that?

you first, it's very similar, you first got to identify the number of types of people. So it's the same process where types are defined by the intersection of personalities, places, and identities or positionalities. It could be a lot of them. So you still got to do that work of figuring out how many life experiences there are in effect in the U.S. And I should say that

The same type is defined in a very simple way, right? It's someone who when prompted with a certain question will respond in the same way. If they respond in the same way, they're the same type for our purposes. It's just that they have to respond to lots of prompts in the same way, right? Because we want it to be available for all sorts of research purposes, right? But anyway, you still have to identify the number of types of people. Then you populate the country with fabricated agents. Again, 260 million agents, one for every adult, based on these types.

So we would know all the types in the US and we know the proportion of population that falls into each of these types. We would use that. We're going to create a population that replicates the types that are in play in the US. And then you would include populating the neighborhoods in the US. All the existing neighborhoods in the US would be populated the right number of agents with the right types of agents.

and you would ensure that the simulated neighborhood level marginal distributions by age gender race attitudes income and all the other available variables reproduce the actual neighborhood level marginal distributions so we're now creating authentic neighborhoods rather than authentic people we're doing the best we can in terms of populating those neighborhoods but there's no longer a one-to-one correspondence between people and agent at the neighborhood level the marginals are right

but we don't have a one-to-one correspondence. You again overlay that with the simulated self-initiated network and you update using administrative data, survey data, ABP interviews. You're all the same except you're disrupting the one-to-one correspondence at the neighborhood level. I believe this would eliminate our worries about using such a system for surveillance. Last slide, Mike. Right on time. So closing point.

I do believe that we have a problem. We need to understand what's happening in this crisis-rich world. It's deeply important. Failing in the way that we failed in the past is extremely costly. We ought to spend up to the level of the cost that's implied in ensuring that we have a monitoring system that gets the job done. Failing to do that has implications that are staggeringly problematic, like what we're living in now.

That's the cost of not building a better monitoring system. We don't even know what our next failure will be if we don't do a better job of listening to the voices of the people. We need to figure out what's happening, monitor in real time as crisis after crisis courses through because we've failed to monitor well in the past. So I think the need is real. We just need to make sure that we do it in a way that doesn't enable surveillance. And I think that's possible.

So, I'll leave it at that. Thank you. Wonderful. I mean, I love these big picture and ambitious. I think social scientists need to be ambitious. So, we all applaud David's... You can't not be applauded for that lack of ambition. Smart. Let's save it up. We've got some good questions online, but let's begin with questions from the floor. So, if you can just raise your hand, say a bit about who you are, and then ask your question. Over there.

Hi, thanks very much. I'm Kristen Zarek. I'm in the Department of Sociology here. Listening to your talk, I was kind of reminded of this book by David Lodge called Changing Places, where there's a professor from the Bay Area who comes to the UK. It's in the north of the UK, I think, the article.

But it was in, and then the one from the UK goes to the Bay Area, and their entire lives changed just by shifting nation states and where they are. And actually, thinking about the, you know, your presentation in a very much kind of Bay Area style. From a presentation, big ideas...

going on within it. It made me also think about the ways that everything you're talking about is still locked within the nation state in terms of the data gathering, measurement, everything else. When the crises you want to address are fundamentally global. And I think this also raises the issue. So say you do get your wish list, say you do get some sort of tech billing and everything, this is great, there's a lot of money at you, and you can run with it and you can do your realistic version of this.

what are the potential issues that arise when you have this amazing data set or this amazing, you know, whatever, agent-based, you know, agentive kind of thing where you can stimulate how American society works when the inputs are the ones that you can gather from within the confines of the nation state and then those simulations are also structured by them too when in the end the questions that we really want to answer are fundamentally global.

Do you want to stop? Yeah, I think it's a great question. I think part of the answer, and it's simplistic and unsatisfactory, but I'll just lead with the unsatisfactory response to a great question, is, and Mike's involved in this, we would need to have similar sorts of initiatives in lots of countries. And I think that would be viable. A second maybe unsatisfactory response, absent that question,

That approach would be that a lot of the global shocks that are in play ultimately impinge on people in the US. And if you monitored those key nodes within the network that I described, you would see how those global shocks impinge on the people. And you could do a nominal job, as it were, of monitoring how people in the US respond to shocks, some of which are global, some of which are domestic.

But nonetheless, from the narrow perspective, as you put it, and I think I'm right with you, from the narrow perspective of trying to build a better society in the U.S., you could get that job done because you'd at least know how the population is responding to those global shocks. But I agree that ideally you would want more. But yeah, even I can't dream as big as you're asking me to dream. Any more questions for Jane? Yeah.

Thanks so much. So I'm Jane Elliott. I'm leading a pilot project at the moment, which is the UK Voices project based at the III. So I've already had some exchanges with David. And thank you, David. That was such a thought-provoking talk. Wonderful lecture. Thank you. I've got lots of questions I could ask you. I'll just focus on one. So you start with the focus on natural science and the fact that they think so much about measurement.

And that is one way of characterizing it. The other is that some of the questions in the natural sciences are actually quite well defined. And so they know what they want to know, but it's incredibly difficult to find out. So they need the measurement, they need the enormous Hadron Collider or the telescope.

In the social sciences, part of our problem is actually working out what the question should be. And that's what was missing for me a little bit from this wonderful vision and provocative and troubling vision is what questions could we actually ask that would benefit people? And it's fascinating that, I mean, obviously you can't cover everything in a lecture, but when you're sort of saying we failed by not seeing the rise of populism,

But another is to say that's not a failure. The failure is that we have created the conditions of inequality that have allowed populism to thrive and that populism is a completely logical response to people's material conditions. And it's not clear to me what, if we knew that there was a sort of rise of this dissatisfaction, would that then also give us the clue as to what to do about it?

So that's probably more of a comment than a question. But it's fascinating. Thank you. That's a great question. Can I respond with less, even though it's a comment? And I couldn't agree more. But I think...

i over dramatized and saying that had we known about populism sooner than than we did we could have done we maybe we could have stopped it i don't think that's impossible let me step back and talk about some crises that i think could have been addressed because they're more tractable so think about fentanyl for example uh it really wasn't understood until a lot of carnage made it obvious but had we had a system like this we could have could have picked it up sooner

and done so, I think that's a more tractable type of problem. I think we have a lot of tractable problems that we don't take care of because we just don't know what's happening. Now, I don't think it's impossible to return back to the populism case. If we had picked this up sooner, it's possible that those who had long been talking about the takeoff of income inequality

Long been talking about how certain sectors of the country are hurting and hurting really bad, feeling lost, often for reasons that are just the loss of privilege, but nonetheless

No less real from their point of view. If we had known that sooner, it's possible that those arguments about addressing the structural causes could have been adhered to and listened to. But I agree, they might not have been. But the chances of that happening would have been raised. And even a slight increase in the chances of a meaningful response would be something we would want.

I've actually got a couple of questions online which I'll link on to this. I'll ask them if that's okay. First is from Carl, an ex-LSC sociology undergraduate and a postgraduate LSE student of history. I disagree with the supposed failure to predict populism in the humanities. For instance, there was an entire very influential course at the LSE called Theories and Problems of Nationalism. It never underestimated the potential for populist resurgence.

was not based on quantitative methods and better data but in some ways the course is spot on as it stressed the primacy of vertical over horizontal ties what anthony smith called at times the ethnic revival as such various failures to anticipate trends and social movements might be a failure of theory so i guess the issue here is

Perhaps social science is the problem and we should be thinking about history, better history. Which actually is the second question too from Vijay Swao, another LSE student, former LSE student. Is social science science? Totalitarian, communist and fascist regimes as well as the USA use the term social science whereas it would be more humble and accurate to describe them as social studies if you call it science.

then isn't there danger you're producing propaganda? I guess the issue is in a way whether investing in more, better, more scientific social science, which is kind of recognised that actually we need to recognise the limits of social science in general and think about more humanities oriented or other modes of understanding problems. Yeah, that is one way to go. Throw in the towel. Give up the dream of social science and...

try to take care of our problems in other ways. I think that would be misguided. I would totally agree that science as it's currently practiced is deficient, problematic in all the ways that I hope all of us appreciate. And the open science movement addresses some of those problems, but by no means all of them. And so that science is a very deficient, problematic enterprise.

is, I agree, indisputable. That there is no hope for building a science that can help us is where I part ways with that comment. What was the one before that? I already forget, but I had a reaction to that one too. The first one about basically the argument is that actually populism was predicted by humanities or it is scholars.

We have long-term historical trends. Yeah, I think that's absolutely right. There are many very, very important early analyses of what was happening, just not enough. There has to be a chorus of voices. There has to be a data set that shows it time and again and that lots of analyses would reveal. I just don't think there was enough. There's always some people who get it right, but we needed a chorus of voices and we didn't have it.

Hi, Yann Renizio, CNRS sociologist at Sciences Po Paris. Thank you very much

It was a very, very interesting presentation. To be a little bit provocative, I would say that you lack a little bit of ambition here in the sense of you mentioned many times the term of behavioral blueprints, but it seems to me that your entire design rely on only one kind of behavior, which is answering to questions.

and that it might be a strong shortfall to predict what you want to predict, which is populism, big crisis, because there is often a big gap between what people say and what they do when things matter a lot. So yeah, could you react to that, please? Thank you. Yeah. These are great questions. Too good.

But I think that you don't have to take literally what people say, right? You look behind what they say and understand what is meant by what they say. And that's the job of inspired qualitative research.

And this is simply providing the data that makes that research possible. I mean, it's not just classic interpretive analysis, which is immensely important and that could help address that problem you lay out so well, but also natural language processing, all the various techniques that we have to kind of reach behind what people say and understand what that means. I think we have pretty good tools in that regard and qualitative data of the sort that the AVP and other sources make available.

can supercharge those efforts. I don't, one reason why survey data are not what we need is because they aren't rich in that same way, right? You only have the response to a question. Well, what can you do with that? It's so thin, right? But if you have an immersive interview, I think it's rich and you can reach beyond what people say to what it means. - Hi, thank you so much. My name's Thomas, I'm a mathematics PhD student

I have two comments and a question. My first comment is relating actually to what I think Jane said, which is that you should focus more on the kind of questions that you'd want to answer. I actually think the opposite. I think the strength of this pitch is that it's question agnostic. You know, you're proposing a general infrastructure and then you can pose or researchers can pose whatever questions they want to that model. And so you're not setting out with a specific research question in mind. You're building a kind of

simulation that would then be able to answer any possible question that you ask it. So that's my first comment. Second comment is you mentioned at the end instead of having a one-to-one agent-based model of individuals, you could have a kind of proxy model where

you're not calibrating it to individuals, but you're kind of replicating the macro structures. And then you said, well, we'd first have to kind of hard code the types of people there are into the system. But that, I suppose, could also be a kind of floating parameter that you can calibrate live, right? Where you can learn how many types of people there are from macro data and adjust the system parameters live as you're learning the model.

That's my second comment. The third, the question is, you talked about the interaction network and you talked a lot about spatial, geospatial interaction, cell phone networks. But a lot of the interaction is online, right? It's on the internet. So would you have to build a copy of the internet as well as sort of latent space in the background in which people interact and you'd feed in information on sort of online behavior as well?

Yeah, three questions. On your one comment regarding how I should have responded to Jane, totally agree. I should have and consider it done. That was a great point and I agree with it. So there, Jane. On your other two points, I think you're on the mark again.

Absolutely, I agree that face-to-face interaction is not enough and online interaction would be valuable as well. And there are lots of ways you could make those data available. I don't know, my reading, and I'm not really well schooled in network analysis, but my reading is that most people try to justify online interaction as a fairly strong sort of recapitulation of face-to-face and may not be huge value add.

But insofar as there is, one should exploit it. Absolutely. Totally agree. I did not understand, actually, now that I didn't fully understand your third point. Maybe you could go back to that? So you said, okay, we build this copy of the real world and we hard code the personality type. So first we need to go out, we need to collect all this survey data on what people are like, and then we build that into the model. But what if you let that be a parameter that you calibrate, that you also calibrate to macro parameters

studies of what you know societies are like so you say okay like there are the dimensionality of the human personality is n and n is itself a parameter that you kind of calibrate to survey data as it comes in and it can be it can be it can adapt i think it's yeah okay i understand now so thank you and i think my response would be insofar as you buy

a conclusion that's based really on just one study. And so obviously it's very premature to reach any conclusion at this point. But insofar as you buy the conclusion that we really can't understand the dimensionality of human experience without the American Voices project or at least similar qualitative data, that we need that to understand the behavioral blueprint, it's going to be very hard to calibrate from survey data, right? It's just not rich enough. That would be my worry. But that's a testable claim, and you may well be right. How about Abraham?

Thank you. As I said, incredibly provocative and interesting. I want to ask a question about the safety of the last version of the data set that you said. So I suppose what I'm wondering is, let's say we imagine the world of the authoritarian surveillance state with the data set that you've described.

And it occurs to me that it wouldn't prohibit someone from using it in a kind of predictive way to create a kind of stop and search on steroids. So what they presumably would be able to do is to take features of the types of individuals, even locate those in neighborhoods and areas, and then use it to understand what kinds of behaviors or values might people have.

prospectively because that's what the kind of model can do it can do some prediction work to try and understand and then what it enables them to do is identify groups of people that they should observe at least observe and survey or indeed you know more proactively stop people from pursuing certain kinds of actions so you know maybe that could be trying to preemptively stop protest for example or you know perhaps even

something to do with crime or other things, but it could be used predictively to identify groups of individuals who should be the target of government surveillance. And that would be potentially worrying, but maybe I've misunderstood something that you're articulating in what that would do. I think that's right. And...

The question is whether or not that kind of analysis would be more valuable than alternative analyses that would be more pinpoint. My suspicion is that there would be a cheaper way to get there. But if you're right and there's not, then that's a worry. Follow on over here. Could I respond to that, which is that that is presupposed quite heavily, I think, on having geospatial information, right? So being able to link an individual to a physical location.

But my point about the internet is that if you're saying, well, most of the interaction happens online actually these days, then people are sort of in this unphysical latent space, right? And it's not that important where they are in the real world. It's more where are they in this sort of ideological space? And that would presumably lessen the risk of surveillance because you wouldn't be able to pinpoint an individual to a certain physical location. I'm going to go here. Jane? The microphone's coming. Yeah.

Okay, so this is such an interesting debate. So to carry on this, to me, this is already happening. So it's not state surveillance pinpointing people and rounding them up and saying, don't go on that protest, but it's Facebook and all sorts of other things that are feeding people through the algorithms what they want to see. And so we've already got a sort of strange social engineering, which means that when I look at Instagram...

I see, well I won't tell you, but I see things that fit my very boring post-menopausal habitus. Yeah, and I bet you don't see the same things I do if you go on. Okay, I always like a bit of humour, but anyway.

So I think that we already live in a society where there are corporations that have this, but so then to return it to the social science question, it's how do we then make sure that social scientists can make the best of this sort of data?

rather than it just being the domain of the metas of this world to use it. But also I want a quick rejoinder to Tom, because I do know about omnibus studies, but we still need to know what types of question it could answer. And that comes back to Jan's point as well, about can we really predict people's behaviour from this? Anyway, I've said plenty. I will shut up. Go to respond, David.

i a lot of behaviors are represented in queries that you could administer to people and insofar and insofar as there is that that correspondence uh then then you're good to go now of course the behavior the correspondence is by no means perfect and so again the point though that would be i think have to be made is that

Although the query read literally, the response to the query read literally might not inform you about the behavior, a deeper read of the transcript likely would be. That would be my hypothesis. But on the larger point, I think there are a lot of fundamental questions that I did at least hint at that we care about. We care about political attitudes deeply, right? We care about social attitudes, autoimmune disaffection deeply. We care about...

effects of various interventions and some of those effects can be I suspect adequately picked up through the responses to the queries that we could administer. That's a testable claim, but even if we can't get that, that's the reach goal. Knowing about people's political attitudes and social attitudes. Thank you so much. I really enjoyed the talk and it opens the whole world of how we can think about the social sciences.

And it's kind of related questions what we just talked about. But so can we really predict these kind of behavior? And I'm thinking more under like from the perspective of like how can we sort of see

whether these twin, whether these avatars would have responded in the same way as the real people. So I assume they would also entail that a lot of validity checks with real people should be made. But I also think how behavior can be so complex and whether, can we really, when we ask an avatar,

like what are your political position on this issue? And then when you ask a real person, I don't know, maybe after a few weeks or after a few months, how can you actually, how are we comparing the same thing? Or like what are the issues that could arise from these different human and non-human subjects? - Yeah, and one way to think about it is that to make

this defensible approach. We only need to do better than what we're currently doing. What we're currently doing is imputing, right? When we have missing data, we impute. And this is just an imputation technique. It's likely to be better. Is it perfect? No, just better. And so it's easy to over-imput

represent the extent to which this approach is different from what we already do. It's not that different. We impute. Our imputation regimens are not so good. We can do a better job of it. Is it perfect? No. There'll still be error. And so what's an open question is to what extent we can improve upon existing imputation. And so I think we agree.

Maybe, I worry that in fact we can do a better job than perhaps you think we can do. I'd rather it didn't do quite as good a job as I think it could do, but that's a research question. I've got two questions online from Stuart McIver, which ask about AI, links into the various discussions we've had. But the kind of argument is, the first question is, because AI is trained on particular segments of the population and their digital expression, i.e. digitally savvy and all that,

Doesn't it mean that you might miss out the kind of people who might become the populists of the future? And then linked into that, since AI is based on an expression of past data, isn't there a certain problem you won't necessarily pick up emerging things if you rely on an AI-based model? Yeah, that would be the worry. And indeed, this comment is on the mark in terms of

as I understand it, one of the biggest failings of existing efforts to kind of reproduce population heterogeneity is that it's suppressed. We don't get as much heterogeneity in terms of an AI-generated response than prevails in the actual world. And because LLMs have been built to kind of eliminate some of the more problematic responses that sadly exist in humans, but have been

trained out of LLMs. And so we need to kind of reinstate the heterogeneity that obtains through some sort of fine-tuning. The question is whether or not we can successfully do that. To date, it's been an imperfect exercise. It's hard to kind of recreate that heterogeneity that's been ironed out. But the guess would be that we can do it. And towards the end, is there any last question of...

Two questions. Let's have three questions very quickly and then one answer, yeah? Yeah, do you want to go next? Kind of a boring question probably. It's about your fallback system basically. You settled on neighborhood as the lowest level of aggregation, but why would you settle for that in particular insofar that you have all this data linkage going on and several levels of aggregation that you could choose from?

Thanks David, I really enjoyed the presentation. I really like the ambition of having a £10 billion social science project. My question is with that scale of ambition,

It seems like you're searching for a truth. And in the social sciences, I feel like there's rarely a truth. And the benefit of having the many different approaches and methodologies that we have at the moment is that you at least get to debate those different versions of the truth. And I just wonder what place do you see for other forms of social research if your vision was to materialise? Hold on to that. No, you can't.

David, for the last word. I love these questions. It wasn't boring at all. As I understand it, it would simply be a matter of going down to the lowest possible level that wouldn't run the risk of re-identifying people, right? That's a technical question. I mean, there's always a risk, and we would have to have an acceptable risk of re-identification, and we would want to press down to the lowest possible spatial level with an intolerable level.

I'd like to think it's not a zero-sum game. Having the capacity to carry out research with this infrastructure wouldn't at all rule out all manner of other approaches to understanding the world.

I don't think it's a zero-sum game. If we had stronger research capacity through this kind of data infrastructure, we would have more social scientists because it would be more payoff to being a social scientist because we could get more very valuable evidence on what's happening in the world, and we wouldn't squeeze out any of the other approaches. We'd just have more good social science, and we would get closer to an understanding of what's happening in the world.

I think we need to stop there. I just want to finish by saying, you know, my own perspective is that so much social science sadly has not risen to the challenges of where we are in the world, and there's very much pitching towards niche research communities. It is so refreshing to get assistance that we need to think about.

all the challenges and the dilemmas we face and think about how we scale up social science about the right way to kind of think about this. So really, really, I think it's been a fantastic discussion. Thank you so much to David. And thanks to all of you for coming along. Thank you.

Thank you for listening. You can subscribe to the LSE Events podcast on your favorite podcast app and help other listeners discover us by leaving a review. Visit lse.ac.uk forward slash events to find out what's on next. We hope you join us at another LSE event soon.

A new data infrastructure for the social sciences? 01:26:30 Share

LSE: Public lectures and events

Deep Dive

Shownotes Transcript

A new data infrastructure for the social sciences?