Welcome to JAMA AI Conversations. I'm Roy Perlis, Editor-in-Chief of JAMA Plus AI, and I'm pleased to welcome today's guest, Dr. Daniel Herrmann from the Department of Pathology and Laboratory Medicine at the Perelman School of Medicine, University of Pennsylvania.
Today, we'll be discussing his recent study, Reassessing the Inclusion of Race in Prenatal Screening for Open Neural Tube Defects, published in JAMA Pediatrics. Dr. Herman, thanks for joining us today. It's a pleasure to be talking with you, Dr. Perlis. Thank you for the opportunity. So before we get into the specifics of your study, can you give us a little bit of clinical context? So when and how is this kind of prenatal testing used?
Sure. So, open neural tube defects affect roughly 1 in 1400 pregnancies and one of the tools that we have for screening for these is measuring AFP concentrations or alpha-fetoprotein concentrations. This is done early in the second trimester, prenatally, and the way it works is we collect specimen from a pregnant individual, we measure the AFP concentration, and we compare it to the values that we expect.
Got it. So you have sort of a set of standards. Where does race traditionally fit into this kind of model?
Sure. The reason why we have these comparisons is the main reason is that AFP concentrations rise rapidly during the second trimester. And so we can't use a single interpretive threshold at, say, 15 weeks gestational age as compared to 20 weeks gestational age. And so that's why a more complex interpretive strategy was used than for most laboratory tests.
After that was done, it was realized that there were other associations with AFP on average, and one of those was race. And so conventionally, the adjustment for gestational age is done in a race-specific way. And where did that come from? So when did people sit down originally and build these models? Was this like 30 years ago or five years ago?
The initial studies were done in the late 70s, and the statistical methods that were used for this were developed in the late 70s and early 80s. And so we've been using this for second trimester screening for decades and decades. And the associations that were identified across race were
were observed at the same time. And there have been many studies over the decades. There's a lot of variability across those studies. Most of them tend to be small. Most tend to have small number of samples. They tend to be geographically localized. Many of them show associations on average across race. Some of them don't. There's a lot of variability across these historical studies.
It's interesting that this goes back so many decades, I think, like a lot of the models that people have taken for granted. In this case, you looked at the inclusion of race in these screening models. What motivated you to ask the question in the first place? Our narrow question was, in our practice at Penn Medicine, do we want to continue making this race adjustment?
So as I said, there's a history of observing on average a difference in AP concentrations in Black pregnant patients as compared to others. And so we knew in our clinical practice that we were following the guidelines, which were to make this adjustment.
And we also know that race is a social construct and it's inaccurate and it's imprecise. And we think that there should be a high bar of evidence for incorporating race in a clinical practice systematically like this. There had been a recent study at the University of Washington
that was trying to reevaluate this and saw that the association with race in their population appeared to disappear after adjusting for other patient factors.
And because of how much variability there is across historical studies, we weren't sure what made sense for our patient population. And so we wanted to look back and understand whether we saw the same association and think more deeply about if we put that in the larger context, did it make sense to continue this practice?
So if I understand you correctly, this was really, should we keep doing this? This was really a clinically focused, you know, quality improvement. Should we be doing this? Which is interesting because it's different from a lot of modeling studies where the research question comes first. In this case, it sounds like it was a very pragmatic question.
That's fair. I mean, we knew that we were adjusting for race. And we knew that looking back at the literature, it wasn't clear from the evidence that was out there that this made sense. And so we wanted to better understand in our patient population what
what the impact of that race adjustment was and how to compare the impact of that to the harms of race-based medicine in general, and to better understand exactly, well, if we were to remove the race adjustment, how would it change outcomes in patients?
And I guess this is probably the point where I should ask you, what did you find? Sure. So we looked back at data from Penn Medicine in our clinical practice over three years at 7,000 pregnant patients' studies. And what we found was consistent with some other studies that on average, Black patients had slightly elevated alpha-fetoprotein multiples of the median, about 8% higher on average, and
And when we traced that through the process and applied the standard interpretive thresholds, we found that there was a slightly higher frequency of false positive interpretations in Black patients as compared to others when we were using a race agnostic model as compared to a race adjusted model.
We can look at those associations in different ways, but on an absolute scale, the difference in false positive rates if we were to move from a race-adjusted model to a race-agnostic model was 0.6%, meaning that moving from a race-adjusted to race-agnostic model, we'd expect to have one more false positive in Black patients compared to others for every 170 patients tested.
So I like the way you're framing that. It's helpful to think in terms of absolute numbers. Maybe not a fair question, but going into the analysis, did you have an idea of what the threshold would be where you would want to retain race in the models? Like, is there a number, you know, if you said it was a 20% false positive rate, would you be framing it differently?
That's a good question. And our thinking about this evolved as we analyzed the data, as we thought more deeply about the consequences of this. I mean, there are lots of different ways of asking a question about fairness.
The reason why we historically have this race adjustment is that on average, Black pregnant patients have slightly higher concentrations, it seems. And the epidemiological evidence doesn't suggest that fetuses for Black patients have a higher frequency of open neural tube defects. So if everything else were equal,
it makes a lot of sense to minimize the number of false positives and to equalize the amount of false positives across patient groups. As we thought about more deeply, though, the screen positive rate and the false positive rate is only one of the factors here. The purpose of this test is to screen for open neural tube defects. And one of the things that's unknown is it's unknown how this race adjustment affects sensitivity for open neural tube defects.
Exactly how to balance an effect on sensitivity and false positive rates is an important question. But here, we don't actually know how the race adjustment affects sensitivity for open neural tube defects. So in the absence of that evidence, there's nothing to balance. We need more studies that can point to that question, that can tell us about, does the race adjustment affect sensitivity?
and tell us about outcomes in these patients, which is what's most important. And I guess the important notion here is that probably that trade-off depends a lot on the context. So in this case, clinically, I think you make the point in your paper, there is a follow-up for a positive test. Is that right?
Exactly. So if a patient screens positive, what is done now is the follow-up testing is non-invasive ultrasound, which is diagnostic in the vast majority of cases. And so, I mean, I wouldn't wish a false positive on anyone and false positives in this setting can lead to considerable anxiety and it leads to downstream testing.
But that harm is mitigated in part because diagnostic ultrasound is readily available and can be performed soon after getting a screen positive result. Got it. One of the things I should point out here is this is a podcast about AI. A lot of times we're talking about new AI applications.
But I'm wondering, this is a fairly straightforward question that you're trying to address about race and the role of race in models. But since this is an AI podcast, do you think there's something to take away from this for people who are building these really complicated AI models and have a choice of either incorporating race or not incorporating race in what they're doing?
Yeah, it's a great question. I mean, zooming out from here, our particular question here focused on prenatal screening. It's different from that general case in that the model is very, very small. It's transparent and we can understand explicitly the difference between including race or not including race. One of the takeaways, I think, is it's really important to have good outcome measures. And if you think more broadly, being able to assess for a particular individual
application of AI for a particular clinical question, how it affects different groups of patients for processing clinical outcomes that we think are most important for that.
And it's important to do that at the time that you're developing. It's also really important for us to develop tools that would allow us to do this practically in practice, to monitor and to say, well, if I brought this tool into practice, did it change the frequency of this important clinical outcome across patients? Here, the model is much more explicit.
And we can think about how the incorporation or non-incorporation of race can affect the performance and the downstream clinical outcomes in patients grouped on race.
With AIA models in general, we know there's a lot of bias that's gone into training of these models, and it's recapitulated as these models are used. And so I think the first principle is to be able to assess that well at the development stage and implementation stage. And then we as a community need to be spending the time to understand that
and to formulate what we think are appropriate fairness goals for individual questions and how to formulate those and how to apply those values, which come up from a diverse set of stakeholders as we are training, applying AI models, and then monitoring those models.
So in the case of this particular model or this particular test, did it lead you to change your practice? Is Penn changing how it applies these thresholds now? Yes. We're going through a two-step approach to this. So overall, we see that there is a small difference in false positives rates, but because of what we don't know about the effect on sensitivity, and we don't know what the mechanism whereby Black patients have a higher AFP on average is,
we don't think there's enough evidence to continue adjusting for race in this practice. And so right now we have changed our practice and we are calculating the medians agnostic of patient race. Our next step is we're actually building a new application that will allow us to completely remove race from the ordering and resulting process and will allow us to incorporate our new methods and
And will also allow us to improve the way information is communicated and the ordering process and the resulting process. And so right now we are now calculating risk and AFP multiples of the mediums in a way that's race agnostic. And our next step is going to be removed entirely from the ordering and resulting process.
So 50 years after these kinds of tests were first developed, you're going to have a new iteration that hopefully takes advantage of newer data and newer technologies. Yes. Not much has changed. The core of the test is the same. As part of the study, we actually updated the methods. And so we said, well, instead of doing serial adjustments for gestational age and then wait, well, let's use a multivariate model. Instead of grouping patients and
And having to look week by week, we were using an explicit quantile regression approach to estimate the medians. This is not the best that one can envision doing. It'd be better to have a method that doesn't have this association, this bias on average. We'd like to understand the mechanism and try to understand why we see this bias and
It'd be better if we could incorporate new biomarkers. And there is a little bit of preliminary evidence that specific forms of AFP that have differential glycosylation might be a better biomarker in general. We don't actually know whether those different variants of AFP, how they're associated across race. And so there's opportunities and things in the community that we should be moving towards that could improve this approach on average.
and improve the equity in using this test. But this is our step that we can take right now. We can remove race from the calculations and we can try to do additional studies to ask the question about sensitivity, to do additional studies to ask the question, are there underlying genetic factors? Are there underlying environmental factors? Is there a better analyte that we could be measuring?
So, more work to be done. Dr. Herrmann, thanks again for talking to us about your study in JAMA Pediatrics. To our listeners, if you want to read more about this study, you can find a link to the article in the episode description. To follow this and other JAMA Network podcasts, visit us online at jamanetworkaudio.com or search for JAMA Network wherever you get your podcasts. This episode was produced by Daniel Musisi at the JAMA Network.
Thanks for joining us, and we'll see you next time.