We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode Real-World Performance of AI in Screening for Diabetic Retinopathy

Real-World Performance of AI in Screening for Diabetic Retinopathy

2025/4/18
logo of podcast JAMA Medical News

JAMA Medical News

AI Deep Dive AI Chapters Transcript
People
A
Arthur Brant
S
Sunny Virmani
Topics
Arthur Brant: 我是一名眼科医生,参与了这项关于人工智能在糖尿病视网膜病变筛查中真实世界表现的研究。这项研究的意义在于,它评估了在真实世界环境中,人工智能模型的性能是否与之前实验室研究中的结果一致。糖尿病视网膜病变是一个严重的公共卫生问题,在美国和印度,许多患者未能得到及时的筛查。这项研究使用了在印度部署的AI系统,对60万名患者进行了筛查,并对其中1%的样本进行了人工复核。结果显示,该模型在真实世界环境中的表现良好,能够准确识别需要立即就诊的严重患者,并且在模型漂移方面也表现出较好的稳定性。虽然模型存在一定的假阳性率,但相比于完全依靠人工筛查,这仍然是一个巨大的进步。 此外,这项研究也揭示了在真实世界环境中部署AI模型时需要注意的一些问题,例如:模型的灵敏度和特异性设置需要根据当地的实际情况进行调整;需要考虑诊所的准备程度、技术人员的培训以及患者的配合程度等因素;需要关注整个医疗流程,包括数据输入、输出结果的处理以及潜在的瓶颈问题。 Sunny Virmani: 我是谷歌健康人工智能产品团队的负责人,参与了这项关于人工智能在糖尿病视网膜病变筛查中真实世界表现的研究。我们的研究旨在评估AI模型在真实世界环境中的性能是否与之前研究中一致,并关注其在更大规模、更复杂环境中的安全标准是否得到维持。在印度,大约一半的糖尿病患者甚至不知道自己患有糖尿病,这使得糖尿病视网膜病变的筛查面临巨大挑战,不仅在于筛查本身,还在于患者对自身疾病的认知和获取医疗服务的途径。AI技术可以改善偏远地区患者获得眼科医生服务的途径,因为他们不必长途跋涉到大型医院进行筛查。 AI模型的性能漂移可能由多种因素造成,包括患者群体特征的变化、摄像设备的更新换代以及操作人员技术水平的差异等。真实世界中的图像数据通常包含各种病理情况,而之前的研究通常会排除这些情况,这可能会导致模型性能在真实世界环境中下降。为了确保AI模型在真实世界中的有效性,我们主动收集了数据并进行人工复核,以便及时发现和解决问题。除了模型本身的性能,还需要考虑诊所的准备程度、技术人员的培训以及患者的配合程度等因素,才能确保AI模型在真实世界环境中的成功应用。

Deep Dive

Chapters
Diabetic retinopathy is a significant public health issue, affecting millions worldwide. Current screening rates are far below recommended levels, particularly in countries like India. The study highlights the challenges of access to care and the need for innovative screening methods.
  • Diabetic retinopathy is a leading cause of preventable blindness.
  • Screening rates are insufficient in the US and even lower in other countries like India.
  • Access to care, including ophthalmologists and appropriate equipment, is a major barrier to effective screening.

Shownotes Transcript

Translations:
中文

Welcome to JAMAplus AI Conversations. I'm Roy Perlis, Editor-in-Chief of JAMAplus AI, and I'm pleased to welcome today's guests, Dr. Arthur Brandt and Sunny Vermani. Dr. Brandt is Chief Resident in Ophthalmology at Stanford University.

And Sonny Vermani is Group Product Manager at Google who leads their Health AI product team. Today we'll be discussing their recent study published in JAMA Network Open that looked at real-world performance of an AI-based tool for detecting diabetic retinopathy and macular edema in a large-scale clinical setting. Guys, thanks for joining us today. Thanks for having us. Thank you. Arthur, we'll start with you.

I'm a psychiatrist, but I do still have my ophthalmoscope. I felt like I had to bring it just to show you, just for credibility. Of course, folks listening can't see it. You'll have to take my word for it. But before we get into the specifics of the study, can you give us a little bit of background? How significant a public health problem is this?

Yeah, absolutely. So Google has been screening patients. Google's AI called Arda has been screening patients in India for diabetic retinopathy for several years now. There's a number of FDA approved devices, both within ophthalmology and in other fields of medicine that also screen. And the question that we wanted to ask is, does the performance in the real world match or is similar to the performance in the retrospective or prospective studies?

And the importance of that is you need to make sure that whatever safety benchmarks you had, that you held yourself accountable to in the work prior to approval, continue to be met after approval when you'll be in a much more diverse multi-site environment.

So specifically after screening 600,000 patients, we subsampled about 1% of the patients between 2019 and 2023, had those images graded by a human grader, and then compared the grade that was generated from the AI to the human grader to ensure that the performance has not degraded in the real world, screening 45 different sites with three different types of fundus cameras. I'm going to pause and zoom out even further.

What's the public health context for this kind of screening in general? In other words, for listeners who aren't ophthalmologists, how important is this screening? How is it usually done prior to this technology? What was the standard of care?

Yeah, so the reality is it's not done nearly as frequently as it should be both in the US and then even more so in other countries. In the US, every diabetic patient should be screened once a year at least. And we really only screen about half of the patients that meet screening criteria. In India, we estimate that there's 100 million patients with diabetes now. And it's almost certainly just a very small fraction of those that undergo routine screening.

At our partnered site, where the data comes from for this particular study at Erivind, they've deployed what I think most people would consider to be the flagship eye care model throughout the state of Tamil Nadu, where they not only have multiple major tertiary hospitals that can provide all subspecialties within ophthalmology, but they also have about 100 vision centers in the periphery so that patients can get routine follow-up

with an ophthalmic tech or an optometrist much closer to their home. So these cameras are predominantly placed in those vision screening centers where anytime a diabetic patient presents to one of them, in addition to getting any other eye checkup, they can also get fundus photographs. A smaller portion of the cameras are placed directly in the diabetes clinics where we recommend that every single patient with diabetes

that presents undergoes photography at least once a year, depending on the state of disease. If they have a little bit of disease, then they will come back more frequently. If they have a lot or more severe disease, they'll be referred to the local ophthalmologist.

So this sounds like kind of a best case scenario for how you're screening. Outside of that region, what's sort of the norm? Like what's, if you don't have these cameras, if you don't have a network of optometrists, like who's doing this work? How often does it happen? How well does it work?

I can talk a little bit about that. So what's happening in a lot of places is, for example, Arthur just talked about how 100 million patients in India actually have diabetes at this time.

We have some stats which says around 50% of these patients don't even know that they have diabetes. So it's not just about getting screening done. It's just about knowing what you need in terms of care. And even if people are aware of what they need in terms of diabetes and diabetic retinopathy screening, they're not able to find access to care, especially in rural parts of India and Thailand and other places where we have actually done this work.

So then the question becomes, these patients can go to primary care clinics, but is a camera available there to be able to actually get screening done? Are ophthalmologists available? So it's also not just about the technology, but also access to specialists like ophthalmologists who should be doing these kinds of screenings, which is why we were able to look at how AI can actually help improve access to care in such places where patients

patients are not able to get to an ophthalmologist very easily. They will have to travel to tertiary ops hospitals, which could mean sometimes taking a day and leaving their villages and going to the main cities. So just to sum it up, it is a huge problem, not just in terms of how the screening and where the screening should be done, but it's also about do the patients actually have access to that kind of care.

I think in AI in general, we're still sort of at the gee whiz stage, right? Isn't it cool that I can take an image and classify it? But what I liked about this paper is you're several steps beyond that, right? The technology is validated, is deployed. I think this is the next wave of studies, which is once we deploy it, how does it do in the real world? So I appreciate that. I guess one question I had is,

I can understand how models move over time with things like risk prediction, right? Like the inputs change in a lot of ways. Why were you worried about drift with images? What did you think was likely to change over time with images? Like, why was this a concern?

Yeah, so there are a few factors where drift could happen. Generally what happens is you train your AI models on a certain set of images. You could have a very large quantity of images and they could all be from one particular race, ethnicity, geography.

When you actually go test this algorithm into the real world, you're not going to just be deploying and testing it in that particular location. So the question is, how is the patient population drifting or changing over time as you test these algorithms out, as you deploy them clinically? The other things that could change also is cannabinoids.

cameras, which are required, the retinal cameras in this case for diabetic chronopathy screening, they update and change and improve over time. Different manufacturers start coming into the market and they supply these cameras to the doctors who are using this technology. Other thing that could change is

people who are actually taking the images, which are the technicians at these clinics, the training level that they have to use these cameras changes too. And new people, of course, would take more time to take better images. And we talk about ungradable images in our paper too. So that is exactly the kind of thing that could happen. So this is why it's very important before you actually clinically deploy it to be testing for all of these changes. Is your model...

which is trained on a certain set of images, generalizing on a different set of images, which could be called your test set. And that's really the work that we had done even before we asked for regulatory approval. Do these things, do these models actually perform equally well or even better than the ones we have trained on? And of course, having more diverse set of training data helps here, but the testing, the post-training and testing is very, very important here.

If I could just add two things to that as well, there's really two other components that we were worried about. One is if you look at a lot of the other perspective and retrospective studies in the literature, they deliberately exclude photographs that have other pathology in the photograph. If the patient has glaucoma, if they have a scar on the side, if

If they have a dense cataract, et cetera, they're highly curated sets of data that they're validating their papers on. But in the real world, that's not the case. You're going to see everything with time. And you want to make sure that it holds up and the performance doesn't degrade at

in a non-curated data set. I think just one additional piece of information is these cameras, in my experience at Stanford, they each have their own little artifacts that accrue with time. A little smudge on the lens, a little fleck of dust, and you'll see over time, you'll actually know which camera photographed the patient based on a little artifact that you can see. They all have their own little signature.

I'm sure that's probably true in this deployment as well. And you want to make sure that as wear and tear happens to the equipment, that you keep patients safe. So let me ask you what might be a harder question. Whose responsibility is it to do these studies? So in this case, it's sort of a joint, I presume, academic Google kind of initiative, but

As we see more and more of these, who is responsible once the technology is deployed for making sure that it doesn't drift? Is that a regulatory thing? Is it the company that develops it? Who should be in charge? So I'll give you an analogy here. In the past, when this is before the AI models or products which are not AI models, it could be just a camera, let's say.

Usually what happens is manufacturers rely on the customer who is using these cameras to provide any feedback to the manufacturers that there is some issue that is going on with your device and you need to come and fix it, right? It is very retroactive. It could actually take a lot of time. The great thing about AI technology is that we can proactively think about how our models are going to be actually performing

in the wild. What that also gives us is the opportunity to make sure that everything that we were hoping that this algorithm or the model would do, it checks out and it stays according to that. So now the question becomes is whose responsibility is it, right?

Just to be clear that when we were putting this algorithm out in the market, especially in the clinical area like at Irvineye Hospital, we decided to do this very proactively. We figured out how to be able to get access to data, this small sample of images that we were able to retrieve back from these clinics and have them reread by

ophthalmologist. Because our system was served on the cloud, all of these things were very possible. And for us to almost get in real time and then be able to test in a real time basis if our algorithms are doing well or not, or are they meeting the benchmark that we had set for it. So in a way that if there was a problem,

then we would be able to figure it out very quickly. Then this is, it's less about the responsibility and it's more about who does it benefit. At the end of the day, we want to make sure that the patients are getting the best possible care. That's what the doctors care about and that's what the manufacturers care about too. So I think this is good for everyone, which is why we have been able to proactively do this and it was always our plan.

Speaking of which, I think, Arthur, I interrupted you before you ever got to the punchline in terms of how actually, how did the model acquit itself? How did you do in terms of looking at drift? Yeah. So there's various ways you can look at exactly what your endpoint was. We deliberately picked an endpoint called severe plus. So did the patient have either severe

non-proliferative diabetic retinopathy or proliferative diabetic retinopathy these are the two categories where if you miss it the patient can go irreversibly blind so for those 100.0 percent of patients were referred to clinic our threshold for referral is moderate severe

PDR or DME. And as long as you had a slightly broader category, 100% of the patients that I am most worried about were referred to clinic in our 1% subsample. So that was as reassuring as you can possibly get. That does sound reassuring. What about false positives? A typical concern with this kind of technology is that you're going to increase referral rates. You're going to be sending too many people for follow-up. How did it do there?

The total positive predictive value was about 50%. So you're going to send two for every patient that actually has disease. But again, the alternative would be just to screen every single patient by an ophthalmologist. So overall, I think it's still a pretty big win.

Got it. So if you take a step back, I mean, in doing this study, do you think you learned anything beyond the technology about how we should be thinking about other kinds of models like this in the field? I think another key question is really where do you set your set point?

on the ROC curve and how do you favor sensitivity and specificity. This may vary depending on which region you're in and really what the positive predictive value and what the negative predictive value that the specific country or region is looking for.

From a regulatory standpoint, whether you can have multiple set points for different countries, I think that's very much uncharted territory and something that will need to mature with time. But I think every environment may have slightly different set points that would be at least optimal for their specific situation. Yeah, plus one to that. I also would like to just add, besides what is in this study and what has been published in the paper,

which is really about the model and how it does compared to the clinical work that we had done in the past, I think there were some other important things that we learned. One of the things where, speaking of uncharted territory, because these are rural areas and these were vision centers where such screening was not being done before, some of these clinics weren't ready to figure

figure out how the retinal screening actually can be done. How the camera, where should it be placed, how dark the room should be, and what kind of ambience should be provided because that really matters in terms of how the image quality of the images would be from these cameras. So, and do we actually have technicians who can be trained for using these cameras?

Do the patients understand how to actually perform in front of these cameras and how to sit there, how to put their chin on the chin rest? There were like so many subtle things that we were learning as we were...

Irvine Eye Hospital was deploying newer sites and different types of cameras in each of these sites. So what we realized is it's not, our model is just one piece of it and it's the core piece. However, the input and the output and what is done with the output, all of these things matter a lot. So it's really important to actually

test these systems, not just in isolation, but overall from a healthcare perspective, from the workflow perspective and figuring out where the gaps and the bottlenecks are and how we can make sure that this is success for everyone, not just for the model. Thank you. I think that's probably a great place for us to wrap up. Arthur, Sunny, thanks again for talking to us about your study in JAMA Network Open. To our listeners, if you want to read more about this study, you can find a link to the article in the episode description.

To follow this and other JAMA Network podcasts, visit us online at jamanetworkaudio.com or search for JAMA Network wherever you get your podcasts. This episode was produced by Daniel Morrow at the JAMA Network. Thanks for joining us and we'll see you next time. This content is protected by copyright by the American Medical Association with all rights reserved, including those for text and data mining, AI training, and similar technologies.