I'm Yulin Xun, Associate Editor of JAMA and JAMA Plus AI, and you are listening to JAMA Plus AI Conversations.
In this episode, we delve into the power of machine learning in electronic health records, EHR data analysis, explore the concept of living models, and unpack the potential of federated learning. My guest today is Dr. Sanjit Kanchalal, an assistant professor in the Department of Population Medicine at Harvard Medical School and Harvard Pilgrim Healthcare Institute, who analyzes real-world data using machine learning to improve the diagnosis and the treatment of infectious diseases.
Welcome. Thank you very much. So you've been working in this area for some time. This might be new to everyone else, but can you tell us a little bit about how you started in this field using machine learning to improve, for instance, infectious disease diagnosis and management? It's actually kind of an interesting story. I started out in my infectious disease fellowship with an interest in studying antimicrobial resistance.
And one of the ways to study that is using electronic health record data. And so the first project I did in this space was to connect whole genome sequencing of a bacteria called Staph aureus, or something that causes skin and soft tissue infections,
and merge that data with a large electronic health record database to get a sense of the kinds of patients who are having these infections and connecting that to genetic and genomic information. And while that work was interesting, it really opened my eyes to the possibility and potential of mining electronic health record data for insights into the diagnostics and treatment of infectious diseases.
And so I wanted to pursue that a little bit more and decided to cold call a colleague at MIT who has now been my longtime collaborator, Dr. David Sontag, who is the co-last author of the JAMA Network Open Paper article.
and pitched him an idea to use machine learning on EHR data to try to improve prediction models, inform clinical decision-making, and even understand treatment outcomes. And that kind of kicked off a large set of projects that are now ongoing for me. What is it in the EHR data that is useful for better understanding these insights and
And how do you use these machine learning tools to actually capture information and to be able to better understand what is going on? I think the key to effectively using EHR data is to begin by having a solid understanding of the pathophysiology of the disease you're interested in.
And so for antimicrobial resistance or predicting antimicrobial resistance, we know from decades of research that prior antibiotic exposure and prior infections impact the likelihood of you having a drug-resistant infection now. And once you have that solid understanding of the disease process, you map those features to things that exist in the HR data and note those features that are not observed in the HR data that you have to infer
And if you have a good sort of overlap of those features, then you can consider using EHR data for modeling that disease process. But if there's not good overlap, for instance, if the disease you're studying is really happening out in the community and not in clinics or hospitals, you're not going to be able to use EHR data for that purpose. Fortunately, for the things that I study, almost all the factors that go into predicting antimicrobial resistance, or at least the ones we know of,
are actually pretty well captured in electronic health record data, making it possible to use analytic methods to try to both predict antibiotic resistance before it happens and even predict treatment outcome in response to antibiotic treatment. That being said, even though the data exists in the EHR, it is very, very messy.
and is riddled with biases, as well as what we call informative censoring, meaning that people are lost to follow up in a non-random way. And so it must be approached very carefully and with a lot of caution and caveats. But once you do build those guardrails in place, there's actually a lot of interesting findings that you can come up with, such as the topic of discussion for this paper that we're publishing.
There is an overwhelming fear that AI will just continue to reproduce these biases if we don't have these kind of checks and balances. So can you tell me about how you feel we should be making these guardrails or the type of ways that you make sure that you reduce those biases?
Yeah, it's a topic that's very important to me as well as to many others. I think there's no one-size-fits-all approach, but the way I've conceptualized it, and most of this was in the realm of generative AI and large language models, but applies to the kind of models I'm working on as well. But really, there's a threefold approach that I take. The first is leadership from medical societies like the Infectious Disease Society of America or the American Heart Association or
to put out statements that really define the needs of providers and patients above the needs of deploying an AI model, meaning that we need to prioritize the patient-provider relationship. And that means putting guardrails in place to reduce the likelihood of AI bias, discrimination, and erroneous predictions. The second involves researchers like myself,
We need to be doing active research in this space to actually study biases, the ones that arise that are unexpected, the directionality of them, quantifying them, and then coming up with solutions in the data science field to mitigate those biases. And then lastly, the third area of work really should be focused on educating providers overall, the ones that are not working on this, but are often the target of these models to understand and be critical of these, but not overly critical.
And so it involves what we call in the field explainable AI, you know, things that allow AI models to explain how they reached their conclusions and being able to put labels of uncertainty around those things. Those are all areas of interest to me, but as well as to many others in the field.
Tell me about how you started it and its methods and research findings in summary. We all know as infectious diseases physicians that there are a lot of factors that go into treatment response. And many of those are focused on the organism, the pathogen, and that can involve their virulence, antimicrobial resistance, their ability to form biofilms, many things. And a lot of those things are highly dynamic.
We may not be measuring them because our surveillance is focused mostly on drug resistance, but we do know from other studies that pathogen epidemiology is constantly changing in response to both intrinsic and extrinsic forces. At the same time, we know that there are many factors in the host that affect treatment response, not just comorbidities, but immune system and health-seeking behaviors, diagnostic testing patterns, all those things affect how well you respond to treatment.
And despite that, the national guidelines for the treatment of uncomplicated UTI published by the Infectious Disease Society of America were last updated in 2011. And those guidelines themselves were based on studies in the late 90s and early 2000s.
Only a small number of those were randomized controlled trials. They weren't large trials in and of themselves. And many of them were in populations that differ greatly from what we see now. And so the motivation for this study is to reassess whether those guidelines, based on decades-old data, still apply to contemporary populations. And we took advantage...
of a very large claims database that was shared by a collaborator at Independence Blue Cross to try to answer this question using machine learning methods.
And so our data set consists of millions of claims from the Independence Blue Cross population, which is mostly in the Pennsylvania area, but also scattered across a few other areas across the country. And the really interesting thing about this data set is that it's formatted in the Observational Medical Outcomes Partnership, or OMOP, Common Data Model.
And so a common data model is simply a way to translate institution-specific data into a common syntax that then allows for universal algorithms to work on. And so we took that data set and we analyzed it in two ways.
The first way was to use domain experts like myself and Dr. Advani, who's on the authorship list, who are infectious disease experts, to pick and choose which features we think are most predictive of treatment response. And they're what you think they would be. Comorbidities, visits to the hospital, prior infections, et cetera. The other way is using a software package developed by my collaborator, David Sontag, at MIT to actually just
build features from that OMOP formatted data set automatically and kind of run it brute force through model and then see what rises to the top. And what is really interesting about the findings of our paper, there's a couple of things. The first is that it turns out that despite changes in the epidemiology, the long and short of it is that the IDSA guidelines for treatment of uncomplicated UTI are still recommending excellent results.
antibiotics for treatment. In fact, first-line antibiotics may be associated with a slightly lower risk of adverse events and are definitely as efficacious as second-line treatment. So that's really good news for antimicrobial stewardship programs that want newer evidence in support of using guideline concordant therapy.
The second really interesting finding is that using this software package, we were able to recapitulate what we as domain experts picked as the most important features. And the output of that model was essentially identical to the output of the model that we designed ourselves by hand. And so that raises the possibility of doing these types of studies with observational data on a more turnkey basis.
where we don't need my support to pick the features. We only need my support to validate the results. And hopefully that will lower the barrier for these types of analyses. And frankly, that would allow us to constantly update the results as new data accrues, which I think is the core for any learning health system. With this hopeful new study and the potential findings that you might see, if all goes well,
then how would you like to see this implemented in into practice, for instance? How do you see the future of it being deployed into practice and in policy? I think besides trying to understand how provider preferences changes based on the clinical context of the patient, what I would actually really love to do with this information is to build a second model
that actually learns provider preferences and provides a personalized provider-focused AI recommender. So it not only learns what is personalized to the patient, but it learns the preferences of the provider such that it can optimally provide decision-making and influence decisions most effectively. And doing that is actually not that difficult. The difficult part is actually the implementation within an electronic health records system.
Why would you say that it is difficult? What are the barriers that you see and how can we try to make that implementation smoother? I will say that I have had difficulty in developing decision support tools that capture what I am trying to do as a designer in existing EHR vendor software systems, just put it that way.
And I'll be very frank with you. I think to really realize the potential of AI, it's not going to happen with our existing EHRs. These are systems built on legacy data architectures. They're built on not designed for data analytics. They're designed for capturing billing. And their user interfaces are challenging to use.
And I think to really realize the potential of AI in healthcare, we need to build something new, truly disruptive from the ground up. How do you think that this can shift? Because trust me when I say we all want this. We all want the system refurbishing, not even refurbishing, maybe explosion and recreation. But how do you think we can do it or what needs to be done to do it?
Yeah. Challenging area. It's something that I'm very passionate about. I want to do this. Literally. I literally want to do this myself. I have an idea, but it takes more than an idea. And I think it will require a very, very long runway of funding to build a product that can do what we envision, what we hope it can do. And it will require a little bit of governmental support to
to open a pathway for introduction into healthcare systems. The way it is right now is once a healthcare system invests in a specific EHR, it's really hard to break them away from it because of the amount of investment needed. However, it is possible to start layering things on top of existing EHRs. In my opinion, that's not going to work.
I really think you do have to build from the ground up. And so how do you break that knot is challenging. But again, I think having a strong funder, a lot of investment and a really good team, you don't actually need that much to get an excellent product that will change minds. The example I use is what Mark Cuban did with Cost Plus Trucks. 20 employees, a long runway of funding, massive impact in three years. I think it is absolutely possible, but it does require some external support.
It's not a small venture. This is a very large venture. But what I'm talking about is something that I think is so revolutionary, it can quickly change the game in terms of how we approach medical care. Imagine an EHR that is making a diagnosis alongside you, surfacing the data that you need to make that diagnosis using the same information you do. It can learn this, right? That's the future to me.
decentralized care, direct interaction with patients where it knows what's going on with the patient, not sort of generic messages. That's all possible. And the billing side of it is actually, it's straightforward once you have the other things in place. If it has the ability to understand diagnoses the way a clinician does, it has the ability to do billing just as easily. So that's what I'm envisioning. It just needs, again, that investment to build a solid, viable product and then to showcase that.
I think we're all hoping for that. How do you feel that models today could maybe potentially make guidelines for today that last into the future when everything is going so rapidly? Sexually transmitted infections, for instance, we know that resistance has been rising and guidelines are continually changing.
to account for the new epidemiology that is arising in our population in a series of waves. And so you raise, again, a good question. How do you keep guidelines up to date and can machine learning help for that? And I think the answer is absolutely it can, with some caveats. In the infectious disease community, there are already sets of living guidelines for treatments of various syndromes. And some of those are published in JAMA Network Open already.
But they require a lot of manual curation of data and expert-based opinions on subject matters that are ambiguous, that are not easily deconvoluted with models.
Well, I think that is great. It is a lot of work. And so how can ML help? Well, if we had data infrastructure to continually accrue, clean, process, and create models from data streams that are already in existence, we can actually build a living model
that is periodically updating as new data accrues to provide hyper-localized, up-to-date recommendations that are reflecting exactly what is going on in the year prior. And there's no reason to think we can't do that, especially with the existence of common data models or using what's called a federated learning approach. Thank you so much. Thank you so much for being here. I appreciate it. Thank you so much for having me.
I am Yulin Xun, Associate Editor at JAMA and JAMA+ AI, and I've been speaking with Dr. Sanjat Kanjilal about the strengths and challenges in integrating AI-driven tools in the healthcare setting. You can find a link to the article in this episode's description. And for more content like this, please visit our new JAMA+ AI channel at jamaai.org.
To follow this and other JAMA Network podcasts, please visit us online at jamanetworkaudio.com or search for JAMA Network wherever you get your podcasts. This episode was produced by Shelley Steffens at JAMA Network. Thanks for listening. This content is protected by copyright by the American Medical Association with all rights reserved, including those for text and data mining, AI training, and similar technologies.