Welcome to JAMA Plus AI Conversations. I'm Roy Perlis, Editor-in-Chief of JAMA Plus AI, and I'm pleased to welcome today's guests, Dr. Raj Manrai and Thomas Buckley from the Department of Biomedical Informatics at Harvard Medical School. Today we'll be discussing their recent study, Comparison of Frontier Open Source and Proprietary Large Language Models for Complex Diagnosis, published in JAMA Health Forum.
The paper looks at how open-source AI models compare to closed-source models, in this case GPT-4, in generating differential diagnosis for cases originally published in, well, another medical journal. Raj, Tom, thanks for joining us today. Yeah, thank you. It's great to be here. Thanks for inviting us. Absolutely. So let's start with the basics. Raj, can you tell us a little bit about the study? Sure. So, you know, as the title suggests,
We are really interested in this particular study in comparing how well these so-called frontier or leading open source and proprietary large language models perform for complex diagnoses. And so until relatively recently, I think the common understanding has been the proprietary models. Models like ChatGPT from OpenAI have been the dominant, the leading models for many applications.
And some of the early studies suggested that some of the open source models didn't really perform quite as well. This has changed, I think, relatively recently, particularly with models that are produced by Meta and now a few others, the LAMA series of models, for example, that have really gotten much better on tasks outside of medicine. And so we sought in this study to take these challenging cases further.
from the case records of the Massachusetts General Hospital, also known as the CPCs, or the Clinical Pathological Conferences, published by the New England Journal of Medicine, and evaluate just how good one of the larger, newer, open-source models from META, the LAMA 3.1, 405 billion parameter model, performed on these cases compared to GPT-4.
And so that was the goal. And we really set it up very similarly to a previous JAMA study, a very influential one that was published by our colleagues, Adam Rodman and Zaheer Kanji back in JAMA in 2023.
So first of all, I want to thank you for calling it the Massachusetts General Hospital. It really, it warms my heart to hear that. Can you say a little more about, for folks who may not be so familiar with the difference between sort of a proprietary model and an open source model, like what's the difference? Why should the average person care about these models in terms of open source or not open source?
Yeah, so these proprietary models, ones that you'll be very familiar with and most listeners will be familiar with, like ChatGPT from OpenAI, you have to use their interface. Sometimes there's another company, like Microsoft, for example, that serves the OpenAI models through their Azure platform. But you're essentially using a model that we don't have access to. Most researchers, most clinicians,
Patients, doctors who are using these models don't have access to the weights and are sending our queries away to another platform where the model is then serving or providing a response. So this might be something simple like you log into the ChatGPT website or you're using the API, but you're still sort of sending the query out, you're sending the data out to another platform and then getting the response.
Very, very different, and I think very critical difference for healthcare applications are these open source models where the weights are available. We can literally download these models, small versions to our laptops. I can have them running on my computer here, and larger versions we can run on our
hospital sort of secure compute behind the hospital firewall locally. And then we can have patient data, we can have data queries that are running without leaving the hospital that are on the sort of local compute because the model is able to be retrieved and run locally. It can also be fine-tuned, it can be tailored, it can be changed locally. And I think this is just seen very differently by hospital IT, by leadership, by
by folks who are worried about egress and very important things like privacy of data, that the data doesn't have to leave, the queries don't have to leave the hospital environment. They can stay local and we can run those models locally. You alluded to this, but I'm at a place where Wi-Fi doesn't even work consistently. Is it fair to say that most people can run a 400 billion parameter model? The one that you used in this paper is still
kind of beyond what most people have the capacity to run locally. Is that fair?
And you're absolutely right. I think the setups within the hospital are very, very different. The setups across hospitals are very different. This is a huge, important, and largely, I think we're still at the beginnings of understanding how to take these large open source models and operationalize them at different hospitals.
So Thomas can maybe, I love his thoughts about how he actually, how he thinks about this, but also some of the problems that I think he's encountered and challenges he's sort of overcome in navigating this across, not just the BI where we do a lot of our work, but across different hospitals that we're working with. Yeah, yeah. I think that's a great summary. To me, this study is kind of like an existence proof that an open source model can do such a challenging task that
we really didn't even consider what's possible. The OpenAI models are presumably millions and millions of dollars to train. It's been presumed that they're nearly a trillion parameters large, which is so large it's completely infeasible for most. But this model, you actually can run it at full precision with 12 A100s, which is becoming in the realm of something that a hospital can deploy. So for example, at the BI, we have eight A100s, which is actually sufficient to run a model like this if you do tricks where you
quantize the weights or you distill a model down. And then at the same time, because we know an open source model can do such a challenging task, I think I'm really hopeful that smaller and smaller models will be able to do the same thing. We're seeing really impressive trends of models just becoming more efficient, smaller in size, but still performing at the same level on benchmarks. So I think it'll be a very short time before this is pretty feasible to run, maybe even on your personal computer.
So it sounds like sort of a proof of concept. Since we've introduced that, what did you find? What were the sort of punchline results? Yes, we made the extremely surprising finding that the open source LAMA model performs on par with the proprietary GPT-4 model.
I mean, I was shocked by this. Like, GBD4 has been just the be-all, end-all LLM for, you know, multiple years. This is the model that every open source LLM has been almost trained on. Like, data sets are created with GBD4. Open source models are trained on them. Like, it's been really just the benchmark that you need to pass. So I think this is truly an inflection point for open source model development that just a model you can literally download to your computer performs on par with this model.
So this is the part of the show where we ask the guests a completely unfair question and ask them to speculate. I guess, you know, one of the things that journals like ours are struggling with is how do you decide when to publish a paper using a particular set of models, given how quickly the field is moving, right? So, you know, we accept a paper that uses a particular flavor of GPT. And by the time the paper comes out, we're two generations past that.
How did you think about that or how do you think about that in your own work? And what would your advice be to journals and to readers? Like, should we pay attention to this version is better than this version?
Yeah, I think, Roy, it's a fantastic question. It is one that I think, as you're saying, is super relevant to editors and to journals that are considering what is transient, what is interesting, what is durable, and also to researchers in selecting problems and selecting projects to work on in this field.
This is also a conversation that we've had from the very beginning of starting NEJM AI. It's a very, very important question. What we've really wrestled with there, and I think this has even informed our approach to selecting problems in even this project, this paper that we're talking about today, is that trying to find something that's durable beyond just the comparison of
of two models that happen to be, at this point in time, released and maybe are 1%, 2% better on our benchmarks, 10% better on our benchmarks. We're really looking for something that opens up or unlocks new avenues of scientific exploration, new avenues of clinical deployment. In this instance, it's not really about the particular models themselves. I think this project was interesting for us
And I suspect interesting for the editors because it's unlocking this scientific question around and this opportunity now around open source models having caught up on this task, which even two years ago, I'd say this task felt like science fiction, right? But for that JAMA paper by Kanji and colleagues,
that on this hard task, on these hard cases, this open source model is able to perform on par with the until recently dominant GPT-4 model. I think that is saying, as Thomas has said earlier, this is existence proof, this is suggesting that open source has really closed the gap in a very meaningful way for models that we considered
very, very capable. And therefore, there's a lot of interesting work that we can do now with EHR records, now with data that's local to the hospital that really can't leave, now with environments where we can do real-time inference on the clusters of the hospitals and serve up these second opinions, for example. There are lots of studies that we can start doing once we've established that open source is competent, it's capable, like the leading proprietary models in
And that we saw as a little bit more scientifically, clinically durable. But I think your point is spot on. We have to avoid the sort of transient one-off comparisons.
All right. So you answered my hard question. I guess I'll follow up with- Is that a harder question or easier question coming? We're going to move on to the next- Adaptive podcast, yeah. Exactly. Although we are going to have to bleep out the name of that other journal that you mentioned. So the paper itself was published in JAMA Health Forum. And I'm looking for kind of a policy angle here. If you're advising the CIO of your hospital,
What would you take away from this paper? I mean, should they be looking to use the cloud version of these kinds of models? Should they be buying the fancy hardware that doesn't have to be quite so fancy anymore? Like, what does this work mean for what hospitals and healthcare institutions should be investing in if they want to have the capacity to do this kind of work?
It kind of depends on the goals of the hospital. For example, we're kind of researching a problem where we want to do error identification from EHR records, for example, and it would just be way too cumbersome to de-identify all these notes or pre-process them or do that before we can just run a model on it. So I think if your goal is to be able to just use the records that are kind of siloed at your hospital immediately, I think it makes a lot of sense to actually deploy one of these models locally.
I think at the same time, if you need kind of the best performing model, I think you can use an API. So I think we'll see kind of combined use of these two based on your use case. And it's not something you directly address in the paper, but because the two of you are both expert in this, do most hospitals need the best performing model?
Are the frontier models going to continue to be necessary, or are we now at the point where the things that are more amenable to running locally and are smaller and quicker probably can get much of the job done almost as well?
It's a great question. I think we are still at the very beginnings of sort of rigorously mapping out what models can be used for what tasks. And so we wanted to almost set this project up with a hard diagnostic challenge, right? I mean, you could argue, I think a limitation of our study is that these cases are not representative of clinical practice. I think there's a lot of
work that the physicians are honestly doing in writing the presentation of case. They're sorting through all that information, this sort of overwhelming, noisy environment that is medicine that a doctor would have to sort through, piecing together that presentation of case. And that's what we're giving the models.
So I think you're hinting at something extremely important, which is we need to study what models can be used for what tasks. And we need to get much closer with the specific tasks that are useful for physicians. I think we see this as, again, as I'm going to come back to it, I really like what Thomas said. This is an existence proof, right? This is set up in a way, these are complex, challenging cases,
If the open source models are capable of doing this, I think it's an interesting question. Can we take the 70 billion parameter model and try to do the same exact task? Can we do a related task? Can we use it for something that is more on the lines of information extraction or conversing with the patient in a safe way? I think we are only starting to sort of explore these questions systematically. I completely share what I think is your suspicion too, Roy, which is that there are going to be many models available
that are less powerful than the frontier, that can be used for many important things that are faster, that are lower latency, that are able to be used in a way that doesn't annoy the doctor because it doesn't have to spend 10 seconds to think about it and doesn't scare the CIO because no data is leaving the hospital. And I think there's reason to believe that we are going to move quickly in exploring that kind of scientific frontier around how models can match up with tasks.
But I completely agree. It is something that we need to study. And I also share the sort of the suspicion that we don't need frontier models for everything. So many good points. I like right at the beginning, you pointed out that these vignettes are written in a very specific, careful way to convey particular information. And I know for the CPCs, for example, you know, all the necessary information is in there.
You got to look for it, but it's in there and it's laid out in such a way that says it's not necessarily leading you to the answer, but if you're inclined to find it, it's there.
And my worry with a lot of the vignette studies, and I think right now we need a gold standard. It's a very reasonable thing to do. But my worry is it is not, it's not really fair because it's not really how the case plays out in the individual docs experience. So I'm really glad you made that point. I want to switch gears. I guess one last question for both of you, which is,
Thinking about your paper, but also just thinking more generally about the work you do, let's say you stroll across the quad to the Brigham to go see your doc or Beth Israel to go see your doc. And in the midst of your appointment, she's looking up your symptoms online. She's typing it into her laptop. Do you feel better or worse that she's using a tool like that? I think I personally would feel better. I feel like a chatbot in the hands of a trained physician
I feel like right now it's a pretty robust thing. I think they're already using a lot of these AI-powered tools. I think it makes sense to trust that they'll have good judgment about using these. I don't think they should use it for completely diagnosing my patient. But if they just have a follow-up question or they want clarification, I think I would rather have them use a lookup tool or use a search engine than just try to come up with it from memory, I think, is how I think about it. But yeah, there certainly are limitations. I wouldn't want to be completely diagnosed by the chatbot. I'm hoping that they're using their expertise.
That's fair. Raj, what about you? Yeah, I think it's going to be, I'm going to give you sort of a little bit of a wishy-washy answer, but I think it's very case-by-case. I think in general, it would make me feel better in the way that I think I would typically imagine one of these chatbots being used by a physician.
But if they're looking up something very basic that I would expect my doctor to know, that would also alarm me, right? So I think it's sort of, we're kind of doing a short cutting. Like what are the types of doctors who would be using one of these chatbots and then using it in that actual clinical care visit or the encounter with the patient?
And I think maybe there's some assumptions that are baked into that. But I do think that, you know, whether or not I'm comfortable, doctors are using this. I think they're using many different versions of these LLMs and other tools. And I think Thomas said something, which I think is very important. And I think this is a message that we need to get out there. There are still problems with all of these models.
And there are hallucinations. They make stuff up confidently. They have errors. And so I think we're still at the point where it's very critical to have human judgment and have sort of humans overseeing or making sure that the outputs from these models are not used in harmful ways.
But as a way of coming up with something that you might be missing, as a second opinion, as a way to help the patient or even the doctor understand care better or values better, I think there's a lot that these models can do and are already doing.
Raj, Thomas, many thanks for talking to us today about your study in JAMA Health Forum. To our listeners, if you want to read more about this study, you can find a link to the article in the episode description. To follow this and other JAMA Network podcasts, visit us online at jamanetworkaudio.com or search for JAMA Network wherever you get your podcasts. This episode was produced by Daniel Morrow at the JAMA Network. Thanks for joining us and we'll see you next time.
This content is protected by copyright by the American Medical Association with all rights reserved, including those for text and data mining, AI training, and similar technologies.