Welcome to a new special deep dive of AI Unraveled, the show created and produced by Etienne Newman, senior engineer and passionate soccer dad from Canada. Great to be here. And if you're getting value from these deep dives, please take a moment to like and subscribe on Apple. It really helps us out. Absolutely. And be sure to check the show notes. We've got our referral links and a special discount code.
could be up to 20% off for Google Workspace. - Oh, nice. - Yeah, you can unlock the power of Google Gemini Advanced Pro, which is arguably the top AI model out there today. - It's certainly up there. - Plus you get all those fantastic benefits like Teams features, personalized emails, Notebook LM Plus, and well,
Well, much more. Definitely worth checking out. So today we're tackling a really critical question, I think. How do we make artificial intelligence and healthcare truly reliable, specifically for medical diagnoses? A huge topic. Yeah. And for you, the learner, this deep dive is going to explore a powerful solution or potential solution called conformal prediction, CP for short. Nice.
Forget just getting an AI's best guess. Imagine getting a set of likely diagnoses, but with a statistical guarantee of actually including the right one. That guarantee part is key. Exactly. Because in medicine, a wrong call can have just devastating consequences, right? Think about...
the millions of deaths globally just from delayed diagnoses of severe bacterial infection. It's a stark reminder of the stakes. So this shift towards reliable AI is just paramount. We're going to unpack exactly what CP is, how it works, its magic, and why it's causing such excitement as a way to build safer, more useful medical AI. Yeah. And what's fascinating here, I think, is this fundamental rethink of how we even judge AI in medicine.
How so? Well, the initial gold rush, it was all about chasing the highest accuracy scores, wasn't it? Get that percentage up. Right. Ninety nine percent accuracy. Everyone's happened. Or so they thought. But that nagging question of a liability, you know, can we really trust these plaque box systems when it's life or death?
That's pushed the focus towards something, well, more profound, provable reliability. Provable reliability. OK. Yeah. It's not just about performing well on average anymore. It's about offering actual guarantees. Yeah. And this growing demand for trustworthy AI, we're seeing it echoed by regulatory bodies, too, like the FDA. The FDA. Right. They're increasingly scrutinizing the safety and, crucially, the effectiveness of these
AI powered medical devices. The bar is getting higher. Okay. Let's unpack that core problem a bit more than AI. I mean, it holds incredible promise, right? Sifting through mountains of complex medical data images, EHRs, genomics, monitoring signals. But your volume is staggering. And the potential to detect these subtle patterns, things may be invisible to the human eye. It's truly revolutionary. It really is. But here's the catch, right?
An AI prediction, no matter how sophisticated, it isn't infallible. And in that high stakes clinical environment, an unreliable AI, well, it can lead to serious errors, real consequences for patients. Precisely. And a key challenge here is the inherent uncertainty that's just baked into AI predictions. It's not simply that the AI sometimes makes mistakes. There's a
The fundamental randomness in the data itself, we call that aleatoric uncertainty. Aleatoric. And then there are the limitations of our models, you know, the data we use to train them. That's epistemic uncertainty, our lack of knowledge. So two kinds of uncertainty to deal with. Exactly. And the crucial step isn't to ignore this uncertainty, but to actively quantify it.
and communicate it clearly to clinicians. Right. Hiding it doesn't help anyone. Not at all. Otherwise, we risk that danger of overconfident predictions that turn out to be wrong, that erodes trust and, well, potentially causes harm. So the whole field of uncertainty quantification, UQ, is dedicated to tackling this, right?
We've seen different approaches, Bayesian methods, ensembles. Lots of different techniques. Bayesian methods try to capture sort of a range of possible model parameters. Ensembles train multiple models and see where they disagree. But today our focus is squarely on conformal prediction. What's the core idea? What makes CP stand out in this UQ landscape? Well, the really compelling core of conformal prediction, CP,
is its promise of a statistical guarantee. That's the differentiator. OK, a guarantee. Instead of just a single prediction, which could be brittle, CP delivers a prediction set for classification tasks. Like a list of possibilities. Exactly. Imagine a short, maybe even prioritized, list of likely diagnoses. Or for regression tasks, like predictability.
like predicting a lab value, it gives you a prediction interval, a range. - Okay, a set or an interval? - And here's the crucial difference. This set or interval comes with a user-defined guarantee, say 95%, that will contain the true but currently unknown
Outcome. Wow. OK, so it's not just high accuracy. It's a guarantee about including the truth. That's the fundamental insight. Moving from just trusting high accuracy to having a statistical guarantee about the correctness of the suggestions. And you mentioned something really interesting earlier. This guarantee is distribution free and holds
even with finite samples. That sounds incredibly powerful, especially in medicine where data can be messy or limited. It is powerful. And that robustness hinges on a relatively mild assumption called data exchangeability. Exchangeability. Okay, what does that mean in practice? Think of it like this. If you shuffled the order of your data points past patient records and the new one you're looking at,
Would the underlying statistical patterns fundamentally change? In many idealized research data sets, maybe the answer is yes, it wouldn't change much. But in the real world, with evolving treatments, changing patient populations, maybe not always so simple. Right. Things change over time in a hospital. Exactly. Exchangeability basically means the joint probability distribution of the data sequence doesn't change if you permute the order.
It's weaker than assuming the data is independent and identically distributed IID, which is often a big oversimplification in healthcare. So it's a less strict assumption than IID, but still important. It's the key reason CP's guarantees are so generally applicable. And another huge advantage, CP is model agnostic. Yep. You can essentially wrap it around almost any existing machine learning model you've already trained.
a deep neural network, a random forest, gradient boosting, whatever. Without retraining the original model? Largely, yes. Especially with the common split CP method we'll get into. You don't need to fundamentally change or retrain that core prediction engine. It acts like a wrapper. Okay, so let's picture this. Instead of an AI looking at a scan and just saying malignant tumor, conformal prediction might give us a set like
Malignant tumor, benign cyst. Potentially, yes. Yes. Or maybe it just gives a malignant tumor if it's very confident and the calibration supports that. But if there's ambiguity based on the data and the model's calibration, it might give you that set. It acknowledges the uncertainty. And includes probabilities or just the labels. It typically just provides the set of labels that satisfy the conformal criteria based on the calibration.
The underlying model might provide probabilities that feed into the calculation, but the output is the set with the coverage guarantee. Got it. And for regression predicting, say, drug response, instead of just 80% effective, we get a range like between 75% and 85% effective. Exactly. With that same statistical guarantee,
that the true effectiveness likely falls within that interval, say 95% of the time. So this shift from a single point to a set or interval is how CP tackles uncertainty head on. Precisely. In many medical situations, different conditions can look really similar. Overlapping symptoms, subtle imaging features,
A single definitive call is tough, even for experts. So providing a set of likely possibilities backed by that statistical guarantee can actually be much more useful clinically and hopefully safer.
OK, now let's dive into the nuts and bolts. How does CP actually build these sets with that promised coverage? You mentioned exchangeability and its potential pitfalls in medicine. Let's revisit that. Right. So exchangeability is the foundation. It assumes shuffling the data order doesn't change the underlying joint statistics. And the pitfall. Real world clinical data is rarely static, is it?
EHRs evolve, new treatments emerge, imaging protocols get updated, patient populations shift demographics. Distribution shift is a constant battle. Exactly. Temporal dependencies, shifts between calibration data and new data, systematic differences between patient groups.
These things could potentially violate strict exchangeability. And if exchangeability is violated, then the theoretical coverage guarantee might not hold exactly as promised in practice. It might degrade.
It's something you absolutely need to consider and monitor. That's a really important caveat. So while being distribution-free is great, we still need to be thoughtful about whether exchangeability is a reasonable assumption for our specific use case. Absolutely critical. Okay. Now, you also mentioned nonconformity scores. Sounds technical, but you said they're the engine. They really are the engine, yes. An NCS, or nonconformity score, is basically a way to measure how weird or atypical a specific data point, meaning an input paired with a
potential output looks compared to other data the system has seen. Higher score means more nonconforming, more unusual. For classification, a really common NCS is simply 1 minus the probability the base AI model assigns to a specific class, say pneumonia. Okay, so low probability means high nonconformity score. Exactly. If the model thinks pneumonia is unlikely for this patient's x-ray, that potential label gets a high score.
For regression, a typical score is just the absolute difference between the predicted value and the actual observed value of the residual. Bigger error, higher score. So the score reflects how much the AI disagrees with a potential answer. That's a good way to put it. How much this potential input-output pair deviates from the patterns the model learned. And the choice of this score...
Does it affect the reliability, that coverage guarantee? Ah, interestingly, no. The choice of NCS doesn't break the validity of the coverage guarantee. That's robust. But, and this is a big but, the choice of NCS dramatically affects the efficiency of the prediction sets. Efficiency meaning size. Exactly. How big or small the sets are. A well-chosen NCS, one that really captures the model's uncertainty accurately for different kinds of inputs,
will generally lead to smaller, tighter, more clinically useful prediction sets. And a bad NCS. Could lead to huge sets, like condition A, B, C, all the way to Z. Technically, it might still contain the true answer 95% of the time, but it's not very helpful. Right, not very informative. So a lot of research in CP actually goes into designing clever nonconformity scores tailored to specific conditions.
problems and models to get those useful small sets while keeping the guarantee. Okay, so we have these scores measuring atypicality. What's the next step? How do we use them to build the set with the guarantee? You mentioned a calibration data set. Yes, the calibration set is crucial. This is a separate chunk of labeled data distinct from the training data. And it has to be exchangeable with the test data. Critically, yes. It needs to be exchangeable with the new unseen test data you'll be making predictions on.
calculate the nonconformity scores for every point in this calibration set using your chosen NCS. Okay, so you get a big list of scores from the calibration data. Exactly. Then, based on your desired confidence level, let's stick with 95%, so alpha is 0.05, you find the corresponding quantile of those calibration scores. The quantile, like the 95th percentile? Precisely, the one empirical quantile. So you rank all
all the scores from the calibration set and find the value below which 95% of the scores fall. That value becomes your threshold. Let's call it QAT. Okay, so we calculate scores on calibration data, find the 95% cutoff point, QAT, now what? Now, when a new test point comes in a new patient scan, see you consider every possible diagnosis or label way.
For each possible label, you calculate its nonconformity score, SX test Y. Using the same NCS function. Using the exact same NCS function. And then you include in your final prediction set all the labels Y whose nonconformity score SXS test Y is less than or equal to that threshold QAT you found earlier. Ah, I see. So if a potential diagnosis looks less weird than 95% of the calibration examples, it gets included in the set. That's the essence of it. Another way to frame it is using p-values.
for each possible label while you can calculate a p-value, which is basically the proportion of calibration scores that were greater than or equal to the score for this test label. Okay. Then you simply include all labels whose p-value is greater than your significance level of testing, 0.05. It's mathematically equivalent. And this whole procedure, calculating scores, finding the threshold from calibration data, comparing test scores, that gives us the marginal coverage guarantee. That's it. The guarantee that PY true X test, one EGA,
For any new data point drawn from the same exchangeable distribution, the probability that its true label EADF falls within the generated prediction set CIX test is at least one EADF. So averaged over many predictions, our 95% set will contain the truth at least 95% of the time. That's the powerful theoretical result of CP. It holds regardless of the data distribution, the AI model, the NCS, as long as exchangeability holds. You mentioned marginal coverage. What does that mean exactly?
Marginal means it's an average guarantee across all possible test points. It doesn't guarantee 95% coverage for every single test point or subgroups, just on average overall. Okay, that's an important distinction. And in practice, because we calculate the threshold based on a finite calibration set, the actual coverage is often slightly higher than 1 A-Teri. It tends to be a bit conservative, which isn't necessarily a bad thing in medicine. Better safe than sorry, maybe.
Now, you also touched on two main ways to implement this: full CP and split CP.
Why is split CP the go-to method now? Right. Full or transductive CP is the original idea. For every single new test point and for every possible label it could have, you'd temporarily add that test point with that hypothetical label to your entire training data set. Wait, add it to the training data? Yep. Then you'd retrain your whole AI model from scratch with this slightly modified data set. Then you'd calculate all the nonconformity scores again and see if that hypothetical label makes the cut for the prediction set. You'd have to retrain the model...
potentially hundreds or thousands of times for just one prediction. Exactly. With modern deep learning models that take hours or days to train and maybe hundreds of possible diagnoses.
It's just computationally completely infeasible. Totally impractical. Okay, yeah, that sounds like a non-starter for real-world use. Absolutely, which is why split or inductive, CP, ICP, was developed. It's much, much more efficient. How does ICP work differently? With ICP, you take your initial labeled data and you split it up front, typically into two sets, a proper training set and a calibration set.
Sometimes three if you need a separate validation set for model tuning. Okay, so dedicated sets. You train your AI model only once on the proper training set. You take that fixed trained model and use it to calculate the nonconformity scores just for the points in your separate calibration set. Right, find the threshold QAT using only the calibration set. Exactly. You determine your threshold QAT based on the calibration scores.
Now, when a new test point comes along, you use that same single already trained model to calculate the nonconformity scores for all possible labels for that test point. And compare them to the threshold QAT you already calculated. Precisely. Compare the scores to QAT, build a prediction set, no retraining needed in prediction time. It's vastly more computationally efficient and makes CP practical for complex models and large data sets. That makes a huge difference. Train once, calibrate once, then predict efficiently. Mm-hmm.
Okay. But you mentioned a potential issue. Even with efficient ICP, the prediction sets can sometimes be impractically large, right? Especially for complex tasks. Yes. That's a real practical challenge. Having a guarantee is great, but if your prediction set for a chest X-ray lists 50 possible conditions...
That's not very helpful for a clinician trying to make a decision. Not really narrowing it down much. Exactly. And this is where techniques like test time augmentation or TTA come into the picture as a way to potentially improve the efficiency, the tightness of those sets. Test time augmentation. What's the basic idea there? TTA is a technique often used in computer vision to boost model performance at the time of prediction.
The core idea is you take your single test input, say a medical image, and you create multiple slightly modified versions of it. How modified? You might crop different sections, flip it horizontally, rotate it slightly, maybe adjust brightness or contrast, standard image augmentations. Okay. Then you run your already trained AI model on each of these augmented versions. So you get multiple predictions for that single original image. Like getting slightly different views. Exactly.
And then you aggregate these predictions often by averaging the predicted probabilities for each class to get a final, hopefully more robust and accurate prediction for the original image. So it's like combining multiple opinions from the same model on slightly varied inputs. Precisely.
It tends to smooth out the model's sensitivity to small variations, noise, or specific framing in the input, and often leans to better accuracy and better calibrated confidence scores from the base model itself. Okay, makes sense. So how does this help with conformal prediction sets being too large? Well, researchers, notably at MIT, had the insight. If TTA improves the quality and calibration of the underlying predictions,
Maybe you could also make the resulting conformal prediction sets smaller and more informative. Ah, improve the input to the CP process. Exactly. This led to what they call TTA-enhanced conformal prediction, or TTA-CP. Okay, so walk me through how TTA gets integrated into the CP workflow. Sure. In TTA-CP, the process looks something like this. First, you take your labeled data and split it. But now you might need three sets.
A main training set, optional if model is pre-trained. A set to learn the best TTA strategy, let's call it DTTA. And your usual conformal calibration set, DECAL. Okay, a dedicated set to figure out the best TTA approach. Right. Then for images in DTA and DECAL, and later for test images, you generate multiple augmented versions using a chosen set of augmentations. You run your base model on all these augmented images. Get lots of predictions. Yep. Now, using the DTTA set, you learn an aggregation function.
How should you best combine the predictions from the augmented images? Maybe a simple average, maybe a weighted average. You learn what works best on DTTA to maximize accuracy. So you learn the optimal way to combine the TTA results. Exactly. Once you have this learned TTA policy, which augmentations to use how to aggregate, you apply it to your calibration set decal. You get aggregated predictions for all calibration points. Using the learned TTA policy. Correct.
Then you calculate your non-conformity scores based on these aggregated TPA predictions and find your conformal threshold QAT, just like in standard ICP. Based on the improved TTA prediction. Yes. Finally, when a new test image arrives, you apply the same learned TTA policy, create augmentations, run the model, aggregate predictions using your learned function.
Then you calculate the nonconformity score for each possible label based on this aggregated prediction and compare to your threshold QAT to form the final prediction set. That's quite clever. You're essentially using some data to fine-tune the prediction process before applying conformal calibration. And crucially, you didn't need to retrain the original base AI model itself. Exactly. It acts like a sophisticated post-processing wrapper around the original model.
And a key detail for maintaining the guarantee is using disjoint data sets for learning the TTA policy, T-T-T-A, and for the conformal calibration, DECAL. Why is that separation so important? It ensures that the calibration data, DECAL, remains exchangeable with the test data conditioned on the now fixed PTA policy. If you use the same data for both, you introduce dependencies that violate the assumptions needed for the CP guarantee to hold rigorously. Got it. Careful data splitting is key.
And what were the results? Did TTA-CP actually shrink the prediction sets? Yes, the results were quite significant. On standard benchmarks, including medical imaging data sets, they typically saw reductions in the average prediction set size in the range of 10% and 30%, sometimes even more, compared to standard CP applied to the same base model. That's a substantial improvement in efficiency. It really is. And critically, this was achieved while maintaining the theoretical marginal coverage guarantee.
The sets were smaller, more focused, but still contained the true diagnosis at least 1% of the time on average. So more informative without sacrificing that core reliability. A win-win. Exactly. It's a very appealing outcome. Did they find anything else interesting, like how data allocation affects things? Yes, they found an interesting trade-off.
Even though they used some labeled data just for learning the TTA policy, meaning less data was available for the final calibration step compared to standard ICP, the improvement in prediction quality from using TTA often outweigh the effect of having slightly less calibration data. So investing some data in TTA paid off overall? It suggests that yes.
Strategically using some data for these kinds of post-training refinement techniques can lead to better practical uncertainty quantification than just throwing all available labeled data at calibration alone. Interesting. Any impact on specific types of predictions? They also noted TTA seemed particularly helpful for classes the base model initially had low confidence in.
aggregating views TTA could sometimes boost the rank or score of the true class even if it wasn't the top guess initially ah so it can rescue predictions that might have been borderline potentially yes which in turn influences the nonconformity scores and can lead to smaller more accurate sets the bottom line is that improving the quality of the base predictions directly helps improve the efficiency of conformal prediction makes sense better input leads
leads to better output. And TKACP is attractive because it's relatively plug and play. You don't need a whole new model architecture or complex retraining schedules. You can often apply it to existing models to get more useful uncertainty guarantees. Okay, so TTA helps with the size of the sets, but we touched on the idea that the standard marginal coverage guarantee is an average. What if we need stronger assurances for specific groups or even individual patients? That average might hide problems, right? You're hitting on a really critical point, especially for medicine.
That marginal guarantee PY true CX one on on average is a great start, but it doesn't tell you anything about performance on, say, different demographic groups or patients with specific comorbidities or even individual hard cases. Yeah. And I could be 95 percent reliable overall, but consistently fail for a specific subgroup. That's not acceptable. Absolutely not. It could be clinically dangerous and ethically problematic.
This need for more granular, more adaptive reliability guarantees is what drives the development of more advanced conformal methods. We want guarantees that hold under specific conditions. So the first step seems to be a conditional conformal prediction, trying to get guarantees that are conditional on certain features or groups. Exactly. The goal is coverage guarantees valid within specific contexts. One of the earlier approaches here is group conditional CP, often called Mondrian CP. Mondrian CP. Yes.
The idea is you pre-define some distinct, non-overlapping groups in your data.
Maybe based on age brackets, for example, 18, 1865, 65, or maybe different hospitals participating in a study or different scanners used. Okay. Partition the data. Right. Then you perform the standard conformal calibration process separately within each group. You calculate group-specific nonconformity scores and determine a separate threshold QAT for each group. So you get different thresholds for different groups. Precisely. This gives you a stronger guarantee.
The probability that the tree outcome is in the prediction set, given that the input belongs to a specific group G, is at least one day.
PY is CGX, X, group G. That's much stronger for those specific groups. It is. It provides exact conditional coverage for those predefined disjoint groups. But it has limitations. Such as? Well, what if your groups overlap? Or what if you want a guarantee conditional on a continuous feature, like a specific biomarker level, not just broad categories? Mondrian CP doesn't directly handle that. Okay, so it works for clear, separate groups.
but not for more complex conditional needs. What else is there? Well, recognizing that getting exact conditional coverage for every possible specific input characteristic is basically theoretically impossible with find-out data unless you make very strong assumptions. Right. That sounds hard.
Researchers develop methods for approximate conditional coverage. The target here is to get coverage that holds approximately conditionally, maybe not on the input features themselves, but on some statistics derived from the input or the model's output. Like what kind of statistics? For example, trying to ensure coverage conditional on the model's own confidence score. So you'd want 95% coverage both when the model is highly confident and when it's uncertain.
or conditional on some measure of how typical the input looks compared to the training data. So adapting the guarantee based on the model's own behavior or the input's characteristics
Kind of. Another related idea is label conditional CP, where you might calibrate differently depending on the predicted label. This has shown promise in areas like analyzing EHR text for disease surveillance, calibrating differently for common versus rare conditions. Interesting. Now, another angle seems to be adapting the size of the prediction set itself based on local characteristics, making it wider or narrower depending on how uncertain things look for the specific patient. That leads to
locally adaptive CP, and conformalized quantile regression, CQR. Exactly. The intuition is that uncertainty isn't uniform, right? Some patient cases are straightforward. Others are inherently ambiguous. Standard CP might give sets that are unnecessarily large for easy cases or too small for hard ones if it just applies one global threshold. So we want the set size to reflect the local difficulty. Precisely.
Locally adaptive CP methods try to achieve this. One common way is to modify the nonconformity score itself. Instead of just using the raw error or probability, you normalize it by an estimate of the local error scale or variability. How do you estimate local error scale? You might use a secondary model, perhaps trained on the calibration set, to predict the expected error magnitude based on the input features. So the nonconformity score becomes something like...
Prediction error, estimated local error standard deviation. Ah, so a large error in a region where large errors are expected is less non-conforming than the same error in a region where errors are usually small. Exactly. This naturally leads to wider intervals or larger sets in high uncertainty regions and tighter ones where the model is typically accurate.
Techniques like gradient boosting are often used to learn these adaptive score functions effectively. That sounds much more nuanced. What about CQR, conformalized quantile regression? How does that achieve adaptivity? CQR takes a different but related route. It starts by leveraging quantile regression models.
Unlike standard regression, which predicts the mean, quantile regression predicts specific quantiles, like the fifth percentile and the 95th percentile of the outcome, conditional on the input features. So it directly models the range. Yes. And quantile regression models are inherently adaptive.
The distance between the predicted lower and upper quantiles can naturally vary depending on the input features, capturing changing uncertainty, heteroscedasticity. Okay, so you train quantile regression models first. Right. They give you an initial prediction interval, say from the predicted 5th to the 95th quantile, but this initial interval doesn't have that guaranteed CP coverage yet. So how do you get the guarantee? You then use your calibration set.
For each calibration point, you measure how far the true value falls outside the initial quantile interval.
This gives you a set of errors or conformity scores. Based on the initial quantile prediction. Yes. Then you find the appropriate quantile of these conformity scores and use that to adjust or conformally calibrate the initial interval width. You basically add a buffer based on the calibration errors. Ah, so you use quantile regression for the adaptivity and then CP on the residuals to lock in the coverage guarantee. Exactly. CQR is popular because it inherits the nice adaptivity of quantile regression.
often producing shorter, more informative intervals where appropriate, while still providing the rigorous finite sample coverage guarantee from CP. Very clever. Now, sometimes just controlling the overall coverage isn't enough. In medicine, certain types of errors are much worse than others, right? Like missing a cancer diagnosis. Absolutely. A false negative can be catastrophic, while a false positive might lead to more tests, but might be less harmful overall, depending on the context.
Standard CP controls the overall miscoverage rate, but doesn't distinguish between error types. So is there a way to use CP to control specific risks like the false negative rate? Yes. That's the domain of conformal risk control, or CRC. It extends the CP framework to allow control over other user-defined risk metrics beyond just miscoverage. Like false negative rate, FNR, or maybe false discovery rate.
Exactly. You can set a target, say, I want the F&R to be below 5%, and CRC provides methods to construct prediction sets or make decisions that provably meet this risk control objective, again, under exchangeability. How does it work conceptually? Often, it involves carefully defining the nonconformity scores and calibration procedure in a way that relates directly to the risk metric you want to control.
For instance, for FNR control and binary classification, the scores might be related to the model's predicted probability of the positive class, and the threshold is set to ensure the desired FNR bound. Okay. And you mentioned conformal risk adaptation, CRA. CRA is a more recent development, particularly aimed at tasks like medical image segmentation. It tries to achieve better conditional risk control.
So instead of just controlling the average FNR across all images, it aims for a more consistent FNR per image using adaptive prediction sets and specially designed score functions.
It sort of bridges ideas from CRC and adaptive CP. It sounds like these advanced methods, conditional coverage, local adaptivity, risk control, offer much more tailored reliability guarantees. They do. They move beyond the basic marginal coverage to address more nuanced clinical needs. But I imagine they might be more complex to implement or require more data. That's generally the tradeoff, yes.
Stronger, more granular guarantees often come with more methodological complexity, potentially stronger assumptions or assumptions that are harder to verify, maybe larger data requirements for calibration,
or the need to train secondary models, like for estimating local error. So practitioners need to balance the desire for these sophisticated guarantees against the practical costs and complexities. Absolutely. It's about choosing the right tool for the job, considering the specific clinical question, the available data, and the computational resources.
Basic CP is often a great starting point, and these advanced methods offer powerful options when more specific control is needed. Okay, we've covered the theory and the advanced methods. Let's talk applications. Where is conformal prediction actually being used or explored in medical diagnosis right now? It seems incredibly versatile. It really is. That model-agnostic nature combined with the guarantees makes it applicable almost anywhere machine learning is used in medicine. Medical imaging seems like a prime candidate.
radiology, pathology. Definitely. We're seeing CP used to provide sets of differential diagnoses for things like chest x-rays, pneumonia, heart failure, something else.
Or pathology slides. Instead of one answer, the clinician gets a statistically backed set of possibilities. Helping to manage uncertainty in interpretation. Exactly. And we mentioned conformal triage before, using CP to stratify scans, like head CTs after trauma, into low-risk, high-negative predictive value guaranteed, high-risk, high-positive predictive value guaranteed, and uncertain.
needs expert review. This can really help optimize workflow. Streamlining the process based on reliable risk assessment. Right. And utility-directed CP aims to make sets even more useful by grouping diagnoses that lead to similar treatments. Beyond diagnosis, what about segmentation, like outliers?
Outlining tumors. Huge area. CP can quantify pixel-level uncertainty in segmentation. Is a pixel definitely part of the tumor? Probably or uncertain. Mondrian ICP has been used for prostate MRI segmentation, for example. By identifying uncertain boundary pixels, you can get more reliable volume measurements. Which is critical for tracking treatment response. Absolutely.
And CRA, conformal risk adaptation, is being applied to things like polyp segmentation in colonoscopy videos, aiming for consistent detection rates, controlling false negatives across different patients and conditions. Okay, moving from images to genomics, another complex area. Yes, CP is finding traction there too. Providing confidence sets for genomic variant calls distinguishing real mutations from sequencing noise.
predicting patient response to drugs based on their genetic profile, pharmacogenomics, but with a confidence interval around the prediction. Predicting immunotherapy response, antimicrobial resistance. Those two.
Anywhere you have complex biological data and predictive models, adding that layer of guaranteed uncertainty quantification is valuable. It helps manage expectations and guide decisions based on the level of certainty. What about clinical risk prediction using EHR data, sepsis, for example? Very active area. There are systems using Mondrian CP combined with models like gradient boosting trained on EHR data to predict sepsis mortality risk.
They provide the risk score and a confidence level, flagging uncertain cases. Helping clinicians focus attention. Exactly. Other work uses CP with deep learning for early sepsis detection in non-ICU patients, aiming to improve specificity by providing confidence scores.
Also predicting the likely site of a bacterial infection, airway, urine, blood, with calibrated confidence to guide initial antibiotic choice. Even disease surveillance using text from EHRs. Yes. Label conditional CP combined with active learning has been explored for monitoring disease outbreaks by analyzing clinical notes.
trying to reliably identify mentions of suspicious symptoms or conditions. It's amazing how broad the applications are. Drug discovery, too. Absolutely. Predicting molecular properties, screening drug candidates, assessing potential toxicity, predicting pharmacokinetics, all areas where ML is used, and where adding a reliable confidence measure via CP can make the predictions more trustworthy and useful for decision-making. Identifying drug targets, too. And even more niche areas. Traditional Chinese medicine. Mental health. We're seeing explorations everywhere.
differentiating syndromes in traditional Chinese medicine based on symptoms, predicting depression severity from facial expression videos, even making large language models provide more reliable answer sets for medical Q&A by having them output conformal sets of possible answers. Wow. So CP's flexibility really shines across different algorithms, traditional ML, deep learning, LLMs, and diverse data types, images, genomics, EHRs, text, chemical structures. That's the key takeaway is versatility is remarkable.
But you also mentioned that the most successful applications often go beyond basic CP, right? They use tailored variants. Yes, that's an important point.
While basic split CP is a great starting point and provides value, achieving the best results in complex medical domains often requires leveraging those advanced methods we discussed: Modrian CP for group fairness, CQR for adaptive intervals, CRC for risk control, conformal triage for workflow optimization. Adaptation is often key to unlocking the full practical potential. Okay, so we have these powerful tools, increasingly sophisticated ones.
But how do we actually get them used effectively in the clinic? What are the hurdles to translating a conformal prediction set, maybe eczema, psoriasis, into a real clinical action? That's the million-dollar question, isn't it? Bridging that gap between a statistically valid output and genuine clinical utility. Because getting a set, does it help narrow possibilities or does it just increase the cognitive load on the clinician? It's a valid concern.
How should a clinician interpret that set? Rule out things not in the set, focus on everything in the set, use the set size itself as a measure of uncertainty to guide further testing? Yeah. And what if the conditions in the set require very different treatments? A statistically valid set might not align perfectly with clinical decision pathways. This is where ideas like utility directed CP trying to build sets based on treatment implications become really relevant. And it's not just about the AI's output, it's the interaction with the human clinician, right? They have their own knowledge.
Absolutely. Human clinical reasoning is complex. Clinicians have tires, experience, intuition, how they integrate a CEP set with their own thinking. Research shows it can have nuanced effects, sometimes positive, sometimes potentially introducing new biases. So it's not a simple plug and play replacement for judgment. Definitely not. We need careful human centered design, thinking about visualization, explanation, and workflow integration.
What are some promising pathways for actually integrating CP into clinical workflows? Several models are emerging. One is using CP primarily for decision support and case flagging. Flagging. Yeah, flagging cases where the AI is highly uncertain. Maybe the prediction set is very large or contains very disparate diagnoses.
These flagged cases would then be prioritized for human expert review, like the sepsis-RISCA systems we mentioned. Okay, using uncertainty to guide attention. Another is the triage system model, like conformal triage, stratifying patients into risk categories, high PPV, high NPV, uncertain, with guarantees, directly influencing workflow and resource allocation. More structured integration. A third way is using CP as a safety layer for existing AI tools.
Before deploying a new diagnostic AI, you could wrap it in CP to get uncertainty estimates and reliability guarantees, adding a layer of assurance. Like a quality control check. Kind of.
And underlying all this is the need for seamless integration with EHRs and imaging systems, KCS. The CP output needs to be readily available, easily visualized, and part of the normal workflow. Not some separate tool clinicians have to consciously open and consult. Makes sense. But what are the practical roadblocks to getting there? There are definitely challenges. Data is a big one. ICP needs that representative calibration set.
Ideally, it should reflect the local patient population where the tool is used. So local calibration might be necessary. Often, yes, to ensure the guarantees really hold in that specific environment.
That requires infrastructure for collecting and managing local data, which can be a hurdle. And the data splitting in ICP itself can be inefficient if labeled data is very scarce. Computational cost. Less so for basic ICP, but some advanced methods, like TTA-CP or those using complex secondary models for adaptivity, can add computational overhead.
We need solutions that are lean enough for busy clinical settings. And the human element, trust and training. Hugely important. Yeah. Clinicians need to understand what a CP set is and what the one-a-guarantee actually means and what it doesn't mean. Like it's not a guarantee for this specific patient.
Otherwise, there's a risk of misuse or falling back into automation bias, even with the uncertainty info. Good training and clear communication are vital. And choosing alpha, that 95% or 90%, who decides? That's another key decision. Choosing alpha involves a trade-off between the strength of the guarantee, lower A means stronger guarantee, and the utility of the set. Lower A generally means larger, less informative sets.
This often requires careful thought and input from clinical domain experts to find the right balance for a specific application. So it really highlights that implementing this isn't just a tech problem. It's about human factors, workflow, training, trust. Absolutely. Human-centered design, intuitive ways to visualize uncertainty, robust training programs. These are just as important as the algorithms themselves. And that idea of local calibration, while a challenge, maybe it's also an opportunity.
To address distribution shifts and build trust locally. Exactly. Performing calibration using data specific to the hospital or clinic where the tool is deployed means the reliability guarantees are tailored to that context. That can significantly boost clinician trust and the real world usefulness of the A.I.
Okay, let's shift gears slightly to the ethical and regulatory side. Fairness is a huge concern with AI. How does CP interact with fairness considerations? It's a complex interaction. On the one hand, CP provides transparency about uncertainty, which could potentially highlight disparities.
but it doesn't automatically solve fairness issues. How so? Well, even if you achieve the overall marginal coverage guarantee, the prediction sets might be systematically larger or less accurate within the set for certain demographic groups compared to others. This is disparate impact. Average coverage doesn't guarantee fairness across groups. Right. And even if you use something like Mondrian CP to achieve equal coverage per group, that doesn't necessarily lead to fair outcomes.
What do you mean? Some research suggests that enforcing equal coverage per group might, counterintuitively, sometimes lead to worse decision making or exacerbate disparities when humans interact with those sets compared to just using standard marginal CP. It's complicated. So equal statistical guarantees don't automatically equal fair real world impact. Exactly.
Some researchers propose focusing on metrics like equalized set size across groups as potentially a better heuristic for fairness in practice, but it's an active area of research. We need end-to-end evaluation, looking at the whole human AI system. What about safety and trust? Does CP help mitigate things like automation bias? It has the potential to.
By explicitly flagging uncertainty, e.g. through larger sets or specific flags, CP can signal to clinicians when not to blindly trust the AI's top prediction. This can counteract automation bias. Making the uncertainty visible. Yes. That transparency about the limits of the AI's knowledge, backed by the statistical guarantee, is crucial for building justified trust and enhancing safety.
How do regulators like the FDA view these kinds of uncertainty quantification techniques? Are they factoring this into approvals for AI medical devices? The regulatory landscape, especially for AI and machine learning and medicine, what they often call SAMD software as a medical device, is definitely evolving. It must be hard to regulate systems that can learn or adapt. It is. Traditional medical device regulation wasn't really built for algorithms that might change over time.
The FDA uses a risk-based classification system, but adaptive AI poses unique challenges. So what are they focusing on? There's a growing focus on robust performance assessment and, yes, uncertainty quantification. They're actively working on developing appropriate metrics, methodologies, and tools for evaluating AI safety and effectiveness.
including how uncertainty is handled and communicated. Are there mechanisms for approving AI that might be updated? Yes. They've introduced concepts like the Predetermined Change Control Plan, PCCP. This allows manufacturers to pre-specify certain types of modifications they plan to make to their AI algorithm after it's been approved without needing a completely new submission for every minor update.
As long as they follow the plan and monitor performance? Exactly. It requires transparency, robust monitoring, and adherence to the pre-agreed plan. They also emphasize good machine learning practice, GMLP, which covers things like data management, model training, ensuring interpretability where possible, and rigorous evaluation in all areas where techniques like CP can play a role.
So the FDA encourages reliable AI, but systems using CP would still need robust protocols for that calibration step for ongoing monitoring, especially if local calibration or updates are involved. Absolutely. Documenting the calibration process, ensuring the representativeness of the calibration data, monitoring for potential violations of exchangeability or performance drift over time. These would all be critical parts of the regulatory submission and post-market surveillance for a CP-based medical AI device. Okay, so wrapping things up.
Conformal prediction offers this really compelling path towards more reliable uncertainty quantification in medical AI. We've seen its model agnostic nature, those distribution-free guarantees, the idea of calibrated prediction sets. It's a powerful framework. And we explored advancements like TTA for making sets smaller, more efficient, and these advanced methods like Mondrian CPs, CQR, CRC for getting more specific guarantees, conditional coverage.
adaptive intervals, controlling specific risks. Leading to applications across so many areas, imaging, genomics, EHRs, drug discovery. Right. But we also have to acknowledge the persistent challenges, getting truly robust conditional validity, improving efficiency further, making the outputs easily interpretable for
clinicians, ensuring fairness in practice, integrating smoothly into workflows, and navigating that evolving regulatory landscape. Definitely still work to be done on all those fronts. So looking ahead, what are the key research directions? Where do you see the field focusing next? I think we'll continue to see a big push towards better and more practical methods for conditional coverage guarantees that hold reliably for specific subgroups or input types.
Designing even smarter nonconformity scores tailored for specific medical data and tasks is also crucial for efficiency. Understanding the human factor better. Absolutely. Yeah. More research on human-AI interaction, how clinicians actually use these sets, how to visualize the information effectively, how it impacts decisions and potential biases. Addressing fairness not just statistically, but in terms of real-world outcomes.
is critical. Handling more complex data. Yes. Extending CP to handle multimodal data like images plus text plus labs and longitudinal data tracking patients over time more effectively. Developing standardized validation and monitoring protocols specifically for CP based systems will be vital for regulatory acceptance and clinical trust. And of course, ongoing work on computational efficiency for deployment. It's a fascinating and rapidly evolving field.
So here's a final thought for you, the learner. As AI becomes more and more deeply integrated into the very fabric of healthcare, how will sophisticated techniques like conformal prediction ultimately reshape the fundamental relationship between clinicians and technology? How do we move towards a future where AI assistance isn't just remarkably powerful, but also verifiably trustworthy and reliably safe for patients?
This de-dive into the principles and potential of conformal prediction offers us, I think, a fascinating glimpse into that future. A future where being well informed about medical AI increasingly means understanding not just what the AI predicts, but also precisely how confident we can reasonably be in those predictions. Knowing the limits. Exactly. Knowing the limits with a statistical backing.
So now, as you consider this whole landscape we've discussed, what specific areas within medical AI do you think stand to gain the most significant benefits from this kind of rigorous, reliable uncertainty quantification? Where could CP make the biggest difference?