We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

901: Automating Legal Work with Data-Centric ML (feat. Lilith Bat-Leah)

2025/7/1

Super Data Science: ML & AI Podcast with Jon Krohn

AI Deep Dive AI Chapters Transcript

People

Jon Krohn

Lilith Bat-Leah

Topics

Jon Krohn: Epic是一家大型法律科技公司，拥有超过6000名员工。Epic AI Discovery Assistant声称可以自动化超过80%的传统电子取证流程，并完成审查比传统技术辅助审查快90%。我想了解线性审查、TAR和电子取证等术语，以及AI如何简化流程。 Lilith Bat-Leah: 我提供的软件确实支持TAR工作流程。TAR是一种使用机器学习将文档分类为与诉讼相关或不相关的过程。诉讼涉及大量的文档、电子邮件和各种非结构化数据。律师需要审查这些数据，以确定哪些需要提交给对方。电子取证是指以电子方式存储的业务记录的发现过程。机器学习工具使对方律师更容易找到关键证据。Epic AI Discovery Assistant利用LLM来更快地找到相关文档。你的自然语言指令和标记示例将用于训练最佳分类器。每个案例都需要一个单独的分类器，具体取决于需要分类的不同事项的数量。通常，你将始终具有响应能力模型，基本上是相关性模型。你可能还会对诸如特权、保密性以及律师关心的所有问题进行分类。法律行业中，评估指标的严格性非常重要，因为评估指标有时会与对方律师或政府机构协商。作为一名数据科学家，你必须能够解释评估指标的真正含义以及可能对律师和法官产生的影响。

Deep Dive

Shownotes Transcript

Translations:

中文

This is episode number 901 with Lilith Batlia, Senior Director of AI Labs at Epic. Today's episode is brought to you by the Dell AI Factory with NVIDIA and by Adverity, the conversational analytics platform.

Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry. Each week, we bring you fun and inspiring people and ideas exploring the cutting edge of machine learning, AI, and related technologies that are transforming our world for the better. I'm your host, John Krohn. Thanks for joining me today. And now, let's make the complex simple.

Welcome back to the Super Data Science Podcast. Today we've got Lilith Batlia, a tremendously gifted communicator of complex technical information on the show. Lilith has over a decade of experience specializing in the application of machine learning to legal tech.

She's now senior director of AI Labs at Epic, a leading legal tech firm that has over 6,000 employees. She's published work on evaluation methods for the use of ML in legal discovery, as well as published research on data-centric machine learning. She's co-chair of the data-centric machine learning research working group, ML Commons.

has organized data-centric workshops at ICML and ICLR, two of the most important AI conferences. She holds a degree from Northwestern in which she focused on statistics. Today's episode will appeal primarily to hands-on practitioners like data scientists, AI/ML engineers, and software developers. In today's episode, Lilith details how AI is revolutionizing the legal industry by automating up to 80% of traditional discovery processes,

why illusion is a critical metric that only exists in legal tech and what it reveals about machine learning evaluation, the surprising reason why we should stop obsessing over model improvements and focus on something else that takes up 80% of data scientists' time. And she talks about how she grew from being a temp receptionist to eventually an AI lab director by falling in love with statistics. All right, you ready for this outstanding episode? Let's go.

Lilith, welcome to the Super Data Science Podcast. I'm delighted to finally get you on the show. I was talking about it with you for a while and now it's finally happening. Where are you calling in from today? Thank you so much for having me. I am in New York City.

Likewise, exactly. Both in Manhattan. Though recording remotely anyway, it does make things easier. There's a lot less setup involved if I just do remote recording sessions for people who are wondering at home why I sometimes do New York episodes remotely with guests. Though I guess also I could be traveling. I don't know, it just makes the logistics easier to do things remote. We actually met after the Open Data Science Conference East event

in Boston a year ago. Was it a year ago or two years ago? Just a year? I think it was just a year ago. Just a year. And so we were on the train back. So the XCELA, the supposed express train that is available only in this kind of northeast corridor of the U.S. And it isn't that fast. If people have been on express trains in Europe or Asia, you'll be like, this isn't a very fast train.

But it is a really nice ride from New York to Boston or vice versa.

and the only kind of nice train ride that you can have in North, or at least in the US. Canada does actually have some nice trains too. I was sitting, trying to mind my own business, but behind me, you were sitting and you were going into a lot of technical detail and explaining technical data science concepts

very clearly, succinctly in a really enjoyable way to whomever you were sitting with. And after like an hour of listening to that, I popped around in my seat because I had to find out who this person was. And it was you. Thank you. Yeah. And you can give all of that credit to the judges and attorneys that I've

Yes. And so we are going to have a legal episode here. But I think this will be interesting for anyone because, you know, we I think it's great to dig into different domains and

And whether you do work in legal or legal tech yourself, there's lots of concepts that we'll be describing that could be transferable. So you might think of an analogous kind of thing that you could be doing in your own industry.

And so, yeah, so let's start off with that. So you are the senior director of the AI Labs for a company called Epic, E-P-I-Q, which is a leading legal tech company. They're a pretty big one. There's thousands of employees, I think, I looked into this. Yeah, I think we're over 6,000 right now. Right, over 6,000 employees. So it's a big legal tech company.

Earlier this year, Epic launched something called the Epic AI Discovery Assistant, which claims to automate more than 80% of traditional e-discovery processes and completes reviews up to 90% faster than something called TAR, Technology Assisted Review, or Linear Review, which I'm guessing is like a human reading every word on a page linearly.

So, uh, we've got a bunch of legal tech jargon here now that, uh, I don't, I'm not really familiar with. So tell us about linear reviews, tar and e-discovery. Tell us about those firms about those terms, and then we can get into how, uh, AI can make life easier. Yeah. And I'll qualify one of those claims a little bit. It's, it's, um, better than traditional tar, um,

So I would say that the software that we offer does support TAR workflows. And to actually describe what that is, it stands for Technology Assisted Review. And it basically...

It basically describes a process whereby you use machine learning to classify documents as relevant to a litigation or not relevant to a litigation. And a litigation is just somebody suing someone else, I guess. Yeah, yeah. So the way I explain discovery for people outside of the legal industry is that basically any time two companies sue each other, they have to exchange anything and everything that might be considered evidence in the case.

So what this ends up looking like is piles and piles, maybe hundreds of thousands, even millions of documents, emails, Word docs, excels, tweets, etc.

Text messages, anything and everything, tons of unstructured data that might be relevant to the litigation. And then attorneys have to go through all of that and determine what is going to be produced to the other side, what they're legally obligated to produce to the other side, because it might be evidence. So that is e-discovery in a nutshell. What makes it e-discovery?

So way back when, and you might be familiar with those TV shows where the lawyers have the boxes of documents, that's traditional discovery. But now all of the business records, all the data as it was maintained in the ordinary course of business is electronic. So in the early aughts around then, I think we started calling it e-discovery rather than discovery, but now pretty much

All of document discovery is e-discovery for the most part. Gotcha. Exceptions where some, you know, asbestos case where you have to go back to the paper documents and scan them in and then review them. This kind of reminds me of, yeah, watching those old TV shows where

And and you'd see like it seems like it's a deliberate strategy to flood your opponent with like as many documents as possible to bog them down and increase their fees, that kind of thing. Yeah. Yeah. So so that's considered a bad faith approach these days. If if you try to just overwhelm your opposing counsel with with documents that aren't actually responsive to their case.

their RFPs. And that's where precision matters a lot if you're using a certain version of technology assisted review. So yeah, so now that we have all these great machine learning tools for e-discovery, it's much easier for opposing counsel to uncover the needles in a haystack that they might be looking for and pinpoint the evidence that really matters to them. Nice. Very cool. So now I think we have an understanding of kind of the territory. So tell us about

Epic's AI Discovery Assistant tell us about how that's different and how it accelerates. Again, the claim that you qualified appropriately. But we get this 80%. It automates 80% of discovery and completes reviews up to 90% faster than Epic.

Yeah.

And before I keep going, can I assume that your audience will be familiar with active learning? I would love to hear a bit more about it. Excellent. So active learning is just a way to select data in a more efficient way for training your classifier. And there are generally two popular approaches to it in...

in eDiscovery. So if you have really low prevalence, you're probably better off using relevance feedback. So you're going to have human annotators label the documents that are

already most likely to be considered relevant by the model. So you're going to use that and then you're going to iteratively retrain the model several times in order to improve performance. And that's in a low prevalence situation. If you have more balanced classes, that is a more equal proportion of relevant and irrelevant documents.

Then you're going to want to use uncertainty sampling where you're looking at the entropy of each data point and having human annotators label the documents that the model is most unsure about in order to improve performance. So those are the two flavors of active learning that we tend to use in the space.

Very cool. That's exactly the kind of clear technical explanation that I heard on that train ride. Fantastic. Thanks, Lilith. Yeah, so we kind of took a little bit of an excursion to talk about active learning. But yeah, you were filling us in on

epics AI discovery assistant. Yeah. And I really should, I should be, I am very excited to talk about epics tools, but again, with, with traditional tar, it's basically traditional long text classification, anything from a random forest algorithm to an SVM to a logistic regression is pretty popular. Yeah.

You can use any of these algorithms or some ensemble learning and arrive at your classifications along with that active learning component of things. What's very cool about Epic AI Discovery Assistant is that it uses more traditional methods for long text classification, but it also leverages LLMs. So you get a head start on the documents that you're

start training on by using retrieval augmented generation to find the documents most likely to be relevant to whatever it is that you care about, whatever issue the attorneys might have specified and kickstart things there. And then both your human language instructions and

and your labeled examples are going to go into training the best classifier possible. So it takes input from example data and from natural language instruction.

So you need a classifier for basically every case, a separate classifier. Sometimes many, many classifiers in one case. It depends on how many different things they care about classifying. So generally, you'll always have a responsiveness, assuming that it's in preparation for a production to opposing. You'll always have what's called a responsiveness model, basically a relevance model. Is it relevant to any of the issues in the case?

But then you might also have classifiers for things like privilege, whether the document is protected by attorney client privilege and therefore it's not mandatory to disclose it. And then confidentiality potentially. And then all sorts of issues that the attorneys working on the case might care about.

So does this mean that big law firms typically have data scientists on hand or do they rely completely on tools like Epic AI Discovery Assistant to allow these classifiers to be trained in a fully automated way without some kind of technical expertise like a data scientist being involved?

Yeah, they mostly rely on these tools. So very few law firms have, maybe that's changing, but I would say very few law firms have data scientists who are involved in the discovery component of the practice.

So, yeah, so they do rely on these tools. With that said, you do need to have some expertise, some domain expertise, and some familiarity with basic evaluation metrics in order to make sure that you're using the tool in a defensible manner. And we are trying to build in as much of that as possible, build in the expertise, build in all the metrics and intuitive explanations of them.

But I would say at this point, it's still ideal to have that domain expertise and a little bit of familiarity with evaluation metrics. Gotcha. So perhaps...

a law firm might work with Epic, not only to get access to a tool, but also to leverage expertise for people like yourself. Exactly. Yeah. Yeah. We have an amazing team that helps clients with, uh, with specific matters and helps them achieve whatever it is they're looking to achieve for that particular case. And that can entail, uh, building dozens of models for, for one single case.

This episode of Super Data Science is brought to you by the Dell AI Factory with NVIDIA, helping you fast-track your AI adoption from the desktop to the data center.

The Dell AI Factory with NVIDIA provides a simple development launchpad that allows you to perform local prototyping in a safe and secure environment. Next, develop and prepare to scale by rapidly building AI and data workflows with container-based microservices, and then deploy and optimize in the enterprise with a scalable infrastructure framework. Visit www.dell.com slash superdatascience to learn more. That's dell.com slash superdatascience.

When the stakes are so high, as big law firms, when you're talking about hundreds of thousands or millions of documents, obviously these are going to end up being very expensive cases. You're talking...

at least millions of dollars and is probably very often in these kinds of litigation situations, tens, hundreds of millions of dollars, billions of dollars on the line one way or another for the defendant or the plaintiff. Does that happen in litigation? Do you have a defendant and a plaintiff in litigation? Yes, absolutely. And so in that kind of situation,

The stakes are very high. So how do you balance speed and automation, which are so important, with the legal field's high standards for defensibility, a word you just used, and due diligence? Those are great questions. So one of the fun things about working in the legal industry is that these standard evaluation metrics, recall and precision generally,

get negotiated, sometimes get negotiated with opposing counsel or some governmental body. So it's the one time where your evaluation metrics and being rigorous in your evaluation processes is

reasonably rigorous in your evaluation processes really, really matter, right? You get to argue about the margin of error and all sorts of things like that. And you have to, as a data scientist, you do have to be able to explain what that really means and what the consequences of it might mean to attorneys and sometimes judges. But it is, every case is a little bit different.

So defensibility boils down to what that particular attorney is comfortable defending. And there are proportionality considerations and, you know, undue burden considerations that go into it. So, for example, if you have a really, really low prevalence, you know, relevance tag, right, if you're looking for a subset of the documents that's really rare, you know,

And just sampling enough documents in order to be able to evaluate it could become overly burdensome potentially. And then we have this metric that I've never come across outside of eDiscovery. We call it illusion, where we're just sampling the subset of documents predicted not relevant. And we have the human ground truth labels for all the relevant documents. So from

From those two metrics, we can then estimate an interval for recall. And that's an interesting case. And the defensibility around that is debated. I'm a proponent of it because we don't use any of these metrics to evaluate what we call linear review, which is just humans with eyes on everything.

And if we're just going to assume that that is the gold standard, that all of those labels are in fact correct, which we kind of know they probably aren't, then why should we hold machine learning workflows to a higher standard? Right. We should be able to accept that those that those are.

are the gold standard. So yeah, lots of interesting areas of debate, lots of different angles. And again, it just depends on the case and who's requesting what and how onerous it's going to be for the producing party to appease opposing. And all of that goes into a defensible, quote unquote, defensible workflow. Yeah.

Let's talk a little bit more about this illusion term that seems to be unique to legal tech. For our listeners, it's not an illusion like a magic trick with an I. It's like elude, E-L-U-D-E. Illusion, E-L-U-S-I-O-N. It's kind of like this idea of deception, right?

Or like avoiding detection, I suppose, because it's not like deliberate deception. Yeah, those are the documents that have eluded you. Right, exactly. And so then why is that different from a machine learning metric that would be equivalent? So that would be a, I often like getting a little two by two table in front of me to make sure I'm not butchering this, but that would be a false negative metric.

Correct. Yes. Yes. False negative out of false negatives and true negatives. Exactly. Right. We actually candidly, for our listeners, I just took...

I just took a second to do some research and pull out that it seems like kind of a generic term for this, for illusion. So false negatives divided by false negatives and true negatives can be called false omission rate in machine learning in general. But I guess that's kind of a, it's a bit of a mouthful. Illusion sounds nicer. It sounds like it's such a, it's a simple word.

I like it a lot. And I don't know who to credit, who to credit with that term. Um, uh, it, it did kind of pop out of nowhere. Um, so yeah, so I, I wish I could tell you more, uh, but I did figure out how to get an interval for recall based on the illusion rate, right? The problem with the illusion rate, and it's a very legitimate problem is that people will take an illusion sample and

And just decide that, hey, yeah, it's low. That's good. Without thinking about the starting prevalence. Right. So if you started with. So, right. People will say, oh, if the illusion is under five percent, then it's good. But.

That's not good if your prevalence was under 5% to begin with, right? Then that doesn't tell you anything. So with this standard workflow, so now I can talk about, I hate these terms, but there's TAR 1.0 and TAR 2.0.

And they basically are heuristics for different workflows and there's different permutations, right? There's different ways of getting to whatever model you're using to serve up documents and different stopping points for training. But at the end of the day, it boils down to TAR 1 being a workflow where you produce documents that have only been classified by your classifier.

and not necessarily been looked at by human attorneys. Whereas TAR 2.0 heuristically describes a workflow where you're looking at every predicted relevant document before it goes out the door. So in this TAR 2 workflow and this workflow where you are having humans actually annotate every predicted relevant document,

Then again, now you have that known quantity. You do know how many actual relevant documents there were. You don't have to estimate that from a recall precision curve or confusion matrix. And then you can estimate what the interval for recall is based on the interval for illusion. And you hear me go on and on about intervals. I am obsessive about focusing on confidence and

and not point estimate. Exactly. I actually had some questions for you later on in the episode about that, but we might as well get into it right now. Why do you... I mean, I can guess, but please tell us why you prefer going in ranges, why you prefer providing information in ranges as opposed to point estimates. Yeah. So...

The short answer is that the coolest thing about statistics is that you get to measure your uncertainty. So why wouldn't you? Why wouldn't you do that? Why wouldn't you measure your uncertainty? But the more more serious answer is that, I mean, we are dealing with uncertainties, right? You shouldn't assume that a point estimate is truly representative of the parameter that you're trying to estimate, right?

You really should think about those confidence intervals because then you can feel pretty good about knowing that it's going to be somewhere within that range and you're taking the uncertainty into account. Right. So.

It's easy to fixate on a point estimate, but I've said before that a point estimate without sample size, without confidence intervals is basically lying with statistics. You have no idea what the actual claim is there. Right. And so how do people who aren't trained in statistics react when you provide them with confidence intervals as opposed to point estimates? Do you ever get kind of a

you know, confusion or backlash on that? Yeah. Yeah. Um, so it depends on who I'm working with. If it's, you know, if it's an attorney or judge, I try to just demonstrate, I have like a, um, even just like a quick calculator and Excel, I'll show them how, you know, varying certain things affects those estimates and try to give them an intuitive understanding of it.

If it's if it's a consultant and I'm trying to give them a more intuitive understanding of it, I'll have them I'll have them randomly sample half of the documents in a certain population and label them something like documents I care about. And then I have them sample with 90 percent confidence that.

at least 10 times so that they can see that, hey, on average, one out of 10 times the actual the point estimate that I'm estimating from this sample is not within the range that I've estimated. Right. And I think that builds up an intuitive understanding of of confidence. Yes, yes, yes. The old law of large numbers. Sounding familiar here.

and I will eventually create YouTube content on these concepts if I haven't already. I can't remember where I am, and I've been creating this mathematical foundations content, and I was really good about putting it on YouTube for a couple of years, up until three years ago. And I know that somewhere in there I do have a Law of Large Numbers video, but I think I might not have released it yet. So it is coming eventually, someday. Nice. In the meantime...

People can look it up, but it's basically, it's like, you know, the more data that you sample, the tighter your ranges will tend to be, your estimates will tend to be. You begin to get a better picture of reality

without having to look at every single data point. That's right. And I was actually just showing colleagues today that when your sample is large enough, the intervals for 95% confidence level versus 99% confidence level tend to converge, right? So once your sample is sufficiently large, it doesn't really matter whether you're estimating something at 95% or 99% confidence. Right.

Right, right, right. That makes a lot of sense. Nice. All right. So we've now learned a lot about law, about legal tech. Have we gotten into yet? Yeah, I guess we have. We've gotten into the Epic AI Discovery Assistant as well, because that you explained, you

You know, it had it had things built into it like retrieval augmented generation that allowed it to outperform technology sister to review 1.0 or 2.0 traditional technology. Those are more workflow terms than than technical terms. So I hate the terms because they confuse people so much. But but it's it's been what the industry has been using for for quite a while now.

Nice. All right, and so the topic that I actually thought might be the topic that we talked about the entire episode, but it ended up being that there were so many interesting things to go into around legal tech AI that I wanted to have the conversation that we just had.

But the impetus for having an episode when we talked about it on the train already a year ago was this idea of data-centric machine learning. And so this is now a topic that is, this isn't just like, oh, there's some analogies here that might be relevant to your industry. Data-centric ML is relevant to every listener, anybody who's working with data.

This is relevant. And so tell us about data-centric machine learning research, DMLR. And my understanding is that you fell into DMLR as a result of how messy the data are in the legal space.

Yeah, that's right. So in my first R&D role, I was really focused on algorithms and on finding the best classification algorithms for these classification tasks that we've discussed. At a certain point, I realized that the label data I was working with was so noisy, just had so many mislabeled instances and all of that.

And that it really curtailed my ability to evaluate the performance of the algorithm just because I couldn't necessarily trust my data. Yeah.

So that led me to be very interested in what Andrew Ng coined data-centric AI. And I ended up getting involved with a working group at ML Commons called DataPerf, where we were looking to benchmark data-centric machine learning. That ended up leading to a few different workshops that we've organized at iClear and ICML, which

Um, data proof also became a NURBS paper. Um, and, uh,

Yeah, yeah. Basically, it turned into a whole community. So now there's a DMLR journal, there are the DMLR workshops at these conferences, and then DataPerf morphed into the data-centric machine learning research working group with ML Commons. So we have a lot of different things going on. We're working in partnership with Common Crawl, the foundation that curates the data sets that most LLMs have been trained on.

We're partnering with them on a challenge that will result in a low resource language dataset that will be publicly available. So if you're interested in joining the working group, please do get involved. Again, it's with ML Commons.

you can go to that site and send to the working group. We'll be sure to have a link to ML Commons in the show notes. And so when you say low resource language, this is languages for which there are not many data available online. They could be rarely spoken languages or for whatever reason, languages that even if they're spoken relatively commonly, they aren't represented on the internet.

Exactly. Nice. That sounds really cool. And so those acronyms that you were saying there earlier where this DMLR initiative was getting traction, so conferences like ICLR, ICML, NURIPS, these are the biggest conferences that there are, academic conferences that there are. And so really cool that you get such an impact there. And it's also interesting to hear the connection to Andrew Ng there because he...

I have in my notes here somewhere, I'm kind of scrolling around in here. Yeah. So at the inaugural DMLR workshop, Andrew Ng was the keynote. Yes, yes, exactly. And he was involved with DataPerf as well. He's on that DataPerf paper.

Just the insights you need right when you need them.

With Adverity's AI-powered data conversations, marketers will finally talk to their data in plain English, get instant answers, make smarter decisions, collaborate more easily, and cut reporting time in half. What questions will you ask? To learn more, check out the show notes or visit www.adverity.com. That's A-D-V-E-R-I-T-Y dot com.

Okay, so I'm now very clear on the importance of DMLR, the traction it's getting, and bigwigs like Andrew Ng being involved. Probably most of our listeners know who Andrew Ng is. He's one of the biggest names in data science, period. And if you aren't already familiar with him, he was on our show in December. So episode 841 you can go back to. We'll have a link to that in the show notes as well.

Um, so yeah, so now I have a clear understanding of, you know, data centric machine learning being very important, gaining traction, but, uh, our listeners still might not have a great understanding of what it is. Yeah. Yeah. So, um, the best way I can explain it is that in traditional machine learning paradigms, you're iterating on the model, you're iterating on the model architecture on the, um,

on the learning algorithm, all of those sorts of pieces. And that's where you're really focused on improving performance is by iterating on the model. With data-centric machine learning, you're iterating on the data. So you're holding the model fixed and you're improving the data. You're systematically engineering better data. And then there are all these different questions, right? So there's the question of whether to aggregate labels or not.

There's a really interesting paper, Doremi, that looked at weighting different domains of the pile to get the best LLM pre-training performance.

So there's, yeah, it can go lots of different ways. There's another paper I'm thinking of, I can't remember the name, but they looked at selecting the best data points for training a model, a priori, so not even active learning, where you're starting with the results of the model to determine which additional data points you should have labeled, but just with a data set from scratch.

using linear algebra to figure out which data points are worth labeling.

Right, right, right, right. So the idea here is that, and so I think this contrast with the idea of what we mostly end up doing as data scientists, as machine learning engineers, as AI engineers, where we're trying to change our model weights in order to get the best results for whatever situation we're in. With data-centric machine learning, the idea is that you could actually potentially keep your model weights the same

And you make adjustments to the data themselves in terms of how much you have or the composition of those data or how you sample from the data. And so basically you're concerned, you're focused on data. They become central to the way that you develop your machine learning models and ultimately provide results. Yeah, that is a much better way of explaining it. I doubt it's better. I doubt it's better. It's just different because you explained it very, very well indeed.

You are, seriously, you are gifted at explaining this stuff. Nice.

in a paper written by the DMLR community members. And so it's a paper called DMLR colon data-centric machine learning research past, present, and future. I'll have a link to that paper in the show notes. And I think you were a co-author on this paper, am I right? Yes, you are. You are, in fact, you're the third author on this paper amongst a couple dozen. And in that paper,

It quotes that everyone wants to do the model work, not the data work. So what mindset shifts or incentives do you think are necessary to elevate the perceived value of data-centric contributions in the ML community and

Yeah, yeah. So that's more than enough questions. Yeah, that's a great question. So one of the major challenges when DMLR was getting off the ground was that there were no really prestigious archival venues for this kind of work, right? So that's starting to be addressed with the data sets and benchmarks track at NeurIPS.

And then launching the DMLR journal, which by the way, is it's the newest sibling journal to the JMLR journal, which, which has some, yeah, street cred. But yeah, so, so finding or establishing these high impact prestigious journals.

venues for publishing this kind of work, I think that goes a long way toward encouraging more of the data-centric work. But we still have a long way to go, right? I mean, it is true. I think, right, 80% of most data science projects are way more about data cleaning and data engineering and all of that. But we really focus on that 20% that's iterating on the models.

But we don't look at that as a fun, exciting part. So I think we do need to just put our engineering mindsets. How can we systematically improve data? How can it be a task that goes beyond just annotating, finding better ways to annotate the data?

All of those things have to happen for it to, I think, gain even more traction than it has. Yeah, it is. Once you put it in that kind of stark term, we've probably had 100 guests on the podcast confirm that kind of 80-20, around 80% of a real-world data science project is spent on data cleaning and 20% is actually on model building. And it's so interesting that when you think about that ratio,

how little there is published on that 80%. It should be most. DMLR should be most of it. Right, right. Well, I agree with you on that one. I guess what ends up happening is, this is me just completely riffing, and I'd love to hear what you think about this, but I guess what ends up happening perhaps is that

people might feel when they're doing that work that the problems that they're encountering are unique to their particular data set. Maybe ideas don't come to mind for them that generalize well across many domains or even within their subject matter that they're expert in. So

Yeah, what do you think about that? What are some of the big trends or big themes that you see in DMLR that, yeah, kind of apply broadly into a large range of circumstances? Yeah, so I'll go back to data perf, right? Because we were aiming to establish this benchmark suite. Benchmarks pushed model-centric machine learning pretty far, right?

So we were hoping that we could push data centric ML further along by establishing benchmarks there. To be honest, we have, I don't know how far we really made it, but it was an interesting endeavor. And we focused on a few different types of tasks.

So one was data selection. So from a very large pool of data, how do you select the subset of data to train the highest performing model? And we did that in both the speech and the vision domains. So that was one challenge and benchmark. Then we had a data debugging challenge where participants were

encourage to find the mislabeled data points, the mislabeled instances in a data set and correct the labels or exclude them from training. So I think that has pretty broad application, right? And anytime you're doing supervised learning, if you have mislabeled data, then that's going to be pretty practical.

And then we also did a data valuation challenge. So how do you, how do you value each piece of data? Right. Not, not, or yeah, not all data are equal when you're training a model. Some have much more impact than others. Right. So we, so we looked at that and that's a whole really interesting, really

area of data-centric machine learning research that I didn't know anything about until I joined DMLR. But yeah, but there are all these different ways to estimate the value of certain data points. And that might become increasingly important as we try to figure out how to compensate people for all the data that we're using to train all these models. And then we had a red teaming challenge

called Adversarial Nibbler, where... You know the reference? Is it Futurama? Yeah, yeah. That's funny. I didn't actually think that there would be a reference until you asked for one. But fortunately, I have seen quite a few episodes of Futurama. Cool. Well, I did not come up with the name. I can't take credit there. But yeah, but the main objective of that challenge was to find...

sounding prompts that generated unsafe images. So for example, a child sleeping in red paint sounds benign, but generates an image that looks horrific, right? So the challenge was all about finding these pairs, these text image pairs for use in then helping to make these models more robust and all of that. Wow.

Wow. What a visual. A child sleeping in red paint. Yeah, that is interesting. And just a little red paint puddle that just happens to be on the floor. So yeah, so I'll have links in the show notes to dataperf.org, which I'm guessing stands for data perfect, maybe? Data performance. Data performance. Of course. But...

But that site is super outdated, just ML Commons. And then dynabench.org is the platform where we host all of these challenges. And that's... Dynabench. That's like Dynamic Bench? Yeah, yeah. So that's a platform that we've used to facilitate a lot of these data-centric challenges. And that's still maintained by ML Commons. And if you're interested in that, that same DMLR working group that I mentioned before, we maintain Dynabench and...

and continue to host challenges on Dynabench. Nice. And then I'll also have a link to your paper. I already mentioned the DMLR past, present, and future paper. We'll also have a link in the show notes to your DataPerf paper, which is on benchmarks for data-centric AI development. And that one, you're just a couple of commas away from Andrew Ng in the authors of that paper there.

Cool. So those are resources that people can dig into deeply if they have more interest in data centric machine learning, which probably all of us should given that 80%. There is probably a lot of value to people sharing domain specific solutions because it might inspire people to find some new domain specific solution for their domain.

And one of the things actually one of the future workshops we're considering as an applications research focused DMLR workshop, because oftentimes at these academic conferences, applications research gets looked down on a little bit.

And we do think that there is more more need to ground everything in really practical use cases. And we're sure that there's going to be a lot of really interesting research that is domain specific that different people can learn from. So that is something that we're we're hoping to undertake in the future. Very nice. And yeah, not only that.

Would it be great for people to be publishing more on the kinds of situations that they get to with their specific domain, much in the same way that us at the beginning of this episode talking about legal tech, AI applications, people can have analogous ideas come up for their industry. And not only that, but...

I totally see the idea of how benchmarks and competition have led us to having such a model-centric approach to machine learning. Things like DataPerf, where you have benchmarks, where you have competitions, and people can be trying to get the best results, how that can drive

more and more data centric ML adoption. It's a brilliant initiative. Yeah. And at the same time, I think, I think we can be critical of it too, because there is a critique that the intense focus on benchmark performance doesn't necessarily translate to real world impact in the way that we would expect. So there's definitely a balance to be found there. Yeah.

Build the future of multi-agent software with Agency, A-G-N-T-C-Y. The Agency is an open-source collective building the Internet of Agents. It's a collaboration layer where AI agents can discover, connect, and work across frameworks.

For developers, this means standardized agent discovery tools, seamless protocols for interagent communication, and modular components to compose and scale multi-agent workflows. Join Crew.ai, LanqChain, Lama Index, Browserbase, Cisco, and dozens more. The agency is dropping code, specs, and services no strings attached. Built with other engineers who care about high-quality multi-agent software.

visit agency.org and add your support. That's A-G-N-T-C-Y dot O-R-G. Nicely said, as you have done throughout the episode. All right. So...

Before I let you go, we've gone through the most exciting technical things that you're working on today. But you have an interesting background that I'd like to ask you at least one question about to get into. Just going over your LinkedIn profile.

it looks like you had a pretty interesting journey where there was actually, there was a point where you were an administrative assistant at the beginning of your career. And it looks like you kind of grew through legal roles, you know, increasing seniority within legal firms. And then, yeah, God, you know,

got into data science as well, and now you are a data science leader. So I think this is an interesting journey, and I'd love to hear just a bit about what happened. Yeah, yeah. So...

I fell into eDiscovery as an admin assistant, basically as a temp receptionist, actually. And that was how I started my career. I was still finishing my undergrad at the time. And then at the same time that I was getting really familiar with eDiscovery and developing my domain expertise there, I fell in love with statistics.

So I took my first stats course and I got an A in it. And I did.

I didn't feel like I understood how or why I got an A because I didn't understand. I mean, I could calculate the correct answers, but I didn't have this intuitive understanding for why they were the correct answers. So I figured, okay, let me take more stats courses. And I took all the ones that made sense for me at the time. I took econometrics, psychometrics, various finance courses with portfolio theory. That's where I learned PCA.

Yeah. So I took all these applied stats courses and I kept getting A's. But after each one, I had no idea how I was deserving of an A when I when I still felt like I didn't understand the material at all. So finally, I I asked a professor.

I asked the chair of the statistics department at Northwestern if I could take his probability and stochastic processes course without any of the prerequisites. Right. And I and I wrote him, you know, saying, OK, I know this is going to sound crazy, but here's why I think I can do it.

And I'll never forget his reply. He wrote, Dear Lilith, anything is possible. But of course, I would have serious reservations about letting you enroll without any of the prerequisites at all. Write me back in a year. Let's see if you really have picked up calculus before I consider this seriously. So I did. I crammed. I crammed for a year. I used MIT OpenCourseWare and Khan Academy and everything out there to just

learn calculus on my own, um, a little bit of linear algebra. And then I came back to him and I said, okay, well, I didn't get as far as I, as I wanted to. Uh, but, um,

I, you know, I think I still want to take your course. So he said, go ahead. He sent me the textbook. It was a PDF. It was the first real math textbook I'd ever come across. It was just no images or anything, just coding problems. That's how I learned how to code. Right. And and math problems. And I crammed and I got an A on the final. So and then I finally felt like I understood statistics. Right.

And then since then, it's just been a lot of self-education and diving really deep into all the different flavors of confidence intervals. You can you can use really understanding what probability coverage means from from that angle. I'm just nerding out on the stuff I find most interesting.

Very interesting indeed. That was an even more exciting story than I was anticipating. It's interesting that I mentioned, because I don't talk about that often anymore in my Machine Learning Foundations curriculum, but it's covering a lot of those subjects. Linear algebra, calculus, probability theory, and statistics. We go in that order so that hopefully by the time we get to the statistics part,

You're able to understand based on the fundamental building blocks underlying it what's going on as opposed to just being able to get an A by following the examples. Not by rote, that's not exactly it, but I guess by being able to apply the abstractions as opposed to understand the underlying fundamentals.

It's kind of interesting. I guess I also, you were very excited when you said, I fell in love with statistics. And it's interesting because in a machine learning foundation's curriculum, I don't really need to include statistics. Many people would argue it's not essential, but...

Yeah.

Yeah. And I think you're not, if you don't understand statistics, I don't think you're able to properly evaluate the performance of the models that you're building. So you might be able to build the model without statistics, but I think especially in this era of black box models, it's so important to be able to actually evaluate the performance of them.

And that is exactly, that ended up being the focus. Like when I would try to come up with relevant examples during the statistics section, and a lot of the time it was in exactly what you're describing about evaluating different models and being able to, you know, not just...

run a model once, a stochastic model once one way and a second time another way and be like, well, I'm done. It did better the second time and therefore the second model is better. You should be running that model a bunch of times in both the A case and the B case. Get a distribution of results and be comparing those. And then if you have a statistically significant result,

And that is actually something, uh, statistical significance came up in our research of you. So Serge Maciz, our researcher pulled up some quotes from you around how, uh, how awful it, you know, these kinds of ideas of a 95% confidence interval, having that as law. Uh, I don't know if you want to go into that at all, that kind of perspective. Sure. Sure. Um,

You mean just being fixated on it being 95% confidence? Yeah, like an alpha of 0.5 being the significance threshold, you know, kind of arbitrarily from the early 20th century, which particularly today when we have very large data sets, you know, when we had data sets, when our sample sizes were 8, 16 in each group, that kind of, that arbitrary confidence thresholds

and you can correct me if I don't say this exactly right. I'll do my best here, but it's that if you ran the experiment 20 times, you would anticipate with a 0.5 alpha that one of those 20 times you would get a significant result by chance alone. And this is like a century-old idea,

from the age of Fisher and Pearson in statistics. The idea there is that you'll accept that you're

that, you know, you'll, you'll end up getting a significant result by chance alone, one out of 20 times. And that's kind of tolerable. Uh, but it is completely arbitrary. And then today when you have thousands or millions or billions of samples, um, you're going to get a significant result every single time at that, at that kind of threshold. Yeah. Yeah. Um, so the way that I've described it, and it is one of those things that's really hard to explain in plain English, um,

But with with a confidence interval, it's if you sampled this population an infinite number of times, you would expect that one out of 20 times the point estimate that you arrive at through your sample is not going to be within the estimated interval with with if your confidence level is 95 percent.

So, yeah. So like I mentioned earlier, whether it's 95 percent or 99 percent confidence at a certain point, those converge. Right. The intervals, the intervals for those will converge if you have a sufficiently large confidence.

sample size. But by that, I mean a huge sample size, right? Like you need to be in the millions for them to really start to converge. And otherwise, it's just, you know, smaller and smaller differences between the intervals as you increase your sample size. But if your sample size is very small, if your sample size is, as you mentioned, you know, eight or 10,

then there's actually a pretty huge difference between the interval that you got using 95% confidence and 99% confidence. And sometimes I think people just, I think people just need to think about the question that they're trying to answer. So how important is it for me to be right about my interval? Right? Like it's a, yeah,

So you're basically trying to answer the question, how likely is it that my inference is correct? Right. And if you have if your inference is as conservative, you have that larger interval, then you have you're in a better place to be correct more often. Right. Even though your your uncertainty is wider, your uncertainty, your interval is wider. Right.

You're still going to be correct about it being in that interval, whereas if you're really focused on that 95 percent confidence level and you have this really small sample size, then, yeah, you're just taking the the you know, you're at a higher risk of just estimating something wrong, inferring something wrong from your sample.

from your, um, statistic. Right. I don't know. I don't know if that made sense. That was pretty good. It is tricky. And, uh, yeah, I followed along there. Um, nice. All right. So this has been a fascinating conversation. I knew it would be, uh, you did not disappoint, um, for people who want to follow you,

after the episode and get more of your insights, how should they do that? LinkedIn is the best place for me. Nice. As it is for most guests these days. I've actually, I don't know if I've said this explicitly on air before, but I've actually, I've stopped tweeting. And I don't really check X anymore at all. Yeah, for me, social media has migrated completely to LinkedIn at this point.

Nice. And I missed my penultimate question there, so it's becoming the ultimate one, which is, do you have a book recommendation for us, Lilith, before we let you go? So I have to give the recommendation to read the DMLR journal for that one. That's just an easy, convenient answer. If it's okay, I also want to give a shout out to dmlr.ai, which is the website where we post the latest articles.

about our workshops at these various conferences. And there's a link to the DMLR Discord if people are interested in following both the journal and the workshops. Fantastic. Yeah, great resources there. I'll be sure to have dmlr.ai in the show notes.

Thank you so much, Lilith. This has been great. We'll have to catch up with you again in a few years when everyone's talking about data-centric machine learning and that's all that we're worried about instead of all these model benchmarks. And it may have had its moment already, right? I think for a minute people were talking about data-centric AI, but it never made it to the peak of inflated expectations or what have you. It kind of fell off the radar, but I'm still passionate about it.

Yeah, maybe we're still approaching it. I hope so. Thanks, Lola. Thank you.

Such a great episode in it, Lilith Botlia covered how when companies sue each other, they often exchange millions of documents as potential evidence, and how Epic's AI Discovery Assistant uses LLMs and retrieval augmented generation to classify these documents as relevant or irrelevant up to 90% faster than traditional methods. She talked about how legal tech's illusion rate measures false negatives amongst predicted non-relevant documents,

She talked about how while 80% of data science work involves data cleaning, most research focuses on the 20% spent on models. She talked about how the DMLR movement, the data-centric machine learning research movement, backed by Andrew Ng and major conferences like ICLR, ICML, and NeurIPS, aims to flip this by systematically improving data quality rather than just iterating on models.

And she talked about how in legal settings where millions or billions of dollars are at stake, confidence intervals matter more than point estimates because understanding uncertainty is crucial when your evaluation metrics can be dissected in court. As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Lilith's social media profiles, as well as my own at superdatascience.com slash 901.

Thanks, of course, to everyone on the Super Data Science podcast team, our podcast manager, Sonia Breivich, media editor, Mario Pombo, our partnerships team, Nathan Daly and Natalie Zheisky, our researcher, Serge Massis, writer, Dr. Zahra Karche, and yes, of course, our founder, Kirill Aramenko. Thanks to all of them for producing another outstanding episode for us today, for enabling that super team to create this free podcast for you. We are deeply grateful to our sponsors. You can support the show by checking out our sponsors' links below.

which are in the show notes. And if you yourself are interested in sponsoring an episode, you can head to johnkrone.com slash podcast to find out how. Otherwise, share the episode with people who'd like to listen to it as well.

review it on wherever you listen to it, subscribe, but most importantly, just keep on tuning in. I'm so grateful to have you listening and hope I can continue to make episodes you love for years and years to come. Until next time, keep on rocking it out there and I'm looking forward to enjoying another round of the Super Data Science Podcast with you very soon.

901: Automating Legal Work with Data-Centric ML (feat. Lilith Bat-Leah) 01:06:12 Share

Super Data Science: ML & AI Podcast with Jon Krohn

Deep Dive

Shownotes Transcript

901: Automating Legal Work with Data-Centric ML (feat. Lilith Bat-Leah)