We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

The Damien Riehl Interview - Legal Data, Legal AI, and Thoughts on a Legal Tech Duopoly

2024/10/9

Legal Tech StartUp Focus Podcast

AI Deep Dive AI Chapters Transcript

People

Damien Riehl

Topics

我，Damien Riehl，是vLex的解决方案负责人，也是SALI的领导者。vLex是一个法律数据公司，拥有超过10亿份来自美国和世界各地的法律文件。我们利用数据科学对这些数据进行标记和结构化，使客户能够使用大型语言模型 (LLM) 来回答复杂的法律问题。我们的五阶段流程提高了LLM答案的可靠性，不仅检索相关案例，还分析其有效性和相关性。这使得律师能够节省大量时间，提高效率，尤其对原告律师有利。然而，对于依赖小时费用的被告律师来说，这可能带来挑战。在竞争激烈的法律科技市场中，拥有独特的数据集和价值主张至关重要。那些仅仅作为现有LLM的包装器，而没有提供额外价值的公司，可能会难以生存。未来，将符号推理与神经网络相结合，可以增强AI的推理能力，使其在法律领域更有效。 SALI是一个非营利组织，提供免费且开放的法律概念分类法，极大地提高了法律数据的可用性。我们相信，免费获取法律数据可以促进法律科技领域的创新和竞争。我们服务的客户包括全球各地的律师事务所及其客户（财富20强公司）。我们不仅能回答法律问题，还能对法律文件进行深入分析，例如，对一份申诉书进行分析，提供200页的法律分析报告，这通常需要律师花费数小时才能完成。这使得原告律师能够节省时间并提高盈利能力，但同时也对依赖小时费用的被告律师提出了挑战。我认为，法律科技领域存在着双头垄断的风险，大型科技公司可能会吸收法律科技的创新成果。然而，更大的风险可能在于大型科技公司，因为它们正在吸收所有数据，而法律数据是高质量的文本数据，对大型语言模型的训练至关重要。

Deep Dive

Chapters

vLex is a legal data company with over a billion legal documents from various countries. It uses data science to tag and structure this information, enabling users to run LLMs to answer complex legal questions. The company merged with Fastcase in 2023, resulting in one of the largest legal datasets globally.

vLex boasts over a billion legal documents worldwide.
It employs data science for tagging and structuring legal data.
The company merged with Fastcase in 2023.
vLex utilizes SALI tags for enhanced data usability.

Shownotes Transcript

Translations:

中文

Hello and welcome to the Legal Tech Startup Focus podcast. I'm your podcast host, Charlie Uniman. On this podcast, I'll be interviewing the people who build, invest in, comment on, and use the apps made by legal tech startups.

My guests and I will be discussing many different startup related topics, covering among other things, startup management and startup life, startup investing, pricing and revenue models, and the factors that affect how users decide to purchase Legal Tech. We're not going to focus on Legal Tech per se. Instead, we'll be focusing on the startups that develop, market, and sell that tech. So whether you're a startup founder or investor,

a lawyer or other legal professional, or a law professor, law student, or commentator who thinks about LegalTech startups, sit back, listen, and learn from my guests about just what it takes for LegalTech startups to succeed. And if you're interested in LegalTech startups and enjoyed this podcast, please become a member of the free LegalTech Startup Focus community by signing up at www.legaltechstartupfocus.com.

Hello, everyone, listeners to the LegalTech Startup Focus podcast. I'm really jazzed about having as our guest today, Damian Reel, who is VP Solutions Champion at VLEX, a legal tech company we're going to talk a little bit about.

And also a leader at SALI, and I thought I had memorized what S-A-L-I stand for, but I'll have Damian tell us. A most important contributor to an organization that contributes to making law work better, and especially some legal tech work better. So welcome, Damian.

Thank you so much for having me, Charlie. I'm really thrilled to be here. All right. I am thrilled to have you. A little bit of, well, tell us what SALI stands for, if I'm not putting you on the spot. Sure. It's Standards Advancement for the Legal Industry, SALI, otherwise known as SALI.

And, you know, Sally is I think they chose the acronym. They certainly chose it before I came. Sally's been around since 2017. They may have chosen the name Sally before they actually chose what the Sally stands for. So mostly most everyone knows it as Sally rather than the standards advancement for the legal industry. Yeah, I like the acronym. So, Damian, excuse me, a little tickle in my throat here. You have trained as a lawyer.

And now you're in legal tech at VLX. You're, as I mentioned, solutions champion. Tell us what VLX does broadly. And we'll riff a little bit more on what VLX's secret sauce does and perhaps more. So give us a little intro to VLX, if you would.

Sure. VLX is a legal data company. So we are legal solutions, not just legal data, where we have all the cases, all the statutes, all the regulations, judicial opinions, briefs, pleadings, motions in the United States. So we have over a billion legal documents, and not just in the United States, but worldwide. So we have those U.S. cases, statutes, regulations, briefs, pleadings, motions. We also have those for the United Kingdom,

so London. We also have those for continental Europe, so Spain, European Union writ large. We also have Latin America. We also have the Commonwealth countries that are Australia, New Zealand, et cetera. So we have over a billion documents in over 100 countries worldwide.

And my job is to help the team at VLX to do data science on those billion legal documents. So you can imagine being able to tag up everything that matters, breach of contract, motion to dismiss in whatever jurisdiction you happen to be, or merger agreement, force majeure clause. Each of the things I just mentioned is a Sally tag. And you can imagine taking those Sally tags and about 18,000 of those Sally tags

and be able to tag up every single legal document in a billion documents for over 100 countries worldwide. And then once you have that well-structured, well-tagged data, then run large language models across those to be able to then be able to answer legal questions, not just in the United States, not just a 50-state survey, but eventually a 50-country survey.

to be able to ask a legal question and then be able to get a legal answer based on ground truth. That is ground truth cases, ground truth statutes, ground truth regulations.

Really, I would say that the VLX dataset, and if you know Fastcase, Fastcase merged with VLX back in 2023, about a year ago. And so, really, I would say that we probably have the richest dataset and broadest and deepest dataset in the world of those cases, statutes, regulations, motions, briefs, pleadings. And so, it's really a great playground to be able to be a part of.

Indeed, that's quite a repository of data. We're going to touch on the old saw, now old in the world of data science, that data is the new oil and what that really means. And then, as you just said, Damian, having data is one thing. Having data that is usable...

is another thing. And one aspect of usability is structuring it a bit and where I guess Sally can come in both for VLex and others is to help give some structure to the data, tagging being principally as I understand it what Sally affords. Sally is what? A not-for-profit, am I correct on that?

That's great. Sally's a nonprofit. I'm a volunteer for that nonprofit. And everything that we do is free and open source. It's free as in speech, that you can extend it, use it however you want. Also free as in beer, like you pay $0 for it. You just go to GitHub, download all the things. So we have 18,000 tags. That is a very well-structured way of how the legal world works.

So we really have created a taxonomy, an ontology of not only every legal concept, but also every business concept. So flat fee is something in Sally. Also hourly fee is something in Sally. Collared rate, you know, capped rate. Sure. So all of these things, every single thing that matters to number one, the substantive law, or number two, the business of law. We are counting all of those things. And we're getting donations from Tom's Reuters.

from LexisNexis, from iManage, from NetDocuments, from all the biggest law firms in the world. I'm just going over 1,500 donations from Mayor Brown. Everything that they're counting taxonomically, I'm now integrating into Sally, doing the same thing for K&L Gates, same thing for Allen & Overy, and a whole bunch of others. So we just take donations from everyone and we normalize them. So everything that TR and NetDocuments and NetDocuments

Kira Zuva, that is Noah Weisberg, has donated his data set. We just have hundreds right now of donations of all the biggest...

and the smartest people in the world. And Sally is a taxonomy that is normalizing all of those data points. Well, you know, you rise to the level of your goals, but I dare say you fall to the level of your taxonomy. So at least as far as being able to automate things goes, if you don't have the taxonomy and ontology straight, you're going to be limping along. Mm-hmm.

You know, one thought that occurred to me, and I think I shared it with you before we hit the record button in some email correspondence. When you talk about large language models, there is this adjacent notion of retrieval augmented generation.

which is an effort to, putting it simply, prompt the LLM further, not only with your typical question and answer prompts that you might insert, but also with sort of vetted documentation of some sort to nudge the LLM in the direction of ground truth and push it away a bit, not entirely, but somewhat away from hallucinating. Am I correct in understanding that with VLEX,

all the documentation that you provide, it sort of serves as a rag as retrieval augmented generation is known as sort of a rag check on what LLMs can do. Is that sort of right?

That's exactly right. Yeah, so we have a five-stage process for what we do. Stage one is we do a vector embedding match between the user's query, that is the user's question, and the source text, that is the case's text or the statute's text or the regulation's text.

And so we put that text into vector embedding model. And then we use the source text as rag and do a vector embedding search and a vector embedding match between the cases text or the statutes or regulations text and the questions text. And if there's a match, then we finish stage one. And then stage two is we say, okay, large language model, tell me how well or how poorly this case's text answers the user's question.

And then after it gives about a paragraph's worth of analysis as to how well or how poorly that case answers the question, stage three then is to be able to say, give me a confidence score from 0% to 100% as to how well or how poorly this case answers the question. Then the large language model gives me a confidence score.

And then what we do after that is we take everything and that is 70% or higher. That is it is a 70% answers the question or more. And we give that to the user. So that's stage three.

Then stage four, what we do is we do even better than retrieval augmented generation. We go into all of the cases that are above that 70% threshold. And we say, well, if that case that answers the question answers our question, maybe the case that that case cites might also answer the question. So we go through every single case that that 70% threshold or above case cites. And if any of those cases that it cites meets that 70% threshold, then we add that to the list.

So we call that going down the tree. And then we also go up the tree because lots of cases cite the original case. They say the case is 15 years old. There might be 100 or 200 cases that have cited that other case. So we go through all of those 100 or 200 cases that cite that original case. And if any of those 100 or 200 cases are above that 70% threshold, that 70% confidence score, we add that to the list too.

So, this is better than just a retrieval augmented generation of a vector embedding search, which is what a lot of technologies do. We do that too. But then we also go down the tree of all the cases that that case sites and up the tree, all the cases that site that case. And then we do a fifth step, I think we're at this point, is to be able to analyze the extent to which that case has been identified.

reviewed as a citator. Has that case been overruled? Or has that case been distinguished by another case? And so we take that as part of the analysis as well. So if Roe v. Wade comes up with the results, it will show up in the user's interface as being overruled. And we might mention it, that is Vincent might mention it in the memorandum, but it will mention it in the context of, you know, over 30 years Roe v. Wade has been the law of the land, but has been supplanted by Dobbs.

So all that's to say is that this is better than your average reg. This is vector embedding plus all of the symbolic AI of going down the tree and up the tree, and all of the human editors saying whether this case is good law or bad law. We're going to get to some of that symbolic versus sub-symbolic. It's sometimes called connectionist AI.

neural network type reasoning, if you want to label that reasoning. I guess you can. Inductive reasoning as opposed to deductive reasoning, as I like to put it. But when... So VLEX is... Its customer base is what? Law firms that are looking just for that sort of analysis. And...

And I guess, you know, anywhere in the country at any size law firm. Yeah, not just at, that's right. Yes, we do anywhere in the country, any size law firm. We also internationally. VLEX was started in 1999 in Barcelona. So started in Spain, expanded to Latin America, expanded to the United Kingdom, expanded when they brought in Fastcase to the United States. So we sell worldwide. We're not just the United States. And not only, go ahead.

No, you go. Yeah. So it's so it's so not only to law firms worldwide, but we also sell to their clients, that is to the fortune 20 customers that hired those law firms. And we don't just answer questions like I just described, but to the extent you want to upload a complaint, we will give you between 150 and 200 pages of good legal analysis of that complaint.

to be able to say, here are an analysis of the claims. Here are the analysis of potential legal defenses. You might be able to argue based on that. Here are some questions you can ask your client to help them win. We give about 200 pages worth of really good information, no prompting necessary. You just get the output. So as you can imagine, what I just described usually would take an associate 80, 90, 100 hours worth of work to be able to give that kind of output.

And so plaintiff side lawyers really love this. You know, you can imagine that the contingency fee lawyers are saying, wow, the less time I spend, the more money I make. So plaintiff side lawyers love this. Defense side hourly lawyers love it less because then they say, wait a minute, if this is $100 of billable hours, what happens next?

But you know who really loves it is the clients. That is the Fortune 20 companies that we serve, the in-house law departments. Those law departments love it because then they can say, this is amazing. Vincent is amazing. What we were going to do is we're going to do two things. The first thing we're going to do is we're going to take complaints that we've received over the last year.

And we're going to run some of those through Vincent. Get the 200 pages, that is the output, and then compare those 200 pages to what our lawyers did and said, did our lawyers really add anything on top of these 200 pages? And if the answer is no, they didn't, that's going to say something. And then they're going to say, and what if did this 200 page have things that we wish our lawyers had told us, but they didn't?

And if the answer is a lot, that's also going to say something. Indeed it will. That's going backward. And then going forward, they say what we're going to do is we're going to have new complaints that come in. Before we send those new complaints to our lawyers, we're going to run through Vincent. And then we're going to take the output, the 200 pages, and send those to our lawyers. And then they're going to say, what can you do on top of this? Because that's all we're going to pay you for. We're not going to pay you to recreate this wheel.

So this is a drastic shift in the ability for lawyers that are plaintiff side lawyers to be able to make more money for in-house counsel to be able to really shift the dynamics between them and outside counsel and for the defense side lawyers that are hourly billing to maybe rethink their their the way that they're doing work because this is a way that maybe if they move to the flat fee world, they can actually make more money than they can onto the billable hour world and actually satisfy their clients in a way that they haven't in the past.

Well, we're not going to devote this podcast episode, someday perhaps another one, to the implications of Gen AI for the business model and how a law firm business model and the entire world of private practice. But, you know, the implications are enormous.

are startling and one of them is what do you do if you're billing by the hour and my you know sort of simple minded reaction is stop billing by the hour but do more work and enjoy more of the work you're doing than you were enjoying the work you were doing when you were billing by the hour and well that again thoughts for another episode but yeah the implications are just profound

as you started to elucidate, Damian. So just so I understand and our listeners understand, are the LLM sources that VLX uses with its Vincent tool, you know, among the favored few, the standard foundation models, the original OG foundation models, OpenAI, Gemini,

meta, where do you get your LLMs? - Yes, we use an ensemble of LLMs and we use every single one of those that you've mentioned there and a few others as well, including some of our home baked things. And as you know, and really as most people who probably listen to this podcast know, this is such a rapidly evolving pace that we think that tying our horse to any one of those models is probably a bad idea because they always go neck and neck and are beating each other left and right.

That's on the ability side. And then there's also a cost side where there's all sorts of dropped costs. And so really it used to be that it'd be great to be able to use Mixed Rule because it was free and open source. And Lama was free and open source. So maybe use that for your cheaper things. But now with the newer OpenAI models are on the API side so cheap that maybe you don't have to skimp.

Maybe you can actually use the OpenAI ones and don't have to be able to settle for some of the open source models. So all that's to say, in this rapidly evolving, we are agnostic as to foundational model. We are using the best model for the task, the best model both for performance, that is what is the reasoning capability, and also cost to make sure that we're giving a reasonable value to our customers. Yeah, being able to witness this rapid-fire evolution

of gen AI, particularly just focusing on the large language models, changes and competitions and as you put it, beating one another back and forth on various measures of performance. I dare say too, and I hope that this is the case, that the research, and we'll expand upon this, the research into

the way in which these large language models work will expand to include other ways of their reasoning for one of a better set of words so that we're not pinned down only to the neural net connectionist version of inductive reasoning that they perform, but as we'll talk about, the models are...

research that will improve their reasoning from a more traditional deductive style reasoning. So as to further reign in the hallucination problem, give them some world models of what the world is like and the legal world is like from which these large language models can better their performance. But we'll get to that in a little bit.

One of the things I had written to Damien about is something we might discuss is, you know, are we overbuilding in this space of the use of generative AI, large language models?

There are, I dare say, and I say this not just to flatter Velex, but because I genuinely believe this, what Velex offers as sort of a special sauce is marrying not only the analysis

that can be done with the tools available at Vincent, but also that enormous database of data that Damian described. But if you don't have a sauce like that, you don't have a distinguishing feature, a differentiator, do you think we're going to face a bit of a market crash in legal tech?

let alone in the greater world of LLMs anytime soon? Is that a risk that vendors ought to bear in mind in your view, Damian? I think that there will be a reconciliation of those that talk the talk versus those that walk the walk.

And as really large language models have increasingly become performant, that is OpenAI and Gemini and Lama 3 and all of the others are increasingly demonstrate reasoning. If you are merely a wrapper of those foundational models without really anything on top of that wrapper, then I think the wisdom of the crowds will be that you will be punished for being a mere wrapper.

You really have to provide something above just what GPT-4 is. Otherwise, why would I as a large law firm, option one is to hire, buy your product and have you as a wrapper. Option number two is for me just to hire a prompt engineer because that prompt engineer is going to be able to prompt engineer just as well as you prompt engineer. And hiring that person is going to be way cheaper than having to pay for your product year over year.

And so really, if you're a mere wrapper on a foundational model, I think that there will be a reconciliation in the marketplace for such things. But I think that those that will win in the marketplace will be those who can sufficiently demonstrate that they are providing product value. That is, they're providing something that makes it better, faster, stronger for them to buy

your product than to merely build it themselves. It's always a build versus buy question. And you have to demonstrate your value and why it's worth it for them to buy your product. Is that making it faster? Maybe. Is that giving an ensemble of things that you can do, no prompting necessary? Maybe.

Is it having the data that nobody else has? Probably. And we're going to be getting into that later. But that's all of those things. You have to demonstrate something more than being a mere wrapper over GPT-4. And I think people are figuring that out today as we sit here in August of 2024, and it will be increasingly so and increasingly evident in the marketplace.

I think you put it very well. I've heard people try to defend against the pejorative use of the word wrapper by saying, well, there are these large language models, but if I can be specific to a particular vertical, legal tech being one example of that specificity, then I'm okay. But I think you're right. You're not okay just because you've wrapped it around a little bit of legal language

familiarity. Instead, you have to offer more than merely a front end to a large language model. Some real value add, whether it's through the data, whether it's through a different sort of reasoning, whether it's sort of a different structure to the data that you offer as opposed to the volume of the data. And if you can't do any of that, I agree with you, Damien. The wrapper

with a W, not just an R, is going to be lost in the market exhaust. I can't agree with you more.

And there's a tad on that. There's been, you know, I, I vacillate and I probably heard the same people that you've been hearing, but some people say, well, yeah, but websites are just a wrapper around SQL servers. And that's, that's kind of true, right? But, but so in, so what does a website give you that it's a SAS product? What does that give you that, you know, a SQL server itself doesn't. And so, and then we go back to foundation, like did, does fast case provide you 25 years of cases, statutes and regulations?

Are they one of only three repositories of those cases, statutes, and regulations in the United States and the world? And the answer is yes. They have 800 million dockets and documents that you're able to be able to go through the motions, briefs, pleadings, in all the federal courts and 38 state courts. Yes. So in that way, before large language models, when we were just a SAS website product, we were a wrapper around SQL servers, but we also provided this amazing data set that provided something on top of it.

So what is it in our large language model world that you're providing beyond just the SQL servers, just the websites, and beyond just the large language models? That's the question that everybody has to answer. No, you're right. Back in the SaaS days, and there's even some talk about whether SaaS as a business model is...

is doomed in the face of generative AI. Back in the old SAS days, you had the back end, you had the front end, and you had the glue between the two. But now you've got analysis that GenAI affords you. And you can do things with data that you couldn't do before, merely having a UI give you an entree into the data and some sort of magic glue software that pulled the data out.

now you can actually do analysis. And if all you're bringing to the table is just that glue between the two, the LLM and a user interface, I agree. I think you've got a problem with how you're going to make money in the future. One thing I've talked about with myself and with others is whether legal tech perhaps will end up suffering what big tech has

has and the Magnificent Seven has more broadly done to tech generally, and that is these Magnificent Sevens, you know, the apples.

of the world, the Metas, the Amazons, the Googles, and the others are, by some measure, sucking the life out of the startup ecosystem, either doing acquisitions and taking startups that might have otherwise out-innovated the Magnificent Seven and buying them. There's some pushback on that from the antitrust regulators on those grounds and others.

or doing acqui-hires to avoid some of the antitrust scrutiny, or hack-quihires as I think the latest versions of these acquisitions are called. But we have TR, we have LexisNexis. Do you see any risk of a magnificent two beginning to suck the air out of the more nimble but potentially disruptive vendors in legal tech?

Beginning to or has for the last 40 years? Take it however you wish. So I would say that there is a real threat to, I'm not going to name names of particular companies, but if there is a duopoly in legal tech, you can imagine that there would be a threat to that duopoly. And the real question is,

Who has that data? That is, for a while, cases would give their judicial opinions to a particular vendor as essentially saying, well, we as a court don't want to distribute these cases, so we will leave it to these private vendors to be able to distribute the cases for us. That was in the 1980s and the 1990s when printing was very expensive. Printing is not very expensive these days.

And so is this outsourcing of the law, that is the law that binds you and me, everyone is bound by the law, therefore everyone needs to have access to the law. Is this outsourcing and privatization of the law something we'll want to continue?

And if the answer is no, we don't want to continue that because now we don't need to print up books and mail them around the world. We can just put them in bits and be able to distribute them as the Free Law Project is done. Then that's a much different world. And really then, if you look at it from that world, if we were to free the law, then what moat do duopoly have?

Because they in the past have had a moat of the cases, statutes and regulations. Maybe they don't have that moat anymore. They have also had a separate moat of secondary materials where a professor will do a gathering of all the cases, statutes and regulations and then do an analysis of those. But it turns out that large language models can retrieve and do analyses not in a general sense but taking your particular facts

and apply your particular facts to those analyses. And in contrast to the secondary materials, where it might take them six months or a year to update their materials with the latest cases, statutes and regulations, if the law were available, like VLUX has the law, we can update every day. So we can give you yesterday's case, whereas the treatise can't do that. So the moat of the secondary materials kind of evaporates in our large language model world.

So if you get rid of the moat of the cases, statutes and regulations, and if you get rid of the moat of the secondary materials because they don't really matter anymore, is there really, what is the future of a duopoly that has rested on those laurels for so long?

And so I would say that really the bigger question, you talked about the big tech, you know, the Googles of the world, Amazon, Facebook, Meta, and Microsoft, etc. I think there's probably a bigger chance of them just absorbing legal, not because they're targeting legal, but just as a roadkill.

They're absorbing all of the words. And it turns out that the law, all we do is words. We as lawyers, every single task that we do, we ingest words, we analyze words, and we output words. And it turns out that large language models can do all three of those things really, really well.

So as the Facebooks of the world are ingesting all of the data, as the Googles of the world are ingesting all the data, as the Microsoft and OpenAI, everyone is ingesting, they need high quality text. And it turns out that the law is high quality text. And not only is it high quality text, but it's also high quality, mostly human created text, which is now a scarcity these days. Every day, judges are outputting reams and reams of high quality human texts that will be catnip to Google and to Facebook.

and to Meta and to Microsoft and to OpenAI. So maybe they just ingest all the law. And then what moat does the duopoly in legal tech have? And so really, I think the bigger risk is to the big tech than there is to any historic duopoly. No, you're right. There's all this talk about what will happen if

The big guys only have synthetic data to train on because all the human created stuff has already been used or the trickle that continues to be provided isn't enough to satisfy, fill the maw of these LLMs. But you're right too that judges are humans and they're creating all this stuff and that's the kind of raw material that every large language model requires.

really likes. Interesting point that you raise. You may have written the roadmap for Lena Kahn or her successor at the FTC to bring a case down the road against big tech if they start to close in on our legal tech roadkill. I never thought until recently that legal tech was a big enough market, despite the

the absolute size for some of the big players like Microsoft to care about. But they're going to start caring and then they've begun to do so. And as to your point about words, you know, the revolution came about with Gen AI because Gen AI is words. It's language. And that's what we lawyers do. So if you can put an interface in front of us, we lawyers, that's language-based, which is what the large language models afford and other interfaces afford.

interfaces with photos and video, but for us lawyers, language, of course we're going to be drawn to it like moths to a flame and all this interest in generative there. It's our meat and potatoes. It's language.

Agreed. And to your point about judges and maybe that being fodder for the large language models, I think the statement that judges are slow to adopt technology, I think that is a massive understatement. They are among the slowest to adopt technology, but maybe that's a feature, not a bug, because probably the last bastion of non-machine created text that is fully human created text

will probably be those human judges that are going to be able to be cranking out that human written things that will be continuously not giving the foundational models the fodder of non-synthetic data, but actually human generated data for years to come.

So that's thing number one. Thing number two, you talked about Lina Khan. And, you know, if we as legal tech become, you know, big tech roadkill, a way to avoid that is to be able to maybe the federal government that she works for, maybe they could maybe make PACER free. Yes. Rather than, you know, PACER currently costs about $2 billion with a B to download every single document. And that is a massive moat.

for that VLACs has spent a bunch of that $2 billion. We then have to charge our customers to make up that cost that we have. But what if Lina Khan's government, that is our federal government, what if they were to make that cost zero, not $2 billion? There could be a thousand legal tech companies that have sprung up to spring up and to be able to essentially compete against the duopoly. And

then where would we be on an anti-competitive or a far more competitive stance? So I would say that making the cases free, making the statutes and regulations free and readily available, that is one step toward avoiding any duopolies in the future. Yeah, let a thousand flowers and more bloom. I'm all in favor of that. We're going to turn to something a little bit more general.

but we'll try to make it of interest to our listeners. You know, I started to say earlier that the current version of the large language models, and I say current, it could change tomorrow, is one where very generally put,

It is autocomplete on steroids. Now, I don't say that disparagingly. It's quite a feat to have trained them to do what they do. But quote, unquote, and I put quotes, scare quotes around, all that they do is really predict life.

after tons of expensive training on tons of expensive data, what the next word will be. Or more properly, they come up with a probability range of what the next words ought to be in the output that they're providing you. And sometimes they even pick an outlier just to be more so-called creative.

But that is inductive reasoning. It's sub-symbolic. It's not being trained to work with concepts that we humans use to make our way through the world when using language. Instead, what it's doing is it's seeing subtle patterns that perhaps we limited humans can't provide, but inductively from those patterns,

forming among all those neurons linking together with weights in all sorts of different configurations, a set of rules buried in those weights and neurons for predicting the next or most likely word. There is another way to reason, though. And the other way to reason is, you know...

All men are mortal, Socrates is a man, therefore Socrates is mortal. Good old fashioned deductive symbolic reasoning as it's called. And there's an AI that was at one time, and I'm doing this by way of a that was pretty much before neural nets, confined to that sort of deductive thinking, good old fashioned AI. And programmers today, when they're programming in a standard way or using concepts,

to put together a logic to create a program. Some of the critics, Gary Marcus being professor emeritus from NYU, most notably say that one of the great faults of large language models, despite the fact that they do the marvelous things they do, is they can't

reason deductively. They don't have a set of concepts that map out a world from which they can reason. And he says, and I somewhat agree, although I am still marveling at what LLMs as they're presently configured can do, he suggests that we combine the two in some fashion, call it neurosymbolic

call it neural nets with trees that are more akin to deductive logic. What do you think is going to happen there? Particularly as we lawyers are used to deductive reasoning, although I do

Recall that as a law student you were taught to read a bunch of cases and inductively create a rule, but then we had to apply that rule in a deductive fashion. What do you think is going to happen between these two modes of artificial intelligence reasoning?

Uh, that is probably my favorite question I've ever been asked on a podcast. And I will also say I've never been on a podcast where people have thrown sub symbolic, uh, uh, out in there. So the fact that, uh, I don't have to dumb down my language for this podcast and I can actually get deep in the weeds as you've, uh, you've asked me to do is a great joy. So first, thank you for that great joy. Secondly, uh, Pablo Redondo, co-founder of case texts. We've been friends for now a decade. Um, we, uh, in, uh,

So July of 2022 at the Iltacon, we were giving a talk together and he was really pounding the table about neural nets. Now keep in mind, this is July of 22, just a few months before ChatGPT came onto the scene.

So he said, Damien, neural nets are going to take over the world. And I said, no, no, no, no. Symbolic AI is the thing. Things like Sally, the knowledge graphs where Sally is representing the world in a symbolic way. That is really the thing. So I said, hey, Pablo, let's have a celebrity death match.

where you take the neural net side and I take the symbolic AI side. But then at the end, after we were prepping for the session, we actually agreed with each other more than we disagreed, that it is not just one or the other, but it's the combination of those things that really is the answer. And the reason for that is for the reasons that Google and Facebook and Amazon and really every other of the big tech companies have figured out is that knowledge graphs

are the name of the game. If you want to be able to say a friend of your friend is probably a friend of mine, or if people who like this book probably like that book, that's all permitted by a knowledge graph. That is done by not, I forget what you say, inductive or deductive, but that is through actual rules-based reasoning to be able to say, well, if this person has a common friend, then they probably have something in common with each other. So that is done through a knowledge graph, not a coincidence that Sally is also a knowledge graph.

And so we, a knowledge graph is two nodes connected by an edge. So lawyer drafts motion to dismiss. Lawyer is a node, motion to dismiss is a node, and then drafts is an edge between that node. So nodes you can think of as nouns, and the edges between those nodes are verbs.

And so, Sally has lawyer, Sally has motion to dismiss, Sally has drafts. So you can now be able to build symbolically through a knowledge graph, a representation of every single legal task, because we have everything that matters to the substantive law, and we have everything that matters to the business of law.

And so now what does that get you that a large language model doesn't get you? Well, it gets you, first of all, if you were analytics. If you were to say to the large language model, tell me what percentage of motions to dismiss for breach of contract Judge Smith grants, and that large language model is going to hallucinate all over the place.

In contrast, if you say here let's tag up all of Judge Smith's orders, tag up which of them are motions to dismiss and then tag up which of them are for breach of contract, you will get a certainty, a deterministic certainty, way better than a probabilistic large language model, but a deterministic certainty as to those analytics. You need that symbolic reasoning. Secondly, the thing that large language models don't give you is interoperability.

If you were to try to take all of your motions to dismiss for breach of contract for Judge Smith, and then try to take them from your legal data source, say VLACs, and then put them into your document management system, like I manage your net documents. If we use neural net number one and the destination uses neural net number two, the odds of us using the same neural net is next to zero. And even so, the weights of those things are next to zero.

What you need to be able to have that interoperability is a common language. That is motion to dismiss has to be tagged on my side, on VLex's side, and the same on iManage's side so that we can be able to push and pull through queries the right data, motions dismissed for breach of contract.

Large language models are not going to be able to permit that interoperability. So that's thing number two that we could do. Number one, analytics. Number two, interoperability. Number three is the reasoning that I know that you really talked to about in this window. And so the reasoning is something that large language models today have a relatively small context window.

And so the relatively small context window results in hallucinations, it results in not actually getting the right portions of the 300 page document that you wish you could have. And of course context windows are expanding all the time. So this might be like an old guy saying, "Yeah, back in 1980, we only had 8K of RAM." So this might be something that is maybe a shortened time.

But for now, what if you were to have, with our currently limited context windows, what if you were to have a symbolic reasoning to be able to say that negligent misrepresentation is a type of negligence claim. It is also a type of misrepresentation claim. It is also a third type of a defamation claim. And to have all of those relationships symbolically connected, each of those are nodes that are connected through an edge.

And then you could be able to say if I run a query for negligence or for misrepresentation or for defamation, all of those roads will symbolically lead to negligent misrepresentation. Then the large language model doesn't have to do as much work. That is, you put the knowledge graph into the context window, the limited context window. So then the benefit, the large language model benefits from that symbolic AI to actually give better insights and better outputs.

So all that's to say they're much better together than separately. The peanut butter and the chocolate go better together. That's exactly where I was going. The Reese's peanut butter cups analogy. Put them together and you have something that's greater than the sum of the parts. And to your point about...

The prompt window getting larger, the pushback on symbolic reasoning and good old-fashioned AI from some of the neural net proponents is that the nets themselves can learn reasoning by just looking at these examples. And the flaw in that argument, I think, and certainly I don't mean to put myself up there with Jeffrey Hinton and, you know...

Others who are professors of great renown in the world of large language models. The problem, though, is they can learn the patterns of reasoning that their data set provides. And when the data set is the entire web, there's a whole hell of a lot of patterns of reasoning. But they're not learning the ability to...

reason outside of the patterns they saw in their training data. And they, as I understand it, are having difficulty generalizing from those patterns that they have discovered, patterns of reasoning. And hence some of the hallucination and boners that you find them making that are just hilarious. If instead you provide them, as Sally would, with what I now see from Sally as a worldview of the law,

then they get to see these concepts, they being the large language models, or the adjacent symbolic reasoner working in tandem with the large language models. They get to see these concepts, these ideas from which you can reason deductively, and if one serves as a check on the other, namely the symbolic reasoner on the large language model, and if the large language model provides the vast learning abilities that it can offer

data wise to the symbolic reasoner, then I think you have that Reese's peanut butter cup. I agree 100%. Sorry to interrupt, but I would say mostly I'm so excited about what you've just said that I have to jump in. And that being able to take that top down representation of how the legal world works and the real benefit of large language models is we can connect that top down reasoning with the bottom up reasoning of neural nets.

You probably know 273 Ventures, that is Mike Bommarito and Dan Katz. They've created their foundational model ingesting all the cases, all the statutes, all the regulations, all the SEC materials, all of the government materials around the world. That foundational model is really good at being a bottom-up representation of all the concepts that matter. They're one of the companies that's doing this. Another entity is Free Law Project.

that is Mike Lissner and team, your listeners may know or may not know that they've actually taken the baton of the Harvard Case Law Access Project. The Harvard Case Law Access Project is from 2018 back to the beginning of time, Harvard scanned all of their cases and then did that for all the federal and state cases. And then as of about six months ago, now it's free and open source.

So the free law project is kind of taking that baton and now they've said that they're going to be from 2018 forward scanning all the books and being able to take the cases that be able to make them free and open source to. So i'm working with a guy from the free law project his name is from Rico and what he's doing is taking that case law access project and number one he's put it into vector embedding space.

So that is now free and open source on Hugging Space. And number two, what Enrico did was did a topic modeling of that. And so you can be able to see in, for example, the California cases, you can see the top 10,000 legal topics that come up in California cases.

So this is a bottom-up representation of the topics that matter to California case law or to the United States case law. And so we can take that bottom-up representation of the law and connect it with Sally's top-down representation to fill in the cracks of what Sally hasn't yet done. So this bottom-up plus top-down will give us a better symbolic representation to then further feed the neural nets in the kind of gestalt peanut butter and chocolate we talked earlier. Yeah, I...

Yeah, I'm glad you mentioned the Free Law Project. I had forgotten all about those guys in addition to '73 Ventures' great work.

We've been talking about, let me backtrack. You have the learning side of large language models, which is key to the data that the language models can access. And then you have the inference side of large language models. And we've been talking about that inference side, the reasoning that they can do.

the actual programming that goes into allowing them to output something from what they've learned whether it's symbolic

reasoning that's driving the app, but bottom-up learning, inductive reasoning, as I call it, or a combination of the two. But let's get back to the data. You had said to me, and something you sent along recently, that, you know, and as I mentioned earlier, if data is the new oil, is it public or private oil? You've talked about that a little bit. But then you said, are you merely a refinery? What did you mean by that,

Damien, is that the wrapper problem? That's exactly right. And so let's define terms a little bit. So we've known data as oil for a bunch of years. You have to first extract that oil, then you have to refine the oil, and then you have to ship it to customers, and then you have to make it into a product. So if it's plastic, is it a plastic toy? Is it a plastic water bottle? And then you have to get it to market.

So, when you think about that oil life cycle, that's the same thing with legal data, where legal data is cases, statutes, regulations, judicial opinions, motions, briefs, pleadings. All of these are legal oil that are public oil, and then you also have private oil. That is, you have case contracts within the corporation. You have settlement agreements. These are all private oil. And so, really,

You used to have to have thousands of people in one of the large duopoly summarizing that oil and then tagging up that oil. But it turns out with large language models, you don't need thousands of people to summarize or to tag. Large language models can do that very well. And that refinery task is what I'm saying that

Are you merely a refinery of the oil? That is, are you taking that public oil, cases, statutes and regulations, or private oil, contracts and settlement agreements, and are you merely refining it? And if you are a mere refinery, then there's really no moat that you will be run over or you won't be providing sufficient value to overcome the build versus buy analysis that we had earlier.

But then, you know, the real question is what is the product you're building on top of that? Is it plastic? Is it a toy or is it a water bottle? So that product that you're building on top of it, for that you need really smart lawyers to be able to say when I was practicing,

This is a pain point that I always thought was worth paying money for. Then you need good product people to be able to turn that into a product. And then you also need good UX people to be able to get Apple-like because lawyers love simplicity. And then you have to, on the marketing side, you have to trust the company. Do you really trust that this company is going to be around in a year or five years or ten years?

So really, are you building number one, a product that is really worth overcoming the build versus buy analysis? And number two, is your marketing and trust in the marketplace good enough that people are going to give you money because they know you'll be around for a while?

Yeah, and the investors too will have a say in all that. And if they're smart, they're going to be looking at it along the lines that you just described and funneling money to the startups that really have more than merely a refinery operation going on. Because if that, as you put it, is all they're doing, that moat's going to dry up and the invaders will invade the castle.

Let's wrap it up with, you had several future facing ideas to talk about. I guess, what's your favorite among the several that you had listed? One that sort of got my attention was, what's the blocking and tackling that will elevate the winners? I have a feeling it goes along with the discussion that we just had on getting beyond the refinery stage. Is that right?

Yeah, I think that's right. So saying what is the product? What is the value you're providing that is worth enough money for a lawyer to pull out their pocketbooks? Because it turns out that that is a very hard thing to get a lawyer to do. So what is your value? What is your product that is beyond just a refinery of oil?

And I would say that as you think about what that product is going to be, you said earlier in this discussion that you live and die by your taxonomy. And so you could either, one, try to figure out the things that matter to the law out of whole cloth. That's option number one. That's really expensive and hard. Option two is just to use Sally.

Sally is free. It's open source. And it's the benefit of Thomson Reuters and LexisNexis and iManage and NetDocuments and the biggest law firms in the world and the biggest corporations like Microsoft and Intel. All of them have donated their data sets to make Sally better. So why would you, if you rise and fall in your taxonomy, why would you try to build it from scratch when you get the benefit of all those big companies? So then I would say that the...

The winners going forward that are going to be the winners are probably the ones that are going to be able to figure out what that product is going to be. And I said earlier that the winners are probably going to be those who have lawyers on their staff that haven't just been lawyers for six months, but have actually been in the trenches for long enough to be able to see all the pain points of the current processes, whether it's a litigation process or transactional process.

And they see the pain points in that process to be able to say, how can I make those pain points better in a way that is worth that lawyer or that in-house counsel to open up their pocketbooks to pay me money? And when I worked for Thomson Reuters, one of my jobs as a subject matter expert was to write requirements. I worked in the product side. And I would say, I as a lawyer want to do X, Y, and Z to achieve A, B, and C. And so really,

It turns out that what I just described is a prompt. What used to be a requirement in the past is literally the way that you build these systems today. And so I used to have to give those requirements to my engineering team that used to have to take a long time to be able to turn that into code. But there's no delay anymore. I just write that prompt and all of a sudden that is the product.

So really, I think the winners are going to be people that have seasoned lawyers. And it turns out that lawyers, number one, they have the expertise to be able to say these are the pain points. And number two, more importantly, we know how to use words really, really well. And so we as lawyers are really good prompt engineers because we speak unambiguously.

And we speak precisely. And those are two things that prompting really requires is unambiguousness and preciseness. So in this way, I think the winners are going to be those that bring enough seasoned lawyers that can write well to be able to draft the legal language. The best, most important coding language in the world is English. And it turns out that lawyers can speak English very well. That's what we do. Damian, I can't thank you enough.

for spending an hour's time with me. This is a fabulous example of what I think, at least, what good podcasting can be for people interested in legal tech. People want to reach you. You write on LinkedIn quite a bit, if I'm not mistaken? Yeah, that's the best place is LinkedIn, yeah. And VLex, of course, and Sally are two organizations where you can see what...

Damien has his hand in. I say this at the end of every podcast. It started it when the pandemic was raging and then had to hope that this could happen. And now it's more likely to happen than not. Look forward to seeing you in real life and having a chance to hoist a beer together. And again, thank you so much. I do love hoisting beer. And yeah, I hope to see you in 3D very soon. Wonderful. Thanks again.

Thank you for listening to the LegalTech Startup Focus podcast. If you're interested in LegalTech startups and enjoyed this podcast, please consider joining the free LegalTech Startup Focus community by going to www.legaltechstartupfocus.com and signing up. Again, thanks.

The Damien Riehl Interview - Legal Data, Legal AI, and Thoughts on a Legal Tech Duopoly 56:48 Share

Legal Tech StartUp Focus Podcast

Deep Dive

Shownotes Transcript

The Damien Riehl Interview - Legal Data, Legal AI, and Thoughts on a Legal Tech Duopoly