We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Agents, Lawyers, and LLMs

2025/2/21

AI + a16z

AI Deep Dive AI Chapters Transcript

People

Aatish Nayak

Topics

Aatish Nayak: 我在快速发展的AI初创公司拥有丰富的产品领导经验，目前在Harvey领导产品工作。Harvey是一个针对法律和专业服务的特定领域AI产品，帮助律师自动化起草文件、信息整合、战略建议备忘录等工作。我们主要处理交易性工作（并购、融资等）、诉讼以及企业内部法律团队的工作。ChatGPT的出现提高了人们对生成式AI的认知，促使法律行业加速采用AI技术，市场竞争也推动了这一趋势。Harvey通过在各个职能部门嵌入法律专业知识，例如聘请律师担任销售人员和产品经理，从而更好地服务客户并推动产品落地。Harvey的AI系统模拟了律师事务所的工作流程，将任务分解并分派给不同的AI引擎，最终交付结果。法律行业需求无限，供给有限，AI可以帮助提高效率，解放律师从重复性工作中。生成式AI在法律领域的应用并非完全自动化，而是需要与人类协作完成复杂任务。Harvey的AI代理需要与人类协作，因为人类可能拥有AI代理不具备的信息或意图。无限的需求迫使律师事务所提高效率，采用AI技术以满足市场需求。企业用户需要培训和支持才能有效利用AI产品，Harvey通过多种方式提高用户利用率。Harvey将继续专注于法律领域，但也会利用现有项目和客户关系，逐步扩展到其他专业服务领域。Harvey将利用其在法律领域的经验和客户资源，逐步扩展到税务、金融等相关领域。Harvey通过与大型企业合作，以及与专业服务公司（如普华永道）合作，逐步扩展到新的垂直领域。Harvey与普华永道合作，定制开发税务和财务尽职调查系统，利用普华永道的数据和专业知识。Harvey重视企业数据安全，采取严格的数据保护措施，以赢得客户信任。Harvey从一开始就重视企业就绪性，采取严格的数据安全策略，建立客户信任。Harvey的AI产品开发理念是构建AI原生用户体验，让AI像同事一样与用户协作。由于法律领域缺乏统一的软件工具，Harvey选择构建AI原生用户体验，而非依赖现有工具。Harvey的AI原生用户体验旨在让AI像同事一样与用户互动，通过对话和反馈来完成工作。Harvey的设计理念借鉴了“宜家效应”，用户参与到工作流程中，会更有责任感和信任感。Harvey的AI代理会主动向用户请求反馈和信息，并根据用户的反馈迭代优化结果。目前AI原生用户体验仍处于发展阶段，未来需要更多创新，例如非文本交互方式。Harvey主要使用OpenAI模型，但其系统架构具有模块化，可以方便地切换不同的模型。切换模型需要进行大量的评估，以确保不会降低产品质量。Harvey利用内部和外部评估来改进AI系统并向客户传达价值。Harvey发布了Big Law Bench基准测试，以评估其AI系统在真实法律工作中的表现。新的OpenAI推理模型提高了Harvey在长篇文本起草和推理方面的能力。企业对AI的认知和应用仍处于早期阶段，许多企业尚未充分利用AI技术。企业对AI的应用和商业模式仍在不断演变，一些企业已经开始积极探索AI的应用。未来几年，企业AI的价值将体现在深入理解客户工作流程，并针对性地开发产品和用户体验。 Kimberly Tan:

Deep Dive

Chapters

The legal field faces a challenge: an exponentially increasing demand for services, coupled with a limited supply of lawyers. This leads to long working hours and lawyers spending time on mundane tasks, hindering their ability to engage in more creative and impactful legal work. AI offers a solution to alleviate these issues.

Infinite demand for legal work due to globalization, the internet, and AI.
Constrained supply of lawyers leads to long working hours and mundane tasks.
AI can help automate mundane tasks, freeing up lawyers' time for more creative work.

Shownotes Transcript

Translations:

中文

globalization, the internet, AI has increased legal work exponentially over the last few decades. And so you have basically infinite demand for legal work because companies are wanting to do different transactions, legations, etc. So you have infinite demand and

And then what that means is that the supply is very constrained. And the unfortunate human cost of supply constraints is very long hours, often doing very mundane, boring tasks. We talked to lawyers who we've hired, our customers. They haven't become lawyers to write the fifth draft of the same document the fifth time or ask the same legal research question, right? They became lawyers to apply the law in creative ways, publish opinions, shape the fabric of society.

Thanks for listening to the A16Z AI podcast. If you're into applied AI and specifically building products for specialized and possibly regulated vertical markets, you should take a lot from this discussion between A16Z partner Kimberly Tan and Atish Nayak, who's head of product at Harvey. With

which, if you're not familiar, is a fast-growing startup targeting the legal industry with LLMs. Atish talks to the various areas of legal work that Harvey is targeting, but more importantly, he gets into the very critical aspects of any successful vertical application, such as working closely with customers, integrating with their existing tools and workflows, and having industry expertise in-house. Looking more broadly, he also touches on how Harvey thinks about expanding into other fields of knowledge work and its strategy for adopting and innovating on top of today's best foundation models.

It's an exciting discussion that you'll hear after these disclosures. As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see a16z.com slash disclosures.

I'm Matish. I lead product at Harvey. I've been at Harvey for a year and a half now when we were around 30 people. And since we've scaled to 250 people, so it's been quite a journey going through that growth. And generally, my background and where I spend most of my career is actually in hyper growth AI startups. And so before this, I was at scale for four and a half years as a product leader. And before that, I was at Shield AI, also a 16Z portfolio company. And I've been at Harvey for a year and a half now.

Really, it's a privilege to be part of Harvey for a third time through Hypergrowth because it's such a pivotal moment in human history. And I think a lot of things are going to change and I'm excited to kind of be at the forefront. Maybe for people who aren't familiar in the audience or people who are listening online,

What exactly does Harvey do? Because I think a lot of people know that Harvey exists, but we might not know in extreme detail what the product offering is. So Harvey is domain-specific AI for legal and professional services. Our product basically helps users and lawyers automate drafts, synthesis, strategic advice, memos, and more. Got it.

And before we get like a little bit deeper into the practice of building applied AI, are there like specific use cases that Harvey tackles the most, knowing that there are a lot of different workflows one could theoretically do for legal or professional services? Broadly for legal, there's maybe two or three types of legal work. So there's transactional work, which is essentially for mergers, acquisitions,

venture funding, large transactions that involve tremendous amounts of money. And then there's litigations, which is if someone sues someone, if there's a case in court, also often involving a lot of money. And then probably the third is really focused on in-house, so enterprise councils and enterprise in-house teams. So these three are the larger buckets. We serve all these in various ways. And so if you think about

what you need in a merger or acquisition. You need to do due diligence in that. You need to understand all the liabilities, you understand the financials, you understand where the gotchas are of the target and the acquirer. So each due diligence can be broken up into

you know, almost 10 to 12 different workflows. And we help in different ways in those workflows and then same thing in litigation. So there's that like high level and really focused on different steps in that journey. - For a long time in Silicon Valley circles, people believed that selling to law firms or selling to professional services just wasn't the most fruitful area given they weren't known for adopting technology quickly. A lot of people thought the billing model wasn't aligned to increasing efficiency or using technology.

I'm curious, what has Harvey seen in that regard? So I think there's two things. There's the market and what Harvey has specifically done. So I think market timing for any startup is incredibly important. If you look overall, when ChatGPT November 2022 came out, that really unleashed the power of Gen AI for a lot of people. So lawyers, in-house counsel, managing partners, CIOs really started understanding this technology and saying,

oh, wow, this actually can change a lot of things. It wasn't really a hidden thing. Before Chachapiti AI, it was just like maybe this hidden thing that you don't really know how to apply. But because it put it in people's hands, the cat was out of the bag in terms of the practice of law was going to change. And so the cat was out of the bag. Everyone knows that it's going to happen. And because everyone knows that's going to happen, a lot of enterprises are saying, hey, law firm,

my law firm X, we use AI. I've seen AI in action. You all should use AI to become more efficient, do more work, etc. Law firms started feeling pressure from clients. And then the law firm market and legal market in general is very competitive. In any region, there's four or five major players really going tooth and nail to each other. And so...

It's important for a law firm to signal that they're innovative because they'll get more clients and they're more efficient. And so because of this competitive dynamic now, everyone really wanted to adopt technology. And I think it's this perfect storm of market timing and where Harvey was. And so there was these macro movements and pressures from the market. And then I think what Harvey early on, and we still do, I think really well, is really embedding the legal expertise

across all different functions. So what that meant was, early on, we actually had lawyers selling the product. So lawyers as account executives and a CEO is a lawyer. And our head of legal research actually is also a lawyer, which I'll go into in a second. But yeah, we had lawyers selling the product. And so they would go to a law firm and speak the language, speak the lingo, be super empathetic. And they would actually come from a lot of the customers that we were serving. So they knew exactly how things worked.

And that really allowed us to get the distribution and really get the GTM going. And then on the product and AI side, we also have lawyers embedded in our product and AI teams. We have like a legal research function that works hand in hand with product managers and AI engineers. And what they really do is...

convert basically legal process into algorithms. So we have an agentic or compound AI system that basically functions how a law firm would function. So in a law firm, if a partner gets a deal or litigation, they break it up into multiple different pieces, maybe give it to the junior partners, junior partners break it up further, give it to associates, and it's passed down the chain. And then

And then because law firms are fairly hierarchical organizations, the associates do the work, then they pass it up for approvals and checks, and then ultimately the partner delivers the end product to the client. And our lawyers who work with our engineers actually just basically replicate that same model for different types of tasks and

convert and literally whiteboard out different processes so that AI engines can convert it into kind of model systems. Do you consider like these different agentic workflows then? Do you consider them replacing any kind of labor that people were previously doing? Or do you view it more in the classic agentic, like labor replacement versus co-pilot model? That's a good question. I think it's a bit of a narrow take. I think the legal landscape overall is...

very complex and getting even more complex and honestly very costly to navigate. Globalization, the internet, AI has increased legal work exponentially over the last few decades. And so you have basically infinite demand for legal work because companies are wanting to do different transactions, legations, et cetera. So you have infinite demand. And then what that means is that the supply is very constrained.

And the unfortunate human cost of supply constraints is very long hours, often doing very mundane, kind of boring tasks. We talk to lawyers who we've hired, our customers, they haven't become lawyers to write the fifth draft of the same document the fifth time or ask the same legal research question, right? They became lawyers to apply the law in creative ways, publish opinions, shape the fabric of society.

And so we hear this from customers all the time, like Harvey gives 30%, 40% of their time back because it really helps them automate that mundane road work. Actually the other day, one of our customers said Harvey allows them to go home to their family in time because it's been able to accelerate a lot of things. So infinite demand,

a lot of supply constraints and it's a great place for AI to help. Can you talk more about that a little bit? What would that interaction pattern actually look like? So this is a general question with, I think, generative AI. Like what is the human component? How much is it fully automated? I think the reality is

Let's say you're drafting an S4 or like an S1. S1 is when you go public. You're not going to one-shot that into the biggest reasoning model and say, hey, write me an S1 and you're done, right? Because all the bankers are safe. Yeah, all the bankers are safe. O1 is not going to one-shot your S1. The process of doing an S1 or process of doing a merger is really interactive with both parties, but the law firm, the client, and any other parties involved.

And so we think basically these agents have to collaborate well with humans to get the work done because humans may have some particular intent that they haven't told the agent or they may have some data that the agent doesn't actually have.

And so we think about building these agents in a nice kind of AI native UX way so that they can actually collaborate with different organizations to actually get the work done and say, hey, you know, I wrote this draft. Am I on the right track? You know, give me this more information because I don't know what to do about this decision here. So, yeah.

I think we'll start to see more proactive agents that really ping different people at each firm and really collaborate effectively to get something done. And maybe circling back a little to the question I had just asked, which is around how the market has changed the world and how ChachiBT really was this moment for enterprises to realize that the cat was out of the bag. Has that changed...

how legal or law firms think about charging, etc. Because one of the things that people believed about legal for a long time was because of the billable model. It actually didn't matter from a profit perspective how many hours you spent in it, even if maybe people just wanted to go home to their kids. I think this goes back to the market dynamic where you have infinite demand. You just have to get more efficient to service all that demand. We started...

in a seed-based model. We charge basically on a per-seed basis. And it's not because we don't believe outcome-based pricing or paying for the work is the future. It's really just because

We want to make it understandable for enterprise buyers. Like I think there's this VC statement, outcome-based pricing is the future or it's happening. I think it will happen. But I think what people have to understand is enterprises don't really know how to reason about buying outcome-based work, especially for such an experimental product like AI. And so I think it'll happen over time. I know one thing also about

deploying AI into the enterprise for maybe the first time ever in some of these customers, people might not know how to use it. It's sort of a new UX experience. People don't really know how to prompt agents a lot of the time. How do you guys think about the types of things that you need to do to actually get an enterprise to meaningfully get value out of an AI product? So our utilization has grown from 40% earlier last year

to 70% now. What is the metric? So it's active users over a number of seats on a monthly basis, basically. Okay. Yeah, I think a lot of that growth has been driven by good old-fashioned discipline across different functions. So maybe starting with the GTM sales team, as I mentioned, we have lawyers embedded in the sales team. And they really, because they come from this field, because they come from a lot of our customer archetypes, they put a lot of emphasis into a very...

specific kind of like onboarding program and use case building where, you know, they speak the lingo, they speak exactly how to accomplish a certain use case. And so it makes it a lot more approachable for our users. So that's one in sales and GTM side. On the customer success side, we've really tried to actually gamify a lot of deployments internally. So our customer success team often does big launches or like use case contests and law firms love to post on LinkedIn. And so if...

We say, hey, this person is the best AI prompt engineer or whatever. They love to talk about that on LinkedIn and creates a nice kind of healthy conversation.

competitive mentality. Yeah. And then the other question is, as you expand to other industries, you're two years or so into the company now and you actually want to expand beyond legal. So I'd love to maybe first understand just the rationale behind doing that versus maybe going deeper into legal and then how applicable you think the product set as well as the go-to-market strategy would be for the new verticals. Good question. I think, first of all, we have a lot of legal customers, but we don't want to rest on our laurels and become complacent. We actually have a cultural principle that says, you know, job's not finished.

It's referencing the Kobe quote. I don't know if you're aware of it. And so... I wasn't, but now I am. We never want to be complacent. And so a lot of our effort is still focused on legal. But I think overall...

If you look at transactions, if you look at litigation, if you look at lawyers and legal work overall, there's oftentimes a lot of professions involved that are not just legal. Like in a transaction, if you're doing an M&A, there's tax people involved, there's financial people involved, there's HR people involved to combine the two teams. And so in general, I think it would be a disservice to say only lawyers can use the Harvey and take advantage of it inside of this transaction. And so...

The way we think about it is like, as we're doing these larger project-based workflows, using that to expand to, hey, maybe the tax professional needs to know the same thing as the legal person with one maybe incremental thing on top. And so we're really using the lawyers and the projects that they work on to expand kind of naturally to these verticals.

And there's like a few ways to do it. I mean, generally we take a very customer driven approach. So not only a lot of our enterprise customers actually already have their compliance and HR teams on Harvey, because if you're reviewing employment contracts, like the HR team is obviously gonna be very involved. And so that's like one avenue is,

kind of organically expanding inside of enterprises and then being very customer driven and partnering with leading firms. So we work with PwC to build basically custom tax and financial diligence systems because PwC

especially internationally, they're the experts in tax law, they're experts in financial diligence. And they've really helped us learn a lot about those domains and really push us in that direction. So we've been kind of laying the seeds for that expansion for a bit. And over the next two, three years, we're really going to naturally expand to those areas. What do you mean when you say like custom models or custom workflows for those domains? Like is that...

custom as in pwc specific and therefore like you actually actively don't want to bring into maybe similar customers or particularly for the tax work you know tax attorneys across the world

ask a lot of questions about certain tax laws, how it can be applied to their clients. And so a lot of that knowledge is actually just in PwC. The world's leading tax experts in UK law or UK tax law are actually at PwC. And so when we say we're building custom systems there, we're actually using a lot of the data that they've curated, as well as using the expertise and

eval from their experts to improve that system. So we build various fine tune models, the rag systems that incorporate that data and eval from those customers.

I think PwC is unique in that sense, but over time we may start to work with other professional service providers as well. So I do want to talk a little bit more about the product building and how you guys think about evals, how you think about selecting model providers, etc. But maybe one last point on this is you talk about how PwC is a great partner in designing some of these more custom projects that you guys didn't have previously.

I imagine that that required first a lot of trust because they're giving you very sensitive data and then a lot of open questions that I think anyone building for the enterprise or any enterprise buyers have around how is my data actually being used both in this context? Is it getting fed back to the models? Is it going to go to some of my competitors, etc.? So I'm curious how you guys think about those questions. I think this is an under-discussed topic in enterprise software in general, not just AI.

enterprise readiness goes way beyond just SOC 2. It is...

I think a culture you have to build with particularly your product and engineering teams really from the beginning. And so examples of what we've done really from the beginning, because we started with the hardest customers first. They work on extremely sensitive work across the world. And it's a big thing for them to actually trust a small startup relatively to do that. So a few things that we implement from the beginning is I think one, a strict no training policy for data that's sent.

by default, all our paperwork, everything doesn't allow Harvey, certainly not to even train that data, but

people at Harvey can't even look at the data. We call this term eyes off, but no one in Harvey can even access most of our customer data because it's such a sensitive set of data. Another part of it is we have a very strict external vendor list. So we're only allowed to use, for example, Azure deployed models to improve our system and power our product. And it's because, again, Azure has a lot of trust in the enterprise, like all

all our customers, they're all on huge Azure deployments. And so they do trust Azure a lot. And what that also means, though, is if a new model comes out, Google, Anthropic, or a new fancy tool comes out on Twitter or something, we can't use it right away. We have to be very strict about that. And I think, again, this goes back to product engineering culture. We really have to make sure engineers understand that you can't actually just use

use the product or deploy it. We're really strict about that. And I think the last thing is we really hired a security team very early on. Our head of security, I think, has the first 15 employees or something. And he's really helped us develop a really robust security program. And when he goes in front of a CIO or...

a C-level person, they know we are legitimate and they know we don't sound like a startup, basically. So I think a lot of those mix of things has been really crucial to gaining that trust. What is your philosophy around building applied AI products? So on the one hand, you get to own the customer and that's great. And on the other hand, there's new fun things coming out on Twitter every single day. There's new models basically every month nowadays. And I imagine that's a very tough foundation to be able to build a consistent product on top of.

Yeah, so I think there's like a few ways. There's another question also, we often get, how much do you focus on existing workflows and existing surface areas for lawyers, which is like a net new AI native UX? I think the one thing maybe to highlight is there is no IDE for lawyers. There's no like VS code or cursor or whatever for lawyers. The two tools that they use the most are Word and email or basically Outlook. And

We are integrating with both of those on email and Word. But ultimately, we didn't really have a choice to build on top of existing tools or existing software because there really isn't one. And so we really opted for an AI-native UX. Yeah. What does that mean? What is an AI-native UX? Ultimately, one of the main principles is we want Harvey to feel like a co-worker and not just an AI or software. We want it to feel like a human. And

If you're working with a human at a law firm or an enterprise, you can basically talk to them and go back and forth a lot if you give them work. So let's say I go up to someone and say, hey, can you draft me this one-on-one disclosure? If you're their good coworker, they will ask you, hey, I need more information. Can you give me what is the information source? What should I base the format and the tone on? Or what deal are we even doing? And then they may write a draft of it and say, hey, can you check my work? Am I on the right track?

And I think that's really how we want Harvey to feel like is you're going this back and forth and you're being guided to do that work. Is it like a chatbot UI still or what is the actual UI that people are using here? It's kind of like a chat UI with dynamic UI components that are surfaced. And I

I think the other principle that we really want to take into account here is there's this principle called the IKEA effect, which is basically the idea that people feel a lot more responsible for what they do if they help build it. And IKEA really took advantage of this, right? They have, they've really kind of

made the process of building their furniture really delightful and fun and really invest a lot in the manuals and everything. And people... There's like a cult following for IKEA because people assemble it themselves. Maybe nowadays they don't as much, but... They used to. They used to, yeah. And so I think for us, this goes back to...

you can't one-shot an S1 with O1. There's so much back and forth that goes into this actual legal work. It's complex. You need humans, unique data sets, where if we were just like,

hey, you know, draft this disclosure schedule and Harvey did it, no one would trust it because they had no idea what actually went into creating that. And so we want to bake in these like nudges and kind of we call it shoulder taps so that Harvey asks for feedback, asks for data, asks for intent before actually producing the outcome. Can you talk

through like if I'm an individual lawyer, like what does that look like in practice? I know like one of the UX experiments a lot of people are trying to figure out is while the agent is doing work, it'll like expose and I'll tell you what it's doing. But there's also like some level of downtime that happens there. Like does the lawyer get like a little notification? It's like, oh, come back. I have a question. Like how do they integrate that with their day-to-day work so that it's not just sitting there like monitoring the agent? I think one thing

One interesting thing for our user base and our product is that we're not very latency constrained. I think for a lot of chat products or your consumer AI products, most people expect an instant answer. But because the quality of the output that Harvey produces is so good and so human-like,

people are okay waiting two minutes, three minutes, four minutes to actually get an outcome. And because of that, we're able to basically shove more intelligence into every single pass, more model calls, more algorithms. And so...

people can wait and are fine waiting. And we're starting to add basically asynchronous agents that work where it'll email you when it's done or ping you when it's done. And so that latency constraint is just not a big constraint for us, which allows us a lot of freedom to work on. And as long as the agent is basically providing some transparency of what it's doing, and it's not just endless spinning, I think it works out for our user base.

Do you think we've arrived at the point that we know what is the best AI native UI or UX experience yet? And if the answer is yes, I'd love to know what it is. And if the answer is no, what do you think are the experiments still being run? Or what are the types of workflows you think people haven't quite figured out yet? Yeah, so short answer is no. Chat is the command line of AI. I think when MS-DOS first came out, you were just typing into a terminal to move things around. That's where we are with AI. And I

Actually, I think hopefully in 2025, we see a lot more innovation on AI native UX, dynamic UX, ways to interact with the model that is not just text-based. I think that first of all, I think what people have to realize is most users and certainly our users have very unspecified queries or prompts. It's interesting how comfortable people have gotten with AI where they just assume anything.

The AI knows everything. We get a lot of support tickets saying, go into my email and search up this thing and produce this outcome. Or do you remember when I talked about this last time? Use that to come up with the answer. I think it's an educational thing, but also I think AI really has to work collaboratively again with the individual to actually extract the intent.

from the individual versus just relying on the one-shot prompt to get it exactly right. So I'm hoping to see more unique back and forths and guidance that the agent can provide instead of just a text-based prompt. I think with enterprises, you actually kind of need this AI native UX even more because the work is so complex and difficult and

oftentimes work is being done by teams of people or humans. And so you do need a more full-fledged kind of natural UX versus I think consumer because the use cases are so varied and because

because there's so many ways to use AI. Maybe the best UI is a chat, right? Because it's so open and you can capture the whole market with just an open-ended UI and it's kind of what we're seeing. So I do think enterprises, there should be more experimentation with AI native UXs because the workflows are so specific because the work is so difficult. And again, never one shot. Yeah, makes sense. Maybe switching gears slightly. I'd love to know like...

To the extent that you guys can't talk about it, how do you think about the infrastructure under the hood? Are you primarily using one model? And if so, what is that? How do you think about swapping out models as new capabilities come out, etc.? So as I mentioned previously, Harvey consists of

hundreds of different model calls, basically an agentic or compound AI system to produce the output. And currently, we primarily use OpenAI models, either OpenAI directly or OpenAI through Azure in production. And that's particularly because, well, one, models are really good. Both OpenAI and Azure's infrastructure is really good and fast. And security and customer trust. As I mentioned earlier, people really, really want to make sure Azure is the default cloud.

of choice for us. And that's really how we've been able to gain trust. But in general, we're not really tied to OpenAI. We work with all the major labs already actually to basically eval their products and provide guidance on how they should think about legal reasoning and sharing data sets, sharing insights that we've gleaned. And so

We are certainly open to using all sorts of different models. It's just because of security and infrastructure constraints, we haven't gotten to that yet. Yeah. How easy is it to swap a model for you guys? Because you can have like, because they're non-deterministic, you can imagine like something weird happens. Like how do you run evals on that afterwards to make sure that the experience is still consistent if you do swap out a model? So from an AI infrastructure perspective, again, I think from early on, we've really

try to emphasize modularity so that we can swap model strings in and out and endpoints in and out. The more difficult thing is actually the eval, as you mentioned. Each model has a different personality, characteristic, behavior. Maybe the same prompts or data for fine-tuning don't work the same way for different models. And so...

Swapping a model in and out does require a lot of eval because we want to make sure it doesn't degrade quality. So have you guys built out internal eval infrastructure to do this? Eval is a big focus for us. You know, I come from scale. I know human expert data is extremely important to building AI systems. I think there's like

two kind of aspects to eval that we think about. One is basically internal eval to improve our AI systems. And then there's external eval to communicate the value. So on the internal side, we have basically a mix of human experts that we have internally or that we contract. So like lawyers in all different countries, all different practice areas to be able to do all sorts of absolute or relative eval. So absolute is like, look at this piece of content and

and rank it based on this rubric or whatever it is. And then side-by-side is like, okay, look at two different versions that are algorithm and then rank it side-by-side.

And we really invested a lot in that and have kind of scaled that up as we've grown. On the external side, the difficulty is a lot of legal work is actually applying subjective opinions on objective facts. And judging subjective opinions is very hard. There's certainly no objective truth. Like,

Did you apply the law in this way or does your interpretation worse or better than the other? So eval overall externally and communicating is really hard. And then generally legal tasks externally, there's just so many. If you look at the legal taxonomy of tasks out there, there's almost like 10,000 leaf nodes. And lawyers have actually mapped this out. And so I think part of the challenge here is...

How do you communicate to customers that Harvey is good or accurate or doesn't lose it, whatever it is? And so we've spent a lot of time, we released this benchmark called Big Law Bench earlier last year, where it basically presents tasks that represent real billable work that lawyers do on a daily basis. And it's the first benchmark of its kind, like,

All public legal benchmarks so far have been multiple choice. I would love if legal was multiple choice, but legal is not multiple choice. It's very open and messy. And so the benchmark we produce is really saying, here's real work that we know lawyers do. And here's how Harvey performs. And I think that one other unique thing that we did is we're not measuring necessarily accuracy. We are measuring accuracy

the percent of work that the model does compared to a 100% human response. You mean like time is the metric? More like the total work. Got it. So maybe it gets you 85, 90% of the way there to drafting a disclosure schedule. And maybe the human just has to get it to 10%. The reason is because...

If you just frame things based on accuracy, no one wants a 90% accurate agentic system, right? It's not the exact right kind of framework to think about communicating value because even if you get a 90% complete product, that is still helpful than starting from zero. Yeah. And then one last question on this front, which is a little bit of a tangent, but I was thinking as you were talking about the infrastructure around swapping out models, but doing evals to make sure that, you know, the experience is consistent and the product doesn't degrade.

What are your thoughts on the new OpenAI reasoning models? Because I imagine legal is actually one of the use cases that is probably on the spectrum more reasoning heavy than a lot of other use cases. Have you seen that to be a dramatic difference? And how has that applied to you guys thinking about which models you would actually want to use? I think that's been a huge unlock for our product and our customers. One nice thing, as I mentioned earlier, is our customers are...

The latency isn't a big constraint. The one downside of these reasoning models is that they take time to think and show their thought process and chain of thought. And so our customers are already used to that. So putting in these reasoning models has actually been very natural because of the way we've designed our product. And then on the AI side, the models have been, they're actually really, really good at

long-form drafting and long-form reasoning, like drafting a whole motion to dismiss argument based on pulling from various different facts, wouldn't have been possible before these reasoning models. - Maybe this is getting a little bit too in the weeds, but I'm trying to think of some of the nice things about, like you said, seat-based is that it's a very clean metric, or usage-based is kind of a clean metric too. So for support tickets, it's like a ticket is the unit of metric.

How are you defining like a unit of work being done for one of these eval sets? Because I imagine like people have a hard time given that this is relatively new also, grokking like exactly what that means.

Yeah, so it's incredibly difficult in general, and it does vary a lot based on the task. I think there's... Based on the task, but also based on our customers, the way you'd create a chronology for a case might be very different from law firm to law firm. And so I think we've thought about it like, let's try to standardize the names and the taxonomies of these tasks first, and then devise rubrics for like, okay, maybe law firm A and law firm B have...

the date column in a chronology in a different place, but it at least has the date. Right. And so I think we've actually developed a whole rubric and this is where a lot of our internal legal expertise comes in for each kind of major task that we've evaluated that is unique, that rubric is unique to that task. And we've tried to standardize it, but there is so much variance. Makes sense.

Has Harvey built its own foundation model? Or do you guys have any plans to? The short answer is no, we have not built our own foundational model. Instead, we've kind of worked really closely with OpenAI to fine-tune, to post-train, to prompt engineer, to do RAG, to build up these agentic kind of compound AI systems. Do you guys...

want to build your own foundation model eventually? And I'm just curious, whatever the answer is, like what was your rationale behind either yes or no? So short answer is no. We don't want to build our own foundation model. I think the compute stats are out there, but it's extremely expensive. And we'd rather... You've raised a lot of money. Yeah, they did raise a lot of money. Billions.

And we'd rather leave that to the experts and really focus on delivering our own customer value and kind of the products around that. Okay, so you guys don't want to build your own foundation model. I'm curious then, as you think about the foundation models getting better and better, you know, a lot of people are like, oh, AGI is almost three to five years away or whatever. Do you view the foundation models as ultimately competitors as they generally get better at reasoning capabilities? There is the ability to do more domain specific things now. We have to assume that the models just get better and better.

And so what does that mean for us? We have to accumulate different types of advantages and not just the model itself. And so a few of those advantages are product, data, network, and brand. So there's UX and kind of the enterprise platform. So I think most people, again, underestimate what it takes to actually deploy products in the enterprise. I think even

even AGI is probably going to underestimate what it takes to go through security checks at a bank. And so again, we've built a lot of these security checks, permissions, audit logging, usage dashboards, all this enterprise and admin functionality that's really required. And companies like SAP, ServiceNow, Workday, they've invested decades in this stuff. And this is why enterprises like them and enjoy them. So I think investing in enterprise platform is important.

UX is also extremely important. As I mentioned, the UX that AI is going to use to collaborate with whole organizations is not going to be a chat-based product. So we need to really innovate on the UX and how you do workflow-specific UX that you can collaborate with AI on. So that's another one. And then data sets, I think, is really important. So AGI is not going to have...

the data that is sitting on some on-prem server at a law firm, right? And this happens. A lot of law firms have on-prem servers. And so what really makes a law firm unique is a lot of the historic deals and cases and data that they actually have. And so we're starting to basically have Harvey be able to use that data and tailor outputs, workflows based on that data. So

So I think overall, there's these like product UX kind of advantages when you accumulate. How much has all the AI zeitgeist, all the things that we hear about coming out weekly, how much has that actually permeated in the enterprise? And what do you think like is the latency there of like us hearing about it versus something actually getting deployed there? Yeah, that's a good question. I think

In a similar way to how Silicon Valley gets information, you know, oftentimes through X now, a lot of our law firm customers get information through LinkedIn. And so the best way for me to understand our personas is actually to look at a lot of LinkedIn posts from a lot of our personas and see what they're liking, see who they're following, because that's really where the zeitgeist and conversation happens. I think overall, maybe like the

This time last year, we would actually go to customers and they would have never heard about Chachapiti. Sure, AI, but never heard about Chachapiti. Like end of 2023, beginning of 2024? No, beginning of 2024. Yeah, exactly. They would have never heard of Chachapiti and they never even used it. And that was like a wake up call for me because again, coming from scale, I was surrounded by AI for a long time. I'm like,

okay, well, this has not actually permeated as much as I thought. I think fast forward to now, most people have heard about strategy PT, but, you know, oftentimes people don't use it. Like I think now if you ask anyone in tech, like, why don't you use strategy PT? You're at a disadvantage. But most of our law firm customers and that world often don't use it, but they've at least heard about it. And then I think for the enterprises, they have, you know, that's been like two-ish, two and a half years since strategy PT, like,

they have at least deployed in some international internal chatbot or have purchased Copilot and maybe use it to draft emails or whatever. But

We haven't really seen, even in leading enterprises and not just law firms, we haven't really seen workflow-specific adoption of AI in the way that Harvey is trying to push. So I just think, this goes back to my bottleneck question, I just think we're so early. AGI takeoff can happen and the LinkedIn law firms are never going to hear about it for five years. So I think...

That's honestly been a good empathy test for a lot of our team is most people don't know that this is happening. And so another reason for a lot more applied AI startups to really, really go into these, you know, quote unquote hidden markets because it is just wide open. So then I guess my next question, you may have already answered it is,

Have they thought about how their business model or staffing model needs to adapt as a result of AI? And maybe the answer is like, no, because on LinkedIn, you're not seeing people talk about impending AGI. But at least in Silicon Valley, people talk about that a lot when it comes to professional services or billing-based models. Yeah, I think the mindset has...

It actually changes maybe every three to six months. And that's probably the leading or lagging time of information. But like six months ago, clients of law firms basically would have said, don't use AI on my projects because X, Y, and Z, trust concerns, risk concerns. But end of last year, now they're just like, you have to use AI on our projects because it's going to be more efficient. And so I think this is evolving quite a bit and this understanding is evolving quite a bit. I think there are...

more leading edge companies and customers that we've partnered with have really leaned into, hey, we think AI is going to fully change how our practice works. We should get on and try to drive it and control it. So I think there are the more visionaries who are thinking about this.

But in general, people know something is going to happen, but they don't know what and they don't know how it's going to change. Neither do we. Yeah, neither do we. AI gets better seemingly every single day. And, you know, there's new capabilities. There's new companies popping up all the time now. How do you guys think about or how do you think about, you know, the next couple of years? Like, if you have any predictions on like...

Where do you think most people are actually going to find value in the enterprise in particular in AI? Like what do you think are still the unlocks that need to happen such that more places can actually see ROI, et cetera? - I think in Silicon Valley, we talk a lot about AI takeoff or AGI takeoff.

that the model's going to get so good and it's ramp and then everyone's going to live happily and never have to work again. Yeah, just retire in two years. And just retire again. I just think intelligence isn't the only thing you need. You run into human bottlenecks deploying this stuff. You've run into your quote-unquote software bottlenecks, like trust, like the ability to work well with the model. And so I think I would encourage, and hopefully we see this more in 2025 of technology,

encourage more enterprise AI companies to get really, really deep with their customers and understand their workflows at a pretty deep level so that they can bring AI to them in very specific ways and build kind of the product and UX around it and establish that enterprise trust. And so I don't believe at least in the next two or three years, we're going to reach AGI heaven. It is continuing to be really customer-focused builders applying AI in unique ways to enterprise workflows as well.

There you have it. Another episode in the books. Thanks for listening. And we hope you learned at least a little something. As LLMs in particular and the ecosystems around them continue to mature, the intricacies of building production-ready, enterprise-grade products will only become more important to understand. And it's something we'll keep covering. As a reminder, if you like this episode or if you like the podcast overall, please do rate, review, and generally share it with your networks.

Agents, Lawyers, and LLMs 40:14 Share

AI + a16z

Deep Dive

Shownotes Transcript

Agents, Lawyers, and LLMs