TrueLaw chose DSPy due to its modular nature, ease of customization, and better object hierarchy, which allowed for more efficient and transparent query rewriting and iterative improvements.
Fine-tuning is necessary for domain-specific tasks in legal AI because off-the-shelf models often lack the precision and quality required by lawyers, who are not skilled prompt engineers. Fine-tuning helps contextualize queries and align the output with the firm's specific expectations.
Embedding model fine-tuning is cost-effective because the models are relatively small and the main cost is generating contrastive data, which is cheaper compared to training large models from scratch.
TrueLaw decided to use SaaS providers for infrastructure to leverage existing services, reduce costs, and focus on their core IP, which is data generation and fine-tuning. This approach is more efficient and scalable, especially for a startup with resource constraints.
TrueLaw chose Temporal for managing long-running workflows because it provided a robust and flexible workflow engine that handled retries, interruptions, and notifications, which would have taken significant time and effort to build in-house.
Shiva believes his broad experience has been beneficial because it allows him to draw parallels between seemingly unrelated areas, apply fundamental principles across different domains, and understand the core concepts that are universally applicable in solving performance and system-level issues.
I am the CTO of Truelaw. We build bespoke AI solutions for law firms. And as far as I take my coffee black, no sugar, no milk, I think just the way I feel. I do feel like, you know, I'm from Calcutta. We grew up drinking tea.
And we don't drink tea with milk and sugar and all of those added. I think the actual flavor of tea comes from dark, chilling tea. I think the same thing with coffee is also it's like where the milk and all of these. I mean, current American coffee is like recipe for diabetes. I feel like versus things like black.
No sugar, coffee. That's what I mean. What is happening, good people of Earth? This is another Ed Miloff's Community Podcast. We're talking with Shiv. And what a leader this guy is. He has done so much in the engineering world. I feel honored to have gotten to speak with him. Before we get into the conversation, I want to play a little song for you. Get that recommendation engine that you have.
spiced up just a bit. Maybe you can consider this destroying your algorithm. Maybe you can consider it upgrading it. That all depends on you. The only problem that I have with this song right here that we're about to play, which is called Glimpse by Varely. Only problem here is that it is too short. I put this song on a playlist
that I called the sound of angels' wings because literally, that's what it feels like to me. I listen to this. As always, if you enjoy this episode, just share it with one friend so we can keep the MLOps community at the top of them podcast charts. Correct me if I'm wrong, but you guys are using DSPy, right? Yes.
So you're, I think I put something out and I said, if anybody is using DSPy in production, please get a hold of me because I would love to talk to you. And you reached out and it was like, hey, what's going on? Let's talk. Now break it down for me. What exactly is going on and how have you found it?
So, you know, we were experimenting with prompting as usual. I think, you know, as things were coming along aside from zero-shot prompting. And we played around with langchain at the time. And...
There's always this notion, I mean, I think I resonated a lot with Omar's notion that, you know, these prompts are brittle. And if you change them here and there, this sort of breaks. And at the time when the LLMs were, you know, the GPT revisions were much faster or we wanted to experiment with other LLMs.
there was definitely a lot of changes to the prompt we needed to make in order to get this. Now things have over time has improved, I'm sure. Now there's structured outputs that OpenAI is doing. But at that time, this was not so easy. And there was a lot of these changes to the prompt that we had to continuously make.
And that sort of resonated quite a bit. I mean, I worked at Databricks before, so I had Matei in my LinkedIn feed. And I think from there, I kind of learned about DSPY or DSPy. I think that's what we call it now. I always called it. I was calling it Dibby before, so no shade to however you want to call it.
And then, I mean, I think I went through a couple of these videos that Omar has put out, and I think they resonated very strongly, especially in the context of what we were doing, that it's modular. It's the idea there is that you sort of like the optimizers make a lot of sense in terms of how you can actually iteratively improve this product
And you still have to kind of go through, I mean, and we can talk about that later, but you still have to go through this limitation of the optimizer still take a bunch of examples to sort of optimize on. But in cases where you are doing that and where you have those limited examples that you can show to sort of mimic what you're looking for, I think that this is a better approach.
That sort of makes a lot of sense. I also really like the Pi aspect of DSPy is that it's very like PyTorch. It's very modular in nature. So from a programming point of view, I think it was very modular and
we can go a little bit over the sort of like stack on the retrieval site that we have done. And it was easy to use different re-rankers. It was easy to use those things. So I think that's why we stuck with SPY. I mean, we also use this for generating data, synthetic data in some cases, but nice. But yeah.
Yeah, we can talk about the synthetic data in just a second. But the burning question that I have is, aren't you afraid that DSPy is...
basically a research project and it is not necessarily something that you want to put that much faith into. Yeah, I mean, you know, like the thing is that we obviously have made changes. So the way this works is like, you know, you clone the rebuild yourself and you're making changes along the way because you obviously...
when our upstream change is going to get pushed in i mean that way actually the community has been pretty good i mean we have pushed one or two changes that got accepted very quickly uh but yeah i mean at the end of the day i mean you know it's python code you're looking at it and kind of like seeing seeing the thing and that is also true for the land chains of the world but um
There were certain, I mean, that was one of the good things I also felt in DSPy because of its modular nature and how it works. You can kind of write your own module also and kind of like see how it works versus, again, maybe I made mistakes when I was doing this kind of changes in langchain. I felt like,
your chain had to go through a lot of layers to make the same changes versus in DSPy, we were using our own re-ranker. It was easy to use that as a different module and then plug it into the framework whereby
You're still calling this DSPY retriever. I mean, the object hierarchy and the calling is just better and modular, in my opinion. In general, like, I mean, you know, there's always that thing whereby, you know, what you're doing. I mean, first of all, the thing, the nice thing is that, you know, this is code that is just running on your servers.
But like our use cases, it's really latency bound rather than anything else. And you can enable tracing and basically see the sort of calls that you're making. Actually, one of the things that we have done, which has been quite helpful, is that because
because of the chaining and the hops that are happening and the sort of like query rewriting that happens, we kind of have exposed that intermediary step to the end user. So, you know, they get a good sense of like what is happening and how the query is becoming, you know, more contextualized with the content that they're searching from, from where they started. So I think in some of those cases that sort of, that,
and visibility actually has helped in also the product in terms of what we have done. Yeah, it's fascinating to me because DSPy is the one tool
open source tool, I think, out there that does not have a big, well-funded VC behind it and a company that has been that just got all that money from the VC, I think. And so you see that there's an incredible community and there's a lot of people and a lot of energy around it.
But I know that some folks get a little scared because there isn't a company that is in a way giving it that stamp of approval and saying like, yeah, we're going to be the shepherd of this open source tool. Yeah, I mean.
I mean, that's a very interesting point. Yeah, I mean, I think you were right. Like, you know, why, why LanqChain has funding and, you know, that got started and not DSPy. I think that's a very good, I mean, there's definitely, you know, the first mover advantage of these things and, you know, how much active development goes on and stuff like that. I think with, uh,
The idea is by, and my solid gut feeling here is that because there's this subtleties around improvements and the iterative inference, it happens like the prompting that happens over in the back.
That's probably not as exposed as in, in langchain. I mean, in langchain, you're still sort of like explicitly calling out, you know, particular like chain of thought or something like this very explicitly. And again, even these things have, you know, they've become more modular and more abstract over time. So it's, it's been a while since I have used langchain, but yeah,
But DSPY, when we first looked at it, there was a, the learning curve was a little bit, I would say, steeper than Langchain. And I don't know if that causes adoption issue or, you know, like that's kind of where. But ultimately, you know, it is a very fascinating thing that, you know, these are equivalent systems. They still sort of like suffer from the same,
Like, you know, this is ultimately prompt engineering that is happening. So, you know, there are limitations around that. But it is a good point. I think the main thing, in my opinion, would be typically I've seen a way to productionize things is you kind of have to provide a service. You know, like you basically, if you were like taking this thing and
and you needed like a server or some state management that you needed to manage yourself. And that's kind of when this, you know, there's a company that comes behind and says, you know, we are going to take on all the operational headache of you doing this. And that becomes a legitimate service. And that's when you see this with Lang Smith and stuff like that, wherever, you know, you're doing this drag and drop and observability of those and those are becoming kind of part of it. I'm sure,
If someone invests at the time, they can make an argument of the same sort of suite of things that, you know, LinkedIn has done on the DSPY side of the world. But yeah, I mean, if you just think of it as a framework and an SDK and, you know, like in a library that you're kind of like using, then most people will be like, well, I'm kind of like running it. All this code and server stuff is running under my control.
environment what is the point of this but yeah so I think you have to sort of like always associate a service with this which may or may not you know like you have to think through like what that would be and that probably would justify you know so like building out a company
Okay, so getting back to what you're working on, you have a RAG use case, right? And you mentioned that sometimes DSPy is the interesting route. And then other times you fine tune. And I would imagine other times you just, you don't do much and it comes, it's like naive RAG, as they call it. Can you break down these different scenarios and what you found to work best when?
So one thing, you know, we have focused on the legal domain. So we built like bespoke solutions for effectively lawyers and law firms.
And here, you know, precision and quality matters, trumps over latency a lot. So they want still like search results to be quick, but it doesn't have to be instantaneous in terms of like, you know, your Google search experience, right? Yeah, I agree. So we always had to focus on quality in that way.
Also, this demographic of users are not the best prompt engineers. I mean, they're not thinking in terms of what is the best way to write a prompt. And I think what we have found out is the way the questions are posed by a typical lawyer
it's very important to contextualize that question in order to able to do a better retrieval. And that's kind of where this query rewriting helps a lot. And that's kind of what we are using DSPIO for. And then it's very parametrizable, which means you can determine the number of hops you want to do and what is the depth of...
retrieval you want to achieve and how many documents you want to see in your top K and all of those things. And these are again kind of important to the lawyer. They're making a call between recall and precision in terms of the number of data that you want to get back. So
And because we could play around with the latency, we don't have to. We don't, again, like I think they understand that the more hops people are doing, you are going to get a little bit better quality with this approach, but at the expense of, you know, the results taking a little bit longer times. And this is especially true in kind of like, you know, very domain-specific question answering stuff. For the fine-tuning work,
we have done some work around embedding models too, because that's one of the other metrics around time. So when you're doing the search and the retrieval, the typical way
if the embedding model is just trained on the corpus of data that you're searching on and the queries that you're doing. It's just a better typical match. And again, I'm talking about specific things. This tends to work a bit better. But then there is also the question around the generation part of it, whereby
you know, certain firms have a particular way of seeing the answer or, you know, like they expect a certain, because typically this would have been done by junior associates where you effectively would have told other folks to do this search and come back and then, and they have a certain way of presenting this data. So I think that alignment of generation is something where we have also used fine-tuned for fine-tuning things for.
And, you know, typical RAG, I mean, again, I think when we started this, we found a bunch of like off-the-shelf RAG approaches. And again, on domain-specific things and for the lawyers, they typically don't. I mean, in our experiment, they didn't perform that well. We always had to do a few different things, either extracting metadata first or
basically make the retrieval process much more contextualized in different ways before we aggregated the data. If you just were to use your typical, like, just use an embedding and your regular embedding and just get the retrieval, the quality was not that. Well, so there's a few things that I would love to know about specifically on the fine tuning of the embedding models. I think you can do that for relatively cheap these days, right?
Yeah, I think, you know, the main thing here is because the embedding model itself doesn't have to be very big, that you can actually have just the encoder part of these models, and they typically are, you know, the dimensions doesn't have to be, like, very, very large. Right.
And the main thing there in an embedding model is to kind of figure out the training data. That's to like, you know,
you're basically giving it some contrastive data and that, that the generation of that contrastive data is the hard part based on the percent. That's kind of like it's cheap money wise, but it is expensive in resources. It's expensive in figuring out like how to generate that contrast of data. I think that's the harder part, but yeah, I think in general, even training and all of these things that these are becoming extremely commoditized, uh, um,
So we actually don't focus on the training infrastructure per se. We have an orchestrator layer that's very agnostic to where things could be trained. But in general, I think all of these are, I think from when we started to even now, the drop in price is significant. And I think that will just continue to grow. I mean, even...
Winchie PT4.0 Mini is a very powerful model and training that is very cheap. Yeah. So you know what I would love to break down is you've worked at a whole slew of incredible companies. You're now CTO at Trulaw and you knew coming into it, we're going to be setting up infrastructure that is...
itself very prone to change and technology in general is prone to change but with AI and LLMs this is like taking that to the max it's very volatile right because as you said things get cheap
a model that is super powerful, all of a sudden it becomes just a commodity overnight almost, it feels like. So as you're going through and you're setting up the stack and you're thinking about which pieces to value, which pieces to try to make future compatible or making bets on what's going to come down in price, how are you thinking through all of these stages and
Since you have that bird's eye view and you're the one who's ultimately the guy in charge on the technology side and you decide with your teams what gets implemented.
Can you walk me through your decision making process and how you think about that? Yeah, that's a very good question. I think like, you know, in general, like, obviously, you know, one of the starting points is sort of money constraint. Like as a startup, we are have to be scrappy. And so like when to build versus buy is always a decision we have to make. And then
You know, you have to also juggle with the fact that we cannot take six months to build a feature because, you know, we have to go through that quick iterative cycle of development. And, you know, I think very soon into this approach, we realized that building foundational model is very, very difficult and challenging.
It is just not, even if it is technically feasible for us to learn and do this, the resources needed is just very different. And then at that time, we started focusing on more fine-tuning approaches and kind of like building. I think Matej and Omar has a paper around this compound systems, which is kind of like building systems
sort of like a system of small language models that can orchestrate and, you know, like doing a bunch of these things. And the corollary to this is like, you know, you're a brain and you're doing different parts of the brain or kind of like doing different things. And then this is sort of like coordination around it. I mean, there's a new train of thought. It's like one large model doing it or a bunch of smaller things coordinating and doing things. Yeah.
And I think we kind of took this approach of the compound system, which I think it was more feasible. In terms of the infrastructure, I think that's very good in the sense like we understood that, you know, in order to get that scale, the price point right, you need like...
of scale to do certain things, right? And it's very hard for us to sort of like build like a training infrastructure just for the few models that we are training on or the volume at which we are training on. So from a very get-go, we understood that we have to leverage this sort of
training services or, you know, model training services that are available. And in, you know, my previous experiences at Confluent where we were, you know, doing this orchestration for like provisioning orchestration for Confluent servers or Kafka servers and stuff.
kind of got me some insights into building this as a way of an orchestration service, which does this in terms of the training. And then there's a sort of like data generation pipeline. So if you think of it like the way that we have focused and our sort of like all our IP is around this, you know, data generation and how are we fine tuning things.
and incorporating into the models. And in that way, it is not too dissimilar from my previous experiences of what I have worked on, which is sort of like this data management and, you know, orchestrating that data flow. But the way we have always built our stack is we haven't
We have always leveraged other SaaS providers in terms of sort of using their infrastructure to train this. And this has been, this has both been a blessing and sort of like cost wise, it has been quite efficient.
Partly because we give the options on which infrastructure things could be trained. Of course, we can train it on our cloud. We can train on the customers and Azure environment also. So that flexibility is actually quite powerful. You said something there that I want to double click on, which is around...
how you used to be dealing with data flows and now you're still dealing with data flows. It's just the data has changed a little bit. So I imagine you used to be dealing with event click data, that kind of data flowing around or purchase this person, purchase this user ID, purchase something, et cetera, et cetera. Now you're dealing with prompts, I am assuming, and outputs of those prompts and how you can best manage
bring the output back into the fold to make sure that you're constantly leveling up the pipeline. Is that what I understood? Yeah. Yeah, I mean, in essence, at the end of the day, whether you're doing prompting or fine-tuning approaches, it's like input and output, you know, to the way you're talking to the LLMs. And in general, like, yeah, I mean, I think in Confluent and when we're dealing with
I was in the data governance team also for a few years. It was all about sort of like, you know, the flow of data. And then that particular case was flow of metadata that you have to be worried about, like, you know, from where it is originating into how it is distributed to everywhere else. Here, this is a similar thing where, you know, you're getting data
you're generating data and you're kind of like aggregating this data. Of course, this is all unstructured data. So the benefits of BLM is like you're trying to get a decent sense of what this is. But at the end of the day, it is sort of like incorporating the feedback or moving this data or constructing the sort of like training set on what this is. So
Of course, everything is data and what you're removing with the context of data changes quite a bit. But the sort of infrastructure that you need, keeping versions of it and things like that and being redundant and being able to replay back, these are all sort of the same building ethos and engineering ethos around how you do that. And that has a lot of similarity. So there's stuff that you decided to buy and
not build. One obvious one is the LLMs, which I think that is in hindsight, a very good choice knowing how much it costs to train them and seeing how quickly they are going down in price. That makes total sense. Are there things that you are a little bit more surprised about
That you went for the buying option or you are happy, I guess is how I could frame it, that you bought instead of built or vice versa. You're happy that you built instead of bought.
Yeah, I think, so, you know, when we first started, we actually built, I think, with every microservice, of course, our thing is a microservice architecture, and we needed a communication mechanism between, you know, how they will talk. So you need some sort of mechanics between, you know, this communication mechanics between your microservices. And
You know, we have leveraged, given from Confluent, like I leveraged like Confluent Cloud to, but ultimately we built a messaging system between these services to handle a bunch of asynchronous stuff. So again, as I said, like most of this work that are dealing with LLM is latency bound. So asynchronous communication had to be, you know, part of the building stack from whatever we built.
And that served us pretty well. And until, you know, we were getting production use cases of like doing massive inferencing for very, you know, thousands of emails or like very large scale inferencing, which will take hours to run and things like that.
And in those cases, we are actually using temporal. This is, I don't know if you know, this is like, we call it durable workflows. It started from like Uber in terms of like how they manage. I mean, this is ultimately, again, workflow management sort of a thing. And
We kind of made a good decision because I think I was talking to the developers, the first pass of, you know, doing this, we kind of used our existing infrastructure to do that. And again, the sort of like,
The loopholes that there is, the kind of like guarding you need to do against, like retry, how much to retry, when things get interrupted, all of those other things that you need to take care of versus sort of like using...
you know, a company that's effectively like, this is the, you know, first of all, we're a small group of engineers and then kind of like using them to kind of handle this durable workflows has worked out very well for us in terms of when running this very long running inferences or doing training, getting notified if things get interrupted and getting handled. Of course, you have to write code to sort of like get it to use, but I think
We haven't had infrastructure problems related to that thing, which I'm sure, you know, it would have taken us at least a couple of months to kind of perfect that sort of infrastructure for ourselves. And then again, it will be very custom to just our use case and not very generic. So every time we probably needed to add and make any changes to the state machine, we would have to have gone back and, you know, like make those changes and test it and whatnot.
Versus I think using temporal has been very useful for us because I think we kind of like, it's a workflow engine. We have to build, of course, the state machine that does that. But writing that state machine logic is much more simpler, much more additive in nature. And then depending on them to execute that is much easier for us or has been very easier for us. So-
One thing that I noticed about you is that you've worked in many different areas of the stack, we could say. And I think it's more even...
and horizontally, I could say. And I don't know if you agree with this, but knowing your background, you've gone very low level. You've also gone from the data side or the front end, back end, DevOps-y, now LLMs, MLOps-y. How do you see things now? Is there stuff that you can look at and say, okay,
Because I know this, I can draw parallels with certain parts of the stack or the LLM side. I know you mentioned earlier that there's data flows. Now it's just LLMs and AI data flows. And all of this metadata is the important thing. And you came from Confluence, so you had that.
data flow kind of in your blood. I know you were also you were working in in many different awesome places, one of which Apple. And so you got to kind of see the gambit. Have you noticed different things that shouldn't necessarily be related, but that you were able to now relate because of this why this breadth of experience?
Yeah, I mean, you know, I've been very fortunate. I think in certain cases, you know, things have clicked in unexpected ways. For instance, I clearly remember I was at Riverbed and we were doing, you know, file systems there. At least Riverbed at that time was building a file system, a deduplicated file system to do so like compression on your primary file system.
And one approach there was that because these things ultimately are your performances, ultimately, you know, like how well your data reads and writes are happening on the disk and they get blocked. And, you know, how do you paralyze as much as you can? So like how...
you know, how's your pipeline into those data arenas that happens. And in Riverbed, we remember we use this very macro-based C to kind of get ourselves this sort of like version of async IO, if you think of it, but at a much deeper level. But they use this notion around, you know, used, I forget what it was called, it was
R threads, if I remember correctly. It's a riverbed threads. That's the idea there. But, you know, it was all like macro level way of doing it. And you have to kind of get used to this framework. It wasn't like CC. It was written in C, but, you know, to kind of understand like when things would unwind and stuff. Yeah.
Surprisingly, when I went to Apple, and I was kind of explaining what I had worked on in this world, you know, in the Riverbed side of the world. I think they were there. I mean, I didn't even know at that time that they were working on, you know, just Grand Central Dispatch, which is their sort of like approach towards multi-threading libraries.
And suddenly they were like, oh, like this makes sense. Like you must have like a lot of context around this. And we got started working on it. And so it was like, you know, one of the two maintainers of GCD at the time.
but that sort of like relation you know that correlation shift those those are unexpected things that happen with it it's also I also feel like that gives you great insights into you know in one stack what you're working in different companies but you know they could be related to sort of like this other stuff which seems completely unrelated but you know at the core they're also like connected I remember at a
I remember Arista, I think, Adam Sweeney, or I think, you know, very senior principal or I think a VP at that time. He told me about, he was asking me questions around like circular buffers and tines and other things. And then I think he mentioned like all of computer science could be boiled down into like
seven or eight algorithmic questions. And if you're good at this, you kind of have a pretty good grasp at most of these system level things. And I think that has been true. If you know buffering and how to deal with memory overflow, out-of-court algorithms and things like that, they have a lot of... I mean, you find that application in several different places. And I think that has definitely helped. But yeah, I think
you know, going back to your original question, like has the breadth helped in understanding? It does, I think for sure. There are still areas of, you know, GPU learning and how the GPU works. I don't have a very good depth of, I never really worked on those kind of things. But I think in principle, when I read some of these papers, it's sort of all, I mean, again, it's, you know, short circuiting, you know, certain things, caching,
certain results for quick access. I mean, they are all sort of like consistent approaches to, you know, solving performance-related issues that has been sort of like universal. I think one last thing I should just add, like even at Databricks when I was there, I think one of the very things that got repeated all the time was that
Spark in general is an amalgamation of all these different concepts. Let's do caching properly. Let's do distribution of data, separate the data separately. Let's do warfarin immutability and things like that. I think it was repeated quite a bit. Are we doing something very revolutionary? I
Not in any one particular angle, but if you aggregate all of those features, Spark as a system seems very complete in terms of being able to have implemented all of these sort of best practices. And I think that was one that kind of stuck with me. I think that was true. If you look at any particular feature of it, it was kind of like, well, it had basically kind of gotten from the experience of what are the
issues with other systems and then kind of like addressed all of those principles in essence. I think that was part of its popularity. And of course, the DataFrame API was very good to access with the declarative idea around it. But yes. So many great teachings. Awesome, dude. We'll end it here.