We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Designing Reliable AI Systems with DSPy (w/ Omar Khattab)

2024/8/9

Neural Search Talks — Zeta Alpha

AI Deep Dive AI Insights AI Chapters Transcript

People

Omar Khattab

Topics

Omar Khattab：大型语言模型虽然易于构建令人印象深刻的演示，但在构建可靠、可扩展且可用于生产环境的系统方面仍面临挑战。其单体化特性使其难以在构建系统时进行有效控制，因此需要新的方法来构建可靠且可扩展的系统。为了构建可控的AI系统，需要从提示工程等临时方法转向更系统化、更类似于编程的方法。构建AI系统时，应关注系统架构的设计，而非模型的微调等低层次细节。大型语言模型擅长处理其训练数据中包含的标准化任务，因此可以通过构建模块化系统，将模型用于处理明确定义的子任务来提高可靠性。实际应用中成功的AI系统通常是将多个语言模型调用组合在一起的复合AI系统。DSPy 将提示技术视为元编程函数，通过定义模块的输入输出接口来构建模块化系统。DSPy 将构建模块化AI系统视为一个编译过程，将高层次代码编译成低层次的语言模型调用。DSPy 方法与基于Agent的方法不同，它强调由程序员将任务分解成子任务，并定义评估指标，而非直接给LLM一个高层次目标。DSPy 强调构建可信赖的AI系统，需要基于明确的指标和迭代开发，而非依赖于大型语言模型之间的交互。ColBERT 的模块化设计为 DSPy 的开发提供了启发。DSPy 的优化器已经超越了简单的示例选择，可以进行更复杂的优化，例如利用模型生成示例并进行筛选。MiPro 优化器通过分析程序结构和成功的示例，生成更有效的指令来优化语言模型。开源 DSPy 并接收社区反馈，促进了项目的发展方向，使贡献者能够专注于特定模块的改进。未来几年，模块化系统将在AI领域占据主导地位，编程和机器学习将融合在一起。AGI 的炒作将会消退，取而代之的是对 API（人工可编程智能）的关注，即关注构建能够在特定应用中展现智能的程序。

Deep Dive

Key Insights

Why is it challenging to reliably integrate large language models into production systems?

Large language models are powerful but opaque, making it difficult to control their outputs and ensure they fit into a larger system cohesively. Each module in a compound AI system must interact with others, requiring a lot of dependencies and precise tuning.

How does DSPy help in building modular AI systems?

DSPy provides a framework to build modular systems where each module is responsible for a well-scoped subtask. It introduces programming abstractions and optimization techniques to generate and select effective examples and instructions, making the system more controllable and iteratively improvable.

Why did Omar shift his focus from information retrieval to modular AI systems?

Omar's work on Colbert and multi-hop question answering systems like BearQA and Baileyn highlighted the importance of modularity. He realized that modular designs could better leverage the strengths of language models while addressing their shortcomings, leading to more reliable and scalable systems.

What are the key components of the optimization process in DSPy?

The key components include generating initial examples, filtering and selecting the best ones, and using techniques like rejection sampling and best-of-n to iteratively improve the system. DSPy also borrows from hyperparameter optimization to efficiently explore different configurations and optimize the system's performance.

Why is the separation of concerns important in DSPy?

Separation of concerns in DSPy means that the data (inputs and outputs) and the program (system design) are kept distinct. This allows developers to focus on their problem and the system architecture, while the framework handles the optimization and tuning of individual modules.

How does DSPy differ from agentic approaches in AI?

DSPy focuses on building systems that are grounded and controllable, where humans specify the system design, modules, and objectives. Agentic approaches, on the other hand, often rely on a high-level goal and expect the LLM to decompose and execute it, which can be less reliable and harder to trust.

What is Omar's vision for the future of AI and DSPy?

Omar envisions a future where AI models become commodities, and developers use frameworks like DSPy to write programs that exhibit intelligence in specific applications. The focus will be on artificial programmable intelligence (API) rather than artificial general intelligence (AGI), enabling more grounded and iteratively improvable systems.

How does open-sourcing DSPy influence its development?

Open-sourcing DSPy has led to a modular development process where contributors focus on specific components like optimizers, assertions, and internal abstractions. This allows for faster iteration and the integration of powerful ideas without the distraction of multiple use cases, making the framework more robust and adaptable.

Chapters

This chapter explores the challenges of integrating LLMs into production systems, emphasizing the need for reliability and scalability. It introduces DSPy's philosophy as a shift from ad-hoc prompting to more principled, systematic approaches resembling classical software engineering, focusing on building controllable systems that leverage the strengths of LLMs while addressing their shortcomings. The chapter highlights DSPy's paradigm of writing structured natural language programs.

Current LLMs are impressive but lack reliability and scalability for production systems.
DSPy focuses on building controllable systems through principled, systematic approaches.
DSPy enables writing natural language programs in a structured way.

Shownotes Transcript

Translations:

中文

Hi, this is Neural Search Talks, post-CIGIR conference, in person from the beautiful courtyard at the William Gates Building in Stanford. And I'm here with Omar Khattab, a last year PhD student here at the Computer Science Department. Welcome on the show, Omar. Yeah, thanks for having me. So...

Omar is well known for a number of things, in fact, both Colbert and the DSPy framework. And today we want to zoom in a little bit into the DSPy framework and kind of the recent advances in that. So at SIGIR in Washington, D.C., you gave a very nice talk in the Large Language Models Day. So large language models have kind of taken over SIGIR, right?

You gave a very nice talk in this session about this sort of new paradigm that DSPI started, right? Can you tell us a little bit about what was the gist of the talk? Yeah, for sure. So, you know, the context of a lot of this work is that

you know language models have made it so easy to build things we just couldn't build before we couldn't imagine building before yeah but a lot of us are quickly realizing that these are really impressive demos but when it comes to building systems we want to put in front of users systems that are reliably part of bigger architectures or things that are scalable things that are reliable um you know our demos only go so far and they don't really permit us to do this

And this really matters in practice because people's ambitions have become really high and we want to build these systems that are able to do all sorts of things. They're like the fastest new generation of technology to go from research to production actually, right? Precisely. I mean, and the scope is impressive and the opportunities are truly endless.

But the challenge here is, you know, language models are so inherently, you know, they keep getting better. We just got Lama 3.1 out, you know, there's been a mistral release, there will be more. But fundamentally, what happens is that these language models are so opaque, they are sort of

kind of by design, and this is really what makes them so powerful, they're so monolithic that controlling them in the way that people need when they're trying to build systems is just really hard. So my work is sort of revolving around how do we build reliable and scalable systems with these language models, and DSPy has sort of come

comes to that with the angle of, well, if you want to build controllable systems, if you want to build systems that you can improve and iterate upon in ways in which you're essentially more systematically engineering them, then we need to move away from

prompting or ad hoc ways of fine tuning or whatever ways in which we are using this technology towards more sort of principled or more systematic things that look more like programming. More like classical software engineering. Or more like classical software engineering or more like, you know, how a machine learning workflow or data science workflow would have looked, you know, before language models. But of course, you want to do this in ways that leverage the strengths of these models while addressing their shortcomings.

So I think the core of the SPI is on the one hand this new paradigm, right, where you can write natural language programs in a structured way. Right, right.

Which also is embodied in a lot of other frameworks of chaining together different agents or modules. But I think the DSPy approach is also very special because it has this optimization mindset. Right. So the main thing is

you don't want to be in the mode of thinking, how do I coerce this language model to do my task? You want to be thinking- - Don't do this, don't do that. - Some of this eventually might, may be useful or important or not, but these are lower order things. These are like hyper parameters you tune after you build your system. The most important thing is building your system. So setting up the architecture and the key idea here, or one of the key ideas here is really,

These models are really powerful at tasks that they have seen and tasks that are standard enough that they were trained on.

And so we can sort of leverage that by building modular systems where these models are responsible for well-scoped individual subtasks. And we compose these modules into bigger systems that sort of just call out to the model to approximate or solve the subtasks individually. And what sort of,

helps us go from just saying that to actually doing it is a series of programming abstractions or new algorithms that sort of make this whole thing cohesive. So like it's one thing to say you want to build modular systems with language models and then you sit down and you write actual like, you know, five separate prompts. And it's another thing to build an actual modular system with language models. And so a lot of the progress in the last couple of years

And if you look at sort of systems that really make it into production in a satisfactory way, they're generally like this. It's what we call these language model programs or compound AI systems that are making so many little calls to language models and sort of composing them carefully in code.

And, you know, there are essentially two ways to go about building this. The standard way is to say, well, you know, here is my system architecture. Here is my diagram. I want to build a system that, you know, is going to

answer user questions or is going to auto complete sort of text, you know, for people writing code or is going to do any number of those things. And so I have the following modules. And then for each module, you sit down and say, well, my input. Yeah. And you sit with a model. Exactly. Instructions like two pages. Right. You write two pages of instructions. You iterate on each and every one of these prompts. And the challenge here is that.

getting the model to sort of perform the specific task that you want is hard, but it's a lot harder in these compound AI systems because each of these modules has to interact with the rest of the code. It's not enough that it's giving you valid output. It has to give you the output that the other modules expect. Yeah, there's a lot of dependencies. There's a lot of dependencies here and the whole system has to be cohesive. So it is actually not that hard in practice, if you know what you're doing, to get a model to sort of perform a specific task. Writing prompts

is challenging, but it's not that challenging. What is challenging is maintaining a system of several prompts that sort of are supposed to interact in a bigger system. So DSPy introduces a few ideas in this space. One of those is prompting techniques sort of become essentially metaprogramming functions or sort of

function generators. So you give them a signature, like a function signature that just defines the relationship between inputs and outputs. So you say, I want a module that can take questions from users and generate search queries. Or I want a function that can take

maybe a goal and a set of tools, and maybe I wanted to select the right tool to call. And this is the full description of the interface of the module. Now how the module gets filled, like that just defines a whole essentially, or defines a kind of an empty unspecified implementation, but we've defined the interface. So you can specify many modules in this way, and you can say, well, for this module, I would like,

I would like it to be implemented in the form of a chain of thought. So that's a particular prompting technique that sort of in practice, people usually sit down and write examples of here is how the model should reason about this particular type of input to generate that particular type of output. But here we just say, well,

for this particular signature, for this interface, I'm interested in getting the model to reason in order to do it. - Right. - For this other module, I have a different interface. Maybe the interface is rag-like. So take a question, take some context passages that were retrieved in some way, and please answer, give me an answer accordingly.

And so that's just an interface. And you could say, well, I want this interface to be addressed through, you know, a program of thought. So a program of thought just means this sort of meta flow where models generate code that gets executed. And based on the execution result, a model sort of generates an output accordingly. And so now we're sort of moving and talking about, you know, these interfaces at the level of abstraction that we design these systems at. Yeah.

And you just write your function, you call these modules. - And then the prompts kind of become hyper parameters within this larger system, right? - And the prompts become parameters in this system. But you're basically just describing your control flow with respect to these modules. And well, you could think of this as a normal computer program. You're just writing Python or it could have been any other language and you're just calling these fuzzy modules, these unimplemented functions that

have these fuzzy interfaces. Now the question becomes, how do we, you know, who and how should these functions be filled by? So what's the process that should happen here? And the most important thing is, well, you know,

we do already have a notion and sort of have a lot of intuition about the process of taking high-level code and converting it into low-level code, maybe by generating code underneath or metaprogramming or other tricks. And essentially, it's this notion of compiling the code from high-level code

language to a lower level language. Except that the lower level language here is how do I take these modules that are specified in a fuzzy way and sort of map them into actual language model calls, which generally means what? It generally means, well, what model should I call? What are the weights assigned to? Like, you know, are we updating the weights? Are we using them as is? And how are we formatting the inputs into an actual prompt that allows us to call this model?

And normally when you're sort of thinking about compilers or meta programming or other such techniques, normally the behavior of the program is well-defined. So the behavior like there is correct behavior and there's incorrect behavior and sort of there are semantics around this programming language that

guard all of this work. Is that essential for the DSPy framework that the programmer, let's say the human designer of the flow, actually splits down the work into these tasks? And how does that kind of contrast with maybe sort of the more general agentic approach where you have a high-level goal and

you actually ask an LLM to decompose it into tabs. - Right, so let's finish a thought and then come to this. So in normal compiling, kind of the program is enough to specify what behavior should happen. And then you're compiling because you want to map to a more efficient language, or you wanna go from a high level to a lower level language or whatever goals you have. And so, you know,

behavior of the function is kind of well defined, but where you have a lot of scope for optimization is usually efficiency. What we're looking at here is cases where no one actually knows how to implement a function that's supposed to generate answers or generate queries that go into a search engine and the search engine is supposed to retrieve high quality stuff. We don't know what the right implementation is because there is no right implementation. And so the missing element here is asking users or asking the developer

what is the function you're trying to maximize? Like, give me some inputs, give me some examples of just the input questions that you want to answer or input documents you want to translate or input reports you want to summarize and give me a sort of a guiding metric, some kind of objective that sort of tells me

tells me when success has happened or gives me a reward or sort of gives me a notion of like good or bad attempts. - A way to evaluate. - A way to evaluate. So now you've asked like, well, now the programmer has to specify three things. They have to actually design this system. They have to break down into the modules, the actual function. They have to sort of figure out where,

what types of inputs they expect to see. Now, maybe in many cases, 30 inputs or 50 inputs are enough, but you still need some. And they need to think of what's the metric they're trying to maximize.

- You might contrast this with much more ambitious approaches that are essentially saying, just give me an objective in English. Like I want to buy a ticket for a flight from San Francisco to DC or something like that. And the system just sort of spawns lots of agents and they talk and they interact. I think the problem here is we wanna enable people to build systems that they can trust. And for that, we need to be grounded. So language models,

talking to language models is an impressive way to generate a very large bill, but it's not necessarily a reliable way to sort of build it. - Certainly not yet. We've all seen the demos and they can go. - Now it is always, you know, so the point here is we want to leverage humans and developers and engineers for what they are good at. So they're good at thinking about their problems. We can't replace that. We don't know how to,

sort of abstract away problem solving in the general sense. This is a much, much more manageable sort of machine learning problem of like, well, you sit down, you write a simple initial program, you write a simple initial metric, you optimize,

You see what the system gives you. You're probably not happy immediately, just like any other system you're building, but you know how to iterate and you know how to add complexity in sort of increasing levels step-by-step as you go along. And so we enable sort of people to go back to an iterative development that is sort of grounded in metrics, grounded in feedback. They're collecting data over time. And, you know, this is really important. I see a lot of people that like DSPy who sort of use it

sometimes successfully, but this is still not sort of optimal, where they kind of write a very simple, you know, chain of thought program or rag program. They don't think about it anymore. It becomes a black box agent, essentially. And they sort of like expect the optimizers to just magically figure things out for them. That's not how it's supposed to be used. It's essentially a programming language. You are solving your task, not the system. And optimizing is part of your loop. Exactly. The system is just making sure that the parts that

you're not best equipped to deal with, which is figuring out these sort of parameters is delegated outside, but you're still the person designing the system. Right.

So before we go further into sort of these ideas about optimization and the more recent developments in DSPy, I just want to go actually way back. Sure. Because you're now in your final year here as a PhD student. You didn't start your project working on DSPy, right? You were essentially working on information retrieval problems and Colbert and all these things. So what was kind of the eureka moment when you thought,

wow, this is a cool idea. I need to change directions. I need to start working on this. Yeah, so the process here was I worked on Colbert in late 2019, my first year, and

If you think of Colbert, what's really cool about it is that we're using BERT as a language model. So maybe some context is Colbert is a retrieval model that looks quite a bit different from other retrieval models in the neural era. Most models that use language models or transformers for search, they kind of

You use the transformer as an encoder that you give it a document and it spits out a high dimensional vector and you do the same for queries. And basically similarity is a dot product. And that's really cool because we know how to scale dot products up to massive data sets. It's very fast, but it leaves a lot of quality on the table because now you're basically asking models to, um,

to compress a long document or a long passage, you know, a couple hundred words at least, into this one high dimensional vector very hard. Yeah. Especially

when you're trying to like generalize from a domain you're trained on to a domain you didn't train on. - Yeah, so you avoid that with Colbert, but how does that lead to DSPy? - So you avoid that with Colbert and you avoid that with Colbert by essentially building this modular structure, modular system in which the model is used, you know, to produce representations

that are not just atomic, but they're actually sort of decomposed into pieces that capture individual tokens and the system uses them in an interesting way. So was that already the seed of this modularization idea? I mean, I think it's certainly consistent with it, but the interesting thing is after working on Colbert and sort of seeing a little bit of that modularity, although I wouldn't say that was...

a crisp idea at the time. It was, what I worked on next was sort of what was called at the time open domain question answering. So nowadays we call that RAG, but this is sort of the general problem of how do you answer questions

when the answer is not supplied to you in a context. And maybe in a complicated multi-hop reasoning kind of framework. A very important special case that is multi-hop, where basically the problem here is you're asked a question, and in order to answer it, you resort to retrieval. So the model doesn't remember everything, and even if it does remember everything about the past, the world keeps changing, and you want systems that can cite their sources, and there are so many reasons that sort of motivate us to...

build systems that are now modular. So now these systems are going to retrieve first and then use those retrieved elements as actual text that goes again into the model and the model downstream generates answers accordingly. So I built a system called, called BearQA. And so a lot of the work there sort of thinks about retrieval and generation, but you immediately see the notion of modularity directly here where, you know, you're saying,

You have a module that is responsible for generating answers and you can optimize it for that, but you have a separate module that's optimized for actually finding relevant sources and you can optimize it for that. So then sort of I worked after that on a system called Baileyn that was, as you said, for multi-hop sort of reasoning. And multi-hop reasoning, it's kind of an interesting, it's kind of a,

It's kind of a problem that is designed to encourage a specific sort of solution, but that sort of solution is a lot more general than specific benchmarks seeking this. So multi-hop question answering is basically saying, what if you want these systems that are generally retrieval based to answer questions that require combining information that's usually not found in any specific context? So language models, what makes them so powerful

beyond Google search is something that we've had for decades now. So what makes language models so powerful is that they can sort of synthesize in their parameters and in their activations during execution.

a lot of nuggets from different places. So if we are going to move towards a modular design where we have a retriever and a generator, well, we better be able to also answer these types of complex questions. So a system like Baileys, what it does is it sort of takes complex questions, essentially it breaks them down into representations of simpler questions, it retrieves stuff,

and then it actually reads those documents. It goes back into the system. The system retrieves more stuff that sort of fills in the gaps of knowledge that it's accumulating. And it's using all of these as intermediate clues to sort of build up a bigger puzzle. And so that is now really an instantiation of a truly modular design that has, Baleen in particular, it had like seven or eight models at least. So models that were retrieving and essentially in doing that, they were,

kind of producing queries, models that were summarizing. So like, you know, you retrieve a hundred documents or you retrieve 20 documents, but you can't possibly

you know, keep accumulating those documents. So you take these 20 and you produce a summary of what is actually relevant here. And then you sort of generate more queries and you proceed. And so, so like, you know, this process has modules. - So you were trying different things and then you thought, this is not, this is not a great programmer experience. I need some sort of language to declaratively iterate over these different systems. - Yes, yes. But so much else was happening. So this was before,

This was before GPT-3 had public access. And so like GPT-3 was, you know, not instruction tuned yet. It was really not competitive with systems we could fine tune. And so Belene was a system that was fine tuned. And so it had, I can't remember the exact number, but I wouldn't be surprised if it was nine, you know, you had to train nine different models together between the retrievers. I think we had to train four retrievers and you know,

at least one re-ranker and at least one summarization module and then an actual question answering module. All separately. Separately. But for it to work well, you had to train them on data that was in domain enough. Like, you know, you can't,

If you want to train a re-ranker, you want to make sure that it's trained on the distribution of outputs you're getting from the retriever that will be deployed with it. You need this end-to-end optimization effect, basically. You want to approximate an end-to-end optimization, but certainly it's actually hard. It is too hard to actually make it end-to-end trainable. And so approximated in a workflow. And this is really hard because

Basically, most of these modules are trying to do things that are intuitive and make sense, but no one will give you data for these. There are datasets like Hotpot, QA, or Hover. Hover is really cool that support multi-hop question and setting or multi-hop reasoning workflows, but they certainly will not

could not have anticipated all the possible modules you will build. So no one is going to give you training data for all of these individual modules. And so the difficulty of building a system like Vailene, I mean, in hindsight, it's pretty clear what modules you want, but like how do you generate the training data that will allow you to train these modules well together? So Vailene was a specific, you know, here is a,

very, very powerful but specific strategy for generating the data and getting these modules to work together. And I remember back then I was still thinking about models, not programs. So when you think about models, you say, well, and this is still the pervasive way the

the overwhelming majority of the language model community, the NLP community and beyond in general, you know, the way we think of these systems is that we have a bunch of models. So when you think of it this way, you know, it's really hard because now you have basically all these scripts that are training models and then you're just connecting these models as an ad hoc thing at the end of the day. So the DSP and DSPy sort of motivating models

kind of goal was to basically say, what if the focus is shifted from the model to the system? What if I'm not building nine models in Baileyn, but I'm building the Baileyn system? - And then scripting them together, but I'm gonna stop down. - Yeah, let's actually think of building the system itself, the Baileyn architecture itself, and then the modules become very similar to layers in neural networks. Like if you're building, if you're,

sort of building ResNet or you're building kind of AlexNet or you're building a transformer, birth or something. It has 12 layers of attention.

You don't think of, I mean, you just compose these layers. You want to have the right abstraction. You want to have the right abstraction where you're building, you're composing these layers and you want to specify some kind of objective and you want to have some optimizers that are doing the right thing under the hood and, you know, maybe better optimizers come along in the future. But at the end of the day, the idea is you want to,

have a data-driven way of sort of learning how to fill in the blanks or how to update the parameters of these systems. Yeah. So a system like Baleen that was, you know, several thousand lines of code thinking about models could be expressed in general in like 15 lines of, well, I have these, you know, nine modules. It's a lot of modules. Yeah. But each of them could be one line that just tells me, well, I should take this and give that. Yeah.

And there should be a function that's like, here is the loop that composes them, very simple Python. And as long as I can specify a right objective, and as long as I can have optimizers that take that objective and kind of reflect it into the choices made in the parameters of the system, then we will be able to successfully iterate on these systems. And so you could go into a system like this, remove a module, put in another, change the language model that's underlying all of this and change that. Now that was not possible at the time of Billy. It was not possible because

to synthesize data here, it was hard. So it was such a, it was a manual process where you needed a lot of basically heuristics to connect the pieces of the system together. And the heuristics were very problem specific. What happened in late 2021 was we started getting

language models that actually worked in the sense that instruction tuning started emerging and we got things like DaVinci One and DaVinci Two. They were very slow and expensive and unreliable. But basically you started to see that few-shot prompting or instruction, you know, instruction prompting could give you systems that

were unreliable, but could do just about anything he could describe in English. And like a few shot prompting came up really big, right? Right, right. Just provide a handful of examples. Yep. And so I've talked to a couple of people before this interview and they're like,

DSPy, you know, great new paradigm to build these kind of systems. And a lot of people still think it's about like selecting the examples, right? That's like a preconception from the original work where the optimizers were very good at that. So a lot of the benefits came from selecting the right examples.

but you have progressed so much further, right? So more recent work, I want to go a little bit into these optimizers. Yeah. So what kind of things can you optimize and how does it really like play along with the power of these very recent language models and the way they can follow instructions and be very like precise about those instructions? Right. So

The most important thing is if you're building a framework, DSPI or otherwise, that needs people to give you examples of these modules, you're not going to go very far. And the reason is the whole point is to enable people to iterate on these system designs, which means that they'll take out modules and they'll throw them away. They'll add other modules and they want to be able to essentially recompile and see what happens. They need to be able to iterate. Now, examples that are needed or necessary are

at the level of the function that as a black box, what are you taking in as an input? And what do you expect about define that output? The internal structure should be entirely sort of independent of the data that you're building. 'Cause the data is about the task, the problem, and the program is about your solution. So there's a bit of a separation of concerns here.

So what demonstrate search predict the precursor of DSPy does the demonstrate step is the notion of can we use the actual program that you've given us to generate examples? And if we could generate those examples successfully, can we basically then look at selecting among them? So that was sort of in the original DSP. And that is,

far more powerful than most people sort of realize. I mean, obviously I'll discuss more recent optimizers, but you know, this is two years old, but basically, - Ancient by today's standards. - Yeah, ancient by today's standards. But this is incredibly powerful because what you're saying is I'll take the inputs you give me, I'll sort of make a guess about the initial sort of configuration of every module, which means I'm going to keep the parameters of the model unchanged. And I'll just guess an initial template that sort of says,

the signature you gave me, I'll convert it to this prompt. And basically, you can then sample through your system by running the function. And every time there's a language model call, you can sample output and you can sort of chain them together or sort of let them run your metric and basically do

do things like rejection sampling. So like if the answer is sort of liked by the metric, you keep it, you keep the whole trace, otherwise you throw it away. Or things like best of n, you know, if you have many, you could run it many times and keep the best one if the metric is continuous, things of that sort. These traces now what they give you is for free, you have examples for every module along the way. And the question is, are they correct? Are they like, you know, useful? Is it like a test case that is actually...

representative of the core distribution. - Right, a lot of people look at this and then they say, well, I mean, you generated some reasoning, then you generated the search query, then that search query led to some passages being retrieved. Those passages were then used to generate an answer in the RAG system. Was any of this intermediate steps correct?

And once you start thinking of these as parameters, it doesn't really matter if they're correct per se. As long as they're helpful. As long as they're helpful. And so not every example is equally helpful. The space becomes so high dimensional that our expectations

exponentially explosive that you don't really have an idea of where the helpful support points are, right? Precisely. So the first set of optimizers were all about how do we generate enough of these examples and how do we figure out which of their combinations should be plugged into each module. And the surprising thing is it is actually really hard to beat random search. So if you generate a lot of these examples and you try basically random subsets of these combinations,

bootstrap demonstrations is what we call them, you can get really far. Because basically, the average trace you're generating is not very effective. But for some reason, some of the more extreme ones, when you plug them in, you see the average quality, it's very powerful. Maybe just to make it a little bit more tangible for our viewers. So

what does random search in this case mean? Because you're actually tweaking the prompts with this search process. So can you give an example of what would be a step in that search process? Right. So the first thing is you have a program, you make an initial guess in the system about what the prompt should look like. So that's

There are two ways. Basically, it's either just a template that takes your signature and plugs them in, or you ask a model, just give me an initial-- - So let's say for RAG, you have something like, here are a bunch of retrieved documents, find the answer to them, stick to the facts, only-- - Well, that's too much work. It might just be if your signature is, I'll take a question and some context and generate a passage, you tell the model, I have context,

question and sorry, I'll take context question and generate an answer. - Yeah. - You have the model given context and question generate answer. There's no like, no customization. - Yeah. So how do you kind of evolve that into more sophisticated instructions? - So now the idea is you can take the question inputs that you have, run them through the system, meaning it will run through this prompt and run through the retrieval and run through the rest of the program. - Yeah. - And then you have a way of filtering, you know, was this actually correct? Like did I get the right answer?

this allows you to keep those input outputs of every module as demonstrations if the process is successful. So the random search is basically saying, well, I have so many of these, if I grab five of them randomly and plug them into this module and plug those into that module, and then take this as an updated program under the hood, and I evaluate it, I validate this whole thing.

What do I expect it to perform like? Do I see that it's performing on average pretty well, or do I see that it's performing even worse than the zero-shot version or otherwise? So you're plugging in those examples that you generated. So this is the simplest thing you could do in order to sort of

maximize that function given examples that you're building. That's a very predictable dimension of variation, right? You're just generating examples which all work and they just work to different levels of degree, then you select the best one. And when you look at tasks, this can take you like, basically this random search process in many cases can take you from like 20%

to like 50% to 80% on the same task. So what is actually being conveyed by these different combinations of examples is pretty interesting. But the amazing thing is now that you've built those examples, your toolkit is much bigger, which is what a lot of the more sophisticated optimizers do. So the first thing that is like pretty easy you could do here is, well, if you generate enough of these examples, you can train the model on it. Like instead of showing it

to the model in the prompt, you can basically go to the model and do supervised fine tuning where you say, given this input for this module, this is the output that worked in the past, so you should generate more of that. You could do something like DPO or such. We don't have that in DSPy. Currently, there's a lot of sort of related ongoing research where you say, well, given this input,

this output performed better than that output and you basically sort of optimize as such. Another thing you can do, which is what Mipro does, is you look at those... Which is your very recent paper, right? It was public... Mipro is like...

eight months old at this point, but the paper is newer. The paper was kind of a retrospective of like, let's look at many of the optimizers. So vPro was kind of introduced in the open source framework and then later the paper came out. Yeah, so we introduced vPro in December or at least released it in December and then we told people about it publicly. I mean, people used it successfully and then we sort of tweeted and told people about it publicly in March and then we released the paper on it in June. So what vPro does is basically

it looks at those examples and it says, well,

how can we get the model to generate examples, to generate instructions that convey the function that these examples are, these successful examples are conveying. But it actually goes a lot further because it studies the program itself. Like it looks at the structure of your program and it tries to figure out what is this program trying to achieve? And what is my module here? What is the role this module is trying to play in the program? What are the examples sort of conveying? And through that, it generates a lot of,

instructions and you know so it's like kind of a meta level understanding of the overall end-to-end task right so trying to understand the end-to-end task and the role of each specific module in it so that the instructions that generates for each module you know language models are really good at brainstorming they're really good at sort of

There's a really cool paper led by a student here I'm also part of called Storm. And the name Storm comes from the notion of brainstorming. So language models are sort of really good at brainstorming, but they need to be grounded. They need to give them what to brainstorm around. And the other thing is you need to filter all of this brainstormed

sort of low quality stuff for the high quality occasional things. So Meeper does that exactly. It has notions of how do we ground the language model proposals in the actual context of the program and in the context of successful examples. But then a lot of prompt optimizers almost stop there. They basically have these notions of let's ask the model to critique and let's use that critique to generate an instruction.

odds are the instruction the model generates is worse than whatever you started with in many cases. So what you want to do is you want to over-generate and you want to test. And the problem here is when you have a program that has five modules and you're trying to think of 20 potential instructions in each and for each of them there are 20 potential few-shot sets that we generated,

it becomes a massive discrete search space. It's a lot smaller than the search space sort of all possible strings, but it's pretty large. So the nice thing here is that one of the easiest things we can do very successfully is borrow from the hyperparameter optimization land. Basically say like, well, you know, I have a very expensive function to optimize. The function is plug things into my program, run it on a validation set, sort of with some inputs and check the metric, very expensive.

So you want to do what is called sequential model-based optimization. You want to essentially be, in some sense, training or sort of optimizing some external small model

that tells you if you combine these selections, here is what your system will perform like. Here's what the expensive evaluation function will perform like. - So it becomes a little bit like reinforcement learning in some sense, right? - It becomes like reinforcement learning, but it's a lot more discreet and cheap in the sense that RL is like, if you're optimizing a policy, really, I think in that case,

somewhere between reinforcement learning and meta learning. So in that case, it's a really expensive and finicky process. But in general, things are inching towards more and more reinforcement learning like things. And when you're optimizing the weights in a DSPy program by updating, I guess this is a little technical, but basically the DSPy optimization problem is two RL problems.

One is how do I update the policy? How do I update the model that's making the choices in the modules? So how do I get the language model, which is operating as sort of my agent or my policy and taking actions, which is generating strings that are part of this program. How do I update it to maximize expected returns and how do I get the model

When I say something like you generate those examples and you do rejection sampling and you train on them, rejection sampling based fine tuning or expert iteration or store or these types of ideas, that is basically a

kind of a poor man's reinforcement learning is one way to think of it in the sense that, you know, like it has analogies to reinforce and the like. But you could easily see extensions of this that do PPO or DPO or the like as well. Now, of course, the more you, the closer you go to things like PPO, the more finicky the whole process gets because hyperparameters become very important. And, you know, in general, it's currently a process that

generally has only been achieved by people, engineers sort of tinkering with them. Can we get sort of reinforcement learning, deep reinforcement learning methods that just work out of the box and use them to build optimizers and DSPIs is an open question. I think, frankly, we will be able to, but it's still an open question. Yeah, I mean, you really opened up a whole new box here, right? Right, right.

one of the things that I was thinking about is this quote. I think it comes from somewhere from MIT, which says, a good thesis answers an important question in the field. A great thesis puts a new perspective on the whole field and an excellent thesis kind of defines a whole new field of research. So I think you're definitely somewhere between those last two. And so, yeah,

It really seems like you've opened up this kind of new way of thinking about these systems as coherent ensembles of kind of the old Marvin Minsky way of society of mind, where different agents are interacting and is all behaves differently.

and nicely as a mind right right but we get we get to do that while thinking about it in ways that so what is the problem with modularity in general i mean a lot of old ai was highly modular and there were systems that were built very thoughtfully and you know in some sense they didn't solve the same problems that deep neural networks solved um and and this is something i think about often so i've

I think you want modularity at the right level of abstraction. So when you think of humans and you think of the way in which we recognize that this is a nice tree or this is a flower or this is sort of a dangerous place to enter or a safe place to enter, all these reflexive decisions are happening without pre-mediated planning or without sort of decomposed thoughts there yet. And so I think basically,

modularity has to happen at the level of abstraction where like we've already more or less solved these reflexive system one type sort of- - Yeah, that's basically the single model kind of- - That's basically the single model. And so a lot of the attempts at modularity for some problems, not for all problems, just couldn't get far enough. If you're spending all your time thinking about building parse trees,

guess what, humans talk without thinking of parse trees. So that doesn't mean parse trees- - I hope that's true still. - Well, so in some sense that doesn't mean parse trees are not useful. - I think here on campus there are still few people who would disagree with that. - I actually don't know, I'm not sure. - At CSLI or something like that. - I think we're all much more aligned, but so I'm not saying parse trees are not useful. I am saying that

getting successful port streets is clearly not essential to successfully mastering language. And in some sense, it ends up being a distraction if you're thinking about modular systems. But being able to sort of divide the complex tasks, like writing a research paper into, well, I need to find sources, I need to draft it first, I need to iterate on my draft, you know,

basically having multiple passes and multiple steps is so fundamental that no model will ever train will sort of write general open-ended sort of articles about anything you might think of in one forward pass of a neural network now these sort of recursive approaches might um might generalize and might sort of the structure itself might be self-generated by the model so there might be

it's maybe scaffolding its own architecture, but at the end of the day, sort of like a simple forward pass through a neural network is only really learning things that you have to compose on top, you know, as

you know, functions or modules in order to really maximize what you're doing out of these, out of these systems. And so I, that's the bet that DSPy makes. I mean, something that I don't, I haven't said in public too many times because I wasn't asked about, but, you know, I said it a lot in private talks is, you know,

The goal of DSPy is not that you're specifying five modules, so we will actually optimize five models. It is very possible that you say, I want a retrieve module and a generate module, and I want the retrieve to pass into generate. And under the hood, the optimizer or the compiler decides, actually, I have one of these neural networks, one of these models that can actually do the two together.

And so it will fuse the calls. It's the same thing as what compilers do when you say, I want an add and I want to multiply in the context of a matrix multiplication. And they actually compile it into a single instruction that's multiply and add. So I think the same notion here of like, you know, models will sort of give us bigger and more coarse grain than more powerful, you know, instruction sets, if you will. And

the abstractions, the code we write will look the same because smart compilers will go from the level in which we want to describe our programs, you know, to more to the effective ways in which we want to teach our systems. So I described the reinforcement learning problem that sort of DSPy makes

a lot easier because it has those declarative modules. Basically, you could have thought of the same RL problem 10 years ago, but the reason it would have been so intractable is that you couldn't have sort of warm started each module by basically taking a signature and guessing an initial prompt and sampling decently successful patterns out of it. So that's a really key thing. But there's actually a separate learning problem, which is kind of a bit

a little bit more meta learning on top of the, it's just really the optimization problem where you're not trying to learn the policy inside the program, you're trying to learn the policy that decides what choices to make in order to optimize the system. And so here, this is a much more expensive problem because you're not just getting a signal from every data point, you're actually getting a signal from basically

in principle, like, you know, the entire validation set or samples of it. But this is a much more powerful space because, you know, now you can kind of see the whole process on top and basically learn across tasks and say, like, you know, learn across models. Like, hey, I'm optimizing for GPT-5, but I have seen

20 previous models and so I can kind of like well this is a bigger model it's a more powerful model and so like I can extrapolate. You could have like really this system an ensemble of many different models maybe which have different strengths or which are different levels and can kind of support each other and all that. Right. So I'm curious to hear your thoughts on this sort of

Um, and to end learning right where the main problem is, um, kind of credit assignment, right? Like, which model gets updated and. So there was this recent paper also come out of here called text grad, which was clearly inspired also by by your work. Is that kind of the direction where, where.

you see this thing going. I thought it was kind of metaphorical. It introduces kind of backpropagation of gradients notation, which was not really like mathematically accurate.

Correct. I think they're definitely thinking in the right level of abstraction. I've seen some people sort of dislike, you know, some prominent people dislike the analogy. I think it's an amazing analogy. I think they did a really good job conveying it. I think the abstractions are great. So let's see.

I think sort of the notion of back-propagating errors across this whole undifferentiable system. In language. It's an idea that's here to stay. I think it's a good idea. I think on its own, it's pretty weak. And the reason is

to back propagate errors you need to know what the errors are and to know what the errors are you need to ask a model what went wrong and i think the problem is models are very good at sort of giving you canned general diverse in principle hopefully if they're not like if they're badly tuned um

suggestions. They're good at brainstorming. But not at surgical precision. Yeah, but if you want to be precise and you want to fix the actual problem, models are not there and I

I haven't made up my mind if they will get there or if this is a very fundamental type of limitation. I think something about it's fundamental, although we can train them enough that most tasks end up being addressed with precise enough feedback. So the problem here is you're at the mercy of the language model critiquing the right thing and taking that critique to make the right proposal for fixing and to make the right proposal for what to add propagate up the stack.

That is highly limiting. So what you want and what you actually want, and I think where the future is headed is you definitely need a component where the model is brainstorming actively what went wrong. But then what you want to do is, and you know, we have some optimizers that we've built in here that, that sort of take this insight and do a whole lot more with it. But basically you, you really need to make things driven by actual experimentation. So,

you're sampling from this distribution of hopefully getting some decent suggestions, but unless you've tried them, you can't trust them. And so this is what something like MePro is really good at. It's good at efficiently exploring many different suggestions without exhaustively testing everything. And I think the moment you start realizing this is the moment you really think of, well, if you're building those systems by hand,

There's just no way you're exploring this massive space of like, how do I build examples to use them in my prompts and then to use them to fine tune my system and then to use them to generate instructions and to explore those for, you know, models as they come along and you are fresh all the time. It really has to be an automated process. And so I think...

seeing things like text grad is very encouraging to see that, you know, other people are picking up this optimization problem and are coming up with creative. It's this kind of paradigm shift, right? Right, right, right. And nobody knows yet exactly where it will go. We don't know, yeah. But it's impressive that you kind of inspired this paradigm shift with DSPy. I think that's fair to say. So I want to shift the conversation a little bit more to practical stuff, right? This is all research. It's...

fascinating intellectual endeavor. But really, when you talk about engineering, the code matters. And so you've open sourced DSPy. First of all, I'm kind of interested how open sourcing it and receiving feedback from the community and contributions to the code, how that's kind of influenced the direction that you're taking with this project. I think what makes this manageable and successful is that most

most contributors are contributing in a highly modular way. And so what I mean by that is no one is like working

At this point, it even includes even me. So no one is really working on DSPy actively. What people are working on are specific parts of DSPy. So there are the- To make it work for their obligation. Well, actually, no. To make specific capabilities or components better. So we have the DSPy optimizers team. I mentioned Krista, Michael, Delora, Josh, whoever.

who am I missing anyone else? These are the main four. So, you know, those folks, you know, they, they, they are absolute experts when it comes to how do we do problem optimization? How do we do weight optimization in DSPy and, and such. There are other folks who are sort of responsible for the assertion side. It's a,

highly sophisticated module we have in DSPy. So that's will be Arnav, Shangyan, Manish. And like they know about assertions better than I do by a large margin. And so we have folks who are like in the core APIs of the LLMs and internal abstractions. And so like Cyrus, you know, Amir and Kyle and others. And, you know,

things related to that here and there by just folks from the community that are just wonderful contributors. I'm thinking Thomas and others. And so I think what works nicely here is

I think our code is in need of a pretty deep refactor and it needs to be cleaner, but conceptually the parts of it are super sort of the separation of concerns is effective enough that folks can just say, I study optimization. I don't do signatures, I don't do types, I don't do assertions, but I will focus on getting optimizers to work really well. And I think that's been inspiring because you have those folks who have all these

really powerful ideas and they're not distracted by, you know, 10 different use cases that are outside their scope, et cetera. Everybody's really focused on their individual elements and, you know, they're all motivated by, you know, many successful things that they're building. Yeah. What's your vision for the future of this field? I mean, I know it's very hard for anybody in AI to have a vision which extends

further than three to six months these days right but uh what's kind of your uh your research in you know because i think you you're choosing to stay in academia for the foreseeable future right right uh so what what's your vision on a research program that you want to pursue so i think i think um we are

in a place where not three to six months, but three to six years is visible. Beyond that, I genuinely have no idea. But for three to six years... You're very optimistic. For three to six years, I really do see the shift towards modular systems. I really see

A lot of people reinterpreting whatever it is that they learned from things like things that have so much value and such, but things like the bitter lesson, I think the bitter lesson is fundamental to people familiar with it. Notions of machine learning and building things, ideas that scale. I think fundamentally the idea from all of computer science is you want to avoid, trim

Premature optimization. Some people say premature optimization is the root of all evil and this applies to machine learning just as anything else. Another thing that applies to machine learning just as it applies to anything else is that you really need modular systems. You didn't need separation of concerns. You didn't need fast iteration cycles and

Deep neural networks on their own give you none of these three, but through notions in DSPy and through notions similar to them, I think it's pretty clear that they can serve as kind of leaves or modules that we can

control through optimizers. But at the end of the day, your program is a program, your control flow is actual control flow. So what's happening in three to six years is that a lot of that is really going to become basically machine learning is one thing and it's really good at learning, but it's pretty bad at composition. And I think

the idea that machine learning and programming are going to become the same is an ambitious, but I think very realistic vision where very few major projects are going to be programmed in anything that is not essentially a language model program. So most large programs will have modules that are fuzzy and that are coming from a language model

as a word might not exist, but essentially it's some kind of deep neural network that understands language as well as other things. Maybe foundation model is a more long lasting term. Maybe not, I'm not sure. But this is a fundamental way in which

A lot of people will do that and will not even think that they're doing machine learning. They're just going to be thinking, I'm just writing a system. And so that means that systems will come with metrics because the only way you can optimize this or compile this is if you define what the objective is. So systems will have objectives. They might not be

they might be statements or what are they called in some other context? They might be called constitutions or rubrics or whatever it is, but basically they will have those statements of purpose. Systems will have a purpose and they will be optimized accordingly. And the other thing is from the other extreme, very few machine learning systems

projects will not take the form of programs as well. So if you're building a system, if you're releasing agents, if you're doing... It will always... People will think first of the function rather than of the models and they will... Models will become...

I think all these are already here, it's just that it hasn't been internalized. So models have become, but they will become more clearly like chips. So if you think of Nvidia or other people, AMD and others producing GPUs, most of us building software or building models or doing machine learning, I mean, some people and some academics and some other people sort of are interested in building better chips. And some of our collaborators do that and it's very important.

But for the vast majority of us doing machine learning, it's just something that you buy from the experts who keep making it better every year through amazing processes. But it's kind of a commodity, and whoever is producing the best one might have a monopoly for a while or might make a lot of money out of it. But at the end of the day, the process is that you're building software, you're thinking of the software, and you just buy the best hardware you can get your hands on, and you use it.

Language models are basically going to become the same or foundation models are going to become the same where really they are commodities. And the vast majority of academics, the vast majority of everyone interested in machine learning will just license them or buy them or download them or whatever it is. And because of languages like DSPy,

you know, you just get the latest one, you will recompile your code and you will expect that it will just work. Now, similarly to GPUs, sometimes your CUDA code that is supposed to be faster on the bigger GPU will actually be slower because somebody like... - Freak a little bit, go under, open up your hood then. - ...pre-measured optimization or is it bad, there was something, a bug in the compiler. - No. - Yes, we will need to fix it. But at the end of the day, the paradigm is that models are a commodity. They are just devices. Software is what matters.

programming and machine learning sort of come together because we need learning always and we need composition always and sort of they will sort of get closer. And I think a lot of the hype around general artificial intelligence. I was just going to mention that. Yeah, will start dissipating. And I hope it doesn't dissipate all of the interest in related things because I make that joke too often. I think it's becoming old at this point, but I don't work on AGI.

I don't work, I don't believe in AGI, but I do work on API, which means artificial programmable intelligence. So I want people to be able to write code or programs that sort of exhibit intelligence in the applications they care about. It's like general technology

specific intelligence. So it's like general tools for getting people who are motivated. Like we already have developers. They already can write code. They already can solve problems. What if they solve the problems they care about? And so did they express the specs of the solutions or the kind of the...

declarative kind of scaffolding of the solution. And we're just helping them map that or compile that to machine learning constructs under the hood. I think that is where I see the future headed.

That's great. So I think that makes you a humanist rather than an AGI believer. And I think that's great. I think we need a future where AI systems serve humans and not the other way around. So thanks for this interview and thank you for opening up this new direction of research. It's fascinating. We'll see what comes out of it.

So thanks for watching. This was Neural Search Talks with Omar Khattab, live from Stanford's computer science courtyard. Stay tuned and enjoy discovery.

Designing Reliable AI Systems with DSPy (w/ Omar Khattab) 59:57 Share