We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

From NLP to LLMs: The Quest for a Reliable Chatbot

2025/1/10

AI + a16z

AI Deep Dive AI Insights AI Chapters Transcript

People

Alan Nichol

Martin Casado

总合伙人，专注于人工智能投资和推动行业发展。

Topics

Martin Casado: 我认为构建可靠的聊天机器人应该采取循序渐进的方式，从使用模板回复开始，逐步过渡到使用大型语言模型生成文本。这样可以降低风险，例如避免模型产生幻觉或被恶意利用。我们应该从小处着手，避免一开始就完全依赖大型语言模型，例如GPT-4。应该先构建一个熟悉的、可靠的系统，然后再逐步扩展其功能。 Alan Nichol: 我最初的创业想法是构建一个搜索引擎，后来发展到将自然语言转化为SQL查询。在这个过程中，我发现营销团队的语言表达方式与正式的SQL语句差异很大，这促使我开始研究多轮对话系统。我们创建了Rasa，一个开源的自然语言理解库。最初我们使用简单的Word2Vec模型，其效果出奇的好。但基于分类的自然语言理解方法难以处理自然语言的歧义性和隐含意义，因此我们开始探索不依赖于分类问题的对话引擎。大型语言模型的出现为解决自然语言处理与严格规则难以结合的问题提供了新的思路。我们现在的方法是将复杂的对话处理交给大型语言模型，而将任务逻辑用确定性引擎处理，从而提高系统的效率和可靠性。完全依赖大型语言模型处理所有逻辑会带来难以调试和维护的问题。在涉及状态变化和事务处理的场景中，应该使用传统的确定性系统来处理，而大型语言模型只负责处理自然语言交互。在生产环境中部署大型语言模型需要循序渐进，从使用模板回复开始，逐步增加LLM生成文本的比例。简单的输出过滤并不能解决大型语言模型的可靠性问题，需要更全面的方法来确保系统的可靠性。构建大型应用时，应区分哪些部分需要动态处理，哪些部分可以使用确定性方法处理，并根据实际情况选择合适的技术。

Deep Dive

Key Insights

Why is it important to start with templated responses when integrating LLMs into chatbots?

Starting with templated responses ensures that no generated text is sent to users, minimizing risks like hallucinations or prompt injections. This approach builds confidence in the system before gradually introducing more dynamic elements.

What is the role of Word2Vec in the evolution of chatbots?

Word2Vec provided a numerical representation of words, enabling mathematical operations like similarity comparisons. It revolutionized NLP by allowing systems to handle natural language more effectively, serving as a foundational tool for early chatbot development.

How does Rasa approach the integration of LLMs with traditional business logic?

Rasa combines the natural language understanding of LLMs with deterministic business logic. The LLM handles the complexity of conversations, while a simple, rule-based system manages tasks and state, ensuring reliability and scalability.

What are the challenges of using LLMs for multi-step, transactional tasks like booking a ticket?

LLMs struggle with maintaining consistent state in transactional tasks, such as reserving and unreserving seats. These tasks require deterministic systems to handle edge cases and ensure state consistency, which LLMs alone cannot reliably manage.

Why is the 'prompt and pray' approach problematic for integrating LLMs into enterprise systems?

The 'prompt and pray' approach lacks control over LLM outputs and requires trial and error to adjust prompts. It is inefficient and unreliable for enterprise systems, where predictable and accurate responses are critical.

What is Retrieval-Augmented Generation (RAG) and how does it improve LLM integration?

RAG dynamically retrieves relevant information from external sources to augment LLM prompts, improving accuracy and relevance. It addresses the limitations of static prompts by incorporating up-to-date and context-specific data.

How does Rasa ensure LLMs do not hallucinate or produce incorrect outputs in regulated industries?

Rasa uses templated responses by default, eliminating opportunities for LLMs to generate incorrect outputs. This approach ensures compliance and reliability, especially in regulated industries like banking.

What is the significance of maintaining state in conversational AI systems?

Maintaining state allows conversational AI systems to track user interactions, retrieve relevant information, and handle multi-turn conversations effectively. It ensures continuity and context-awareness in dialogues.

How do enterprises typically adopt LLMs for customer service?

Enterprises start with LLMs for understanding user inputs while using templated responses to minimize risks. As confidence grows, they introduce dynamic elements like paraphrasing and RAG to enhance personalization and naturalness.

What is the difference between dynamic and deterministic systems in LLM integration?

Dynamic systems, like LLMs, handle unpredictable and fuzzy aspects of conversations, while deterministic systems manage structured, rule-based tasks. Combining both ensures flexibility for natural language interactions and reliability for business logic execution.

Chapters

Alan Nichol's background in physics and machine learning led him to create a search engine and then explore natural language processing (NLP) for more complex tasks. He discusses early challenges in mapping natural language to formal languages like SQL and the limitations of early NLP approaches like using a separate model for each database schema.

Alan Nichol's background in physics and machine learning.
Early challenges in mapping natural language to formal languages.
Limitations of early NLP approaches.

Shownotes Transcript

Translations:

中文

The on-ramp is typically, look, let's start with something where we're only using templated responses. We're going to use the LLM for all the understanding stuff, but we know that we're not sending any generated text to users. And we can say with full confidence, we're not at risk of hallucination or we're not at risk of prompt injections and hijacking and all that kind of stuff. We're not going to end up on the front page of the Wall Street Journal for the wrong reasons. And then as you sort of build that confidence, then you start to sort of open up a little bit more.

So it's sort of that confidence building exercise of something that feels very familiar and feels not like a million miles away from the familiar old world. And then being able to kind of open it up rather than just the starting point being, well, GPT-4 is speaking on our behalf.

Welcome back to the A16Z AI podcast, and happy new year. I'm Derek Harris, and this week we have another episode digging into the topic of AI agents, customer support, and specifically chatbots and other natural language interfaces for interacting with what is essentially business data.

It features A16Z general partner Martin Casado and Rasa co-founder and CTO Alan Nicol, who's been working in the natural language processing space for over a decade and has explored numerous angles for applying NLP to business problems. They walk through the history of failed approaches as well as the promise and perils of LLMs. And then Alan explains why his approach to solving this riddle involves pairing the natural language understanding of LLMs with more traditional techniques for executing on a user's request.

The goal is baby steps and building confidence, not rushing into a world where LLMs are making business decisions and creating their own business logic on the fly. And you can hear their whole discussion after these disclaimers.

As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see a16z.com slash disclosures.

So my background was in physics originally. My PhD was in machine learning applied to physics. And I think physics to ML is sort of a well-trodden path. But anyway, I got more and more interested in it and then started working on the first sort of startup idea, which was a search engine. So kind of ML into search. And then from search, I got interested in how do you do more than just answer a question? How do you do more than just like search for something? Because actually one of the first...

products that I was working on at the time, which is kind of a startup idea, was like English to SQL. At the time, we would have to train a model for every schema. And now you can just like put schema in context, like put it in a prompt and it'll figure things out. So we would have to like train a model for a particular schema. And we had this product and we were selling it to marketing teams. And so they could say, what am I spending on Facebook ads at the moment? And like, we would turn that into SQL and, you know, either show a chart or, you know, a table or something like that.

And you could ask these really sophisticated questions. You could say like, "What was the ROI on all my campaigns by country for the last three weeks?" And of course, you very quickly realize that marketing teams don't speak like that. People who formulate a question like that, they would just write SQL themselves, right? So people would just talk to this chatbot. It was in Slack. It was one of the first Slack chatbots. And people would just talk to it and they would just say, "Oh, how's Facebook?" So you've got to kind of get into this interesting sort of multi-turn question.

I know we're still going through your background, but I kind of actually want to just like dig in here really quickly. It's something that I've always wondered, which is formal languages are formal for a reason. And that is because they're non-ambiguous and you need to be precise. And so I've always just wondered, is no code and low code in natural language is actually a thing? Like, does it have to be the case that if you're trying to do something that requires a formal language, then over time, you're going to have to actually converge on a formal language and like,

In the no-code world version of this is it's great you give me boxes and arrows, but at some point in time, I'm going to need to create subroutines, and I'm going to need to think and basically let's call it delay recurring. And in natural languages, I'm going to have to use a subset that isn't ambiguous. Like, for example, let me give you a sentence. The dog brought me the ball, and I kicked it.

Well, semantically, maybe I kicked the ball, but maybe I kicked the dog and who knows? And like, you know, I feel like it's strange to me that we even try to map natural languages to formal languages. Do you think this is even worth going down? Do you think that there's actually a happy destination here?

I think it's a very valid question. And for a long time, I've been really bearish on no-code conversational AI. Much to many people's chagrin. Talking to people I look up to in the field who have been at this for a while, it's like you ship one of these things, and then a customer comes in and says, well, here we need to compare it to integers. And you're like, oh, OK, fine. And it's like, oh, here we need to check a string. And you're like, oh, OK. And then exactly as you say, before you know it, you've built a general-purpose programming language inside your UI, and you're scratching your head wondering,

where it all went wrong, right? What I like about the way that we've now cracked the problem is that we just don't do that. We don't say it's no code. We give you a nice UI for specifying your business logic. And like, if you have other stuff you need to do, you just break out, you know, you color outside the lines and you make no apologies for giving you access to the code so you can go and do that in the subclass something or write a bit of code somewhere. Well,

I love it. Okay, we're going to get back to this a little bit, but we're going to your background. So you were trying to do natural languages SQL, creating a model for each schema. Yeah. I wrote this essay in 2016, and there was a paragraph in the middle of it where I talked about how deep learning and rigid rules are really unhappy bedfellows, and how do you actually inject a bit of extra logic into a deep learning-based system and everything else. And I kind of wonder...

You know, what if I went back to 2016 me and said, oh, actually the solution to this is to train a really big RNN and then just explain it in English in the start and then just start sampling tokens from there. Anyway, so we're working on that. I just said that actually dialogue is a very interesting problem. Yeah. Right. And actually having like a multi-tongue conversation is a very interesting problem. And the

Ironically, I found out that a lot of the research on that topic had happened in the building where I did my PhD, in the floor below me. And I never knew that any of those people existed at the time, but I discovered that literature later. This was like first chatbot hype, right? 2016 era chatbot hype, which was hype despite a lack of progress in the technology rather than what we're seeing now, where obviously it's tech-led. It was Messenger opened up as a platform. It's very exciting. People are building apps for it.

And we just kind of looked around and we said, look, we've used all the tools that are out there for devs to build chatbots. And they're pretty dire. Like there's a hundred things you can use to build Hello World and there's nothing you can use to build whatever you want to build after Hello World. So that was kind of, you know, set us on that journey. And then the first thing we did was we open sourced an NLP live

library just for chatbots. So for classifying what people are saying and then like doing some named entity recognition, extracting some entities and things like that. I mean, the algorithm of that first version was so simple, right? These were the days of like Word2Vec and Glove. So the way that it worked was it would take whatever the sentence was, it would add up the word vectors and we just like trained a classifier on that representation, right? And what I like to think of it as is like, you know, I once met a very famous drummer and he said, my favorite drum beat is the one that you go...

He's like, I make all my money with that drum beat. Sick all of you think you're so clever playing this fancy stuff. I make all my money playing this beat. And you all think you can do it, but they still call me and they pay me the big money. And that's exactly how I think about training a classifier on Word2Vec. It's such a workhorse that...

that got so many people so far and so widely adopted. And it's actually a pretty hard to beat baseline, right? - Maybe just for the listeners, maybe a bit more of a description of Word2Vec. - It was the first time that I think for many people, at least for me, and I think for a large group of people in NL, that we really had a nice way of saying, okay, well, we wanna do NLP. We wanna have a computer deal with natural language, but like computers aren't really, they don't know how to represent natural language. You don't have a way that you can like

do things with it, do maths with it, sort it, compare it, see what's similar to other things. And then Word2Vec kind of was the first approach where we said, okay, well, actually, we're going to get you a numerical representation of not every word,

I mean, what is a word really? But like a large group of words. And now you can do maths, right? And the sort of the classic example that blew everybody's mind was that you could take the representation for king, you would subtract man and add woman, and you would get to the representation for queen. And this was like pretty revolutionary, right? So we're like, oh, we have a way to do maths with words. When did WordEvec actually show up? I think the paper was like 2013. Yeah, that's right. Maybe. And then there was Glove, maybe 2014, something like that.

So my view of history is that a lot of like the chatbot stuff came after Word2Vec. And so that feels modestly tech-led.

I would say so, but I mean, for me, Word2Vec was like a source of inspiration. But if you think about the paradigm for building a chatbot circa 2012, which was the thing that annoyed me so much that we started working on this problem was, you know, user sends a message, we're going to categorize this into one of these predefined categories. They're either saying hello or they're saying goodbye or they're asking for a receipt or they're doing a refund or whatever it is. You know, you have your predefined categories and you just have to categorize things. And that's

And that's kind of the understanding part, right? It's just picking the right category. Is WordSafec really a big step up versus something simpler without, you know, depends on how much stats you have, I guess. But it was certainly an inspiration for me and for many people. Did we get to the point in your history where you started Rasa yet? Kind of, yeah, yeah. So when we opened up, when we created that NLP platform,

NLU library. We open sourced that thing. That was called Rasa. And then we were cool, right? Because everyone was using these APIs at the time for natural language understanding. Microsoft had one and Google had one. It was pretty sketchy to be in the situation where

every interaction with your app goes through a third party before you know what it is. What if every button click had to go through a provider before you knew what button somebody clicked in your app? It's a pretty precarious situation to be in as a startup. Okay, here's an open source drop-in replacement for this thing. And, you know, it was cool. People liked it. And then all of a sudden every developer in the chatbot world knew

knew what Rasa was and thought we were cool and thought we were legit and wanted to use the tool. And that sort of started the journey of the company that became Rasa. I think it'd be very useful to just do like your rough sketch of how people would do NLU for like, say, chatbots in a program traditionally using Rasa. And then we'll talk about how LLMs changed that and then now how that's evolved the paradigm.

So the original sin of everything is this categorization thing that I just talked about, this classification problem. So the paradigm is you categorize what the user says into one of these buckets. And then you add a bunch of if statements. That's your chatbot. That was it. That sounds like programming. It does. And I will tell you that most widely used enterprise scale conversational AI platforms

you scratch behind the surface and that's what's happening, right? Like there's something classifying a user message and then a bunch of rules to say what to do, you know, depending on where you are in the conversation and what's happening so far. And so for many years, pursued this direction of, okay, accept that as the paradigm and then try and build like a really clever dialogue system that's going to deal with all the

limitations of that approach. I mean, spent a lot of time, wrote a bunch of papers, did a bunch of work on like, how do you build, you know, a contextual engine that can kind of like reinterpret that misinterpretation. The classic example, maybe to make this more concrete, is like, why just representing everything as a classification problem is so impoverished is, you know, example from one of our customers, which is an airline, and they asked the user, are you going to be traveling in economy class?

And the person said, sadly. It's perfectly clear to anyone reading that what that means, right? Like it's an implicature, right? We know, of course, that that means yes. But it's not that like the word sadly means yes in general, right? So if you have this sort of

this very naive way of understanding language where you're just like looking at a message and categorizing it. There's a limited amount you can do. So we kind of built a lot of machinery around dealing with that ambiguity, like reinterpreting things and all that kind of stuff. And of course that feels all very arcane in the world of LLMs where you could just like open up any of these tools that are, you know, so widely available and easy to use. You just have like a very fluent, natural conversation with them, right? And obviously there isn't like a...

little categorization going on in the background, a little classifier. There's something very dramatically different happening, right? So of course, all the technologies had to evolve with that. Just kind of ghost of Christmas future here, but they also introduced a whole new host of problems. So in the arc of the conversation, we're about that. We're going to go from the old problems to the new problems.

No, absolutely. And I remember building all this stuff, these clever dialogue managers and clever contextual stuff. And then it puts a bunch of work on the person building the chatbot to be really clever and do a bunch of clever things and collect data and do this evaluation and stuff. I remember always thinking in the back of my head, I was like, the way people really think about this problem, what people really want is they want a quote, like perfect understanding engine. And then they want like a really simple engine for managing their business logic.

So they actually want the dialogue part to be dumb and the understanding part to be infinitely smart. And I think that's much closer to what we have today, where you do just... So we kind of attacked the problem from the wrong side, but I guess you find that out over time. All right, so this is kind of the state of the world circa, what, 2019? Something like that?

Yeah, something like that. 2020, we started building the first dialog engines that really didn't rely on this classification problem at all. It was 2020, we published that paper and shipped the feature to go with it and all that kind of stuff. So sort of some end-to-end work.

I assumed a lot of what Rasa and you were doing was this classification approach. It's just providing kind of good tools to do that. Did you evolve the approach pre-LLMs? 2019, I wrote a blog post saying it's time to get rid of intents. So in other words, it's time to get rid of this like classification paradigm.

And, you know, it kind of, again, it resonates with people because everybody felt when you're building one of these things, you're really building a house of cards and the house of cards, the cards are the intent. It's like what the if statements are conditioned on. And that's the thing that just doesn't scale when you're, you're building with a team and you're like, well, where does this message really fit and all that kind of thing. So, yeah. So we, we started breaking that paradigm in a meaningful way around sort of 2020 and then I

I think now everybody's on board. No one wants to build intent-based chatbots anymore, right? Clearly, I think from chat GPT, ground zero was chat.

Maybe it's worth talking through how that landed with existing people doing chat, because it's not actually super obvious how you kind of retrofit that into like an existing program or problem or company, right? So like, you're like, not at all. Open AI is doing this cool thing that you can chat with this LLM, but I need like a customer service agent. How do I reconcile this? So maybe talk through like kind of that, you know, these things land and then just kind of the realization that they're not.

they would be great, but maybe not so easy to use. Yeah, I mean, from a vendor's perspective, it massively shifted the goalposts. As someone building a platform, everyone's like, well, when's our chatbot going to look like that one? Also made this a C-level topic in every company, which is obviously, you know, good for us and helpful for us. Creates a lot of noise and everything else as well. But, you know, the way you're thinking through it, exactly as you say, and it really is the same as that point that I made back in 2016 was, okay, how do you marry...

this sort of fuzzy system with something that isn't fuzzy. And you build something that is useful and does things, but still leverages what you get from these models. So I mean, the way that we tackled it was,

Look, we have at this point six or seven years of experience. We know the kinds of conversations that enterprises want to build and care about and the kinds of things they need to automate for their customers. So we just actually built an enormous database of conversation patterns, types of conversations, exact things that we wanted to be able to do. All the nuanced things, all the difficult things that were never really possible in that old paradigm.

And then we just said, okay, we have a new category of models. We know the system, what it needs to do, and let's just throw everything out and just, just kind of rebuild. And so we just went into the trenches and built a whole new paradigm and a totally sort of classification free, intent free approach to building these things. And yeah, very happy with, with what came out of it. Maybe let's back up a bit and talk about like, okay, so LLM show up.

And then let's say company X, airline X wants to like book tickets with it. It just feels to me there was this kind of five stages of grief with LLMs where like, oh, it seems like it'd be really easy to integrate, but it kind of wasn't. So an LLM shows up and I'm an airline and I'm like, okay, I want to use this LLM to talk to customers to book a ticket, right? And I've got OpenAI or whatever it is on the backend. So the first thing is it would have to understand everything.

the systems in the context of the airline, right? And so what, so you kind of walk through how people thought through integrating these things. So I do feel like there was like the entire industry learned how to do it. And there was a lot of long paths. Like one of my favorite tweets recently is a customer support tweet where the customer was like, are you a, are you an LLM? And then the support, you know, response is no. And, and,

The customer says, can you create me a React component? Then it spits out all this. So clearly there's stuff that you need to do to like actually make these things work. So can you just kind of walk through like the evolution of the realization of how that you incorporate LLMs in this stuff?

Yeah, there's a naive approach. And this is what, you know, there's certain hyperscalers that will promise you that this is the world is all you need is a big foundation model. And then you're just going to expose all your APIs and it's going to be great. It's not even obvious. I think the people listening and to many people, it's like, if you haven't, how do you do that? And like the stuff that's unique to my company, like how would you even know about my products to support? Yeah, yeah, for sure. So I think the level one, the basic approach is,

you list all the APIs you have and like what parameters they take. Yeah. And then you tell the LLM, here's what the user said. So you fill out the prompts and you say, here's what the user said. And you say, here's a list of APIs that you could potentially make use of. And if it looks like you need the answer from one of these APIs, then like,

formulative request or just say that you need it and then there's like a bit of code that actually calls that API, something like that. You're just exposing like the raw existence of those APIs to the LLM and just saying like, figure it out, take a stab, take a best guess. But like, you're going to have to be like really well-named APIs or something. I mean, I'm just like, I'm just trying to make this practical. Let's say like somebody wants to get support for a product, right? Like somehow the LLM would have to

have to know about the product, right? - Yeah, the simplest possible approach is you write a whole bunch of information in a text blob. That's your prompt. You say, "Here, we're this company, this is what we do. These are our policies. Always be polite." Don't offer things. - The entire company knowledge base in the prompt. - Exactly. But the problem with this is this makes an incredible demo. It's 10 minutes of work.

It does 80% of the stuff for you well, and it looks super, super convincing. So I think that's part of the issue. Or part of the challenge is there's a chart I saw recently that was sort of illustrating this point of the experience of building something with Gen AI is the exact opposite of building something with traditional software, where you spend a week building a back end in an auth system, and you still have nothing to show for it. And then slowly, incrementally, you start to show something you're a little bit excited about, right? And here, you have all the dopamine

in the first 10 minutes and then just downhill from there. - That's actually a brilliant observation. It's totally true. Why not just put everything, I'll try to dumb it down just a little bit. I mean, the way these LLMs work is you just go ahead and you say something and then it gives you an answer, right? In English.

Right. The thing that you feed it in is a prompt. And so what you're saying is, well, you just take the prompt and then you just append all the information. So this is a customer support request from a user,

you are an LLM that is a support agency. Represents this company. Acme company. Acme company is a hardware company. It has these products. It has these people, you know, and you just kind of explain everything about the company in it. So what is the shortcoming of just putting it all in the prompt? Well, I call this approach prompt and pray because you really don't have like control over the output, right? Okay. So there's one modification to this, which is that maybe it's not,

The whole prompt is got everything, but it's a template and some things are pulled in dynamically depending on what's relevant or what the user asked or something like that. So it's sort of assembled right before you go and call it to the LLM and you pull in some relevant context and people call that RAG, right? So like retrieval augmented generation. Maybe let's just walk to RAG really quickly. So the reason that you don't put it all in the prompt is because you can't. There's just too much information to put into the prompt and it's changing all the time and that's just stupid, right? Okay.

You can't describe everything you ever need to know ever and put it in a prompt. So instead, we have traditional systems and these traditional systems will store state whatever it is. And then the hope is the prompt, somehow you pull that information out of these systems and then you add it to the prompt, which has templates for these things, right? And that's what rag. Normally, you take information, put it in a vector database. Right.

and then you're somehow pulling it out and filling the prompt. Yeah, I think the simplest thing to imagine is you have a big collection of FAQs. When the user asks a question, you first query and pull the Q&A pairs for the things that are most similar to what the user asked. And then you add that to the prompt and say, by the way, if the answer's in here, just use that. It's a way of steering it a little bit more.

In many cases, this initial query for data to create the prompt requires reasoning, right? Like you kind of have to know what to look for. So let's say that you're asking a question, like I'd have to know which PDFs are relevant, right? And like normally a RAG is just basic similarity search, like cosine similarity. So it doesn't have reasoning in the way that an LLM does it. And so it always has felt like a very inexact approach.

approach to me. Is that fair? Oh, totally. I mean, take the classic sort of hello world brag app, ask a question, get an answer, and then say, tell me more. It's like, you won't get a good answer because it'll just query, tell me more, and it will come up with any old junk. So, I mean, yeah, it's a hello world, but it's not the end all and be all, right? But so, I mean, a couple of problems with that approach. I mean, one, that are kind of more fundamental, right? Because, I mean, you can work around a lot of these things. You have very limited control over what the output is, right? Yeah.

Yeah. And so it will say something plausible, but not necessarily what you hope for it to say. And when you change the prompt, the only way to know the impact that that change is going to have is to run it and see. So that's also a frustrating trial and error type of process, which is why I call it prompt and pray because you're just kind of throwing stuff at the wall. And because these models, especially the instruction tuned ones, which are the ones you typically tend to use,

These are trained on human feedback and they're very eager to please because they sort of get rewarded during training time for pleasing. So they will offer stuff that they can't do because

Because they think it'll make you happy, right? So that's the classic book. You build the first version of your chatbot and they're like, oh, would you also like me to do this? Or would you like me to compare some products? Or would you like me to offer you? Like they're just offering stuff that they think will make you happy. There's no basis for it, right? All right. So we're at a point where, you know, we have the LLMs. You can't put it all in the prompt. So you do this rag thing. So it's in the vector database or whatever you're doing retrieval, but that, you know, is not sufficient.

And let me tell you, as a programmer, so I program periodically with LLMs. I do it for just kind of silly apps. In my experience, it's very, very tough to map these to a formal language just because you can't predict what they're going to say. Even if you try and tell it... I work in a silly video game. The LLM

returns objects in JSON. And like my problem just to get it to get the JSON right is literally like the kind of stop token and the formatting of the JSON. I mean, I have to evolve it all of the time. Even then every like 10th time, it just decides not to, or it decides to add something. Right. So, so how, how do you even think about marrying these LLMs that you don't know what they're going to say to a formal program? Like, is that even something that can be done? For sure. But I mean,

I think we're really getting into it now. And this is the question of how do you build a system that merges these two worlds, right? That's right. I don't know if I have a general solution, but I can speak to it from the context of building conversational AI. That'd be great. And especially from enterprises who are building a voice or a chatbot that is automating something, customer support or something like that. Let me just specify the problem domain and then say exactly what you were going to say before, which is,

So we've been talking about using an LLM to actually have a conversation. You clearly are going to, A, want to constrain what it says, and then, B, if it's going to do anything like book a ticket or something, it's got to interface with formal systems, right? And so the challenge of somebody taking an LLM

and making it effective conversational AI. So both that it speaks in a way that makes sense and then impacts the system requires you to solve this problem of how do you map it to, well, the first one, actually, we haven't even talked about, which is like, how do you make sure you're not saying stupid shit to a customer? Let's do that one first. How do you make sure it doesn't say stupid stuff

to the customer. And then the second one is like, how do you actually map it to a formal system? So we can tell, I mean, we work a lot with regulated industries and, you know, we have a lot of big banks as customers and stuff like that. We always tell them, you know, you can hold your hand on your heart and tell your compliance team, Razzle will never hallucinate.

And then people are like, oh, wow, what kind of secret sauce do you have there? So well, by default, you only send templates and answers. So there's no opportunity for generating a response. But of course, you can introduce that. I think that the way I think about all these things is that LLM paraphrasing is like the little secret sauce, the little fairy dust on top to make things a little nicer. So I guess the way I think about these things is LLMs are unbelievably powerful components for

Turning freeform natural language into structured data and vice versa and so that's kind of then getting into the point of the formal system Well, if you're gonna do something on behalf of the user, right? Like if you just kind of fetch some data You just go call an API and like fetch their I know their latest order or something like that That's just like a single turn kind of thing. Most things are not like that Most things are like there's some state there's some multi turn interaction going on so for us

It's just about translating what the user is saying into what that implies in this context. What does that imply for progressing the conversation, progressing the task, going to the next step, that kind of thing. So to go back to the sort of sadly example, when you ask the user, are you traveling in economy class? And the person says, sadly, MLMs are perfectly smart enough to understand that, right? And so the way we represent it is we

we output just, hey, you're setting this variable to true. So the economy class variable is set to true. And so in general, our approach is, you know, we translate what's happening in the conversation into like a small set of verbs that say, you know, we call these commands and they just say, here's how the user wants to progress the conversation. That doesn't even mean that that's what's going to happen, but here's how the user would like to progress the conversation. This is what they're trying to achieve. This is what they're saying implies

in this context and how we're going to move forward. And then you have just a very simple deterministic engine that says, "Okay, well, I know the steps. I know that if they want to do this, I need this information from them. I need this information from an API. I have these decision points. I'm going to ask them a couple more questions and then we'll get there." The nice thing about the approach that we have is you say, "Look,

All the messy, ugly, confusing stuff around having a conversation, having a fluent conversation and dealing with like digressions and people correcting themselves and having follow-up questions and changing their mind about things and switching from one thing to another. Like all of that complexity is handled by the LLM. And just the task logic is just done simply deterministically. You write it down, right? And that's just a very, very efficient system, both from a development perspective, but also, uh,

just from an ongoing maintenance perspective, right? Because the alternative approach is you just let the MLM do everything and figure it out.

And the most infuriating thing about that approach is kind of going back to the prompt and pray thing. When it doesn't do what you want, you can't like systematically influence it so that it does do what you want, right? So you kind of, I know I've found a bug. I have no predetermined set of steps that I can follow that will fix it for me, right? So that's where you start pulling your hair out, trying to make this thing reliable, playing whack-a-mole when you found another edge case, you found another edge case. Yeah.

You're kind of chasing your tail. I understand how you deal with the LLM getting it wrong when it involves a human. So let's say that you, whatever, the LLM didn't pick up some colloquialism. So instead of saying, sadly,

You know, there was another colloquialism that just nobody knows, like in a small area, and it gets it wrong. Yeah. So I can understand how you deal with that, which is you just ask, is this right? And then the user is like, no, that's not right. And then actually maybe speaks more clearly or something. I get how you could build a system like that. Yeah.

Let's say that you're trying to book a ticket. And so you get the user information and then you reserve the seat and then you do a credit card transaction and the credit card transaction was denied. And so you have to somehow unreserve the seat. So you're actually trying to do like a stateful set of steps. The LLM is trying to do a stateful set of steps that require us transaction. And if you stop halfway, there's like more stuff to do. It's inaccurate during VAT.

Like if it's somehow hallucinating during that, you'll end up with like state inconsistency and partial transactions. Like it is very, very tough to build a formal system where you're trying to like have consistent state if that's happening. Again, I get the user thing. I just don't get the machine thing. Like how do you build a system this way? You don't let the other one do it.

So you just say, here's what the user told me. OK, I want to book a ticket. I need these three bits of information from the user. So you've got some-- we call it a flow, whatever. But you've got some set of steps you go through to do that for the user. And whenever you reach a step that says, I need information from the user, you go and ask them. But in between, when you're doing stuff, you're just executing the actions. And then you reach some decision point, and you have little if-else. And then you go and execute and all that kind of stuff. So you only pause to ask the user for input.

But in between those steps, it's just code. It's determinism, right? It's just written down. And it's funny because there's this sort of really in vogue idea that the LLM should be reasoning about that and should be doing that. And it should be dynamic. How many use cases are there really where that logic is dynamic at runtime every time a user interacts with it, it could change? Your view is...

Anything that's multi-step and transactional is handled by a traditional system. And then you just expose it as do this. Neither it does it or it doesn't do it, but you'll never get it in some funky inconsistent state. Right. And then when you get to a point that you're returning to the natural language, you're like, we're having

conversation, we're going to ask you for something. And then you go back into natural language world, right? And so, you take the output from the system, which is maybe some structured data, you use some LLM magic to format it nicely in a human friendly way that the person can consume it and reason about it. Let me push on this. When I'm talking to a human to book a ticket, they reserve a seat and then keep talking just so it doesn't go away.

In this case, the LLM would not be able to reserve the seat while you're having the conversation, right? Well, I mean, and now you're getting into the difference between spoken dialogue and written dialogue. Well, no, no, no. And you're just going to wait until the thing responds. No, no, no. I'm trying to say something a little differently, which is, listen, you have to reserve the seat, but then you have to somehow unreserve the seat if the transaction doesn't go through, right? So you're doing this stateful stuff.

that's changing the system that you may have to unwind later. And then it would be the LLM that would have to figure out how to do that, right? And if it doesn't unwind it correctly, then you end up with a reserved seat that nobody's filled. It's

It's a very tough problem to get an LLM to do something like book a ticket if it involves first reserving a seat, getting more information, maybe like doing like the payment transaction. And if that fails, then unwinding the seat or changing the seat or like just all the stateful stuff seems very hard to do from an LLM. I mean, I agree. If you know...

You know that logic, you know what to do when you have to unreserve a seat, and it is fundamentally very silly to launder that information through an LLM and then hope that it comes out obeying those instructions versus just saying, "I've got a system that will do that and will handle all the edge cases and will unwind it if needed." But you maintain state in the conversation as well. The conversation, at least within Rasa, the conversation itself also maintains state. I maintain some things about, "Here's some things I know about the user.

Here's some things that I've retrieved from APIs so far. And here are things that we've been talking about and that they might refer back to, and those kinds of things. So if you've got something that's pending and needs to be cleaned up at the end of a conversation or if a conversation gets interrupted, you've got that state there. And that's your tidy up and wrap up type of action. MARK MANDEL: So basically,

per conversation is one big transaction and either it goes all the way through and if it doesn't, you have to have the ability to like kind of roll back and undo. Yeah, yeah. Clean up and unwind and that kind of thing. I see. So it's kind of this very blunt instrument for you to tie a transaction to a conversation. Well, I mean, the nice thing is, you know, you can have arbitrary composition of different tasks that you might do, right? So you might check

in the middle before, you know, you complete and book, you might go and ask about availability for return flights or something like that. Or ask if there's an upgrade available, right? Like, there might be all sorts of, like, little sidebars and things that happen. So, and that's, again, where, like, the LLM magic comes in, dealing with all the complexity of all the ways that you can, like, deviate. And you don't have to, like, map out all the ways that conversations can go anymore, which is nice. Yeah.

But you know, the pieces of business logic you can just like write down and have executed reliably. Well listen, as we round this out, I would love for you to talk through like

maybe real world examples of companies that have taken LLMs and put them in and how that's worked out because I feel it's still pretty wild west for getting LLMs to be in production for useful systems. And I think you're probably the world's expert on having just a little insight on how it's actually done in production with real customers would be great. Yeah, no, for sure. And so the system I've been talking about with this sort of

combining rigorous logic with an LLM for all the understanding parts. We call it COM, so Conversational AI with Language Models. A year ago, I gave a talk on it in New York. We did a meetup and there was like an ND from one of the three biggest US banks. And like he said before, he said, "Oh, I was absolutely 100% convinced we were not allowed to do LLMs." And he's like, "Now that I've got an institution system, the only limitation is how many GPUs can I get?"

And I was like, nice. All right, let's go. So we did the first proper GA release of the comp system 12 months ago. So yeah, last December. We now have a double digit number of large enterprises running it in production for their customer service voice and chat interactions. I have to say, I have to give a lot of credit to just the overall momentum and the C-level buy-in for doing these things, right? And like the belief everywhere that

things can be done better and like if we just invented this thing in a vacuum there's no way it would have gone at this pace yeah but the the on-ramp is typically look let's start with something where we're only using templated responses we're going to use the llm for all the understanding stuff but we know that we're not sending any generated text to users and you know we can say with full confidence we're not at risk of hallucination we're not at risk of like

prompt injections and hijacking and all that kind of stuff we're not going to end up on the front page of the wall street journal for the wrong reasons that kind of thing right and then and then as you sort of build that confidence then you start to sort of open up a little bit more and like maybe you introduce some brag and like you introduce some rephrasing some paraphrasing that kind of stuff and like all that stuff adds a lot of value right it means you can handle like more long tail questions it means you can customize and personalize things sound a lot more natural so it's sort of that

confidence building exercise of something that feels very familiar, not like a million miles away from the familiar old world. And then being able to kind of open it up rather than just

The starting point being, well, GBT4 is speaking on our behalf. I mean, honestly, this word guardrails, it really winds me up. The things that people say, the things that people describe as LLM guardrails are nothing of the sort. It's like, oh, we have some output filtering. Check for swear words on the output. Congratulations. I think you've underestimated the problem. Or when people just

offer something that's like observability, right? You can see everything that's happening. And then the promise is like, you can now ship robust generative AI. I'm like, listen, you can put a toddler on your customer service hotline and like listen into all the calls. That doesn't make it robust. It just means you'll know just how terrible your system is, right? So yeah, it's really about sort of providing something that you can debug, you can write end-to-end tests, you can like build confidence that you can ship it and it's going to work.

I would maybe just encourage people to think more critically, especially when you're dealing with an LLM and you're building a larger application. You're building like a piece of infrastructure and LLMs are part of it. Think about which pieces are truly dynamic and those pieces which are truly dynamic and fuzzy and unpredictable, handle them with an LLM.

And everything else, like, just use something that works, right? Ask yourself, like, is this going to be a different journey for every single user who interacts with it? If so, fair enough. It's done all right. If not, like, what are you doing? I think a great way to dissect the problem. Awesome. That's it for this episode. If you enjoyed it, if you learned something, or if it struck a chord some other way, please do rate the podcast to wherever you listen. Until next time, thanks for listening. ♪

From NLP to LLMs: The Quest for a Reliable Chatbot 38:21 Share