We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

RAG Inventor Talks Agents, Grounded AI, and Enterprise Impact

2025/3/27

Founded & Funded

AI Deep Dive AI Chapters Transcript

People

Douwe Kiela

Jon Turow

Topics

Douwe Kiela: 我是Contextual AI的联合创始人兼CEO，也是RAG（检索增强生成）技术的共同发明者。我们公司致力于为企业构建下一代检索系统。我希望Contextual AI能够更宏伟，不只是RAG技术栈中的一小部分，而是拥有整个市场。RAG的起源是为了解决语言模型缺乏知识背景的问题，通过将语言模型与外部知识库结合，使其能够更好地理解和生成文本。RAG最初的灵感是将语言模型与维基百科等外部文本知识库结合，以增强其理解能力。RAG的成功部分源于当时Facebook AI的图像相似性搜索技术（一种向量数据库）的可用性，这使得将向量数据库的输出与生成模型结合成为可能。RAG的初衷之一是解决生成模型知识陈旧的问题，通过检索最新的信息来增强模型的知识。RAG并非与微调或长上下文窗口相互排斥，而是可以与它们结合使用以获得最佳性能。RAG之所以受欢迎，是因为它提供了一种简单的方法来让大型语言模型处理企业内部数据。下一代RAG系统将使开发者能够更专注于业务价值和差异化，而不是技术细节，例如分块策略。虽然CTO们希望RAG易于使用，但开发人员仍然需要关注技术细节，例如分块策略，以优化性能。企业对RAG的采用程度参差不齐，有些企业还在探索阶段，有些企业已经构建了自己的RAG平台，但可能基于错误的假设。许多企业在RAG的应用场景选择上目标过低，应该选择更复杂、影响范围更大的应用场景以获得更高的投资回报。RAG的投资回报率取决于应用场景的复杂性和影响的员工数量。生成式AI的部署主要分为两类：成本节约和业务转型。企业对RAG的常见误解包括：低估了将RAG投入生产的难度，以及对RAG适用场景的误解。区分RAG能够解决的问题和不能解决的问题，例如，RAG擅长回答特定问题，但不擅长总结文档。Contextual AI的RAG 2.0以及其他技术整合，改变了与企业高管的沟通方式。Contextual AI帮助企业发现合适的RAG应用场景，并定义成功指标和测试集，以评估RAG的有效性。用户验收测试（UAT）对于RAG的成功部署至关重要，因为在实际应用中，用户行为可能与测试阶段不同。Contextual AI为企业提供支持，帮助其完成RAG的部署和应用。Contextual AI专注于构建围绕大型语言模型的系统，而不是训练基础模型，因为大型语言模型将被商品化。RAG 2.0专注于联合优化RAG管道中的各个组件，以提高性能。主动检索与被动检索的区别：主动检索允许语言模型根据需要决定是否进行检索，并根据需要调整检索策略。RAG代理通过上下文推理选择最佳信息来源。未来RAG领域的一个重要挑战是解决结构化数据和非结构化数据的融合问题。针对特定问题，专门化的模型总是优于通用的模型。Contextual AI的研究重点是解决实际的客户问题，而不是纯粹的学术研究。在AI领域，研究与产品开发的界限非常模糊，研究成果需要迅速转化为产品。AI原生公司与SaaS公司的区别：AI原生公司需要在构建产品的同时不断探索和学习。将研究成果转化为产品，并与企业合作，是一次令人兴奋的经历。当前AI技术已经具备巨大的经济影响力，但仍需解决一些非技术性问题，例如法律法规和组织管理。在AI部署中，关注不准确性以及如何降低风险至关重要。不同类型的创始人（领域专家、技术专家和研究型创始人）在AI领域的创业中都扮演着重要角色。创始人是否具备深厚的领域知识并不重要，重要的是领导能力和远见卓识。基于基础技术的“包装公司”同样具有巨大的商业机会。未来AI创业的机会在于利用AI技术的成熟度，解决更复杂的问题，并注重成本效益。新兴的AI公司需要专注于差异化，并避免被大型公司吞噬。灵活适应市场变化，保持谦逊。不要害怕挑战难题，追求更大的目标。一个公司的核心是其数据和对数据的专业知识，这应该体现在其模型评估方法中。“幻觉”是一个不够精确的术语，应该用准确性来衡量。幻觉与模型的groundedness有关，如果模型能够很好地遵循上下文，则幻觉会减少。建立AI公司并解决实际问题比预想的要困难得多。AI领域被忽视的重要问题：模型评估。模型评估对于AI的成功部署至关重要，但许多公司对评估的重视程度不足。企业可以利用Contextual AI的工具进行模型评估，但评估的专业知识需要企业自身积累。 Jon Turow: 早期生成模型的一个常见批评是其知识截止日期的限制，例如2020和2021年的模型无法识别COVID-19。RAG与其他技术（如微调、团队合作和情境学习）的关系以及在这些技术组合中的作用。RAG的流行和误解：它被视为解决所有问题的灵丹妙药。企业高管对RAG的积极反馈，以及RAG带来的便利性。下一代RAG系统将使开发者能够更专注于业务价值和差异化，而不是技术细节，例如分块策略。CTO们希望RAG能够易于集成到现有架构中，并能立即投入使用。虽然CTO们希望RAG易于使用，但开发人员仍然需要关注技术细节，例如分块策略，以优化性能。企业对RAG的采用程度参差不齐，有些企业还在探索阶段，有些企业已经构建了自己的RAG平台，但可能基于错误的假设。RAG的应用发展阶段：2023年是演示阶段，2024年是生产化阶段，2025年是追求投资回报阶段。企业对RAG投资回报率的衡量方法各不相同，一些企业注重成本节约，另一些企业关注业务转型和收入增长。许多企业在RAG的应用场景选择上目标过低，应该选择更复杂、影响范围更大的应用场景以获得更高的投资回报。RAG的投资回报率取决于应用场景的复杂性和影响的员工数量。生成式AI的部署主要分为两类：成本节约和业务转型。企业对RAG的常见误解包括：低估了将RAG投入生产的难度，以及对RAG适用场景的误解。Contextual AI的RAG 2.0以及其他技术整合，改变了与企业高管的沟通方式。关于幻觉的讨论：RAG是否能够解决幻觉问题。“幻觉”是一个不够精确的术语，应该用准确性来衡量。关于AI采用和能力的看法变化。建立AI公司并解决实际问题比预想的要困难得多。AI领域被忽视的重要问题：模型评估。模型评估对于AI的成功部署至关重要，但许多公司对评估的重视程度不足。 supporting_evidences Douwe Kiela: 'So the history of the RAG project, so we were at Facebook AI Research, obviously FAIR, and I had been doing a lot of work on grounding already for my PhD thesis. And grounding at the time really meant understanding language with respect to something else.' Douwe Kiela: 'So it was like, if you want to know the meaning of the word cat, like the embedding, word embedding of the word cat, this was before we had like sentence embeddings, then ideally you would also know what cats look like because then you understand the meaning of cat better. So that type of perceptual grounding was something that a lot of people were looking at at the time.' Douwe Kiela: 'And then I was talking with one of my PhD students, Ethan Perez, about can we ground it in something else? Maybe we can ground in other text instead of in images. So the obvious source at the time to ground in would be Wikipedia. So we would say this is true, sort of true. And then you can understand language with respect to that ground truth.' Jon Turow: 'Well, you know, this takes me back to another common critique of these early generative models that for the amazing Q&A that they were capable of, the knowledge cutoff was really striking. You've had models in 2020 and 2021 that were not aware of COVID-19.' Douwe Kiela: 'Yeah, it was part of the original motivation, right? So that is sort of what grounding is, the vision behind the original RAG project. And so we did a lot of work after that on that question as well, is can I have a very lightweight language model that basically has no knowledge?' Jon Turow: 'Now we have RAG, and we still have this constellation of other techniques. We have training, and we have teaming, and we have in-context learning. And that was, I'm sure, very hard to navigate for research labs, let alone enterprises. In the conception of RAG, in the early implementations of it, what was in your head about how RAG was going to fit into that constellation? Was it meant to be standalone?' Douwe Kiela: 'Yeah, it's interesting because the concept of in-context learning didn't really exist at the time. That really became a thing with GPT-3, I think, where they showed that that works. And that's just an amazing paper and an amazing proof point that that actually works. And I think that really unlocked a lot of possibilities. But in the original BRAC paper, we have a baseline, what we call the frozen baseline, where we don't do any training and we just give it as context.' Douwe Kiela: 'So in context learning is great, but you can probably always beat it through machine learning if you're able to do that. So if you have access to the parameters, which is obviously not the case with a lot of these black box frontier language models, but if you have access to the parameters and you can optimize them for the data you're working on or the problem you're solving, then at least theoretically, you should always be able to do better. So I see a lot of kind of' Jon Turow: 'What has happened since then is that, and we'll talk about how this is all getting combined in more sophisticated ways today, but I think it's fair to say that in the past 18, 24, 36 months, Rag has caught fire and even become misunderstood as the single silver bullet. Why do you think it's been so seductive?' Douwe Kiela: 'It's seductive because it's easy. I honestly, I think like long context is actually even more seductive if you're lazy, right? Because then you don't even have to worry about the retrieval anymore. You just put it all there and you pay a heavy price for having all of that data in the context. You're like every single time you're answering a question about Harry Potter, you have to read the whole book in order to answer the question, which is not great. So RAG is seductive, I think, because you need to have' Jon Turow: 'And we'll get to the part where we're talking about how you need to move beyond a cool demo. But I think the power of a cool demo should not be underestimated. And RAG enables that. What are some of the aha moments that you see with enterprise executives?' Douwe Kiela: 'Yeah, I mean, there are lots of aha moments. I think like that's part of the joy of my job. I think it's where you get to show what this can do and it's just amazing sometimes what these models can do. But yeah, so basic aha moments for us.' Douwe Kiela: 'So the next generation of these systems and platforms for building these RAG agents is going to enable developers to think much more about business value and differentiation, essentially. How can I be better than my competitors because I've solved this problem so much better? So your chunking strategy should really not be important for solving that problem.' Jon Turow: 'Well, so if I now connect what we were just talking about to what you said now, the seduction of long context and RAG are that it's straightforward and it's easy. It plugs into my existing architecture. And as a CTO, if I have finite resources to go implement new pieces of technology, let alone dig into concepts like chunking strategies and how the vector similarity for non-dairy will look similar to the vector similarity for milk, things like this. Is it fair to say that CTOs are wanting something coherent' Douwe Kiela: 'But then what we often find is that we talk to these people and then they talk to their architects and their developers. And those developers love thinking about chunking strategies because that's what it means in a modern era to be an AI engineer is to be very good at prompt engineering and evaluation and optimizing all the different parts of the RAG stack.' Douwe Kiela: 'So I think it's very important to have the flexibility to play with these different strategies. But you need to have very, very good defaults so that these people don't have to do that unless they really want to squeeze like the final percent and then they can do that. So that's what we're trying to offer is like you don't have to worry about all this stuff.' Douwe Kiela: 'The timeline is basically 2023 was the year of the demo. ChatGPT, it just happened. Everybody was kind of playing with it. There was a lot of experimental budget. Last year has been about trying to productionize it and you could probably get promoted if you were in a large enterprise, if you were the first one to ship GenAI into production. So there's been a lot of kind of kneecapping of those solutions happening in order to be the first one to get it into production.' Douwe Kiela: 'This year, those first past the post, so I asked the post, but so in a limited way, because it's actually very hard to get the real thing past the post. Right. So this year, people are really under a lot of pressure to deliver return on investment for all of those investments and all of the experimentation that has been happening. So it turns out that actually getting that ROI is a very different question.' Douwe Kiela: 'I think my general stance on like use case adoption is that I see a lot of people kind of aiming too low. Where it's like, oh, we have AI running in production. It's like, oh, what do you have? Well, we have something that can tell us who our 401k provider is and how many vacation days I get. And that's nice. Is that where you get the ROI of AI from? Obviously not. You need to move up in terms of complexity.' Douwe Kiela: 'Yeah, so there's roughly two categories for Gen-AI deployment, right? One is cost savings. So I have lots of people doing one thing. If I make all of them slightly more effective, then I can save myself a lot of money. And the other is more around business transformation and generating new revenue.' Douwe Kiela: 'I see some confusion around this kind of the gap between demo and production. A lot of people think that the common misconception we see is like, oh, yeah, it's great. I can easily do this myself. And then it turns out that everything breaks down after like 100 documents and they have a million. And so that is the most common one that we see. But I think there are other misconceptions maybe around what RAG is good for.' Douwe Kiela: 'and what is not. So what is a rag problem and what is not a rag problem? And so people, I think, don't have the same kind of mental model that maybe AI researchers like myself have, where if I give them access to a rag agent, often the first question they ask is, what's in the data?' Jon Turow: 'So now we have contextual, which is an amalgamation of multiple techniques. And you have what you call React 2.0, and you have fine tuning, and there's a lot of things that happen under the covers that customers ideally don't have to worry about until they choose to do so. And I expect that changes radically the conversation you have with an enterprise executive. So how do you describe the kinds of problems that they should go find and apply and prioritize?' Douwe Kiela: 'Yeah, so we often help people with use case discovery. So really just thinking through, okay, what are the rag problems? What are maybe not really rag problems? And then for the rag problems, how do you prioritize them? How do you define success? How do you come up with a proper test set?' Douwe Kiela: 'so that you can evaluate whether it actually works, what is the process for after that doing what we call UAT, user acceptability testing. So putting it in front of real people, that's really the thing that really matters, right? Sometimes we see production deployments and they're in production and then I ask them how many people use this and the answer is zero.' Douwe Kiela: 'Yes. So I think it's very tempting to pretend that AI products are mature enough to be fully self-serve and standalone. It's sort of decent if you do that, but in order to get it to be really great, you just need to put in the work.' Jon Turow: 'I want to talk about two sides of the organization that you've had to build in order to bring all this for customers. One is scaling up the research and engineering function to keep pushing the envelope. And there are a couple of very special things that Contextual has, something you call React 2.0, something you call active versus passive retrieval. Can you talk about some of those innovations that you've got inside Contextual and why they're important?' Douwe Kiela: 'We really want to be a frontier company, but we don't want to train foundation models. I mean, obviously that's a very, very capital intensive business. I think language models are going to get commoditized. The really interesting problems are around how do you build systems around these models that solve the real problem.' Douwe Kiela: 'And so most of the business problems that we encounter, they need to be solved by a system. So then there are a ton of super exciting research problems around how do I get that system to really work well together? So that's what RAC 2.0 is in our case. So like, how do you jointly optimize these components so that they can work well together?' Douwe Kiela: 'Yeah. So passive retrieval is basically old school rag. It's like I get a query and I always retrieve. And then I take the results of that retrieval and I give them to the language model and it generates. So that doesn't really work. Very often you need the language model to think, first of all, where am I going to retrieve it from? And like, how am I going to retrieve it? Are there maybe better ways to search for the thing I'm looking for?' Jon Turow: 'This implies two uses of two relationships of contextual and RAG to the agent. There is the supplying of information to the agent so that it can be performant. But if I probe into what you said, active retrieval implies a certain kind of reasoning. Maybe even' Douwe Kiela: 'Yeah, exactly. So it's like I enjoy saying, everything is contextual. That's very true for an enterprise, right? So the context that the data exists in, that really matters for the reasoning that the agent does in terms of finding the right information that all comes together in these RAG agents.' Douwe Kiela: 'What is a really thorny problem that you'd like your team and the industry to try and attack in the coming years? The most interesting problems that I see everywhere in enterprises are at the intersection of structured and unstructured. And so we have great companies working on unstructured data. There are great companies working on structured data. But once' Douwe Kiela: 'So they are different components right? It's just Despite what some people maybe like to pretend, I can always train up a better Texas SQL model if I specialize it for Texas SQL than taking a generic off-the-shelf language model and telling it, like, generate some SQL query. So specialization is always going to be better than generalization.' Jon Turow: 'Can you talk about active versus passive retrieval? Yeah. So passive retrieval is basically old school rag. It's like I get a query and I always retrieve. And then I take the results of that retrieval and I give them to the language model and it generates. So that doesn't really work. Very often you need the language model to think, first of all, where am I going to retrieve it from? And like, how am I going to retrieve it? Are there maybe better ways to search for the thing I'm looking for?' Douwe Kiela: 'Yeah. First of all, I think our researchers are really special in that we're not focused on like publishing papers or like being too far out on the frontier. As a company, I don't think you can afford that until you're much bigger and if you're like Zuck and you can afford to have FAIR. And the stuff I was working on at FAIR at the time, I was doing like Wittgensteinian language games and like all kinds of crazy stuff that I would never let people do here, honestly.' Douwe Kiela: 'But there's a there's a place for that. And that's not a startup. So the way we do research is we're very much looking at what the customer problems are that we think we can solve better than anybody else. And then really just focusing again, like thinking from the system's perspective about all of these problems. How can we make sure that we have the best system and then make that system jointly optimized and really specialized or specializable for different use cases? That's that's kind of.' Douwe Kiela: 'For like AI companies or AI native companies like us, if you compare this generation of companies with like SaaS companies, there is like, okay, all like the LAMP stack, everything was already there. You just have to basically go and like implement it. That's not the case here is that we're very much just figuring out what we're doing, like flying the airplane as we're building it sort of thing, which is exciting, I think.' Douwe Kiela: 'Yeah. So, I mean, that's kind of my personal journey as well, right? I started off like I did a PhD. I was very much like a pure research person. And' Douwe Kiela: 'I think there's going to be people problems and organizational problems and regulatory and domain constraints that fall outside the bounds of the paper? I would maybe argue that those are the main problems to still be overcome. I don't care about AGI and all of those discussions. I think the core technology is already here for huge economic disruption.' Douwe Kiela: 'So all the building blocks are here. The questions are more around how do we get lawyers to understand that? How do we get the MRM people to figure out what is an acceptable risk? One thing that we are very big on is not thinking about the accuracy, but thinking about the inaccuracy. And what do you do with the, like, if you have 98% accuracy, what do you do with the remaining 2% to make sure that you can mitigate that risk?' Jon Turow: 'What do new founders ask you? What kind of advice do they ask you? They ask me a lot about this like wrapper company thing and modes and differentiation. I think there's some fear that like incumbents are just going to eat everything. And so they obviously have amazing distribution. But yeah, I think there are just massive opportunities for companies to be AI native companies.' Douwe Kiela: 'Yeah, I think that's a very interesting question. I would argue like how many PhDs does Zuck have working for him? That's a lot, right? It's a lot. I don't think it really matters like how deep your expertise in a specific domain is. As long as you are a good leader and a good visionary, then you can recruit the PhDs to go and work for you.' Douwe Kiela: 'It's fine to be a wrapper company as long as you have an amazing business. People should have a lot more respect for companies building on top of fundamental new technology and then just discovering whole new business problems that we didn't really exist, new existed, and then solving them much better than anything else.' Douwe Kiela: 'I think so. I mean, I am also learning a lot of this myself, like about how to be a good founder basically. But I think it's always good to sort of plan for what's going to come and not for what is here right now. And that's how you really get to ride that wave in the right way. And so what's going to come is that a lot of this stuff is going to become much more mature.' Douwe Kiela: 'What do new founders ask you? What kind of advice do they ask you? They ask me a lot about this like wrapper company thing and modes and differentiation. I think there's some fear that like incumbents are just going to eat everything. And so they obviously have amazing distribution. But yeah, I think there are just massive opportunities for companies to be AI native companies.' Jon Turow: 'What is some advice that you've gotten? And I'll actually ask you to break it into two. What is advice that you've gotten that you disagree with? And what do you think about that? And then what is advice that you've gotten that you take a lot from?' Douwe Kiela: 'Maybe we can start with the advice I really like, which is one observation around why Facebook is so successful. It's like be fluid like water. It's like whatever the market is telling you or your users are telling you, like fit into that. Don't be too rigorous in like what is right and wrong. Just like be humble, I think, and just like look at what the data tells you and then try to optimize for that. That is advice that when I got it, I didn't really appreciate it fully. And I'm starting to appreciate that much more right now. Honestly, it took me too long to understand that.' Douwe Kiela: 'In terms of advice that I've gotten that I disagree with, It's very easy for people to say, like, you should do one thing and you should do it well. Sure, maybe, but I'd like to be more ambitious than that. So we could have been like one small part of a rag stack and we probably would have been the best in the world at that particular thing. But then we're just slotting into this ecosystem where we're just like a small piece.' Jon Turow: 'So if I page back a little bit and we get back into the technology for a minute, there's a common question, maybe even misunderstanding that I hear about RAG, that, oh, this is the thing that's going to solve hallucinations. And you and I have spoken about this so many times. Where is your head at right now on that?' Douwe Kiela: 'What hallucinations are, what they are not, does Rags solve it? What's the outlook there? I think like hallucination is not a very technical term. That's right. So we used to have a pretty good word for it. It was just accuracy. And so if you were like inaccurate, if you were wrong, then one way I guess to explain that or to anthropomorphize it would be to say, oh, the model hallucinated. I think it's a very ill-defined term, honestly.' Douwe Kiela: 'If I would have have to try to turn it into a technical definition. I would say the generation of the language model is not grounded in the context that it's given, where it is told that context is true. So basically, hallucination is about groundedness. If you have a model that really adheres to its context, then it will hallucinate less.' Douwe Kiela: 'What are some of the things that you might have believed a year ago about AI adoption or AI capabilities that you think very differently about today? Many things. The main thing I thought that turned out not to be true was that I thought this would be easy.' Douwe Kiela: 'What is this? This, like building the company and solving real problems with AI. I think we were very naive, especially in the beginning of the company. We were like, "Oh yeah, we just get a research cluster, get a bunch of GPUs in there, we train some models, it's going to be great."' Douwe Kiela: 'What are we, either you and I, or are we the industry not talking about nearly enough that we should be? Evaluation. I've been doing a lot of work on evaluation in my research career. Things like DynaBench where it was really about like how do we hopefully maybe get rid of like benchmarks altogether and sort of have a more dynamic way to measure model performance.' Douwe Kiela: 'But evaluation is just very boring. People don't seem to care about it. I care deeply about it. So that always surprises me. Like we did this amazing launch, I thought, around LM unit. It's natural language unit testing. So you have a response from a language model and now you want to check very specific things about that response. It's like, did it contain this? Did it not make this mistake? Like ideally, you can write unit tests as a person for what a good response looks like.' Douwe Kiela: 'So whoever is lucky enough to get that cool JP Morgan head of AI job that you would be doing in another life, is that intellectual property of JP Morgan what the evals really need to look like? Or is this something that they can ultimately ask Contextual to cover for them? No, so I think the tooling for evaluation they can use us for.' Douwe Kiela: 'but the actual expertise that goes into that evaluation, so the unit tests, They should write that themselves, right? Like in the limit we talked about, like a company is its people, but in the limit that might not even be true, right? Because there might be AI mostly and maybe only a few people. So what makes a company a company is its data and the expertise around that data and sort of the institutional knowledge. And so that is really what defines a company. And so that should be captured in how you evaluate the systems that you deploy in your company.'

Deep Dive

Chapters

This chapter details the creation of RAG at Facebook AI Research, highlighting its initial goal of grounding language models in external text, particularly Wikipedia. It emphasizes the collaboration with other researchers and the role of vector databases in enabling the combination of retrieval and generative models.

RAG originated from grounding language models in external text.
Initial grounding attempts used Wikipedia.
Collaboration with researchers at Facebook and elsewhere was crucial.
Early RAG models were multimodal, though primarily language-focused in application.

Shownotes Transcript

Translations:

中文

It's very easy for people to say, like, you should do one thing and you should do it well. Sure, maybe, but I'd like to be more ambitious than that. We could have been like one small part of a rag stack and we probably would have been the best in the world at that particular thing. But then we're just slotting into this ecosystem where we're just like a small piece and I want the whole pie, ideally.

Welcome to Founded and Funded. I'm Madrona partner, John Thurow, and I'm here with Dao Kiela, who is the founder and CEO of Contextual AI, a company building next generation retrieval systems for the enterprise. Dao is also the co-creator of Retrieval Augmented Generation, also known as RAG. There's a saying that when you have a hammer, everything looks like a nail. Dao, who created RAG, has resisted that urge

RAG has become one of the most widely adopted techniques in enterprise AI, but he's continued pushing the envelope even when customers weren't always ready to hear about what comes next. RAG was never meant to be the final answer. It was the beginning of something bigger. And so, Dow has this rare perspective as a researcher and a founder bringing that innovation to market. Whether you're a builder,

an investor or an AI practitioner. This is a conversation that will challenge how you think about the future of enterprise AI.

So let's get it going. So Dao, take us back to the beginning of RAG. You know, what was the problem that you were trying to solve when you came up with that? So the history of the RAG project, so we were at Facebook AI Research, obviously FAIR, and I had been doing a lot of work on grounding already for my PhD thesis. And grounding at the time really meant understanding language with respect to something else.

So it was like, if you want to know the meaning of the word cat, like the embedding, word embedding of the word cat, this was before we had like sentence embeddings, then ideally you would also know what cats look like because then you understand the meaning of cat better. So that type of perceptual grounding was something that a lot of people were looking at at the time.

And then I was talking with one of my PhD students, Ethan Perez, about can we ground it in something else? Maybe we can ground in other text instead of in images. So the obvious source at the time to ground in would be Wikipedia. So we would say this is true, sort of true. And then you can understand language with respect to that ground truth.

That was the origin of Rang. Ethan and I were looking at that and then we found that some folks in London had been working on open domain question answering, mostly Sebastian Riedel and Patrick Lewis. And they had amazing first models in that space and it was really a very interesting problem. How can I make a generative model work on any type of data and then answer questions on top of it?

We joined forces there. We happened to get very lucky at the time because the people at the Facebook AI image similarity search, I think is what it stands for. Basically the first vector database, but it was just there. And so we were like, we have to take the output from the vector database and

Give it to a generative model. This was before we called it language models. Then the language model can generate answers grounded on the things you retrieve. And that became RAC. We always joke with the folks who were on the original paper that we should have come up with a much better name than that. But yeah, somehow it stuck. And so this was by no means the only project that was doing this. There were people at Google working on very similar things.

Like Realm is an amazing paper from around the same time. Why RAG, I think, stuck was because the whole field was moving towards Gen AI. And so the G in RAG stands for generative. So we were really the first ones to show that you could make this combination of a vector database and a generative model actually work. You know, there's an insight in here that RAG from its very inception was multimodal.

you know you were starting with image grounding and things like that and it's been heavily language centric in the way people have applied it but from that very beginning place it

Were you imagining that you were going to come back and apply it with images? We had some papers from around that time. There's a paper we did with more applied folks in Facebook where we were looking at, I think it was called Extra, and it was basically RAG, but then on top of images. And so, yeah, that feels like a long time ago now, but that was always very much the idea, right? It's like you can have arbitrary data that is not captured by the parameters of the generative model,

And you can do retrieval over that arbitrary data to augment the generative model so that it can do its job. So it's all about the context that you give it. Well, you know, this takes me back to another common critique of these early generative models that for the amazing Q&A that they were capable of, the knowledge cutoff was really striking. You've had models in 2020 and 2021 that were not aware of COVID-19.

that obviously was so important to the society. Was that part of the motivation? Was that part of the solve that you can make these things fresher? Yeah, it was part of the original motivation, right? So that is sort of what grounding is, the vision behind the original RAG project. And so we did a lot of work after that on that question as well, is can I have a very lightweight language model that basically has no knowledge?

It's very good at reasoning and speaking English or any language, but it knows nothing. And so it has to rely completely on this other model, the retriever, which does a lot of the heavy lifting to ensure that the language model has the right context, but that they really have separate responsibilities. But yeah, getting that to work turned out to be quite difficult. And so at the time...

Now we have RAG, and we still have this constellation of other techniques. We have training, and we have teaming, and we have in-context learning. And that was, I'm sure, very hard to navigate for research labs, let alone enterprises. In the conception of RAG, in the early implementations of it, what was in your head about how RAG was going to fit into that constellation? Was it meant to be standalone?

Yeah, it's interesting because the concept of in-context learning didn't really exist at the time. That really became a thing with GPT-3, I think, where they showed that that works. And that's just an amazing paper and an amazing proof point that that actually works. And I think that really unlocked a lot of possibilities. But in the original BRAC paper, we have a baseline, what we call the frozen baseline, where we don't do any training and we just give it as context.

So that's in table six. And we kind of show that it doesn't really work, or at least that you can do a lot better if you optimize the parameters. Right?

So in context learning is great, but you can probably always beat it through machine learning if you're able to do that. So if you have access to the parameters, which is obviously not the case with a lot of these black box frontier language models, but if you have access to the parameters and you can optimize them for the data you're working on or the problem you're solving, then at least theoretically, you should always be able to do better. So I see a lot of kind of

false dichotomies around RAG. So the one I often hear is like it's either RAG or fine tuning. That's wrong. You can fine tune RAG system and then it will be even better. The other dichotomy I often hear is it's RAG or long context.

Like those are kind of the same thing. Like reg is a different way to solve the problem where you have more information than you can put in the context. So one solution is to try to grow the context, which doesn't actually really work yet, even though people like to pretend that it does. The other is just to use information retrieval, which is pretty well established as a computer science research field.

and just leverage all of that and make sure that the language model can do its job. And I think things get oversimplified where it's like, you should be doing all of those things. You should be doing a rag, you should have a long context window as long as you can get, and you should fine tune that thing. That's how you get the best performance. - What has happened since then is that, and we'll talk about how this is all getting combined in more sophisticated ways today, but I think it's fair to say that in the past 18, 24, 36 months,

Rag has caught fire and even become misunderstood as the single silver bullet. Why do you think it's been so seductive?

It's seductive because it's easy. I honestly, I think like long context is actually even more seductive if you're lazy, right? Because then you don't even have to worry about the retrieval anymore. You just put it all there and you pay a heavy price for having all of that data in the context. You're like every single time you're answering a question about Harry Potter, you have to read the whole book in order to answer the question, which is not great. So RAG is seductive, I think, because you need to have

a way to get these language models to work on top of your data. And so in the old paradigm of machine learning, we would actually probably do that in a much more sophisticated way. But because these frontier models are behind black box APIs and we have no access to what they're actually doing, the only way to really make them do the job on your data is to use retrieval to augment them. It's a function of sort of what the ecosystem has looked like over the past two years since ChatGPT.

And we'll get to the part where we're talking about how you need to move beyond a cool demo. But I think the power of a cool demo should not be underestimated. And RAG enables that. What are some of the aha moments that you see with enterprise executives?

- Yeah, I mean, there are lots of aha moments. I think like that's part of the joy of my job. I think it's where you get to show what this can do and it's just amazing sometimes what these models can do. But yeah, so basic aha moments for us.

So accuracy is almost kind of table stakes at this point. It's like, okay, like you have some data, it's like one document, you can probably answer lots of questions about that document pretty well. It becomes much harder when you have a million documents or tens of millions of documents and they're all very complicated or they have...

very specific things in them. So we've worked with Qualcomm and they're like circuit design diagrams inside those documents. It's much harder to make sense of that type of information. So the initial wow factor, at least from people using our platform, is that you can stand this up in like a minute. I can build like a state-of-the-art rag agent in like three clicks, basically. And so

that time of value used to be very difficult to achieve, right? Because you had your developers, they have to think about like the optimal chunking strategy for the documents and things that you really don't want your developers thinking about, but they had to because the technology was so immature.

So the next generation of these systems and platforms for building these RAG agents is going to enable developers to think much more about business value and differentiation, essentially. How can I be better than my competitors because I've solved this problem so much better? So your chunking strategy should really not be important for solving that problem. Well, so if I now connect what we were just talking about to what you said now,

the seduction of long context and RAG are that it's straightforward and it's easy. It plugs into my existing architecture. And as a CTO, if I have finite resources to go implement new pieces of technology, let alone dig into concepts like chunking strategies and how the vector similarity for non-dairy will look similar to the vector similarity for milk, things like this. Is it fair to say that CTOs are wanting something coherent

That can be something that works out of the box. So you would think so. And I think that's probably true for CTOs and CIOs and CAIOs and CDOs and the sort of folks who are thinking about it from that level. But then what we often find is that we talk to these people and then they talk to their

architects and their developers. And those developers love thinking about chunking strategies because that's what it means in a modern era to be an AI engineer is to be very good at prompt engineering and evaluation and optimizing all the different parts of the RAG stack.

So I think it's very important to have the flexibility to play with these different strategies. But you need to have very, very good defaults so that these people don't have to do that unless they really want to squeeze like the final percent and then they can do that. So that's what we're trying to offer is like you don't have to worry about all this stuff.

basic stuff. You should be thinking about how to really use the AI to deliver value. So it's really a journey. I think a lot of companies, so the maturity curve is very wide and very flat. So it's like some companies are really just figuring it out. It's like, what use case should I look at? And others have a full-blown like rag platform that they built themselves.

based on completely wrong assumptions for where the field is going to go. And now they're kind of stuck in this paradigm. It's really all over the place, which means it's still very early in the market. Can you take me through some of the milestones on that maturity curve from the cool demo all the way through to the ninja level results?

The timeline is basically 2023 was the year of the demo. ChatGPT, it just happened. Everybody was kind of playing with it. There was a lot of experimental budget. Last year has been about trying to productionize it and you could probably get promoted if you were in a large enterprise, if you were the first one to ship GenAI into production. So there's been a lot of kind of kneecapping of those solutions happening in order to be the first one to get it into production.

This year, those first past the post, so I asked the post, but so in a limited way, because it's actually very hard to get the real thing past the post. Right. So this year, people are really under a lot of pressure to deliver return on investment for all of those investments and all of the experimentation that has been happening. So it turns out that actually getting that ROI is a very different question.

That's where you need a lot of deep expertise around the problem, but also you need to just have better components than that exists out there in an open source kind of easy framework for you to cobble together a Frankenstein rag solution. That's great for the demo, but that doesn't scale. How do customers think about the ROI? How do they measure, perceive that? Yeah.

It really depends on the customer. Some are very sophisticated, really trying to think through sort of the metrics, like how do I measure it? How do I prioritize it? I think a lot of consulting firms are trying to be helpful there as well, thinking through, okay, like this use case is interesting, but it touches 10 people.

They're very highly specialized, but we have this other use case. It has 10,000 people there may be slightly less specialized, but there's much more impact there. So it's kind of a trade off. I think my general stance on like use case adoption is that I see a lot of people kind of aiming too low.

Where it's like, oh, we have AI running in production. It's like, oh, what do you have? Well, we have something that can tell us who our 401k provider is and how many vacation days I get. And that's nice. Is that where you get the ROI of AI from? Obviously not. You need to move up in terms of complexity. Or if you think of the org chart of a company, you want to go for this specialized organization.

roles where they have like really hard problems and if you can make them 10 20 more effective at that problem you can save the company tens or hundreds of millions of dollars just by making those people better at their job there's an equation you're kind of getting at which is the complexity sophistication of the work being done times the number of employees

that it impacts. Yeah, so there's roughly two categories for Gen-AI deployment, right? One is cost savings. So I have lots of people doing one thing. If I make all of them slightly more effective, then I can save myself a lot of money. And the other is more around business transformation and generating new revenue.

So that second one is obviously much harder to measure. And you need to really think through the metrics, like what am I optimizing for here? So as a result of that, I think you see a lot more production deployments in the former category where it's just about cost saving. What are some big misunderstandings that you see around what the technology is or is not capable of?

I see some confusion around this kind of the gap between demo and production. A lot of people think that the common misconception we see is like, oh, yeah, it's great. I can easily do this myself. And then it turns out that everything breaks down after like 100 documents and they have a million. And so that is the most common one that we see. But I think there are other misconceptions maybe around what RAG is good for.

and what is not. So what is a rag problem and what is not a rag problem? And so people, I think, don't have the same kind of mental model that maybe AI researchers like myself have, where if I give them access to a rag agent, often the first question they ask is, what's in the data?

So that is not a rag problem, actually. Or it's a rag problem on sort of the metadata. It's not on the data itself, right? So a rag question would be like, what was Meta's R&D expense in Q4 of 2024 and how did it compare to the previous year? Something like that, right? So it's kind of a specific question where you can extract the information and then kind of reason over it and synthesize that different information.

a lot of questions that people like to ask are not rag problems. So it's like summarize the document is another one. Summarization is not a rag problem. Ideally, you want to put the whole document in a context and then just summarize it. So there are different...

different strategies that work well for different questions and why ChatGPT is such a great product is because they kind of abstracted away some of those decisions that go into it, but that's still very much happening under the surface. So I think

People need to understand better what type of use case they have. Like if I'm a Qualcomm customer engineer and I need very specific answers to very specific questions, that's very clearly a rag problem. If I need to summarize a document, just put that in context of a long context model. And so now we have contextual, which is...

an amalgamation of multiple techniques. And you have what you call React 2.0, and you have fine tuning, and there's a lot of things that happen under the covers that customers ideally don't have to worry about until they choose to do so. And I expect that changes radically the conversation you have with an enterprise executive. So how do you describe the kinds of problems that they should go find and apply and prioritize?

Yeah, so we often help people with use case discovery. So really just thinking through, okay, what are the rag problems? What are maybe not really rag problems? And then for the rag problems, how do you prioritize them? How do you define success? How do you come up with a proper test set?

so that you can evaluate whether it actually works, what is the process for after that doing what we call UAT, user acceptability testing. So putting it in front of real people, that's really the thing that really matters, right? Sometimes we see production deployments and they're in production and then I ask them how many people use this and the answer is zero. But during the initial UAT, everything was great and everybody was saying, oh yeah, this is so great. But then when your boss asks you the question and your job is on the line,

then you do it yourself. You don't ask AI in that particular use case. It's a transformation that a lot of these companies still have to go through. Do the companies want

support through that journey today, either direct from Contextual or from a solution partner to get such things implemented? Yes. So I think it's very tempting to pretend that AI products are mature enough to be fully self-serve and standalone. It's sort of decent if you do that, but in order to get it to be really great, you just need to put in the work.

work. And so we do that for our customers or we can also work through systems integrators who can do that for us. I want to talk about two sides of the organization that you've had to build in order to bring all this for customers. One is scaling up the research and engineering function to keep pushing the envelope.

And there are a couple of very special things that Contextual has, something you call React 2.0, something you call active versus passive retrieval. Can you talk about some of those innovations that you've got inside Contextual and why they're important? We really want to be a frontier company, but we don't want to train foundation models.

I mean, obviously that's a very, very capital intensive business. I think language models are going to get commoditized. The really interesting problems are around how do you build systems around these models that solve the real problem.

And so most of the business problems that we encounter, they need to be solved by a system. So then there are a ton of super exciting research problems around how do I get that system to really work well together? So that's what RAC 2.0 is in our case. So like, how do you jointly optimize these components so that they can work well together?

But there's also other things like making sure that your generations are very grounded. So it's not a general language model. It's a language model that has been trained specifically for RAG and RAG only. It's not doing creative writing. It can only talk about what's in the context. And similarly, when you build these production systems, you need to have a state-of-the-art re-ranker. And ideally, that re-ranker can also follow instructions. So it's a smarter model.

So there's a lot of really innovative stuff that we're doing around building the RAC pipeline better and then how you incorporate feedback into that RAC pipeline as well. So we've done work on like KTO and APO and things like that. So really different ways to incorporate human preferences into entire systems and not just models. But that takes a very special team, which we have. I'm very proud of.

Can you talk about active versus passive retrieval? Yeah. So passive retrieval is basically old school rag. It's like I get a query and I always retrieve. And then I take the results of that retrieval and I give them to the language model and it generates. So that doesn't really work. Very often you need the language model to think, first of all, where am I going to retrieve it from? And like, how am I going to retrieve it? Are there maybe better ways to search for the thing I'm looking for?

than just copy pasting the query. So modern production rag pipelines are already way more sophisticated than just having a vector database and a language model. And one of the interesting things that you can do in the new paradigm of agentic things and test time reasoning is decide for yourself if you want to retrieve something. So it's active retrieval. It's like if you give me a query like, hi, how are you?

I don't have to retrieve in order to answer that, right? So I can just say, "I'm doing well, how can I help you?" And then you ask me a question and now I decide that I need to go and retrieve. But maybe I make a mistake with my initial retrieval. So then I need to go and think like, "Oh, actually, maybe I should have gone here instead." And so that sort of active retrieval, that's all getting unlocked now. So this is what we call Rag Agents. And this really is the future, I think, because agents are great.

But we need a way to get them to work on your data. And that's where RAG comes in. This implies two uses of two relationships of contextual and RAG to the agent. There is the supplying of information to the agent so that it can be performant. But if I probe into what you said, active retrieval implies a certain kind of reasoning. Maybe even

longer reasoning about, okay, what is the best source of the information that I've been asked to provide? Yeah, exactly. So it's like I enjoy saying, everything is contextual. That's very true for an enterprise, right? So the context that the data exists in, that really matters for the reasoning that the agent does in terms of finding the right information that all comes together in these RAG agents.

What is a really thorny problem that you'd like your team and the industry to try and attack in the coming years? The most interesting problems that I see everywhere in enterprises are at the intersection of structured and unstructured. And so we have great companies working on unstructured data. There are great companies working on structured data. But once

You have the capability, which we're starting to have now, where you can reason over both of these very different kind of data modalities using the same model. Then that unlocks so many cool use cases. That's really, I think, going to happen like this year, next year, just thinking through the different data modalities and how you can reason on top of all of them with these agents. Will that happen under the covers with one common model?

piece of infrastructure or will it be a coherent single pane of glass across many different Lego bricks? So I'd like to think that it would be one solution and that is our platform. Let's imagine that, but behind the covers will you be accomplishing that with many different components each handling the structure versus the structure? So they are different components right? It's just

Despite what some people maybe like to pretend, I can always train up a better Texas SQL model if I specialize it for Texas SQL than taking a generic off-the-shelf language model and telling it, like, generate some SQL query. So specialization is always going to be better than generalization.

for specific problems if you know what the problem is that you're solving. The real question is much more around like, is it worth actually investing the money to do that? And so it costs money to specialize and it sort of sometimes hampers kind of economies of scale that you might want to have. If I look at the other side of your organization that you've had to build, so you've had to build a very sophisticated research function, but Contextual is not a research lab, it's a company. Mm-hmm.

So what are the other kinds of disciplines and capabilities you had to build up at King Textual that complement all the research that's happening here? Yeah. First of all, I think our researchers are really special in that we're not focused on like publishing papers or like being too far out on the frontier.

As a company, I don't think you can afford that until you're much bigger and if you're like Zuck and you can afford to have FAIR. And the stuff I was working on at FAIR at the time, I was doing like Wittgensteinian language games and like all kinds of crazy stuff that I would never let people do here, honestly.

But there's a there's a place for that. And that's not a startup. So the way we do research is we're very much looking at what the customer problems are that we think we can solve better than anybody else. And then really just focusing again, like thinking from the system's perspective about all of these problems. How can we make sure that we have the best system and then make that system jointly optimized and really specialized or specializable for different use cases? That's that's kind of.

what we can do. So that means that it's a very fluid boundary between your research and like applied research, basically. So all of our research is applied, but it's so in AI right now, I think there's a very fine line between sort of product and research where the research is basically is the product. And that's not true just for us. I think it's true for OpenAI, Anthropic, everybody like the field is moving so quickly.

that you have to productize research almost immediately. Like as soon as it's ready, like you don't even have time to write a paper about it anymore. You just like have to ship it into product very quickly because it's such a fast moving space. How do you allocate your research attention? Is there some element of play, even 5%, 10%? The team would probably say not enough. But not zero. Yeah, as a researcher, I think,

You always want to play more, but you have limited time. So yeah, it's a trade off. I don't think we're like officially committing, like we don't have a 20% rule or something like Google would have. It's more like we're just trying to solve cool problems as quickly as we can and hopefully have some impact on the world. So not just like work in isolation, but really try to focus on things that matter. Yeah.

I think I'm hearing you say that it's non-zero, even in an environment with finite resources and moving fast. Every environment has finite resources. I think it's more like if you really want to do special things, then you need to try new stuff. And so that's, I think, very different.

For like AI companies or AI native companies like us, if you compare this generation of companies with like SaaS companies, there is like, okay, all like the LAMP stack, everything was already there. You just have to basically go and like implement it. That's not the case here is that we're very much just figuring out what we're doing, like flying the airplane as we're building it sort of thing, which is exciting, I think. What is it like?

to now take this research that you're doing and go out into the world and have that make contact with enterprises? What has that been like for you personally? And what has that been like for the company to transform from research led to a product company? Yeah. So, I mean, that's kind of my personal journey as well, right? I started off like I did a PhD. I was very much like a pure research person. And

kind of slowly transitioned to where I am now. Yeah, the key observation is really that the research is the product. And so this is a special point in time. It's not going to always be like that, I think. I think that's just been a lot of fun, honestly. I've been on a podcast before,

like a while back and they asked me like, what other job would you think is interesting? And I said, maybe being the head of AI of JP Morgan. And they were like, really? And it's like, well, I think actually right now at this particular point in time, that actually is a very interesting job. And because you have to think about how am I going to change this giant company to use this latest piece of technology that is frankly going to change everything, right? It's going to change our entire society. And so, yeah,

I think for me, yeah, just gives me a lot of joy talking to people like that and sort of thinking about what the future of the world is going to look like. I think there's going to be people problems and organizational problems and

regulatory and domain constraints that fall outside the bounds of the paper? I would maybe argue that those are the main problems to still be overcome. I don't care about AGI and all of those discussions. I think the core technology is already here for huge economic disruption.

So all the building blocks are here. The questions are more around how do we get lawyers to understand that? How do we get the MRM people to figure out what is an acceptable risk? One thing that we are very big on is not thinking about the accuracy, but thinking about the inaccuracy. And what do you do with the, like, if you have 98% accuracy, what do you do with the remaining 2% to make sure that you can mitigate that risk?

And so a lot of this is happening right now. There's a lot of change management that we're going to need to do in these organizations. So all of that is outside of the research questions where I think we have all the pieces to just completely disrupt the global economy right now. It's just a question of executing on it, which is kind of scary and exciting at the same time. You know, Dal, you and I have had a conversation many times about

different archetypes of founders and their capabilities. There's one lens that really stuck with me that kind of has three click stops on it. There's a domain expert who has expertise in revenue cycle management, but really may not be that technical at all. A. B, there is somebody who is technical and able to write code, but is not a PhD researcher. And, you know, Mark Zuckerberg is a really famous example of that. And then there are

There is the research founder who has deep technical capabilities and really advanced vision into the frontier. What do you see as the role for each of those types of founders

in the next wave of companies that need to get built. Yeah, I think that's a very interesting question. I would argue like how many PhDs does Zuck have working for him? That's a lot, right? It's a lot. I don't think it really matters like how deep your expertise in a specific domain is. As long as you are a good leader and a good visionary, then you can recruit the PhDs to go and work for you.

But at the same time, obviously, it gives you an advantage if you are very deep in one field and that field just happens to take off, which is sort of what happened to me. I just got very lucky, I think, with a lot of timing there as well. But yeah, overall, I think maybe one underlying question you're asking there is around like AI wrapper companies, for example, right?

To what extent should companies go horizontal and vertical using this technology? And I think there's been a lot of disdain for these wrapper companies like, "Oh, that's just a wrapper for OpenAI." It's like, well, it turns out you can make an amazing business just from that, right? I think Cursor is like Anthropix's biggest customer right now. I think it's

It's fine to be a wrapper company as long as you have an amazing business. People should have a lot more respect for companies building on top of fundamental new technology and then just discovering whole new business problems that we didn't really exist, new existed, and then solving them much better than anything else. Well, so I'm really thinking also about the comment you made that we have

a lot of technology that is capable of a lot of economic impact, even today, without new breakthroughs that, yes, we will also get. Does that change the next types of companies that should be founded in the coming year?

I think so. I mean, I am also learning a lot of this myself, like about how to be a good founder basically. But I think it's always good to sort of plan for what's going to come and not for what is here right now. And that's how you really get to ride that wave in the right way. And so what's going to come is that a lot of this stuff is going to become much more mature. But like one of the big problems we had even two years ago was that AI infrastructure

was just very, very immature. Everything would go like break down all the time. There were like bugs in the attention mechanism, implementation of frameworks we were using, like really like basic stuff. All of that has been solved now. So with that maturity also comes the ability to scale much better, to think much, much more rigorously, I think, around sort of cost quality tradeoffs and things like that. So there's a lot of kind of business value just right there.

What do new founders ask you? What kind of advice do they ask you? They ask me a lot about this like wrapper company thing and modes and differentiation. I think there's some fear that like incumbents are just going to eat everything. And so they obviously have amazing distribution. But yeah, I think there are just massive opportunities for companies to be AI native companies.

and to really like think from day one as an AI company, if you really do that right, then you have a massive opportunity to be the next Google or Facebook or whatever, if you play your cards right. What is some advice that you've gotten? And I'll actually ask you to break it into two. What is advice that you've gotten that you disagree with? And what do you think about that? And then what is advice that you've gotten that you take a lot from?

Maybe we can start with the advice I really like, which is one observation around why Facebook is so successful.

It's like be fluid like water. It's like whatever the market is telling you or your users are telling you, like fit into that. Don't be too rigorous in like what is right and wrong. Just like be humble, I think, and just like look at what the data tells you and then try to optimize for that. That is advice that when I got it, I didn't really appreciate it fully. And I'm starting to appreciate that much more right now. Honestly, it took me too long to understand that. In terms of advice that I've gotten that I disagree with,

It's very easy for people to say, like, you should do one thing and you should do it well. Sure, maybe, but I'd like to be more ambitious than that. So we could have been like one small part of a rag stack and we probably would have been the best in the world at that particular thing. But then we're just slotting into this ecosystem where we're just like a small piece.

And I want the whole pie, ideally. And that's why we've invested so much time in building this platform, making sure that all the individual components are state of the art and that they've been made to work together so that you can really solve this much bigger problem. But yeah, that is also a lot harder to do. And so, yeah, not everyone would give me the advice that I should go and solve that hard problem. But I think...

Over time as a company like that is where your mode comes from, right? Like doing something that everybody else thinks is kind of crazy. So that would be my advice to founders is go and do something that everybody else thinks is crazy. You're probably going to tell me that that reflects in the team that comes to join you. Yeah, I mean, the company is the team, especially the early team. We've been very fortunate with the people who joined us early on and that just that is what the company is, right? It's the people.

So if I page back a little bit and we get back into the technology for a minute, there's a common question, maybe even misunderstanding that I hear about RAG, that, oh, this is the thing that's going to solve hallucinations. And you and I have spoken about this so many times. Where is your head at right now on that?

What hallucinations are, what they are not, does Rags solve it? What's the outlook there? I think like hallucination is not a very technical term. That's right. So we used to have a pretty good word for it. It was just accuracy. And so if you were like inaccurate, if you were wrong, then one way I guess to explain that or to anthropomorphize it would be to say, oh, the model hallucinated. I think it's a very ill-defined term, honestly. If I would have

have to try to turn it into a technical definition. I would say the generation of the language model is not grounded in the context that it's given, where it is told that context is true. So basically, hallucination is about groundedness. If you have a model that really adheres to its context, then it will hallucinate less.

But Hallucination itself is arguably actually a feature for a general purpose language model. It's not a bug, right? If you have a creative writing department or like a marketing department, creative writing thing like content generation, I think Hallucination is great.

As long as you have a way to fix it, you probably have a human somewhere double checking it and rewriting some stuff. So hallucination itself is not even a bad thing necessarily. It is a bad thing if you have a rag problem though, and you cannot afford to make a mistake. So then that's why we have a grounded language model that has been trained specifically not to hallucinate or to hallucinate less.

Because one other misconception that I sometimes see is that people think that these probabilistic systems can have 100% accuracy. And that, I think, is just a pipe dream. It's the same with people, right? If you look at a big bank,

There are people in these banks and people make mistakes too. And so, AI also... SEC filings have mistakes. Exactly. And the whole reason we have the SEC and that is a regulated market is so that we have mechanisms built into the market so that if a person makes a mistake, then at least we kind of made reasonable efforts to mitigate the risk around that.

It's the same with AI deployments. That's why I'm talking about how to mitigate the risk with inaccuracies. It's like, we're not going to get it to 100%. So you need to think about the 2, 3, 5, 10%, depending on how hard the use case is, where you might still not be perfect. How do you deal with that? What are some of the things that you might have believed a year ago?

about AI adoption or AI capabilities that you think very differently about today? Many things. The main thing I thought that turned out not to be true was that I thought this would be easy.

What is this? This, like building the company and solving real problems with AI. I think we were very naive, especially in the beginning of the company. We were like, "Oh yeah, we just get a research cluster, get a bunch of GPUs in there, we train some models, it's going to be great."

And then it turned out that getting a working GPU cluster was actually very hard. And then it turned out that training something on that GPU cluster in a way that actually works, where if you're using other people's code, then maybe that code is not that great yet. So now you have to build your own framework for a lot of the stuff that you're doing if you want to make sure that it's really, really good.

So we just had to do a lot of plumbing that we really did not expect to have to do. And so now I'm very happy that we did all that work. But at the time, it was very frustrating. What are we, either you and I, or are we the industry not talking about nearly enough that we should be? Evaluation. I've been...

doing a lot of work on evaluation in my research career. Things like DynaBench where it was really about like how do we hopefully maybe get rid of like benchmarks altogether and sort of have a more dynamic way to measure model performance.

But evaluation is just very boring. People don't seem to care about it. I care deeply about it. So that always surprises me. Like we did this amazing launch, I thought, around LM unit. It's natural language unit testing. So you have a response from a language model and now you want to check very specific things about that response. It's like, did it contain this? Did it not make this mistake? Like ideally, you can write unit tests as a person for what a good response looks like.

So you can do that with our approach. And we have a model that is by far state of the art at verifying that these unit tests are passing or failing. So I think this is awesome. Like, I just love talking about this, but people don't seem to really care. It's like, oh yeah, evaluation. Like, yeah, we have a spreadsheet somewhere with like 10 examples. Like, how is that possible? That's such an important problem. When you deploy AI, you need to know if it actually works or not. And you need to know where it falls short and you need to

have trust in your deployment and you need to think about the things that might go wrong and all of that. So it's been very surprising to me just how immature a lot of companies are when it comes to evaluation. And this includes huge companies. Yeah. You know, Gary Tan posted on social media not too long ago that evaluation is the secret weapon of the

strongest AI application companies. Also AI research companies, by the way. So OpenAI and Anthropic, part of why they're so great is because they're amazing at evaluation too. So they know exactly what good looks like. That's also why we are doing all of that in-house. We're not just outsourcing evaluation to somebody else. It's like if you are an AI company and AI is your product, then

You can only assess the quality of your product through evaluation. So it's really core to all of these companies. So whoever is lucky enough to get that cool JP Morgan head of AI job that you would be doing in another life, is that intellectual property of JP Morgan what the evals really need to look like? Or is this something that they can ultimately ask Contextual to cover for them? No, so I think the tooling for evaluation they can use us for.

but the actual expertise that goes into that evaluation, so the unit tests,

They should write that themselves, right? Like in the limit we talked about, like a company is its people, but in the limit that might not even be true, right? Because there might be AI mostly and maybe only a few people. So what makes a company a company is its data and the expertise around that data and sort of the institutional knowledge. And so that is really what defines a company. And so that should be captured in how you evaluate the systems that you deploy in your company.

Maybe we can leave it there. Dao Kiela, thank you so much. This was a lot of fun. Thank you.

RAG Inventor Talks Agents, Grounded AI, and Enterprise Impact 45:26 Share

Founded & Funded

Deep Dive

Shownotes Transcript

RAG Inventor Talks Agents, Grounded AI, and Enterprise Impact