We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

RAG Inventor Talks Agents, Grounded AI, and Enterprise Impact

2025/3/27

Founded & Funded

AI Deep Dive AI Chapters Transcript

People

Douwe Kiela

Jon Turow

Topics

Douwe Kiela: 我是Contextual AI的联合创始人兼CEO，也是RAG（检索增强生成）技术的共同发明者。我们公司致力于为企业构建下一代检索系统。我希望Contextual AI能够更宏伟，不只是RAG技术栈中的一小部分，而是拥有整个市场。RAG的起源是为了解决语言模型缺乏知识背景的问题，通过将语言模型与外部知识库结合，使其能够更好地理解和生成文本。RAG最初的灵感是将语言模型与维基百科等外部文本知识库结合，以增强其理解能力。RAG的成功部分源于当时Facebook AI的图像相似性搜索技术（一种向量数据库）的可用性，这使得将向量数据库的输出与生成模型结合成为可能。RAG的初衷之一是解决生成模型知识陈旧的问题，通过检索最新的信息来增强模型的知识。RAG并非与微调或长上下文窗口相互排斥，而是可以与它们结合使用以获得最佳性能。RAG之所以受欢迎，是因为它提供了一种简单的方法来让大型语言模型处理企业内部数据。下一代RAG系统将使开发者能够更专注于业务价值和差异化，而不是技术细节，例如分块策略。虽然CTO们希望RAG易于使用，但开发人员仍然需要关注技术细节，例如分块策略，以优化性能。企业对RAG的采用程度参差不齐，有些企业还在探索阶段，有些企业已经构建了自己的RAG平台，但可能基于错误的假设。许多企业在RAG的应用场景选择上目标过低，应该选择更复杂、影响范围更大的应用场景以获得更高的投资回报。RAG的投资回报率取决于应用场景的复杂性和影响的员工数量。生成式AI的部署主要分为两类：成本节约和业务转型。企业对RAG的常见误解包括：低估了将RAG投入生产的难度，以及对RAG适用场景的误解。区分RAG能够解决的问题和不能解决的问题，例如，RAG擅长回答特定问题，但不擅长总结文档。Contextual AI的RAG 2.0以及其他技术整合，改变了与企业高管的沟通方式。Contextual AI帮助企业发现合适的RAG应用场景，并定义成功指标和测试集，以评估RAG的有效性。用户验收测试（UAT）对于RAG的成功部署至关重要，因为在实际应用中，用户行为可能与测试阶段不同。Contextual AI为企业提供支持，帮助其完成RAG的部署和应用。Contextual AI专注于构建围绕大型语言模型的系统，而不是训练基础模型，因为大型语言模型将被商品化。RAG 2.0专注于联合优化RAG管道中的各个组件，以提高性能。主动检索与被动检索的区别：主动检索允许语言模型根据需要决定是否进行检索，并根据需要调整检索策略。RAG代理通过上下文推理选择最佳信息来源。未来RAG领域的一个重要挑战是解决结构化数据和非结构化数据的融合问题。针对特定问题，专门化的模型总是优于通用的模型。Contextual AI的研究重点是解决实际的客户问题，而不是纯粹的学术研究。在AI领域，研究与产品开发的界限非常模糊，研究成果需要迅速转化为产品。AI原生公司与SaaS公司的区别：AI原生公司需要在构建产品的同时不断探索和学习。将研究成果转化为产品，并与企业合作，是一次令人兴奋的经历。当前AI技术已经具备巨大的经济影响力，但仍需解决一些非技术性问题，例如法律法规和组织管理。在AI部署中，关注不准确性以及如何降低风险至关重要。不同类型的创始人（领域专家、技术专家和研究型创始人）在AI领域的创业中都扮演着重要角色。创始人是否具备深厚的领域知识并不重要，重要的是领导能力和远见卓识。基于基础技术的“包装公司”同样具有巨大的商业机会。未来AI创业的机会在于利用AI技术的成熟度，解决更复杂的问题，并注重成本效益。新兴的AI公司需要专注于差异化，并避免被大型公司吞噬。灵活适应市场变化，保持谦逊。不要害怕挑战难题，追求更大的目标。一个公司的核心是其数据和对数据的专业知识，这应该体现在其模型评估方法中。“幻觉”是一个不够精确的术语，应该用准确性来衡量。幻觉与模型的groundedness有关，如果模型能够很好地遵循上下文，则幻觉会减少。建立AI公司并解决实际问题比预想的要困难得多。AI领域被忽视的重要问题：模型评估。模型评估对于AI的成功部署至关重要，但许多公司对评估的重视程度不足。企业可以利用Contextual AI的工具进行模型评估，但评估的专业知识需要企业自身积累。 Jon Turow: 早期生成模型的一个常见批评是其知识截止日期的限制，例如2020和2021年的模型无法识别COVID-19。RAG与其他技术（如微调、团队合作和情境学习）的关系以及在这些技术组合中的作用。RAG的流行和误解：它被视为解决所有问题的灵丹妙药。企业高管对RAG的积极反馈，以及RAG带来的便利性。下一代RAG系统将使开发者能够更专注于业务价值和差异化，而不是技术细节，例如分块策略。CTO们希望RAG能够易于集成到现有架构中，并能立即投入使用。虽然CTO们希望RAG易于使用，但开发人员仍然需要关注技术细节，例如分块策略，以优化性能。企业对RAG的采用程度参差不齐，有些企业还在探索阶段，有些企业已经构建了自己的RAG平台，但可能基于错误的假设。RAG的应用发展阶段：2023年是演示阶段，2024年是生产化阶段，2025年是追求投资回报阶段。企业对RAG投资回报率的衡量方法各不相同，一些企业注重成本节约，另一些企业关注业务转型和收入增长。许多企业在RAG的应用场景选择上目标过低，应该选择更复杂、影响范围更大的应用场景以获得更高的投资回报。RAG的投资回报率取决于应用场景的复杂性和影响的员工数量。生成式AI的部署主要分为两类：成本节约和业务转型。企业对RAG的常见误解包括：低估了将RAG投入生产的难度，以及对RAG适用场景的误解。Contextual AI的RAG 2.0以及其他技术整合，改变了与企业高管的沟通方式。关于幻觉的讨论：RAG是否能够解决幻觉问题。“幻觉”是一个不够精确的术语，应该用准确性来衡量。关于AI采用和能力的看法变化。建立AI公司并解决实际问题比预想的要困难得多。AI领域被忽视的重要问题：模型评估。模型评估对于AI的成功部署至关重要，但许多公司对评估的重视程度不足。 supporting_evidences Douwe Kiela: 'So the history of the RAG project, so we were at Facebook AI Research, obviously FAIR, and I had been doing a lot of work on grounding already for my PhD thesis. And grounding at the time really meant understanding language with respect to something else.' Douwe Kiela: 'So it was like, if you want to know the meaning of the word cat, like the embedding, word embedding of the word cat, this was before we had like sentence embeddings, then ideally you would also know what cats look like because then you understand the meaning of cat better. So that type of perceptual grounding was something that a lot of people were looking at at the time.' Douwe Kiela: 'And then I was talking with one of my PhD students, Ethan Perez, about can we ground it in something else? Maybe we can ground in other text instead of in images. So the obvious source at the time to ground in would be Wikipedia. So we would say this is true, sort of true. And then you can understand language with respect to that ground truth.' Jon Turow: 'Well, you know, this takes me back to another common critique of these early generative models that for the amazing Q&A that they were capable of, the knowledge cutoff was really striking. You've had models in 2020 and 2021 that were not aware of COVID-19.' Douwe Kiela: 'Yeah, it was part of the original motivation, right? So that is sort of what grounding is, the vision behind the original RAG project. And so we did a lot of work after that on that question as well, is can I have a very lightweight language model that basically has no knowledge?' Jon Turow: 'Now we have RAG, and we still have this constellation of other techniques. We have training, and we have teaming, and we have in-context learning. And that was, I'm sure, very hard to navigate for research labs, let alone enterprises. In the conception of RAG, in the early implementations of it, what was in your head about how RAG was going to fit into that constellation? Was it meant to be standalone?' Douwe Kiela: 'Yeah, it's interesting because the concept of in-context learning didn't really exist at the time. That really became a thing with GPT-3, I think, where they showed that that works. And that's just an amazing paper and an amazing proof point that that actually works. And I think that really unlocked a lot of possibilities. But in the original BRAC paper, we have a baseline, what we call the frozen baseline, where we don't do any training and we just give it as context.' Douwe Kiela: 'So in context learning is great, but you can probably always beat it through machine learning if you're able to do that. So if you have access to the parameters, which is obviously not the case with a lot of these black box frontier language models, but if you have access to the parameters and you can optimize them for the data you're working on or the problem you're solving, then at least theoretically, you should always be able to do better. So I see a lot of kind of' Jon Turow: 'What has happened since then is that, and we'll talk about how this is all getting combined in more sophisticated ways today, but I think it's fair to say that in the past 18, 24, 36 months, Rag has caught fire and even become misunderstood as the single silver bullet. Why do you think it's been so seductive?' Douwe Kiela: 'It's seductive because it's easy. I honestly, I think like long context is actually even more seductive if you're lazy, right? Because then you don't even have to worry about the retrieval anymore. You just put it all there and you pay a heavy price for having all of that data in the context. You're like every single time you're answering a question about Harry Potter, you have to read the whole book in order to answer the question, which is not great. So RAG is seductive, I think, because you need to have' Jon Turow: 'And we'll get to the part where we're talking about how you need to move beyond a cool demo. But I think the power of a cool demo should not be underestimated. And RAG enables that. What are some of the aha moments that you see with enterprise executives?' Douwe Kiela: 'Yeah, I mean, there are lots of aha moments. I think like that's part of the joy of my job. I think it's where you get to show what this can do and it's just amazing sometimes what these models can do. But yeah, so basic aha moments for us.' Douwe Kiela: 'So the next generation of these systems and platforms for building these RAG agents is going to enable developers to think much more about business value and differentiation, essentially. How can I be better than my competitors because I've solved this problem so much better? So your chunking strategy should really not be important for solving that problem.' Jon Turow: 'Well, so if I now connect what we were just talking about to what you said now, the seduction of long context and RAG are that it's straightforward and it's easy. It plugs into my existing architecture. And as a CTO, if I have finite resources to go implement new pieces of technology, let alone dig into concepts like chunking strategies and how the vector similarity for non-dairy will look similar to the vector similarity for milk, things like this. Is it fair to say that CTOs are wanting something coherent' Douwe Kiela: 'But then what we often find is that we talk to these people and then they talk to their architects and their developers. And those developers love thinking about chunking strategies because that's what it means in a modern era to be an AI engineer is to be very good at prompt engineering and evaluation and optimizing all the different parts of the RAG stack.' Douwe Kiela: 'So I think it's very important to have the flexibility to play with these different strategies. But you need to have very, very good defaults so that these people don't have to do that unless they really want to squeeze like the final percent and then they can do that. So that's what we're trying to offer is like you don't have to worry about all this stuff.' Douwe Kiela: 'The timeline is basically 2023 was the year of the demo. ChatGPT, it just happened. Everybody was kind of playing with it. There was a lot of experimental budget. Last year has been about trying to productionize it and you could probably get promoted if you were in a large enterprise, if you were the first one to ship GenAI into production. So there's been a lot of kind of kneecapping of those solutions happening in order to be the first one to get it into production.' Douwe Kiela: 'This year, those first past the post, so I asked the post, but so in a limited way, because it's actually very hard to get the real thing past the post. Right. So this year, people are really under a lot of pressure to deliver return on investment for all of those investments and all of the experimentation that has been happening. So it turns out that actually getting that ROI is a very different question.' Douwe Kiela: 'I think my general stance on like use case adoption is that I see a lot of people kind of aiming too low. Where it's like, oh, we have AI running in production. It's like, oh, what do you have? Well, we have something that can tell us who our 401k provider is and how many vacation days I get. And that's nice. Is that where you get the ROI of AI from? Obviously not. You need to move up in terms of complexity.' Douwe Kiela: 'Yeah, so there's roughly two categories for Gen-AI deployment, right? One is cost savings. So I have lots of people doing one thing. If I make all of them slightly more effective, then I can save myself a lot of money. And the other is more around business transformation and generating new revenue.' Douwe Kiela: 'I see some confusion around this kind of the gap between demo and production. A lot of people think that the common misconception we see is like, oh, yeah, it's great. I can easily do this myself. And then it turns out that everything breaks down after like 100 documents and they have a million. And so that is the most common one that we see. But I think there are other misconceptions maybe around what RAG is good for.' Douwe Kiela: 'and what is not. So what is a rag problem and what is not a rag problem? And so people, I think, don't have the same kind of mental model that maybe AI researchers like myself have, where if I give them access to a rag agent, often the first question they ask is, what's in the data?' Jon Turow: 'So now we have contextual, which is an amalgamation of multiple techniques. And you have what you call React 2.0, and you have fine tuning, and there's a lot of things that happen under the covers that customers ideally don't have to worry about until they choose to do so. And I expect that changes radically the conversation you have with an enterprise executive. So how do you describe the kinds of problems that they should go find and apply and prioritize?' Douwe Kiela: 'Yeah, so we often help people with use case discovery. So really just thinking through, okay, what are the rag problems? What are maybe not really rag problems? And then for the rag problems, how do you prioritize them? How do you define success? How do you come up with a proper test set?' Douwe Kiela: 'so that you can evaluate whether it actually works, what is the process for after that doing what we call UAT, user acceptability testing. So putting it in front of real people, that's really the thing that really matters, right? Sometimes we see production deployments and they're in production and then I ask them how many people use this and the answer is zero.' Douwe Kiela: 'Yes. So I think it's very tempting to pretend that AI products are mature enough to be fully self-serve and standalone. It's sort of decent if you do that, but in order to get it to be really great, you just need to put in the work.' Jon Turow: 'I want to talk about two sides of the organization that you've had to build in order to bring all this for customers. One is scaling up the research and engineering function to keep pushing the envelope. And there are a couple of very special things that Contextual has, something you call React 2.0, something you call active versus passive retrieval. Can you talk about some of those innovations that you've got inside Contextual and why they're important?' Douwe Kiela: 'We really want to be a frontier company, but we don't want to train foundation models. I mean, obviously that's a very, very capital intensive business. I think language models are going to get commoditized. The really interesting problems are around how do you build systems around these models that solve the real problem.' Douwe Kiela: 'And so most of the business problems that we encounter, they need to be solved by a system. So then there are a ton of super exciting research problems around how do I get that system to really work well together? So that's what RAC 2.0 is in our case. So like, how do you jointly optimize these components so that they can work well together?' Douwe Kiela: 'Yeah. So passive retrieval is basically old school rag. It's like I get a query and I always retrieve. And then I take the results of that retrieval and I give them to the language model and it generates. So that doesn't really work. Very often you need the language model to think, first of all, where am I going to retrieve it from? And like, how am I going to retrieve it? Are there maybe better ways to search for the thing I'm looking for?' Jon Turow: 'This implies two uses of two relationships of contextual and RAG to the agent. There is the supplying of information to the agent so that it can be performant. But if I probe into what you said, active retrieval implies a certain kind of reasoning. Maybe even' Douwe Kiela: 'Yeah, exactly. So it's like I enjoy saying, everything is contextual. That's very true for an enterprise, right? So the context that the data exists in, that really matters for the reasoning that the agent does in terms of finding the right information that all comes together in these RAG agents.' Douwe Kiela: 'What is a really thorny problem that you'd like your team and the industry to try and attack in the coming years? The most interesting problems that I see everywhere in enterprises are at the intersection of structured and unstructured. And so we have great companies working on unstructured data. There are great companies working on structured data. But once' Douwe Kiela: 'So they are different components right? It's just Despite what some people maybe like to pretend, I can always train up a better Texas SQL model if I specialize it for Texas SQL than taking a generic off-the-shelf language model and telling it, like, generate some SQL query. So specialization is always going to be better than generalization.' Jon Turow: 'Can you talk about active versus passive retrieval? Yeah. So passive retrieval is basically old school rag. It's like I get a query and I always retrieve. And then I take the results of that retrieval and I give them to the language model and it generates. So that doesn't really work. Very often you need the language model to think, first of all, where am I going to retrieve it from? And like, how am I going to retrieve it? Are there maybe better ways to search for the thing I'm looking for?' Douwe Kiela: 'Yeah. First of all, I think our researchers are really special in that we're not focused on like publishing papers or like being too far out on the frontier. As a company, I don't think you can afford that until you're much bigger and if you're like Zuck and you can afford to have FAIR. And the stuff I was working on at FAIR at the time, I was doing like Wittgensteinian language games and like all kinds of crazy stuff that I would never let people do here, honestly.' Douwe Kiela: 'But there's a there's a place for that. And that's not a startup. So the way we do research is we're very much looking at what the customer problems are that we think we can solve better than anybody else. And then really just focusing again, like thinking from the system's perspective about all of these problems. How can we make sure that we have the best system and then make that system jointly optimized and really specialized or specializable for different use cases? That's that's kind of.' Douwe Kiela: 'For like AI companies or AI native companies like us, if you compare this generation of companies with like SaaS companies, there is like, okay, all like the LAMP stack, everything was already there. You just have to basically go and like implement it. That's not the case here is that we're very much just figuring out what we're doing, like flying the airplane as we're building it sort of thing, which is exciting, I think.' Douwe Kiela: 'Yeah. So, I mean, that's kind of my personal journey as well, right? I started off like I did a PhD. I was very much like a pure research person. And' Douwe Kiela: 'I think there's going to be people problems and organizational problems and regulatory and domain constraints that fall outside the bounds of the paper? I would maybe argue that those are the main problems to still be overcome. I don't care about AGI and all of those discussions. I think the core technology is already here for huge economic disruption.' Douwe Kiela: 'So all the building blocks are here. The questions are more around how do we get lawyers to understand that? How do we get the MRM people to figure out what is an acceptable risk? One thing that we are very big on is not thinking about the accuracy, but thinking about the inaccuracy. And what do you do with the, like, if you have 98% accuracy, what do you do with the remaining 2% to make sure that you can mitigate that risk?' Jon Turow: 'What do new founders ask you? What kind of advice do they ask you? They ask me a lot about this like wrapper company thing and modes and differentiation. I think there's some fear that like incumbents are just going to eat everything. And so they obviously have amazing distribution. But yeah, I think there are just massive opportunities for companies to be AI native companies.' Douwe Kiela: 'Yeah, I think that's a very interesting question. I would argue like how many PhDs does Zuck have working for him? That's a lot, right? It's a lot. I don't think it really matters like how deep your expertise in a specific domain is. As long as you are a good leader and a good visionary, then you can recruit the PhDs to go and work for you.' Douwe Kiela: 'It's fine to be a wrapper company as long as you have an amazing business. People should have a lot more respect for companies building on top of fundamental new technology and then just discovering whole new business problems that we didn't really exist, new existed, and then solving them much better than anything else.' Douwe Kiela: 'I think so. I mean, I am also learning a lot of this myself, like about how to be a good founder basically. But I think it's always good to sort of plan for what's going to come and not for what is here right now. And that's how you really get to ride that wave in the right way. And so what's going to come is that a lot of this stuff is going to become much more mature.' Douwe Kiela: 'What do new founders ask you? What kind of advice do they ask you? They ask me a lot about this like wrapper company thing and modes and differentiation. I think there's some fear that like incumbents are just going to eat everything. And so they obviously have amazing distribution. But yeah, I think there are just massive opportunities for companies to be AI native companies.' Jon Turow: 'What is some advice that you've gotten? And I'll actually ask you to break it into two. What is advice that you've gotten that you disagree with? And what do you think about that? And then what is advice that you've gotten that you take a lot from?' Douwe Kiela: 'Maybe we can start with the advice I really like, which is one observation around why Facebook is so successful. It's like be fluid like water. It's like whatever the market is telling you or your users are telling you, like fit into that. Don't be too rigorous in like what is right and wrong. Just like be humble, I think, and just like look at what the data tells you and then try to optimize for that. That is advice that when I got it, I didn't really appreciate it fully. And I'm starting to appreciate that much more right now. Honestly, it took me too long to understand that.' Douwe Kiela: 'In terms of advice that I've gotten that I disagree with, It's very easy for people to say, like, you should do one thing and you should do it well. Sure, maybe, but I'd like to be more ambitious than that. So we could have been like one small part of a rag stack and we probably would have been the best in the world at that particular thing. But then we're just slotting into this ecosystem where we're just like a small piece.' Jon Turow: 'So if I page back a little bit and we get back into the technology for a minute, there's a common question, maybe even misunderstanding that I hear about RAG, that, oh, this is the thing that's going to solve hallucinations. And you and I have spoken about this so many times. Where is your head at right now on that?' Douwe Kiela: 'What hallucinations are, what they are not, does Rags solve it? What's the outlook there? I think like hallucination is not a very technical term. That's right. So we used to have a pretty good word for it. It was just accuracy. And so if you were like inaccurate, if you were wrong, then one way I guess to explain that or to anthropomorphize it would be to say, oh, the model hallucinated. I think it's a very ill-defined term, honestly.' Douwe Kiela: 'If I would have have to try to turn it into a technical definition. I would say the generation of the language model is not grounded in the context that it's given, where it is told that context is true. So basically, hallucination is about groundedness. If you have a model that really adheres to its context, then it will hallucinate less.' Douwe Kiela: 'What are some of the things that you might have believed a year ago about AI adoption or AI capabilities that you think very differently about today? Many things. The main thing I thought that turned out not to be true was that I thought this would be easy.' Douwe Kiela: 'What is this? This, like building the company and solving real problems with AI. I think we were very naive, especially in the beginning of the company. We were like, "Oh yeah, we just get a research cluster, get a bunch of GPUs in there, we train some models, it's going to be great."' Douwe Kiela: 'What are we, either you and I, or are we the industry not talking about nearly enough that we should be? Evaluation. I've been doing a lot of work on evaluation in my research career. Things like DynaBench where it was really about like how do we hopefully maybe get rid of like benchmarks altogether and sort of have a more dynamic way to measure model performance.' Douwe Kiela: 'But evaluation is just very boring. People don't seem to care about it. I care deeply about it. So that always surprises me. Like we did this amazing launch, I thought, around LM unit. It's natural language unit testing. So you have a response from a language model and now you want to check very specific things about that response. It's like, did it contain this? Did it not make this mistake? Like ideally, you can write unit tests as a person for what a good response looks like.' Douwe Kiela: 'So whoever is lucky enough to get that cool JP Morgan head of AI job that you would be doing in another life, is that intellectual property of JP Morgan what the evals really need to look like? Or is this something that they can ultimately ask Contextual to cover for them? No, so I think the tooling for evaluation they can use us for.' Douwe Kiela: 'but the actual expertise that goes into that evaluation, so the unit tests, They should write that themselves, right? Like in the limit we talked about, like a company is its people, but in the limit that might not even be true, right? Because there might be AI mostly and maybe only a few people. So what makes a company a company is its data and the expertise around that data and sort of the institutional knowledge. And so that is really what defines a company. And so that should be captured in how you evaluate the systems that you deploy in your company.'

Deep Dive

Chapters

This chapter details the creation of RAG at Facebook AI Research, highlighting its initial goal of grounding language models in external text, particularly Wikipedia. It emphasizes the collaboration with other researchers and the role of vector databases in enabling the combination of retrieval and generative models.

RAG originated from grounding language models in external text.
Initial grounding attempts used Wikipedia.
Collaboration with researchers at Facebook and elsewhere was crucial.
Early RAG models were multimodal, though primarily language-focused in application.

Shownotes Transcript

人们很容易说，你应该专注于一件事并把它做好。当然，也许吧，但我希望更雄心勃勃一些。我们本可以只专注于RAG技术栈的一小部分，并且可能成为世界上最擅长那一小部分的公司。但那样的话，我们就只是这个生态系统中的一小块拼图，而我理想情况下想要的是整个蛋糕。

欢迎收听《Founded and Funded》。我是Madrona的合伙人John Thurow，今天和我一起的是Dao Kiela，他是Contextual AI的创始人兼首席执行官，该公司为企业构建下一代检索系统。Dao也是检索增强生成（RAG）的共同创造者。俗话说，工欲善其事，必先利其器。Dao创造了RAG，但他抵制了这种倾向

RAG已成为企业AI中最广泛采用的技术之一，但他仍在不断突破界限，即使客户并不总是准备好听取下一步计划。RAG从未打算成为最终答案。它是更大事情的开始。因此，Dow拥有一个独特的视角，他既是一位研究人员，也是一位将创新推向市场的创始人。无论您是建设者，

投资者还是AI从业者。这段对话都将挑战您对企业AI未来的思考方式。

那么让我们开始吧。Dao，让我们回到RAG的起源。您在提出RAG时试图解决什么问题？RAG项目的起源是这样的：当时我们在Facebook AI Research（FAIR），我之前在我的博士论文中做了很多关于语义基础的工作。当时，语义基础主要指相对于其他事物理解语言。

例如，如果您想知道“猫”这个词的含义，比如“猫”这个词的词嵌入（这发生在句子嵌入出现之前），那么理想情况下，您也应该知道猫是什么样子的，因为这样您就能更好地理解“猫”的含义。当时很多人都在研究这种感知基础。

然后我和我的博士生Ethan Perez讨论，我们能否将其建立在其他事物之上？也许我们可以将其建立在其他文本上，而不是图像上。当时最明显的来源是维基百科。所以我们会说这是真的，或者某种程度上是真的。然后你可以根据这个真理来理解语言。

这就是RAG的起源。Ethan和我正在研究这个问题，然后我们发现伦敦的一些人在研究开放域问答，主要是Sebastian Riedel和Patrick Lewis。他们在这一领域拥有令人惊叹的早期模型，这是一个非常有趣的问题。我如何才能让一个生成模型处理任何类型的数据，然后在其之上回答问题？

我们在这里联手。我们当时非常幸运，因为Facebook AI图像相似性搜索（我认为是这个名字）的人们，基本上是第一个向量数据库，它就在那里。所以我们想，我们必须获取向量数据库的输出，然后

将其提供给生成模型。这发生在我们称之为语言模型之前。然后语言模型可以根据您检索到的内容生成答案。这就是RAG。我们总是和参与原始论文的人开玩笑说，我们应该想出一个比这好得多的名字。但不知何故，它保留了下来。这绝不是唯一一个这样做的项目。谷歌也有人在做类似的事情。

例如，Realm是大约同一时间发表的一篇精彩论文。我认为RAG之所以流行，是因为整个领域正在转向生成式AI。RAG中的G代表生成式。因此，我们是第一个证明可以将向量数据库和生成模型结合起来实际工作的人。这里有一个见解，那就是RAG从一开始就是多模态的。

你知道，你从图像基础和类似的东西开始，它在人们应用它的方式中一直以语言为中心，但从一开始它就

你当时是否想象过你会用图像来应用它？我们当时有一些论文。我们与Facebook中更应用导向的人员一起做了一篇论文，我们研究的是，我认为它被称为Extra，它基本上是RAG，但建立在图像之上。所以，是的，那感觉像是很久以前的事了，但那一直都是这个想法，对吧？就像你可以拥有生成模型参数无法捕捉到的任意数据，

你可以对这些任意数据进行检索，以增强生成模型，使其能够完成其工作。所以这完全取决于你给它的上下文。好吧，这让我回到了对这些早期生成模型的另一个常见批评，那就是尽管它们能够进行令人惊叹的问答，但知识截止日期非常明显。你在2020年和2021年看到的模型并不知道COVID-19。

这显然对社会非常重要。这是动机的一部分吗？这是你能让这些东西更新的部分解决方案吗？是的，这是最初动机的一部分，对吧？所以这某种程度上就是语义基础的愿景，最初的RAG项目背后的愿景。之后我们也做了很多关于这个问题的研究，那就是我能否拥有一个非常轻量级的语言模型，它基本上没有任何知识？

它非常擅长推理和说英语或任何语言，但它什么都不知道。因此，它必须完全依赖于另一个模型，即检索器，它承担了大部分繁重的工作，以确保语言模型拥有正确的上下文，但它们确实有不同的职责。但是，让它发挥作用证明相当困难。所以当时……

现在我们有了RAG，我们仍然有其他技术的组合。我们有训练、团队合作和情境学习。我敢肯定，这对研究实验室来说非常难以驾驭，更不用说企业了。在RAG的概念中，在其早期的实现中，你对RAG如何融入这个组合有什么想法？它打算独立存在吗？

很有趣，因为情境学习的概念当时并不存在。我认为这真正成为GPT-3出现后的一件事，他们证明了这一点是有效的。这是一篇令人惊叹的论文，也是一个令人惊叹的证明点，证明它确实有效。我认为这确实开启了许多可能性。但在最初的BRAC论文中，我们有一个基线，我们称之为冻结基线，我们不做任何训练，只是将其作为上下文提供。

所以这是在表六中。我们有点证明它并不真正有效，或者至少如果你优化参数，你可以做得更好。对吧？

所以情境学习很棒，但是如果你能够做到的话，你可能总是可以通过机器学习来超越它。所以如果你可以访问参数，这显然不是很多这些黑盒前沿语言模型的情况，但是如果你可以访问参数并且可以针对你正在处理的数据或你正在解决的问题优化它们，那么至少在理论上，你应该总是能够做得更好。所以我看到很多关于RAG的

错误二分法。我经常听到的一种说法是，要么是RAG，要么是微调。这是错误的。你可以微调RAG系统，然后它会更好。我经常听到的另一个二分法是，它是RAG或长上下文。

就像这些是同一件事一样。RAG是解决问题的另一种方法，你拥有的信息比你可以放入上下文中的信息更多。所以一个解决方案是尝试扩展上下文，但这实际上还没有真正奏效，尽管人们喜欢假装它有效。另一个方法只是使用信息检索，这在计算机科学研究领域已经相当成熟。

并充分利用所有这些，并确保语言模型能够完成其工作。我认为事情被简化了，就像你应该做所有这些事情一样。你应该使用RAG，你应该拥有尽可能长的长上下文窗口，并且你应该微调它。这就是你获得最佳性能的方式。——此后发生的事情是，我们将讨论这些是如何在当今更复杂的方式中结合在一起的，但我认为可以公平地说，在过去的18、24、36个月中，

RAG迅速流行起来，甚至被误解为唯一的灵丹妙药。你认为它为什么如此诱人？

它很诱人，因为它很容易。老实说，我认为如果你懒惰的话，长上下文甚至更诱人，对吧？因为你甚至不必再担心检索了。你只需把它全部放在那里，并为将所有这些数据放在上下文中付出高昂的代价。就像每次你回答关于哈利波特的问题时，你都必须阅读整本书才能回答问题，这并不好。所以RAG很诱人，我认为是因为你需要

一种方法让这些语言模型能够处理你的数据。所以在旧的机器学习范式中，我们可能会以更复杂的方式做到这一点。但由于这些前沿模型位于黑盒API的后面，我们无法访问它们实际在做什么，所以让它们真正处理你的数据的唯一方法是使用检索来增强它们。这是过去两年自ChatGPT以来生态系统外观的一个函数。

我们将讨论如何需要超越酷炫的演示。但我认为酷炫演示的力量不容低估。而RAG能够做到这一点。你看到企业高管的一些顿悟时刻是什么？

是的，我的意思是，有很多顿悟时刻。我认为这是我工作的一部分乐趣。我认为这是你展示它能做什么的地方，有时这些模型能做的事情令人惊叹。但是，是的，对我们来说，基本的顿悟时刻。

所以准确性在这个时候几乎是基本要求。就像，好吧，你有一些数据，就像一个文档，你可能能够很好地回答关于该文档的大量问题。当你有一百万个文档或数千万个文档并且它们都很复杂或包含……时，它会变得更加困难。

非常具体的东西。所以我们与高通合作，他们的文档中包含电路设计图。理解这种类型的信息要困难得多。因此，至少对于使用我们平台的人来说，最初的惊叹之处在于你可以在一分钟内建立它。我基本上可以通过三次点击来构建一个最先进的RAG代理。所以

这种价值的时间过去很难实现，对吧？因为你有你的开发人员，他们必须考虑文档的最佳分块策略以及你真的不想让你的开发人员考虑的事情，但他们不得不这样做，因为这项技术还不成熟。

所以下一代构建这些RAG代理的系统和平台将使开发人员能够更多地考虑业务价值和差异化，本质上是。我如何才能比我的竞争对手更好，因为我解决了这个问题？所以你的分块策略不应该对解决这个问题很重要。好吧，如果我现在将我们刚才讨论的内容与你刚才所说的内容联系起来，

长上下文和RAG的诱惑在于它简单明了。它可以插入到我现有的架构中。作为一个CTO，如果我有有限的资源来实施新的技术，更不用说深入研究诸如分块策略以及非乳制品产品的向量相似性与牛奶的向量相似性如何相似等概念了。可以公平地说，CTO们想要一些连贯的东西吗？

可以开箱即用。你可能会这么认为。我认为这可能适用于CTO、CIO、CAIO、CDO以及从这个层面考虑这个问题的人。但我们经常发现，我们与这些人交谈，然后他们与他们的

架构师和开发人员交谈。这些开发人员喜欢思考分块策略，因为在现代时代，成为一名AI工程师就意味着非常擅长提示工程、评估和优化RAG技术栈的不同部分。

所以我认为拥有灵活地使用这些不同策略的能力非常重要。但是你需要有非常非常好的默认值，这样除非他们真的想挤出最后百分之一，否则他们不必这样做，然后他们可以这样做。所以这就是我们试图提供的，就像你不必担心所有这些东西一样。

基本的东西。你应该考虑如何真正利用AI来创造价值。所以这真的是一个旅程。我认为很多公司，所以成熟度曲线非常宽且非常平坦。就像有些公司真的才刚刚开始弄清楚。就像，我应该关注哪个用例？而另一些公司则拥有一个完整的RAG平台，他们自己构建的。

基于对该领域未来发展方向的完全错误的假设。现在他们有点被困在这个范式中了。它真的到处都是，这意味着市场仍然非常早期。你能带我了解成熟度曲线上的几个里程碑，从酷炫的演示一直到忍者级别的结果吗？

时间线基本上是2023年是演示的一年。ChatGPT出现了。每个人都在玩它。有很多实验预算。去年一直在努力将其产品化，如果你在大型企业中，如果你第一个将GenAI投入生产，你可能会得到晋升。所以有很多对这些解决方案的削弱正在发生，以便成为第一个将其投入生产的人。

今年，那些先到者，所以我说先到者，但在有限的程度上，因为让真正的东西先到实际上非常困难。对。所以今年，人们真的面临着巨大的压力，需要为所有这些投资和所有发生的实验带来投资回报。所以事实证明，获得这种投资回报实际上是一个非常不同的问题。

这就是你需要围绕这个问题进行大量深入专业知识的地方，但你还需要拥有比开源中存在的更好的组件，以便你能够轻松地将一个弗兰肯斯坦RAG解决方案拼凑在一起。这对于演示来说很棒，但它无法扩展。客户如何看待投资回报率？他们如何衡量、感知它？是的。

这真的取决于客户。有些客户非常老练，真的试图考虑指标，例如我该如何衡量它？我该如何优先考虑它？我认为很多咨询公司也在试图提供帮助，考虑，好吧，这个用例很有趣，但它只涉及10个人。

他们非常专业，但我们有另一个用例。它有10,000人，可能专业程度略低，但影响更大。所以这是一种权衡。我认为我对用例采用的总体立场是，我看到很多人目标定得太低。

就像，哦，我们在生产中运行AI。就像，哦，你有什么？好吧，我们有一些东西可以告诉我们我们的401k提供商是谁以及我还有多少假期。这很好。这就是你从AI中获得投资回报的地方吗？显然不是。你需要在复杂性方面提升。或者如果你考虑公司的组织结构图，你想去这个专门的组织。

角色，他们有非常棘手的问题，如果你能让他们在解决这个问题上提高10%到20%的效率，你只需让他们在工作中更胜任，就能为公司节省数千万甚至数亿美元。你正在谈论一个等式，它大致是所做工作的复杂性和复杂程度乘以

它影响的员工数量。是的，所以Gen-AI部署大致分为两类，对吧？一个是成本节约。我有很多人做一件事。如果我让所有人的效率都略微提高，那么我可以节省很多钱。另一个是围绕业务转型和创造新的收入。

第二个显然更难衡量。你需要认真考虑指标，例如我在这里优化什么？因此，结果是，你在前一类中看到更多生产部署，它只是关于成本节约。你看到关于这项技术的能力或不具备能力的一些重大误解是什么？

我看到一些关于演示和生产之间差距的混淆。很多人认为我们看到的常见误解是，哦，是的，这很棒。我可以很容易地自己做到这一点。然后事实证明，在100个文档之后，一切都会崩溃，而他们有一百万个。所以这是我们看到的最常见的一种。但我认为还有一些误解，也许是关于RAG擅长什么。

以及不擅长什么。那么什么是RAG问题，什么不是RAG问题？所以人们，我认为，并没有像我这样的AI研究人员那样拥有相同的思维模型，如果我给他们访问RAG代理的权限，他们经常问的第一个问题是，数据中有什么？

这实际上不是RAG问题。或者说是关于元数据的RAG问题。它不是关于数据本身，对吧？所以RAG问题就像，Meta在2024年第四季度的研发支出是多少，以及它与前一年相比如何？诸如此类的事情，对吧？所以这是一个具体的问题，你可以提取信息，然后对其进行推理并综合不同的信息。

人们喜欢问的很多问题都不是RAG问题。例如，总结文档就是另一个问题。总结不是RAG问题。理想情况下，你想将整个文档放在上下文中，然后对其进行总结。所以有不同的……

不同的策略适用于不同的问题，为什么ChatGPT是一款如此出色的产品，是因为它们某种程度上抽象化了其中的一些决策，但这仍然在表面之下发生。所以我认为

人们需要更好地理解他们有什么类型的用例。例如，如果我是一名高通客户工程师，我需要对非常具体的问题给出非常具体的答案，这显然是一个RAG问题。如果我需要总结一份文档，只需将其放在长上下文模型的上下文中即可。所以现在我们有了Contextual，它……

多种技术的融合。你拥有你所谓的React 2.0，你拥有微调，并且在幕后发生的事情是客户理想情况下不必担心的事情，除非他们选择这样做。而且我希望这会彻底改变你与企业高管的对话。那么你如何描述他们应该去寻找、应用和优先考虑的问题类型呢？

是的，所以我们经常帮助人们进行用例发现。所以真的只是考虑，好吧，什么是RAG问题？什么可能不是真正的RAG问题？然后对于RAG问题，你如何优先考虑它们？你如何定义成功？你如何制定一个合适的测试集？

以便你可以评估它是否真的有效，之后进行我们所说的UAT（用户可接受性测试）的过程是什么。将其放在真正的人面前，这才是真正重要的事情，对吧？有时我们看到生产部署，它们正在生产中，然后我问他们有多少人在使用它，答案是零。但在最初的UAT期间，一切都很顺利，每个人都说，哦，是的，这太棒了。但是当你的老板问你问题并且你的工作岌岌可危时，

然后你自己做。在这个特定用例中，你不会问AI。这是许多公司仍然必须经历的转变。公司想要

通过今天的旅程获得支持，无论是直接来自Contextual还是来自解决方案合作伙伴来实施此类事情？是的。所以我认为假装AI产品已经足够成熟，可以完全自助和独立运行，这非常诱人。如果你这样做的话，这还算不错，但是为了让它真正变得出色，你只需要付出努力。

努力。所以我们为我们的客户这样做，或者我们也可以通过系统集成商来为我们做到这一点。我想谈谈你必须构建的组织的两面，以便为客户带来所有这些。一个是扩大研究和工程职能以不断突破界限。

Contextual有一些非常特殊的东西，你称之为React 2.0，你称之为主动检索与被动检索。你能谈谈Contextual内部的一些创新以及它们为什么重要吗？我们真的想成为一家前沿公司，但我们不想训练基础模型。

我的意思是，这显然是一项非常非常资本密集型的业务。我认为语言模型将会商品化。真正有趣的问题是，你如何围绕这些模型构建系统来解决实际问题。

所以我们遇到的大多数业务问题都需要系统来解决。然后围绕如何让该系统真正良好地协同工作，就会出现大量非常令人兴奋的研究问题。这就是我们所说的React 2.0。例如，你如何联合优化这些组件，以便它们能够良好地协同工作？

但也有一些其他事情，例如确保你的生成非常扎实。它不是一个通用的语言模型。它是一个专门为RAG和仅RAG训练的语言模型。它不做创意写作。它只能谈论上下文中的内容。同样，当你构建这些生产系统时，你需要拥有一个最先进的重新排序器。理想情况下，该重新排序器也可以遵循指令。所以它是一个更智能的模型。

所以我们正在做很多真正创新的工作，以更好地构建RAG管道，然后如何将反馈也整合到该RAG管道中。所以我们已经做了关于KTO和APO之类的工作。所以真正不同的方法是将人类偏好融入整个系统，而不仅仅是模型。但这需要一个非常特殊的团队，我们拥有这个团队。我为此感到非常自豪。

你能谈谈主动检索与被动检索吗？是的。所以被动检索基本上是老式的RAG。就像我得到一个查询，我总是检索。然后我将检索的结果提供给语言模型，它会生成。所以这并不真正有效。你经常需要语言模型首先思考，首先，我要从哪里检索？以及我该如何检索？是否有更好的方法来搜索我正在寻找的东西？

而不仅仅是复制粘贴查询。所以现代生产RAG管道已经比仅仅拥有向量数据库和语言模型复杂得多。你可以在这个新的代理事物和测试时间推理范式中做的一件有趣的事情是，自己决定是否要检索某些东西。所以这是主动检索。就像如果你给我一个查询，例如，“你好，你好吗？”

我不必检索就能回答这个问题，对吧？所以我可以说，“我很好，我能帮你什么吗？”然后你问我一个问题，现在我决定我需要去检索。但也许我对最初的检索犯了一个错误。所以然后我需要去思考，“哦，实际上，也许我应该去那里。”所以这种主动检索，现在都得到了解锁。这就是我们所说的RAG代理。我认为这确实是未来，因为代理很棒。

但我们需要一种方法让它们能够处理你的数据。这就是RAG发挥作用的地方。这意味着Contextual和RAG对代理的两种用途和两种关系。一个是向代理提供信息，以便它能够高效运行。但是如果我探究你所说的内容，主动检索意味着某种类型的推理。也许甚至

更长时间的推理，好吧，我被要求提供的信息的最佳来源是什么？是的，完全正确。就像我喜欢说，一切都是有上下文的。这对企业来说非常正确，对吧？所以数据存在的上下文，这对于代理在查找正确信息方面的推理非常重要，所有这些都在这些RAG代理中融合在一起。

在未来几年，你希望你的团队和行业尝试解决的一个真正棘手的问题是什么？我在企业中看到的最有趣的问题是在结构化数据和非结构化数据的交集处。所以我们有伟大的公司致力于非结构化数据。有伟大的公司致力于结构化数据。但是一旦

你有了这种能力，我们现在开始拥有这种能力，你可以使用相同的模型对这两种非常不同的数据模式进行推理。然后这将解锁许多很酷的用例。我认为这真的会发生在今年或明年，只是考虑不同的数据模式以及你如何使用这些代理在所有这些数据模式之上进行推理。这会在幕后使用一个通用的模型发生吗？

基础设施的一部分，还是会成为许多不同乐高积木的连贯单一玻璃面板？所以我认为它应该是一个解决方案，那就是我们的平台。让我们想象一下，但在幕后，你会用许多不同的组件来完成这项工作，每个组件都处理结构与结构？所以它们是不同的组件，对吧？这只是

尽管有些人可能喜欢假装，但我总能训练出一个更好的德克萨斯州SQL模型，如果我专门针对德克萨斯州SQL进行训练，而不是使用通用的现成语言模型并告诉它，例如，生成一些SQL查询。因此，专业化总是比泛化更好。

对于特定问题，如果您知道要解决的问题是什么。真正的问题更多的是，是否值得实际投资去做？因此，专业化需要花钱，而且有时会妨碍您可能希望拥有的规模经济。如果我看看您必须构建的组织的另一面，那么您必须构建一个非常复杂的研发功能，但Contextual不是一个研究实验室，它是一家公司。嗯哼。

那么，您在King Textual必须建立的其他类型的学科和能力是什么，它们补充了这里正在进行的所有研究？是的。首先，我认为我们的研究人员非常特别，因为我们不专注于发表论文或过于超前。

作为一家公司，我认为你负担不起，除非你规模更大，如果你像扎克伯格那样，你可以负担得起FAIR。当时我在FAIR上从事的工作，我正在做维特根斯坦式的语言游戏和各种疯狂的事情，说实话，我永远不会让人们在这里做。

但这里面有一个位置。这不是一个初创公司。我们进行研究的方式是，我们非常关注我们认为比任何人都能更好地解决的客户问题。然后真正专注于从系统的角度思考所有这些问题。我们如何才能确保拥有最好的系统，然后使该系统联合优化并真正专门化或可专门用于不同的用例？这就是……

我们可以做的事情。这意味着你的研究和应用研究之间存在非常灵活的边界。所以我们所有的研究都是应用研究，但在现在的AI领域，我认为产品和研究之间存在非常细微的界限，研究基本上就是产品。这不仅仅对我们来说是正确的。我认为这对OpenAI、Anthropic以及所有类似的公司都是正确的，因为这个领域发展如此迅速。

你必须立即将研究产品化。就像准备好后一样，你甚至没有时间写一篇论文了。你只需要快速地将它交付到产品中，因为它是一个变化如此迅速的空间。你如何分配你的研究注意力？即使是5%、10%，也有一些游戏的成分吗？团队可能会说不够。但不是零。是的，作为一名研究人员，我认为，

你总是想玩得更多，但你的时间有限。所以是的，这是一个权衡。我认为我们并没有正式承诺，例如，我们没有像谷歌那样的20%规则。更像是我们只是试图尽快解决一些很酷的问题，并希望对世界产生一些影响。所以不仅仅是孤立地工作，而是真正尝试关注重要的事情。是的。

我认为我听到你说即使在资源有限且发展迅速的环境中，它也不是零。每个环境的资源都是有限的。我认为更像是，如果你真的想做一些特别的事情，那么你需要尝试新的东西。我认为这非常不同。

对于像我们这样的AI公司或AI原生公司，如果你将这一代公司与SaaS公司进行比较，就会发现，好吧，所有像LAMP堆栈一样的东西都已经存在了。你只需要去实现它就可以了。这里的情况并非如此，我们正在努力弄清楚我们在做什么，就像在建造飞机的同时驾驶飞机一样，我认为这很令人兴奋。这是什么感觉？

现在将你正在进行的研究带到世界上，并让它与企业接触？对你个人来说是什么样的？对于公司来说，从研究主导型公司转变为产品公司是什么样的？是的。我的意思是，这也是我个人的旅程，对吧？我一开始就像我读了博士学位。我非常像一个纯粹的研究人员。而且

慢慢地过渡到我现在的样子。是的，关键的观察结果是研究就是产品。所以这是一个特殊的时刻。我认为不会总是这样。我认为这真的很有趣，说实话。我以前参加过播客，

很久以前，他们问我，你认为还有什么有趣的工作？我说，也许是摩根大通的AI主管。他们说，真的吗？就像，好吧，我认为现在在这个特定的时间点，这实际上是一份非常有趣的工作。因为你必须考虑我将如何改变这家巨型公司来使用这项最新的技术，坦率地说，它将改变一切，对吧？它将改变我们的整个社会。所以，是的，

我认为对我来说，是的，与这样的人交谈并思考世界的未来将会是什么样子，这让我非常高兴。我认为会有人的问题、组织问题和

监管和领域约束超出了论文的范围？我可能会争辩说，这些是仍然需要克服的主要问题。我不关心AGI和所有这些讨论。我认为核心技术已经存在，可以带来巨大的经济破坏。

所以所有的构建块都在这里。问题更多的是我们如何让律师理解这一点？我们如何让MRM人员弄清楚什么是可接受的风险？我们非常重视的一件事不是考虑准确性，而是考虑不准确性。如果你有98%的准确率，你会如何处理剩下的2%，以确保你能减轻这种风险？

所以现在很多事情都在发生。我们需要在这些组织中进行大量的变革管理。所有这些都超出了研究问题，我认为我们拥有所有可以立即彻底颠覆全球经济的要素。这只是一个执行问题，这同时令人恐惧和兴奋。你知道，Dal，我和你多次谈过

不同类型的创始人及其能力。有一个视角让我印象深刻，它有三个咔嗒声。有一个领域专家，他在收入周期管理方面拥有专业知识，但实际上可能根本不是技术人员。A. B，有人精通技术并且能够编写代码，但不是博士研究人员。而且，你知道，马克·扎克伯格就是一个非常著名的例子。然后有

有进行研究的创始人，他们拥有深厚的技术能力，并对前沿有着非常先进的远见。你认为这些类型的创始人各自扮演什么角色？

在未来需要建立的下一波公司中。是的，我认为这是一个非常有趣的问题。我认为像扎克伯格有多少博士为他工作？很多，对吧？很多。我认为这并不重要，你对特定领域的专业知识有多深。只要你是一个优秀的领导者和一个有远见的领导者，你就可以招募博士为你工作。

但与此同时，显然，如果你在一个领域非常深入，而这个领域恰好起飞，这会给你带来优势，这对我来说就是这样。我认为我在时机方面非常幸运。但总的来说，我认为你可能在那里提出的一个潜在问题是关于AI包装公司，例如，对吧？

这些公司应该在多大程度上使用这项技术进行横向和纵向发展？我认为人们对这些包装公司有很多轻蔑，“哦，这只是OpenAI的包装。”就像，好吧，事实证明，你只需要从那里就能创造一个惊人的业务，对吧？我认为Cursor现在是Anthropix最大的客户。我认为是

只要你有一个很棒的业务，成为一家包装公司就可以了。人们应该更加尊重那些在基础新技术之上进行建设的公司，然后发现我们以前没有真正存在的新业务问题，然后比其他任何东西都更好地解决这些问题。好吧，所以我真的也在考虑你刚才说的评论，我们有

很多技术能够产生很大的经济影响，即使在今天，没有新的突破，是的，我们也会得到。这是否会改变未来几年应该成立的下一类公司？

我认为是的。我的意思是，我也在学习很多关于如何成为一个好创始人的知识。但我认为总是很好地规划即将到来的是什么，而不是现在这里有什么。这就是你真正以正确的方式驾驭浪潮的方式。所以即将到来的是，很多这些东西会变得更加成熟。但就像两年前我们遇到的一个大问题一样，AI基础设施

非常非常不成熟。每件事都会崩溃。在注意力机制中存在错误，我们使用的框架的实现，就像真正基本的东西一样。所有这些现在都已解决。因此，随着这种成熟，也带来了更好地扩展的能力，我认为，更严格地考虑成本质量权衡等问题。所以那里有很多商业价值。

新创始人会问你什么？他们会问你什么样的建议？他们经常问我关于包装公司的事情、模式和差异化。我认为有些人担心像现有公司一样，他们会吞噬一切。所以他们显然拥有惊人的分销能力。但我认为，对于公司来说，成为AI原生公司仍然存在巨大的机会。

并且从第一天起就真正地作为一个AI公司来思考，如果你真的做对了，那么如果你打好牌，你就有机会成为下一个谷歌或Facebook或其他什么。你得到的一些建议是什么？我实际上会让你分成两部分。你得到的你不同意的建议是什么？你对此怎么看？然后你得到的你从中获得很多益处的建议是什么？

也许我们可以从我非常喜欢的建议开始，这是一个关于为什么Facebook如此成功的观察结果。

就像水一样流动。就像市场告诉你什么或你的用户告诉你什么一样，融入其中。不要对什么是对什么是错过于严格。我认为，要谦逊，看看数据告诉你什么，然后尝试对其进行优化。当我得到这个建议时，我并没有完全理解它。我现在开始更加欣赏它了。说实话，我花了太长时间才理解这一点。关于我得到的我不赞成的建议，

人们很容易说，你应该做一件事，并且应该把它做好。当然，也许吧，但我希望比这更有野心。所以我们可以成为RAG堆栈的一小部分，我们可能在这个特定的事情上会是世界上最好的。但随后我们只是插入这个生态系统，我们只是一小部分。

理想情况下，我想要整个馅饼。这就是为什么我们投入了大量时间来构建这个平台，确保所有单个组件都是最先进的，并且它们已经协同工作，以便你能够真正解决这个更大的问题。但是是的，这也很难做到。所以，是的，不是每个人都会给我建议说我应该去解决那个难题。但我认为……

随着时间的推移，作为一家这样的公司，你的模式就来自那里，对吧？就像做一些其他人认为有点疯狂的事情一样。所以我的建议给创始人就是去做一些其他人认为疯狂的事情。你可能会告诉我，这反映在你加入的团队中。是的，我的意思是，公司就是团队，尤其是早期团队。我们很幸运能有早期加入我们的人，这就是公司，对吧？是人。

所以如果我稍微回顾一下，然后我们再回到技术上几分钟，有一个常见的问题，甚至是我听到的关于RAG的误解，哦，这是将要解决幻觉的事情。我和你多次谈过这个。你现在对这个问题的看法是什么？

什么是幻觉，它们不是什么，Rags能解决它吗？那里的前景如何？我认为幻觉不是一个非常专业的术语。没错。所以我们过去有一个很好的词来形容它。它只是准确性。所以如果你不准确，如果你错了，那么我猜解释它或将它拟人化的一种方法就是说，哦，模型产生了幻觉。我认为这是一个非常定义不明确的术语，说实话。如果我必须

必须尝试将其转化为技术定义。我会说语言模型的生成没有基于它所给定的上下文，它被告知该上下文是正确的。所以基本上，幻觉是关于基础的。如果你有一个真正遵循其上下文的模型，那么它产生的幻觉就会更少。

但幻觉本身可以说是通用语言模型的一个特性。它不是一个错误，对吧？如果你有一个创意写作部门或营销部门，创意写作之类的内容生成，我认为幻觉很棒。

只要你有一种方法可以修复它，你可能在某个地方有一个检查它并重写一些东西的人。所以幻觉本身不一定是坏事。但是，如果你有一个RAG问题，你却承担不起犯错的风险，那它就是一件坏事了。这就是为什么我们有一个专门训练过的基础语言模型，不会产生幻觉或产生更少的幻觉。

因为我有时会看到另一个误解是，人们认为这些概率系统可以达到100%的准确性。我认为这只是一个白日梦。人和人一样，对吧？如果你看看一家大银行，

这些银行里有人，人也会犯错。所以，AI也会……SEC文件也有错误。没错。我们拥有SEC的全部原因，这是一个受监管的市场，所以我们在这个市场中建立了机制，以便如果一个人犯了错误，那么至少我们已经做出了合理的努力来减轻围绕这一风险的风险。

AI部署也是如此。这就是为什么我谈论如何减轻不准确性的风险。就像，我们不会做到100%完美。因此，你需要考虑2%、3%、5%、10%，这取决于用例的难度，你可能仍然不完美。你如何处理这个问题？你一年前可能相信的一些事情是什么？

关于AI的采用或AI的能力，你今天对这些事情的看法与以前大不相同？很多事情。我认为结果并非如此的主要事情是我认为这很容易。

这是什么？这就像建立公司并用AI解决实际问题一样。我认为我们非常天真，尤其是在公司成立之初。“哦，是的，我们只需要一个研究集群，在那里获得一堆GPU，我们训练一些模型，这将很棒。”

然后事实证明，获得一个工作的GPU集群实际上非常困难。然后事实证明，在GPU集群上以实际有效的方式进行训练，如果你使用其他人的代码，那么该代码可能还不够好。因此，如果你想确保它真的非常好，那么现在你必须为你正在做的事情构建你自己的框架。

所以我们不得不做很多我们真的没有预料到要做的管道工作。所以现在我很高兴我们做了所有这些工作。但在当时，这非常令人沮丧。我们，或者我和你，或者我们这个行业没有充分讨论的事情是什么？评估。我一直……

在我的研究生涯中做了很多关于评估的工作。像DynaBench这样的东西，它实际上是关于我们如何希望也许完全摆脱基准测试，并采用更动态的方式来衡量模型性能。

但评估非常无聊。人们似乎并不关心它。我非常关心它。所以这总是让我感到惊讶。就像我们进行了一次惊人的发布，我认为，围绕LM单元。它是自然语言单元测试。所以你有一个来自语言模型的响应，现在你想检查该响应的非常具体的事情。就像，它包含这个了吗？它没有犯这个错误吗？理想情况下，你可以作为一个人为好的响应编写单元测试。

所以你可以用我们的方法做到这一点。我们有一个模型，它在验证这些单元测试是通过还是失败方面，是迄今为止最先进的。我认为这太棒了。我喜欢谈论这个，但人们似乎并不关心。就像，哦，是的，评估。就像，是的，我们在某个地方有一个包含10个例子的电子表格。怎么可能？这是一个如此重要的问题。当你部署AI时，你需要知道它是否真的有效。你需要知道它在哪里不足，你需要

对你的部署有信心，你需要考虑可能出错的事情，所有这些。所以让我非常惊讶的是，很多公司在评估方面有多么不成熟。这包括大型公司。是的。你知道，Gary Tan不久前在社交媒体上发布了一篇文章，说评估是

最强大的AI应用公司的秘密武器。顺便说一句，也是AI研究公司。所以OpenAI和Anthropic，他们如此优秀的部分原因是他们也非常擅长评估。所以他们确切地知道什么是好的。这也是为什么我们在内部进行所有这些工作的原因。我们不仅仅将评估外包给其他人。就像，如果你是AI公司，AI是你的产品，那么

你只能通过评估来评估产品的质量。所以这对所有这些公司来说都非常重要。所以任何有幸获得你将在另一个生活中担任的摩根大通AI主管这份酷工作的人，摩根大通的知识产权是什么，评估真正需要是什么样子？或者这是他们最终可以要求Contextual为他们解决的事情吗？不，我认为他们可以使用我们进行评估的工具。

但实际进行评估的专业知识，即单元测试，

他们应该自己编写，对吧？就像我们谈到的极限一样，公司就是人，但在极限情况下，这可能甚至不正确，对吧？因为可能主要是AI，也许只有少数人。所以是什么让公司成为公司？是它的数据以及围绕这些数据的专业知识以及机构知识。所以这真正定义了一家公司。因此，这应该体现在你如何评估在公司中部署的系统上。

也许我们可以到此为止。Dao Kiela，非常感谢你。这很有趣。谢谢。

RAG Inventor Talks Agents, Grounded AI, and Enterprise Impact 45:26 Share

Founded & Funded

Deep Dive

Shownotes Transcript

RAG Inventor Talks Agents, Grounded AI, and Enterprise Impact