We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

The evolution and promise of RAG architecture with Tengyu Ma from Voyage AI

2024/6/6

No Priors: Artificial Intelligence | Technology | Startups

AI Deep Dive AI Insights AI Chapters Transcript

People

Tengyu Ma

Topics

Tengyu Ma: 本人研究涵盖深度学习多个领域，从理论到实践应用，目前专注于提升大型语言模型的效率和推理能力。认为未来AI发展需要提高数据和计算效率，并重点关注推理任务及其应用。研究经历了从矩阵补全、嵌入模型到Transformer和对比学习的演变，致力于优化大型语言模型的训练效率，开发的优化器SOFIA能将大型语言模型的预训练效率提高2倍，在百亿参数规模的模型训练中效率提升1.6倍。认为将AI技术商业化的时机已成熟，因为基础模型的出现简化了AI在行业中的应用。Voyage AI主要构建用于改进检索系统质量的重排序器和嵌入模型，因为在RAG系统中，检索质量是影响响应质量的关键瓶颈。RAG系统通过检索步骤和生成步骤，利用公司内部知识来生成更准确、无幻觉的答案。RAG的应用范围广泛，几乎涵盖所有领域。与微调相比，RAG更容易实现且更准确，并能有效减少幻觉。从成本和效率角度，RAG优于长上下文Transformer，因为RAG类似于长时记忆，而长上下文Transformer类似于短时记忆，RAG的层次化结构使其更有效率。代理链技术与嵌入模型和重排序器是正交的，两者可以结合使用。改进RAG系统的方法包括改进大型语言模型的提示方式以及提升检索质量，后者可以通过改进嵌入模型和使用软件工程技巧来实现。领域特定微调可以提高嵌入模型的性能，因为有限的参数需要针对特定领域进行优化。嵌入模型的维度会影响向量搜索的延迟，因此需要在延迟预算内找到最佳的维度。构建RAG系统时，应尽早关注检索组件的投资，并通过性能分析确定瓶颈。随着大型语言模型的改进，RAG系统将变得更简单，组件数量更少，并且对数据格式的要求更低。学术界在AI领域应专注于长期创新和具有挑战性的问题，例如推理任务。 Sarah: 引导访谈，提出问题，并与Tengyu Ma进行讨论。

Deep Dive

Key Insights

Why did Tengyu Ma decide to start Voyage AI after his academic research?

He felt the timing was right for commercialization as AI technologies had matured, making it easier to apply AI to industry with foundation models. The process of applying AI had become much simpler, requiring only prompt tuning and retrieval-augmented generation (RAG) on top of pre-trained models.

What is the main bottleneck in implementing RAG systems according to Tengyu Ma?

The quality of the retrieval part is the main bottleneck. If the retrieved documents are relevant, the large language model can synthesize good answers, but poor retrieval quality significantly impacts the response quality.

How does Tengyu Ma compare RAG to long-context transformers in terms of cost efficiency?

RAG is much cheaper than long-context transformers because the latter requires storing all intermediate computations for large contexts, which can be prohibitively expensive. RAG, being a hierarchical system, is more cost-efficient as it retrieves only relevant information for each query.

What are the two main ways to improve RAG systems according to Tengyu Ma?

One is to improve the neural networks, such as embedding models and re-rankers, which require heavy data-driven training. The other is to improve the software engineering aspects, like better data chunking, iterative retrieval, and incorporating metadata.

How does domain-specific fine-tuning improve embedding models in RAG systems?

Domain-specific fine-tuning allows embedding models to excel in particular domains by customizing the limited number of parameters to focus on specific tasks. This can lead to improvements of 5% to 20% in retrieval quality, depending on the domain and the amount of data available.

What advice does Tengyu Ma give to companies building RAG systems?

He suggests starting with a prototype and immediately profiling both latency and retrieval quality. If retrieval quality is the bottleneck, companies should consider swapping components like embedding models or re-rankers to improve performance.

What does Tengyu Ma predict for the future of RAG systems as LLMs improve?

He predicts that RAG systems will become simpler, with fewer components and less need for complex software engineering. Embedding models will handle multi-modality and data formats more effectively, reducing the need for manual preprocessing.

What role does Tengyu Ma believe academia should play in the AI industry?

He believes academia should focus on long-term innovations and research questions that industry may not prioritize due to short-term incentives. This includes working on efficiency improvements and challenging reasoning tasks that require innovative approaches.

Chapters

Tengyu Ma's research spans various deep learning fields, focusing on theoretical understanding and practical applications. His recent work centers on improving the efficiency of training large language models and enhancing their reasoning capabilities. He highlights the importance of efficient data and compute usage due to resource limitations.

Focus on theoretical foundations and practical applications of deep learning.
Emphasis on efficiency in training large language models.
Development of the SOFIA optimizer, resulting in significant training efficiency improvements.

Shownotes Transcript

Translations:

中文

Welcome to No Priors. Today we're talking to Tengu Ma, assistant professor of computer science at Stanford and the co-founder and CEO of Voyage. Voyage trains state-of-the-art components for next-generation retrieval systems, including embeddings models and re-rankers. We're really excited to talk about his research and the RAG debate today. Welcome, Tengu. Yeah, thanks so much. Thanks for having me here.

We're looking forward to the debate. Yeah. Why don't we start with just a little bit of an overview of your research agenda to date? Because I think uniquely it covers a broad range of fields within and around deep learning from theory to RL to embeddings and optimizers. So can you talk a little bit about sort of how you pick the directions you have?

Yeah, so I think most of the papers I wrote have some theoretical thinking in it. I guess maybe that's the commonality. And besides that, I think I worked on quite a few topics, as you mentioned, ranging from the theoretical understanding, mathematical proofs of

deep learning systems, all the way to practical large language models, reinforcement learning, deep reinforcement learning. And these days, recently, I think what we are working on more centralized to efficiency of training the large language models and improving the reasoning tasks for large language models.

So my vision is that in the future, the efficiency is very important because we are running out of data and compute. So we have to either use the data much better and use the compute much better. And also reasoning tasks seems to be a pretty important direction and also applications.

in some sense, kind of like a risky direction in the sense that we don't know exactly how fast we can solve those challenging reasoning questions yet. Can you mention a few of the key papers or work that you or students in your lab have done, just so our listeners can look them up?

In the very early days, I think I worked on some of this matrix completion, optimization for matrix completion. That's like 10 years ago. And then I move on to embedding models, like sentence embeddings, vector embeddings. One of the papers we wrote is a very actually simple paper where we average the word embeddings to get sentence embeddings. And then we did some of these transformations using PCA to make the performance much better. That was even before Transformer came out.

And then I think I move on to transformers, large-language models, and contrastive learning, which is the new way of training the embedding models. Especially the direction started with some of the papers on using contrastive learning for images, and we work on improving those and understanding why contrastive learning can work.

And recently, we work on optimizers for large language models. For example, one of the papers we wrote last year was SOFIA, where we found that we have a new optimizer which can improve the training efficiency by 2x for pre-training. This is great. Adam is very old at this point.

Yeah, it's 10 years old now. I think that's the interesting part about it. So optimizers, you know, I think people have tried in the last 10 years for so many times, there were so many papers published, um, which has, um, improvements over item in various cases, but so far item is still the default algorithms for training large English models. And that's why we thought that is the time to, uh,

really, we spent a lot of time on this. Like I think I started probably around 2018, 2019. And I asked a few students to work on this. And finally we had one paper out after a few years, after a few failed projects and failed ideas. And recently I think one of the Facebook posts

friends actually use this in their large scale multi-model training. And they found that on that scale, I don't know exactly how many parameters they are, but I assume it's kind of more than a hundred billion parameters. They found that on that scale, there is a 1.6 X improvement in the efficiency of the training. So that's like a $10 million versus $16 million.

That's super exciting. Yeah, I think, you know, Sophia has an opportunity to be really, really impactful. You started a company last year taking leave from Stanford. Given your work has been like theoretical, but with context.

practical applications? Like what drove you to do that? I think I came to Stanford partly because there's a very strong industry connection here at Stanford compared to some of the other universities. And also probably entrepreneurship is just part of my career plan anyways. And in terms of the timing, I felt that this is the right timing in the sense that

the technologies are more and more mature so that it seems that the commercialization is the right timing right now. So, for example, I think one story I have is that, you know, I look up some of my

for my lectures at Stanford CS299 seven years ago when I started to teach at Stanford. At that point, machine learning, we have a lecture with Chris Ray on applied machine learning. So how do you apply machine learning industry?

And there are seven steps there. So the first step is you define your problem. The second step is you collect your data and you choose the loss function, you train it and you iterate so on and so forth. So it's pretty complicated at that point. And now the foundation model arrives to power. And in a new foundation model era, the only thing you have to do is that you have to, you know, someone will train the

foundation model for you and then you tune a prompt and you add a retrieve augment generation on top of it and that's pretty much that's it. So applying machine learning AI to an industry environment is much, much easier than seven years ago. And that's why I felt that this is probably the right time to commercialize many of the technologies because the technologies are more mature.

Yeah, this is actually, I mean, a core premise even for the investing fund that I started a conviction in that, you know, somebody's doing the bulk of the work for you in a more general way. And so the application of AI in industry is just much, much cheaper, right? Because you only do, you know, the last few steps and or different set, but last few steps in essence. So maybe you can talk about like the, you know, just given wide range of research, the problem you focus on with Voyage.

that you saw with customers? Yeah, so with Voyage, I think we are mostly building these two components, rewanker and embeddings, for improving the quality of the retrieval or the search system.

So the reason why we focus on this is because we talked to so many customers and we found that right now for implementing Rack, the bottleneck seems to be that it's not very hard to implement it. You can just connect the components and have your Rack system ready very quickly. But the bottleneck seems to be the quality of the response. And the quality of the response is enormous.

heavily affected or is kind of almost bottlenecked by the quality of the retrieval part. If the large-emission model sees very relevant documents, then they can synthesize very good answers. Even like LAMA-70B can do that very well. Can you just give like a sort of general intuition for like what a RAG system is and some of the applications of it?

Yeah, so I guess just a little bit of background. So a retrieval argument generation, the idea is that there's a retrieval step, there's a generation step. So the main point here is that if you just use a large language model as a black box, you know, as is, then the large language model wouldn't know anything about the proprietary information inside the company. And it doesn't know enough context about the use cases.

And the Retrieve Augmented Generation stack is about you first retrieve some knowledge from, for example, inside a company and then use this knowledge and give the knowledge to the Latin Gush model so that the Latin Gush model can generate or synthesize a good answer without any hallucination. This has been found to be very, very useful to reduce the hallucination rate.

And so there's two steps. The first step is to retrieve some relevant information given the query, and then this relevant information are given to the large-tanglish model. The retrieval step is important because once the large-tanglish model sees the relevant information, it can reduce the hallucination rate dramatically because it used the relevant information as an anchor to refine the answers in some sense.

And what we are doing here is that we want to improve the quality of the retrieval or the relevancy or accuracy of the retrieved documents and information. And the way to do this is that there are two steps. The first step is that you vectorize all of your documents or all of your knowledge base. So you turn the documents to vectors, you turn the videos into vectors. You turn your code into vectors. Code into vectors, everything into vectors.

And so the vectors are the representations of each piece of the knowledge or documents and all the other indices. So, and then you put these vectors into a vector database and then you search the relative information using the vectors as indices. Where are you seeing RAG applications today? Like what are customers building or, you know, what are the most common systems?

Yeah, so we have a lot of users and they are all over the places. I think, you know, we have even a customer who is a chemistry company who is building this rack system to understand their chemistry documents or products in descriptions.

And I think it's almost everywhere, like finance, legal, code retrieval, code generation, so on and so forth. I think it can be applied to almost any cases and also even for individual users where you have a lot of individual personal information.

and you want to have a rack system on a phone so that you can access your past information in a much more easy way. For example, we all have seen that when you search your documents on your laptop, it's actually pretty hard. You have to use the exact file name. It will be much easier if this search can be semantic-based.

RAG is a relatively new architecture. I think your average enterprise technology leader had not heard the term before the last year or so, and it became popularized in a researcher for the last few years. But there is already a debate, I think, you know, in terms of opinions from people at different large labs and in academia about whether or not

you need a RAG architecture to work on proprietary data. And just to sort of describe some of the alternative views, I think there's kind of two alternative points of view given. One is a sort of agent chaining architecture where you are inputting your data and knowledge, you know, chemistry, code, law, finance, whatever documents involved.

into a series of LLMs that just operate with instruction on it, for example, to summarize or categorize it. Or you simply feed everything into LLMs with infinite context or actively managed context versus explicitly vectorizing anything. And so I would love to get your reaction to that as an alternative to RAG. Actually, there was also a debate last year about RAG versus fine-tuning. Yeah.

And I think that debate was kind of like getting to a consensus now. It sounds like RAC is much easier than fine-tuning and fine-tuning in many cases doesn't work because you need a lot of data to see the results and there are still hallucinations even after fine-tuning. And now, as you said, the debate becomes RAC versus agent changing or long context. So maybe let's talk about long context first.

So I,

I think there are probably two answers to this from different angles because the long context right now is not practical yet. So we have to kind of anticipate what long context transformer can do and then do the debate at a future time in some sense, or anticipate the debate at a future time. In the near term, I think the long context transformer where you just put in all the proprietary data, 1 billion tokens into the context of the transformer.

will be very, very expensive, right? If you use the price right now, it's going to be just impossible to do it. It's probably like five, 10 magnitudes of difference, depending on how many documents you have in the context.

Of course, you can bring the cost down by, for example, one approach is to cache the activations of all of the internal operations of the documents you put in the context. So that will bring the cost down by a lot. But I think still, if you do the calculation, theoretically, it's still much more expensive than RAC.

So I think that's the more practical answer. So in terms of cost, it's going to be much more expensive than RAC because you have to save all of these activations or intermediate computations in the GPU memory, most likely, or maybe in CPU memory of all the 1 billion tokens context.

You know, you may argue that, OK, over time, everything will become cheaper and cheaper, but RAG will be cheaper as well, right? Because many of the technologies under RAG are neural network based and the GPUs will become cheaper, the neural networks will become smaller. So my prediction is that RAG will be much cheaper than long context going forward.

And another way to think about this is that maybe just from the first principle, right? So my analogy of log context is that, so in some sense, the context is the short-term memory in some sense, right? And the reg is more like long-term memory in some sense. So the question is, you know, for example, you know, when you answer any question, why you have to go through the entire library every time, right? Like a

and put all of the entire library in your short-term memory for answer a single question, right? It sounds like the right approach should be that for every single question, you retrieve some subset of the information and use those to answer the question. That seems to be the most efficient way to do that.

It should be some kind of hierarchies in some sense in terms of how we solve the problem so that we can get the best efficiency. Even when we do the computer architecture, like the hardware stuff, right? So you have a different level of caching, right? So you have disk, you have CPU cache and so forth. So in that sense, I feel like the more hierarchical two-level kind of like system like RAC is more cost efficient. Yeah.

Yeah. I mean, the analogy certainly makes sense. I think there is another thread of discussion of like, what does long-term memory for LLMs look like where, you know, it is something managed by the LLM itself. But I do not think that is a well-answered question. And like RAG may just be a

part of that answer? So the embedding model, the rewanker, are in some sense the large language model that are managing the long-term memory. Of course, there might be variants and other ways to manage the long-term memory, but I think it will be somewhat similar. It's going to be like more

you know, the technology always evolves, right? Gradually, right? So maybe two years later, Voyage or maybe other companies will have a new version of the long-term memory, which is based on, you know, embedding models, but, you know, kind of like extending the embedding model in some way. That's entirely possible.

Yeah, I do think it's useful to sort of contextualize for people who are not working with sort of data sources for LLMs at scale every day, like what sort of token limitations are, right? You know, we go from a few thousand tokens to something like,

Gemini 1.5 Pro context window of a million tokens, right? And so if that's short of that in word count, that's maybe five books or like 25, 30,000 lines of code.

And obviously, like limited amount of video and audio. And so I think the ability to make reasoning decisions on more than that amount of data is obviously going to be needed. And the questions are to me are really like, you know, does does efficiency matter both from a cost perspective and a speed like a latency perspective? Yeah.

Right. And how much can you push the context window? And like, you know, does hallucination management matter? And so I think there are lots of arguments for like rag being very persistent here.

Yeah, yeah, exactly. And just to add a little bit on that. So 1 million tokens, 5 books, right? So by many companies has 100 million tokens. That's 100x difference, right? So 100x, you know, for cost is a big difference. That could be just, you know, $100K versus like $10 million, right? $10 million is...

and acceptable, but 100K sounds okay. Yeah, I think that's probably what's going to happen. Like, so, at least for many of the companies, right? So right now, if they have 100 million tokens, I don't think they can use long context transformers at all because it's way too expensive. Yeah.

Right. And I like the simplest thing for me is actually for a system to look at the entire code base or some representation of the entire code base versus the portion of it that could fit into context today. What about the other piece, like the idea of agent chaining and using LLMs to manage the data in that form?

So agent chaining, this is a growing area and many people are doing research on it. I think it's a little bit less well defined in some sense. On the first level bit, I would say is that I think it's kind of orthogonal to embedding models and re-rankers to some degree, because even when you have agent chaining, you still probably use embedding models more.

as part of the chain, right? You probably do iterative retrieval as part of the chain. And of course, you use large-language models as part of the chain as well. In some sense, it's orthogonal direction. So I probably would rephrase the agent training as more like an iterative multi-step

retrieval, large-language model augmented system. So some part of this retrieval probably is done by a large-language model, sometimes part of the system is done by a small large-language model, and some part of the system is done by an embedded model, so on and so forth. So in that sense, I feel like it's somewhat kind of orthogonal.

Yeah, and I feel like some of the motivation for agent chaining to begin with is the same efficiency motivation as RAC. Yeah, exactly. But if you use a very, very large language model to manage the system, the knowledge system, I think you, again, lose the efficiency, right? So it has to be a somewhat smaller model to manage the...

the knowledge. And then at that point, embedding model might be the right thing to do in that agent training framework. Maybe another angle to look at this is that whether we should do iterative retrieval versus just retrieve at once. I think iterative retrieval is definitely useful, especially because now there are still a lot of headroom in the embedding model's performance. So that's why sometimes you have to retrieve multiple times because the models are not clever enough.

However, in the long run, my suspicion is that iterative retrieval will be useful, but it will be a bit less useful if the embedding models become more and more clever. So once the embedding models are more clever, then maybe one round or two rounds is going to be enough.

If we go ahead and just assume that RAG is at least a dominant architecture for enterprise use cases where you care about proprietary data that is large with reliability, how do you go about improving like a RAG system, right? You can improve the LLM itself, but what are the other components that you guys are working on or what are the maybe challenges from the user, the builder's perspective to improve retrieval quality? Yeah.

Yeah, so I guess there are a few ways, right? One way is that you improve the prompting of the large language models. So, for example, you could tell the large language models to abstain if there's no relevant information in the retrieved documents. But because the large language models are so good these days, I think you don't need a lot of prompting anymore. It just responds to the instructions so well.

And then the next thing is to improve the retrieval part, which is the bottleneck in my opinion, because most of our users found out that if they improve the retrieval quality directly, that affects the response quality. And improving the retrieval part, I think there are two ways. One way is you improve the embedding model. One way is that you improve some of the other things on top of that. For example, how you trunk the data, whether you do iterative

retrieval, whether you put in some of the meta information in the data, so on and so forth. So basically I would say there are two ways of improving. One way is you improve the neural networks, either the embedding models or the rewrapers, or you improve the ways to use the networks with software engineering, right? Better trunking iterations or other kind of like heuristics or kind of like tricks on top of that.

So, what we are specialized in is that we want to improve the networks because that requires a lot of heavy lifting. That's a very data driven approach. We train our networks on trillions of tokens at least, and we fine tune them for special use cases. And this is something that probably what a company should do instead of like every the users, the end users should optimize themselves.

And my long-term vision here is that some of the software engineering layers on top of the networks will be less and less needed when the networks are more and more clever. So, for example, right now we already see that trunking becomes less needed because the context window becomes longer and longer and the long context embedding model, you know,

relatively long context embedding model. Long context here means like 10K, for example, maybe 16K so that you can put 50 pages PDF into it. Because this long context embedding model becomes much better, there's less of a need to trunk the documents into pieces of like five, 12 tokens. And

And I think this will happen in other dimensions as well. So maybe in the future, you don't have to turn your images into description images and then give it to the text embedding model. That's what people are doing right now. Everything is turned into text and they use a text embedding model. But when the embedding models are more clever and multi-model, then you don't have to do that anymore.

Can you talk a little bit about just like the intuition for how fine tuning or domain specific embeddings improves performance? Yeah, fine tuning and domain specific embedding models are what we are very good at at Voyage. So just to have some context here. So what we do is that we start with a general purpose based embedding model, which is also what we train from scratch.

And from there, we first fine-tune or continue to pre-train, whatever you call it, on some domain-specific data. So for example, we fine-tune on two trillions of code snippets, tokens.

And then we get the code embedding model and we do the fine tuning on one trillion legal tokens. And that's how we got the legal embedding model. And this domain specific embedding models, I didn't use any of the proprietary data so that everyone can use them, but they really excel in one particular domain and the performance in other domains are not changed much. And the reason why we do this is because the number of parameters in the embedding model is limited.

So, because you only have like a, you have a latency budget, something like maybe one second, sometimes like 200 milliseconds, you know, some people even want 50 milliseconds. And then basically it's impossible to use more than 10 billion parameters for embedding models. And we have limit parameters. Any customization is very important because customization means that you use the limit number of parameters on the right tasks.

the right domain so that you excel in that domain. There's no way that you can use these 10 billion parameters to excel in everything. So that's why you have to specialize in one domain. And we have seen like 5% to 20% of improvements by this domain-specific code.

fine tuning depending on the particular domains. For code, we have seen 15 to 20% of improvement, partly because we have a lot of data there and the headroom there is also bigger because code retrieval requires a lot of deep understanding of the algorithmic part of the code.

And for legal domain, the baseline is a little better so that the headroom is slightly smaller. So that's why we see 5% to 15% improvement depending on the data sets. For some of the very complex legal data sets, we have seen bigger improvements. Just to make sure that our listeners can picture exactly where the latency cost is coming from here, in a search system, your data has been vectorized by an embeddings model, but then every query...

also needs to be translated into an embedding and then compared to the embeddings of your knowledge in order to feed that LLM for the generation that you want, right? And so there's inference time latency here as well. I just think that's not obvious if somebody hasn't built a rack system.

Yeah, exactly, exactly. So basically, at the inference time, you have to first turn the query into vectors and then do the search with vector database. And actually, related to this, the dimension of the vectors you produce also affects the latency for the vector-based search. If the dimension of the

embedding is like 100, it's only 100, then it's going to be much, much faster than when the dimension of the embeddings is 1000. So, and actually this is something we are very good at as well. So we produce embeddings that is like a 3x, you know, 4x smaller dimension than some of the competitors.

Yep. That makes, I mean, intuitively you are creating embeddings models that use a limited number of parameters and dimensions, just given the sort of latency budget that you, that any application has to create the best possible representation of proprietary data or domain specific data.

Yeah, exactly. And going back to the domain specificity and fine tuning. So the second level of customization is that we can customize to a particular company, right? So we fine tune on the proprietary data of a particular company and we can see

10 to 20% improvement on top of the domain-specific fine-tuning as well. So of course, there's a total budget in terms of how much additive improvements you have there, right? So if you start with like 50% accuracy, then you only have 50% headroom. But if you start with 90%, you only have 10% headroom. So the improvement, the absolute improvement varies a little bit across the domains.

Maybe just advice to people who are building rag systems. At what point do they begin to invest in, you know, some of these retrieval components?

Yeah, I think they can do it even from day one as long as they have a prototype available. So basically, my default suggestion for our users is that when they have the rack, first of all, of course, you want to connect the components and at least see some response. And then probably do some kind of basic profiling in terms of the latency and the quality so you can

check the retrieval quality, meaning that how often you retrieve relevant documents. There are some default ways to evaluate the retrieval quality. And then you also do the end-to-end evaluation for the responses. And then you can see which part is the bottleneck. And in many cases, people found that the retrieval quality is not good so that the final response is not good.

And then you can swipe some of the components. You can say, I'm going to try voyage embedding. I can try the voyage rerankers, which we haven't discussed too much about. And you can try various different embeddings and possibly various different large language models as well.

Maybe just zooming out, like, you know, you started by saying in order to have the debate about RAG versus alternative architectures for working on proprietary data, you need to predict forward, right? Any predictions for how these systems change as LLMs improve dramatically, right? If we look at the next generations of OpenAI and or GPT and Claude and the Mistral models and LAMA and such.

Yeah, so my prediction is that the system will be simpler and simpler. Maybe this is my biased view. So at least this is something that we are working towards. So the idea would be that it's a very, very simple system. So you just have three components like large English model,

um vector database and embedding models and maybe four components another rewanker which refine the retrieved results um and you connect all of this and each of the new networks that's everything else you don't have to worry anything about trunking multi-modality changing the data format because

New Archiveworks can do most of them, right? So seven years ago, if you talk to any of the so-called language models seven years ago, you have to turn the format into a very, very clean format. And now you talk to GPT-4, you can have typos, you can have all kind of like weird formats. You can even dump JSON files to it, right? So the same thing would happen for embedding models as well. So my vision is that in the future, AI will just be that

uh, a very simple software engineering layer on top of, of a few, um, uh, very strong neural network components. Yes. I think the bias toward, um, it is actually all going to be AI versus complex, you know, discretized software systems is, is clear, but, um, I believe directionally, right. Uh,

Maybe zooming out to just get a little bit of your perspective as a founder, like, you know, what's one or two top learnings you have about starting the company as an academic before even, you know, despite your work with Google and other companies before? Yeah.

Yeah, I think it's very, very different. Founding a company is very different from doing research at Big Tech. And also even from, actually it's a little bit closer to being academia because to run a university lab, I'm the CEO, CTO, CFO,

and HR for the university lab, right? So you touch on a little bit of everything, but at a slightly different scale, right? So I think one of the biggest thing I learned actually is from one of our angel investors is that I should read some of the books. Even those...

I think for probably experienced entrepreneur, many of the books are very basic, but for me, they are very, very useful. Uh, when I read some of the, even the basic books, um, including Eli's book, by the way. Uh, so, uh, but his book is a little bit, um,

uh, advanced in a sense that he's, his book is talking about how to scale from 10 people to a thousand people. And I only read a few chapters of that because we are about 10 people right now. So, um, yeah. And also talking to a lot of angel investors, talking to, um, um, Sarah and, uh,

my other lead investors. So I think all of this helped me a lot in reducing the unforced mistakes in this process. To me, I think it's really about how to reduce the number of errors you make so that you can maximize the efficiency. At least this is what happens to me. Also how to correct the mistakes as fast as possible, right? If you can correct mistakes as

every one week after you made them versus like one month after you made them, then that's a Forex efficiency improvement.

Very theoretically consistent in your vein of research. Last question. You know, you have been personally productive, productive research lab, but you've started a company. What do you think the role of academia in AI is in this age of like scaling? Because most of your former students, like they essentially all work at OpenAI or Anthropic with, you know, a few professors and Citadel folks in the mix.

And the ones working with you. Yes, yes. In academia, this is a little bit controversial topic. I think different people have different views. My view is that I think academia probably should work on subculture.

some different questions from what industry is good at, right? So if we are just only working on how to scale up the system, then obviously the incentive is not right. We don't have enough capital there. And, you know, even OpenAI, I guess, Seth Altman argues that you need a lot of capital to start to do this in some sense. So, you know, like at the very beginning, I think the point is that, you know, you first have

It cannot be non-profit because if it's non-profit, then you don't have enough capital and you cannot scale up enough. I think I kind of agree with that. And that's why in academia, it's very hard to scale up and have enough resources to do the large scale research. However, I think in academia, there are many, many other things that we can do on a smaller scale. And we probably should focus on more long term innovations.

So what I told my students at the lab is that we should think about what will be the breakthrough in three to five years, as opposed to how do you help OpenAI to improve their large language models in GPT-5. So that's why we work on optimizers, which is like 10 years old. The item is a 10 years old optimizer. And we say, okay, that sounds like a long-term project.

Maybe in five years we can improve the optimization efficiency by five to 10X. That's going to be a game changer for the whole landscape, right? So if we improve the efficiency by 10X, I guess that's like $100 million versus $10 million for training GPT-5. Then I think that would change the landscape a lot in the industry. So efficiency is one of the things I spend a lot of time on. Another thing is that there's reasoning tasks involved.

I think the reason why I identified that as one of my lab's direction is because it's challenging and it requires a lot of

very innovative research. It's very unclear whether you can really, the scaling law is really enough to get you to prove Riemann hypothesis or any of the math conjectures. So, you know, and also you have to be superhuman performance in some sense, right? So if you turn on just the common crowd data on the web, can you be a good mathematician? It's kind of

very hard to believe that. So we need more innovations there. So that's pretty much what we are doing at the university lab. We try to work on the three to five years agenda and on a smaller scale. I think that's an inspiring note to end on and like a very open-minded one about what is still to be figured out. Thanks so much for doing this, Tango. Thanks so much.

Find us on Twitter at NoPriorsPod. Subscribe to our YouTube channel if you want to see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.

The evolution and promise of RAG architecture with Tengyu Ma from Voyage AI 36:20 Share