丁丁: 我认为模型评估的核心在于分析和评估模型性能的好坏,而基准测试则是用于评估模型性能的一系列标准测试。AI的下半场更注重定义问题,而不是单纯追求在现有基准测试上的分数提升,因为现有基准测试可能与实际应用场景存在差距。模型在基准测试中表现出色,但在实际应用场景中可能表现不佳,这与基准测试的设计和实际应用场景的差异有关。评估模型时,除了基准测试,还需考虑业务场景和用户实际使用情况,因为基准测试与实际应用场景存在偏差。单一指标追求DAU是一种偷懒的行为,因为缺乏客观指标来评价模型的优劣。DAU很重要,但一味追求DAU对模型能力提升没有帮助,需要结合高质量的用户数据和与模型能力提升相关的Benchmark。用户数据很重要,但高质量的用户数据和模型能力提升之间需要保持一致性,需要选择合适的Benchmark。模型最终表现取决于多方面因素,包括预训练、微调、强化学习等环节。在模型公司工作最大的收获是对模型评估和基准测试的认知,因为模型能力在动态迭代,需要不断定义不同维度的基准测试。模型评估至关重要,通过基准测试来评估和实现,基准测试是第三方或内部产品设计的一套题目。不同业务场景的评估标准和关注点不同,例如深度搜索和情感陪伴类产品。抽象出评估维度和优先级,并结合真实用户场景进行评估。未来可能会有多种模型,每种模型都有自己的个性;未来基准测试可能针对不同人群进行分类。模型应具备足够强的能力,才能满足不同用户的个性化需求,即使这些需求与模型本身的目标相悖。不同公司对Benchmark的理解和设计不同,导致模型迭代方向不同。好的Benchmark需要真实反映用户需求,有一定的难度和区分度,并随着模型迭代而调整。Benchmark与最终用户指标强相关,如果Benchmark改进后用户指标没有改善,则需要调整Benchmark。自动评估和人工评估都需要不断校验,以确保与真实用户体验一致。创业公司和大厂在Benchmark的理解和实践上存在差异,创业公司迭代速度更快。Benchmark更新频率取决于模型迭代速度和用户需求变化。创业公司初期Benchmark数量可控制在400题左右,根据用户使用情况动态调整。不好的Benchmark可能过于简单,或者只关注单一维度和难度。模型能力提升类似于提升技能点,可以专注于长板,也可以平均提升。AI陪聊产品难点在于评估标准难以定义,因为它没有标准答案。未来可能出现更多小众猎奇的产品,因为Benchmark难以覆盖所有用户。评估标准往往是人类价值观的映射,但这种映射是否准确值得探讨。寻找优秀的AI产品经理应关注其实践经验、动手能力和对模型的理解。
曲凯: 引导话题,并就丁丁的观点进行补充和提问。
supporting_evidences
丁丁: 'Then look at its performance, the second half of the article is very widely circulated, and it puts forward a very important point, that is, at this stage, defining the problem may be more important than the original to brush the points of some existing benchmarks, in fact, these benchmarks may still have a gap with the real scene of the business or the actual needs of users.'
丁丁: 'I remember there's a point in this article that I think is quite right, that is, the current various, looking at the results of its points brushing, many AI have reached the level of graduate students and doctoral students, but in fact, when landing, it may not even be considered as an intern's level, the reason behind this is the problem of benchmark setting, because in fact, in our real business scenarios, at least what I feel,'
丁丁: 'In addition to benchmark tests, we actually also pay attention to many business-related and user-related benchmarks, because the benchmark tests we mentioned above still have a relatively large deviation between the real world model products and the inputs of different businesses.'
丁丁: 'So you are acknowledging the statement of AI's second half, right? I fully agree.'
丁丁: 'First of all, in the first half, everyone is still working hard to improve the ability of the base model, or is still working hard to explore the potential of print train. For Kimi, it also realized the importance of IL very early, but the effect of IL must ultimately be based on a good basic model or a good print train link.'
丁丁: 'I think that blindly or single-indicator pursuit of DAU is, to some extent, a kind of inertial experience or a little lazy behavior.'
丁丁: 'I don't think DAU is not important, and in fact, you must have users to get feedback and real user input, it's just that blindly pursuing DAU may not help, for example, the improvement of model capabilities. Here, I think we can also introduce the source of benchmark to understand this matter, for example, the benchmark we mentioned has several sources, one is some benchmark common benchmark, and another may be the real feedback from online users.'
丁丁: 'User data is still important, but the high-quality user data and the model capabilities we want to improve must be aligned, that is, you must choose the right benchmark to help improve your model intelligence.'
丁丁: 'So from the perspective of your model product, if you are Liang Wenfeng half a year ago, would you accept those DAUs and data? If I have sufficient resources, I would definitely like to accept them. Of course, the premise is that the resources are sufficient, right? But it's just not enough, right? It's still about wanting to accept it. Is this a common problem for classical product managers? You will, right? You have so many users coming, you still have this idea. I think, but in the end, they seem to let it be, they didn't particularly want to accept this thing.'
丁丁: 'I think it's still about model evaluation and the understanding of the benchmark, because I used to do search products, search may also have some evaluation work is a bit similar, but it's just that in the past doing search, your evaluation data and the speed of change may not be so fast.'
丁丁: 'I can simply understand that evaluation is certainly the most important for the performance of the model and the final performance of the product, and evaluation is to evaluate and implement through benchmark, and benchmark is, for example, a set of questions designed by a third party or the product itself.'
丁丁: 'But in different businesses, the evaluation standards or points of concern are very different. For example, as we just mentioned, if you are doing deep search, you will hope that the model's output is what, I think it is highly likely that you will hope that it can give a relatively real and comprehensive requirement based on all the data sources you have guided, but if you are, for example, a CAA, it may be an emotional companionship type,'
丁丁: 'So in this process, you abstract not only the elements and classifications, but also their importance, and there is a very interesting example here, that is, when DeepSeek became popular, a large reason was that everyone felt that DeepSeek's style was very interesting, because it seemed very philosophical and elegant.'
丁丁: 'For the first point, I think so, in fact, there are already some preferences, for example, when you are programming, you will definitely choose Cloud as the first priority, but for example, you may do some deep search, you may use O3 today, and for the second question, I think it can be converted, you still need to abstract this personalization into a certain kind of model ability or product ability, for example, I can give a simple example, can I solve your personalized preference through Memory?'
丁丁: 'I think it's possible, and OpenAI itself is also working on it, so it may not need to be as finely divided as you said, but it must be able to achieve your ultimate goal through some kind of model internalized ability. I want to think of an extreme meaning, for example, assuming it is still a CAI product.'
丁丁: 'First of all, the difficulty of the benchmark in the same field may be different, and this difficulty is reflected in how you understand the business, for example, at the beginning, when you may do search, you will use some very simple questions.'
丁丁: 'There may be several principles, for example, it must first be real, able to reflect the needs of online users, and also have a certain degree of difficulty and distinction, it is not all difficulties are the same, secondly, it may be that this benchmark is with your entire model iteration life cycle.'
丁丁: 'Is it a strong correlation between benchmark and final user indicators? That is, we will look at, for example, the benchmark I designed today, and if it gets better, theoretically, these user indicators should also get better, right? Yes, if it doesn't get better, it means you need to change your benchmark, at least to make them continuously align, otherwise your evaluation will be meaningless. It should be a positive correlation.'
丁丁: 'Including, for example, when we do evaluation, we will involve auto eval and human eval, the task completion effect of using a large model to evaluate your own model, and the end-to-end effect of using a person to evaluate your final evaluation, and I understand that these two kinds of eval actually also need to be constantly verified, otherwise there will be a model to automatically score, and then find out that there is actually a gap between the real user experience.'
丁丁: 'First of all, startups and large factories may have some differences, large factories I see is that different teams are still like the previous way of circulation, for example, your high-quality data annotation, including this evaluation set, it is completely done by the data team.'
丁丁: 'How often is it reasonable to change the benchmark? I think there is no standard answer, the faster the better, or look at the data, etc. Yes, yes, the faster it means that your model ability iteration is very fast, but for many startups, if they don't move the model, in fact, what they iterate should be some of their, for example, pre-prepared prompts and their engineering testing capabilities, right? And affect its results.'
丁丁: 'I might give myself, for example, 400 questions, will this question be more or less? I don't think it is, it's just that you can measure the performance of your model, and that's OK. Then can I say that I will launch the product first, and then the users will use it, and then I will sort all the user prompts.'
丁丁: 'I think I've done so many good benchmarks, and no one has ever asked me what a bad benchmark looks like, for example, it's particularly simple, or it's benchmark is simply a certain dimension, and a certain difficulty, which is a very bad benchmark.'
丁丁: 'So there is a possibility that I just put all the points on one item to make the long board long enough, and the user's perception of the long board of my product is obvious enough, and I can stand out, or should I average the points is the best choice? This question may have to be divided into two layers, one is the ability of the base model, we will see that the stronger the ability of the base model.'
丁丁: 'Let's take an example, I think everyone may have used it in daily life, which is similar to CAI, this kind of chat product, so assuming that you are now doing an AI companion chat product, what do you think the difficulties might be, and how to do this?'
丁丁: 'Because as you said, your benchmark definition can only take care of 80% of users, maybe with personalization and mobile models strong enough ability can also solve.'
丁丁: 'I'm actually thinking about a more abstract question, that is, when we are setting many evaluation standards or values, you will find that we are generally a mapping of human values, but is this right? This is really abstract, this is too abstract, it's nothing more than you think that in the whole human world, there must be a relatively good answer to a certain problem, so you go to do this mapping, but is this mapping really right? I don't know.'
丁丁: 'I think there are some product principles, one is to do product structure first, then functional details, for example, we will find that there are many functions in WeChat, if it is used, for example, different tabs to express, do not do hierarchical decomposition, then today will be very redundant, and will be very complex, but until today, WeChat is still relatively simple, and there are only four tabs.'
丁丁: 'Then another point is that I think your hands-on ability should be very strong, and you don't need to be like the products in the past, you have a very strong module circulation awareness, but you need to completely throw away this awareness, that is, you just treat yourself as a product manager, a designer, and at the same time, a front-end, and now you may not be able to fully implement the back-end, but you can also try, and you go to complete the full process of this closed loop, then I think it will be more helpful for you to understand the model.'
丁丁: 'First of all, if it's from some background, I think I might really prefer the initial model products or smaller companies, that is, to complete this from zero to one or end-to-end students, that is, more full-stack themselves have done some things from beginning to end, right?'