We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

AI 下半场：聊透 Benchmark 与 Evaluation | 对谈前 Kimi 产品经理丁丁

2025/5/5

42章经

AI Deep Dive Transcript

People

丁

丁丁

丁维平是美国资深的企业国际形象策划及商业人脉整合专家，硅谷创新频道 – 丁丁电视的创办人和总裁。

Topics

丁丁: 我认为模型评估的核心在于分析和评估模型性能的好坏，而基准测试则是用于评估模型性能的一系列标准测试。AI的下半场更注重定义问题，而不是单纯追求在现有基准测试上的分数提升，因为现有基准测试可能与实际应用场景存在差距。模型在基准测试中表现出色，但在实际应用场景中可能表现不佳，这与基准测试的设计和实际应用场景的差异有关。评估模型时，除了基准测试，还需考虑业务场景和用户实际使用情况，因为基准测试与实际应用场景存在偏差。单一指标追求DAU是一种偷懒的行为，因为缺乏客观指标来评价模型的优劣。DAU很重要，但一味追求DAU对模型能力提升没有帮助，需要结合高质量的用户数据和与模型能力提升相关的Benchmark。用户数据很重要，但高质量的用户数据和模型能力提升之间需要保持一致性，需要选择合适的Benchmark。模型最终表现取决于多方面因素，包括预训练、微调、强化学习等环节。在模型公司工作最大的收获是对模型评估和基准测试的认知，因为模型能力在动态迭代，需要不断定义不同维度的基准测试。模型评估至关重要，通过基准测试来评估和实现，基准测试是第三方或内部产品设计的一套题目。不同业务场景的评估标准和关注点不同，例如深度搜索和情感陪伴类产品。抽象出评估维度和优先级，并结合真实用户场景进行评估。未来可能会有多种模型，每种模型都有自己的个性；未来基准测试可能针对不同人群进行分类。模型应具备足够强的能力，才能满足不同用户的个性化需求，即使这些需求与模型本身的目标相悖。不同公司对Benchmark的理解和设计不同，导致模型迭代方向不同。好的Benchmark需要真实反映用户需求，有一定的难度和区分度，并随着模型迭代而调整。Benchmark与最终用户指标强相关，如果Benchmark改进后用户指标没有改善，则需要调整Benchmark。自动评估和人工评估都需要不断校验，以确保与真实用户体验一致。创业公司和大厂在Benchmark的理解和实践上存在差异，创业公司迭代速度更快。Benchmark更新频率取决于模型迭代速度和用户需求变化。创业公司初期Benchmark数量可控制在400题左右，根据用户使用情况动态调整。不好的Benchmark可能过于简单，或者只关注单一维度和难度。模型能力提升类似于提升技能点，可以专注于长板，也可以平均提升。AI陪聊产品难点在于评估标准难以定义，因为它没有标准答案。未来可能出现更多小众猎奇的产品，因为Benchmark难以覆盖所有用户。评估标准往往是人类价值观的映射，但这种映射是否准确值得探讨。寻找优秀的AI产品经理应关注其实践经验、动手能力和对模型的理解。曲凯: 引导话题，并就丁丁的观点进行补充和提问。 supporting_evidences 丁丁: 'Then look at its performance, the second half of the article is very widely circulated, and it puts forward a very important point, that is, at this stage, defining the problem may be more important than the original to brush the points of some existing benchmarks, in fact, these benchmarks may still have a gap with the real scene of the business or the actual needs of users.' 丁丁: 'I remember there's a point in this article that I think is quite right, that is, the current various, looking at the results of its points brushing, many AI have reached the level of graduate students and doctoral students, but in fact, when landing, it may not even be considered as an intern's level, the reason behind this is the problem of benchmark setting, because in fact, in our real business scenarios, at least what I feel,' 丁丁: 'In addition to benchmark tests, we actually also pay attention to many business-related and user-related benchmarks, because the benchmark tests we mentioned above still have a relatively large deviation between the real world model products and the inputs of different businesses.' 丁丁: 'So you are acknowledging the statement of AI's second half, right? I fully agree.' 丁丁: 'First of all, in the first half, everyone is still working hard to improve the ability of the base model, or is still working hard to explore the potential of print train. For Kimi, it also realized the importance of IL very early, but the effect of IL must ultimately be based on a good basic model or a good print train link.' 丁丁: 'I think that blindly or single-indicator pursuit of DAU is, to some extent, a kind of inertial experience or a little lazy behavior.' 丁丁: 'I don't think DAU is not important, and in fact, you must have users to get feedback and real user input, it's just that blindly pursuing DAU may not help, for example, the improvement of model capabilities. Here, I think we can also introduce the source of benchmark to understand this matter, for example, the benchmark we mentioned has several sources, one is some benchmark common benchmark, and another may be the real feedback from online users.' 丁丁: 'User data is still important, but the high-quality user data and the model capabilities we want to improve must be aligned, that is, you must choose the right benchmark to help improve your model intelligence.' 丁丁: 'So from the perspective of your model product, if you are Liang Wenfeng half a year ago, would you accept those DAUs and data? If I have sufficient resources, I would definitely like to accept them. Of course, the premise is that the resources are sufficient, right? But it's just not enough, right? It's still about wanting to accept it. Is this a common problem for classical product managers? You will, right? You have so many users coming, you still have this idea. I think, but in the end, they seem to let it be, they didn't particularly want to accept this thing.' 丁丁: 'I think it's still about model evaluation and the understanding of the benchmark, because I used to do search products, search may also have some evaluation work is a bit similar, but it's just that in the past doing search, your evaluation data and the speed of change may not be so fast.' 丁丁: 'I can simply understand that evaluation is certainly the most important for the performance of the model and the final performance of the product, and evaluation is to evaluate and implement through benchmark, and benchmark is, for example, a set of questions designed by a third party or the product itself.' 丁丁: 'But in different businesses, the evaluation standards or points of concern are very different. For example, as we just mentioned, if you are doing deep search, you will hope that the model's output is what, I think it is highly likely that you will hope that it can give a relatively real and comprehensive requirement based on all the data sources you have guided, but if you are, for example, a CAA, it may be an emotional companionship type,' 丁丁: 'So in this process, you abstract not only the elements and classifications, but also their importance, and there is a very interesting example here, that is, when DeepSeek became popular, a large reason was that everyone felt that DeepSeek's style was very interesting, because it seemed very philosophical and elegant.' 丁丁: 'For the first point, I think so, in fact, there are already some preferences, for example, when you are programming, you will definitely choose Cloud as the first priority, but for example, you may do some deep search, you may use O3 today, and for the second question, I think it can be converted, you still need to abstract this personalization into a certain kind of model ability or product ability, for example, I can give a simple example, can I solve your personalized preference through Memory?' 丁丁: 'I think it's possible, and OpenAI itself is also working on it, so it may not need to be as finely divided as you said, but it must be able to achieve your ultimate goal through some kind of model internalized ability. I want to think of an extreme meaning, for example, assuming it is still a CAI product.' 丁丁: 'First of all, the difficulty of the benchmark in the same field may be different, and this difficulty is reflected in how you understand the business, for example, at the beginning, when you may do search, you will use some very simple questions.' 丁丁: 'There may be several principles, for example, it must first be real, able to reflect the needs of online users, and also have a certain degree of difficulty and distinction, it is not all difficulties are the same, secondly, it may be that this benchmark is with your entire model iteration life cycle.' 丁丁: 'Is it a strong correlation between benchmark and final user indicators? That is, we will look at, for example, the benchmark I designed today, and if it gets better, theoretically, these user indicators should also get better, right? Yes, if it doesn't get better, it means you need to change your benchmark, at least to make them continuously align, otherwise your evaluation will be meaningless. It should be a positive correlation.' 丁丁: 'Including, for example, when we do evaluation, we will involve auto eval and human eval, the task completion effect of using a large model to evaluate your own model, and the end-to-end effect of using a person to evaluate your final evaluation, and I understand that these two kinds of eval actually also need to be constantly verified, otherwise there will be a model to automatically score, and then find out that there is actually a gap between the real user experience.' 丁丁: 'First of all, startups and large factories may have some differences, large factories I see is that different teams are still like the previous way of circulation, for example, your high-quality data annotation, including this evaluation set, it is completely done by the data team.' 丁丁: 'How often is it reasonable to change the benchmark? I think there is no standard answer, the faster the better, or look at the data, etc. Yes, yes, the faster it means that your model ability iteration is very fast, but for many startups, if they don't move the model, in fact, what they iterate should be some of their, for example, pre-prepared prompts and their engineering testing capabilities, right? And affect its results.' 丁丁: 'I might give myself, for example, 400 questions, will this question be more or less? I don't think it is, it's just that you can measure the performance of your model, and that's OK. Then can I say that I will launch the product first, and then the users will use it, and then I will sort all the user prompts.' 丁丁: 'I think I've done so many good benchmarks, and no one has ever asked me what a bad benchmark looks like, for example, it's particularly simple, or it's benchmark is simply a certain dimension, and a certain difficulty, which is a very bad benchmark.' 丁丁: 'So there is a possibility that I just put all the points on one item to make the long board long enough, and the user's perception of the long board of my product is obvious enough, and I can stand out, or should I average the points is the best choice? This question may have to be divided into two layers, one is the ability of the base model, we will see that the stronger the ability of the base model.' 丁丁: 'Let's take an example, I think everyone may have used it in daily life, which is similar to CAI, this kind of chat product, so assuming that you are now doing an AI companion chat product, what do you think the difficulties might be, and how to do this?' 丁丁: 'Because as you said, your benchmark definition can only take care of 80% of users, maybe with personalization and mobile models strong enough ability can also solve.' 丁丁: 'I'm actually thinking about a more abstract question, that is, when we are setting many evaluation standards or values, you will find that we are generally a mapping of human values, but is this right? This is really abstract, this is too abstract, it's nothing more than you think that in the whole human world, there must be a relatively good answer to a certain problem, so you go to do this mapping, but is this mapping really right? I don't know.' 丁丁: 'I think there are some product principles, one is to do product structure first, then functional details, for example, we will find that there are many functions in WeChat, if it is used, for example, different tabs to express, do not do hierarchical decomposition, then today will be very redundant, and will be very complex, but until today, WeChat is still relatively simple, and there are only four tabs.' 丁丁: 'Then another point is that I think your hands-on ability should be very strong, and you don't need to be like the products in the past, you have a very strong module circulation awareness, but you need to completely throw away this awareness, that is, you just treat yourself as a product manager, a designer, and at the same time, a front-end, and now you may not be able to fully implement the back-end, but you can also try, and you go to complete the full process of this closed loop, then I think it will be more helpful for you to understand the model.' 丁丁: 'First of all, if it's from some background, I think I might really prefer the initial model products or smaller companies, that is, to complete this from zero to one or end-to-end students, that is, more full-stack themselves have done some things from beginning to end, right?'

Deep Dive

Shownotes Transcript

活动预告🥳：5 月 24 日，我们会请到丁丁和 Fellou 创始人谢扬办一场线上活动，大家记得翻到 shownotes 末尾查看报名信息！

像 RL 这个概念一样，Benchmark 和 Evaluation 也是做 AI 的人经常挂在嘴边的词，但到底该怎么理解这个概念，该如何正确的设定这些问题和数值呢？

正巧前不久 OpenAI 研究员姚顺雨的那篇《AI 即将进入下半场》特别火，他核心讲的就是「我们当下已经进入了 AI 的第二阶段——从解决问题转向定义问题，评估的意义会超过训练本身。而这其中，评估最关键的不是设置更难的基准测试，而是要在实际落地的场景中重新设计一套实用的评估标准」。

所以这期我们请到了前 Kimi 产品经理丁丁，从她在大模型公司一年多的实践经验出发，请她分享些对于 Benchmark 和 Evaluation 的思考，相信大多数人听完这期都会对这些概念有更深的理解，也可以开始自己设定一些评估问题和标准了。

P.S. 丁丁之前曾在微信做过 5 年的搜索产品，也在美团做过策略产品，所以在节目最后她也分享了一些从古典产品转型 AI 产品经理的心得。

【人类博物馆】

**导游：**曲凯，42章经创始人

**34 号珍藏：**丁丁，前微信、美团、Moonshot 产品（负责 Kimi App）

【时光机】

1:27 进入 AI 下半场，「重新定义 Benchmark」比「刷榜提分」更关键
3:23 回顾 AI 上半场，国内大模型公司的发展重心历经了哪些变化？
5:51 一味追求 DAU 是一种偷懒的经验主义
7:07 数据固然重要，但更多的用户数据 ≠ 更好的模型智能
9:28 如果你是梁文锋，你要不要承接这波泼天的用户？
9:59 Evaluation 和 Benchmark 是拉开模型差距的一大关键
14:40 对于没有标准答案的问题，该怎么制定 Benchmark？
17:55 怎么衡量 Benchmark 的好坏？
22:14 创业公司的 Benchmark 有多少道题比较合理？
22:38 能通过高频的用户 Prompt 反推出一套 Benchmark 吗？
24:23 让模型「突出长板」好，还是「全面均衡」好？
25:42 以 C.AI 类产品为例，示范一下该怎么设计 Benchmark
29:28 Benchmark 是团队的核心机密，算法同学都不应该告诉
30:07 AI 产品经理和古典产品经理有什么异同？
31:49 怎么更好地理解模型边界？
33:38 未来每个人都要具备全栈能力
35:38 做微信产品积累下来的 knowhow
39:52 分享一些招 AI 产品经理的标准

【Reference】

OpenAI Agent Researcher 姚顺雨的最新博客内容，探讨了 AI 发展的「下半场」：ysymyth.github.io)
一个顶级 AI 产品经理的自我修养 | 对谈光年之外产品负责人 Hidecloud)

【活动预告🥳】

5 月 24 日，我们会办一场线上活动。感兴趣的朋友欢迎点击链接)或扫描下面的二维码，一起来认识&交流！

【The gang that made this happen】

制作人：陈皮、Celia
剪辑：陈皮
Bgm：Mondo Bongo - Joe Strummer & The Mescaleros

AI 下半场：聊透 Benchmark 与 Evaluation | 对谈前 Kimi 产品经理丁丁 41:12 Share

42章经

Deep Dive

Shownotes Transcript

AI 下半场：聊透 Benchmark 与 Evaluation | 对谈前 Kimi 产品经理丁丁