We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

AI's Unsung Hero: Data Labeling and Expert Evals

2025/6/27

AI + a16z

AI Deep Dive AI Chapters Transcript

People

Manu Sharma

Matt Bornstein

Topics

Manu Sharma: 在人工智能模型训练的早期阶段，监督学习占据主导地位，但随着GPT-3和DALL-E等模型的出现，无监督学习开始崭露头角。在ChatGPT时代，通过强化学习从人类专家那里获取偏好变得越来越重要。现在，我们正处于一个强化学习回归的时代，专家们不仅要教算法如何给出正确答案，还要教它们如何评估答案的质量。我亲身经历了数据标注从计算机视觉到推理模型，再到语音模型的演变，并带领Labelbox成功适应了这些变化。 Matt Bornstein: 我认为数据标注和评估在模型训练中起着至关重要的作用。价值已经从标注预训练数据转移到评估强化学习阶段的输出。这种转变反映了模型能力、架构和应用的变化，以及对人类专家在更复杂模式和更苛刻用户中帮助模型执行的需求增加。

Deep Dive

Chapters

This chapter explores the evolution of data labeling and evaluation in AI, from early supervised learning to today's sophisticated reinforcement learning loops. It discusses the shift from pre-training to post-training and the role of human experts in assessing the quality of AI responses.

Supervised learning was replaced by unsupervised learning.
Reinforcement learning emerged as a new technical vector.
Experts teach algorithms how to assess the quality of answers, not just the correctness.

Shownotes Transcript

Translations:

中文

Somewhere around the GP3, DALI, the kind of the first phases of models where we were starting to see, something fundamentally was changing. Supervised learning was taking a backseat and rather unsupervised learning was starting to work.

Around the chat GPD moment, we started to see RLHF emerge where it is rather tedious to ask people to write the essays or solve some problems from scratch. But we can capture preferences very easily from humans and experts across different fields. Now we are in 2025 in a regime where reinforcement learning has came back.

and is a new technical vector that all of the AI labs are scaling, my best way to describe it is like meta-learning. And so instead of telling a computer what is good or bad, the experts are essentially trying to teach these algorithms how to assess what is good or bad. It's not simply like getting the answer right, it's how great the answer is.

Thanks for listening to the A16C AI podcast. In this episode, we dig into one of the unsung heroes of the AI industry, data labeling and evaluation. Now, you've likely heard about Meta's big investment in scale AI. But before that news, it was still an incredibly important piece of the model trading pipeline that largely flew under the radars of non-practitioners. So we brought in Labelbox co-founder and CEO Manu Sharma. We've

We sat down with ACE16Z Infra partner Matt Borenstein to explain the foundation and evolution of data labeling and how his company has been able to ride that wave from computer vision to reasoning models to, more recently, helping power advances in state-of-the-art voice models. As Manu explains in detail, there was a seismic shift over the past several years as the value moved from labeling pre-training data to evaluating outputs in the reinforcement learning phase, signaling a shift in model capabilities, architectures, and applications that

as well as an even greater need for human experts to help models perform across more complex modalities and with more demanding users. It's a great introduction into the space and a great example of being able to ride the wave as a founder and startup. And you'll hear it all after these disclosures.

As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see a16z.com slash disclosures.

2014 to 16 or 18 was some really, really interesting times where we were starting to, for the first time, see computer vision algorithms starting to work.

And I was working at a couple of technologies in space industry. I was kind of looking at building technologies at Planet Labs, where we scanned the Earth every day with all like 300 or 400 satellites orbiting the Earth. And there was just so much vast data where like it was sort of basically the obvious thing to do was to use the machine learning algorithms to extract insights from the data and

power the kind of the kind of the geospatial industry with those insights and it was around that time where i felt you know the the need for a kind of the data is so essential to develop in developing these models that you know they could we could build something here we could build a product here and so forth and and i think that really led to building labelbox we launched it on reddit

among all things. And our initial prototype got so popular that in the weeks after the launch, we just started subscribing customers. And our customers were across the sector. They were health care customers or robotics or geospatial and insurance. And it was just such a kind of exciting momentum to see and that we kind of rolled into building a company around that.

essentially. So these were the, I would call it early days of traditional machine learning really starting to take off. That's right. Exactly. And you were kind of there. That's right. Helping it happen. That's right. Exactly. So the self-driving cars were, companies were starting to pop up and then there were kind of this kind of a few big companies that had just vast amount of data and they were applying these

computer vision algorithms to kind of see what products and capabilities they could build. And so it was sort of very early innings. However, it was evident for a group of people who had been in the industry for a while. So I remember like in 2010 or 12, and I was kind of in academic programs,

My neural network kind of work would be like three layers of neural networks and, you know, maybe 10 neurons. You can count them. And they were on MATLAB, Simulink, and you kind of use that to, you know, test these things and train these networks. And so it was already kind of a vast amount of progress from, you know, 2008, let's say, to 2016.

And so if you were to extrapolate that, it was sort of like very exciting times ahead in terms of the vision was going to be everywhere. And then I think, I believe in 2020 or so, we started to see transformers kind of taking a lead role, but it wasn't really evident until the chat GPD moment. And so what was the original problem you were solving with LabelBox?

Yeah. So when you think about building machine learning models at that time, there's basically three things you need. You need compute, you need data, and you need talent. And so when it came to data, we were developing algorithms to detect a wide range of things. We were trying to detect

where are some illegal activities happening around certain countries? And there's some signatures that you can see from satellite imagery, like, hey, you know, if it's like a very big forest and, you know, you see like a road pop up, it's likely some sort of a signal for illegal activity happening.

Or if you're detecting, you know, vessels of interest in synthetic aperture radar imagery or maybe even just, you know, change detection. Like how many buildings, how many, like how much stuff got developed in ports and so forth. How much deforestation was happening. So all of these are very different use cases.

And you need specialized data sets for them. And our challenge at the time was, how do we go produce a vast amount of labeled data in different expertise very quickly? And it wasn't very easy for us to be able to simply outsource that at that time, because the industry was just so new, I would suppose.

And one of the biggest challenges, and we already had the experts from our network at the company. We already knew our customers or their affiliates who would have that expertise.

And so the biggest challenge at the time was a collaborative data labeling system. At the time, nearly every tool for labeling the data were desktop tools. And so if you wanted to label data with, let's say, you know, 100 experts, you'd have to literally install the copy of the software on every computer and then manage all the operations of that.

And so the first problem, literally the first problem we solved was having a very smart queuing system and brought it into the web. And so any arbitrary number of people could just simply upload the data in this prototype and configure the ontology and start labeling the data.

And that enabled all of these experts in different domains to easily label data collaboratively. And because it was very expert-driven in the first place, the

The experts felt that they had everything in control, and that also meant that they were very close to the data production and could quickly iterate with models. Like if the model was not performing in some areas, they would simply label more data very quickly and continue going. And that was literally the number one problem we solved. And actually, in hindsight, that continues to be one of the biggest engines of the work we do. That's very interesting. So this was modern software for...

computer vision data labeling. That's right. A collaborative computer. And this is what you launched on Reddit and people got excited about. That's right, 100%. Because there was simply no alternative at that time. It was all desktop tools or kind of these legacy services companies where you had to really just talk to, you know, five, six salespeople and then even get a demo or something. And machine learning engineers don't really want to go through all of that. And they simply wanted to, you know, just solve the problem rather quickly. Yeah.

and being controlled. - And so you made an interesting comment that the very first problem you solved was sort of queuing and this kind of, how do you manage a pool of data labelers and annotators? Labelbox has grown and expanded a lot since these early days, right? Now you no longer do just computer vision, you've added text and all these other data modalities.

You don't only provide software, you provide services and labelers, etc. But it's interesting, it sounds like you're saying the problems that you're solving today as this big, sort of diversified AI data company are in some sense the same as the problem you set out to solve with the very first Reddit demo. Can you just expand on that point? Yeah, absolutely. I think it's really interesting in hindsight, where the core tenets has remained the same, but the context and maybe the nuances and details have changed quite a bit.

So let's look at the arc of kind of the AI systems. In 2018, we were primarily in a regime of supervised learning.

And in supervised learning, you're essentially producing labels that you want the machine to predict. And so, you know, the canonical examples like where you can, you know, draw a circle around a ship or something like that. And, you know, you want the vision system to be able to detect humans or ships and so forth. And you're really asking these humans to actually tag them in that sense.

And so that paradigm at that time was kind of in full swing. Now, to produce data sets for any kind of machine learning system in supervised learning required a vast amount of data. The more general the problem was that you were tackling, the higher amount of data you needed.

So on one hand, for example, self-driving cars is kind of the uber general problem. Like there is just so many edge cases. And so you just needed a ton of data for every city, every conditions, lighting conditions and weather and so forth. But if you are, let's say, detecting certain kind of cells in a pathology slides and so forth, maybe that problem is a bit more narrow. However, in all of these cases, you needed data.

generally a lot of data. And so we solved with our software, we enabled these teams to collaboratively label any kind of, any format of this data. And as we moved forward in kind of the paradigm, in somewhere around like kind of the GP3, DALI, the kind of the first phases of models where we were starting to see

something fundamentally was changing. Supervised learning was taking a backseat and rather unsupervised learning was starting to work. And it was sort of a period where even the experts in the field were not so sure like what will happen here. And there was certainly a kind of a lot of enthusiasm among a group of researchers where unsupervised learning has to work. That's sort of the only way to kind of scale things forward. And

It started to work. However, it took us, it took the world some more time, maybe six to nine months to figure out, well,

While we can have these models learn on vast amount of data, you still need to kind of really train these models with human intelligence. And that's where the terms post-training and pre-training kind of came. And in that sense, in that regime, you still needed actually a lot of expert data.

And so when we saw like these trends kind of with sort of in the fullness of time, it essentially is the same problem. In the former problems, you're actually like we were enabling our customers to label data with the experts. But in the latter configuration, now we needed data from kind of a vast ocean of expertise, kind of the human knowledge.

Because these foundation models, the way they are being trained is kind of across the board. And these terms like RLHF and SFT kind of emerged, which is basically some techniques that made these models really useful for humans. Like you take a pre-trained model, it's probably not as interesting to talk to or to interact with. But then when you post-train it with this human data, it certainly becomes an assistant and

And again, same thing. So in this context of producing RLHF data or SFD, now these companies like foundation model companies, hyperscalers, as they entered, they needed a vast amount of data across mathematics and physics and arts and sciences and coding and so forth. And we kind of took the same kind of core technologies that we built and

and kind of expanded upon that to cater to this new industry that kind of formed. But again, the problems remain the same. You have to work with a vast amount of experts, produce highest quality data, have...

a lot of operational rigor, which is in most cases encoded in the software and continue to evolve the techniques to producing this data. And now frontier of AI is so advanced where nearly every data that is produced to improve the state-of-the-art model capabilities requires a fusion of AI, software, and humans.

There is just no way to produce the best data in isolation. So your customers went from 50 radiologists teaching a model, sort of a narrow skill, to thousands of labelers teaching models to sort of be human or interact with humans in reasonable ways. But your argument is that

The data and sort of the human input into the model sort of continues to be a key ingredient. Absolutely. And now the nature of kind of supervision, if you will, or the nature of how humans are teaching these models have evolved quite a bit. And actually, there is a very interesting trend to see. So again, like in the early days, the supervision was very detailed. You're really telling a computer like, hey, you know, this is this is this, this is that.

And then in the kind of transformers, like around the chat GPD moment, we started to see RLHF kind of emerge where it is kind of like rather tedious to ask people to write the essays or kind of solve some problems from scratch.

But we can capture preferences very easily from humans and experts across different fields. And so that became a very important technique to capture human signal. And so now humans were not doing actually the hard work, but rather they were providing preferences, which is still fairly complex work.

And so as the models continue to learn, now we are in 2025 in a regime where reinforcement learning has came back and is kind of like the new technical vector that all of the AI labs are scaling. And in reinforcement learning, humans are, our experts are essentially teaching the models kind of everything.

My best way to describe it is like meta-learning. And so instead of telling a computer what is good or bad, the experts are essentially trying to teach these algorithms how to assess what is good or bad. And so what's an example of that?

Let's say if you're producing state-of-the-art coding models, software engineers are going to come up with some really interesting real-world problems that they face every day. They'll write the code for solving that problem, but they'll actually also write perhaps a whole series of tests for the AI model to automatically grade the outputs of that problem.

And these rubrics in our industry, what we call, could be kind of in the context of tests that people write in software engineering or it could be computer use agents. And so, you know, all of this encompassing we call RL informants. And so now we are really in a regime where the experts are teaching long horizon tasks, but also

but also providing kind of like a evaluation scorecard for these AI models to automatically grade, like when they got the answer right or not. And it's not simply like getting the answer right. It's like how great the answer is. And so it's sort of a very smooth gradient.

like, you know, maybe I scored 80%. That's not good enough. I want to have the RL systems really score 95 or 98% kind of in these solving these large problems. And so that's where we are right now in terms of regime. And so that means that, you know, for all these domains in health or life sciences and, you know, discoveries and scientific discoveries, like the kind of tools we are making, now we are, we have to teach these models how to like,

how to perform long-term horizon tasks and grade it. Like, how do they know that there's actually a really good output? And so, yeah, the humans are still needed to teach this quality judgment in these models. And so it sounds like you're saying the work we're asking human annotators to do has actually gotten more complicated. It's not like a...

like a hot dog, no hot dog thing. It's, it's like a pretty complex domain and set of rubrics and kind of like way of inputting data. Yeah. And, you know, another analogy is just, I thought of it as like, you know, maybe like, it's like you're, you're teaching a toddler, you know, it's very heavy instructions in the early days, like, you know, do this, do that. But now we are kind of like taking a role of a master, like a, like a really good masterful coach.

where we are really at a very high level providing kind of instructions and coaching to these models on how to make decisions and judgments. And that is the form of alignment that is happening. And, you know, now, again, like, we've just started the journey of AI agents. And

What is AI agents in many ways? Like one of the ways it may manifest in the world is that nearly every knowledge work we do in the industries, you know, you can kind of map that to some sort of a workflow. And at some point, these AI systems are going to be so good that they can perform that workflow reliably and efficiently.

For these agents to learn, they need data. They need to understand what a task looks like and how it knows that it is actually doing the task successfully. And so if you take a very long horizon task, like that may take hours or days for a human, that entire task has to be decomposed into a specialized format of data. And these rubrics help these models to understand

that they got that task right and to what extent they got it right. And so that is, you know, I think this is an infinite kind of ladder, I suppose, for us to continue to take on as these models and systems get better. Do you believe that

What we call agents, I'll put it in quotes because we've had some discussions about agents on this podcast before, and I'm sort of on the side that agents may not be a real thing. But as a caveat, do you think that these types of sort of looping AI apps need...

unique data to train on, right? Because it's, you know, it seems to be quite clear that say, you know, coding needs sort of special data and math needs special data. And there's some degree of generalization across domains, but not, you know, a lot. Do you think like this kind of agentic thing of planning and task evaluation and so forth is sort of a separate modality that you kind of need training data to get good just at that? Yeah.

Or do you think agents kind of just get better as the underlying capabilities get better over time? I think the answer is both. I think there is a fundamental capability, which is reasoning that is needed, reasoning, planning, and so forth, to comprehend the user intent and then come up with some sort of a plan to go perform those actions.

And as that gets better, the reliability of these orchestration layers of agents will just generally get better. And I do think that nearly behind every successful kind of application product that we are starting to see, there is a really high quality eval data set.

And what is an eval data set? Eval data set is simply a held out set, you know, typically that you're from a training data. And so I do think that, let's say, if you are building a really good, let's say, assistant like for coding, you do want to have as a company really like distill down the taste, like what makes your product, what makes whatever this kind of company or product that, you know, you're after consistent.

really compelling about that industry, about that workflow. And that has to be distilled down into some sort of an evaluation data set. And that evaluation data set is so important now because in RL, you can actually use that to kind of hyper-optimize or optimize all your hyper-parameters.

of the entire system. So there is base models, that foundation models, and then there's all these other parameters that you might be using for retrieval, for kind of other parameters for multi-agent systems. And, you know, the answer probably lies somewhere in optimizing all these parameters to make the entire end-to-end system more reliable. And so I think, you know, both kind of is true in that sense.

You need these base models to get better, and which I think is happening already. So, for example, we see from many of our AI lab customers that they are now really trying to bring the capabilities of these application products and coding into their base models. And I think that's one of their big goals.

goals this year that I think if you talk to these AI labs, they will say, hey, we wish that you don't need all this orchestration layer on top of our base models. We want the models to be just more smarter. And then the application product companies are innovating very fast by figuring out what is the use case, converting that into eval, and then also building a great UX and product on top of these base models.

And that is generally true for every customer or every application product we have seen in every category. That's very interesting. So you're saying for most use cases, there's sort of a general problem to solve, say coding, that the labs will get better and better at, the base models get better and better at. There's also kind of a system level optimization that needs to happen, which is for that application application.

And a sort of application-specific eval set that allows that optimization to take place. Yeah, for sure. I mean, just let's take an example of a customer service. You know, I think this is a year where voice, we do a lot of, we basically are behind every breakthrough in voice models in the last nine months. And if you look at like customer service, like you talk to a, you know, a company through a voice,

And, you know, you have first layer of kind of concierge and asks you about your intent and so forth. And, you know, that agent might, you know, brought you to the second agent, which is like, hey, I'm really here for refunding, you know, your orders. I'm here for helping you sell more new things.

And so these are all kind of customized agents for whatever the job there is. And so this customer service company, let's say, that is deploying these capabilities have to impart their own taste of what is a great customer service to their customers and how they want to go about it. And that ultimately, one way or the other, comes down to what is the evaluation data set.

that they're going to use to know kind of objectively, like how good that their intent or their goals, the business goals are reflected in this entire system. And this kind of solving this business problem is no longer a single model. It is a, you know, a number of AI models with, you know, maybe they are customized with system prompts and so forth, but it's a system.

and you have to optimize that to you know and handle millions of calls a day and so this is the kind of problems our you know industry is starting to tackle now so you talked about the journey from traditional machine learning to transformers chat gpt voice robots all this stuff that must have been a big deal for labelbox right like i'm just trying to you know imagine what was going through your head and i know a little bit because we were talking a lot at you know about this um

you know, you could have made the decision, I'm not convinced by this new stuff. Let's stick to what we do, stick to what we know, which is traditional machine learning. In retrospect, that would have been a bad decision, right? I'm guessing that would have been tough. You would have missed this huge new opportunity versus the decision you made was to kind of go big on the new thing, which required new data types, you know, a new business model in some ways, like retooling the team. Can you just talk a little bit as a CEO, founder, what that is like and what was going through your head? Yeah.

Yeah, a lot of uncertainty and, you know, and a lot of kind of first principles thinking to come up with sort of this high conviction bets, you know, so it's never so clear in the moment. Because LabelBox is kind of a hot company. You're doing really well. This wasn't like a pivot. This is, this is just like this big thing coming, right? Like, that's the really interesting thing to me.

Yeah, for sure. I mean, in the software category, we have been leaders for the products we offer, which is the data labeling software and so forth. And our long-term kind of belief has been, which has been the same, which is like we wanted to make the, we continue to want to make the best products for humans to align with AI systems.

And, you know, sort of in the computer vision era, we felt that the best way to do that was in supervised learning era, the best way was to offer this tools and software layer that, you know, any number of companies could use with their own experts to produce these labels and train these models. And there was sort of a data engine they were making in every use case in every company. And after this kind of generative AI moment, it took us some time to really understand, you know, the...

Every AI lab was sort of experimenting and tinkering with different techniques. And it took us some time, like, hey, no, a lot of the data, like the world is shifting from building AI models to renting AI intelligence, like intelligence.

And so a vast number of enterprises around the world are no longer building their own models. They're rather actually renting base intelligence and kind of adding on top of it to make that AI kind of work for their company. And that was a very big shift.

That also meant that, you know, where are we going to accomplish this mission ultimately? I think we're going to continue to do so with companies who are customizing these models. But then the even bigger opportunity was starting to emerge where like, hey, these are the hyperscalers and the AI labs that are, you know, spending billions of dollars of capital in developing these models and data sets.

we really ought to go and figure out and innovate for them in addition to kind of our other customers. And, you know, this was a time of, I believe, like late 2023, 24. And I think some other players had kind of started to enter that space because they were born as a services companies. And for us, you know, the challenge to truly like, it was a big shift from DNA perspective because Labelbox was built with a

software tools mindset and in our go-to-market teams and engineering and product and design teams, all of that

operated like software companies where we are truly like you know understanding our customers their roadmap and trying to build the features and you know all the software labs and the sales kind of craft is designed for selling software and to retool that was a very significant undertaking but i think it really started with having a high conviction bet which is like hey no um

these are the big shifts that are happening and let's look at like how would we serve the hyperscalers? How would we serve the AI labs? And it really came down to two things.

We have the best in class software tools and platforms that, you know, a vast amount of companies and labeling is happening through our tools. We needed expertise. We needed like these human experts from around the world so that we can recruit them ourselves. And then we need operational capability to actually run these data pipelines ourselves. So we took our own product and we built those capabilities on top of that product.

And this was just actually less than a year ago. In June or July is when we announced our aligner, which is kind of the network where we are hiring and assessing these experts, you know, PhDs in mathematics and physics or, you know, different language experts, software engineers and so forth.

And since then, we have been able to kind of systematically go through every department, every team, and reconfigure that to essentially operate as a data factory.

And, you know, these decisions in the moment were always hard. The changes are always hard for individuals in the company. But, you know, the rewarding kind of ultimately the rewarding thing for everyone is to be able to see the progress. So if you're making these changes through all this uncertainty, as long as the teams are seeing kind of objective progress, like, hey, you know, we are now serving nearly every lab in the world. And that didn't happen overnight. That was sort of

This every month, every week, we are kind of innovating and doing something that others are not doing. And that gave sort of our teams this energy to continue with, you know, reconfiguring the company. And what was your operative feeling at the time? Were you in like fear mode, like, oh, like this might not work and the world is changing? Or were you in, you know,

ambitious mode where it's like, yes, like finally we've been building this foundation and now there's, you know, billions of dollars at stake. Let's go get it. Like, what were you kind of thinking and feeling? I think a bit of both is probably true in the sense that we, whenever we try to build kind of experiment with new things, there's always this curiosity about the problem. So, for example, to building a, you know, like for us to go hire these human experts and so forth,

There was just a lot of curiosity, like, okay, well, how are we going, like, what would be a label box way to do it? And so there was kind of like these kind of curiosities and kind of questions

kind of beginner's mind. - That's interesting. So, I have a new set of problems to solve. - It's a new set of problems to solve. And so that's really interesting. But then you also know that this may not work. And so, you know, we've tried, I mean, in our history of our company, we have tried to build new products or new features. And you kind of know that, you know, look, some things are not gonna work. Some things are gonna work really well.

And so there was that sort of nervousness and so forth. But I think the hardest part for many of us at that time was to just make that decision that we're going to just go try it and do it. And companies sometimes that can be actually literally the hardest one because you're having all these smart people kind of argue against all these directions and why it may not work, et cetera. But at the end of the day, we kind of channeled like our moments of,

when we build something, when we launch on Reddit, let's just go build and see. And nothing comes in, you know, nothing is better than that. Let's just go build and try and build an MVP and see what happens. And so when we looked at building this network,

We said like, hey, there's got to be a much better way to hire and assess these kind of contractors and experts. And I think even today in the industry, a lot of this is happening with kind of offline tasks or like coding, lead code-like style tasks.

And we made a bet where we're going to use like state-of-the-art AI models to interview, live interview these candidates and really understand their capabilities and their skills. And, you know, so things like that, it was just so exciting for us to innovate and ultimately start to work. That's very interesting. So you're almost saying making the decision was the hard part. You're saying we're looking at first principles, we're looking at the market, we have to do this.

Once you'd made the decision, it sounds like you had to manage through it with the team and all of that. But the kind of like curiosity took over. And like, I love this idea of going back to this is such a powerful thing for founders and engineers and builders of all types. It's like going back to that first thing that really worked and gave us the spark of like, oh, there's something here in the company, like there's something. And so it almost gave you the license to like go out and look for that again, it sounds like.

That's right. Like, I think the hardest part is like, you know, there's all these times here you can convince yourself like, hey, you know, that seems like so different. Then why would we do that? To build anything, you've got to have a dogmatic drive to believe in the mission and what we're going to go do. And sometimes that can be pointed in directions that needs to be adjusted. And sometimes that comes in the way of doing things that you otherwise like would...

would want to do. And shifting that kind of course is sometimes, I mean, what I've found is the hardest one, especially with a, you know, large team. I mean, think about like, and we are a fairly small company when it comes to, when you compare to the big tech companies and, you know, it's probably, it gets insanely hard at even like as the company grows to be able to try new things. And I think that's really where culture and kind of culture,

culture and kind of the soft things are so important where, you know, we want to be able to take these bets and try new things that are perhaps beyond our kind of stated discussions or roadmap or mission. And especially in an industry like ours where everything changes in a few months. And so you've got to be able to have this sort of

Yeah.

And, and the nice thing is that everything is changing every few months in a sense that the data that the labs are going to need six months from now is going to be vastly different than today. And that means that there is room for innovation and to do it better than, than the others.

There's this joke of AI years, right? Like dog years, you know, there's seven dog years to one human year. I think there's about a thousand AI years to one human year. Every day, you know, seems to be a new cycle. I can't let you go without asking about scale. This is sort of a tectonic

thing that's happening in this industry right now. And I think it connects a lot to what you were just saying, right? Like scale was early to this game of hiring people to be sort of, you know, data annotators and providing this service. You made that decision later, but taking, it sounds like a more modern and more AI driven approach, which it sounds like is what one thing that's allowed you to scale so quickly. What is, first of all, is that a correct read? And the second question is how does the, you know,

partial acquisition of scale changed the industry and changed the opportunity for LabelBox? Yeah, I mean, our industry certainly just got a lot of eyeballs and a lot of interest or I guess limelight. And, you know, it tells, if there's one message I share, like it tells you like how important

data is for the AGI efforts. It comes down to three things, compute, data, and talent. And, you know, you really have to get all the three things right to be at the forefront of AGI efforts and, you know, or application products. And I think

The race for AGI is so fierce that I think this move by Meta and Zuckerberg is totally, in the grand scheme of things, it totally makes sense. There's just tremendous opportunity for Meta to innovate across its suite of products.

There's all these reasons you can imagine why capabilities like Gemini or ChatGPD should exist in all these apps and meta and so forth. And maybe they can, again, there's an opportunity for them to go build completely new products experiences. So I think my guess is that they see, Zuckerberg sees that opportunity and he wants to try something new and different.

And so Zuck agrees with you that data is worth spending $15 billion on at least. Well, so I think my best read here is that I think it's actually rather more about talent rather than anything else. But data obviously continues to be a very big part of it because the Lama series has been like they have had the compute. They've had the data already.

They've been working with, you know, Scale and others from very early inception. And so I think it's probably like something around like, hey, you know, he needs like sort of a Navy SEALs team to go build this new effort and have that energy back and so forth. So that's my best guess. But yes, like data continues to be a very big, important key factor for all of these efforts and projects.

And it's actually, the importance is getting even stronger every few months because the complexities of data is now in the realm of knowledge work.

So to go after all of these use cases and agentic kind of workflows that people talk about, you're going to really need the best experts in those industries and translate their problems in these RL environments for the algorithm to learn. And I think human knowledge is, I think it's very unique and it is probably not as trivial as people think. And it's probably like infinitely vast.

in terms of the intellect and the creativity and all these different parts of knowledge that is in the powers of the world. You mentioned that AGI is the goal of many of these labs, which is in some ways the grandest goal that humans can work on and achieve right now. How does data help us achieve AGI? And do you think we're running out of data to get there?

Yeah, so I do think that we, the data for pre-training, you know, or the pre-training of these models kind of have reached kind of some sort of a plateau.

So there are gains to be made in pre-training. I'm sure there will be some really interesting efficiency gains that we're going to see over time. But it is no longer kind of the focus where you're getting the best ROI. So most of the gains in these foundation models are coming from post-training.

And within post-training, as we discussed, reinforcement learning is becoming a bigger, bigger part of a compute budget. And many of these data sets in these techniques don't exist because now we are trying to have these models exhibit a behavior that, you know, very smart individuals and knowledge workforce would do. And they will be able to do those tasks reliably. And so that kind of data,

actually does not exist in the format that these models want to learn. And so it has to be produced. There's just no way around it.

And so there's that. And I do think that there, you know, you've got to keep an open mind here that there will be maybe new novel ways to produce data sets, you know, whether they're synthetic techniques and so forth, that might accelerate kind of the path to developing or going through all of these use cases, you know, for knowledge work. Now, you know, what is, what I think is more plausible scenario that happens in the next year or two years is,

what we are seeing right now, which is, but just at a bigger scale, nearly every data sets we are producing for the coding as state-of-the-art techniques. And it includes verifiable domains like coding and math, but also includes audio, video, and kind of non-verifiable scenarios. Creative open-ended. Creative open-ended. And we have, like, there are techniques now where you're able to apply reinforcement learning

across all of those data modalities. And all of this data that is created is a fusion of AI, software, and humans and experts. And so...

So in many ways, we are using all of these techniques like AI and software and workflows to take away the tedious part of the actual work and really have the experts focus on producing the signal. And that signal is the most important thing for these models to learn from. And that's kind of like how we go produce data sets over the next year or two years at a bigger scale. Now, these data sets don't have to be very large in volume. Like we are talking an order of magnitude less data sets.

of size compared to computer vision era. And so these are, you know, think of them as like boutique, small data sets, just extremely high quality that resembles a particular task or domain. And the reinforcement learning algorithms are just so good that they can learn from them.

And so does that mean we need sort of a little slice of like everybody on the planet helping, you know, like we need like a polite person to teach the model manners and a lawyer to teach it law and a mathematician to teach, you know, it's like, like what are the labelers, you know, kind of data labelers actually look like?

Yeah, so it's really across the board, actually. So when it comes to coding, these are actually like our aligners, our software engineers, and many of them are working in technology companies actually now in the United States or Europe. So that's one example. Some of the nearly all of the biggest universities that are that are leading in sciences and mathematics and engineering are

Students and PhD programs or even professors are part of the network and they're contributing to developing these data sets for these models. So, for example, in mathematics, you really are at a place where you have to have people, experts at the top of their game in whatever domain in the mathematics there are to produce data that the models can learn from.

Like it's really up there in terms of complexity and sheer intellect. And if you're, let's say, in healthcare, you've got to have to work with medical professions.

professionals who know all the certified doctors and nurses and so forth to teach models how to, you know, behave or aid in scientific discovery and so forth. So it's really in the, this is a knowledge work. And, you know, you're really tapping into people who are top of their game in these industries. And these are the people who are producing these data sets now.

So we've reached the stage where the base knowledge is there in the models, and it really is experts, meaning in the precise sense of the word experts, someone who has knowledge or expertise, right?

that normal people don't have. It's really those people who are kind of teaching the models at this point. Yeah, absolutely. I mean, just to give you a kind of intuition about the gradient ascent of human intellect needed to produce these data sets a year and a half ago or so, everybody was very excited about having writers and English writers and sort of like helping these models write good essays and things like that.

We solved that problem fairly well at this point. I think these models can do that really well. Obviously, there's room for improvement, but generally you can call the problem solved.

And, and then, you know, maybe when the reasoning paradigm appeared, there was a lot of kind of data needed from students in universities and so forth, because they can do, they can perform a lot of the work, like in deep research, for example, capabilities comes from people in academic producing those kind of data sets and rubrics and so forth on what a great report looks like, you know, like, like, that's what people do in the universities, I guess. And so,

And I think we are actually getting past that. And so to produce the state-of-the-art data, you really have to tap into professions and people who are at the top of their game and are employed probably by companies.

you know, whether it's, you know, material science. And so if you want to really have like models exhibit great capabilities, you've got to tap into best polymer scientists and in the world that are working with researchers to produce these data sets. And that's where the bar is now. Thanks for listening to the end. Hopefully you're now smarter about the topic of data labeling and the incredible amount of work that goes into doing it well. If you enjoy the discussion, please do rate the podcast and share it far and wide across your network.

AI's Unsung Hero: Data Labeling and Expert Evals 46:48 Share

AI + a16z

Deep Dive

Shownotes Transcript

AI's Unsung Hero: Data Labeling and Expert Evals