We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

What happens when you train your AI on AI-generated data?

2025/5/19

On Point | Podcast

AI Deep Dive AI Chapters Transcript

People

Ari Morkos

Kalyan Viramachaneni

Mark Zuckerberg

创立Facebook和Meta的美国商人，致力于推动社交媒体和元宇宙技术的发展。

Meghna Chakrabarty

Sam Altman

领导 OpenAI 实现 AGI 和超智能，重新定义 AI 发展路径，并推动 AI 技术的商业化和应用。

Topics

Meghna Chakrabarty: 人工智能模型通过从现实世界中获取大量数据进行训练，学习如何创建与现实世界数据相匹配的响应。大型语言模型通过阅读互联网上的文本样本，寻找单词如何协同工作的模式，并尝试猜测句子中的下一个单词，通过不断纠正错误来学习，从而理解和书写像人类一样的文本。 Mark Zuckerberg: 我认为在未来，大型模型的训练可能更多地是推理生成合成数据，然后将其反馈到模型中。这意味着AI模型可以利用基于真实世界数据构建的模型来创建新的、人工的数据，即所谓的合成数据，以训练未来的模型。 Sam Altman: 只要能够克服合成数据事件视界，即模型足够智能，可以生成良好的合成数据，一切都应该没问题。这个观点强调了合成数据在AI模型训练中的潜力，但同时也提出了模型生成高质量合成数据的能力这一关键要求。 Ari Morkos: 合成数据点是由模型生成的数据点，而不是由人或现实世界创建的。这种数据只有在能够反映潜在现实的情况下才有用。虽然公共领域的数据正在耗尽，但总会有新数据产生，互联网上的数据量每天都在增加。然而，改进模型的关键在于更好地利用现有数据，而不是仅仅收集更多数据。互联网上的绝大多数数据对于训练模型来说并不是特别有用，因为其中存在大量冗余信息。高质量的数据依赖于模型的用例，并且需要针对特定任务进行优化。合成数据绝对是解决方案的重要组成部分，但许多主要基于合成数据训练的模型实际上存在很多问题，例如脆弱性和泛化能力差。 Kalyan Viramachaneni: 我们现在使用的AI在很大程度上仍然很小，我不是指规模，而是指它可以完成的任务。我们要求AI进行推理和思考，这需要我们提供更多的数据训练模型，这些模型在推理方面更加高效，并且可以解决我们尚未想到使用AI模型解决的问题。任何值得预测的事情都很少发生，为了训练这些模型来预测这种罕见的情况，我们必须创建合成数据，因为它们太罕见了，在现实世界中不会经常发生。例如，银行中的欺诈行为很少发生，因此需要合成数据来训练模型以检测欺诈。

Deep Dive

Chapters

The traditional method of training AI models involves using vast amounts of real-world data. However, recent research suggests that this data may be running out. The podcast explores the shift from supervised learning to self-supervised learning and the resulting massive increase in data usage, questioning whether we are truly running out of data or simply failing to utilize existing data effectively.

Shift from supervised to self-supervised learning massively increased data usage.
Current AI models consume trillions of data points.
The public internet's data is not static; it's constantly growing, but AI's demand may be growing faster.
Data quality and targeted training are crucial for efficient AI model development.

Shownotes Transcript

Translations:

中文

This episode is brought to you by Amazon's Blink Video Doorbell. Get more at your door with the easy-to-install Blink Video Doorbell. Get more connections. Hey, I'm here for our first date. More deliveries. Hi, I have tacos for two. Oh, thanks. We'll be right down. And more memories. Alan, I have a surprise. All new Blink Video Doorbell with two-year battery, head-to-toe HD view, and simple setup. Shop now at Amazon.com slash Blink for just $69.99.

Support for this podcast comes from Is Business Broken?, a podcast from BU Questrom School of Business. What is short-termism? Is it a buzzword or something that really impacts businesses and the economy? Stick around until the end of this podcast for a preview of a recent episode. WBUR Podcasts, Boston. This is On Point. I'm Meghna Chakrabarty.

So the general understanding of how artificial intelligence models get trained is that they scoop up vast amounts of data from the real world and learn how to create responses that match that real world data. Here's an example. A large language model or LLM, tools that Siri or Alexa use to answer your questions. In development, those LLMs read billions of text samples from across the internet, books, websites, etc.,

The model looks for patterns on how words work together, or really how humans use those words. And as it trains, it tries to guess maybe what word comes next in a sentence. And if it guesses wrong, it fixes the mistake. It learns from that mistake. And then it repeats that process billions and billions and billions of time, each iteration getting better and better and better at guessing the right word.

That's essentially how the LLM learns to understand and write like a human. So what happens when AI models run out of real-world data to train on?

Well, several research papers published in recent years suggest that developers will in fact run out of real-world data in a matter of years. But developers also say there might be a solution to that. I do think in the future, it seems quite possible that more of what we call training for these big models is actually more along the lines of inference generating synthetic data to then go feed into the model.

That's Mark Zuckerberg, of course, talking to AI podcaster Dwarkesh Patel last April. And what he's suggesting is this. AI models built on real world data, that those models create new artificial data, so-called synthetic data, as he said, to train future models.

Here's OpenAI's Sam Altman at the Sohn Investment Conference in May of 2023. As long as you can get over the synthetic data event horizon where the model is smart enough to make good synthetic data, I think it should be all right.

But what is that event horizon? Also, I love that analogy because event horizons are also on black holes. So like, could we really just plunge into a black hole of AI synthetic data? Perhaps more importantly, if AI's purpose is ultimately to be used and beneficial to our world, our real human world, how is it even possible to run out of data? Aren't we humans generating data all the time?

And can synthetic data be an adequate or even acceptable replacement for reality? So let's start today with Ari Morkos. He is co-founder and CEO of Datology AI and a former research scientist at Meta's Fundamental AI Research team and at Google DeepMind. Ari, welcome to On Point.

Thank you so much for having me. Also, I see you're joining us from San Jose, California. Actually, not that long ago, I was driving down Highway 101 and every single billboard, every billboard was an AI billboard. OK, so first of all, you heard my sort of populist version of the definition of synthetic data. How would you actually or more precisely define what it is?

I think you actually gave a very good definition. But briefly, a synthetic data point is just a data point that's generated by a model rather than generated by a human or created from the real world. And that can be quite useful so long as that synthetic data point is actually reflective of the underlying reality. Reflective of the underlying reality. Okay, so we're going to talk about how that data is generated and how to get to it.

be sure that it satisfies that important caveat you gave. But let's get right to this question, because it's been perplexing me. How can we run out of real world data? Yeah, so maybe to start, let me take a step back for a moment and just kind of

Talk about how we got here to where we are now, where we might be running out of data. In the 2010s, the way you would train a machine learning model is that you would have some amount of data. You would go and you get a bunch of humans to label it. So imagine you have a data set of lots and lots of pictures. Some of them are pictures of cats. Some of them are pictures of dogs. You go and you have a bunch of humans say, this is a cat. This is a dog. This is a cat. This is a dog.

And that is a pretty expensive and time-consuming process, you might imagine, right? And as a result, the largest data sets that we would train our models on would be a million data points or something like that. There is a very famous academic benchmark called ImageNet that was used for a lot of progress in the last decade that was about a million images. And that's called supervised learning because it's being supervised by a human saying, this is a cat, this is a dog.

But then in the late 2010s, we had this incredible breakthrough, which is called self-supervised learning, which means we figured out how to train on models on data that hadn't gone through this manual annotation process by a human. And the vast, vast, vast majority of data that we have is not labeled, right? A human hasn't ever looked at it and given it a label. So this massively unlocked

the amount of data that we were able to train models on, going from a million data points being really large, say circa 2018, to now trillions of tokens, the entire internet. So when you put that into perspective, that's about a million-fold increase in the scale of data that we are feeding into these models today.

in the order of three to five years. Which is really wild when you come to think about it. And this is also why the compute spend has gone up so dramatically because the more data you see, the more GPU hours you need, which is why NVIDIA, of course, does so well in all of this. And that's why we've seen this massive explosion. But this now literally means that for many of these models, we are feeding the entire total of the public internet into these models at this point.

And the public internet, though, isn't a fixed thing, right? I mean, we're pouring data into it every second of the day, including all the now like AI slop that's out there as well.

Yes, and that's another big problem, right? That if we do train on synthetic data, we want to make sure it's high quality synthetic data that was intentional. We don't want to train it on accidental synthetic data that just has made it onto the internet through AI slop, as you call it. So yeah, the internet is absolutely growing, but the hunger of these models in some ways is growing faster. Okay, so let me parse some of what you said just a little bit more. So first of all, are we running out of real world data? Yes or no?

So I would push back on this, I think. And I would beg this question a bit because I think, well, yes, in the public domain, we are exhausting what is currently available. There's, of course, always new data growing and the amount of data that's being put onto the Internet is increasing with every passing day. So there always will be new data there. But we have exhausted the majority of it.

However, this question presupposes a notion that all data are created equal and that the only way for us to improve our models is to get more data rather than making better use of the data we already have.

And I would argue that there's orders of magnitude, hundredfold improvements left by just making better use of the data we have already rather than needing to collect more data. The vast majority of the data on the internet is not particularly useful for training models for a whole bunch of reasons. One is that it's just a lot of it's very redundant. For example, think about how many different summaries of Hamlet there are on the internet. A model doesn't need all of those.

Some fraction of them will be enough for the model to understand the plot of Hamlet. So there's a lot of data that's not useful and a lot of data that's only useful at certain times. For example, imagine you were teaching a middle schooler math class.

If you showed them a bunch of arithmetic problems, it would be too simple for the student. They know how to do addition and subtraction, basic multiplication, division. It wouldn't teach them anything. And similarly, if you were to show them calculus, it also wouldn't be very helpful for the student. Calculus is way too difficult for an average middle schooler. You need to show them geometry and algebra. That's where they're going to be learning.

Well, when we train these models, we just mix all of those together and show it all to it at all times of training rather than actually thinking about what is the data that's going to teach the model the most based off of what the model understands right now. And then using that to target and in a sort of curriculum actually teach the model in the best way. And that can enable massive success.

massive increases in data efficiency. Okay, so targeted training, that's going to be my little slug for that. Targeted training. But you also, so this gets us another phrase that you used a little earlier, high quality data, right? So is that what you're talking about? I mean, how would you define what high quality data is?

Yeah, so that is exactly what I'm talking about. As to how to define high quality, in many ways, I think that is the billion or perhaps trillion dollar question even. And in many ways, that's entirely what my company, Datology, was built to do, is to try to solve this problem of how do we understand what is really high quality data and make use of that to make models better and to solve these problems.

The first and most important thing to understand about quality is that there's no one silver bullet for this in the sense that quality is very dependent upon what the use case of the model is.

For example, if I want to train a model that's really good at legal questions and can serve as a legal assistant, obviously I'm going to value legal data more highly than I would data about movies or about history in some cases. Whereas if I'm training a model that's going to help doctors, a healthcare system of some sort, obviously I'm going to value healthcare data more. So the first thing to note is that it depends on what you're going to do.

Now, as to how you actually do this, it's a real frontier research problem. And most of this research is, of course, being done within these big frontier labs, OpenAI, Anthropic, DeepMind, et cetera. And this is literally the secret sauce that is distinguishing between these different models and labs. Okay. Lurking behind a lot of this, though, Ari, I think you, again, you tantalized us with that a little earlier, is money. Money.

Right? It sounds like, look, I'm just inferring here, but it sounds like perhaps one of the impediments to trying to use the data we have better is that it may cost companies more to do that. I think actually it's a bit the opposite. If you can do it better, it actually saves a dramatic amount of money. Then why aren't more companies doing it? Why are we hearing Sam Altman say we need synthetic data? I think there's two reasons. I think one, because it's really hard. Yeah.

And I think synthetic data is absolutely a big part of the solution here. Don't get me wrong. I think that these are not mutually exclusive. And for example, at Datology, we also use quite a bit of synthetic data. Now, it's not the panacea, I think, that it's often made out to be. People talk about synthetic data as if it will be the end-all, be-all and fully replace it. I think what we've seen is actually that a number of models that have been trained primarily on synthetic data actually have a lot of problems.

In particular, they get very brittle and kind of weird. They're very good on the exact data that they're trained on, but they don't generalize to new formats or things that are a little bit different. Okay. So Ari, I'm going to ask you to just hold there for a second because we have to take a quick break, but this is the perfect place to pause because when we come back,

I'm going to bring another guest in and we're going to have a good discussion on, again, the pros and cons as it looks like we're going to see more use of synthetic data to train AI systems. So we'll be right back. This is On Point. Support for On Point comes from Indeed. You just realized that your business needed to hire someone yesterday. How can you find amazing candidates fast? Easy. Just use Indeed. There's no need to wait. You can speed up your hiring with Indeed.

and On Point listeners will get a $75 sponsored job credit to get your jobs more visibility at Indeed.com slash On Point. Just go to Indeed.com slash On Point right now and support the show by saying you heard about Indeed on this podcast. Indeed.com slash On Point. Terms and conditions apply. Hiring? Indeed is all you need.

Support for this podcast comes from Is Business Broken? A podcast from BU Questrom School of Business. A recent episode explores the potential dangers of short-termism when companies chase quick wins and lose sight of long-term goals. I think it's a huge problem because I think it's a behavioral issue, not a systemic issue. And when I see these kinds of systemic ideas of changing capitalism, it scares me.

Follow Is Business Broken wherever you get your podcasts and stick around until the end of this podcast for a sneak preview.

I want to bring Kalyan Viramachaneni into the conversation. He's co-founder and CEO of DataCibo and principal research scientist at the MIT Schwarzman College of Computing. Kalyan, welcome to On Point. Thank you. Thank you for having me. Okay, so Ari did a really good job, I think, of laying out sort of the subtleties when we talk about what synthetic data is. But I want to just get a check from you. I mean, what do you think about his assertion that

synthetic data is going to be part of the picture moving forward, but we're not actually running out of real-world data. We just have to use what we have in the real world better. To a certain extent, I agree with that, but I wanted to give another perspective. I think...

The AI that we have as of today and we are using is largely very small so far. I don't mean that as in size, but in the tasks that it can do. And as days go by, we are asking more and more of it. So originally it was just like, let's chat with it. Let's see if it finds us something. Let's do search. And now we are asking like legal questions. We are asking, what do you think about this question? So we are asking it to reason. We are asking it to think.

So that requires us to provide more data-trained models that are much more efficient in reasoning and can solve problems that we haven't thought of solving with using such models in AI. So in AI, I always say that anything worth predicting is very rare to happen. So that's generally true, and most of the AI models depend on predicting either the next word or the label of the sentence or a sentiment of a sentence and so on and so forth.

So as a result, for us to be able to train these models to predict such rare situations, we would have to create synthetic data because they are just rare. They don't happen that much in the world. Okay, so let me, I'm just a tad bit confused here because you said anything that's worth predicting doesn't happen very often. Yeah. Because it's like, that's maybe why we want to be able to predict it. But that what we're asking models to do such, like the LLMs right now, are you...

worth predicting the next word? I mean, next words happen all the time. So I'm not quite sure what you're saying there. Right. So LLM's next word prediction happens all the time and we can predict the words. But what we are asking now then is specific tasks saying that, hey, I have this set of texts. Does this mean fraud? Okay.

So we are asking at a meta level, we are asking, does this group of words mean something, you know, a fraud or a sort of a hate speech or something else? So we are asking such questions of it. So we are asking more. And why can't AI be trained on whatever real world data that we have? Why is what we have right now not satisfactory to make AI models be good at that kind of work?

Great question. So I think if you just take the example of fraud, I mean, thankfully, in banks and so on and so forth, the fraud happens rarely. So you have, you know, 10 million transactions that are not fraudulent, and you have like 10,000 transactions that are fraudulent, and you have reports for those fraud transactions, fraudulent transactions. So as a result, when banks are training a model to be able to detect from a certain report whether it's truly fraud or not, you only have 10,000.

and you have a million or 10 million of reports that are not even reports, you have data that is not fraud at all. So as a result, when you try to train a model, it just latches on to the non-fraudulent examples and doesn't have enough to learn from the fraudulent examples.

So that's just one example where there's rare occurrence of an event that we want to predict or we want to reason about. Okay. So full disclosure here, when my undergraduate majors were in civil and environmental engineering, so I'm definitely a very like hands-on concrete person. If I don't have to wear a hard hat, it's a little challenging for me to understand it. So...

So I'd love, you know, as we have this discussion, I'd love both of you to bring in as many real world examples as you can to help us understand this. So Ari, what do you make of...

what Kalyan said about, let's take, fraud is a really good one, right? Because that's a highly important area that we want as much, you know, AI assistance as possible. That the kinds of data that we have, Ari, right now, as Kalyan's saying, are inadequate in order to predict new kinds of fraud.

Yeah, so I think this is a good example for a couple reasons. First off, I think this reveals one aspect of that we're running out of data problem, which people don't talk about, which is that the vast majority of data in the world is not public. The vast majority of data in the world is private sitting in large companies. As an example, there is very little data around fraudulent credit transactions in the public internet, but there is a whole bunch at Amex and Visa and Chase and large financial institutions.

And that data is useful for various problems, but currently the big foundational labs wouldn't have access to those data. There are several companies that might license the data, but for the most part, that's companies really valuable moat that enables them to build their own applications that can be really strong.

But I want to touch on this notion of kind of the edge cases or outlier examples, or they're sometimes called the long tail that Kalyan was just referring to, because that's absolutely correct. I think one really salient version of this is self-driving cars. Imagine, for example, all Teslas are constantly recording video data as they're driving. If you think about that data set that has been collected, the vast majority of that data is going to be on highways.

And highways are actually pretty simple for self-driving cars comparatively. They've been pretty good at them for a while, right? Autopilot has worked well on highways for a long time. They're pretty predictable. You don't have to worry about a woman in a stroller that may or may not hop off the street and get in the way of a car or construction zones or things like that nearly as much.

Those are the edge cases that you really need to be looking at to make sure that your self-driving car isn't going to have a terrible accident. And Kalyan's right. Those are rarely represented in real data sets. However, one of the things we can do is identify those examples and then up-sample them, repeat them or up-weight them in some way so that the model sees them more frequently.

Another place for this coming up. Ari, I'm going to make you stop there because you're stealing our thunder because we talked to someone specifically about autonomous vehicles. I mean, driving is a perfect example, I'd say, because a while ago, again, I was in the Bay Area, Ari, and I was in an autonomous taxi and it...

pulled into the hotel where I was staying. And there was, I don't know why, but someone left a dumpster sort of halfway covering the exit to the pull-in in front of the hotel lobby. And the taxi was stumped. Like it did not know what to do. And it was just sitting there and we had to call the company and have some human come get us. And I was really confused about why it didn't even think just to back up.

Like a human would automatically be like, "Let's just back up and go another direction." But it wasn't the taxi at that time. This was a couple of years ago. Couldn't do it. But okay. Anyway, here's a developer who's working in the autonomous vehicle space. He's Felix Hyde, professor of computer science at Princeton and head of AI at Torque Robotics. Felix Hyde: At this point, we're able to generate high quality novel trajectories for autonomous vehicles that are almost photorealistic.

So we can take an existing driving sequence that we have observed, simulate our ego vehicle driving on that same route, but on the opposite side or in a squiggly line or driving off the shoulder, leaving the drivable area or crashing into another vehicle that is driving ahead of us. And he tells us that these simulations can create incredibly realistic environments with other vehicles, pedestrians, trees, buildings, even fine detail like parking meters and

Trash cans, good to know. Along with cameras, LIDAR and other sensor technologies, Professor Hyde says AI models can learn in a self-play type of way.

I can put them into a synthetic environment where I have a closed-loop environment that I over and over again provide them with new scenarios that challenge the model. So through the self-play, we can really unlock the original idea of reinforcement learning in a very convincing, exciting way that we have the best superhuman driver train in the simulation world until it sees all of the crashes that it needs to see in order to understand how to react.

And Professor Hyde says these environments help provide data points for situations that are rare or haven't yet occurred in the real world, as we've been talking about.

And it's a key step, he says, for ensuring that the technology is safe. If you look at the Waymo deployments, for example, they're sort of city by city, relatively slow geofenced deployments of 100 vehicles here and there. So this is exciting and shows the potential of the technology, and I'm super excited about it. But to really bring it at scale, this is one of the key technologies that will allow us to bring these vehicles out in the hundreds of thousands of vehicles and do it in a safe manner.

So that's Felix Hyde, professor of computer science at Princeton and head of AI at Torque Robotics. And Kalyan, let's stick with this example for a second because, again, the hard hat wearer in me can understand it. But I also feel like there's a trust but verify aspect to this because these autonomous vehicles may do perfectly in training using the synthetic data that's given to them. But in terms of unleashing it...

into the real world, wouldn't we want to have, like, sort of a very tight regulatory scheme to be sure that they perform well in the real world? Absolutely. Absolutely. And autonomous vehicles especially have much more stringent testing requirements before you put them out in the real world. And look, the synthetic data creation, just stepping back one, you know, in

We did the synthetic data generation even 20 years ago. In 2005, I was at GE. And at that point, they were generating synthetic data using computational fluid dynamics-based simulator for aircraft engines, GE90 engines, right? So they will create the data. They'll pretend as if the flight is happening. And this is through a software framework and inject some falls and create the data.

So what's very important is that when you take the synthetic data, you mix in with the real data in your actual development of model. So you don't essentially just train it with synthetic data. So you mix in real data, you train a model, and then you test it rigorously. So in this case, I think they would try to test that autonomous car, I guess, in some locations and drive around with the new model. Downtown Boston. Downtown Boston. Or maybe behind the dumpster. Yeah.

And C, so actually all the situations, the one situation that you mentioned just previously, also become part of the test suite. So we now test the car whether it's able to handle or the autonomous driving is able to handle the new situations that it was not able to have.

So that sort of rigorous stress testing is required before they are deployed in the real world. So Ari, I guess Kalyan just said what you were telling us earlier, that it's a mix, that new AI models should be trained on this mix of synthetic and high quality data.

Yeah, I think that's exactly right. You need to find the high quality real data, and that could involve finding a lot of those outliers. That could involve finding the most difficult examples. And then you need to mix it in with appropriate synthetic data. When you think about what's going to make synthetic data work, there are generally two things that are really important. First of all, of course, synthetic data has to be reflective of the real world, right? Imagine I have a simulator where the laws of physics are different.

Obviously, a model is not going to generalize from that to our world where, you know, if gravity is half of what it is. So the simulation actually has to match reality in order for this to work, number one. And then number two is that you have to make sure you generate diverse data. Diversity is in many ways the most important thing for high quality data curation and for making these models learn. You have to make sure it covers lots and lots and lots of scenarios, every possible way something could be presented.

Wait, but how can you do that when the – so this is, again, forgive me for just being so gauche here, but how can you do that when part of the problem is we can't actually predict the infinite number of scenarios that we even as humans can be presented with every day? I think the answer is you can't do it perfectly.

You can do it well. And then what you do is you kind of make a virtuous cycle there where you start with some synthetic data, use that to make a model better. That model can do better at generating more data, use that, and so on and so forth until eventually you get a model that's getting better and better and better. It's often the way that people think about this. But it doesn't have to be perfect. It just has to be more informative than what the model currently understands. So long as it teaches the model something new, you can get to a certain point.

Now, that said, it does mean that if the synthetic data has a ceiling in quality, eventually you would reach a ceiling. Now, the bet that many folks are making is that we can get past the ceiling with synthetic data, which I think there is some reasonable evidence to suggest we may be able to. But we haven't yet reached that point, and we'll have to see when we get there. Kalyan, you're leaning in here. Go ahead. Yeah.

So I think to be able to generate synthetic data, sometimes those rare examples that we find, we would use that to create more of them in the adjacent neighborhoods. And then once we create more of them, sometimes we do verification or engineer them. Like sometimes even we will go back to humans and verify those examples to see if they make sense and generate that. So there's an ability for us to engineer synthetic examples that can give us those

new situations. And also, I wanted to add, like the example that you gave of the autonomous car and having a dumpster, when such situations happen, there is a recording of that data that is fed back. And then we can use that data to create even more situations. So we'll just move the dumpster around or we'll do more sort of creation of synthetic data examples in that neighborhood, in that neighborhood of that example. So in a way, we are able to create more novel scenarios, even though...

We may not have that many to begin with. Yeah. I want to go back to the example that you gave earlier about fraud detection in the financial world because I think it's really important when you said, look, the idea of using simulations essentially is a longstanding practice in technology development. I mean, decades and decades and decades old. But why?

To your point, what we're asking or want to ask AI to do is really different than, let's say, training a fighter pilot in a simulator, right? Because we're going to eventually, we are asking these, even now, these machines to make decisions for us that are in many ways removing the human element, okay? And the reason why I say this is, is this a world in which maybe a possible good way to train AI on real world data

financial fraud is to say, well, anything that doesn't match these known non-fraudulent acts, okay, should be flagged. Meaning just like program the AI to have a lot of false positives instead of trying to predict what new kinds of fraud could be. Does that make sense? Yeah. Yeah. I think we can program the AI to say, you know,

flag a lot of examples that are non-fraudulent, that we are non-fraudulent, that we still think are very close to the patterns of the fraud. So those examples, we actually do that. We actually find examples that are very close to the fraud, but we know that they are non-fraudulent. So as a result, what we are seeing is we found out how people are bypassing our checks and balances, right? Because the fraudulent examples are very close to the non-fraud.

and use that to create sort of synthetic data. Okay, we'll have more in just a second. This is On Point. Support for AI coverage in On Point comes from MathWorks, creator of MATLAB and Simulink software for technical computing and model-based design. MathWorks, accelerating the pace of discovery in engineering and science. Learn more at mathworks.com.

and from Olin College of Engineering, committed to introducing students to the ethical implications of artificial intelligence in engineering through classes like AI and Society, olin.edu.

Craftsman days are here at Lowe's with big savings on the tools you need. Save $100 on the Craftsman V26 Tool Power Tool Combo Kit, now at $199. No matter what the project is, Craftsman's high-quality, high-performance products empower you to build on. Stop by your nearest Lowe's store and check out the full line of Craftsman tools today. Valid through 618 while supplies last. Selection varies by location.

Running a business comes with a lot of what-ifs. But luckily, there's a simple answer to them. Shopify. It's the commerce platform behind millions of businesses, including Thrive Cosmetics and Momofuku. And it'll help you with everything you need. From website design and marketing to boosting sales and expanding operations, Shopify can get the job done and make your dream a reality. Turn those what-ifs into... Sign up for your $1 per month trial at shopify.com slash special offer.

Before we return to our conversation about the need, potentially, of synthetic data in training AI models, I want to give you a quick heads up on another AI-related show that we're working on for later this week, and that has to do with artificial intelligence analytics.

And job applications and the job search or how AI is being used by companies when they're screening job candidates. And then also on the flip side of that, how some job seekers are using AI to get noticed by recruiters. So if you've been job hunting recently, have you encountered AI?

AI in your job search or maybe think that you might have? Maybe that rejection came really, really quickly. Have you been maybe received that rejection because you think some keywords are missing from your resume? Have you tweaked your resume in order to get past that AI gatekeeper? Here's another thing. Have you maybe potentially been interviewed?

by an AI system? And if you are on the recruiter side of things, are you using artificial intelligence to help you find the right solution

We absolutely want to know how AI is having an impact on the world of job hunting. So grab your phone and get the On Point VoxPop app wherever you get your apps because that way you can send us a very high quality message or you can give us a call 617-353-0683. So we want to hear your AI and job search stories for later this week. Today's topic is AI and job search.

Today, I'm joined by Ari Morkos. He's co-founder and CEO of Datology AI. He's in San Jose, California. And here with me in the On Point studio is Kalyan Viramachaneni. He's co-founder and CEO of DataCibo. And I now want to, gentlemen, I want to dig in much more deeply into really the potential downsides because I'm

I was highly skeptical coming into this hour about the need for synthetic data, but I'm relaxing that skepticism a little bit. But nevertheless, Ari, you had mentioned some words a little earlier, like broads.

And so to that point, let's listen to Rich Berenik, who's a professor of electrical and computer engineering at Rice University in Houston, Texas. And he and his team have been running experiments to see what happens when you train a new AI model using a combination of real world data and synthetic data created by other generative AI models. For example, he's asking models to produce realistic images.

human faces okay and the result he says sometimes literally is not pretty

If your generative model creates even imperceptible artifacts in the output, maybe there's a little bit of a distortion in the picture. Well, then as you continue this process over subsequent generations, those artifacts are going to be increasingly amplified. Okay, so what he found is that the models trained on synthetic data were at the beginning producing realistic human faces. But then as the training continued on those images...

later outputs would have very strange patterns appearing on the faces. So Ari, the way I read this is there's a high risk of, to put it bluntly, error amplification in using synthetic data. Is that so?

Yeah, I think that's right. I mean, like all things in machine learning, there are far more ways to do synthetic data incorrectly than there are to do it correctly. It's much easier to mess it up than it is to get it correct. And I think if you just naively have a model generate synthetic data, feed that into a new model, have that model generate synthetic data, feed that into a new model repeatedly, you are absolutely going to get the sort of terrible artifacts that Rich is describing.

I think the way around that is that every time you generate the synthetic data, you then filter it very aggressively. So you then say, what's the synthetic data that came out of the model that's actually realistic? Let's keep that. What's the synthetic data that came out of the model that's a bit weird? Let's remove that. And I think this also dovetails into some of what Callie was saying earlier. I think there are two ways you can approach synthetic data at a philosophical level. One is, let me generate completely novel data that I've never seen before.

that's really hard and you're likely to make mistakes that are gonna propagate when you do that. The other way is instead to say, let me take an example, like a fraud example that I've seen already or an outlier self-driving car case that I've seen already and then let me just tweak it a little bit. Let me make it so that it looks a little bit different as if it was another presentation of the same sort of error.

That's a lot easier to do and is a lot less risky. So I think we're first going to see that sort of synthetic data. And that's what we do at Datology a lot of times. We'll take documents, for example, and we'll rephrase them into different formats so that the model can kind of understand them, you know, it

presented in different ways. And that form of synthetic data, I think, is a lot easier to get right, a lot easier to mess up. When you start going to, I want to build an entirely, you know, imagine an entirely new type of scenario that might go wrong, that's where you're more likely to start having these errors. Okay, well, Colleen, let me push on this a little bit, though, because I think another way, another term I've heard kicked around here is model collapse.

Right. Because if these if these tiny errors or artifacts do get amplified in the way that it seems to me inevitably it could happen. Right. Because we're talking about billions and billions and billions of iterations or in the training of the model. Yeah.

I'm not entirely convinced that we should put that aside as a concern. Yes. Yes, we shouldn't put that aside as a concern. It is an important concern. And as Ari pointed out, like a lot of the synthetic data, while it's generated by AI, we are there as part of the process to include it in the next model training as engineers, right? So we watch how those training examples have some artifacts or how they're fed into the model. Is the model collapsing? We have measures to check that.

So that's a lot of engineering that goes behind putting these synthetic data into training the models and checking how the training is going. The second thing I'd also push back is, I mean, after the model is trained, there's a lot of checks and balances before that model is deployed. So Data Cibo, we do that all the time. Any software or model that we deploy in the real world, there's a lot of automatic checks and balances that we do. So, you know, to Richard's point, the professor from Rice, I mean, one of the things that

One of the checks is what he used to detect the artifacts. If you can imagine, we wouldn't deploy such a model. I mean, he had a check where, you know, whether it was visual or automatic. One of the challenges that we do is we do implement now a lot of automatic checks because we don't want to depend on humans. So we check those after the model is trained. We do a lot of checks to make sure the model is performant.

and is not producing weird artifacts like that. Well, let's listen to a little bit more of what Professor Berenik had to say because he did sort of...

offer a kind of a caution in terms of the use of synthetic data, because he says there is an important question right now that's still left unresolved. One of the big problems that we have is that there's such a limited understanding of this phenomenon. We're still early days in trying to provide authoritative guidance on how much synthetic data is okay and how much isn't okay. So that's an area that we really need to advance.

What do you think, Kalyan? How much is okay? How much isn't okay? I mean, is it case dependent? It is case dependent. It is very case dependent and use case dependent. Yeah. So we, again, proportions is a parameter that we fine tune as engineers and developers when using the synthetic data. Okay. Use case dependent. So Ari, let me turn to you on this because, and I want to hear both of you about this.

Again, from the public's point of view, AI is a very powerful and awesome tool, but it's also already problematic. I mean, we've done shows in the way that we can here at On Point about we focus on health care law and AI. And in the ways that depending on the question that you ask the AI to do or what you're asking the AI to look for in, let's say, approving or disapproving insurance claims, you

It's, you know, summarily rejecting people who actually deserve to have their claims fulfilled. And it's very, very hard in real time to catch those errors. OK, so, I mean, Ari, wouldn't the use of synthetic data potentially make that problem even worse? Yeah.

I think it could go either way. I think it depends on how you use it. Again, if you use it well, it could make the problem much better. If you use it poorly, it can absolutely make it worse, which is why you need to have verification audits on these systems and why you have to be very careful that you're putting in data that's actually going to present as well. I think this also gets to Rich's point that

This is still a frontier research problem, not just synthetic data, data research in general. There are a whole bunch of cultural reasons why data research has largely been overlooked by the machine learning community relative to things like architectures or other areas of AI research.

And there's a lot more for us to understand here. And that's actually a lot of why we created Datology was to do this research and then make it so that when we work with folks who want to train models, they get a really good use of synthetic data and real data that's not going to result in these sorts of errors. And for example, we found that going beyond half of the data being synthetic pretty quickly causes issues. So we usually will cap it out at about 50% synthetic data.

Mm. Kalyan, go ahead. Yeah, I wanted to add to your example. I think I'm holding in my hands a paper called Single Word Change is All You Need. It's one of the paper we wrote about a classifier that classifies whether to, for example, maybe to give a loan or not. And all you have to change is one word in a sentence that will just reject anything.

And there is no change in the meaning. There is no change in the sentence structure, nothing. It's just one word that made that classifier very fragile. And that classifier was not trained on synthetic data at all. It was trained on the real data. So one of the things that we now do in academic research community as well as in business is that we try to create examples that will break a classifier.

that is trying to decide whether to give a loan or not. And they call them adversarial examples. So basically, you create an example that should go through the classifier, right? And that should get a positive result.

But just because you changed a word or maybe even put a comma at a wrong place, it's rejecting. So now when we create such examples, we retrain the classifier or the model to make it better. And as a result, in doing so, what we're doing is we're essentially using synthetic training examples to make the model better, right? Because we took the examples that should pass, we tweaked them a little bit, see how fragile the model is, and use that data again to retrain the classifier so that it becomes more robust, right?

And so this is a very ongoing, very popular field of research called robustness of these models of how to make them more robust by tweaking parameters and creating synthetic examples to train them better. So you can use it to address exactly the problem that you're seeing where one word changed everything.

That is really interesting. But I'm also afraid I'm taking the wrong lesson from your example, Kalyan, which is the lesson that's like screaming in my head is, wow, a lot of this can seem very arbitrary. Do

Do you know what I mean? No, I'm serious because, again, from the normal human perspective, if we are in a world now where these AI tools, in certain examples like you're giving, having a comma in the wrong place that we have to test for that and the outcomes for that before we unleash the tool out into the real world, again, just speaking –

purely from the point of view of like what we already know about how businesses operate, can we trust the industry or industries who are developing? I mean, you two are willing to talk to me. Many people aren't to be that robust in their testing while they're developing before they, the models are unleashed. Um,

I mean, they will see that as a result in the business metrics as well. At least we hope so. For example, if it's a fraud detection thing and you're just doing a false positive, there's a lot of rejection of the transactions, they will see that in customer satisfaction. They'll see that a lot in the results that they're seeing at the end of the day.

Where it becomes tricky is when your result is not immediately observable for long periods of time. So healthcare is a tricky place where if you start deploying them, you have to be extremely careful of how you, because you won't see its effect for a long time.

Things where there is an immediate measurement available and businesses have already have age-old practices to measure the outcomes, customer satisfaction, number of false positives, you know, some things that are black and white, just you can measure them. It's easy to test and it's easy to deploy. So I agree with you. I think it's very important for areas where we don't know.

And we can measure the outcomes that quickly. And it takes time. Okay. So, Ari, I want to actually circle back to roughly where we started. Because there is a whole different way of thinking about this, right? Which is if you parse out the high-quality data from the vastly large data sets that we have, right? Train a model on it.

see what the model's doing right and what it's doing wrong, tweak the model, and then train it again on that same real data, why isn't that good enough?

I mean, I think that will get us pretty far. But the challenge is at a certain point, it will be challenging to find enough of that high quality data. Although I think, again, if we can get access to what's the private data that's present and use that for particular use cases, that can do a lot. But ultimately, the data is everything for models. One of my favorite catchphrases is models are what they eat.

If you show them really high quality data, they're going to be high quality. If you show them low quality data, they're going to be low quality. In order for us to solve this problem, it's going to require bringing all of our solutions and all of the tools in our toolkit to bear. We're going to have to do a lot on data curation of real data to enrich that and make that higher quality. And then use that higher quality real data as the guide to generate more high quality synthetic data as well. And then combine the two of them to

massively improve the data efficiency of our models. So how quickly they can train, what performance they can train, the reliability of our models. This is the number one problem with AI models in the real world is that they're not reliable enough. And also the cost of actually deploying these models, which is another huge factor that, you know, running these models is quite expensive. And as, of course, these AI products get more and more users, we're going to spend more and more data center compute costs.

on running these models. And when you use better data, you can get smaller models that are just as good, which means you can both save compute costs, which both saves financial costs, but also saves the environmental costs of training these models. So we're going to have to take all of these tools in our toolkit and bring them to bear in order to solve these problems. But I'm quite optimistic. I don't think we're going to completely... I think when we say we're running out of data, we're being a bit hyperbolic.

There's a lot more we can do with our existing. Okay, so we have less than 30 seconds left. I want to ask you a tweak on the same question that I asked Kalia, because ultimately my interest is in trying to have conversations where we get to a place where we understand what can we do as this technology is being developed to minimize the harm that may happen, right? So that people don't get hurt in the ways that we've described can already happen with AI tools. So regarding synthetic data, Ari,

What do you think should the industry do? Should regulators do to try to minimize negative outcomes? Let's put it that way. I think ultimately we have to test and measure. You have to have a reliable testing framework. When we deploy a model, we come up with clear evaluation suites to make sure that they're to understand how they're performing and where their their harms are. And then also make sure we look at the real harms like bias and claims, denials and things like that are actually going to affect real people in the near term.

Well, Ari Morkos, co-founder and CEO of Datology AI in San Jose, California, thank you so much for joining us today. Thank you for having me. And Kalyan Veeramachaneni here in the On Point studio, co-founder and CEO of Data Cibo. Thank you so much for being with us. Thank you. Thank you for having me. I'm Meghna Chakrabarty. This is On Point. On Point.

Support for this podcast comes from Is Business Broken? A podcast from BU Questrom School of Business. How should companies balance short-term pressures with long-term interests? In the relentless pursuit of profits in the present, are we sacrificing the future? These are questions posed at a recent panel hosted by BU Questrom School of Business. The full conversation is available on the Is Business Broken podcast. Listen on for a preview.

Just in your mind, what is short-termism? If there's a picture in the dictionary, what's the picture? I'll start with one ugly one. When I was still doing activism as global head of activism and defense, so banker defending corporations, I worked with Toshiba in Japan. And those guys had five different activists, each one of which had a very different idea of what they should do right now, like short-term.

Very different perspectives. And unfortunately, under pressure from the shareholders, the company had to go through two different rounds of breaking itself up, selling itself and going for shareholder votes. I mean, that company was effectively broken because the leadership had to yield under the pressure of shareholders who couldn't even agree on what's needed in the short term. So to me, that is when this behavioral problem, you're under pressure and you can't think long term, becomes a real problem.

real disaster. Tony, you didn't have a board like that. I mean, the obvious ones, I mean, you look at there's quarterly earnings. We all know that you have businesses that will do everything they can to make a quarterly earning, right? And then we'll get into analysts and what causes that. I'm not even going to go there. But there's also, there's a lot of pressure on businesses to, if you've got a portfolio of businesses, sell off an element of that portfolio. And as a manager, you say, wait, this is a really good business. Might be down this year, might be, but it's a great business.

Another one is R&D spending. You can cut your R&D spend if you want to, and you can make your numbers for a year or two, but we all know where that's going to lead a company. And you can see those decisions every day, and you can see businesses that don't make that sacrifice. And I think in the long term, they win.

Andy, I'm going to turn to you. Maybe you want to give an example of people complaining about short-termism that you think isn't. I don't really believe it exists. I mean, you know, again, I don't really even understand what it is. But what I hear is we take some stories and then we impose on them this idea that had they behaved differently, thought about the long term, they would have behaved differently. That's not really science.

Find the full episode by searching for Is Business Broken wherever you get your podcasts and learn more about the Mehrotra Institute for Business, Markets and Society at ibms.bu.edu.

What happens when you train your AI on AI-generated data? 45:42 Share

On Point | Podcast

Deep Dive

Shownotes Transcript

What happens when you train your AI on AI-generated data?