We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

The race to create AI applications is creating demand for training data in China

2025/6/29

All Things Considered

AI Deep Dive AI Chapters Transcript

People

Henry Chen

Huang Rui

Olga Mugerskaya

Rohir Kremers

Topics

Henry Chen: 作为Sapien AI的创始人，我看到中国对高质量训练数据的巨大需求，尤其是在DeepSeek等本土AI模型涌现后。我们公司致力于在全球范围内收集、标记和组织数据，以支持各种人工智能应用。为了满足国内AI模型的需求，我们必须在中国境内处理数据，确保数据不出境。政府为我们提供了很多帮助，包括低息贷款和灵活的办公空间，这使我们能够更好地发展业务。 Rohir Kremers: 我认为数据在中国被视为一种重要的经济投入，类似于原材料。中国希望通过发展人工智能等未来产业来获取经济利益。地方政府，如沈阳，正在积极招募AI数据处理公司，以促进经济转型和发展。 Olga Mugerskaya: 我将早期AI模型比作小孩，通过简单的图像学习；而更高级的AI模型则像大学生，需要阅读大量的复杂数据。因此，AI模型需要不断吸收更高级的数据集才能进步。 Huang Rui: 作为一名数据质量专家，我认为数据处理工作非常适合注重细节的人。虽然工作可能有些枯燥，但对于人工智能的创新至关重要。

Deep Dive

Shownotes Transcript

Translations:

中文

The race to create more powerful artificial intelligence applications has also created a huge demand for high-quality training data and competition over who gets to use that data. And a lot of that demand is in China. As NPR's Emily Fang and Owen Soh report. In this brand-spanking-new office building in northeastern China, rows and rows of people sit silently clicking at their computer screens.

This is the fuel that powers so much of generative AI, raw data. And this data processing center is the brainchild of this man. My name is Henry, Henry Chen. He's the founder of Sapien AI. It hires people around the world to collect data and tag and organize it so it can be used to train a variety of artificial intelligence applications.

China is a big market. Especially after DeepSeek came out. DeepSeek, the Chinese chatbot performing on par with American-trained chatbots, but trained at a fraction of the cost.

That demand for data is why Chen's company now has about 60 employees in China labeling maps of Chinese streets. This data today is being used to train an autonomous driving program. It looks very abstract. That's NPR producer Ao Wencao. I see people working in front of computers, but on the computer screens there are black backgrounds with...

Squares. Squares and green dots. It almost looks like, Alwyn says, laughing, the television show Severance. The data may look abstract, but it's a valuable commodity, says Rohir Kremers. He's a professor at Leiden University in the Netherlands who studies China's digital technology policies. They believe that data is an economic input.

And in a way, they see it as akin in that sense to raw materials. Chatbots today, like ChatGPT, need literally trillions of data points to get up to speed. And who owns that data has increasingly been a competition between companies and between countries like the U.S. and China. Each wants an edge over the other in AI, and that means hoarding data.

Data is such a choke point that since last year, China's cyberspace regulators have to approve any bulk export of data out of the country, which is in part why Sapien AI, a Canadian company, is in China to begin with. For the AI models that are trained here, the data needs to be processed in the country and cannot leave the country. The race to create and protect data is also because the data AI companies want is getting more complicated.

Olga Mugerskaya, the founder of an Amsterdam-registered data processing company called Toloka, now specializes in creating datasets for highly technical scientific and engineering fields. She uses an analogy that compares early AI models to human toddlers. The person is like two years old. He or she is taught by kids' books with very bright pictures.

And more advanced AI models are like university students. When she goes to the university, there are dozens of textbooks that she needs to read. For an AI model, that means gobbling up more and more advanced data sets. The data industry is crucial enough that local governments in China, once dependent on dying industries like steelmaking and coal mining, are actively recruiting AI data processing companies. Here

Here's Creamers at Leiden University again. China wants to make a large amount of money through developing the industries of the future. The Rust Belt city of Shenyang, where Sapien AI chose to locate one of its offices, is one of seven Chinese cities that says it wants to become an AI data hub. The city offers low interest rates on loans and flexible and affordable office space.

Here's Chen again at Sabiant AI. They benefited from this help. So they give us a lot of help as well. So we find a really good environment to set up the office here. Because data processing employs a lot of young people. China's economy never fully recovered from a global coronavirus pandemic. And youth unemployment has concerned policymakers enough that they briefly stopped publishing that statistic.

One of the young people working at Sapien AI is Huang Rui, age 21. She's a data quality specialist. She says the work of data processing is suitable for people with obsessive-compulsive tendencies because it requires a high level of attention to detail.

Data processing is admittedly not the most exciting work, says Chen, her boss. Just picture yourself sitting at a desk and try to draw bounding boxes around cars for 40 hours a week. But sometimes innovation requires someone, actually a whole lot of people, to do the boring work. Emily Fang in Peer News.

The race to create AI applications is creating demand for training data in China 05:00 Share

All Things Considered

Deep Dive

Shownotes Transcript

The race to create AI applications is creating demand for training data in China