The race to create more powerful artificial intelligence applications has also created a huge demand for high-quality training data and competition over who gets to use that data. And a lot of that demand is in China. As NPR's Emily Fang and Owen Soh report. In this brand-spanking-new office building in northeastern China, rows and rows of people sit silently clicking at their computer screens.
This is the fuel that powers so much of generative AI, raw data. And this data processing center is the brainchild of this man. My name is Henry, Henry Chen. He's the founder of Sapien AI. It hires people around the world to collect data and tag and organize it so it can be used to train a variety of artificial intelligence applications.
China is a big market. Especially after DeepSeek came out. DeepSeek, the Chinese chatbot performing on par with American-trained chatbots, but trained at a fraction of the cost.
That demand for data is why Chen's company now has about 60 employees in China labeling maps of Chinese streets. This data today is being used to train an autonomous driving program. It looks very abstract. That's NPR producer Ao Wencao. I see people working in front of computers, but on the computer screens there are black backgrounds with...
Squares. Squares and green dots. It almost looks like, Alwyn says, laughing, the television show Severance. The data may look abstract, but it's a valuable commodity, says Rohir Kremers. He's a professor at Leiden University in the Netherlands who studies China's digital technology policies. They believe that data is an economic input.
And in a way, they see it as akin in that sense to raw materials. Chatbots today, like ChatGPT, need literally trillions of data points to get up to speed. And who owns that data has increasingly been a competition between companies and between countries like the U.S. and China. Each wants an edge over the other in AI, and that means hoarding data.
Data is such a choke point that since last year, China's cyberspace regulators have to approve any bulk export of data out of the country, which is in part why Sapien AI, a Canadian company, is in China to begin with. For the AI models that are trained here, the data needs to be processed in the country and cannot leave the country. The race to create and protect data is also because the data AI companies want is getting more complicated.
Olga Mugerskaya, the founder of an Amsterdam-registered data processing company called Toloka, now specializes in creating datasets for highly technical scientific and engineering fields. She uses an analogy that compares early AI models to human toddlers. The person is like two years old. He or she is taught by kids' books with very bright pictures.
And more advanced AI models are like university students. When she goes to the university, there are dozens of textbooks that she needs to read. For an AI model, that means gobbling up more and more advanced data sets. The data industry is crucial enough that local governments in China, once dependent on dying industries like steelmaking and coal mining, are actively recruiting AI data processing companies. Here
Here's Creamers at Leiden University again. China wants to make a large amount of money through developing the industries of the future. The Rust Belt city of Shenyang, where Sapien AI chose to locate one of its offices, is one of seven Chinese cities that says it wants to become an AI data hub. The city offers low interest rates on loans and flexible and affordable office space.
Here's Chen again at Sabiant AI. They benefited from this help. So they give us a lot of help as well. So we find a really good environment to set up the office here. Because data processing employs a lot of young people. China's economy never fully recovered from a global coronavirus pandemic. And youth unemployment has concerned policymakers enough that they briefly stopped publishing that statistic.
One of the young people working at Sapien AI is Huang Rui, age 21. She's a data quality specialist. She says the work of data processing is suitable for people with obsessive-compulsive tendencies because it requires a high level of attention to detail.
Data processing is admittedly not the most exciting work, says Chen, her boss. Just picture yourself sitting at a desk and try to draw bounding boxes around cars for 40 hours a week. But sometimes innovation requires someone, actually a whole lot of people, to do the boring work. Emily Fang in Peer News.