Inside the race to train AI robots how to act human in the real world

Now that artificial intelligence has mastered almost everything we do online, it needs help learning how we physically move around in the real world.

A growing global army of trainers is helping it escape our computers and enter our living rooms, offices and factories by teaching it how we move.

In an industrial town in southern India, Naveen Kumar, 28, stands at his desk and starts his job for the day: folding hand towels hundreds of times, as precisely as possible.

He doesn’t work at a hotel; he works for a startup that creates physical data used to train AI.

A robot practices for the 100-meter race before the opening ceremony of the World Humanoid Robot Games in Beijing in August.

(Ng Han Guan / Associated Press)

He mounts a GoPro camera to his forehead and follows a regimented list of hand movements to capture exact point-of-view footage of how a human folds.

That day, he had to pick up each towel from a basket on the right side of his desk, using only his right hand, shake the towel straight using both hands, then fold it neatly three times. Then he had to put each folded towel in the left corner of the desk.

If it takes more than a minute or he misses any steps, he has to start over.

His firm, a data labeling company called Objectways, sent 200 towel-folding videos to its client in the United States. The company has more than 2,000 employees; about half of them label sensor data from autonomous cars and robotics, and the rest work on generative AI.

Most of them are engineers, and few are well-practiced in folding towels, so they take turns doing the physical labor.

“Sometimes we have to delete nearly 150 or 200 videos because of silly errors in how we’re folding or placing items,” said Kumar, an engineering graduate who has worked at Objectways for six years.

The carefully choreographed movements are to capture all the nuances of what humans do — arm reaching, fingers gripping, fabric sliding — to fold clothes.

The captured videos are then annotated by Kumar and his team. They draw boxes around the different parts of the video, tag the towels, and label whether the arm moved left or right, and classify each gesture.

Kumar and his colleagues in the town of Karur, which is about 300 miles south of Bengaluru, are an unlikely batch of tutors for the next generation of AI-powered robots.

“Companies are building foundation models fit for the physical world,” said Ulrik Stig Hansen, co-founder of Encord, a data management platform in San Francisco that contracts with Objectways to collect human demonstration data. “There’s this huge resurgence in robotics.”

Encord works with robotics companies such as Jeff Bezos-backed Physical Intelligence and Dyna Robotics.

Tesla, Boston Dynamics and Nvidia are among the leaders in the U.S. in the race to develop the next generation of robots. Tesla already uses its Optimus robots — which seem to be often remotely controlled — for different company events. Google has its own AI models for robotics. OpenAI is beefing up its robotics ambitions.

Nvidia projects the humanoid robot market could reach $38 billion over the next decade.

There are also many lesser-known companies trying to provide the hardware, software and data to make a mass-produced, multitasking humanoid robot a reality.

Robots at Nvidia's booth an an expo in Beijing

Robots are displayed at Nvidia’s booth during the China International Supply Chain Expo in Beijing in July.

(Mahesh Kumar A. / Associated Press)

Large language models that power chatbots such as ChatGPT have mastered using language, images, music, coding and other skills by hoovering up everything online. They use the entire internet to figure out how things are connected and mimic how we do things, such as answering questions and creating photo-realistic videos.

Data on how the physical world works — how much force is required to fold a napkin, for example — is harder to get and translate into something AI can use.

As robotics improves and combines with AI that knows how to move in the physical world, it could bring more robots into the workplace and the home. While many fear this could lead to job losses and unemployment, optimists think advanced robots would free up humans from tedious work, lower labor costs and eventually give people more time to relax or focus on more interesting and important work.

Many companies have entered the fray as shovel sellers in the AI gold rush, seeing an opportunity to gather data for what is being called physical AI.

One group of companies is teaching AI how to act in the real world by having humans guide robots remotely.

Ali Ansari, founder of San Francisco-based Micro1, said emerging robotics data collection increasingly focuses on teleoperations. Humans with controllers make the robot do something like picking up a cup or making tea. The AI is fed videos of successful and failed attempts at doing something and learns to do it.

The remote-control training can happen in the same room as the robots or with the controller in a different country. Encord’s Hansen said that there are warehouses planned in Eastern Europe where large teams of operators will sit with joysticks, guiding robots across the world.

There are more of these, what some have dubbed “arm farms,” popping up as demand increases, said Mohammad Musa, founder of Deepen AI, a data annotation firm headquartered in California.

“Today, a mix of real and synthetic data is being used, gathered from human demonstrations, teleoperation sessions and staged environments,” he said. “Much of this work still occurs outside the West, but automation and simulation are reducing that dependency over time.”

Some have criticized teleoperated humanoids for being more sizzle than substance. They can be impressive when others are controlling them, but still far from fully autonomous.

Ansari’s Micro1 also does something called human data capture. It pays people to wear smart glasses that capture everyday actions. It is doing this in Brazil, Argentina, India, and the United States.

San José-based Figure AI, partnered with real estate giant Brookfield to capture footage from inside 100,000 homes. It will collect data about human movement to teach humanoid robots how to move in human spaces. The company said it will spend much of the $1 billion it raised to collect first-person human data.

Meta-backed Scale AI, has collected 100,000 hours of similar training footage for robotics through its prototype laboratory set up in San Francisco.

Still, training bots isn’t always easy.

Twenty-year-old Dev Mandal created a company in Bengaluru, hoping to cash in on the need for physical data to train AI. He offered India’s inexpensive labor to capture movements. After advertising his services, he got requests to help train a robotic arm to cook food as well as a robot to plug and unplug cables in data centers.

But he had to give up the business, as potential clients needed the physical movement data collected in a very specific manner, making it tougher for him to make money, even with India’s inexpensive labor. Clients wanted an exact robot arm, for example, using a certain kind of table with purple lights to be used.

“Everything, down to the color of the table, had to be specified by them,” he said. “And they said that this has to be the exact color.”

Still, there’s lots of work for the towel folders of Karur.

Their boss, Objectways founder Ravi Shankar, says that in recent months, his firm has captured and annotated footage of robotic arms folding cardboard boxes and T-shirts and picking out certain colored objects on a table.

It recently started annotating videos from more advanced humanoid robots, helping train them to sort and fold a mix of towels and clothes, folding them and placing them in different corners of the table. His team had to annotate 15,000 videos of the robots doing the jobs.

“Sometimes the robot’s arms throw the clothes and won’t fold properly. Sometimes it scatters the stack,” but the robots are learning quickly said Kavin, 27, an Objectways employee who goes by one name. “In five or 10 years, they’ll be able to do all the jobs and there will be none left for us.”

Continue Reading