Data Labeling Is the Hot New Thing in AI

By Matthew S. Smith Earlier this summer, Meta made a US $14.3 billion bet on a company most people had never heard of before: Scale AI. The deal, which gave Meta a 49 percent stake, sent Meta’s competitors—including OpenAI and Google—scrambling to exit their contracts with Scale AI for fear it might give Meta insight into how they train and fine-tune their AI models. Scale AI is a leader in data labeling for AI models. It’s an industry that, at its core, does what it says on the tin. The most basic example can be found in the thumbs-up and thumbs-down icons you’ve likely seen if you’ve ever used ChatGPT. One labels a reply as positive; the other, negative. But as AI models grow, both in model size and popularity, this seemingly simple task has grown into a beast every organization looking to train or tune a model must manage.

“The vast majority of compute is used on pre-training data that’s of poor quality,” says Sara Hooker, a vice president of research at Cohere Labs. “We need to mitigate that, to improve it, applying super high-quality gold dust data in post-training.”

What Is Data Labeling?

Computer scientists have, in the past, relied on the axiom “garbage in, garbage out.” It suggests that bad inputs always lead to bad outputs. However, as Hooker suggests, the training of modern AI models defies that axiom. Large language models are trained on raw text data scraped from the public Internet, much of which is of low quality (Reddit posts tend to outnumber academic papers). Cleaning and sorting training data makes sense in theory, but with modern models training on petabytes of data, it’s impractical in practice due to the sheer volume of data involved. That’s a problem, because popular AI data training sets are known to include racist, sexist, and criminal data. Training data can also include more subtle issues, like sarcastic advice or purposefully misleading advice. Put simply: a lot of garbage finds its way into the training data. So data labeling steps in to clean up the mess. Rather than trying to scrub out all of the problematic elements of the training data, human experts manually provide feedback on the AI model’s output after the model is trained. This molds the model, reducing undesirable replies and changing the model’s demeanor. Sajjad Abdoli, founding AI scientist at data labeling company Perle, explains this process of creating “golden benchmarks” to fine-tune AI models. What exactly that benchmark contains will depend on the purpose of the model.

“We walk our customers through the procedure, and create the criteria for a quality assessment,” says Abdoli.

Consider a typical chatbot. Most companies want to build a chatbot that’s helpful, accurate, and concise, so data labelers provide feedback with those goals in mind. Human data labelers read the replies generated by the model on a set of test prompts. A reply that seems to answer the prompt with concise and accurate information would be considered positive. A meandering reply that ends in an insult would be labeled as negative. Not all AI models are meant to be chatbots, however, or focus on text. As a counterpoint, Abdoli described Perle’s work assisting a customer working on a model to label images. Perle contracted human experts to meticulously label the objects in thousands of images, creating a standard that could be used to improve the model.

“We found a huge gap between what the human experts mentioned in an image, and what the machine learning model could recognize,” Abdoli says.

Why Meta Invested Billions in Scale AI

Data labeling is necessary to fine-tune any AI model, but that alone doesn’t explain why Meta was willing to invest over $14 billion in Scale AI. To understand that, we need to understand the AI industry’s latest obsession: agentic AI. OpenAI’s CEO, Sam Altman, believes AI will make it possible for a single person to build a company worth $1 billion (or more). To make that dream come true, though, AI companies need to invent agentic AI models capable of complex multi-step workflows that might span days, even weeks, and include the use of numerous software tools. And it turns out that data labeling is a key ingredient in the agentic AI recipe.

“Take a universe where you have multiple agents interacting with each other,” said Jason Liang, a senior vice president at AI data labeling company SuperAnnotate. “Somebody will have to come in and review, did the agent call the right tool? Did it call the next agent properly?”

In fact, the problem is even more complicated than it at first appears, as it requires evaluation of both specific actions and the AI agent’s overall plan. For example, several agents might call another in sequence, each for reasons that seem justifiable.

“But actually, the first agent could have just called the fourth one and skipped the two in the middle,” says Liang.

Agentic AI also requires models that can solve problems in high-stakes fields where an agent’s results could have life-or-death consequences. Perle’s Abdoli pointed to medical use as a leading example. An agentic AI doctor capable of accurate diagnosis, even if just in a single field or in limited circumstances, could prove immensely valuable. But the creation of such an agent, if it’s even possible, will push the data labeling industry to its limits.

“If you’re collecting medical notes, or data from CT scans, or data like that, you need to source physicians [to label and annotate the data]. And they’re quite expensive,” says Abdoli. “However, for these kinds of activities, the precision and quality of the data is the most important thing.”

Synthetic Data’s Impact on AI Training

However, if AI models require human experts for data labeling to judge and improve models, where does that need end? Will we have teams of doctors labeling data in offices instead of doing actual medical work? That’s where synthetic data steps in. Rather than relying entirely on human experts, data labeling companies often use AI models to generate training data for other AI models—essentially letting machines teach machines. Modern data labeling is often a mix of manual human feedback and automated AI teachers designed to reinforce desirable model behavior.

“You have a teacher, and your teacher, which in this case is just another deep neural network, is outputting an example,” says Cohere’s Hooker. “And then the student model is trained on that example.” The key, she notes, is to use a high-quality teacher, and to use multiple different AI “teachers” rather than relying on a single model. This avoids the problem of model collapse, in which the output quality of an AI model trained on AI generated data drastically collapses.

DeepSeek R1, the model from the Chinese company of the same name that made waves in January for how cheap it was to train, is an extreme example of how synthetic data can work in practice. It achieved reasoning performance comparable to the best models from OpenAI, Anthropic, and Google without traditional human feedback. Instead, DeepSeek R1 was trained on “cold start” data consisting of a few thousand human-selected examples of chain-of-thought reasoning. After that, DeepSeek used rules-based rewards to reinforce the model’s reasoning behavior. However, SuperAnnotate’s Liang cautioned that synthetic data isn’t a silver bullet. While the AI industry is often eager to automate whenever possible, attempts to use models for ever-more-complex tasks can reveal edge cases that only humans catch.

“As we’re starting to see enterprises putting models into production, they’re all coming to the realization, holy moly, I need to get humans into the mix,” he says.

That’s precisely why data labeling companies like Scale AI, Perle, and SuperAnnotate (among dozens of others) are enjoying the spotlight. The best method for tuning agentic AI models to tackle complicated or niche use cases—whether through human feedback, synthetic data, some combination, or new techniques yet to be discovered—remains an open question. Meta’s $14 billion bet suggests the answer won’t come cheap.

Originally published at IEEE Spectrum on Fri, 01 Aug 2025.

Frequently Asked Questions (FAQ)

What is data labeling?

Data labeling is the process of identifying raw data and adding one or more meaningful and informative labels to provide context for machine learning algorithms. It's a crucial step in training AI models, especially for tasks like image recognition, natural language processing, and autonomous systems.

Why is data labeling important for AI?

High-quality, accurately labeled data is essential for training robust and reliable AI models. Without proper labeling, AI models can learn incorrect patterns, leading to biased or inaccurate outputs, which is often referred to as the "garbage in, garbage out" principle.

What are the different types of data labeling?

Data labeling encompasses various methods, including image annotation (bounding boxes, polygons, keypoints), text labeling (sentiment analysis, entity recognition), audio transcription, and video annotation. The type of labeling depends on the specific AI task.

How does data labeling differ from data cleaning?

While both are crucial for AI development, data cleaning focuses on identifying and correcting errors, inconsistencies, or duplicate entries in raw data. Data labeling, on the other hand, involves adding relevant tags or categories to the data to make it understandable for AI models.

What is agentic AI, and how does data labeling relate to it?

Agentic AI refers to AI systems capable of complex, multi-step workflows and autonomous decision-making. Data labeling is vital for training these agents, as it helps them understand and execute these complex tasks by providing labeled examples of desired actions and outcomes.

What is synthetic data, and how is it used in AI training?

Synthetic data is artificially generated data that mimics real-world data. It can be used to supplement or replace real-world data for AI training, especially when real data is scarce, expensive, or contains sensitive information. This allows for more diverse and extensive training datasets.

Crypto Market AI's Take

The article highlights the critical role of data labeling in the advancement of AI, particularly in the burgeoning field of agentic AI. This demand for high-quality, meticulously labeled data directly impacts the AI and cryptocurrency sectors. At Crypto Market AI, we understand the symbiotic relationship between robust AI development and the underlying data infrastructure. Our platform leverages AI for market analysis and trading insights, emphasizing the need for reliable data. Just as Scale AI provides the foundational data for AI models, we provide the intelligent analysis and tools that empower traders in the volatile crypto market. Exploring how AI drives both data labeling and trading strategies is key to understanding the future of finance. You can learn more about how we utilize AI in our AI Crypto Market Platform and discover insights into AI Agents that are shaping various industries.

Why Did Meta Invest Billions in Scale AI?