The next AI breakthrough won’t come from bigger models, but from better data

wpnews.pro

Artificial intelligence does not advance at the same pace across industries. It presses forward in some directions while lagging behind in others.

Spend time with today’s most advanced AI applications, and this contrast becomes obvious. In software development, AI is quickly becoming ubiquitous. It writes production-ready code, explains obscure libraries, and iterates at a pace human teams have difficulty matching.

But place that same AI model inside a complex customer support workflow or ask it to reason through a nuanced clinical scenario, and the cracks begin to show. Multi-step reasoning falters. Context gets lost. Performance drops in ways that can seem inconsistent with the model’s strengths elsewhere.

These AI models are often similar. They run on similar hardware and are often trained in similar ways. So why the mismatch in performance across tasks? The simplest explanation is also the most overlooked: data.

Software engineering benefits from an immense, structured, and highly visible digital record. Code is written in standardized languages, benefits from robust documentation, is reviewed in public forums, and is discussed at scale. That ecosystem has generated a robust and massively useful pool of training material.

Other fields often do not. For example, healthcare data is scattered across institutions, wrapped in privacy constraints, expressed in multiple modalities, and rarely ready out-of-the-box for AI training. Enterprise workflows are captured in internal systems that were never designed for training AI. Multilingual speech data varies widely in quality and representation.

This imbalance creates what I describe as “the data gap.” This is the distance between what models are capable of in theory and what they can achieve in practice, because the right data does not yet exist in usable form. Closing this data gap may be the most important—and least glamorous—challenge in AI today.

Three forces are driving the recent advances in AI: the models, the chips, and the data.

On the AI models, the field has invested heavily. Major research organizations employ thousands of researchers and scientists who are actively refining architectures, training techniques, and evaluation methods. Breakthroughs are measured in benchmark scores, conference papers, and model performance on human tasks. On the computing chips, the investment has been equally intense. Hardware manufacturers and infrastructure providers are pouring billions of dollars into building and supporting data centers that deliver faster results via large-scale training.

Yet data has not received the same institutional focus for AI development. Conversations with researchers at frontier AI labs share a similar frustration that today’s model capabilities in key use cases, such as healthcare, are limited less by architectural imagination and more by the availability of high-quality, domain-specific data. The bottleneck is not always a lack of ideas, but a lack of reliable inputs.

We are long past scraping the internet for useful data, and this path does not scale. Progress depends on building and curating datasets that reflect the complexity of true lived experience and organizational processes. That work requires both scientific rigor and research specialization in the field of data for AI.

The history of AI reinforces a consistent lesson: major leaps in model capability follow major leaps in the availability of quality data. From early vision systems that relied on clearly labeled images to today’s language models trained on massive text collections, each major leap has depended on access to more high-quality data.

Architectural innovation alone is rarely enough. The value of these new approaches only emerges when paired with large, structured, and representative datasets that reveal what the models can actually do in practice. Whether in vision or language, progress has depended on the painstaking work of collecting, organizing, and validating the underlying data.

Large language models illustrate this clearly. Their emergence was not just the result of better training techniques, but of access to an unprecedented volume of data. The models did not generate that data. They relied on it. That pattern raises a pressing question for the present: who is building the next generation of foundational datasets?

Across domains ranging from healthcare to audio to agentic task performance, there is no widely accepted blueprint. What constitutes a gold-standard dataset for training an AI agent to handle complex enterprise tasks? What does a clinically meaningful evaluation look like for a model that will assist in medical decision-making? How should multilingual speech data be curated to ensure broad representation and reliable performance?

These are not simple sourcing problems. They are fundamental research challenges that need to be solved.

Too often, consequential data decisions are handled like procurement exercises. An organization requests “medical conversations” or “wildlife scenes,” and the request is routed to internal procurement or data sourcing teams, or to external data vendors, who assemble data that appears to match the description. The implicit assumption is that data is interchangeable, that one dataset is as good as another so long as it meets a basic specification.

Actual application suggests otherwise. Seemingly small choices about factors such as inclusion criteria, annotation standards, filtering rules, and validation protocols can dramatically alter downstream performance. Data design shapes model behavior as much as architecture does.

Three structural issues compound the problem:

The rise of annotation providers and reinforcement learning services has addressed part of the need. Rating model outputs, labeling text, and evaluating structured information are essential for many optimization tasks. But these activities generate data that is carefully constructed for specific, bounded purposes.

The frontier challenges in AI require more. They demand datasets derived from real human activity and organic organizational processes. Such data is complex, multimodal, and sensitive. It is rarely AI-ready by default. And converting it into reliable training and evaluation material is a scientific undertaking.

If high-quality data is a central bottleneck, then scientific rigor is part of the solution. Just as leading model-builders have dedicated research labs and hardware has dedicated development ecosystems, the data layer for AI requires focused, scientifically-grounded institutions. This means engaging directly with core questions like dataset design, evaluation methodology, and quality control. The conversation cannot end at volume; it must address data structure, representativeness, and expert validation.

Dataset construction must be approached as experimental design. Protocols must be documented and validated. Evaluation frameworks must test whether the dataset truly reflects the intended applications.

The field also requires standards and benchmarks that reflect real-world complexity, not simplified proxies. In healthcare, for instance, evaluating a system intended for clinical assistance with generic question-and-answer tests is insufficient. Real-world clinical environments involve multimodal inputs and contextual judgment. Benchmarks must reflect that reality if they are to function as meaningful gates before deployment.

Quality measurement is another crucial frontier. Finance relies on standardized metrics such as credit scores to assess risk. AI lacks an equivalent for datasets and benchmarks. Developing clear methodology to quantify dataset quality and evaluation reliability brings clarity to model assessment.

The criteria for evaluating a multilingual audio library will differ from those of a multimodal oncology dataset. Yet the underlying principle remains constant. Better models require better-defined, better-measured data.

As AI systems move closer to high-stakes deployment, weak data practices carry tangible risks.

Benchmarks cannot be created with the same data that is used for training—that’s giving the test answers to the model ahead of time. Scaling data volume without prioritizing data quality and selection diminishes model performance gains, and can even bias against or omit underrepresented populations. These are methodological challenges, and ones that must be solved.

The rigor required at the data layer may not attract headlines. It does not typically lend itself to dramatic product launches. Yet the data layer for AI is foundational to trust, safety, and sustained progress for all AI progress.

No single organization can resolve the data gap alone. What is needed is an ecosystem of AI data labs and research groups, each focused on different domains and challenges but united by a commitment to scientific discipline. These institutions would collaborate with model researchers and domain experts who would tackle challenges such as dataset contamination, factuality, groundedness, de-identification, international representation, and bias. They would design benchmarks that mirror real-world complexity rather than simplified abstractions.

AI’s trajectory will not be determined solely by larger models or faster chips. It will be shaped by the datasets we construct, the standards we adopt, and the rigor we apply at the foundation. The uneven frontier we see today reflects an uneven data landscape. Bridging the gap requires deliberate, research-driven dataset design.

If we want AI systems capable of operating reliably in clinical contexts, navigating enterprise workflows, and functioning responsibly across languages and cultures, we must treat data for AI as a first-class scientific endeavor. AI models have their research labs. AI chip-builders have their fabrication plants. AI data needs institutions of equal seriousness and ambition.

—

New Tech Forum** provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to *** doug_dineley@foundryco.com.*

source & further reading

infoworld.com — original article GitLab previews auto-remediation of vulnerable dependencies Agentic coding is everywhere Reselling unused cloud instances is no longer easy

The next AI breakthrough won’t come from bigger models, but from better data

Run your AI side-project on zahid.host