The AI arms race may soon center on a competition for ‘expert’ data

Welcome to AI Decoded, Fast Company’s weekly newsletter that breaks down the most important news in the world of AI. You can sign up to receive this newsletter every week here.

The AI arms race will soon focus on competition for data

We use benchmark tests such as MMLU and HellaSwag to test large language models’ knowledge and problem-solving capability. But over the past six months it’s become clear that the performance gaps between well-known models is narrowing. A year ago, OpenAI’s GPT-4 was considered the undisputed champion of LLM, but now models from Anthropic, Mistral, Meta, Cohere, and Google are producing similar or better scores, depending on the benchmark.

In the past, we’ve improved large models by giving them more training data and compute power. But the performance returns on training with data scraped indiscriminately from the public web are limited, many believe. As a result, we’re left with a growing group of LLMs with roughly equal performance. Now, AI developers will likely try to gain an edge by acquiring stores of specialized data, such as health data.

“We built really great general purpose machines that talk like humans, but just like humans [who] are not experts, they’re generalists,” says Ali Golshan, cofounder and CEO of the synthetic data company Gretel. “Now, what we’re saying is that these general purpose machines need to become experts.” But “expert” training data is usually not public; it’s proprietary, held close by corporations. Gretel’s platform can be used to anonymize such data for use in training models.

We’ve already seen a number of AI developers strike deals to license content data from publishers. Earlier this week, in fact, OpenAI said it had signed a content deal to use content from the Financial Times. Reuters reported in February that Google was licensing data from the social platform Reddit. The New York Times Company sued OpenAI for using its content without permission, and the suit may well result in some form of licensing deal.

But as AI companies intensify their quest for specialized domain data, we may see deals that go well beyond licensing agreements. It’s very possible that AI companies will buy content companies outright, just for their training data. Stephen DeAngelis, founder and CEO of the reasoning AI company Enterra Solutions believes that Wikipedia, WolframAlpha, or even Getty Images could be targets in this type of acquisition. Tech firms could also be eyeing lesser-known companies that possess a kind of data needed to fill some crucial gap in an LLM’s knowledge. Or, AI companies might try to tap into academic knowledge, DeAngelis says. “I could see these large firms saying to colleges, ‘We’ll pay you a lot of money so you can fuel your investment in research, and can we license a copy of that [research] content to put into our LLM,’” he says.

The budding AI industry is already seeing an alarming migration of research and engineering talent, and specialized computer chips, to powerful, deep-pocketed players, such as Microsoft, Meta, and, increasingly, Elon Musk’s X (formerly Twitter). This concentration of resources could also soon include training data, further entrenching the players with the most buying power.

California frontier AI bill is moving through the Senate

A bill that would impose safety guidelines on AI companies developing large AI models has been moving through the state’s Senate and will get a full hearing May 6. The bill, called the Safe and Secure Innovation for Frontier Artificial Intelligence Systems Act (SB 1047), would require AI companies to study the safety implications before training large models, satisfy certain safety requirements, and report any safety incidents caused by the model. The bill would also establish a “Frontier Model Division” within the state’s Department of Technology that would collect information on large model development and oversee certification, and it proposes civil penalties for companies that violate the requirements of the Act.

It’s still unclear how the state would actually enforce such requirements. And the law doesn’t automatically hold developers accountable if a model causes harm, or even a catastrophe, as Dan Hendrycks of the Center for AI Safety points out on X. “The question is whether they took reasonable measures to prevent that,” he writes. “This bill could have used strict liability, where developers would be liable for catastrophic harms regardless of fault, but that’s not what the bill does.”

As the bill moves toward an eventual vote, some in the AI community are voicing fears that it could stifle the work of smaller AI startups and people working on open-source models.

The bill is unique for its focus on the development of the largest models, such as OpenAI’s GPT-4 and Google’s Gemini. The last major milestone in AI model safety came when Amazon, Anthropic, Google, Inflection, Meta, Microsoft, and OpenAI all pledged to the Biden administration in September 2023 to study the societal risks of new models (such as bias and privacy violations), proactively manage risk, and empower their internal safety teams.

For now AI regulation is happening mostly on a state level, with a good deal of focus on deepfakes and employment discrimination. SB 1047 is especially important given that California is often a first mover on technology regulation, and laws that pass in Sacramento are often used as models or templates for other states.

Survey: AI is already changing the way people search the web

The Verge survey about “how Americans are using and thinking about AI” includes some notable findings around new chatbot users (there are fewer of them in 2024) and AI usage patterns (people are finding more, and increasingly advanced, ways of using the technology). But most interesting is what the survey turns up around search: “The first meaningful disruption in search in 20 years is coming into full view,” The Verge editors write.

The survey of 2,000 users asked: “Do you use AI tools in place of search engines (like Google) to find information about a topic?” 61% of Gen Z respondents said they did, along with 53% of millennials. And 63% of millennials and 52% of Gen Z say they “trust the veracity of information that AI provides,” compared to only 32% of baby boomers. More than half of the respondents said they think AI can do a better job on common search tasks, such as planning a family activity or outing or discovering new recipes.

These levels of AI-native search adoption could push Google to make its own AI-native search, called Search Generative Experience (SGE), a regular part of its traditional search experience sooner than expected. The results also bode well for AI search upstarts like Perplexity.

More AI coverage from Fast Company:

Want exclusive reporting and trend analysis on technology, business innovation, future of work, and design? Sign up for Fast Company Premium.

https://www.fastcompany.com/91116610/the-ai-arms-race-may-soon-center-on-a-competition-for-expert-data?partner=rss&utm_source=rss&utm_medium=feed&utm_campaign=rss+fastcompany&utm_content=rss

Erstellt 1y | 02.05.2024, 11:40:06

Melden Sie sich an, um einen Kommentar hinzuzufügen

Andere Beiträge in dieser Gruppe

How your data is collected and what you can do about it

You wake up in the morning and, first thing, you open your weather app. You close that pesky ad that opens first and check the forecast. You like your weather app, which shows hourly weather forec

03.07.2025, 10:10:05 | Fast company - tech

Crypto is about to get even bigger thanks to millennials

How the Boomer wealth transfer could reshape global finance.

Born too late to ride the wave of postwar prosperity, but just early enough to watch the 2008 financial crisis decimate some

03.07.2025, 10:10:04 | Fast company - tech

Is the Velvet Sundown an AI band? Many on the internet sure think so

The Velvet Sundown is the most-talked-about band of the moment, but not for the reason you might expect.

The “indie rock band,” which has gained more than 634,000 Spotify lis

03.07.2025, 10:10:04 | Fast company - tech

‘You can do anything if you got money’: The Diddy verdict sparks internet uproar

Sean “Diddy” Combs was convicted of prostitution-related offenses but acquitted of

02.07.2025, 22:30:04 | Fast company - tech

China’s Huawei Technologies must face fraud and racketeering charges, says U.S. judge

A U.S. judge has ruled that China’s Huawei Technologies

02.07.2025, 17:50:04 | Fast company - tech

U.K.’s Bytes Technology stock plunged over 27%. Here’s why

Shares of U.K.’s Bytes Technology plunged over 27% on Wednesday after the IT firm said its operating profit for the first half of fiscal 2026 would be marginally lower due to delayed custome

02.07.2025, 17:50:03 | Fast company - tech

Elon Musk is right: Trump’s Big Beautiful Bill could hurt clean energy

Donald Trump’s Big Beautiful Bill Act has passed through the Senate thanks to

02.07.2025, 15:30:04 | Fast company - tech

Tomas_r2