Synthetic data is the new AI gold rush, but critics call it ‘data laundering’

AI development is moving at a rapid pace, but it risks running headlong into a wall. As websites increasingly place barriers on scraping (some of which are allegedly ignored), and as the remaining content is voraciously collected by scrapers to train AI models, concerns are growing that we may run out of usable training data.

The industry’s answer? Synthetic data.

“Recently in the industry, synthetic data has been talked about a lot,” said Sebastien Bubeck, a member of technical staff at OpenAI, in the company’s livestreamed release of GPT-5 last week. Bubeck stressed its importance for the future of AI models—an idea echoed by his boss, Sam Altman, who live-tweeted the event, saying he was “excited for much more to come.”

The prospect of relying heavily on synthetic data hasn’t gone unnoticed by the creative industries. “I believe the main reason companies like OpenAI are having to rely more on synthetic data now is that they’ve run out of high-quality human created data to mine from the public facing internet,” says Reid Southern, a film concept artist and illustrator.

Southern believes there’s another motive. “It further distances them from any copyrighted materials they’ve trained on that could land them in hot water.”

For this reason, he has publicly called the practice “data laundering.” He argues that AI companies could train their models on copyrighted works, generate AI variations, then remove the originals from their datasets. They could then “claim their training set is ‘ethical’ because it didn’t technically train on the original image by their logic,” says Southern. “That’s why we call it data laundering, because in a sense, they’re attempting to clean the data and strip it of its copyright.” (OpenAI did not respond to Fast Company’s request for comment.)

The issue is more nuanced, according to Felix Simon, an AI researcher at the University of Oxford. “In one sense, it doesn’t really remediate the original harm over which creators and AI firms squabble,” he says. “After all, synthetic data isn’t plucked from the ether but presumably created with models that have reportedly been trained with data from creators and copyright holders—often without their permission and without compensation.” From the perspective of societal justice, rights, and duties, “these rights holders still are owed something even with the use of synthetic data—be that compensation, acknowledgements, or both.”

Ed Newton-Rex, founder of Fairly Trained—a non-profit certifying AI companies that respect creators’ intellectual property rights—shares Southern’s concerns. “I think synthetic data is a legitimately helpful way to augment your dataset,” he says. “If you’re training an AI model, it’s a way of increasing the coverage of your training data. And at a time when we’re butting up against the limits of legitimately accessible training data, it’s seen as a way to extend the usable life of that data.”

Still, Newton-Rex acknowledges its darker side. “At the same time, I think unfortunately its effect is, at least in part, one of copyright laundering,” he says. “I think both are true.”

He warns against taking AI firms’ promises at face value. “Synthetic data is not a panacea from the incredibly important copyright questions,” he says. “I think there tends to be so much of a feeling that synthetic data helps you, as an AI developer, get around copyright concerns.” That belief, he says, is wrong.

The framing of synthetic data—and the way AI companies talk about model training—also helps them distance themselves from the individuals whose work they may be using. “The average listener, if they hear this model was trained on synthetic data, they’re bound to think, ‘Oh, right, okay. Well, this probably isn’t Ed Sheeran’s latest album, right?’ It further moves us away from an easy understanding of how these models are actually made, which is ultimately by exploiting people’s life’s work.”

He compares it to plastic recycling, where a recycled container might once have been a toy, a car bumper, or something else entirely. “The fact these AI models mash all this stuff up and generate, quote-unquote, ‘new output’, does nothing to reduce their reliance on the original work.”

For Newton-Rex, this is the critical takeaway: “Really the absolutely critical element here, and it’s just got to be remembered, is that even in a world of synthetic data, what’s happening is people’s work is being exploited in order to compete with them.”

https://www.fastcompany.com/91385285/synthetic-data-is-the-new-ai-gold-rush-but-critics-call-it-data-laundering?partner=rss&utm_source=rss&utm_medium=feed&utm_campaign=rss+fastcompany&utm_content=rss

Utworzony 2h | 14 sie 2025, 12:40:08


Zaloguj się, aby dodać komentarz

Inne posty w tej grupie

5 common Amazon scams and how to avoid them

Amazon is the the most efficient, popular online retailer. So maybe it shouldn’t be surprising that it’s a gold mine for scammers. These individuals, bless their blackened hearts, are adept at cra

14 sie 2025, 05:50:02 | Fast company - tech
Russia restricts WhatsApp and Telegram calls

Russian authorities announced Wednesday they were “partially” restricting calls in messaging apps Telegram and WhatsApp, the latest step in an 

13 sie 2025, 20:30:08 | Fast company - tech
Amazon expands same-day perishable grocery delivery

Amazon is rolling out a service where its Prime members can now order their blueberries and milk at the same time as basic items like batte

13 sie 2025, 20:30:07 | Fast company - tech
Most people are using ChatGPT totally wrong—and OpenAI’s CEO just proved it

How did you react to the August 7 release of GPT-5, OpenAI’s latest version of ChatGPT? The company behind the model h

13 sie 2025, 18:20:04 | Fast company - tech
This mine feeds the tech world and fuels a rebel war

Under the watchful eye of M23 rebels in the hills around the Congolese town of Rubaya, a line of men in rubber boots ferry sacks full of crushed rocks up winding paths cut into the slopes.

13 sie 2025, 18:20:03 | Fast company - tech
This free web timer puts your computer’s Clock app to shame

For something as simple as setting a timer, the built-in apps on our computers can be awfully fiddly.

Usually you have to open a Clock app first, then navigate to a separate tab for time

13 sie 2025, 11:20:08 | Fast company - tech
Is agentic AI more than hype? This company thinks it knows how to find out

Over the past five years, advances in AI models’ data processing and r

13 sie 2025, 11:20:06 | Fast company - tech