AI development is moving at a rapid pace, but it risks running headlong into a wall. As websites increasingly place barriers on scraping (some of which are allegedly ignored), and as the remaining content is voraciously collected by scrapers to train AI models, concerns are growing that we may run out of usable training data.
The industry’s answer? Synthetic data.
“Recently in the industry, synthetic data has been talked about a lot,” said Sebastien Bubeck, a member of technical staff at OpenAI, in the company’s livestreamed release of GPT-5 last week. Bubeck stressed its importance for the future of AI models—an idea echoed by his boss, Sam Altman, who live-tweeted the event, saying he was “excited for much more to come.”
The prospect of relying heavily on synthetic data hasn’t gone unnoticed by the creative industries. “I believe the main reason companies like OpenAI are having to rely more on synthetic data now is that they’ve run out of high-quality human created data to mine from the public facing internet,” says Reid Southern, a film concept artist and illustrator.
Southern believes there’s another motive. “It further distances them from any copyrighted materials they’ve trained on that could land them in hot water.”
For this reason, he has publicly called the practice “data laundering.” He argues that AI companies could train their models on copyrighted works, generate AI variations, then remove the originals from their datasets. They could then “claim their training set is ‘ethical’ because it didn’t technically train on the original image by their logic,” says Southern. “That’s why we call it data laundering, because in a sense, they’re attempting to clean the data and strip it of its copyright.” (OpenAI did not respond to Fast Company’s request for comment.)
The issue is more nuanced, according to Felix Simon, an AI researcher at the University of Oxford. “In one sense, it doesn’t really remediate the original harm over which creators and AI firms squabble,” he says. “After all, synthetic data isn’t plucked from the ether but presumably created with models that have reportedly been trained with data from creators and copyright holders—often without their permission and without compensation.” From the perspective of societal justice, rights, and duties, “these rights holders still are owed something even with the use of synthetic data—be that compensation, acknowledgements, or both.”
Ed Newton-Rex, founder of Fairly Trained—a non-profit certifying AI companies that respect creators’ intellectual property rights—shares Southern’s concerns. “I think synthetic data is a legitimately helpful way to augment your dataset,” he says. “If you’re training an AI model, it’s a way of increasing the coverage of your training data. And at a time when we’re butting up against the limits of legitimately accessible training data, it’s seen as a way to extend the usable life of that data.”
Still, Newton-Rex acknowledges its darker side. “At the same time, I think unfortunately its effect is, at least in part, one of copyright laundering,” he says. “I think both are true.”
He warns against taking AI firms’ promises at face value. “Synthetic data is not a panacea from the incredibly important copyright questions,” he says. “I think there tends to be so much of a feeling that synthetic data helps you, as an AI developer, get around copyright concerns.” That belief, he says, is wrong.
The framing of synthetic data—and the way AI companies talk about model training—also helps them distance themselves from the individuals whose work they may be using. “The average listener, if they hear this model was trained on synthetic data, they’re bound to think, ‘Oh, right, okay. Well, this probably isn’t Ed Sheeran’s latest album, right?’ It further moves us away from an easy understanding of how these models are actually made, which is ultimately by exploiting people’s life’s work.”
He compares it to plastic recycling, where a recycled container might once have been a toy, a car bumper, or something else entirely. “The fact these AI models mash all this stuff up and generate, quote-unquote, ‘new output’, does nothing to reduce their reliance on the original work.”
For Newton-Rex, this is the critical takeaway: “Really the absolutely critical element here, and it’s just got to be remembered, is that even in a world of synthetic data, what’s happening is people’s work is being exploited in order to compete with them.”
Inicia sesión para agregar comentarios
Otros mensajes en este grupo.

The summer holidays are here and many of us will heading off on trips to hot and sunny destinations,

Welcome to AI Decoded, Fast Company’s weekly new

Russia on Wednesday became the latest country to restrict some

Amazon is the the most efficient, popular online retailer. So maybe it shouldn’t be surprising that it’s a gold mine for scammers. These individuals, bless their blackened hearts, are adept at cra


Amazon is rolling out a service where its Prime members can now order their blueberries and milk at the same time as basic items like batte

How did you react to the August 7 release of GPT-5, OpenAI’s latest version of ChatGPT? The company behind the model h