Wikipedia offers AI developers a training dataset to maybe get scraper bots off its back

Wikipedia has been struggling with the impact that AI crawlers — bots that are scraping text and multimedia from the encyclopedia to train generative artificial intelligence models — have been having on its servers, leading to increased costs and slower load times for human users in some cases. Perhaps in an effort to stop the bots from pummeling the public Wikipedia website and soaking up too much bandwidth, the Wikimedia Foundation (which manages Wikipedia's data) is offering AI developers a dataset they can freely use.

The organization has teamed up with Kaggle, a data science platform, to offer up a beta release of a structured dataset in both English and French. According to Google — which owns Kaggle — the dataset is formatted for machine learning to make it more useful for training, development and data science.

Wikimedia Enterprise notes that the dataset includes "abstracts, short descriptions, infobox-style key-value data, image links and clearly segmented article sections." There are no references or other "non-prose elements," such as video clips. The lack of references could make the issue of attribution for information in the dataset somewhat foggy. However, Wikimedia Enterprise (a part of the Wikimedia Foundation that seeks to make Wikipedia data available through APIs) says that the content in the dataset is freely licensed under Creative Commons, the public domain and so on since it's all from Wikipedia.

This article originally appeared on Engadget at https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss
Creato 20d | 17 apr 2025, 15:30:20


Accedi per aggiungere un commento

Altri post in questo gruppo

Nintendo Switch 2 pre-orders: Updates on restocks from Walmart, GameStop, Target, Best Buy and other retailers

Nintendo Switch 2 pre-orders are technically open, but you may have trouble grabbing the console before its

7 mag 2025, 17:40:24 | Engadget
The Golden Globes will start giving an award to the best podcast of the year

The Golden Globes is expanding beyond film and TV with a new category. It will dole out an award for the best

7 mag 2025, 17:40:23 | Engadget
The next Battlefield game will launch sometime before April 2026

Electronic Arts has confirmed that the next Battlefield game will be revealed this summer, ahead of a launch date some time before April 2026. The news was shared in the company’s Q4 and a

7 mag 2025, 17:40:22 | Engadget
A four-pack of Samsung SmartTag 2 Bluetooth trackers is down to $52 at Woot

A four-pack of Samsung SmartTag 2 Bluetooth trackers

7 mag 2025, 17:40:21 | Engadget
Ford will raise Mustang Mach-E prices in part due to tariffs

It’s earnings season, and automakers are warning investors about the impact tariffs will have on vehicle pricing. As

7 mag 2025, 17:40:20 | Engadget
The DEA abandons bodycams after only four years

"We welcome the addition of body-worn cameras and appreciate the enhanced transparency and assurance they provide," a then-DEA official

7 mag 2025, 17:40:19 | Engadget
Samsung is paying $350 million for audio brands Bowers & Wilkins, Denon, Marantz and Polk

Harman International, a wholly owned subsidiary of Samsung, is pur

7 mag 2025, 15:20:22 | Engadget