Wikipedia offers AI developers a training dataset to maybe get scraper bots off its back

Wikipedia has been struggling with the impact that AI crawlers — bots that are scraping text and multimedia from the encyclopedia to train generative artificial intelligence models — have been having on its servers, leading to increased costs and slower load times for human users in some cases. Perhaps in an effort to stop the bots from pummeling the public Wikipedia website and soaking up too much bandwidth, the Wikimedia Foundation (which manages Wikipedia's data) is offering AI developers a dataset they can freely use.

The organization has teamed up with Kaggle, a data science platform, to offer up a beta release of a structured dataset in both English and French. According to Google — which owns Kaggle — the dataset is formatted for machine learning to make it more useful for training, development and data science.

Wikimedia Enterprise notes that the dataset includes "abstracts, short descriptions, infobox-style key-value data, image links and clearly segmented article sections." There are no references or other "non-prose elements," such as video clips. The lack of references could make the issue of attribution for information in the dataset somewhat foggy. However, Wikimedia Enterprise (a part of the Wikimedia Foundation that seeks to make Wikipedia data available through APIs) says that the content in the dataset is freely licensed under Creative Commons, the public domain and so on since it's all from Wikipedia.

This article originally appeared on Engadget at https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss
Établi 1mo | 17 avr. 2025, 15:30:20


Connectez-vous pour ajouter un commentaire

Autres messages de ce groupe

Android 16 includes a desktop interface Google built from Samsung DeX

Devices running Android 16 will pick up a new trick when the software update rolls out later this year: The ability to run a desktop-style interface while connected to an external display. An early

21 mai 2025, 22:40:11 | Engadget
GeoGuessr community maps go dark in protest of EWC ties to human rights abuses

A group of GeoGuessr map creators have pulled their contributions from the game to protest its participation in the Esports World Cup 2025,

21 mai 2025, 22:40:10 | Engadget
News/Media Alliance calls Google's AI Mode 'theft'

The News/Media Alliance took aim at Google today after the tech company's announcement at its

21 mai 2025, 22:40:09 | Engadget
Xbox Game Pass Retro Classics has over 50 old-school games for people over 50

Game Pass members can now play over 50 old-school games for free. Microsoft's

21 mai 2025, 20:20:20 | Engadget
Sonos portable speakers are 25 percent off for Memorial Day

Sonos' speakers are known for their premium price tags, but if you're looking for a more affordable entry-point to the ecosystem, you can

21 mai 2025, 20:20:19 | Engadget
Tamagotchi Paradise looks like the most exciting virtual pet toy in years

You've got to hand it to the Tamagotchi team for continuing to find new ways to spin a toy that is now pushing 30 years old. We've seen a

21 mai 2025, 17:50:33 | Engadget