Wikipedia offers AI developers a training dataset to maybe get scraper bots off its back

Wikipedia has been struggling with the impact that AI crawlers — bots that are scraping text and multimedia from the encyclopedia to train generative artificial intelligence models — have been having on its servers, leading to increased costs and slower load times for human users in some cases. Perhaps in an effort to stop the bots from pummeling the public Wikipedia website and soaking up too much bandwidth, the Wikimedia Foundation (which manages Wikipedia's data) is offering AI developers a dataset they can freely use.

The organization has teamed up with Kaggle, a data science platform, to offer up a beta release of a structured dataset in both English and French. According to Google — which owns Kaggle — the dataset is formatted for machine learning to make it more useful for training, development and data science.

Wikimedia Enterprise notes that the dataset includes "abstracts, short descriptions, infobox-style key-value data, image links and clearly segmented article sections." There are no references or other "non-prose elements," such as video clips. The lack of references could make the issue of attribution for information in the dataset somewhat foggy. However, Wikimedia Enterprise (a part of the Wikimedia Foundation that seeks to make Wikipedia data available through APIs) says that the content in the dataset is freely licensed under Creative Commons, the public domain and so on since it's all from Wikipedia.

This article originally appeared on Engadget at https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss
Creată 25d | 17 apr. 2025, 15:30:20


Autentifică-te pentru a adăuga comentarii

Alte posturi din acest grup

Trump has fired the director of the US Copyright Office

As first reported by

11 mai 2025, 21:30:14 | Engadget
iOS 19 may bring a feature that makes signing into public Wi-Fi less of a hassle

Apple is reportedly planning to introduce a small but welcome convenience feature with iOS 19: cross-device syncing for Wi-Fi access portals. This is according to the latest

11 mai 2025, 21:30:13 | Engadget
SoundCloud says it's never trained AI using artists' work after getting called out for terms of use change

Following backlash about a quietly added clause to SoundCloud's

11 mai 2025, 19:10:29 | Engadget
Samsung has begun taking pre-orders for its 500Hz OLED gaming monitor

It won't make you a better gamer, but Samsung's latest gaming monitor entices those hunting for faster refresh rates. The company's newest

11 mai 2025, 19:10:28 | Engadget
Scientists find lead really can be turned into gold (with help from the Large Hadron Collider)

One of the ultimate goals of medieval alchemy has been realized, but only for a fraction of a second. Scientists with the European Organization for Nuclear Research, better known as CERN, were able

11 mai 2025, 16:50:12 | Engadget
How to use Gemini to generate unique backgrounds in Google Meet

Google’s Gemini AI has been getting upgrade after upgrade

11 mai 2025, 14:30:23 | Engadget