Wikipedia offers AI developers a training dataset to maybe get scraper bots off its back

Wikipedia has been struggling with the impact that AI crawlers — bots that are scraping text and multimedia from the encyclopedia to train generative artificial intelligence models — have been having on its servers, leading to increased costs and slower load times for human users in some cases. Perhaps in an effort to stop the bots from pummeling the public Wikipedia website and soaking up too much bandwidth, the Wikimedia Foundation (which manages Wikipedia's data) is offering AI developers a dataset they can freely use.

The organization has teamed up with Kaggle, a data science platform, to offer up a beta release of a structured dataset in both English and French. According to Google — which owns Kaggle — the dataset is formatted for machine learning to make it more useful for training, development and data science.

Wikimedia Enterprise notes that the dataset includes "abstracts, short descriptions, infobox-style key-value data, image links and clearly segmented article sections." There are no references or other "non-prose elements," such as video clips. The lack of references could make the issue of attribution for information in the dataset somewhat foggy. However, Wikimedia Enterprise (a part of the Wikimedia Foundation that seeks to make Wikipedia data available through APIs) says that the content in the dataset is freely licensed under Creative Commons, the public domain and so on since it's all from Wikipedia.

This article originally appeared on Engadget at https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss
Created 3mo | Apr 17, 2025, 3:30:20 PM


Login to add comment

Other posts in this group

A surrealist documentary about climate change and other new indie games worth checking out

Welcome to our latest roundup of indie game news and releases. It's a shorter edition than usual this week, but there are still a few interesting games here that are worth having on your radar.&nbs

Jul 5, 2025, 11:20:05 AM | Engadget
Paramount+ deal: Get two months of Essential or Premium for only $2

Another hot streaming deal has arrived to match the summer heat. This time it comes

Jul 5, 2025, 11:20:04 AM | Engadget
Prime Day deals include the Amazon Smart Plug for only $13

The Amazon Smart Plug

Jul 4, 2025, 11:40:03 PM | Engadget
Supergiant’s latest Hades II patch is likely its last before launch

Despite it having upward of 61,000 reviews on Steam, Hades II isn’t actually out yet. The sequel to Supergiant Games’ hugely successful roguelite dungeon crawler has been in early access o

Jul 4, 2025, 9:20:22 PM | Engadget
How AI can help you navigate layoffs, according to one executive producer at Xbox

It's been a rough week at Microsoft. Following the news that

Jul 4, 2025, 9:20:20 PM | Engadget
Fairphone 6 lands a perfect 10 for repairability

Dutch company Fairphone continues to lead the charge on consumer- and planet-friendly electronics, proving that a great phone doesn't have to be impossible to repair or environmentally unsustainabl

Jul 4, 2025, 6:50:20 PM | Engadget