Wikipedia offers AI developers a training dataset to maybe get scraper bots off its back

Wikipedia has been struggling with the impact that AI crawlers — bots that are scraping text and multimedia from the encyclopedia to train generative artificial intelligence models — have been having on its servers, leading to increased costs and slower load times for human users in some cases. Perhaps in an effort to stop the bots from pummeling the public Wikipedia website and soaking up too much bandwidth, the Wikimedia Foundation (which manages Wikipedia's data) is offering AI developers a dataset they can freely use.

The organization has teamed up with Kaggle, a data science platform, to offer up a beta release of a structured dataset in both English and French. According to Google — which owns Kaggle — the dataset is formatted for machine learning to make it more useful for training, development and data science.

Wikimedia Enterprise notes that the dataset includes "abstracts, short descriptions, infobox-style key-value data, image links and clearly segmented article sections." There are no references or other "non-prose elements," such as video clips. The lack of references could make the issue of attribution for information in the dataset somewhat foggy. However, Wikimedia Enterprise (a part of the Wikimedia Foundation that seeks to make Wikipedia data available through APIs) says that the content in the dataset is freely licensed under Creative Commons, the public domain and so on since it's all from Wikipedia.

This article originally appeared on Engadget at https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss
Created 19d | Apr 17, 2025, 3:30:20 PM


Login to add comment

Other posts in this group

Netflix teases Squid Game's final season

Netflix has released its first teaser trailer for the final season of

May 6, 2025, 4:30:15 AM | Engadget
Gears of War is coming to PlayStation for the first time on August 26

Yet another high-profile Xbox franchise is making the leap to PlayStation. While Ha

May 6, 2025, 2:20:09 AM | Engadget
How to use a VPN on Apple TV

The Apple TV is one of the best streaming devices you can get right now to add Ne

May 5, 2025, 11:50:10 PM | Engadget
Reflections on the Nintendo Switch, the hybrid console that changed gaming

The Switch 2 is nearly here, which means the original Switch is entering its twilight years. It’

May 5, 2025, 9:40:10 PM | Engadget
A new 'Ecco the Dolphin' game and remasters are on the way

Ecco the Dolphin, the Sega-published game series starring a time-traveling bottlenose dolphin, is making a comeback, according to

May 5, 2025, 9:40:09 PM | Engadget
TeleMessage, a Signal clone the Trump administration uses, has been hacked

A hacker has exploited a vulnerability in TeleMessage to breach the service and steal data,

May 5, 2025, 7:20:19 PM | Engadget