Wikipedia offers AI developers a training dataset to maybe get scraper bots off its back

Wikipedia has been struggling with the impact that AI crawlers — bots that are scraping text and multimedia from the encyclopedia to train generative artificial intelligence models — have been having on its servers, leading to increased costs and slower load times for human users in some cases. Perhaps in an effort to stop the bots from pummeling the public Wikipedia website and soaking up too much bandwidth, the Wikimedia Foundation (which manages Wikipedia's data) is offering AI developers a dataset they can freely use.

The organization has teamed up with Kaggle, a data science platform, to offer up a beta release of a structured dataset in both English and French. According to Google — which owns Kaggle — the dataset is formatted for machine learning to make it more useful for training, development and data science.

Wikimedia Enterprise notes that the dataset includes "abstracts, short descriptions, infobox-style key-value data, image links and clearly segmented article sections." There are no references or other "non-prose elements," such as video clips. The lack of references could make the issue of attribution for information in the dataset somewhat foggy. However, Wikimedia Enterprise (a part of the Wikimedia Foundation that seeks to make Wikipedia data available through APIs) says that the content in the dataset is freely licensed under Creative Commons, the public domain and so on since it's all from Wikipedia.

This article originally appeared on Engadget at https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss
созданный 16d | 17 апр. 2025 г., 15:30:20


Войдите, чтобы добавить комментарий

Другие сообщения в этой группе

Doctor Who ‘Lucky Day’ review: Pete, I owe you an apology

Spoilers for “Lucky Day.”

When the writers for this season of Doctor Who were announced, one name in the roster put me instantly on edge. Pete McTighe may

3 мая 2025 г., 21:10:20 | Engadget
How to watch NVIDIA CEO Jensen Huang deliver the Computex 2025 keynote

Computex 2025 is approaching, and it’s sure to bring a ton of announcements about the latest chips, laptops, gaming devices and more from leading brands. The event in Taipei will kick off on Monday

3 мая 2025 г., 21:10:19 | Engadget
Kids under 13 will soon get supervised access to Google Gemini

Google Gemini is adding nannying to its chatbot skillset. According to a New

3 мая 2025 г., 18:40:16 | Engadget
Half-Life 3 is reportedly playable in its entirety and could be announced this year

Cue a new batch of “Half-Life 3 confirmed” memes. The latest rumor surrounding Valve’s long-awaited next installment in the Half-Life series claims that the game is currently “pla

3 мая 2025 г., 18:40:15 | Engadget
The Louvre will stop renting out Nintendo 3DS audio guides in September

In a few months, you'll no longer be able to rent a Nintendo 3DS to guide you around the Louvre and tell y

3 мая 2025 г., 16:30:04 | Engadget
Bandai releases Tamagotchi Paradise comic for Free Comic Book Day, and it may hint at the next device

It’s the first Saturday of May, which means Free Comic Book Day is here, and this year, even

3 мая 2025 г., 16:30:03 | Engadget