Wikipedia offers AI developers a training dataset to maybe get scraper bots off its back

Wikipedia has been struggling with the impact that AI crawlers — bots that are scraping text and multimedia from the encyclopedia to train generative artificial intelligence models — have been having on its servers, leading to increased costs and slower load times for human users in some cases. Perhaps in an effort to stop the bots from pummeling the public Wikipedia website and soaking up too much bandwidth, the Wikimedia Foundation (which manages Wikipedia's data) is offering AI developers a dataset they can freely use.

The organization has teamed up with Kaggle, a data science platform, to offer up a beta release of a structured dataset in both English and French. According to Google — which owns Kaggle — the dataset is formatted for machine learning to make it more useful for training, development and data science.

Wikimedia Enterprise notes that the dataset includes "abstracts, short descriptions, infobox-style key-value data, image links and clearly segmented article sections." There are no references or other "non-prose elements," such as video clips. The lack of references could make the issue of attribution for information in the dataset somewhat foggy. However, Wikimedia Enterprise (a part of the Wikimedia Foundation that seeks to make Wikipedia data available through APIs) says that the content in the dataset is freely licensed under Creative Commons, the public domain and so on since it's all from Wikipedia.

This article originally appeared on Engadget at https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss
Vytvořeno 23d | 17. 4. 2025 15:30:20


Chcete-li přidat komentář, přihlaste se

Ostatní příspěvky v této skupině

The developers behind Overwatch have unionized

Activision Blizzard’s Overwatch team has formed a wall-to-wall union under the Communications Workers of America (CWA). The union, which has been recognized by parent company Microsoft, includes ne

10. 5. 2025 17:41:10 | Engadget
Nintendo grants itself the power to brick Switches with pirated games

Nintendo’s latest legal move to combat piracy may be super effective. According to a new change in the Nintendo User Agreement, the console maker can brick your

10. 5. 2025 17:41:09 | Engadget
FDA approves at-home pap smear alternative device for cervical cancer screening

The Food and Drug Administration has

10. 5. 2025 17:41:08 | Engadget
Mexico is suing Google over 'Gulf of America' name change for US users

The Mexican government has filed a lawsuit against Google for renamin

10. 5. 2025 15:20:14 | Engadget
Google will pay Texas $1.4 billion to settle data privacy violation lawsuits

Google has agreed to pay the state of Texas $1.375 billion to settle two lawsuits accusing the company of violating its residents' data privacy rights. Texas Attorney General Ken Paxton

10. 5. 2025 13:10:11 | Engadget
Spreadsheet puzzles, metatextual platformers and other new indie games worth checking out

Welcome to our first weekly roundup of indie game releases, news and trailers. It's impossible to cover the indie scene completely comprehensively — dozens of games hit Steam alone every single day

10. 5. 2025 13:10:10 | Engadget
Engadget review recap: Surface Pro, Rivian, Canon, Light Phone and more

I can't blame you if you've been spending more time outside lately instead of reading gadget reviews. Spring has sprung, at least for us at Engadget HQ in the US, and there's a lot of touching gras

10. 5. 2025 13:10:09 | Engadget