News

Wikipedia and Kaggle Release AI-Friendly Beta Dataset

Wikipedia and Kaggle Release AI-Friendly Beta Dataset

April 18, 2025
Wikipedia Kaggle AI machine learning dataset NLP Creative Commons Wikimedia Foundation
Wikipedia has partnered with Kaggle to release a structured beta dataset in English and French, designed to provide AI developers and machine learning practitioners with high-utility content such as abstracts, infobox-style data, and image links, while reducing server strain from web scraping.

Wikipedia and Kaggle Release AI-Friendly Beta Dataset

Video: Kaggle Free Dataset for ML and Data Science #ai #machinelearning #stackdev

Wikipedia has partnered with Kaggle to release a beta dataset designed for AI developers and machine learning practitioners. This structured dataset, available in English and French, aims to provide a more efficient and developer-friendly alternative to scraping raw Wikipedia content. The dataset includes high-utility elements such as abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections, making it ideal for training AI models, building features, and testing NLP pipelines.

By offering this dataset, the Wikimedia Foundation seeks to reduce the strain on its servers caused by AI crawlers scraping public Wikipedia content, which has led to increased costs and slower load times for human users. The dataset is formatted for machine learning workflows and is freely licensed under Creative Commons Attribution-Share-Alike 4.0, the GNU Free Documentation License (GFDL), and other applicable licenses.

Kaggle, a leading data science platform owned by Google, hosts this dataset, providing a collaborative environment for the machine learning community to experiment with and refine their models. The beta release also invites feedback and suggestions to improve the dataset for future production use.

For more details, you can access the dataset directly on Kaggle or read the official announcement on the Wikimedia Enterprise blog.

Sources

Wikipedia Joins Forces with Kaggle: AI-Friendly Datasets Take ... In a groundbreaking partnership, Wikipedia teams up with Kaggle to introduce a dataset optimized for AI training, aiming to curb server ...
Wikipedia Kaggle Dataset using Structured Contents Snapshot Wikimedia Enterprise has released a beta dataset on Kaggle, featuring structured Wikipedia content in English and French.
Wikipedia offers AI developers a training dataset to maybe get ... Wikipedia offers AI developers a training dataset to maybe get scraper bots off its back. The encyclopedia has been struggling with the impact ...