Video: Kaggle Free Dataset for ML and Data Science #ai #machinelearning #stackdev
Wikipedia has partnered with Kaggle to release a beta dataset designed for AI developers and machine learning practitioners. This structured dataset, available in English and French, aims to provide a more efficient and developer-friendly alternative to scraping raw Wikipedia content. The dataset includes high-utility elements such as abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections, making it ideal for training AI models, building features, and testing NLP pipelines.
By offering this dataset, the Wikimedia Foundation seeks to reduce the strain on its servers caused by AI crawlers scraping public Wikipedia content, which has led to increased costs and slower load times for human users. The dataset is formatted for machine learning workflows and is freely licensed under Creative Commons Attribution-Share-Alike 4.0, the GNU Free Documentation License (GFDL), and other applicable licenses.
Kaggle, a leading data science platform owned by Google, hosts this dataset, providing a collaborative environment for the machine learning community to experiment with and refine their models. The beta release also invites feedback and suggestions to improve the dataset for future production use.
For more details, you can access the dataset directly on Kaggle or read the official announcement on the Wikimedia Enterprise blog.