Recently, the German research organization LAION released a new dataset, Re-LAION-5B, which claims to have been thoroughly cleaned of links to suspected child sexual abuse material (CSAM). This dataset serves as a re-release of an older dataset, LAION-5B, with fixes implemented based on recommendations from various organizations.
The Cleaned Dataset
The Re-LAION-5B dataset is available in two versions – Research and Research-Safe. Both versions have been filtered for thousands of links to known and likely CSAM. LAION emphasized its commitment to removing illegal content from its datasets and stated that illegal content is removed as soon as it becomes known.
It is important to note that LAION’s datasets do not contain images but indexes of links to images and image alt text curated from the Common Crawl dataset of scraped sites and web pages. This distinction is crucial in understanding the nature of the dataset and the potential implications of its use.
The release of Re-LAION-5B follows an investigation by the Stanford Internet Observatory that found links to illegal images and inappropriate content in LAION-5B. This discovery prompted LAION to take the dataset offline temporarily and address the issues identified by the investigation.
The Stanford report recommended that models trained on LAION-5B should be deprecated, and distribution should cease where feasible. This recommendation underscores the importance of addressing the presence of CSAM and inappropriate content in training datasets used for generative AI models.
The temporary removal of LAION-5B and the recommendations from the Stanford report may impact the output of models trained on the dataset. Companies like Runway, which took down its Stable Diffusion 1.5 model, are facing repercussions for using datasets with illegal or inappropriate content.
LAION stresses that its datasets are intended for research purposes and not commercial use. However, past instances of organizations using LAION datasets for training models raise questions about the ethical and legal implications of using datasets with potentially illegal content.
The release of Re-LAION-5B raises awareness about the importance of thorough data cleaning and ethical considerations in AI research. The issues identified with LAION-5B highlight the need for transparency, accountability, and responsible data practices in the development of AI models.