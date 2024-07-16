A new report has found that Apple, NVIDIA, Anthropic, and others have used a dataset that includes YouTube video transcripts to train AI models.

In early April, YouTube sent a clear message to AI model developers that downloading data from the platform and using it to train AI models is a clear violation of YouTube's terms of service.

This sentiment was reinforced in the same week as YouTube's public comment about its content being used to train AI model, but it came from a Google spokesperson who told the New York Times any, "unauthorized scraping or downloading of YouTube content" is prohibited. However, a new report from Proof News has found YouTube has been scraped for its data, and some of the biggest tech companies advancing AI have used it to train models.

According to a Proof News investigation, subtitles from 172,535 YouTube videos were siphoned from more than 48,000 channels, and some of these channels included prominent creators on the platform such as MKBHD (19 million subscribers), MrBeast (289 million), Jacksepticeye (31 million), PewDiePie (111 million), Stephen Colbert, John Oliver, Jimmy Kimmel, and more. Notably, the video transcriptions are subtitles files.

The report found that Apple, NVIDIA, Salesforce, Anthropic, and others used a dataset called Pile, which is accessible and open to anyone with internet access. Moreover, the report states Apple, NVIDIA, and Salesforce have stated in their respective research papers that Pile was used to train their AI models. In Apple's case, the Pile dataset was used to train OpenELM, a new AI model that was released in April, only weeks before the Cupertino company unveiled Apple Intelligence.

It should be noted that all of the big tech companies listed above didn't download the YouTube video transcriptions, as that was EleutherAI, which created the dataset for educational and academic purposes. However, it appears big tech discovered the dataset and decided to use it to train their models, which raises the question of what happens when a company uses a dataset from a third party to train an AI model, but that dataset contains data that users didn't consent to be used for training purposes.