Apple and NVIDIA busted swiping YouTube videos to train AI models

A new report has found that Apple, NVIDIA, Anthropic, and others have used a dataset that includes YouTube video transcripts to train AI models.

Apple and NVIDIA busted swiping YouTube videos to train AI models
Comment IconFacebook IconX IconReddit Icon
Tech and Science Editor
Published
2 minutes & 15 seconds read time

In early April, YouTube sent a clear message to AI model developers that downloading data from the platform and using it to train AI models is a clear violation of YouTube's terms of service.

This sentiment was reinforced in the same week as YouTube's public comment about its content being used to train AI model, but it came from a Google spokesperson who told the New York Times any, "unauthorized scraping or downloading of YouTube content" is prohibited. However, a new report from Proof News has found YouTube has been scraped for its data, and some of the biggest tech companies advancing AI have used it to train models.

According to a Proof News investigation, subtitles from 172,535 YouTube videos were siphoned from more than 48,000 channels, and some of these channels included prominent creators on the platform such as MKBHD (19 million subscribers), MrBeast (289 million), Jacksepticeye (31 million), PewDiePie (111 million), Stephen Colbert, John Oliver, Jimmy Kimmel, and more. Notably, the video transcriptions are subtitles files.

The report found that Apple, NVIDIA, Salesforce, Anthropic, and others used a dataset called Pile, which is accessible and open to anyone with internet access. Moreover, the report states Apple, NVIDIA, and Salesforce have stated in their respective research papers that Pile was used to train their AI models. In Apple's case, the Pile dataset was used to train OpenELM, a new AI model that was released in April, only weeks before the Cupertino company unveiled Apple Intelligence.

It should be noted that all of the big tech companies listed above didn't download the YouTube video transcriptions, as that was EleutherAI, which created the dataset for educational and academic purposes. However, it appears big tech discovered the dataset and decided to use it to train their models, which raises the question of what happens when a company uses a dataset from a third party to train an AI model, but that dataset contains data that users didn't consent to be used for training purposes.

"AI companies are generally secretive about their sources of training data, but an investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube's rules against harvesting materials from the platform without permission.

Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, NVIDIA, Apple, and Salesforce. The dataset, called YouTube Subtitles, contains video transcripts from educational and online learning channels like Khan Academy, MIT, and Harvard. The Wall Street Journal, NPR, and the BBC also had their videos used to train AI, as did "The Late Show With Stephen Colbert," "Last Week Tonight With John Oliver," and "Jimmy Kimmel Live,"" reads Proof News' YouTube description

Photo of the $10 -PlayStation Store Gift Card [Digital Code]
Best Deals: $10 -PlayStation Store Gift Card [Digital Code]
Country flag Today 7 days ago 30 days ago
$10 USD $10 USD
Buy
$10 USD $10 USD
Buy
$10 USD $10 USD
Buy
$50 CAD $50 CAD
Buy
$10 USD $10 USD
Buy
$10 USD $10 USD
Buy
* Prices last scanned on 3/6/2025 at 6:31 pm CST - prices may not be accurate, click links above for the latest price. We may earn an affiliate commission from any sales.

Tech and Science Editor

Email IconX IconLinkedIn Icon

Jak joined the TweakTown team in 2017 and has since reviewed 100s of new tech products and kept us informed daily on the latest science, space, and artificial intelligence news. Jak's love for science, space, and technology, and, more specifically, PC gaming, began at 10 years old. It was the day his dad showed him how to play Age of Empires on an old Compaq PC. Ever since that day, Jak fell in love with games and the progression of the technology industry in all its forms.

Related Topics

Newsletter Subscription