Artificial intelligence-powered tools and applications are certainly impressive in what they can generate, but where these AI companies get the data to train these impressive models remains ambiguous or completely closed off from public knowledge.
A new report revealed only last week that Apple, NVIDIA, Anthropic, and others used a public dataset containing hundreds of thousands of YouTube video transcripts to train their AI models. While Apple, NVIDIA, and others weren't the ones to download the transcripts, the data was still used to train AI models, which strictly violates YouTube's Terms of Service (ToS). Earlier in the year, YouTube's CEO stated that any data downloaded from its platform is a violation of its ToS.
Now, a new report from 404 Media states that the popular AI video generator company Runway trained its Gen-3 Alpha model on thousands of YouTube videos without obtaining permission from the creator or YouTube. The report also states the company used pirated content for AI model training. 404 Media was sent a spreadsheet that lists how many videos were taken from a specific source, and judging from the list, the sources are extensive and cover a large variety of channels.
Notably, the spreadsheet states the AI was trained on 21,000 Washington Post videos, 10,000 New York Times videos, 27,000 Wall Street Journal videos, and many more hundreds and thousands of videos from various channels. The report gained the attention of popular technology review MKBHD, who found himself on the spreadsheet with 1,600 of his videos used to train the Runway AI.
"The channels in that spreadsheet were a company-wide effort to find good quality videos to build the model with," an unnamed former employee told 404 Media. "This was then used as input to a massive web crawler which downloaded all the videos from all those channels, using proxies to avoid getting blocked by Google."