A new report from Proof News alleged Apple, NVIDIA and other big tech companies used a dataset that contained copyrighted IP to train their respective AI models. That copyrighted IP included transcripts of YouTube videos from prominent creators, such as MKBHD, one of the platforms biggest technology reviewers.
The report cited an investigation into the dataset known as Pile, with the reporters claiming to have discovered transcripts or subtitles of more than 170,000 YouTube videos across 40,000 different channels. Some of those videos were from creators such as MrBeast, MKBHD, Jimmy Kimmel, Stephen Colbert, PewDiePie and many others. The report also revealed statements from companies stating they used the Pile dataset in the training of their AI models, as the dataset is free and open for public use.
This newly surfaced report raises the question of what happens to AI companies that use datasets containing copyrighted IP to train their AI models. Is the owner of the AI model responsible or the company that formed the dataset? Or both? OpenAI was caught in the hot water bath that is AI models and copyrighted data only a few months ago when Chief Technology Officer (CTO) Mira Murati was unable to answer whether OpenAI uses YouTube videos to train its AI models.
Following the ambiguity around OpenAI's training data, YouTube's CEO issued a public reminder that scraping data from YouTube violates its terms of service.
MKBHD has since responded to the new report explaining the situation in a short 1-minute YouTube Short and adding that he pays for high-quality subtitles to be made for each of his videos, which would mean that content is being "stolen" more than once.