It was only last month we heard about Apple, NVIDIA and many other big name players in the AI race being caught up in an investigative report that found they all used a public data set containing YouTube video transcripts to train their respective AI products, which is a violation of YouTube's terms-of-service (TOS).
YouTube has said in the past that any "unauthorized scraping or downloading of YouTube content" is strictly prohibited, and it's especially prohibited when that data is then used for commercial projects. Last month, a Proof News investigation found NVIDIA, Apple, and other AI companies used an academic data set containing subtitles from more than 170,000 YouTube videos to train AI models, and now NVIDIA has been caught in the spotlight again with a report from 404 Media.
According to the publication that spoke with a former NVIDIA employee about the company's internal processes, employees were instructed to scrape videos from Netflix, YouTube, and other sources to add to the data sets that are being used to an AI model for NVIDIA's Omniverse 3D world generator, self-driving car systems, a "digital human" AI avatar product, and the Cosmos deep learning model.
Additionally, the report states NVIDIA made efforts to hide its tracks from YouTube by running multiple "virtual machines" to avoid detection.
"We are finalizing the v1 data pipeline and securing the necessary computing resources," Ming-Yu Liu, NVIDIA's VP of Research and a leader on the Cosmos project, wrote in a May email, according to 404, "to build a video data factory that can yield a human lifetime visual experience worth of training data per day."
Internal conversations viewed by 404 Media revealed when employees raised concerns about the source of the data and the ethics surrounding how it was acquired, managers assured employees they had clearance to use the content for training from the highest levels of the company.
"This is an executive decision," Liu wrote to a hesitant underling on one such occasion, according to Slack messages reviewed by 404. "We have an umbrella approval for all of the data."
NVIDIA was asked to comment on the report's allegations, and the driving force behind the AI push replied that its AI training practices are "full compliance with the letter and the spirit of copyright law."