Artificial intelligence-powered systems are truly impressive, but what datasets are they trained on? OpenAI has kept the answer to the question behind closed doors, and now YouTube has issued a warning to the company ahead of its release of Sora, its AI-powered text-to-video generation tool.
Creators of AI models use large amounts of data to successfully train their tools into whatever they are designed for. However, there is a major problem with simply grabbing data off the internet and using it to train an AI model that will potentially be used to generate money - copyrighted IP. This problem isn't new, as The New York Times and Getty Images have already filed lawsuits against AI creators for the theft of copyrighted data used to train models that are then used to generate profit.
The copyright debate regarding AI models heated up again in March when OpenAI CTO Mira Murati told The Wall Street Journal that she wasn't sure if Sora's training included data from YouTube, Instagram, or Facebook. Now, in an interview with Bloomberg Originals, YouTube CEO Neal Mohan reminded OpenAI that any kind of data taken from the platform and used to train AI models is strictly against the platform's terms of service.
"From a creator's perspective, when a creator uploads their hard work to our platform, they have certain expectations. One of those expectations is that the terms of service is going to be abided by. It does not allow for things like transcripts or video bits to be downloaded, and that is a clear violation of our terms of service. Those are the rules of the road in terms of content on our platform," said Mohan