OpenAI says DeepSeek stole its data to train its breakthrough AI

OpenAI has alleged that DeepSeek stole data from the company to train its R1 AI model and used a special technique to mask its tracks.

VIEW GALLERY - 2

Jak Connor

Tech and Science Editor

Published Jan 30, 2025 1:02 AM CST

3 minutes & 45 seconds read time

TL;DR: OpenAI has accused DeepSeek of stealing data to train its R1 AI model and employing a special technique to conceal its actions.

Voice: Jak ConnorSpeed

0:00 / --:--

It has been a tumultuous week in the AI industry after Chinese company DeepSeek unveiled its R1 model, which caused approximately $1 trillion to be wiped away from Silicon Valley AI companies as DeepSeek said it's model was on par with OpenAI's ChatGPT, but cost just $6 million to create, which is a fraction of the billions of dollars poured into ChatGPT.

OpenAI says DeepSeek stole its data to train its breakthrough AI 312132123

VIEW GALLERY - 2 IMAGES

DeepSeek has truly shaken up the AI industry and has caused AI heavyweights such as OpenAI, Microsoft, and others to reassess their own AI chatbots and processes. But now accusations are beginning to fly, with OpenAI and Microsoft alleging they have obtained evidence DeepSeek used stolen OpenAI data to train its R1 model. In a recent Financial Times article, OpenAI claimed DeepSeek trained its AI with OpenAI's models, while Microsoft told Bloomberg that it believes it has evidence of an OpenAI developer account being connected to DeepSeek.

Furthermore, the accusations don't stop there, as Microsoft said this OpenAI developer account is linked to stealing large amounts of data from OpenAI. These accusations were backed up by President Trump's artificial intelligence and cryptocurrency advisor David Sacks, who told Fox News, "There's substantial evidence that what DeepSeek did here is distilled the knowledge out of OpenAI's models." Sacks added it's "possible" DeepSeek has engaged in IP theft.

"There's a technique in AI called distillation, which you're going to hear a lot about, and it's when one model learns from another model, effectively what happens is that the student model asks the parent model a lot of questions, just like a human would learn, but AIs can do this asking millions of questions, and they can essentially mimic the reasoning process they learn from the parent model and they can kind of suck the knowledge of the parent model," Sacks told Fox News. "There's substantial evidence that what DeepSeek did here is they distilled the knowledge out of OpenAI's models and I don't think OpenAI is very happy about this."

OpenAI accusing DeepSeek of IP theft comes with a sense of irony, considering OpenAI has been accused multiple times of engaging in the same practices, but not by stealing data from companies, but from consumers -- people like you and me. However, OpenAI has refuted these claims by citing it has used publicly available datasets, which fall under fair use. But what if those publicly available datasets contain illegally obtained copyrighted material, such as the Pile dataset, which was found to contain as many as 170,000 YouTube video transcriptions/subtitles.

The Pile is a dataset used for academic purposes, and scraping YouTube video transcripts is strictly against YouTube's terms of service. Additionally, OpenAI's now former Chief Technology Officer, Ermira Murati, failed to answer a simple question last year: whether OpenAI uses YouTube videos to train its AI models.

Judging by OpenAI and Microsoft's response to DeepSeek and the accusations that are now flying we are now expecting to see some legal action taken against DeepSeek. Funnily enough, if the legal proceedings are carried out to their total fruition, we could see OpenAI and Microsoft create a new legal precedent in AI training/data and fair use, which could ultimately backfire if OpenAI is discovered to have stolen large swaths of data themselves.

"As the leading builder of AI, we engage in countermeasures to protect our IP, including a careful process for which frontier capabilities to include in released models, and believe as we go forward that it is critically important that we are working closely with the U.S. government to best protect the most capable models from efforts by adversaries and competitors to take U.S. technology," said OpenAI

OpenAI says DeepSeek stole its data to train its breakthrough AI

Best Deals: MSI Gaming GeForce RTX 3060 12GB 15 Gbps GDRR6

Similar News Stories