Lawsuit alleges NVIDIA approved use of pirated books to train AI models

NVIDIA has been accused of contacting Anna's Archive to purchase terabytes of copyrighted books to train its large language models.

Lawsuit alleges NVIDIA approved use of pirated books to train AI models
Comment IconFacebook IconX IconReddit Icon
Tech and Science Editor
Published
1 minute & 30 seconds read time
TL;DR: A lawsuit alleges NVIDIA executives approved partnering with Anna's Archive, a site hosting millions of pirated books and papers, to use its data for training Large Language Models. Internal emails reveal NVIDIA sought access to 500 terabytes of illegally obtained content amid competitive pressures.

A complaint filed in the US District Court claims NVIDIA executives approved contact with Anna's Archive, a website that harbors millions of copyrighted books and academic papers, to discuss a partnership that involves using Anna's Archive as a dataset for training its Large Language Models (LLMs).

Lawsuit alleges NVIDIA approved use of pirated books to train AI models 1516165

The complaint alleges that "competitive pressures drove NVIDIA to piracy," and that internal NVIDIA emails demonstrate a member of the company's data strategy team contacting Anna's Archive about the collaboration. Furthermore, the complaint states that Anna's Archive warned NVIDIA that its treasure trove of data was obtained illegally, and asked how Team Green wanted to proceed.

The lawsuit states that within a week, NVIDIA approved of the collaboration, and in response, Anna's Archive offered NVIDIA approximately 500 terabytes of data. "Desperate for books, NVIDIA contacted Anna's Archive -- the largest and most brazen of the remaining shadow libraries -- about acquiring its millions of pirated materials and 'including Anna's Archive in pre-training data for our LLMs,'" the complaint notes.

Lawsuit alleges NVIDIA approved use of pirated books to train AI models 191165

Furthermore, the complaint states that the 500 terabytes of data included millions of books that are only accessible through the Internet Archive's digital lending system. Notably, the complaint does not explicitly state whether NVIDIA followed through with the transaction of paying for access to the dataset offered by Anna's Archive.

Lawsuit alleges NVIDIA approved use of pirated books to train AI models 11132312

"Because Anna's Archive charged tens of thousands of dollars for 'high-speed access' to its pirated collections [] NVIDIA sought to find out what "high-speed access" to the data would look like," reads the complaint