A complaint filed in the US District Court claims NVIDIA executives approved contact with Anna's Archive, a website that harbors millions of copyrighted books and academic papers, to discuss a partnership that involves using Anna's Archive as a dataset for training its Large Language Models (LLMs).

The complaint alleges that "competitive pressures drove NVIDIA to piracy," and that internal NVIDIA emails demonstrate a member of the company's data strategy team contacting Anna's Archive about the collaboration. Furthermore, the complaint states that Anna's Archive warned NVIDIA that its treasure trove of data was obtained illegally, and asked how Team Green wanted to proceed.
The lawsuit states that within a week, NVIDIA approved of the collaboration, and in response, Anna's Archive offered NVIDIA approximately 500 terabytes of data. "Desperate for books, NVIDIA contacted Anna's Archive -- the largest and most brazen of the remaining shadow libraries -- about acquiring its millions of pirated materials and 'including Anna's Archive in pre-training data for our LLMs,'" the complaint notes.
- Read more: Meta accused of downloading torrents of 81.7TB of pirated books to train its Llama AI models

Furthermore, the complaint states that the 500 terabytes of data included millions of books that are only accessible through the Internet Archive's digital lending system. Notably, the complaint does not explicitly state whether NVIDIA followed through with the transaction of paying for access to the dataset offered by Anna's Archive.

"Because Anna's Archive charged tens of thousands of dollars for 'high-speed access' to its pirated collections [] NVIDIA sought to find out what "high-speed access" to the data would look like," reads the complaint



