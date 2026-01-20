NVIDIA has been accused of contacting Anna's Archive to purchase terabytes of copyrighted books to train its large language models.

TL;DR: A lawsuit alleges NVIDIA executives approved partnering with Anna's Archive, a site hosting millions of pirated books and papers, to use its data for training Large Language Models. Internal emails reveal NVIDIA sought access to 500 terabytes of illegally obtained content amid competitive pressures.

A complaint filed in the US District Court claims NVIDIA executives approved contact with Anna's Archive, a website that harbors millions of copyrighted books and academic papers, to discuss a partnership that involves using Anna's Archive as a dataset for training its Large Language Models (LLMs).

The complaint alleges that "competitive pressures drove NVIDIA to piracy," and that internal NVIDIA emails demonstrate a member of the company's data strategy team contacting Anna's Archive about the collaboration. Furthermore, the complaint states that Anna's Archive warned NVIDIA that its treasure trove of data was obtained illegally, and asked how Team Green wanted to proceed.

The lawsuit states that within a week, NVIDIA approved of the collaboration, and in response, Anna's Archive offered NVIDIA approximately 500 terabytes of data. "Desperate for books, NVIDIA contacted Anna's Archive -- the largest and most brazen of the remaining shadow libraries -- about acquiring its millions of pirated materials and 'including Anna's Archive in pre-training data for our LLMs,'" the complaint notes.

Furthermore, the complaint states that the 500 terabytes of data included millions of books that are only accessible through the Internet Archive's digital lending system. Notably, the complaint does not explicitly state whether NVIDIA followed through with the transaction of paying for access to the dataset offered by Anna's Archive.

