Was AI training based on shadowy pirate libraries?


The Times in an article they mention them that so-called shadow libraries that are places where millions of book titles are illegally stored, in many cases without permission, are used as training data for AI models.

There are several file sharing sites on the internet that host an incredible amount of books, magazines and general printed material that you would normally have to pay to get.

Free libraries like the LibraryGenesis, the Z-Library or the Library, they offer material that you don't have enough time to read. But at the same time you can also upload your own material.

So this is a huge resource and the developers of artificial intelligence did not leave it unexploited. AI companies have acknowledged that they relied on shadow libraries in research work.

The OpenAI's GPT-1 educated at BookCorpus, which has over 7.000 unpublished titles pulled from the self-publishing platform smashwords.

For the training of GPT-3 , OpenAI said that about 16 percent of the data it used came from two “Internet-based groups of books” it called “Books1” and “Books2”.


According sued by Sarah Silverman (Sarah Silverman) and two other anti-OpenAI authors, Books2 is likely a "blatantly illegal" shadow library.

Efforts to shut down these sites have failed. Last year, the FBI, with the help of the Editors' Guild, charged two people who are accused of managing Z-Library for copyright infringement, fraud and money laundering.

However, after all this fuss, tech companies are becoming increasingly strict about the data used to train their systems.

