Was AI training based on shadowy pirate libraries?
The New York Times in an article they mention them that so-called shadow libraries that are places where millions of book titles are illegally stored, in many cases without permission, are used as training data for AI models.
There are several file sharing sites on the internet that host an incredible amount of books, magazines and general printed material that you would normally have to pay to get.
Of course, don't expect your provider's DNS to see these links. They are blocked and will you need to change them with DNS Gloudflare or Google.
So this is a huge resource and the developers of artificial intelligence did not leave it unexploited. AI companies have acknowledged that they relied on shadow libraries in research work.
For the training of GPT-3 , OpenAI said that about 16 percent of the data it used came from two “Internet-based groups of books” it called “Books1” and “Books2”.
According sued by Sarah Silverman (Sarah Silverman) and two other anti-OpenAI authors, Books2 is likely a "blatantly illegal" shadow library.
Efforts to shut down these sites have failed. Last year, the FBI, with the help of the Editors' Guild, charged two people who are accused of managing Z-Library for copyright infringement, fraud and money laundering.
But at plusέχεια, ορισμένοι από αυτούς τους ιστότοπους μεταφέρθηκαν στον Dark Web και τους ιστότοπους torrent, καθιστώντας δυσκολότερο τον εντοπισμό τους. Και επειδή πολλοί από αυτούς τους ιστότοπους λειτουργούν εκτός των Ηνωμένων Πολιτειών και ανώνυμα, η τιμωρία των χειριστών είναι πραγματικά δύσκολη υπόθεση.
However, after all this fuss, tech companies are becoming increasingly strict about the data used to train their systems.