Hacking libraries and AI training

Was AI training based on shadowy pirate libraries?

e-books

The New York Times in an article they mention them that so-called shadow libraries that are places where millions of book titles are illegally stored, in many cases without permission, are used as training data for AI models.

There are several file sharing sites on the internet that host an incredible amount of books, magazines and general printed material that you would normally have to pay to get.

Free libraries like the LibraryGenesis, the Z-Library or Library, they offer material that you don't have enough time to read. But at the same time you can also upload your own material.

library genesis

Of course, don't expect your provider's DNS to see these links. They are blocked and will you need to change them with DNS Gloudflare or Google.

So this is a huge resource and the developers of artificial intelligence did not leave it unexploited. AI companies have acknowledged that they relied on shadow libraries in research work.

The OpenAI's GPT-1 educated at BookCorpus, which has over 7.000 unpublished titles pulled from the self-publishing platform smashwords.

For the training of GPT-3 , το OpenAI είπε ότι περίπου το 16 τοις εκατό των δεδομένων που χρησιμοποίησε προέρχονταν από δύο "ομάδες βιβλίων που βασίζονται στο Διαδίκτυο" που ονόμασε "Books1" και "Books2".

books

According sued by Sarah Silverman (Sarah Silverman) and two other anti-OpenAI authors, Books2 is likely a "blatantly illegal" shadow library.

Efforts to shut down these sites have failed. Last year, the FBI, with the help of the Editors' Guild, charged two people charged with managing Z-Library for copyright infringement, scam and money laundering.

But then some of these sites moved to the Dark Web and torrent sites, making them harder to track down. And since many of these sites operate outside of the United States and anonymously, punishing the operators is a real challenge.

However, after all this fuss, tech companies are becoming increasingly strict about the data used to train their systems.

Hacking libraries and AI training

every publication, directly to your inbox

Written by Dimitris

Leave a reply Ακύρωση απάντησης

every publication, directly to your inbox

spread the news

Leave a reply Ακύρωση απάντησης