Harvard University announced today Thursday that a high data set is released quality με σχεδόν ένα εκατομμύριο public-domain books that could be used by anyone to train language models and other AI tools.
The dataset was created by Harvard's newly formed Institutional Data Initiative with funding from Microsoft and from OpenAI. Contains books that have been scanned by the Google Books project and are no longer copyrighted.
It's about five times larger than the infamous Books3 dataset used to train AI models like the Meta Llama. His dataset Institutional Data Initiative it spans genres, decades and languages, with classic literature from Shakespeare, Charles Dickens, and Dante included alongside Czech math and Welsh pocket dictionaries.
Greg Leppert, executive director of the Institutional Data Initiative, says the project is an effort to "level the playing field" by giving the general public, from the AI industry and individual researchers, access to the kind of highly sophisticated repositories of content that typically only established tech companies have the resources to muster.
Leppert believes the new public dataset could be used in conjunction with other licensed datasets to build artificial intelligence models.
“I think of it a bit like the way Linux has become a foundation functional σύστημα για τόσο μεγάλο μέρος του κόσμου”, λέει, αναφέροντας ότι οι εταιρείες θα πρέπει να χρησιμοποιήσουν επιπρόσθετα δεδομένα εκπαίδευσης για να διαφοροποιήσουν τα models them from those of their competitors.