Harvard University announced the publication of a high-quality data set of nearly one million public domain books; data that anyone could use to train large language models (LLM) and other artificial intelligence tools. The dataset was created by Harvard’s newly formed Institutional Data Initiative (IDI) with funding from Microsoft and OpenAI. Contains books scanned as part of the Google Books project that are no longer protected by copyright.
The IDI database spans genres, decades and languages: classics by Shakespeare, Charles Dickens and Dante, alongside obscure Czech maths textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the initiative, says the project is an attempt to “level the playing field” by giving the general public, including small players in the AI industry and individual researchers, access to the type of repositories that normally only Big Tech they had resources to gather: “It’s gone through a rigorous review,” he adds.
Additional data for chatbots
Leppert believes the new public domain database could be used alongside other licensed materials to build AI models: “It’s similar to the way Linux has become a fundamental operating system for much of the world”. He notes that companies will still need additional training data to differentiate their models from rivals.
Burton Davis, vice president and deputy general counsel for Intellectual Property at Microsoft, stressed that the company’s support for the project is consistent with its broader beliefs of creating “accessible pools of data” for AI companies to use and be “managed in interest of the public.” In other words, Microsoft does not plan to replace all AI training data with public domain alternativesas Harvard’s new foundation books: “We use public domain data to train our models.” For his part, Tom Rubin, head of intellectual property and content at OpenAI, mentioned in a statement that the company was “delighted” to support the project.
As dozens of lawsuits filed over the use of copyrighted data for AI training make their way through the courts, the future of how artificial intelligence tools are built hangs in the balance. If AI companies win their cases, they will be able to continue collecting data from the internet without needing to sign licensing agreements. But if they lose, they would be forced into a review of how their models are created. A wave of projects like the Harvard database are moving forward with the assumption that, no matter what, there will be an appetite for public domain data sets.
In addition to the trove of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from different newspapers that are now in the public domain, and suggests being open to forming similar collaborations down the road. Exactly how the book dataset will be made public is yet to be decided. In a statement, Kent Walker, Google’s president of global affairs, stressed that the company was “proud to support.”
A long list of support for AI data banks
Regardless of how the IDI data set is released, it will join a host of similar projects, startups and initiatives that promise to give companies and startups access to substantial, high-quality AI training materials without the risk of running into copyright issues. Corporations such as Calliope Networks and ProRata support licensing and managing compensation plans so that creators and owners of works receive remuneration for providing data to feed artificial intelligence models.
There are also other new projects in the public domain. Last spring, French AI company Pleias released its own dataset, Common Corpus, containing between 3 and 4 million books and periodical collectionsaccording to the project coordinator, Pierre-Carl Langlais. The plan, backed by the French Ministry of Culture, has been downloaded more than 60,000 times this month alone on the open source AI platform Hugging Face. Last week, Pleias announced the publication of its first set of LLMs trained with this database. Langlais told WIRED this is a step forward for the constitution of “the first models trained exclusively with open data and in accordance with the European Union Law on AI.”
#Harvard #publishes #free #data #training #funded #OpenAI #Microsoft