AI needs more data. Will library stacks be their next frontier?
Loading...
| Cambridge, Mass.
Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly 1 million books published as early as the 15th century 鈥 and in 254 languages 鈥 are part of a Harvard University collection being released to AI researchers this week. Also coming soon are troves of old newspapers and government documents held by Boston鈥檚 public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artists, and others whose creative works have been scooped up without their consent to train AI chatbots.
鈥淚t is a prudent decision to start with public domain data because that鈥檚 less controversial right now than content that鈥檚 still under copyright,鈥 said Burton Davis, a deputy general counsel at Microsoft.
Mr. Davis said libraries also hold 鈥渟ignificant amounts of interesting cultural, historical, and language data鈥 that鈥檚 missing from the past few decades of online commentary that AI chatbots have mostly learned from. Fears of running out of data have also led AI developers to turn to 鈥渟ynthetic鈥 data, made by the chatbots themselves and of a lower quality.
Supported by 鈥渦nrestricted gifts鈥 from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries and museums around the world on how to make their historic collections AI-ready in a way that also benefits the communities they serve.
鈥淲e鈥檙e trying to move some of the power from this current AI moment back to these institutions,鈥 said Aristana Scourtas, who manages research at Harvard Law School鈥檚 Library Innovation Lab. 鈥淟ibrarians have always been the stewards of data and the stewards of information.鈥
Harvard鈥檚 newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s 鈥 a Korean painter鈥檚 handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law, and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
鈥淎 lot of the data that鈥檚 been used in AI training has not come from original sources,鈥 said the data initiative鈥檚 executive director, Greg Leppert, who is also chief technologist at Harvard鈥檚 Berkman Klein Center for Internet & Society. This book collection goes 鈥渁ll the way back to the physical copy that was scanned by the institutions that actually collected those items,鈥 he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn鈥檛 think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit, and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens 鈥 units of data, each of which can represent a piece of a word.
Harvard鈥檚 new AI training collection has an estimated 242 billion tokens, an amount that鈥檚 hard for humans to fathom, but it鈥檚 still just a drop of what鈥檚 being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images, and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from 鈥渟hadow libraries鈥 of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated $50 million this year to a group of research institutions including Oxford University鈥檚 400-year-old Bodleian Library, which is digitizing rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the United States, the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.
鈥淥penAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,鈥 Ms. Chapel said.
Digitization is expensive. It鈥檚 been painstaking work, for instance, for Boston鈥檚 library to scan and curate dozens of New England鈥檚 French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
Harvard鈥檚 collection was already digitized starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
The new effort was applauded Thursday by the same authors鈥 group that sued Google over its book project and more recently has brought AI companies to court.
鈥淢any of these titles exist only in the stacks of major libraries and the creation and use of this dataset will provide expanded access to these volumes and the knowledge within,鈥 said Mary Rasenberger, CEO of the Authors Guild, in a Thursday statement. 鈥淚mportantly, the creation of a legal, large training dataset, will democratize the creation of new AI models.鈥
How useful all of this will be for the next generation of AI tools remains to be seen as the data was shared on June 12 on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish, and Latin.
Mr. Leppert said a book collection steeped in 19th-century thought could also be 鈥渋mmensely critical鈥 for the tech industry鈥檚 efforts to build AI agents that can plan and reason as well as humans.
鈥淎t a university, you have a lot of pedagogy around what it means to reason,鈥 Mr. Leppert said. 鈥淵ou have a lot of scientific information about how to run processes and how to run analyses.鈥
At the same time, there鈥檚 also plenty of outdated data, from debunked scientific and medical theories to racist and colonial narratives.
鈥淲hen you鈥檙e dealing with such a large data set, there are some tricky issues around harmful content and language,鈥 said Kristi Mukk, a coordinator at Harvard鈥檚 Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to 鈥渉elp them make their own informed decisions and use AI responsibly.鈥
This story was reported by The Associated Press.