Home » Entertainment » The chatbots of AI need more books to learn and several US libraries will

The chatbots of AI need more books to learn and several US libraries will

Tech Giants Leverage Historic Libraries to Power AI Development

In a groundbreaking move, technology giants like Microsoft and OpenAI are turning to historic libraries to fuel AI advancements. The initiative, involving a vast collection of public domain books from Harvard’s libraries, promises to revolutionize AI development while addressing concerns over data quality and copyright.

Harvard’s Vast Book Collection Unlocked

Harvard University has recently opened its archive of over one million books in 254 languages, scanned and available for researchers in a dataset dubbed “Books 1.0.” This collection, which includes works dating back to the fifteenth century, is part of an ambitious effort to provide AI developers with high-quality, reliable data.

Addressing Data Quality Concerns

The shift from online comments and social media to public domain texts aims to improve the precision and reliability of AI systems. Greg Leppert, the executive director of Harvard’s data initiative, highlights the importance of using original sources over pirated data. “Many of the data that have been used in the training in AI do not come from original sources,” he notes.

Cooperation with Libraries and Museums

Microsoft and OpenAI are collaborating with libraries and museums around the world to make historical collections accessible for AI training. The partnership seeks to “return some power back to these institutions” responsible for data administration, as stated by Aristana Scurtas, directing research at Harvard Law’s Library Innovation Laboratory.

Addressing Copyright Issues

Both Microsoft and OpenAI are taking steps to mitigate copyright concerns. Microsoft has decided to start with public domain information, which is less controversial. Additionally, OpenAI has donated $50 million to research institutions, including Oxford’s 400-year-old Bodleian Library, for AI transcription projects.

Expanding Access to AI Training Data

The initiative also includes documents from the Boston Public Library, which has digitized newspapers and government documents. This effort not only aids AI developers but also finances projects aimed at enhancing library services. Digitization costs are substantial, making such collaborations crucial for making historical information more accessible.

Harvard’s collection has begun digitizing since 2006 in collaboration with Google, though Google faced copyright issues with newer works. Now, both Google and Harvard are working to share public domain volumes, thereby providing AI developers with more reliable data.

The Implications for AI Development

This collaborative effort could democratize AI development by creating extensive legal datasets. Mary Rasenberger, executive director of the Writers Association, highlights the value in expanding access to these volumes, allowing more broader use and creation of AI models.

However, there are challenges ahead. The vast data set includes sensitive historical content that poses risks regarding harmful language and content. Kristi Mukk, coordinator of the Harvard Library Innovation Laboratory, emphasizes the need for guidance to mitigate these risks and promote responsible AI use.

Looking ahead, the impact of this initiative on the field of AI is immense. By incorporating diverse linguistic resources and original sources, AI developers can create more sophisticated and reliable systems capable of reasoning and planning like humans.

For more updates on AI, technology, and related topics, stay tuned to archyde.com.


You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Adblock Detected

Please support us by disabling your AdBlocker extension from your browsers for our website.