Since 2011 the HathiTrust Research Center (HTRC) has been developing tools to promote “non-consumptive research” of its 4 billion pages of public domain material—University Librarian John Unsworth has long been a practitioner of text-mining. As a member of the HTRC Executive Management Group, Unsworth is part of the HTRC’s effort to encourage participation in the growing field of quantitative literary analysis that deals less with the meaning of individual works than with literary texts as collections of words. For instance, it may be instructive to know that bestsellers share an uncanny number of textual features.
Now, the HathiTrust is dramatically expanding the number of works available for non-consumptive research. With the updated release of HTRC Analytics, HTRC now provides access to its complete 16.7-million-item corpus for data mining and computational analysis, including items protected by copyright. The U.S. courts have recognized the solid legal basis for non-consumptive research of copyrighted materials. In 2016, HathiTrust established a working group to develop the Non-Consumptive Use Research Policy to ensure the responsible research use of copyrighted items. If you aren’t sure what sorts of research are non-consumptive, please contact Director of Information Policy Brandon Butler.
This extraordinary opportunity to use copyrighted materials for non-consumptive research is sustained by HathiTrust’s 140+ member libraries, including the UVA’s. You can access HTRC’s easy-to-use computational tools—some ideal for beginners, others more complex, to meet advanced data analysis needs:
- HTRC Algorithms—a set of tools for assembling collections of digitized text from the HathiTrust corpus and performing text analysis on them. Including copyrighted items for ALL USERS.
- Extracted Features Dataset—dataset allowing non-consumptive analysis on specific features extracted from the full text of the HathiTrust corpus. Including copyrighted items for ALL USERS.
- HathiTrust+Bookworm—a tool for visualizing and analyzing word usage trends in the HathiTrust corpus. Including copyrighted items for ALL USERS.
- HTRC Data Capsule—a secure computing environment for researcher-driven text analysis on the HathiTrust corpus. All users may access public domain items. Access to copyrighted items is available ONLY to member-affiliated researchers.