The Hathi Trust Research Center (HTRC) is working to promote one of the hottest areas of literary research. Text-mining and non-consumptive research refer to a type of literary analysis in which the close reading of text is irrelevant. The important thing is the frequency of certain words. Scholars will of course continue to find meaning in literary texts, but studying the frequency and patterns of words published in thousands of books over time is opening new avenues to understanding literature. Computational analysis applied to the linguistics of Shakespeare, for instance, is determining which parts of Shakespeare’s plays were written by him, and which parts by contemporaries working with him—and who they were. Beyond questions of authorship, the impersonal analysis of words may also reveal previously unknown forces that have contributed to literary movements. Collecting the data is only half the battle; the rest is figuring out what it all means.
University Librarian John Unsworth has had a particular interest in text-mining since he was a faculty member at the University of Illinois. As a member of the HTRC Executive Management Group, he is part of the HTRC’s effort to encourage participation in the growing field of quantitative literary analysis. To tempt more scholars on board, the HTRC is offering a sneak preview of some of the sites it has in development, putting its 4 billion pages of scanned data at their disposal. Note that these are all development sites, and sometimes slow to load initially, but then your patience will be rewarded.
- HTRC Extracted Feature Solr Search—Faceted Unigram Page-based Extracted Feature Search (Prototype)
- bookworm: HathiTrust—Search for trends in millions of volumes
- Bookworm Map—See where a word occurs in the 15 million volume HathiTrust collection
- Research Datasets—Downloadable, non-consumptive book data