HathiTrust Non-consumptive Research Pilot: Analysing the New Zealand Corpus

Arrived in Auckland on Friday. What more could I ask for? Beautiful weather (well, at least for the first two days), a fantastic city (or so say my first impressions), and an interesting project to work on. For the rest of the summer (another benefit - I get to escape from winter in the northern hemisphere!) I’ll be working on the HathiTrust pilot, sponsored by James Smithies.

The Project

The HathiTrust is a digitized database of approximately 10 million scanned books, 3 million of which are in the public domain. The HathiTrust Research Center is developing a set of tools to facilitate ‘non-consumptive research’; that is, large scale text and data analysis by scholars working in the digital humanities and other related fields, which is being powered by computing resources provided by Indiana University and the University of Illinois at Urbana-Champaign. The intent of this project is to leverage and test these tools by performing a preliminary analysis of the New Zealand content in the HathiTrust. There are a number of questions to be answered: How do we define New Zealand content? How much of the New Zealand content is in the public domain? What can we say about the digital quality of the works? What sort of information can we gather regarding historical context (age) and genre? Can we do some interesting topic modeling and text analysis on the database?

Immediate Goals

  1. Background research: learn more about the HathiTrust, get in touch with potential resources, read about copyright laws relating to and restricting access to HathiTrust resources and, in general, get oriented with the project.
  2. Learn about and familiarize myself with the HathiTrust Data API and Meandre Workbenches. Write some preliminary code and learn how to use these tools.

The specifics of longer term goals will, of course, develop naturally as I learn more about the project. And boy, do I have a lot to learn! It looks like I’ll be giving myself a crash course in text and big data analysis, as well as learning about digital humanities and who knows what else as the project evolves — I suppose that’s the beauty of an exploratory project: you never know quite where you’ll end up.

Submitted by robertmarchman on