Data & Methods
We have received data from What Middletown Read in two forms: the comma-delimited query results available for download from the web site, and the relational database that Ball State University shared with us. In both cases, the data is normalized (see Wikipedia for more on normalization), so we denormalized ("flattened", as we call it) the data to make subsequent processes simpler and faster. All of this, both normalizing the data as Ball State has done and denormalizing it for processing as we did, conforms to standard practices.
It's important to note how the data relates people to transactions. As near as I can tell from the data, the Muncie Library seems to have regarded each transaction as involving up to three different people, two of whom (borrower and patron) are important to what follows.
First, there's the obviously named "borrower", who I understand the be the person who actually walked out of the library with the book, and who I regard as an approximation for "reader." Second, there's the "patron". Often the patron is not the same person as the borrower. For example, someone named Kate Wilson was the patron for 471 transactions, but those transactions are connected to over 300 different borrowers. According to John Plotz at Slate, Kate Wilson seems to have been the librarian (it's easy to imagine her as the the sort of librarian that book lovers would love, the sort who checks out books for out-of-town visitors, for wayward youths, for people who for whatever reason don't have a card, etc). In any case, I regard "patron" as meaning, "someone with a library card."
In the database, census data is associated with patrons, but not explicitly with borrowers. For example, Kate Wilson, who the census identifies as female, is the patron on a transaction for borrower Bob Knowlton. If we were to uncritically use Kate Wilson's census information for the transaction, then we would count Bob Knowlton among female readers.
Our solution was to examine every transaction, and to use the census information only when the borrower and patron names were more or less the same (our precise definition is, "when the Levenshtein string distance between the two names, ignoring titles, honorifics and capitalization, was 2 or less"). The result is that our processes look at fewer transactions—at 108,000 instead of 176,000—than would otherwise be the case.
Our processes for market basket analysis and for investigating (a)typical readers is limited to only the most common borrowers, authors and readers; otherwise, our processes would be take unreasonably long to complete. For the purposes of finding similarities and common patterns in reading, such limits seem reasonable; however, we do understand that they also cost us the ability to look for the "long tail" in the circulation data, and to measure the degree of difference in readers' habits.