Book lists

The list of the "our" novels.
The list of the "other" novels.

"Thanksgiving" deliverables

Overlap spreadsheets for 8 different year-long time slices:

1901-12-01 -- 1902-11-30 overlap > 10% overlap > 5% average overlap
1900-12-01 -- 1901-11-30 overlap > 10% overlap > 5% average overlap
1899-12-01 -- 1900-11-30 overlap > 10% overlap > 5% average overlap
1898-12-01 -- 1899-11-30 overlap > 10% overlap > 5% average overlap
1897-12-01 -- 1898-11-30 overlap > 10% overlap > 5% average overlap
1896-12-01 -- 1897-11-30 overlap > 10% overlap > 5% average overlap
1895-12-01 -- 1896-11-30 overlap > 10% overlap > 5% average overlap
1894-12-01 -- 1895-11-30 overlap > 10% overlap > 5% average overlap

Revised average overlap, etc., (November 17, 2015)

Family reading for the period 12/1/1901 to 11/29/1902.

Readers who overlapped with 5 book choices or more for the period 12/1/1901 to 11/29/1902.

A list of "our" books and the number of readers for each for the period 12/1/1901 to 11/29/1902. And a list of books other than "ours".

A spreadsheet listing all book-book combinations with an overlap >= 10%. In making this spreadsheet, I considered all books, even those with a very low number of readers. This spreadsheet is similar to last overlap spreadsheet we looked at, except that I added an addition "score" column. For the period 12/1/1901 to 11/29/1902. And the corresponding "hairball" network diagram.

Same as above, except the overlap is >= 5%. And the corresponding "hairball" network diagram.

A spreadsheet with shows the average overlap between a book and "our" books. There are three columns, book, readers and average overlap. "Average overlap" is calculated by summing the overlaps between the book and all "our" books, then dividing by the number of our books. I also added an addition "score" column. For the period 12/1/1901 to 11/29/1902.

Average overlap, etc., (November 9, 2015)

A list of "our" books and the number of readers for each during the time frame.

A spreadsheet listing all book-book combinations with an overlap >= 10%. In making this spreadsheet, I considered all books, even those with a very low number of readers. This spreadsheet is similar to last overlap spreadsheet we looked at.

A spreadsheet with shows the average overlap between a book and "our" books. There are three columns, book, readers and average overlap. "Average overlap" is calculated by summing the overlaps between the book and all "our" books, then dividing by the number of our books.

Overlap data for one book which figures prominently in the average overlap spreadsheet.

Recomputing overlap, etc., (October 30, 2015)

The "peak books" spreadsheet (days / transactions).

The "greater than or equal to 10% overlap with Wister translations spreadsheet from October 29, 2015. And the corresponding "hairball" network diagram. Caution: the "hairball" may be a little slow loading.

The overlap was based on a rule change, which we discussed on October 27, 2015. The new rule is "when we need gender and age for an analysis, use TRUST_THIS_CENSUS and borrower name as a proxy for "reader"; otherwise, use patron as a proxy for "reader" (which we did for the overlap). This spreadsheet shows some selected patron-borrower name combinations.

Recolored, selected PCA scatterplots. Dark blue is Jane Eyre. Yellow is Wister. Green is books in the "10%" overlap list. Blue-green are other books by authors who are in the "10%" overlap list. Gray is everything else.

full texts
standard stopwords and
proper nouns removed:

10 topics
25 topics
endings (last 10% of each novel)
standard stopwords and
proper nouns removed:

10 topics
25 topics

Topic modeling, etc., Round 2 (October 6, 2015)

demographics, etc

Latest demographic files:

A web page giving some examples of family reading. And another page with more and better family reading!

Doug's word document and word document of October 20th relating to OMS readership.

A graph showing the distribution of overlapping reading for all books checked out 50 or more times. Most books' readership overlaps weakly; for example, the peak in the graph points out that for 218,252 pairs, only 1% of their readers overlapped.

A spreadsheet listing the top 5,000 book-to-book reading pairs. This is a selection from the data that went into the graph, showing only the most "overlappy" book-to-book pairs. And a revised spreadsheet implementing a different method of calcuating overlap (see the last column in the spreadsheet).

A spreadsheet listing the "Wister" book-to-book reading pairs. Another selection from the data that went into the graph, showing only the book-to-book pairs which include at least one Wister book. And a revised spreadsheet implementing a different method of calcuating overlap (see the last column in the spreadsheet).

A version of the old bubble graph showing just Wister (red), Alcott and Finley (blue), and Alger and Fosdick (green). This graph uses the data we extract aeons ago, so if we like this, I'd prefer to re-extract the data.

A spreadsheet listing Mable and William Harmans' library transactions. She starts using the library in 1896 at age 12, and uses the library through the end of the data at age 18, checking out 249 items. He starts using the library in 1898 at age 16, and uses the library the end of the data at age 20, chekcing out 143 items. I suspect that before 12/31/1898. st least some of Mable's transactions are on behalf of William. See especially the period between August 8, 1897, and September 13, 1897, when she checks out 13 books by Horatio Alger, sometimes on successive days. Interesting: when they are both actively checking books out of the library, then tend to use the library on different days. Although it's not unusual for them to go on the same day, 2/3 of the time, they go in different days.

A similar spreadsheet for Rosa and Huston Burmaster.

A graph of number of readers per number of checkouts. 45% of library patrons check out only 1 item. 70% of readers check out 5 or less items.

Older demographic files.:

A spreadsheet ranking Wister readers' "preferences" for books. The "most favored" is at the top, the "least favored" are at the bottom.

Network graphs showing how the readership of novels correlates with each other. > 0.29, > 0.19, and > 0.09.

The "color all of the Wister novels red, and their connect books orange" version: > 0.09.

topic modeling -- PCA scatter plots

In all instances, the 10 topic run seems to offer the best separation between the texts.

full texts
standard stopwords removed:


10 topics
25 topics
50 topics
100 topics
200 topics
full texts
standard stopwords and words which
occur in only one text removed:


10 topics
25 topics
50 topics
100 topics
200 topics
full texts
standard stopwords and
proper nouns removed:


10 topics
25 topics
50 topics
100 topics
200 topics
endings (last 10% of each novel)
standard stopwords removed:


10 topics
25 topics
50 topics
100 topics
200 topics
endings (last 10% of each novel)
standard stopwords and words which
occur in only one text removed:


10 topics
25 topics
50 topics
100 topics
200 topics
endings (last 10% of each novel)
standard stopwords and
proper nouns removed:


10 topics
25 topics
50 topics
100 topics
200 topics

topic modeling -- cluster diagrams

full texts
standard stopwords removed:


10 topics (words)
25 topics (words)
50 topics (words)
100 topics (words)
200 topics (words)
full texts
standard stopwords and words which
occur in only one text removed:


10 topics (words)
25 topics (words)
50 topics (words)
100 topics (words)
200 topics (words)
full texts
standard stopwords and
proper nouns removed:


10 topics (words)
25 topics (words)
50 topics (words)
100 topics (words)
200 topics (words)
endings (last 10% of each novel)
standard stopwords removed:


10 topics (words)
25 topics (words)
50 topics (words)
100 topics (words)
200 topics (words)
endings (last 10% of each novel)
standard stopwords and words which
occur in only one text removed:


10 topics (words)
25 topics (words)
50 topics (words)
100 topics (words)
200 topics (words)
endings (last 10% of each novel)
standard stopwords and
proper nouns removed:


10 topics (words)
25 topics (words)
50 topics (words)
100 topics (words)
200 topics (words)

topic modeling -- correlation reports

These reports show (or should show) how positively or negatively various topics correlate with "our" novels.

full texts
standard stopwords removed:


10 topics
25 topics
50 topics
100 topics
200 topics
full texts
standard stopwords and words which
occur in only one text removed:


10 topics
25 topics
50 topics
100 topics
200 topics
full texts
standard stopwords and
proper nouns removed:


10 topics
25 topics
50 topics
100 topics
200 topics
endings (last 10% of each novel)
standard stopwords removed:


10 topics
25 topics
50 topics
100 topics
200 topics
endings (last 10% of each novel)
standard stopwords and words which
occur in only one text removed:


10 topics
25 topics
50 topics
100 topics
200 topics
endings (last 10% of each novel)
standard stopwords and
proper nouns removed:


10 topics
25 topics
50 topics
100 topics
200 topics

topic modeling -- sentence-level topic modeling

For these runs, each sentence was considered a separate "text".

full texts
standard stopwords removed:


50 topics
100 topics
full texts
standard stopwords and words which
occur in only one text removed:


50 topics
100 topics
full texts
standard stopwords and
proper nouns removed:


10 topics
25 topics
50 topics
100 topics
full texts
standard stopwords and
proper nouns removed
(improved interface):


10 topics
25 topics
endings (last 10% of each novel)
standard stopwords removed:


50 topics
100 topics
endings (last 10% of each novel)
standard stopwords and words which
occur in only one text removed:


50 topics
100 topics
endings (last 10% of each novel)
standard stopwords and
proper nouns removed:


10 topics
25 topics
50 topics
100 topics
endings (last 10% of each novel)
standard stopwords and
proper nouns removed
(improved interface):


10 topics
25 topics

Topic modeling, etc., Round 1 (September 23, 2015)

topic modeling

A sparklines viewer for a 10 topic run. This run was the one that led me to suspect the presense of a "melodrama" topic (topic 0).

The same, but for a 20 topic run. If there's melodrama, it's probably topic 4.

topic modeling -- expanded corpus

PCA plot from a 10 topic run which included "our" 106 novels, plus 753 other books also in the Muncie library.

PCA plot from a 25 topic run.

PCA plot from a 50 topic run.

PCA plot from a 100 topic run.

PCA plot from a 200 topic run.


Cluster diagrams. These may be hard to look at, at first, because they are very large.

Hierarchical cluster diagram, 10 topics.

Hierarchical cluster diagram, 25 topics.

Hierarchical cluster diagram, 50 topics.

Hierarchical cluster diagram, 100 topics.

Hierarchical cluster diagram, 200 topics.


A set of reports, show how well or poorly topics correlated with the novels we identified as being favored by Marlit readers:

10 topics

25 topics

50 topics

100 topics

200 topics

word2vec -- expanded corpus

A set of PCA plots, using as features the relative percentage of word2vec word classes for each novel:


200 word classes

500 word classes

1000 word classes

Not at all helpful, but included here because it was on the to-do list . . .


A set of reports, show how well or poorly word2vec word classes correlated with the novels we identified as being favored by Marlit readers:

200 word classes

500 word classes

1000 word classes