Topic Modeling

For the past year or so, we've been using topic modeling to find differences and similarities within collections of novels. In topic modeling, a topic is defined as, "a set of words which characterizes the content of a text." At its simplest, it's a way of thinking about and catagorizing the the vocabularies of texts. We use the Mallet toolkit from the University of Massachusets. For explanations of topic modeling intended for a non-specialist audiences, please see Matthew Jockers' and Ted Underwood's web pages on the subject.

Topic modeling can produce interesting results. For an especially good example, see Mining the Dispatch, a project which used topic modeling to categorize articles printed during the Civil War in the Richmond Daily Dispatch.

Since we have a subset of the top 300 most popular books in the Muncie Public Library, we thought it might be interesting to topic model them.

The first output of topic modeling is a spreadsheet which shows how topics are distributed across the novels in our corpus, and how each novel is made up of a different topics. For example, the first couple of lines of the spreadsheet reads:

Topic 0Pct 0Topic 1Pct 1Topic 2Pct 2
22Donald_and_Dorothy_by_Mary_Mapes_Dodge.txt015.76%2912.41%1211.20%

This line says that Mary Mapes Dodge's novel Donald and Dorothy is made of 15% topic 0, 12% topic 29, 11 percent topic 12, and so on.

The next output--a list of the words associated with each topic--answers the question, "what is topic 0, topic 29, topic 12, etc?" Topic modeling doesn't say anything like, "topic 0 is a vocabulary of class and luxury"; instead, we have to infer a description of the topic from the list of words that make up the topic. For example, the words for topic 0 are in part:

earl kate francis george grandfather london ladyship sailor lordship madame england mamma french charming castle pony color charity lawyer dearest america pink mary monsieur papa lace afterward park heir yellow american lee clever gallery dresses humor porch diamonds library . . .

If you scroll down the list of topic words, you'll see the words for topic 4:

colonel army officers officer major fort camp soldiers soldier troops march british wounded guns regiment arthur st prisoners fighting sergeant lieutenant military governor saddle ranks fought virginia cavalry sword staff cannon . . .

Clearly topic 4 represents a military vocabulary. If you now go back to the spreadsheet and look down the columns labeled "Topic 0, Pct 0, Topic 1, Pct 1, Topic 2, Pct 2", you should have no trouble seeing that topic 4--a vocabulary of militarism--occurs in a suprising number of the texts. It isn't surprising that the militarism topic figures in a book like Frank on a Gun Boat by Harry Castlemon; but what is it doing in Martha Finley's Elsies Vacation and After Events?

There's a third set of outputs we might look at. By using the topic numbers and percentages for the novels, we can compute the distance between any two of the novels. If two novels have the similar percentages of the same topics, then we can understand them as being "close" (or "similar") to each other. On the other hand, if two novels are made up of different topics, then we can see them as "distant."

We've prepared two visualizations (here and here), made with slightly different versions of the texts, which help us visualize the similarity or closeness between novels. In these visualiations, novels are represented as circles. If two novels are similar to each other (close in distance), then they are connected with a line.

These visualizations yield both expected and unexpected results. On one hand, it's not surprising that the novels of individual authors (search the visualizations for Alcot, Alger and Finley) are close to each other, but not necessarily close to other authors' novels. It's not a surprise because topic modeling is all about vocabulary, and because we understand vocabulary to be in part individual.

The surprise is that a novel and author I've never heard of, The Story of A Bad Boy by Thomas Bailey Aldrich, is at the center of the larger, non-author-specific networks in both visualizations (search the visualizations for Aldrich). Aldrich's novel is, on other words, at the center of these networks of closeness and similarity. From this perspective, it looks like an architypical novel among this set.

The novel opens curiously:

This is the story of a bad boy. Well, not such a very bad, but a pretty bad boy; and I ought to know, for I am, or rather I was, that boy myself.

Lest the title should mislead the reader, I hasten to assure him here that I have no dark confessions to make. I call my story the story of a bad boy, partly to distinguish myself from those faultless young gentlemen who generally figure in narratives of this kind, and partly because I really was not a cherub. I may truthfully say I was an amiable, impulsive lad, blessed with fine digestive powers, and no hypocrite. I didn't want to be an angel and with the angels stand; I didn't think the missionary tracts presented to me by the Rev. Wibird Hawkins were half so nice as Robinson Crusoe; and I didn't send my little pocket-money to the natives of the Feejee Islands, but spent it royally in peppermint-drops and taffy candy. In short, I was a real human boy, such as you may meet anywhere in New England, and no more like the impossible boy in a storybook than a sound orange is like one that has been sucked dry. But let us begin at the beginning.

It's an opening that understands its place within the universe of popular reading, circa the 1880's and 90's. I haven't read the whole novel, but I have looked through it, and I feel pretty comfortable in saying (with only modest hyperbole) that if it happens in a novel, it happens in The Story of A Bad Boy.

Thomas Bailey Aldrich was, it turns out, a figure of some literary importance (editor of the Atlantic Monthly, poet, etc.). In other words, he's the sort of writer who we might expect to write something that is at the same time both a post hoc ur-version of a genre, and a send up of it.

It says quite a bit about the Muncie readers that some of them--no matter how locked in they are on Alger, Alcot and Finley--would have the sophistication to appreciate Aldrich's positioning of his book within his--and their--reading culture.