Nov 4, 2011: Prototypes for visualization, etc

Materials on this page are prototype quality intended not as finished products, but as samples for possible refinement!

Background

These notes describe a number of attempts to refine our use of the standard Mallet outputs. This spreadsheet suggested these prototypes:

How far are the other novels from OMS?

We were especially interested in the distance between The Old Mamselles' Secret (OMS) and Jane Eyre (JE). Based on here close reading, Professor Tatlock was surprised that the two novels were not connected in the "standard" network graphs offered on the main page (for example, here).

The spreadsheet shows two things a) that the absolute distance between OMS and JE (0.279004040192) is greater than the largest distance that would have caused us to connect two novels in the "standard" network graph; and b) that JE was still relatively nearer to OMS that 180 of the 193 novels in our set. So we asked, are we looking at the right things?

Are Steve's distance measurements right?

I desked checked the calculations, and they seemed fine. I also checked them against distance measures available in the Orange data visualization and analysis package. While my results do not match results from the Orange package (the Orange distance measures perform a "normalization of values"), my results are consistent with Orange's (i.e. if I say two things are close, Orange agrees).

Multi-dimensional scaling

Merlin asked that I look into multi-dimensional scaling (MDS) for his project. I tried it out on Tatlock's data using a couple of different scaling approaches from the Orange toolkit:

MDS performed on Tatlock's novels one way . . .

. . . and another.

(Apologies for the image size, but these are quick prototypes, so they're not exactly as smooth as I'd like.) OMS is colored red, JE is green, and the other Marlitt books are blue. I expected two see most of Marlitt's books close together, and all of them within some reasonable closeness to JE.

Because the image size was so awkward, I put together a more interactive visualization, and I also ran it for Merlin's novels (click on the little gray squares to see which novel they represent):

MDS for Tatlock's corpus . . .

. . . and for Merlin's

In looking at the Tatlock novels, I don't quite see what I expected. I'm pleased that OMS and Marlitt's Little Moorland Princess end up being drawn close to each other, and it seems right that JE is in the middle of the images. But, on the other hand, the Marlitt novels seem too spread out, which seems odd since they're closer to each other than to anything else.

What the problem? It may be that I don't understand how to read the images. Or it simply may not be possible to compress a 193 dimension dataset (our full distance spreadsheet) down to two dimensions while having the result make sense in every respect. Or it may that the images are right.

Relative distances, novel-by-novel

I sketched up a couple of visualizations that plot the distances between novels on a novel-by-novel basis. I take each novel in the corpus, and prepare a graph(ish thing) that shows the distance between that novel and all the other novels (click on the little gray squares to see which novel they represent).

Novel-by-novel viewer for Tatlock's corpus . . .

. . . and for Merlin's

(The upside-down column graph is a good sign that these prototypes were built as quickly as possible.)

Other visualizations

I tried a couple of other kinds of clustering and visualization, none of which are much more helpful than what we've already seen First, the common hierarchical cluster (I made no effort to make these easier to use, because they didn't seem work it):

Hierarchically clustered distances

And I tried a self-organizing map. I don't have an image for this, but I do have the underlying data, which shows which novels would be grouped into which (x.y) locations in an 8 by 8 checkerboard:

Data for a self-organizing map

The hierarchical cluster, besides being hard to read, doesn't tell us anything that we didn't already know (authors cluster together). And the self-organizing map might be interesting, except that OMS ends up being rather farther from JE that seems right. But still, I may have been too quick to dismiss these visualizations . . . perhaps this deserves more work?

What makes two novels close?

What causes us to decide that one novel is close to another? The distance measure used topic percentages; however, beneath the topic percentages are the actual word-topic combinations that make up each novel's data in the Mallet outputs. The following page compares OMS to every other novel in the Tatlock corpus. Each novel's top 100 word-topic combinations are listed and compared, and the matches are totaled. The novels are compared in the order of their closeness to OMS (close novels are compared first, then less close), in the same order as the spreadsheet that started all this.

Comparision of top 100 word_topic combinations, OMS vs everything else

What causes us to decide that one novel is close to another? Generally speaking, close novels have a higher number of matches than far novels.

Note that this is a good page for understanding what Mallet does with frequently occuring words, and how morphological variation effects Mallet.

Next?

I'm doing #1 next week. I'm not so sure about the other ones.

"Germanness," "Happy endings," and the comparison of passages generally (the "chunk" process).
Modify network viewer so I connect one novel to its closest neighbor.
Feature selection through topic modeling? Bigrams and trigrams?
How many OCR errors are in the texts? It really surfaces in some instances . . .
What happens if I select only the top words, and topic model them? Or otherwise cluster, etc them?
Classification.