Finding passages of "German-ish-ness"

Failed Attempt 1

We've had a long standing desire to see if we could use topic modeling outputs to identify passages with similar content. We were especially interested to see if we could find passages where one or more versions of "the German national character" (aka "German-ish-ness") were discussed. Toward this end, Cici assembled a document which contained passages wherein Germans and Germany were discussed. And, to make passage-specific topic modeling possible, we dividing the novels in our corpus into roughly 2,000 word "chunks". Instead of topic modeling the whole novels, we topic modeled the "chunks."

At first, this seemed like a simple assignment: a) topic model the chunks; b) find a chunk which contains some of the material Cici identified ("the target chunk"); c) let the target chunk's topic percentages lead us to other chunks with similar topic percentages, and thus with similar contents.

Unfortunately, the resulting "similar" chunks weren't even remotely similar. I think the problem was that "the target chunk" contained 80% content other than the german-ish-ness we were looking for, so we ended up matching on the wrong words.

Better?

Stephen Aiken and I discussed the problem a day or so ago, and he suggested that I reference the state files for words in a passage of german-ish-ness (the state files are an output from topic modeling which shows the topic number assigned to every word in the corpus). The process I followed was:

  1. Settled on a sample of german-ish-ness, drawn from three novels.

    The sample is available here.

    It turns out that in our corpus passages discussing Germans are not especially common. Most mentions of Germans and Germany are off-handed, and occur on one of two forms: as an adverb ("German silver"), or as an entry in a list of European countries or languages ("France, Germany and Italy"). I'm pretty sure the sample does not describe some version of "the German national character"; instead, I think a set of Victorian ideals are listed in the sample, and "German" gets attached to the list, much as "German" gets attached to "engineering" now.

  2. Settled on the 75 most important words in the sample. More on this process later.

    The words are available here.

  3. Extracted the sample's 75 most important words from the state file created when I topic modeled the chunks. I summarized, counted, etc., asking, "Of the 75 top words, which occur the most often? What topics do those words fall into? Which of those topics occur the most often?" This took a lot of looking at results, rejecting anomalies, etc. (ask about the problem of "practical aunt 61").

  4. The sample's 75 most important words were most often and significantly associated with topics 197, 87, 11 and 45.

    The chunks assocated with these topics are isolated on a spreadsheet here.

    Names in bold are included in the subset listed in the next point.

  5. I pulled a subsets of the chunks which the process suggests might have some similarity to the sample:

    Beecher_Norwood_8260.txt
    Beecher_Norwood_8299.txt
    Bronte_Jane_Eyre_8895.txt
    Collins_No_Name_6993.txt
    Collins_The_Law_and_the_Lady_8245.txt
    Cooper_The_Pioneers_3246.txt
    Cooper_The_Spy_1026.txt
    DeMille_The_Cryptogram_1595.txt
    Hillern_Only_a_Girl_7494.txt
    Hillern_Only_a_Girl_7520.txt
    Hugo_Les_Miserables_12094.txt
    Hugo_Les_Miserables_12148.txt
    Hugo_Les_Miserables_12151.txt
    Jay_Shiloh_6480.txt
    Kingsley_Hypatia_1756.txt
    Porter_Thaddeus_of_Warsaw_5338.txt
    Porter_Thaddeus_of_Warsaw_5339.txt
    Stowe_Uncle_Toms_Cabin_3698.txt
    Stowe_Uncle_Toms_Cabin_3700.txt
    Stowe_Uncle_Toms_Cabin_3701.txt

My conclusion--based on an admittedly sketchy reading of the subset of results--is that the process is finding passages of similar content. The similarity isn't a matter of German-ness; instead, it seems to arise from common discussion of piety, rationalism, order, etc. The discussion is at times forensic or didactic. There seem often to be ruler-subject, teacher-student, court-witness relationships. In other words, I think these results suggests we're finding something coherent, even if I can't find a name for it.