09_wordnet_jane_eyre

Are there specific kinds of content associated with passages with a high number (>= 7) of sight-related words? Beyond, of course, sight-related words? And how does the content of sight-related passages compare with the content of passages with no sight related words?

Passages with a high number (>= 7) of sight-related words are those printed out in notebook 06_eye_look_see_etc.ipynb. There are 37 such passages. 110 passages contain no no sight related words.

What synsets are characteristic of all high-sight passages?

TL;DR

To skip to the point, scroll down o the cell labeled "Compare high-sight and no-sight synset counts"; nothing much happens before then, at least in terms of interesting results . . .

Load spacy, . . .

. . . load the text, and pass to spacy for part-of-speech tagging and lemmatization.

In [1]:
import spacy
nlp = spacy.load('en')
In [2]:
import codecs, re

CORPUS_FOLDER = '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/'

text = codecs.open(CORPUS_FOLDER + 'Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt', 
                   'r', encoding='utf-8').read()
text = re.sub('\s+', ' ', text).strip()

doc = nlp(text)

Slice the texts into bins, etc

Like I did in notebook 06_eye_look_see_etc.ipynb, I'm slicing Jane Eyre into 1,000 231-token bins (there's a 1,001st bin, but it holds the last 203 token of the novel. For each bin, I'm saving original tokens, counting sight-related words (same method as in 06_eye_look_see_etc.ipynb), saving a list of synsets associated with only the non-sight-related words in the passage, and saving a list of the words associated with each synsets. I'm appending prefixes to synset names (e.g., "W.", "0.", etc) are so I can keep track of how far up the tree the synset occurs.

In [3]:
from textblob import Word

sight_related_synsets = [
    'eye.n.',
    'look.n.',
    'sight.n.',
    'stare.n.',
    'gaze.n.',
    'vision.n.',
    'see.v.01',
    'detect.v.01',
    'spy.v.03',
    'appear.v.04',
    'look.v.01',
    'visualize.v.01',
    'see.v.23',
    'look.v.03',
    'detect.v.01',
    'watch.v.01',
]

# ------------------------------------------------------------------------

n_tokens = doc.__len__()

N_BINS = 1000

bin_size = n_tokens / N_BINS

print 'n_tokens', n_tokens, 'N_BINS', N_BINS, 'bin_size', bin_size

# ------------------------------------------------------------------------

bins = []
for a in range(0, N_BINS + 1):
    bins.append({'original_tokens': [], 
                 'synsets': [], 
                 'n_sight_words': 0, 
                 'synsets_to_words': {}, 
                 'lemma': []})

hyper = lambda s: s.hypernyms()

for t in doc:
        
    #if t.i > 100:
    #    break
            
    bin_number = t.i / bin_size
    
    bins[bin_number]['original_tokens'].append(t.text)
    
    # -------------------------------------------------
        
    is_seeing_lemma = False
        
    if t.lemma_ not in ['-PRON-', 'which', 'what', 'who', 'be'] and \
        t.pos_ in ['NOUN', 'VERB',]:
        
        word_synsets =  Word(t.lemma_).get_synsets(pos=t.pos_[0].lower())

        for w in word_synsets:
                
            for sight_related_synset in sight_related_synsets:
                if w.name().startswith(sight_related_synset) == True or \
                    w.name() == sight_related_synset:
                    is_seeing_lemma = True
            
            h = list(w.closure(hyper, depth=10))
            h.reverse()
            
            for hn, hypernym in enumerate(h):
                
                for sight_related_synset in sight_related_synsets:
                    if hypernym.name().startswith(sight_related_synset) == True or \
                        hypernym.name() == sight_related_synset:
                        is_seeing_lemma = True
                        
    # -------------------------------------------------
    
    if is_seeing_lemma == False:
    
        if t.pos_ not in ['PUNCT']:
            bins[bin_number]['lemma'].append(t.lemma_)
    
        if t.lemma_ not in ['-PRON-', 'which', 'what', 'who', 'be'] and \
            t.pos_ in ['NOUN', 'VERB',]:

            word_synsets =  Word(t.lemma_).get_synsets(pos=t.pos_[0].lower())

            for w in word_synsets:
                
                synset_key = 'W.' + w.name()

                bins[bin_number]['synsets'].append(synset_key)
                
                try:
                    bins[bin_number]['synsets_to_words'][synset_key].append(t.lemma_)
                except KeyError:
                    bins[bin_number]['synsets_to_words'][synset_key] = [t.lemma_]

                h = list(w.closure(hyper, depth=10))
                h.reverse()

                for hn, hypernym in enumerate(h):
                
                    synset_key = str(hn) + '.' + hypernym.name()

                    bins[bin_number]['synsets'].append(synset_key)
                
                    try:
                        bins[bin_number]['synsets_to_words'][synset_key].append(t.lemma_)
                    except KeyError:
                        bins[bin_number]['synsets_to_words'][synset_key] = [t.lemma_]
                 
    if is_seeing_lemma == True:
        bins[bin_number]['n_sight_words'] += 1
            
print 'len(bins)', len(bins),  'len(bins[0][\'original_tokens\'])', len(bins[0]['original_tokens'])

# ------------------------------------------------------------------------
n_tokens 231203 N_BINS 1000 bin_size 231
len(bins) 1001 len(bins[0]['original_tokens']) 231

What synsets are characteristic of all high-sight passages?

I count up the number of high-sight passages in which synsets occur. Here, it doesn't matter how many times a synset occurs in any high-sight passage; once is enough for me to count it.

After a couple of debugging-related outputs (the number of high-sight passages, and the positions of those passages in the novel, this cell outputs:

  • A table listing synsets and the number of high-sight bins in which they occur. An "n bin" count of 37 means that that synset occurs at least once in every high-sight bin. For the sake of brevity, I list only synsets with"n bin" counts of 37.
  • A list of synsets (drawn from the table immediately above), with the lemma and lemma count for the lemma in the high-sight bins which point to that synset. For example, "3.cognition.n.01" is one of the synsets which occurs in every high-sight bin, and the top 10 lemma in those bins like "heart" (26 occurrences), "light" (24), "feature" (18), etc point to that synset (see this snippet of the output to follow along):

      3.cognition.n.01
    
          heart 26, light 24, feature 18, feeling 16, rule 15, hand
          13, mind 12, attention 12, thought 12, head 11

Note that here and in what follows, I'm discarding synsets which occur either at the topmost nodes of the Wordnet heirarchy, or else at the bottom word-level, my feeling being that these levels won't yield interesting information.

Results? At first glance, this seems interesting . . . for example, every high-sight bin contains at least one word which points toward synset "3.cognition.n.01", and so it's tempting to conclude, " In Jane Eyre, sight and cognition occur together." However, we should look at the next set of cells . . .

In [4]:
from collections import defaultdict, Counter
import tabletext, textwrap

n_high_sight_bins = 0
high_sight_bin_numbers = []
high_synset_df = defaultdict(int)
check_synsets = []

for bn, b in enumerate(bins):
    if b['n_sight_words'] >= 7:
        
        high_sight_bin_numbers.append(bn)
        n_high_sight_bins += 1
        synset_wf = defaultdict(int)
        
        for s in b['synsets']:
            if s[:2] not in ['0.', '1.', 'W.']:
                synset_wf[s] += 1
                check_synsets.append(s)
                
        for k in synset_wf:
            high_synset_df[k] += 1
            
print
print 'n_high_sight_bins', n_high_sight_bins
print
print 'high_sight_bin_numbers', high_sight_bin_numbers
print
print 'len(set(check_synsets))', len(set(check_synsets))
print 'len(high_synset_df)', len(high_synset_df)
print

print

results = [['synset', 'n bins']]

synsets_to_lookup = []
synsets_word_counts = defaultdict(lambda : defaultdict(int))

for w in Counter(high_synset_df).most_common():
    if w[1] != 37:
        break
    synsets_to_lookup.append(w[0])
    results.append([w[0], w[1]])
        
print tabletext.to_text(results)

synsets_to_lookup = set(synsets_to_lookup)

for bn, b in enumerate(bins):
    if b['n_sight_words'] >= 7:
        for s, words in b['synsets_to_words'].iteritems():
            if s in synsets_to_lookup:
                for w in words:
                    synsets_word_counts[s][w] += 1

for k in synsets_word_counts.keys():
    
    word_list = []
    for w in Counter(synsets_word_counts[k]).most_common():
        word_list.append(w[0] + ' ' + str(w[1]))
    
    print
    print k
    print 
    print '\t' + '\n\t'.join(textwrap.wrap(', '.join(word_list[:10]) ,60))
        
n_high_sight_bins 37

high_sight_bin_numbers [93, 98, 103, 165, 219, 247, 277, 334, 339, 361, 372, 373, 374, 375, 396, 397, 400, 408, 421, 429, 431, 434, 558, 611, 627, 687, 700, 756, 802, 805, 809, 819, 828, 936, 937, 946, 998]

len(set(check_synsets)) 3384
len(high_synset_df) 3384


┌──────────────────────────────┬────────┐
│ synset                       │ n bins │
├──────────────────────────────┼────────┤
│ 3.living_thing.n.01          │     37 │
├──────────────────────────────┼────────┤
│ 5.activity.n.01              │     37 │
├──────────────────────────────┼────────┤
│ 3.event.n.01                 │     37 │
├──────────────────────────────┼────────┤
│ 4.physical_entity.n.01       │     37 │
├──────────────────────────────┼────────┤
│ 3.whole.n.02                 │     37 │
├──────────────────────────────┼────────┤
│ 4.act.n.02                   │     37 │
├──────────────────────────────┼────────┤
│ 3.cognition.n.01             │     37 │
├──────────────────────────────┼────────┤
│ 3.state.n.02                 │     37 │
├──────────────────────────────┼────────┤
│ 2.attribute.n.02             │     37 │
├──────────────────────────────┼────────┤
│ 2.communication.n.02         │     37 │
├──────────────────────────────┼────────┤
│ 5.organism.n.01              │     37 │
├──────────────────────────────┼────────┤
│ 2.entity.n.01                │     37 │
├──────────────────────────────┼────────┤
│ 2.group.n.01                 │     37 │
├──────────────────────────────┼────────┤
│ 3.location.n.01              │     37 │
├──────────────────────────────┼────────┤
│ 2.process.n.06               │     37 │
├──────────────────────────────┼────────┤
│ 2.measure.n.02               │     37 │
├──────────────────────────────┼────────┤
│ 2.psychological_feature.n.01 │     37 │
├──────────────────────────────┼────────┤
│ 3.inform.v.01                │     37 │
├──────────────────────────────┼────────┤
│ 4.artifact.n.01              │     37 │
├──────────────────────────────┼────────┤
│ 2.object.n.01                │     37 │
├──────────────────────────────┼────────┤
│ 7.person.n.01                │     37 │
├──────────────────────────────┼────────┤
│ 2.communicate.v.02           │     37 │
├──────────────────────────────┼────────┤
│ 2.copulate.v.01              │     37 │
├──────────────────────────────┼────────┤
│ 6.causal_agent.n.01          │     37 │
└──────────────────────────────┴────────┘

5.activity.n.01

	hand 13, effort 9, part 8, position 8, fire 7, chair 6,
	service 6, covering 6, place 6, examination 4

3.cognition.n.01

	heart 26, light 24, feature 18, feeling 16, rule 15, hand
	13, mind 12, attention 12, thought 12, head 11

4.physical_entity.n.01

	man 42, master 40, lady 39, hand 39, head 33, girl 30,
	reader 28, face 22, woman 15, brother 15

2.attribute.n.02

	face 88, heart 52, light 32, time 27, colour 25, head 22,
	softness 18, love 16, life 16, power 15

2.group.n.01

	house 28, hand 13, band 12, party 12, head 11, room 10,
	family 7, company 7, book 6, form 6

2.entity.n.01

	man 42, master 40, lady 39, hand 39, head 33, girl 30,
	reader 28, face 22, woman 15, brother 15

5.organism.n.01

	head 55, man 42, master 40, lady 39, hand 39, girl 30,
	reader 24, face 22, woman 15, brother 15

3.location.n.01

	head 33, face 22, place 21, front 20, heart 13, hand 13, air
	10, side 9, part 8, light 8

3.whole.n.02

	head 99, face 66, door 40, hall 32, window 30, chair 18,
	band 18, light 16, picture 15, hand 13

7.person.n.01

	man 42, master 40, lady 39, hand 39, head 33, reader 24,
	face 22, girl 18, woman 15, brother 15

2.measure.n.02

	day 42, time 36, night 25, life 16, moment 14, hand 13, step
	12, head 11, evening 9, love 8

3.event.n.01

	step 24, head 22, fire 21, time 18, appearance 16, voice 15,
	word 14, hand 13, opening 12, effort 12

3.inform.v.01

	say 44, give 14, tell 12, talk 9, smile 7, draw 7, get 7,
	ask 7, point 6, fear 5

4.artifact.n.01

	head 66, face 66, door 40, hall 32, window 30, chair 18,
	band 18, light 16, picture 15, hand 13

2.process.n.06

	head 22, love 8, light 8, fire 7, moment 7, cloud 6, air 5,
	power 5, front 5, answer 4

2.psychological_feature.n.01

	head 33, light 32, time 27, heart 26, hand 26, step 24, fire
	21, attention 20, appearance 20, feature 18

2.communicate.v.02

	say 44, give 42, call 35, speak 21, ask 21, talk 18, smile
	14, get 14, tell 12, read 10

3.state.n.02

	face 22, love 16, light 16, heart 13, life 12, room 10,
	place 9, cloud 9, dream 9, pride 8

3.living_thing.n.01

	man 42, master 40, lady 39, hand 39, head 33, girl 30,
	reader 28, face 22, woman 15, brother 15

2.copulate.v.01

	have 92, know 11, take 11, stand 8, love 8, mount 3, cover 1

6.causal_agent.n.01

	man 42, master 40, lady 39, hand 39, head 33, girl 30,
	reader 24, face 22, woman 15, brother 15

4.act.n.02

	step 18, hand 13, effort 12, appearance 12, air 10, voice
	10, position 10, movement 8, opening 8, expression 8

2.communication.n.02

	face 44, word 35, head 33, hand 26, expression 16, voice 15,
	book 15, picture 15, step 12, mark 12

2.object.n.01

	head 154, face 88, door 40, hall 32, window 30, hand 26,
	front 25, light 24, place 21, chair 18

What synsets are characteristic of no-sight passages?

Here, I'm producing a pair of tables similar to the one in the previous cell. Again, I'm counting how many bins--no-sight bins in this case--contain lemma which point to particular synsets.

The first table lists the synsets which occur in the most no-sight bins (I'm listing only 15 such synsets for the sake of space.

The second table lists the synsets which occur in all high-sight bins, followed by the number of times that synset occurs on no-sight bins. In other words, for a synset to be listed in this table, it had to appear in every high-sight bin.

What's the news here? With few exceptions, any synset which appears in every one of the high-sight bins also occurs in every one of the no-sight bins. The exceptions are insignificant.

In other words, the only feature that distinguishes all high-sight bins from all non-sight bins is the presence or absense of sight-related words. Or, to put it another way, it doesn't seem possible to say, "When sight occurs in JE, X always occurs, and when sight does not occur, X does not occur."

In [5]:
from collections import defaultdict, Counter
import tabletext

n_low_sight_bins = 0
low_sight_bin_numbers = []
low_synset_df = defaultdict(int)
check_synsets = []

for bn, b in enumerate(bins):
    if b['n_sight_words'] == 0:
        
        low_sight_bin_numbers.append(bn)
        n_low_sight_bins += 1
        low_synset_wf = defaultdict(int)
        
        for s in b['synsets']:
            if s[:2] not in ['0.', '1.', 'W.']:
                low_synset_wf[s] += 1
                check_synsets.append(s)
        
        for k in low_synset_wf:
            low_synset_df[k] += 1
            
print
print 'n_low_sight_bins', n_low_sight_bins
print
print 'low_sight_bin_numbers', low_sight_bin_numbers
print
print 'len(set(check_synsets))', len(set(check_synsets))
print 'len(low_synset_df)', len(low_synset_df)
print

results = [['synset', 'n bins']]

for w in Counter(low_synset_df).most_common(15):
    results.append([w[0], w[1]])
 
print
print 'MOST COMMON SYNSETS'
print
print tabletext.to_text(results)
print

results = [['synset', 'n NO-SIGHT bins']]

for w in Counter(low_synset_df).most_common():
    if w[0] in synsets_to_lookup:
        results.append([w[0], w[1]])

print
print 'HIGH_SIGHT-SYNSETS'
print     
print tabletext.to_text(results)
        
n_low_sight_bins 110

low_sight_bin_numbers [0, 4, 7, 15, 18, 24, 28, 31, 33, 41, 44, 57, 61, 71, 78, 99, 100, 101, 110, 115, 120, 138, 144, 153, 157, 158, 167, 172, 179, 213, 216, 226, 228, 267, 279, 287, 291, 294, 304, 307, 336, 342, 346, 352, 357, 382, 383, 395, 411, 418, 458, 468, 477, 483, 488, 490, 508, 510, 511, 521, 542, 547, 549, 554, 564, 571, 582, 590, 598, 603, 613, 635, 643, 655, 657, 674, 691, 692, 694, 711, 716, 724, 738, 742, 755, 757, 789, 810, 825, 830, 840, 846, 859, 864, 867, 896, 900, 905, 909, 921, 932, 944, 963, 974, 975, 980, 983, 984, 988, 999]

len(set(check_synsets)) 5096
len(low_synset_df) 5096


MOST COMMON SYNSETS

┌──────────────────────────────┬────────┐
│ synset                       │ n bins │
├──────────────────────────────┼────────┤
│ 4.physical_entity.n.01       │    110 │
├──────────────────────────────┼────────┤
│ 2.entity.n.01                │    110 │
├──────────────────────────────┼────────┤
│ 3.whole.n.02                 │    110 │
├──────────────────────────────┼────────┤
│ 2.communicate.v.02           │    110 │
├──────────────────────────────┼────────┤
│ 6.causal_agent.n.01          │    110 │
├──────────────────────────────┼────────┤
│ 4.act.n.02                   │    110 │
├──────────────────────────────┼────────┤
│ 3.living_thing.n.01          │    110 │
├──────────────────────────────┼────────┤
│ 3.state.n.02                 │    110 │
├──────────────────────────────┼────────┤
│ 3.event.n.01                 │    110 │
├──────────────────────────────┼────────┤
│ 3.cognition.n.01             │    110 │
├──────────────────────────────┼────────┤
│ 2.attribute.n.02             │    110 │
├──────────────────────────────┼────────┤
│ 5.organism.n.01              │    110 │
├──────────────────────────────┼────────┤
│ 2.psychological_feature.n.01 │    110 │
├──────────────────────────────┼────────┤
│ 4.artifact.n.01              │    110 │
├──────────────────────────────┼────────┤
│ 2.object.n.01                │    110 │
└──────────────────────────────┴────────┘


HIGH_SIGHT-SYNSETS

┌──────────────────────────────┬─────────────────┐
│ synset                       │ n NO-SIGHT bins │
├──────────────────────────────┼─────────────────┤
│ 4.physical_entity.n.01       │             110 │
├──────────────────────────────┼─────────────────┤
│ 2.entity.n.01                │             110 │
├──────────────────────────────┼─────────────────┤
│ 3.whole.n.02                 │             110 │
├──────────────────────────────┼─────────────────┤
│ 2.communicate.v.02           │             110 │
├──────────────────────────────┼─────────────────┤
│ 6.causal_agent.n.01          │             110 │
├──────────────────────────────┼─────────────────┤
│ 4.act.n.02                   │             110 │
├──────────────────────────────┼─────────────────┤
│ 3.living_thing.n.01          │             110 │
├──────────────────────────────┼─────────────────┤
│ 3.state.n.02                 │             110 │
├──────────────────────────────┼─────────────────┤
│ 3.event.n.01                 │             110 │
├──────────────────────────────┼─────────────────┤
│ 3.cognition.n.01             │             110 │
├──────────────────────────────┼─────────────────┤
│ 2.attribute.n.02             │             110 │
├──────────────────────────────┼─────────────────┤
│ 5.organism.n.01              │             110 │
├──────────────────────────────┼─────────────────┤
│ 2.psychological_feature.n.01 │             110 │
├──────────────────────────────┼─────────────────┤
│ 4.artifact.n.01              │             110 │
├──────────────────────────────┼─────────────────┤
│ 2.object.n.01                │             110 │
├──────────────────────────────┼─────────────────┤
│ 7.person.n.01                │             110 │
├──────────────────────────────┼─────────────────┤
│ 2.communication.n.02         │             110 │
├──────────────────────────────┼─────────────────┤
│ 3.inform.v.01                │             109 │
├──────────────────────────────┼─────────────────┤
│ 2.copulate.v.01              │             109 │
├──────────────────────────────┼─────────────────┤
│ 2.group.n.01                 │             109 │
├──────────────────────────────┼─────────────────┤
│ 2.measure.n.02               │             108 │
├──────────────────────────────┼─────────────────┤
│ 3.location.n.01              │             106 │
├──────────────────────────────┼─────────────────┤
│ 5.activity.n.01              │             103 │
├──────────────────────────────┼─────────────────┤
│ 2.process.n.06               │              99 │
└──────────────────────────────┴─────────────────┘

Compare high-sight and no-sight synset counts

This cell produces a table which ranks synsets based on the difference in the percentages of high-sight and no-sight bins "containing" that synset. For example:

┌───────┬─────────┬───────────┬──────────────────────────────┐
│ diff  │ sight % │ no sight% │ synset                       │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 46.6% │ 73.0%   │ 26.4%     │ 4.appearance.n.01            │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 43.0% │ 73.0%   │ 30.0%     │ 7.surface.n.02               │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 42.0% │ 78.4%   │ 36.4%     │ 6.boundary.n.01              │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 41.4% │ 51.4%   │ 10.0%     │ 5.facial_expression.n.01     │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 41.3% │ 54.1%   │ 12.7%     │ 4.gesture.n.02               │
├───────┼─────────┼───────────┼──────────────────────────────┤

In other words, there do seem to be differences between high-sight and no-sight passages. It's not surprising that high-sight passages traffic in matters of appearance, expression and gesture, since those are so often visual in the novel.

I also list out the top 10 words associated with each synset when there are at least 10; otherwise, I print them all) since the synset name may be misleading. For example, "5.aggressiveness.n.01" occurs in many more high-sight than no-sight passages ("Aha!"). Unfortunately, that's because one sense of "face" rolls up to "aggressiveness" in Wordnet.

Bottom line? We probably could have gotten here by a shorter path; e.g., we might have asked, "which words occur more (or less) frequently in high-sight vs no-sight passages?" (in fact, see the last cell of this notebook, where I ask that question). Still, with this data, we can see the outline of the differences between high-sight and no-sight passages:

  • High-sight passages contain many more references to "people from the neck up". It's mostly faces, but also heads and related parts. Not so interesting, since we've known or suspected for a long time that Jane reads faces.
  • Ditto re words and images on paper.

I think, however, that it's very interesting that "2.bend.v.01" occurs in 43.2% of high-sight passages, but 13.6% of no sight passage, and that "2.bend.v.01" includes words like:

fall 11, lean 6, bend 6, incline 5, stoop 4, ascend 3, curl
3, cower 2, bow 1, creep 1

We might be able to tease more out of this by tweaking including more abstract Wordnet verb nodes; however, I'm going to look at word-frequencies instead (see the last notebook).

In [6]:
import math


results = []
high_synsets = []
low_synsets = []

for k in high_synset_df.keys():
    high_synsets.append(k)
    if k in low_synset_df:
        
        high_pct = (float(high_synset_df[k]) / float(n_high_sight_bins) * 100.0)
        low_pct = (float(low_synset_df[k]) / float(n_low_sight_bins) * 100.0)
        diff = math.fabs(high_pct - low_pct)
        results.append([diff, high_pct, low_pct, k])
        
    else:
        
        high_pct = (float(high_synset_df[k]) / float(n_high_sight_bins) * 100.0)
        low_pct = 0.0
        diff = high_pct
        results.append([diff, high_pct, low_pct, k])

for k in low_synset_df.keys():
    low_synsets.append(k)
    if k not in high_synset_df:
        
        high_pct = 0.0
        low_pct = (float(low_synset_df[k]) / float(n_low_sight_bins) * 100.0)
        diff = low_pct
        results.append([diff, high_pct, low_pct, k])

print
print 'len(set(high_synsets))', len(set(high_synsets))
print 'len(set(low_synset_df))', len(set(low_synset_df))       
print
        
results.sort(reverse=True)

synsets_to_lookup = []

output = [['diff', 'sight %', 'no sight%', 'synset']]
for r in results:
    if r[0] < 25.0:
        break
    output.append([('%.1f' % r[0]) + '%', 
                    ('%.1f' % r[1]) + '%', 
                    ('%.1f' % r[2]) + '%', 
                   r[3]])
    synsets_to_lookup.append(r[3])
    
print tabletext.to_text(output)

print 

synsets_word_counts = defaultdict(lambda : defaultdict(int))

#synsets_to_lookup = set(synsets_to_lookup)

for bn, b in enumerate(bins):
    if b['n_sight_words'] >= 7 or b['n_sight_words'] == 0:
        for s, words in b['synsets_to_words'].iteritems():
            if s in synsets_to_lookup:
                for w in words:
                    synsets_word_counts[s][w] += 1

for k in synsets_to_lookup:
    
    word_list = []
    for w in Counter(synsets_word_counts[k]).most_common():
        word_list.append(w[0] + ' ' + str(w[1]))
    
    print
    print k
    print 
    print '\t' + '\n\t'.join(textwrap.wrap(', '.join(word_list[:10]) ,60))
len(set(high_synsets)) 3384
len(set(low_synset_df)) 5096

┌───────┬─────────┬───────────┬──────────────────────────────┐
│ diff  │ sight % │ no sight% │ synset                       │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 46.6% │ 73.0%   │ 26.4%     │ 4.appearance.n.01            │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 43.0% │ 73.0%   │ 30.0%     │ 7.surface.n.02               │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 42.0% │ 78.4%   │ 36.4%     │ 6.boundary.n.01              │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 41.4% │ 51.4%   │ 10.0%     │ 5.facial_expression.n.01     │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 41.3% │ 54.1%   │ 12.7%     │ 4.gesture.n.02               │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 41.3% │ 56.8%   │ 15.5%     │ 6.character.n.08             │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 40.5% │ 48.6%   │ 8.2%      │ 7.front.n.04                 │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 40.5% │ 48.6%   │ 8.2%      │ 6.vertical_surface.n.01      │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 40.4% │ 51.4%   │ 10.9%     │ 5.countenance.n.01           │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 39.5% │ 54.1%   │ 14.5%     │ 5.aggressiveness.n.01        │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 39.4% │ 62.2%   │ 22.7%     │ 3.visual_communication.n.01  │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 38.6% │ 48.6%   │ 10.0%     │ 7.type.n.04                  │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 37.6% │ 67.6%   │ 30.0%     │ 6.side.n.05                  │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 36.6% │ 78.4%   │ 41.8%     │ 5.extremity.n.04             │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 35.0% │ 48.6%   │ 13.6%     │ 7.property.n.04              │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 33.9% │ 67.6%   │ 33.6%     │ 5.structure.n.04             │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 33.9% │ 73.0%   │ 39.1%     │ 5.written_symbol.n.01        │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 33.7% │ 89.2%   │ 55.5%     │ 3.signal.n.01                │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 31.5% │ 32.4%   │ 0.9%      │ 8.social_event.n.01          │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 31.5% │ 32.4%   │ 0.9%      │ 11.product.n.02              │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 31.5% │ 32.4%   │ 0.9%      │ 10.show.n.03                 │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 30.6% │ 32.4%   │ 1.8%      │ 6.event.n.01                 │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 30.4% │ 54.1%   │ 23.6%     │ 4.drive.n.05                 │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 30.4% │ 56.8%   │ 26.4%     │ 4.visual_property.n.01       │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 29.6% │ 43.2%   │ 13.6%     │ 2.bend.v.01                  │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 29.3% │ 75.7%   │ 46.4%     │ 5.surface.n.01               │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 28.8% │ 32.4%   │ 3.6%      │ 4.psychological_feature.n.01 │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 28.6% │ 43.2%   │ 71.8%     │ 2.supply.v.01                │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 27.6% │ 64.9%   │ 37.3%     │ 2.abstraction.n.06           │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 27.0% │ 29.7%   │ 2.7%      │ 6.expressive_style.n.01      │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 27.0% │ 32.4%   │ 5.5%      │ 9.creation.n.02              │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 27.0% │ 32.4%   │ 5.5%      │ 7.artifact.n.01              │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 26.6% │ 75.7%   │ 49.1%     │ 5.external_body_part.n.01    │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 26.1% │ 27.0%   │ 0.9%      │ 8.writing_style.n.01         │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 26.1% │ 27.0%   │ 0.9%      │ 13.article.n.01              │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 26.1% │ 27.0%   │ 0.9%      │ 12.nonfiction.n.01           │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 26.1% │ 27.0%   │ 0.9%      │ 11.piece.n.06                │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 26.1% │ 27.0%   │ 0.9%      │ 10.prose.n.01                │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 25.8% │ 40.5%   │ 66.4%     │ 2.recite.v.02                │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 25.1% │ 35.1%   │ 10.0%     │ 3.object.n.01                │
└───────┴─────────┴───────────┴──────────────────────────────┘


4.appearance.n.01

	face 66, colour 12, expression 6, beauty 5, form 5,
	countenance 5, shape 4, mark 4, figure 3, disguise 3

7.surface.n.02

	face 33, head 26, side 7, front 6, end 5, crown 4, top 2,
	bilge 1, sphere 1, interior 1

6.boundary.n.01

	face 33, head 26, end 10, border 8, side 7, front 6, crown
	4, edge 4, limb 3, shoulder 3

5.facial_expression.n.01

	face 33, laugh 6, smile 3

4.gesture.n.02

	face 33, laugh 6, smile 3, wave 2, sign 2, beck 1

6.character.n.08

	face 33, blank 4, case 2, star 2, space 2, dagger 1, fount
	1, t 1, letter 1, alpha 1

7.front.n.04

	face 33, nose 4

6.vertical_surface.n.01

	face 33

5.countenance.n.01

	face 33, expression 6, aspect 3

5.aggressiveness.n.01

	face 33, cheek 6, nerve 4, audacity 1, cheeks 1, brass 1

3.visual_communication.n.01

	face 33, picture 12, laugh 6, expression 6, figure 3, smile
	3, drawing 3, frame 2, illustration 2, sign 2

7.type.n.04

	face 33, case 2, fount 1

6.side.n.05

	face 33, head 26, lip 11, front 6, nose 4, quarter 4, back
	3, edge 2, beam 1, brim 1

5.extremity.n.04

	face 33, head 26, end 15, point 12, border 8, side 7, crown
	6, front 6, edge 4, limb 3

7.property.n.04

	feature 18, character 8, side 7, attention 6, aspect 3,
	excellence 2, sphere 1, attraction 1

5.structure.n.04

	head 52, mouth 10, hair 9, chamber 8, ear 7, brain 7, pocket
	4, arch 4, germ 4, back 3

5.written_symbol.n.01

	face 33, head 26, point 24, character 8, blank 4, grave 3,
	star 2, mark 2, case 2, space 2

3.signal.n.01

	face 33, head 26, point 24, light 13, character 8, mark 8,
	pound 4, crown 4, number 4, blank 4

8.social_event.n.01

	feature 9, picture 6

11.product.n.02

	feature 9, picture 6

10.show.n.03

	feature 9, picture 6

6.event.n.01

	feature 9, picture 6, singing 1

4.drive.n.05

	face 33, energy 8, cheek 6, nerve 4, ambition 2, action 2,
	audacity 1, brass 1, cheeks 1

4.visual_property.n.01

	light 13, tone 9, colour 6, darkness 4, rose 3, coffee 3,
	cherry 2, gold 2, lustre 2, pearl 2

2.bend.v.01

	fall 11, lean 6, bend 6, incline 5, stoop 4, ascend 3, curl
	3, cower 2, bow 1, creep 1

5.surface.n.01

	face 99, head 26, bed 12, lip 11, side 7, front 6, ring 5,
	ground 4, quarter 4, nose 4

4.psychological_feature.n.01

	feature 9, picture 6, defence 1, projection 1, isolation 1

2.supply.v.01

	give 45, leave 38, offer 20, keep 17, open 11, set 5, charge
	4, fit 3, provide 3, allow 3

2.abstraction.n.06

	word 30, day 25, feature 18, light 13, picture 6, length 4,
	paper 3, note 3, promise 2, spark 2

6.expressive_style.n.01

	feature 9, soul 5, paper 3

9.creation.n.02

	feature 18, picture 6, note 3, paper 3, letter 1, line 1

7.artifact.n.01

	feature 18, picture 6, note 3, paper 3, letter 1, line 1

5.external_body_part.n.01

	face 66, hand 34, head 26, foot 20, arm 12, breast 8,
	countenance 5, finger 5, neck 4, right 3

8.writing_style.n.01

	feature 9, paper 3

13.article.n.01

	feature 9, paper 3

12.nonfiction.n.01

	feature 9, paper 3

11.piece.n.06

	feature 9, paper 3

10.prose.n.01

	feature 9, paper 3

2.recite.v.02

	say 128, count 1

3.object.n.01

	feature 18, picture 6, vault 4, note 3, paper 3, iceberg 1,
	repository 1, queen 1, letter 1, line 1

Very Basic QA

Just checking some numbers . . .

In [7]:
import numpy as np

synset_counts = []
unique_synset_counts = []
synsets = []
reported_synsets = []

print
for bn, b in enumerate(bins):
    if b['n_sight_words'] >= 7:
        
        synset_counts.append(len(b['synsets']))
        unique_synset_counts.append(len(set(b['synsets'])))
        synsets += b['synsets']
        for s in b['synsets']:
            if s[:2] not in ['0.', '1.', 'W.']:
                reported_synsets.append(s)
                
        #print bn, b['n_sight_words'], len(b['synsets']), len(set(b['synsets']))
        
print 'high-sight -- average n synsets/bin', np.average(synset_counts)
print 'high-sight -- average n unique synsets/bin', np.average(unique_synset_counts)
print 'high-sight -- unique synsets', len(set(synsets))
print 'high-sight -- unique REPORTED synsets', len(set(reported_synsets))

synset_counts = []
unique_synset_counts = []
synsets = []
reported_synsets = []
        
print
for bn, b in enumerate(bins):
    if b['n_sight_words'] == 0:
        
        synset_counts.append(len(b['synsets']))
        unique_synset_counts.append(len(set(b['synsets'])))
        synsets += b['synsets']
        for s in b['synsets']:
            if s[:2] not in ['0.', '1.', 'W.']:
                reported_synsets.append(s)
        
        #print bn, b['n_sight_words'], len(b['synsets']), len(set(b['synsets']))
        
print 'no-sight -- average n synsets/bin', np.average(synset_counts)
print 'no-sight -- average n unique synsets/bin', np.average(unique_synset_counts)
print 'no-sight -- unique synsets', len(set(synsets))
print 'no-sight -- unique REPORTED synsets', len(set(reported_synsets))
high-sight -- average n synsets/bin 2232.43243243
high-sight -- average n unique synsets/bin 961.864864865
high-sight -- unique synsets 9185
high-sight -- unique REPORTED synsets 3384

no-sight -- average n synsets/bin 2232.77272727
no-sight -- average n unique synsets/bin 937.681818182
no-sight -- unique synsets 14797
no-sight -- unique REPORTED synsets 5096

Word lists: high-sight vs no-sight

Two tables of words (lemma, really) follow, one of words which are more common in high-sight passages, and one of words which are more common in no-sight passages. The table format should be familiar:

┌──────────┬──────────┬─────────────┬────────────┐
│ diff     │ sight wf │ no sight wf │ word       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.005838 │ 0.007101 │ 0.001263    │ face       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.004477 │ 0.007424 │ 0.002947    │ rochester  │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.003382 │ 0.005487 │ 0.002105    │ yet        │
├──────────┼──────────┼─────────────┼────────────┤


"sight wf" is the number of times the lemma occurs in high-sight passages, divided by the total number of words in high-sight passages.

What does this reveal? Not much, really. Jane sees faces, and everything that goes with faces. She sees Rochester.

There are, however, some suggestive differences:

  • "'s" (the possessive token, which is handled as a separate token by spacy) is about 50% more common in high-sight than non-sight passages. Does Jane see her own whatever?
  • "remember" is almost 5 times more likely in high-sight vs non-sight passage. Is memory visual in Jane Eyre?
  • Ditto "full". To what extent is sight necessary for completeness?
  • Notice the characters in the more common in no-sight list (and, I suppose, the characters in the high-sight list). Does Jane "see" the object of her desire (Rochester) and her competition for him (Ingram), and not see the other characters (Reed, John, Bessie)? Is desire visual?
In [8]:
from nltk.corpus import stopwords
sw = set(stopwords.words('english') + ['-PRON-'])


no_sight_lemma_counts = defaultdict(int)
n_no_sight_lemma = 0

high_sight_lemma_counts = defaultdict(int)
n_high_sight_lemma = 0


for bn, b in enumerate(bins):
    
    if b['n_sight_words'] == 0:
        for l in b['lemma']:
            if l not in sw:
                no_sight_lemma_counts[l] += 1
                n_no_sight_lemma += 1
            
    if b['n_sight_words'] >= 7:
        for l in b['lemma']:
            if l not in sw:
                high_sight_lemma_counts[l] += 1
                n_high_sight_lemma += 1
            
results = []

for lemma in no_sight_lemma_counts:
    if lemma in high_sight_lemma_counts:
        
        no_sight_rel_freq = float(no_sight_lemma_counts[lemma]) / float(n_no_sight_lemma)
        high_sight_rel_freq = float(high_sight_lemma_counts[lemma]) / float(n_high_sight_lemma)
    
        results.append([
            (high_sight_rel_freq - no_sight_rel_freq),
            high_sight_rel_freq,
            no_sight_rel_freq,
            lemma
        ])
    else:
        
        no_sight_rel_freq = float(no_sight_lemma_counts[lemma]) / float(n_no_sight_lemma)
        high_sight_rel_freq = 0.0
    
        results.append([
            (high_sight_rel_freq - no_sight_rel_freq),
            high_sight_rel_freq,
            no_sight_rel_freq,
            lemma
        ])

for lemma in high_sight_lemma_counts:
    if lemma not in no_sight_lemma_counts:
        
        no_sight_rel_freq = 0.0
        high_sight_rel_freq = float(high_sight_lemma_counts[lemma]) / float(n_high_sight_lemma)
    
        results.append([
            (high_sight_rel_freq - no_sight_rel_freq),
            high_sight_rel_freq,
            no_sight_rel_freq,
            lemma
        ])
        
results.sort(reverse=True)

output = [['diff', 'sight wf', 'no sight wf', 'word']]
for r in results[:100]:
    output.append([('%.6f' % r[0]), 
                    ('%.6f' % r[1]), 
                    ('%.6f' % r[2]), 
                   r[3]])
    synsets_to_lookup.append(r[3])

print
print 'MORE COMMON IN HIGH-SIGHT THAN NO-SIGHT'
print
print tabletext.to_text(output)
        
results.sort()

output = [['diff', 'sight wf', 'no sight wf', 'word']]
for r in results[:100]:
    output.append([('%.6f' % r[0]), 
                    ('%.6f' % r[1]), 
                    ('%.6f' % r[2]), 
                   r[3]])
    synsets_to_lookup.append(r[3])

print
print 'MORE COMMON IN NO-SIGHT THAN HIGH-SIGHT'
print
print tabletext.to_text(output)
MORE COMMON IN HIGH-SIGHT THAN NO-SIGHT

┌──────────┬──────────┬─────────────┬────────────┐
│ diff     │ sight wf │ no sight wf │ word       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.005838 │ 0.007101 │ 0.001264    │ face       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.004476 │ 0.007424 │ 0.002948    │ rochester  │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.003381 │ 0.005487 │ 0.002106    │ yet        │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.003045 │ 0.004519 │ 0.001474    │ turn       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.003022 │ 0.008393 │ 0.005370    │ mr.        │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.002905 │ 0.002905 │ 0.000000    │ feature    │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.002814 │ 0.003551 │ 0.000737    │ full       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.002708 │ 0.003551 │ 0.000842    │ light      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.002617 │ 0.004196 │ 0.001579    │ lady       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.002498 │ 0.003551 │ 0.001053    │ something  │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.002484 │ 0.002905 │ 0.000421    │ smile      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.002484 │ 0.002905 │ 0.000421    │ dark       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.002482 │ 0.007747 │ 0.005265    │ 's         │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.002273 │ 0.002905 │ 0.000632    │ remember   │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.002175 │ 0.003228 │ 0.001053    │ though     │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.002063 │ 0.002905 │ 0.000842    │ ingram     │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.002056 │ 0.002582 │ 0.000526    │ longer     │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.002047 │ 0.007101 │ 0.005054    │ could      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001985 │ 0.004196 │ 0.002211    │ hand       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001971 │ 0.003551 │ 0.001579    │ head       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001944 │ 0.002260 │ 0.000316    │ step       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001900 │ 0.005165 │ 0.003264    │ miss       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001852 │ 0.002905 │ 0.001053    │ back       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001845 │ 0.002582 │ 0.000737    │ master     │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001831 │ 0.001937 │ 0.000105    │ handsome   │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001831 │ 0.001937 │ 0.000105    │ glow       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001726 │ 0.001937 │ 0.000211    │ diana      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001690 │ 0.005165 │ 0.003475    │ love       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001662 │ 0.003873 │ 0.002211    │ much       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001614 │ 0.001614 │ 0.000000    │ mouth      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001564 │ 0.004196 │ 0.002632    │ heart      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001522 │ 0.002260 │ 0.000737    │ round      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001516 │ 0.001937 │ 0.000421    │ white      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001516 │ 0.001937 │ 0.000421    │ chair      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001516 │ 0.001937 │ 0.000421    │ black      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001509 │ 0.001614 │ 0.000105    │ picture    │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001509 │ 0.001614 │ 0.000105    │ front      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001509 │ 0.001614 │ 0.000105    │ colour     │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001509 │ 0.001614 │ 0.000105    │ brown      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001410 │ 0.001937 │ 0.000526    │ nature     │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001403 │ 0.001614 │ 0.000211    │ recall     │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001403 │ 0.001614 │ 0.000211    │ brilliant  │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001333 │ 0.003228 │ 0.001895    │ first      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001312 │ 0.002260 │ 0.000948    │ draw       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001298 │ 0.001614 │ 0.000316    │ read       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001298 │ 0.001614 │ 0.000316    │ mary       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001298 │ 0.001614 │ 0.000316    │ lift       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001298 │ 0.001614 │ 0.000316    │ form       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001291 │ 0.001291 │ 0.000000    │ glass      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001207 │ 0.002260 │ 0.001053    │ fire       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001193 │ 0.001614 │ 0.000421    │ towards    │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001193 │ 0.001614 │ 0.000421    │ hair       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001193 │ 0.001614 │ 0.000421    │ fine       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001193 │ 0.001614 │ 0.000421    │ fairfax    │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001193 │ 0.001614 │ 0.000421    │ dress      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001193 │ 0.001614 │ 0.000421    │ behind     │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001186 │ 0.001291 │ 0.000105    │ touch      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001186 │ 0.001291 │ 0.000105    │ sky        │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001186 │ 0.001291 │ 0.000105    │ party      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001186 │ 0.001291 │ 0.000105    │ grave      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001186 │ 0.001291 │ 0.000105    │ fix        │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001122 │ 0.003228 │ 0.002106    │ door       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001094 │ 0.001937 │ 0.000842    │ low        │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001087 │ 0.001614 │ 0.000526    │ window     │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001087 │ 0.001614 │ 0.000526    │ power      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001081 │ 0.001291 │ 0.000211    │ tremble    │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001081 │ 0.001291 │ 0.000211    │ pride      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001081 │ 0.001291 │ 0.000211    │ lean       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001081 │ 0.001291 │ 0.000211    │ expression │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001081 │ 0.001291 │ 0.000211    │ distant    │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001081 │ 0.001291 │ 0.000211    │ dent       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001081 │ 0.001291 │ 0.000211    │ curl       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001081 │ 0.001291 │ 0.000211    │ bend       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001081 │ 0.001291 │ 0.000211    │ attention  │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.001017 │ 0.003228 │ 0.002211    │ answer     │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000996 │ 0.002260 │ 0.001264    │ moment     │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000989 │ 0.001937 │ 0.000948    │ even       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000982 │ 0.001614 │ 0.000632    │ often      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000982 │ 0.001614 │ 0.000632    │ beautiful  │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000982 │ 0.001614 │ 0.000632    │ air        │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000975 │ 0.001291 │ 0.000316    │ wonder     │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000975 │ 0.001291 │ 0.000316    │ brain      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000975 │ 0.001291 │ 0.000316    │ appearance │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000975 │ 0.001291 │ 0.000316    │ amongst    │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000968 │ 0.000968 │ 0.000000    │ survey     │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000968 │ 0.000968 │ 0.000000    │ sudden     │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000968 │ 0.000968 │ 0.000000    │ stoop      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000968 │ 0.000968 │ 0.000000    │ rule       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000968 │ 0.000968 │ 0.000000    │ quick      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000968 │ 0.000968 │ 0.000000    │ mount      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000968 │ 0.000968 │ 0.000000    │ grim       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000968 │ 0.000968 │ 0.000000    │ dull       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000968 │ 0.000968 │ 0.000000    │ drawing    │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000968 │ 0.000968 │ 0.000000    │ band       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000898 │ 0.002582 │ 0.001685    │ stand      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000898 │ 0.002582 │ 0.001685    │ quite      │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000884 │ 0.001937 │ 0.001053    │ girl       │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000877 │ 0.001614 │ 0.000737    │ strange    │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000877 │ 0.001614 │ 0.000737    │ receive    │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.000877 │ 0.001614 │ 0.000737    │ moon       │
└──────────┴──────────┴─────────────┴────────────┘

MORE COMMON IN NO-SIGHT THAN HIGH-SIGHT

┌───────────┬──────────┬─────────────┬──────────────┐
│ diff      │ sight wf │ no sight wf │ word         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.004928 │ 0.000968 │ 0.005897    │ jane         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.004060 │ 0.007101 │ 0.011161    │ say          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.003356 │ 0.000646 │ 0.004001    │ leave        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.003124 │ 0.001614 │ 0.004738    │ sir          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.002802 │ 0.001937 │ 0.004738    │ must         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.002774 │ 0.003228 │ 0.006002    │ one          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.002703 │ 0.001614 │ 0.004317    │ little       │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.002520 │ 0.000323 │ 0.002843    │ shall        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.002374 │ 0.002260 │ 0.004633    │ good         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.002317 │ 0.000000 │ 0.002317    │ live         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.002213 │ 0.004842 │ 0.007055    │ go           │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.002030 │ 0.003551 │ 0.005581    │ know         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001973 │ 0.001291 │ 0.003264    │ tell         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001888 │ 0.000323 │ 0.002211    │ child        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001882 │ 0.000646 │ 0.002527    │ john         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001742 │ 0.002260 │ 0.004001    │ give         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001735 │ 0.002582 │ 0.004317    │ make         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001685 │ 0.000000 │ 0.001685    │ bad          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001671 │ 0.000646 │ 0.002317    │ bessie       │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001579 │ 0.000000 │ 0.001579    │ reed         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001566 │ 0.000646 │ 0.002211    │ way          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001467 │ 0.000323 │ 0.001790    │ mean         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001440 │ 0.001614 │ 0.003054    │ yes          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001369 │ 0.000000 │ 0.001369    │ anything     │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001369 │ 0.000000 │ 0.001369    │ cold         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001355 │ 0.000646 │ 0.002001    │ god          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001355 │ 0.000646 │ 0.002001    │ st.          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001355 │ 0.000646 │ 0.002001    │ thing        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001279 │ 0.004196 │ 0.005475    │ come         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001264 │ 0.000000 │ 0.001264    │ lowood       │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001264 │ 0.000000 │ 0.001264    │ stay         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001264 │ 0.000000 │ 0.001264    │ three        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001257 │ 0.000323 │ 0.001579    │ die          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001257 │ 0.000323 │ 0.001579    │ right        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001151 │ 0.000323 │ 0.001474    │ forget       │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001151 │ 0.000323 │ 0.001474    │ home         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001144 │ 0.000646 │ 0.001790    │ hour         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001138 │ 0.000968 │ 0.002106    │ away         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001131 │ 0.001291 │ 0.002422    │ house        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001131 │ 0.001291 │ 0.002422    │ want         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001124 │ 0.001614 │ 0.002738    │ night        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001046 │ 0.000323 │ 0.001369    │ always       │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001046 │ 0.000323 │ 0.001369    │ let          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001046 │ 0.000323 │ 0.001369    │ year         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001039 │ 0.000646 │ 0.001685    │ return       │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001018 │ 0.001614 │ 0.002632    │ ever         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.001012 │ 0.001937 │ 0.002948    │ man          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000948 │ 0.000000 │ 0.000948    │ listen       │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000941 │ 0.000323 │ 0.001264    │ hope         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000941 │ 0.000323 │ 0.001264    │ new          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000941 │ 0.000323 │ 0.001264    │ school       │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000941 │ 0.000323 │ 0.001264    │ true         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000927 │ 0.000968 │ 0.001895    │ morning      │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000913 │ 0.001614 │ 0.002527    │ last         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000842 │ 0.000000 │ 0.000842    │ breakfast    │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000842 │ 0.000000 │ 0.000842    │ entertain    │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000842 │ 0.000000 │ 0.000842    │ however      │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000842 │ 0.000000 │ 0.000842    │ money        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000842 │ 0.000000 │ 0.000842    │ nursery      │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000835 │ 0.000323 │ 0.001158    │ gentleman    │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000835 │ 0.000323 │ 0.001158    │ oh           │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000835 │ 0.000323 │ 0.001158    │ old          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000835 │ 0.000323 │ 0.001158    │ thornfield   │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000835 │ 0.000323 │ 0.001158    │ wife         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000829 │ 0.000646 │ 0.001474    │ lie          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000829 │ 0.000646 │ 0.001474    │ perhaps      │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000815 │ 0.001291 │ 0.002106    │ place        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000794 │ 0.002260 │ 0.003054    │ ask          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000739 │ 0.004842 │ 0.005581    │ think        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000737 │ 0.000000 │ 0.000737    │ abbot        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000737 │ 0.000000 │ 0.000737    │ benefactress │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000737 │ 0.000000 │ 0.000737    │ duty         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000737 │ 0.000000 │ 0.000737    │ endure       │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000737 │ 0.000000 │ 0.000737    │ exclaim      │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000737 │ 0.000000 │ 0.000737    │ flesh        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000737 │ 0.000000 │ 0.000737    │ georgiana    │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000737 │ 0.000000 │ 0.000737    │ governess    │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000737 │ 0.000000 │ 0.000737    │ madame       │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000737 │ 0.000000 │ 0.000737    │ month        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000737 │ 0.000000 │ 0.000737    │ prayer       │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000737 │ 0.000000 │ 0.000737    │ press        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000737 │ 0.000000 │ 0.000737    │ teacher      │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000737 │ 0.000000 │ 0.000737    │ water        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000730 │ 0.000323 │ 0.001053    │ minute       │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000730 │ 0.000323 │ 0.001053    │ show         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000730 │ 0.000323 │ 0.001053    │ walk         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000723 │ 0.000646 │ 0.001369    │ every        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000716 │ 0.000968 │ 0.001685    │ mind         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000716 │ 0.000968 │ 0.001685    │ name         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000709 │ 0.001291 │ 0.002001    │ mrs.         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000709 │ 0.001291 │ 0.002001    │ sit          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000689 │ 0.002260 │ 0.002948    │ get          │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000682 │ 0.002582 │ 0.003264    │ never        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000632 │ 0.000000 │ 0.000632    │ alone        │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000632 │ 0.000000 │ 0.000632    │ blow         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000632 │ 0.000000 │ 0.000632    │ brocklehurst │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000632 │ 0.000000 │ 0.000632    │ cousin       │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000632 │ 0.000000 │ 0.000632    │ else         │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000632 │ 0.000000 │ 0.000632    │ french       │
├───────────┼──────────┼─────────────┼──────────────┤
│ -0.000632 │ 0.000000 │ 0.000632    │ hate         │
└───────────┴──────────┴─────────────┴──────────────┘
In [ ]: