Are there specific kinds of content associated with passages with a high number (>= 7) of sight-related words? Beyond, of course, sight-related words? And how does the content of sight-related passages compare with the content of passages with no sight related words?
Passages with a high number (>= 7) of sight-related words are those printed out in notebook 06_eye_look_see_etc.ipynb. There are 37 such passages. 110 passages contain no no sight related words.
What synsets are characteristic of all high-sight passages?
To skip to the point, scroll down o the cell labeled "Compare high-sight and no-sight synset counts"; nothing much happens before then, at least in terms of interesting results . . .
. . . load the text, and pass to spacy for part-of-speech tagging and lemmatization.
import spacy
nlp = spacy.load('en')
import codecs, re
CORPUS_FOLDER = '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/'
text = codecs.open(CORPUS_FOLDER + 'Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt',
'r', encoding='utf-8').read()
text = re.sub('\s+', ' ', text).strip()
doc = nlp(text)
Like I did in notebook 06_eye_look_see_etc.ipynb, I'm slicing Jane Eyre into 1,000 231-token bins (there's a 1,001st bin, but it holds the last 203 token of the novel. For each bin, I'm saving original tokens, counting sight-related words (same method as in 06_eye_look_see_etc.ipynb), saving a list of synsets associated with only the non-sight-related words in the passage, and saving a list of the words associated with each synsets. I'm appending prefixes to synset names (e.g., "W.", "0.", etc) are so I can keep track of how far up the tree the synset occurs.
from textblob import Word
sight_related_synsets = [
'eye.n.',
'look.n.',
'sight.n.',
'stare.n.',
'gaze.n.',
'vision.n.',
'see.v.01',
'detect.v.01',
'spy.v.03',
'appear.v.04',
'look.v.01',
'visualize.v.01',
'see.v.23',
'look.v.03',
'detect.v.01',
'watch.v.01',
]
# ------------------------------------------------------------------------
n_tokens = doc.__len__()
N_BINS = 1000
bin_size = n_tokens / N_BINS
print 'n_tokens', n_tokens, 'N_BINS', N_BINS, 'bin_size', bin_size
# ------------------------------------------------------------------------
bins = []
for a in range(0, N_BINS + 1):
bins.append({'original_tokens': [],
'synsets': [],
'n_sight_words': 0,
'synsets_to_words': {},
'lemma': []})
hyper = lambda s: s.hypernyms()
for t in doc:
#if t.i > 100:
# break
bin_number = t.i / bin_size
bins[bin_number]['original_tokens'].append(t.text)
# -------------------------------------------------
is_seeing_lemma = False
if t.lemma_ not in ['-PRON-', 'which', 'what', 'who', 'be'] and \
t.pos_ in ['NOUN', 'VERB',]:
word_synsets = Word(t.lemma_).get_synsets(pos=t.pos_[0].lower())
for w in word_synsets:
for sight_related_synset in sight_related_synsets:
if w.name().startswith(sight_related_synset) == True or \
w.name() == sight_related_synset:
is_seeing_lemma = True
h = list(w.closure(hyper, depth=10))
h.reverse()
for hn, hypernym in enumerate(h):
for sight_related_synset in sight_related_synsets:
if hypernym.name().startswith(sight_related_synset) == True or \
hypernym.name() == sight_related_synset:
is_seeing_lemma = True
# -------------------------------------------------
if is_seeing_lemma == False:
if t.pos_ not in ['PUNCT']:
bins[bin_number]['lemma'].append(t.lemma_)
if t.lemma_ not in ['-PRON-', 'which', 'what', 'who', 'be'] and \
t.pos_ in ['NOUN', 'VERB',]:
word_synsets = Word(t.lemma_).get_synsets(pos=t.pos_[0].lower())
for w in word_synsets:
synset_key = 'W.' + w.name()
bins[bin_number]['synsets'].append(synset_key)
try:
bins[bin_number]['synsets_to_words'][synset_key].append(t.lemma_)
except KeyError:
bins[bin_number]['synsets_to_words'][synset_key] = [t.lemma_]
h = list(w.closure(hyper, depth=10))
h.reverse()
for hn, hypernym in enumerate(h):
synset_key = str(hn) + '.' + hypernym.name()
bins[bin_number]['synsets'].append(synset_key)
try:
bins[bin_number]['synsets_to_words'][synset_key].append(t.lemma_)
except KeyError:
bins[bin_number]['synsets_to_words'][synset_key] = [t.lemma_]
if is_seeing_lemma == True:
bins[bin_number]['n_sight_words'] += 1
print 'len(bins)', len(bins), 'len(bins[0][\'original_tokens\'])', len(bins[0]['original_tokens'])
# ------------------------------------------------------------------------
I count up the number of high-sight passages in which synsets occur. Here, it doesn't matter how many times a synset occurs in any high-sight passage; once is enough for me to count it.
After a couple of debugging-related outputs (the number of high-sight passages, and the positions of those passages in the novel, this cell outputs:
A list of synsets (drawn from the table immediately above), with the lemma and lemma count for the lemma in the high-sight bins which point to that synset. For example, "3.cognition.n.01" is one of the synsets which occurs in every high-sight bin, and the top 10 lemma in those bins like "heart" (26 occurrences), "light" (24), "feature" (18), etc point to that synset (see this snippet of the output to follow along):
3.cognition.n.01
heart 26, light 24, feature 18, feeling 16, rule 15, hand
13, mind 12, attention 12, thought 12, head 11
Note that here and in what follows, I'm discarding synsets which occur either at the topmost nodes of the Wordnet heirarchy, or else at the bottom word-level, my feeling being that these levels won't yield interesting information.
Results? At first glance, this seems interesting . . . for example, every high-sight bin contains at least one word which points toward synset "3.cognition.n.01", and so it's tempting to conclude, " In Jane Eyre, sight and cognition occur together." However, we should look at the next set of cells . . .
from collections import defaultdict, Counter
import tabletext, textwrap
n_high_sight_bins = 0
high_sight_bin_numbers = []
high_synset_df = defaultdict(int)
check_synsets = []
for bn, b in enumerate(bins):
if b['n_sight_words'] >= 7:
high_sight_bin_numbers.append(bn)
n_high_sight_bins += 1
synset_wf = defaultdict(int)
for s in b['synsets']:
if s[:2] not in ['0.', '1.', 'W.']:
synset_wf[s] += 1
check_synsets.append(s)
for k in synset_wf:
high_synset_df[k] += 1
print
print 'n_high_sight_bins', n_high_sight_bins
print
print 'high_sight_bin_numbers', high_sight_bin_numbers
print
print 'len(set(check_synsets))', len(set(check_synsets))
print 'len(high_synset_df)', len(high_synset_df)
print
print
results = [['synset', 'n bins']]
synsets_to_lookup = []
synsets_word_counts = defaultdict(lambda : defaultdict(int))
for w in Counter(high_synset_df).most_common():
if w[1] != 37:
break
synsets_to_lookup.append(w[0])
results.append([w[0], w[1]])
print tabletext.to_text(results)
synsets_to_lookup = set(synsets_to_lookup)
for bn, b in enumerate(bins):
if b['n_sight_words'] >= 7:
for s, words in b['synsets_to_words'].iteritems():
if s in synsets_to_lookup:
for w in words:
synsets_word_counts[s][w] += 1
for k in synsets_word_counts.keys():
word_list = []
for w in Counter(synsets_word_counts[k]).most_common():
word_list.append(w[0] + ' ' + str(w[1]))
print
print k
print
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(word_list[:10]) ,60))
Here, I'm producing a pair of tables similar to the one in the previous cell. Again, I'm counting how many bins--no-sight bins in this case--contain lemma which point to particular synsets.
The first table lists the synsets which occur in the most no-sight bins (I'm listing only 15 such synsets for the sake of space.
The second table lists the synsets which occur in all high-sight bins, followed by the number of times that synset occurs on no-sight bins. In other words, for a synset to be listed in this table, it had to appear in every high-sight bin.
What's the news here? With few exceptions, any synset which appears in every one of the high-sight bins also occurs in every one of the no-sight bins. The exceptions are insignificant.
In other words, the only feature that distinguishes all high-sight bins from all non-sight bins is the presence or absense of sight-related words. Or, to put it another way, it doesn't seem possible to say, "When sight occurs in JE, X always occurs, and when sight does not occur, X does not occur."
from collections import defaultdict, Counter
import tabletext
n_low_sight_bins = 0
low_sight_bin_numbers = []
low_synset_df = defaultdict(int)
check_synsets = []
for bn, b in enumerate(bins):
if b['n_sight_words'] == 0:
low_sight_bin_numbers.append(bn)
n_low_sight_bins += 1
low_synset_wf = defaultdict(int)
for s in b['synsets']:
if s[:2] not in ['0.', '1.', 'W.']:
low_synset_wf[s] += 1
check_synsets.append(s)
for k in low_synset_wf:
low_synset_df[k] += 1
print
print 'n_low_sight_bins', n_low_sight_bins
print
print 'low_sight_bin_numbers', low_sight_bin_numbers
print
print 'len(set(check_synsets))', len(set(check_synsets))
print 'len(low_synset_df)', len(low_synset_df)
print
results = [['synset', 'n bins']]
for w in Counter(low_synset_df).most_common(15):
results.append([w[0], w[1]])
print
print 'MOST COMMON SYNSETS'
print
print tabletext.to_text(results)
print
results = [['synset', 'n NO-SIGHT bins']]
for w in Counter(low_synset_df).most_common():
if w[0] in synsets_to_lookup:
results.append([w[0], w[1]])
print
print 'HIGH_SIGHT-SYNSETS'
print
print tabletext.to_text(results)
This cell produces a table which ranks synsets based on the difference in the percentages of high-sight and no-sight bins "containing" that synset. For example:
┌───────┬─────────┬───────────┬──────────────────────────────┐
│ diff │ sight % │ no sight% │ synset │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 46.6% │ 73.0% │ 26.4% │ 4.appearance.n.01 │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 43.0% │ 73.0% │ 30.0% │ 7.surface.n.02 │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 42.0% │ 78.4% │ 36.4% │ 6.boundary.n.01 │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 41.4% │ 51.4% │ 10.0% │ 5.facial_expression.n.01 │
├───────┼─────────┼───────────┼──────────────────────────────┤
│ 41.3% │ 54.1% │ 12.7% │ 4.gesture.n.02 │
├───────┼─────────┼───────────┼──────────────────────────────┤
In other words, there do seem to be differences between high-sight and no-sight passages. It's not surprising that high-sight passages traffic in matters of appearance, expression and gesture, since those are so often visual in the novel.
I also list out the top 10 words associated with each synset when there are at least 10; otherwise, I print them all) since the synset name may be misleading. For example, "5.aggressiveness.n.01" occurs in many more high-sight than no-sight passages ("Aha!"). Unfortunately, that's because one sense of "face" rolls up to "aggressiveness" in Wordnet.
Bottom line? We probably could have gotten here by a shorter path; e.g., we might have asked, "which words occur more (or less) frequently in high-sight vs no-sight passages?" (in fact, see the last cell of this notebook, where I ask that question). Still, with this data, we can see the outline of the differences between high-sight and no-sight passages:
I think, however, that it's very interesting that "2.bend.v.01" occurs in 43.2% of high-sight passages, but 13.6% of no sight passage, and that "2.bend.v.01" includes words like:
fall 11, lean 6, bend 6, incline 5, stoop 4, ascend 3, curl
3, cower 2, bow 1, creep 1
We might be able to tease more out of this by tweaking including more abstract Wordnet verb nodes; however, I'm going to look at word-frequencies instead (see the last notebook).
import math
results = []
high_synsets = []
low_synsets = []
for k in high_synset_df.keys():
high_synsets.append(k)
if k in low_synset_df:
high_pct = (float(high_synset_df[k]) / float(n_high_sight_bins) * 100.0)
low_pct = (float(low_synset_df[k]) / float(n_low_sight_bins) * 100.0)
diff = math.fabs(high_pct - low_pct)
results.append([diff, high_pct, low_pct, k])
else:
high_pct = (float(high_synset_df[k]) / float(n_high_sight_bins) * 100.0)
low_pct = 0.0
diff = high_pct
results.append([diff, high_pct, low_pct, k])
for k in low_synset_df.keys():
low_synsets.append(k)
if k not in high_synset_df:
high_pct = 0.0
low_pct = (float(low_synset_df[k]) / float(n_low_sight_bins) * 100.0)
diff = low_pct
results.append([diff, high_pct, low_pct, k])
print
print 'len(set(high_synsets))', len(set(high_synsets))
print 'len(set(low_synset_df))', len(set(low_synset_df))
print
results.sort(reverse=True)
synsets_to_lookup = []
output = [['diff', 'sight %', 'no sight%', 'synset']]
for r in results:
if r[0] < 25.0:
break
output.append([('%.1f' % r[0]) + '%',
('%.1f' % r[1]) + '%',
('%.1f' % r[2]) + '%',
r[3]])
synsets_to_lookup.append(r[3])
print tabletext.to_text(output)
print
synsets_word_counts = defaultdict(lambda : defaultdict(int))
#synsets_to_lookup = set(synsets_to_lookup)
for bn, b in enumerate(bins):
if b['n_sight_words'] >= 7 or b['n_sight_words'] == 0:
for s, words in b['synsets_to_words'].iteritems():
if s in synsets_to_lookup:
for w in words:
synsets_word_counts[s][w] += 1
for k in synsets_to_lookup:
word_list = []
for w in Counter(synsets_word_counts[k]).most_common():
word_list.append(w[0] + ' ' + str(w[1]))
print
print k
print
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(word_list[:10]) ,60))
Just checking some numbers . . .
import numpy as np
synset_counts = []
unique_synset_counts = []
synsets = []
reported_synsets = []
print
for bn, b in enumerate(bins):
if b['n_sight_words'] >= 7:
synset_counts.append(len(b['synsets']))
unique_synset_counts.append(len(set(b['synsets'])))
synsets += b['synsets']
for s in b['synsets']:
if s[:2] not in ['0.', '1.', 'W.']:
reported_synsets.append(s)
#print bn, b['n_sight_words'], len(b['synsets']), len(set(b['synsets']))
print 'high-sight -- average n synsets/bin', np.average(synset_counts)
print 'high-sight -- average n unique synsets/bin', np.average(unique_synset_counts)
print 'high-sight -- unique synsets', len(set(synsets))
print 'high-sight -- unique REPORTED synsets', len(set(reported_synsets))
synset_counts = []
unique_synset_counts = []
synsets = []
reported_synsets = []
print
for bn, b in enumerate(bins):
if b['n_sight_words'] == 0:
synset_counts.append(len(b['synsets']))
unique_synset_counts.append(len(set(b['synsets'])))
synsets += b['synsets']
for s in b['synsets']:
if s[:2] not in ['0.', '1.', 'W.']:
reported_synsets.append(s)
#print bn, b['n_sight_words'], len(b['synsets']), len(set(b['synsets']))
print 'no-sight -- average n synsets/bin', np.average(synset_counts)
print 'no-sight -- average n unique synsets/bin', np.average(unique_synset_counts)
print 'no-sight -- unique synsets', len(set(synsets))
print 'no-sight -- unique REPORTED synsets', len(set(reported_synsets))
Two tables of words (lemma, really) follow, one of words which are more common in high-sight passages, and one of words which are more common in no-sight passages. The table format should be familiar:
┌──────────┬──────────┬─────────────┬────────────┐
│ diff │ sight wf │ no sight wf │ word │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.005838 │ 0.007101 │ 0.001263 │ face │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.004477 │ 0.007424 │ 0.002947 │ rochester │
├──────────┼──────────┼─────────────┼────────────┤
│ 0.003382 │ 0.005487 │ 0.002105 │ yet │
├──────────┼──────────┼─────────────┼────────────┤
"sight wf" is the number of times the lemma occurs in high-sight passages, divided by the total number of words in high-sight passages.
What does this reveal? Not much, really. Jane sees faces, and everything that goes with faces. She sees Rochester.
There are, however, some suggestive differences:
from nltk.corpus import stopwords
sw = set(stopwords.words('english') + ['-PRON-'])
no_sight_lemma_counts = defaultdict(int)
n_no_sight_lemma = 0
high_sight_lemma_counts = defaultdict(int)
n_high_sight_lemma = 0
for bn, b in enumerate(bins):
if b['n_sight_words'] == 0:
for l in b['lemma']:
if l not in sw:
no_sight_lemma_counts[l] += 1
n_no_sight_lemma += 1
if b['n_sight_words'] >= 7:
for l in b['lemma']:
if l not in sw:
high_sight_lemma_counts[l] += 1
n_high_sight_lemma += 1
results = []
for lemma in no_sight_lemma_counts:
if lemma in high_sight_lemma_counts:
no_sight_rel_freq = float(no_sight_lemma_counts[lemma]) / float(n_no_sight_lemma)
high_sight_rel_freq = float(high_sight_lemma_counts[lemma]) / float(n_high_sight_lemma)
results.append([
(high_sight_rel_freq - no_sight_rel_freq),
high_sight_rel_freq,
no_sight_rel_freq,
lemma
])
else:
no_sight_rel_freq = float(no_sight_lemma_counts[lemma]) / float(n_no_sight_lemma)
high_sight_rel_freq = 0.0
results.append([
(high_sight_rel_freq - no_sight_rel_freq),
high_sight_rel_freq,
no_sight_rel_freq,
lemma
])
for lemma in high_sight_lemma_counts:
if lemma not in no_sight_lemma_counts:
no_sight_rel_freq = 0.0
high_sight_rel_freq = float(high_sight_lemma_counts[lemma]) / float(n_high_sight_lemma)
results.append([
(high_sight_rel_freq - no_sight_rel_freq),
high_sight_rel_freq,
no_sight_rel_freq,
lemma
])
results.sort(reverse=True)
output = [['diff', 'sight wf', 'no sight wf', 'word']]
for r in results[:100]:
output.append([('%.6f' % r[0]),
('%.6f' % r[1]),
('%.6f' % r[2]),
r[3]])
synsets_to_lookup.append(r[3])
print
print 'MORE COMMON IN HIGH-SIGHT THAN NO-SIGHT'
print
print tabletext.to_text(output)
results.sort()
output = [['diff', 'sight wf', 'no sight wf', 'word']]
for r in results[:100]:
output.append([('%.6f' % r[0]),
('%.6f' % r[1]),
('%.6f' % r[2]),
r[3]])
synsets_to_lookup.append(r[3])
print
print 'MORE COMMON IN NO-SIGHT THAN HIGH-SIGHT'
print
print tabletext.to_text(output)