This notebook was rerun with additional author birth date information
How does "eye" and "eyes" appear in Jane Eyre, and, in this respect, how does Jane Eyre compare with David Copperfield and Vanity Fair? What happens if I perform some simple comparision of "eye" and "eyes" across the whole corpus?
Why did I start with David Copperfield and Vanity Fair? In our earlier conversations, we identified them as close comparables, at least in terms of date of composition, with Jane Eyre. In other work, I sometimes had trouble determining whether texts in our corpus were closer to Dickens or Bronte, or, I we put it, in differentiating between a "Jane Eyre effect" and a "Dickens effect". And, in light of the forward to Jane Eyre, some Bronte-Thackery comparision seems reasonable.
What did I learn with this notebook?
Compared to David Copperfield and Vanity Fair, Jane Eyre uses "eye" and "eyes" more frequently, and distinctively. I'm particularly taken by Jane Eyre use of "eye" as the subject of sentences.
However, when I look across the whole Muncie fiction corpus (1,100 texts), Jane Eyre does not have a remarkable amount of "eye" and "eyes". Marlitt, on the other hand, does (consistent with Tomek's summer 2016 findings).
I reran the methods of the Jane Eyre-David Copperfield-Vanity Fair comparison, except that I swapped in OMS and Gisela in place of David Copperfield and Vanity Fair; those results appear below, after the very long listing of the relative frequency of "eye[s]" in the corpus.
Marlitt use "eye[s]" as the subject of sentences even more than Jane Eyre, again a finding consistent with Tomek's.
Bottom line? I'd like to find some way to focus this, especially in how "eye[s]" acts and is modified; the modifiers and verbs feel "sparse" (i.e., there are a lot of words which occur once), and the words which do occur frequently (pronouns, for example) seem like uninteresting data points. I'd also like to run the parsing analysis across the whole corpus: how does the use of "eye[s]" as a sentence subject figure across the corpus? (I'll start that as a separate process and let it run tonight.)
Still, I think the uses of "eye[s]" deserves attention. It seems like a distinctive feature of Jane Eyre, the words occur even more often in Marlitt, and its uses--not as a body part, but as a way of figuring observation, judgement and discernment, and at times unreliably--gets right at our focus on "the gaze."
There's only preparation work in this cell (nothing to see here!): read Jane Eyre, David Copperfield and Vanity Fair, and use them to load up nltk and spacy objects. The nltk objects serve for printing out key words in context, for providing easy to access token counts, and for making plain text available for regex searches. The spacy objects provide access to lemma, and to sentence-level dependency parses.
Note that I'm using spacy 1.9.0 here; the last time I checked, version 2 was still buggy.
import codecs, re
import nltk
import spacy
nlp = spacy.load('en')
CORPUS_FOLDER = '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/'
texts = [
{'file_name': 'Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': 'Dickens_Charles_David_Copperfield_PG_766.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': 'Thackeray_William_Makepeace_Vanity_Fair_PG_599.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
]
for t in texts:
print t
t['raw_text'] = codecs.open(CORPUS_FOLDER + t['file_name'], 'r', encoding='utf-8').read()
t['tokens'] = nltk.word_tokenize(t['raw_text'])
t['text_obj'] = nltk.Text(t['tokens'])
cleaned_text = re.sub('\s+', ' ', t['raw_text'])
t['spacy_doc'] = nlp(cleaned_text)
print 'Done!'
t = texts[0]
t['raw_text'][:1000]
I recapitulate a point from a previous notebook: "eye" is the most common noun lemma in Jane Eyre. Here, I list out the top 10 noun lemma, along with the number of times each occurs (notice that I ignore "what" and "who", which I suspect are part-of-speech tagging errors by spacy). This step is important only because I want to be sure I am chasing a significant lexical feature of Jane Eyre.
I also print the top 10 lemma for the other two novels. Are the relative positions of "eye" and "hand" of interest?
from collections import defaultdict, Counter
for t in texts:
print
print t['file_name']
print
lemma_counts = defaultdict(int)
for t in t['spacy_doc']:
if t.pos_ == 'NOUN' and t.lemma_ not in ['what', 'who']:
lemma_counts[t.lemma_] += 1
for w in Counter(lemma_counts).most_common(10):
print '\t', w[0], w[1]
Close read some passages; first, passages from the three novels containing "eye"; and next, passages containing "eyes".
One thing is immediately obvious (and should have been obvious to me before I did this): "eye" and, to a lesser extent, "eyes" appear to be not so much about an actual anatomical feature; instead, the words function as a) a metonomy (?) for gaze, sight, evaluative inspection, discernment, etc (one "falls under the eye" of some person); and b) as a conveyor of emotion and character; and c) sometimes as both at once ("strict eye").
This last use is particularly interesting, especially in Jane Eyre, because it suggests prejudgement, or even the failure of the eye to discern ("severe eye", "strict eye", etc). In other words, it suggests the failure of the eye to see correctly.
(I print only the first 20 examples for each word-novel pair; it would be trivial to print more, or even all of them, if that would be helpful.)
print
print '--------------------------- EYE ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('eye', lines=20, width=115)
print
print
print '--------------------------- EYES ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('eyes', lines=20, width=115)
Do the simpliest thing possible: compute the relative frequencies of "eye", "eyes", "[pronoun] eye[s]", etc.
What do we observe?
Jane Eyre uses more "eye" and "eyes" than David Copperfield and Vanity Fair; its usage is especially more pronounced in its use of the singular "eye". How to read this? Perhaps as a gesture meaning, "here, I really do mean 'eye' as a metonomy." I'm especially struck by how much more often Jane Eyre uses "my/his/her eye" (although to be fair, David Copperfield seems to prefer "my/his/her eyes").
Note that I'm actually undercounting "my/his/her eye[s]", since I'm not making allowances for intervening adjectives; i.e., I'm not counting "my/his/her [adjective] eye[s]".
import re, tabletext
results = [
['', 'Jane Eyre', 'David C', 'Vanity F'],
['EYE[S]', '', '', ''],
['EYE', '', '', ''],
['EYES', '', '', ''],
['PRON EYE', '', '', ''],
['PRON EYES', '', '', ''],
]
regexes = [
r'\beye\b|\beyes\b',
r'\beye\b',
r'\beyes\b',
r'\bmy eye\b|\bhis eye\b|\bher eye\b',
r'\bmy eyes\b|\bhis eyes\b|\bher eyes\b'
]
for a, t in enumerate(texts):
cleaned_text = re.sub('\s+', ' ', t['raw_text'])
for b, r in enumerate(regexes):
matches = re.finditer(r, cleaned_text.lower())
n_matches = 0
for m in matches:
n_matches += 1
results[b + 1][a + 1] = '%.7f' % (float(n_matches) / float(len(t['tokens'])))
print tabletext.to_text(results)
I'm printing out relative frequencies for the 10 most common eye/eyes, dependency code pairs in Jane Eye along with the corresponding relative frequencies for the other two novels. The dependency codes ("pobj", "nsubjpass", etc) can be found at:
https://nlp.stanford.edu/software/dependencies_manual.pdf
Anything interesting here?
The lemma of "eye" as the subject of a sentence ('nsubj') occurs much more often in Jane Eyre than in the other two novels; the difference is even more pronounced for the singular "eye". For the other syntactic functions (object of a preposition [pobj], direct object [dobj], the relative frequencies are more or less the same in the novels. I provisionally take the high frequency of "eye" as subject to be evidence of the extent to which "eye" in all its meanings becomes something like a character (or an active agent, or a way to say, "the person is important only to the extent he or she is interogating his or her environment") in Jane Eyre.
import tabletext
word_dependency_counts = {}
for a, t in enumerate(texts):
for s in t['spacy_doc'].sents:
for token in s:
if token.lemma_.lower() == 'eye':
try:
noop = word_dependency_counts[(token.text.lower(), token.dep_)]
except KeyError:
word_dependency_counts[(token.text.lower(), token.dep_)] = [0, 0, 0]
try:
noop = word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)]
except KeyError:
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)] = [0, 0, 0]
word_dependency_counts[(token.text.lower(), token.dep_)][a] += 1
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)][a] += 1
for k in word_dependency_counts.keys():
word_dependency_counts[k][a] = '%.7f' % (word_dependency_counts[k][a] / float(len(t['tokens'])))
sort_results = []
for k, v in word_dependency_counts.iteritems():
sort_results.append([v[0], ' '.join(k)] + v)
sort_results.sort(reverse=True)
final_results = [['', 'Jane Eyre', 'David C', 'Vanity F'],]
for r in sort_results[:10]:
final_results.append(r[1:])
print tabletext.to_text(final_results)
How much information can I extract about the instances of "eye" which function as the subjects of sentences? What modifies them? What actions to they do?
Note that here, I'm using the lemma of "eye" and "eyes"; the two words are collapsed into one. I'm reporting the number of instances for the actual words for the modifiers (not the lemma), but the lemma for the actions. Confusing, I realize, but necessary if I'm going to get words to collapse usefully.
Does this tell us anything?
Two answers: if we're trying to get to some sort of word-frequency matrix from this data, then it isn't going to be all that useful. Too many modifiers and actions occur only once, and the frequency of pronouns seem to be as much the result of the narrative voice as anything else.
However, it is, I suppose, somewhat interesting that "eyes" can do so much, and I suppose that someone might make something of the number of kind of different actions "eyes" can do. But that seems more an activity for close reading (although I'd be happy to make these sentences available as a convenient set).
import string, textwrap
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
eye_modifiers = defaultdict(int)
eye_actions = defaultdict(int)
n_eye_nsubj = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if token.lemma_.lower() == 'eye' and token.dep_ in ['nsubj']:
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if token.lemma_.lower() == 'eye' and token.dep_ in ['nsubj']:
n_eye_nsubj += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation:
#print child.text.lower(),
#eye_modifiers[(child.text.lower(), child.tag_)] += 1
eye_modifiers[child.text.lower()] += 1
#print
eye_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_eye_nsubj', n_eye_nsubj
print
print '\t', 'eye_modifiers'.upper()
print
output = []
for w in Counter(eye_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
print
print '\t', 'eye_actions'.upper()
print
output = []
for w in Counter(eye_actions).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#break
Here, I'm digging out the words which modify all instances of "eye" and "eyes". Note that we pick up some instances of proper nouns as modifiers. I suspect, although I haven't confirmed, that this is an artifact of how spacy parse sentences; things like "Brocklehurst's eyes" gets broken into three things ("Brocklehurst", "'s", and "eyes"), and the proper noun gets connected to "eyes" in the parse.
This may not be terribly useful; I include it only for the sake of completeness.
import string
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
eye_modifiers = defaultdict(int)
eye_actions = defaultdict(int)
n_eye = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if token.lemma_.lower() == 'eye':
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if token.lemma_.lower() == 'eye':
n_eye += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation and child.text.lower() not in ['--',]:
#print child.text.lower(),
#eye_modifiers[(child.text.lower(), child.tag_)] += 1
eye_modifiers[child.text.lower()] += 1
#print
if token.head.lemma_.lower() not in string.punctuation:
eye_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_eye', n_eye
print
print '\t', 'eye_modifiers'.upper()
print
output = []
for w in Counter(eye_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#print
#print '\t', 'eye "heads"'.upper()
#print
#output = []
#for w in Counter(eye_actions).most_common():
# output.append(w[0] + ' ' + str(w[1]))
#print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#break
In the previous cells, I didn't see the number of instances of "dark" that I expected. Here, I check to see of the earlier results are reasonable.
Well, more or less. Sometimes "dark" comes some distance before or after "eye" or "eyes", but there's not a lot of that.
import re
print
print texts[0]['file_name']
print
cleaned_text = re.sub('\s+', ' ', texts[0]['raw_text'])
matches = re.finditer(r'\bdark eye', cleaned_text.lower())
for m in matches:
print cleaned_text[m.start() - 40: m.end() + 40]
print
for s in texts[0]['spacy_doc'].sents:
has_dark = False
has_eye = False
for token in s:
if token.text.lower() == 'dark':
has_dark = True
if token.lemma_.lower() == 'eye':
has_eye = True
if has_dark == True and has_eye == True:
s_text = unicode(s)
s_text = re.sub(r'\bdark\b', '*dark*', s_text)
s_text = re.sub(r'\beye\b', '*eye*', s_text)
s_text = re.sub(r'\beyes\b', '*eyes*', s_text)
print
print s_text
Just a quick check to see how common "eye" and "eyes" are in the whole Muncie fiction corpus. Does Jane Eyre really have a lot? How does it compare to other texts? In the very long list that follows, Jane Eyre, David Copperfield and Vanity Fair are prefixed with astericks; other works by Charlotte Bronte are prefixed by dashes, and Marlitt's texts are prefixed with ">>>>".
Please note: This cell produces over 1,100 lines of output. Please scroll to the bottom--there's more after this. Also, please note that I'm using a method to count "eye" and "eyes" that's different from what I used above, so the relative frequencies for "eye" and "eyes" are slightly different for Jane Eyre, etc.
The results are not what I expected. Jane Eyre is 293rd on the ranked list of novels; i.e., 292 of the 1,100 novels in the corpus have more "eye" and "eyes". What's more interesting? The Old Mamselles Secret is 24th on the list, Gisela is 32nd, and the Owl's Nest is 50th. This seems consistent with Tomek's finding's from the summer of 2016 ("if Bronte uses a face/hand/eye word, Marlitt uses it more").
import codecs, re, glob, time, json
import nltk
birth_date_lookup_table = json.loads(codecs.open('birth_date_lookup_table.js', 'r', encoding='utf-8').read())
start_time = time.time()
CORPUS_FOLDER = '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/'
results = []
all_rel_freqs = []
for a, path_to_file in enumerate(glob.glob(CORPUS_FOLDER + '*.txt')):
if a % 100 == 0:
print 'processing', a
#if a > 100:
# break
file_name = path_to_file.split('/')[-1]
birth_date = '????'
try:
birth_date = str(birth_date_lookup_table[file_name])
except KeyError:
if 'Marlitt' in file_name:
birth_date = '1825'
file_name = birth_date + ' ' + file_name
if 'Jane_Eyre' in file_name:
file_name = '**** ' + file_name
elif 'David_Copperfield' in file_name:
file_name = '**** ' + file_name
elif 'Vanity_Fair' in file_name:
file_name = '**** ' + file_name
elif 'Bront_Charl' in file_name:
file_name = '---- ' + file_name
elif 'Marlitt' in file_name:
file_name = '>>>> ' + file_name
raw_text = codecs.open(path_to_file, 'r', encoding='utf-8').read()
tokens = nltk.word_tokenize(raw_text)
n_eyes = 0
for t in tokens:
if t.lower() in ['eye', 'eyes']:
n_eyes += 1
results.append([(float(n_eyes) / len(tokens)), file_name])
all_rel_freqs.append((float(n_eyes) / len(tokens)))
results.sort(reverse=True)
stop_time = time.time()
print 'Done gathering', (stop_time - start_time)
print
for a, r in enumerate(results):
print '%.7f' % r[0], ('(' + str(a + 1) + ')'), r[1]
Get some numbers (average, standard deviation, etc) to use against the results of the previous cell.
It's not the most regular of distributions. But if I use regard any novel with a relative frequency of "eye" and "eyes" greater than 0.00200222508952, then both OMS and Gisela have a remarkable amount of "eye" and "eyes".
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
plt.rcParams['figure.figsize']=(15,5)
ax = sns.distplot(all_rel_freqs, bins=100)
ax.set(xlabel='REL FREQ EYE[S]', ylabel='n texts')
plt.show()
print
print 'mean', np.mean(all_rel_freqs), \
'median', np.median(all_rel_freqs), \
'std', np.std(all_rel_freqs), \
'plus 1 std', (np.mean(all_rel_freqs) + (1 * np.std(all_rel_freqs))), \
'plus 2 std', (np.mean(all_rel_freqs) + (2 * np.std(all_rel_freqs)))
I'm going light on the notes in what follows, since I'm simply recapitulating methods with a different trio of texts.
import codecs, re
import nltk
import spacy
nlp = spacy.load('en')
CORPUS_FOLDER = '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/'
texts = [
{'file_name': 'Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': 'Marlitt_Wister_At_the_Councillor_s_or_A_Nameless_History_PG_43393_0.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': 'Marlitt_Wister_Baliff.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': 'Marlitt_Wister_Countess_Gisela_corrected_4_10_2018.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': 'Marlitt_Wister_Gold_Elsie_PG_42426.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': 'Marlitt_Wister_Im Schillingshof_4_26_2018.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': 'Marlitt_Wister_Lady_with_the_Rubies_corrected_3_13_208.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': 'Marlitt_Wister_Little_Moorland_Princess_cleaned_121817.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': 'Marlitt_Wister_OMS_translation_cleaned_110617.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': 'Marlitt_Wister_Owls_Nest_corrected_4_21_2018.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': 'Marlitt_Wister_The_Second_Wife_corrected.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
]
for t in texts:
t['raw_text'] = codecs.open(CORPUS_FOLDER + t['file_name'], 'r', encoding='utf-8').read()
t['tokens'] = nltk.word_tokenize(t['raw_text'])
t['text_obj'] = nltk.Text(t['tokens'])
cleaned_text = re.sub('\s+', ' ', t['raw_text'])
t['spacy_doc'] = nlp(cleaned_text)
print 'Done!'
It's interesting that even though "eye" and "eyes" occurs more often in OMS and Gisela, the "eye" is not the most common noun lemma in either text.
from collections import defaultdict, Counter
for t in texts:
print
print t['file_name'], len(t['tokens'])
print
lemma_counts = defaultdict(int)
for t in t['spacy_doc']:
if t.pos_ == 'NOUN' and t.lemma_ not in ['what', 'who']:
lemma_counts[t.lemma_] += 1
for w in Counter(lemma_counts).most_common(10):
print '\t', w[0], w[1]
One important difference from Jane Eyre: The Wister translations of OMS and Gisela use the singular "eye" very little; lines like
Displaying 20 of 186 matches:
make it possible to see how many times a particular word occurs in a text.
Also, please note that OMS and Gisela are much shorted than Jane Eyre.
print
print '--------------------------- EYE ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('eye', lines=20, width=115)
print
print
print '--------------------------- EYES ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('eyes', lines=20, width=115)
Lots more "eye[s]" in OMS and Gisela. Wister's/Marlitt's preference for the plural "eyes" is a significant different.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))
import re, tabletext
results = [
['', 'Jane Eyre', 'At the C', 'Baliff', 'Gisela', 'Gold_Elsie',
'Im Schill', 'Rubies', 'Moorland', 'OMS', 'Owls', '2 Wife'],
['EYE[S]', '', '', '', '', '', '', '', '', '', '', ''],
['EYE', '', '', '', '', '', '', '', '', '', '', ''],
['EYES', '', '', '', '', '', '', '', '', '', '', ''],
['PRON EYE', '', '', '', '', '', '', '', '', '', '', ''],
['PRON EYES', '', '', '', '', '', '', '', '', '', '', ''],
]
regexes = [
r'\beye\b|\beyes\b',
r'\beye\b',
r'\beyes\b',
r'\bmy eye\b|\bhis eye\b|\bher eye\b',
r'\bmy eyes\b|\bhis eyes\b|\bher eyes\b'
]
for a, t in enumerate(texts):
cleaned_text = re.sub('\s+', ' ', t['raw_text'])
for b, r in enumerate(regexes):
matches = re.finditer(r, cleaned_text.lower())
n_matches = 0
for m in matches:
n_matches += 1
results[b + 1][a + 1] = '%.7f' % (float(n_matches) / float(len(t['tokens'])))
print tabletext.to_text(results)
Marlitt, again, super-sizes Jane Eyre.
import tabletext
word_dependency_counts = {}
for a, t in enumerate(texts):
for s in t['spacy_doc'].sents:
for token in s:
if token.lemma_.lower() == 'eye':
try:
noop = word_dependency_counts[(token.text.lower(), token.dep_)]
except KeyError:
word_dependency_counts[(token.text.lower(), token.dep_)] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
try:
noop = word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)]
except KeyError:
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
word_dependency_counts[(token.text.lower(), token.dep_)][a] += 1
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)][a] += 1
for k in word_dependency_counts.keys():
word_dependency_counts[k][a] = '%.7f' % (word_dependency_counts[k][a] / float(len(t['tokens'])))
sort_results = []
for k, v in word_dependency_counts.iteritems():
sort_results.append([v[0], ' '.join(k)] + v)
sort_results.sort(reverse=True)
final_results = [['', 'Jane Eyre', 'OMS', 'Gisela'],]
for r in sort_results[:10]:
final_results.append(r[1:])
print tabletext.to_text(final_results)
These word lists need to be close-read. My suspicion is that the Marlitt word lists are more Jane Eyre-like than the Dickens and Thackery lists, but that could be wishful thinking.
Is there, I wonder, some way to test the similarity of these lists?
import string, textwrap
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
eye_modifiers = defaultdict(int)
eye_actions = defaultdict(int)
n_eye_nsubj = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if token.lemma_.lower() == 'eye' and token.dep_ in ['nsubj']:
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if token.lemma_.lower() == 'eye' and token.dep_ in ['nsubj']:
n_eye_nsubj += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation:
#print child.text.lower(),
#eye_modifiers[(child.text.lower(), child.tag_)] += 1
eye_modifiers[child.text.lower()] += 1
#print
eye_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_eye_nsubj', n_eye_nsubj
print
print '\t', 'eye_modifiers'.upper()
print
output = []
for w in Counter(eye_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
print
print '\t', 'eye_actions'.upper()
print
output = []
for w in Counter(eye_actions).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
Ditto re the word lists.
import string
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
eye_modifiers = defaultdict(int)
eye_actions = defaultdict(int)
n_eye = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if token.lemma_.lower() == 'eye':
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if token.lemma_.lower() == 'eye':
n_eye += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation and child.text.lower() not in ['--',]:
#print child.text.lower(),
#eye_modifiers[(child.text.lower(), child.tag_)] += 1
eye_modifiers[child.text.lower()] += 1
#print
if token.head.lemma_.lower() not in string.punctuation:
eye_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_eye', n_eye
print
print '\t', 'eye_modifiers'.upper()
print
output = []
for w in Counter(eye_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))