This notebook was rerun with additional author birth date information
How does "face" and "faces" (and synonyms, especially "countenance[s] and "visage[s]," as well as related words such as "brow" and "eyebrows" appear in Jane Eyre, and, in this respect, how does Jane Eyre compare with David Copperfield and Vanity Fair? What happens if I perform some simple comparision of these words across the whole corpus?
Why did I start with David Copperfield and Vanity Fair? In our earlier conversations, we identified them as close comparables, at least in terms of date of composition, with Jane Eyre. In other work, I sometimes had trouble determining whether texts in our corpus were closer to Dickens or Bronte, or, I we put it, in differentiating between a "Jane Eyre effect" and a "Dickens effect". And, in light of the forward to Jane Eyre, some Bronte-Thackery comparision seems reasonable.
What did I learn with this notebook?
Compared to David Copperfield and Vanity Fair, Jane Eyre uses "eye" and "eyes" more frequently, and distinctively. I'm particularly taken by Jane Eyre use of "eye" as the subject of sentences.
However, when I look across the whole Muncie fiction corpus (1,100 texts), Jane Eyre does not have a remarkable amount of "eye" and "eyes". Marlitt, on the other hand, does (consistent with Tomek's summer 2016 findings).
I reran the methods of the Jane Eyre-David Copperfield-Vanity Fair comparison, except that I swapped in OMS and Gisela in place of David Copperfield and Vanity Fair; those results appear below, after the very long listing of the relative frequency of "eye[s]" in the corpus.
Marlitt use "eye[s]" as the subject of sentences even more than Jane Eyre, again a finding consistent with Tomek's.
Bottom line? I'd like to find some way to focus this, especially in how "eye[s]" acts and is modified; the modifiers and verbs feel "sparse" (i.e., there are a lot of words which occur once), and the words which do occur frequently (pronouns, for example) seem like uninteresting data points. I'd also like to run the parsing analysis across the whole corpus: how does the use of "eye[s]" as a sentence subject figure across the corpus? (I'll start that as a separate process and let it run tonight.)
Still, I think the uses of "eye[s]" deserves attention. It seems like a distinctive feature of Jane Eyre, the words occur even more often in Marlitt, and its uses--not as a body part, but as a way of figuring observation, judgement and discernment, and at times unreliably--gets right at our focus on "the gaze."
There's only preparation work in this cell (nothing to see here!): read Jane Eyre, David Copperfield and Vanity Fair, and use them to load up nltk and spacy objects. The nltk objects serve for printing out key words in context, for providing easy to access token counts, and for making plain text available for regex searches. The spacy objects provide access to lemma, and to sentence-level dependency parses.
Note that I'm using spacy 1.9.0 here; the last time I checked, version 2 was still buggy.
import codecs, re
import nltk
import spacy
import en_core_web_sm
#nlp = spacy.load('en')
nlp = en_core_web_sm.load()
CORPUS_FOLDER = '..\muncie_public_library_corpus\PG_no_backmatter_fiction'
#users\tombr\Documents\WUSTL\Work\HDW\tatlock_spring_2018\muncie_public_library_corpus\PG_no_backmatter
texts = [
{'file_name': '\Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': '\Dickens_Charles_David_Copperfield_PG_766.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': '\Thackeray_William_Makepeace_Vanity_Fair_PG_599.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
]
for t in texts:
t['raw_text'] = codecs.open(CORPUS_FOLDER + t['file_name'], 'r', encoding='utf-8').read()
t['tokens'] = nltk.word_tokenize(t['raw_text'])
t['text_obj'] = nltk.Text(t['tokens'])
cleaned_text = re.sub('\s+', ' ', t['raw_text'])
t['spacy_doc'] = nlp(cleaned_text)
print 'Done!'
I recapitulate a point from a previous notebook: "eye" is the most common noun lemma in Jane Eyre. Here, I list out the top 10 noun lemma, along with the number of times each occurs (notice that I ignore "what" and "who", which I suspect are part-of-speech tagging errors by spacy). This step is important only because I want to be sure I am chasing a significant lexical feature of Jane Eyre.
I also print the top 10 lemma for the other two novels. Are the relative positions of "eye" and "hand" of interest?
from collections import defaultdict, Counter
for t in texts:
print
print t['file_name']
print
lemma_counts = defaultdict(int)
for t in t['spacy_doc']:
if t.pos_ == 'NOUN' and t.lemma_ not in ['what', 'who']:
lemma_counts[t.lemma_] += 1
for w in Counter(lemma_counts).most_common(10):
print '\t', w[0], w[1]
Close read some passages; first, passages from the three novels containing "face"/"countenance"/"visage"; and next, passages containing "faces"/"countenances"/"visages".
(I print only the first 20 examples for each word-novel pair; it would be trivial to print more, or even all of them, if that would be helpful.)
In JE "faces" is more metaphorical than "face" (less likely to refer to an anatomical feature. "Face" appears as a body part, by which we recognize a person, a part of a character described to give them distinctive features, as well as body part mentioned in random circumstances with other parts: "face and hands," "nails and face" in all three novels. "Countenance" is more likely to express emotions than "face" or "faces" (THIS IS VERY EVIDENT). "Visage" more likely to be used with descriptions of permanent features in JE. "Brow" is almost always used to describe character's permanent features. "Eyebrows" sometimes describe permanent features, sometimes to describe emotional reactions.
print
print '--------------------------- FACE ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('face', lines=20, width=115)
print
print
print '--------------------------- FACES ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('faces', lines=20, width=115)
print
print
print '--------------------------- COUNTENANCE ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('countenance', lines=20, width=115)
print
print
print '--------------------------- VISAGE ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('visage', lines=20, width=115)
print
print
print '--------------------------- BROW ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('brow', lines=20, width=115)
print
print
print '--------------------------- BROWS ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('brow', lines=20, width=115)
print
print
print '--------------------------- EYEBROWS ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('eyebrows', lines=20, width=115)
Do the simpliest thing possible: compute the relative frequencies of "eye", "eyes", "[pronoun] eye[s]", etc.
What do we observe?
Frequency of use of "face" and "faces" in Jane Eyre is not significantly different from the frequency of use of these words in David Copperfield or Vanity Fair. It is highest in David Copperfield and lowest in Vanity Fair.
However, Jane Eyre uses more countenance and visage. Bronte evidently uses a more diverse vocabulary related to "face" using synonyms more often than the other authors. However, even after adding all instances of "face," "countenance," and "visage" (row FACE_ALL) David Copperfield has a higher relative frequency of "face" words.
IS FIRST PERSON NARRATOR MORE LIKELY TO TALK ABOUT FACE
import re
import tabletext
results = [
['', 'Jane Eyre', 'David C', 'Vanity F'],
['FACE', '', '', ''],
['FACE[S]', '', '', ''],
['COUNTENANCE[S]', '', '', ''],
['COUNTENANCE', '', '', ''],
['VISAGE[S]', '', '', ''],
['VISAGE', '', '', ''],
['FACE[S] ALL', '', '', ''],
['FACE ALL', '', '', ''],
['FACES ALL', '', '', ''],
['PRON FACE', '', '', ''],
['PRON FACES', '', '', ''],
]
regexes = [
r'\bface\b',
r'\bface\b|\bfaces\b',
r'\bcountenance\b|\bcountenances\b',
r'\bcountenance\b',
r'\bvisage\b|\bvisages\b',
r'\bvisage\b',
r'\bface\b|\bfaces\b|\bcountenance\b|\bcountenances\b|\bvisage\b|\bvisages\b',
r'\bface\b|\bcountenance\b|\bvisage\b',
r'\bfaces\b|\bcountenances\b|\bvisages\b',
#HOW TO MAKE ONE ON THE LEFT SIDE OF + MATCH ONE TO THE RIGHT OF IT
r'\bmy face\b|\bhis face\b|\bher face\b',
r'\bmy faces\b|\bhis faces\b|\bher faces\b'
]
for a, t in enumerate(texts):
cleaned_text = re.sub('\s+', ' ', t['raw_text'])
for b, r in enumerate(regexes):
matches = re.finditer(r, cleaned_text.lower())
n_matches = 0
for m in matches:
n_matches += 1
results[b + 1][a + 1] = '%.7f' % (float(n_matches) / float(len(t['tokens'])))
print tabletext.to_text(results)
I'm printing out relative frequencies for the 10 most common face/faces (etc.), dependency code pairs in Jane Eye along with the corresponding relative frequencies for the other two novels. The dependency codes ("pobj", "nsubjpass", etc) can be found at:
https://nlp.stanford.edu/software/dependencies_manual.pdf
Unlike face and countenance, visage never appears as the subject of a sentence in any of the novels (or visage nsubj is at least not among the top 10 dependency pairs for "visage").
import tabletext
word_dependency_counts = {}
for a, t in enumerate(texts):
for s in t['spacy_doc'].sents:
for token in s:
if token.lemma_.lower() == 'face':
try:
noop = word_dependency_counts[(token.text.lower(), token.dep_)]
except KeyError:
word_dependency_counts[(token.text.lower(), token.dep_)] = [0, 0, 0]
try:
noop = word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)]
except KeyError:
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)] = [0, 0, 0]
word_dependency_counts[(token.text.lower(), token.dep_)][a] += 1
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)][a] += 1
for k in word_dependency_counts.keys():
word_dependency_counts[k][a] = '%.7f' % (word_dependency_counts[k][a] / float(len(t['tokens'])))
sort_results = []
for k, v in word_dependency_counts.iteritems():
sort_results.append([v[0], ' '.join(k)] + v)
sort_results.sort(reverse=True)
final_results = [['', 'Jane Eyre', 'David C', 'Vanity F'],]
for r in sort_results[:10]:
final_results.append(r[1:])
print tabletext.to_text(final_results)
import tabletext
word_dependency_counts = {}
for a, t in enumerate(texts):
for s in t['spacy_doc'].sents:
for token in s:
if token.lemma_.lower() == 'countenance':
try:
noop = word_dependency_counts[(token.text.lower(), token.dep_)]
except KeyError:
word_dependency_counts[(token.text.lower(), token.dep_)] = [0, 0, 0]
try:
noop = word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)]
except KeyError:
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)] = [0, 0, 0]
word_dependency_counts[(token.text.lower(), token.dep_)][a] += 1
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)][a] += 1
for k in word_dependency_counts.keys():
word_dependency_counts[k][a] = '%.7f' % (word_dependency_counts[k][a] / float(len(t['tokens'])))
sort_results = []
for k, v in word_dependency_counts.iteritems():
sort_results.append([v[0], ' '.join(k)] + v)
sort_results.sort(reverse=True)
final_results = [['', 'Jane Eyre', 'David C', 'Vanity F'],]
for r in sort_results[:10]:
final_results.append(r[1:])
print tabletext.to_text(final_results)
import tabletext
word_dependency_counts = {}
for a, t in enumerate(texts):
for s in t['spacy_doc'].sents:
for token in s:
if token.lemma_.lower() == 'visage':
try:
noop = word_dependency_counts[(token.text.lower(), token.dep_)]
except KeyError:
word_dependency_counts[(token.text.lower(), token.dep_)] = [0, 0, 0]
try:
noop = word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)]
except KeyError:
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)] = [0, 0, 0]
word_dependency_counts[(token.text.lower(), token.dep_)][a] += 1
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)][a] += 1
for k in word_dependency_counts.keys():
word_dependency_counts[k][a] = '%.7f' % (word_dependency_counts[k][a] / float(len(t['tokens'])))
sort_results = []
for k, v in word_dependency_counts.iteritems():
sort_results.append([v[0], ' '.join(k)] + v)
sort_results.sort(reverse=True)
final_results = [['', 'Jane Eyre', 'David C', 'Vanity F'],]
for r in sort_results[:10]:
final_results.append(r[1:])
print tabletext.to_text(final_results)
How much information can I extract about the instances of "face" which function as the subjects of sentences? What modifies them? What actions to they do?
Note that here, I'm using the lemma of "face" and "faces" (etc.); the singular and plural are collapsed into one. I'm reporting the number of instances for the actual words for the modifiers (not the lemma), but the lemma for the actions. Confusing, I realize, but necessary if I'm going to get words to collapse usefully.
Does this tell us anything?
Two answers: if we're trying to get to some sort of word-frequency matrix from this data, then it isn't going to be all that useful. Too many modifiers and actions occur only once, and the frequency of pronouns seem to be as much the result of the narrative voice as anything else.
import string, textwrap
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
face_modifiers = defaultdict(int)
face_actions = defaultdict(int)
n_face_nsubj = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if token.lemma_.lower() == 'face' and token.dep_ in ['nsubj']:
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if token.lemma_.lower() == 'face' and token.dep_ in ['nsubj']:
n_face_nsubj += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation:
#print child.text.lower(),
#eye_modifiers[(child.text.lower(), child.tag_)] += 1
face_modifiers[child.text.lower()] += 1
#print
face_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_face_nsubj', n_face_nsubj
print
print '\t', 'face_modifiers'.upper()
print
output = []
for w in Counter(face_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
print
print '\t', 'face_actions'.upper()
print
output = []
for w in Counter(face_actions).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#break
import string, textwrap
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
countenance_modifiers = defaultdict(int)
countenance_actions = defaultdict(int)
n_countenance_nsubj = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if token.lemma_.lower() == 'countenance' and token.dep_ in ['nsubj']:
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if token.lemma_.lower() == 'countenance' and token.dep_ in ['nsubj']:
n_countenance_nsubj += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation:
#print child.text.lower(),
#eye_modifiers[(child.text.lower(), child.tag_)] += 1
countenance_modifiers[child.text.lower()] += 1
#print
countenance_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_countenance_nsubj', n_countenance_nsubj
print
print '\t', 'countenance_modifiers'.upper()
print
output = []
for w in Counter(countenance_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
print
print '\t', 'countenance_actions'.upper()
print
output = []
for w in Counter(countenance_actions).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#break
import string, textwrap
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
face_all_modifiers = defaultdict(int)
face_all_actions = defaultdict(int)
n_face_all_nsubj = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if (token.lemma_.lower() == 'face' or token.lemma_.lower() == 'countenance') and token.dep_ in ['nsubj']:
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if (token.lemma_.lower() == 'face' or token.lemma_.lower() == 'countenance') and token.dep_ in ['nsubj']:
n_face_all_nsubj += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation:
#print child.text.lower(),
#eye_modifiers[(child.text.lower(), child.tag_)] += 1
face_all_modifiers[child.text.lower()] += 1
#print
face_all_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_face_countenance_nsubj', n_face_all_nsubj
print
print '\t', 'face_countenance_modifiers'.upper()
print
output = []
for w in Counter(face_all_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
print
print '\t', 'face_countenance_actions'.upper()
print
output = []
for w in Counter(face_all_actions).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#break
Here, I'm digging out the words which modify all instances of "face" and "faces" (etc.). Note that we pick up some instances of proper nouns as modifiers. I suspect, although I haven't confirmed, that this is an artifact of how spacy parse sentences; things like "Brocklehurst's eyes" gets broken into three things ("Brocklehurst", "'s", and "eyes"), and the proper noun gets connected to "eyes" in the parse.
This may not be terribly useful; I include it only for the sake of completeness.
import string
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
face_modifiers = defaultdict(int)
face_actions = defaultdict(int)
n_face = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if token.lemma_.lower() == 'face':
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if token.lemma_.lower() == 'face':
n_face += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation and child.text.lower() not in ['--',]:
#print child.text.lower(),
#face_modifiers[(child.text.lower(), child.tag_)] += 1
face_modifiers[child.text.lower()] += 1
#print
if token.head.lemma_.lower() not in string.punctuation:
face_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_face', n_face
print
print '\t', 'face_modifiers'.upper()
print
output = []
for w in Counter(face_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#print
#print '\t', 'face "heads"'.upper()
#print
#output = []
#for w in Counter(face_actions).most_common():
# output.append(w[0] + ' ' + str(w[1]))
#print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#break
import string
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
countenance_modifiers = defaultdict(int)
countenance_actions = defaultdict(int)
n_countenance = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if token.lemma_.lower() == 'countenance':
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if token.lemma_.lower() == 'countenance':
n_countenance += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation and child.text.lower() not in ['--',]:
#print child.text.lower(),
#countenance_modifiers[(child.text.lower(), child.tag_)] += 1
countenance_modifiers[child.text.lower()] += 1
#print
if token.head.lemma_.lower() not in string.punctuation:
countenance_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_countenance', n_countenance
print
print '\t', 'countenance_modifiers'.upper()
print
output = []
for w in Counter(countenance_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#print
#print '\t', 'countenance "heads"'.upper()
#print
#output = []
#for w in Counter(countenance_actions).most_common():
# output.append(w[0] + ' ' + str(w[1]))
#print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#break
import string
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
visage_modifiers = defaultdict(int)
visage_actions = defaultdict(int)
n_visage = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if token.lemma_.lower() == 'visage':
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if token.lemma_.lower() == 'visage':
n_visage += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation and child.text.lower() not in ['--',]:
#print child.text.lower(),
#countenance_modifiers[(child.text.lower(), child.tag_)] += 1
visage_modifiers[child.text.lower()] += 1
#print
if token.head.lemma_.lower() not in string.punctuation:
visage_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_visage', n_visage
print
print '\t', 'visage_modifiers'.upper()
print
output = []
for w in Counter(visage_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#print
#print '\t', 'visage "heads"'.upper()
#print
#output = []
#for w in Counter(visage_actions).most_common():
# output.append(w[0] + ' ' + str(w[1]))
#print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#break
Just a quick check to see how common "eye" and "eyes" are in the whole Muncie fiction corpus. Does Jane Eyre really have a lot? How does it compare to other texts? In the very long list that follows, Jane Eyre, David Copperfield and Vanity Fair are prefixed with astericks; other works by Charlotte Bronte are prefixed by dashes, and Marlitt's texts are prefixed with ">>>>".
Please note: This cell produces over 1,100 lines of output. Please scroll to the bottom--there's more after this. Also, please note that I'm using a method to count "eye" and "eyes" that's different from what I used above, so the relative frequencies for "eye" and "eyes" are slightly different for Jane Eyre, etc.
The results are not what I expected. Jane Eyre is 293rd on the ranked list of novels; i.e., 292 of the 1,100 novels in the corpus have more "eye" and "eyes". What's more interesting? The Old Mamselles Secret is 24th on the list, Gisela is 32nd, and the Owl's Nest is 50th. This seems consistent with Tomek's finding's from the summer of 2016 ("if Bronte uses a face/hand/eye word, Marlitt uses it more").
import codecs, re, glob, time, json
import nltk
#birth_date_lookup_table = json.loads(codecs.open('birth_date_lookup_table.js', 'r', encoding='utf-8').read())
start_time = time.time()
CORPUS_FOLDER = '..\muncie_public_library_corpus\PG_no_backmatter_fiction'
results = []
all_rel_freqs = []
for a, path_to_file in enumerate(glob.glob(CORPUS_FOLDER + '\\' + '*.txt')):
if a % 100 == 0:
print 'processing', a
#if a > 100:
# break
file_name = path_to_file.split('\\')[-1]
# birth_date = '????'
#try:
# birth_date = str(birth_date_lookup_table[file_name])
# except KeyError:
# if 'Marlitt' in file_name:
# birth_date = '1825'
# file_name = birth_date + ' ' + file_name
if 'Jane_Eyre' in file_name:
file_name = '**** ' + file_name
elif 'David_Copperfield' in file_name:
file_name = '**** ' + file_name
elif 'Vanity_Fair' in file_name:
file_name = '**** ' + file_name
elif 'Bront_Charl' in file_name:
file_name = '---- ' + file_name
elif 'Marlitt' in file_name:
file_name = '>>>> ' + file_name
raw_text = codecs.open(path_to_file, 'r', encoding='utf-8').read()
tokens = nltk.word_tokenize(raw_text)
n_faces = 0
for t in tokens:
if t.lower() in ['face', 'faces']:
n_eyes += 1
results.append([(float(n_faces) / len(tokens)), file_name])
all_rel_freqs.append((float(n_faces) / len(tokens)))
results.sort(reverse=True)
stop_time = time.time()
print 'Done gathering', (stop_time - start_time)
print
for a, r in enumerate(results):
print '%.7f' % r[0], ('(' + str(a + 1) + ')'), r[1]
I'm going light on the notes in what follows, since I'm simply recapitulating methods with a different trio of texts.
import codecs, re
import nltk
import spacy
import en_core_web_sm
#nlp = spacy.load('en')
nlp = en_core_web_sm.load()
CORPUS_FOLDER = '..\muncie_public_library_corpus\PG_no_backmatter_fiction'
texts = [
{'file_name': '\Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': '\Marlitt_Wister_OMS_translation_cleaned_110617.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
{'file_name': '\Marlitt_Wister_Countess_Gisela_corrected_4_10_2018.txt',
'raw_text': '', 'tokens': [], 'text_obj': None, 'spacy_doc': None},
]
for t in texts:
t['raw_text'] = codecs.open(CORPUS_FOLDER + t['file_name'], 'r', encoding='utf-8').read()
t['tokens'] = nltk.word_tokenize(t['raw_text'])
t['text_obj'] = nltk.Text(t['tokens'])
cleaned_text = re.sub('\s+', ' ', t['raw_text'])
t['spacy_doc'] = nlp(cleaned_text)
print 'Done!'
from collections import defaultdict, Counter
for t in texts:
print
print t['file_name'], len(t['tokens'])
print
lemma_counts = defaultdict(int)
for t in t['spacy_doc']:
if t.pos_ == 'NOUN' and t.lemma_ not in ['what', 'who']:
lemma_counts[t.lemma_] += 1
for w in Counter(lemma_counts).most_common(10):
print '\t', w[0], w[1]
Please note that OMS and Gisela are much shorted than Jane Eyre.
print
print '--------------------------- FACE ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('face', lines=20, width=115)
print
print
print '--------------------------- FACES ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('faces', lines=20, width=115)
print
print
print '--------------------------- COUNTENANCE ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('countenance', lines=20, width=115)
print
print
print '--------------------------- VISAGE ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('visage', lines=20, width=115)
print
print
print '--------------------------- BROW ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('brow', lines=20, width=115)
print
print
print '--------------------------- BROWS ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('brow', lines=20, width=115)
print
print
print '--------------------------- EYEBROWS ---------------------------'
for t in texts:
print
print t['file_name']
print
t['text_obj'].concordance('eyebrows', lines=20, width=115)
import re, tabletext
results = [
['', 'Jane Eyre', 'OMS', 'Gisela'],
['EYE[S]', '', '', ''],
['EYE', '', '', ''],
['EYES', '', '', ''],
['PRON EYE', '', '', ''],
['PRON EYES', '', '', ''],
]
regexes = [
r'\beye\b|\beyes\b',
r'\beye\b',
r'\beyes\b',
r'\bmy eye\b|\bhis eye\b|\bher eye\b',
r'\bmy eyes\b|\bhis eyes\b|\bher eyes\b'
]
for a, t in enumerate(texts):
cleaned_text = re.sub('\s+', ' ', t['raw_text'])
for b, r in enumerate(regexes):
matches = re.finditer(r, cleaned_text.lower())
n_matches = 0
for m in matches:
n_matches += 1
results[b + 1][a + 1] = '%.7f' % (float(n_matches) / float(len(t['tokens'])))
print tabletext.to_text(results)
import tabletext
word_dependency_counts = {}
for a, t in enumerate(texts):
for s in t['spacy_doc'].sents:
for token in s:
if token.lemma_.lower() == 'face':
try:
noop = word_dependency_counts[(token.text.lower(), token.dep_)]
except KeyError:
word_dependency_counts[(token.text.lower(), token.dep_)] = [0, 0, 0]
try:
noop = word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)]
except KeyError:
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)] = [0, 0, 0]
word_dependency_counts[(token.text.lower(), token.dep_)][a] += 1
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)][a] += 1
for k in word_dependency_counts.keys():
word_dependency_counts[k][a] = '%.7f' % (word_dependency_counts[k][a] / float(len(t['tokens'])))
sort_results = []
for k, v in word_dependency_counts.iteritems():
sort_results.append([v[0], ' '.join(k)] + v)
sort_results.sort(reverse=True)
final_results = [['', 'Jane Eyre', 'OMS', 'Gisela'],]
for r in sort_results[:10]:
final_results.append(r[1:])
print tabletext.to_text(final_results)
import tabletext
word_dependency_counts = {}
for a, t in enumerate(texts):
for s in t['spacy_doc'].sents:
for token in s:
if token.lemma_.lower() == 'countenance':
try:
noop = word_dependency_counts[(token.text.lower(), token.dep_)]
except KeyError:
word_dependency_counts[(token.text.lower(), token.dep_)] = [0, 0, 0]
try:
noop = word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)]
except KeyError:
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)] = [0, 0, 0]
word_dependency_counts[(token.text.lower(), token.dep_)][a] += 1
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)][a] += 1
for k in word_dependency_counts.keys():
word_dependency_counts[k][a] = '%.7f' % (word_dependency_counts[k][a] / float(len(t['tokens'])))
sort_results = []
for k, v in word_dependency_counts.iteritems():
sort_results.append([v[0], ' '.join(k)] + v)
sort_results.sort(reverse=True)
final_results = [['', 'Jane Eyre', 'OMS', 'Gisela'],]
for r in sort_results[:10]:
final_results.append(r[1:])
print tabletext.to_text(final_results)
import tabletext
word_dependency_counts = {}
for a, t in enumerate(texts):
for s in t['spacy_doc'].sents:
for token in s:
if token.lemma_.lower() == 'visage':
try:
noop = word_dependency_counts[(token.text.lower(), token.dep_)]
except KeyError:
word_dependency_counts[(token.text.lower(), token.dep_)] = [0, 0, 0]
try:
noop = word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)]
except KeyError:
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)] = [0, 0, 0]
word_dependency_counts[(token.text.lower(), token.dep_)][a] += 1
word_dependency_counts[(token.lemma_.lower() + ' (lem)', token.dep_)][a] += 1
for k in word_dependency_counts.keys():
word_dependency_counts[k][a] = '%.7f' % (word_dependency_counts[k][a] / float(len(t['tokens'])))
sort_results = []
for k, v in word_dependency_counts.iteritems():
sort_results.append([v[0], ' '.join(k)] + v)
sort_results.sort(reverse=True)
final_results = [['', 'Jane Eyre', 'OMS', 'Gisela'],]
for r in sort_results[:10]:
final_results.append(r[1:])
print tabletext.to_text(final_results)
import string, textwrap
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
face_modifiers = defaultdict(int)
face_actions = defaultdict(int)
n_face_nsubj = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if token.lemma_.lower() == 'face' and token.dep_ in ['nsubj']:
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if token.lemma_.lower() == 'face' and token.dep_ in ['nsubj']:
n_face_nsubj += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation:
#print child.text.lower(),
#eye_modifiers[(child.text.lower(), child.tag_)] += 1
face_modifiers[child.text.lower()] += 1
#print
face_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_face_nsubj', n_face_nsubj
print
print '\t', 'face_modifiers'.upper()
print
output = []
for w in Counter(face_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
print
print '\t', 'face_actions'.upper()
print
output = []
for w in Counter(face_actions).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#break
import string, textwrap
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
countenance_modifiers = defaultdict(int)
countenance_actions = defaultdict(int)
n_countenance_nsubj = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if token.lemma_.lower() == 'countenance' and token.dep_ in ['nsubj']:
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if token.lemma_.lower() == 'countenance' and token.dep_ in ['nsubj']:
n_countenance_nsubj += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation:
#print child.text.lower(),
#eye_modifiers[(child.text.lower(), child.tag_)] += 1
countenance_modifiers[child.text.lower()] += 1
#print
countenance_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_countenance_nsubj', n_countenance_nsubj
print
print '\t', 'countenance_modifiers'.upper()
print
output = []
for w in Counter(countenance_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
print
print '\t', 'countenance_actions'.upper()
print
output = []
for w in Counter(countenance_actions).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#break
import string, textwrap
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
face_all_modifiers = defaultdict(int)
face_all_actions = defaultdict(int)
n_face_all_nsubj = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if (token.lemma_.lower() == 'face' or token.lemma_.lower() == 'countenance') and token.dep_ in ['nsubj']:
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if (token.lemma_.lower() == 'face' or token.lemma_.lower() == 'countenance') and token.dep_ in ['nsubj']:
n_face_all_nsubj += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation:
#print child.text.lower(),
#eye_modifiers[(child.text.lower(), child.tag_)] += 1
face_all_modifiers[child.text.lower()] += 1
#print
face_all_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_face_countenance_nsubj', n_face_all_nsubj
print
print '\t', 'face_countenance_modifiers'.upper()
print
output = []
for w in Counter(face_all_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
print
print '\t', 'face_countenance_actions'.upper()
print
output = []
for w in Counter(face_all_actions).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#break
import string
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
face_modifiers = defaultdict(int)
face_actions = defaultdict(int)
n_face = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if token.lemma_.lower() == 'face':
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if token.lemma_.lower() == 'face':
n_face += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation and child.text.lower() not in ['--',]:
#print child.text.lower(),
#face_modifiers[(child.text.lower(), child.tag_)] += 1
face_modifiers[child.text.lower()] += 1
#print
if token.head.lemma_.lower() not in string.punctuation:
face_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_face', n_face
print
print '\t', 'face_modifiers'.upper()
print
output = []
for w in Counter(face_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#print
#print '\t', 'face "heads"'.upper()
#print
#output = []
#for w in Counter(face_actions).most_common():
# output.append(w[0] + ' ' + str(w[1]))
#print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#break
import string
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
countenance_modifiers = defaultdict(int)
countenance_actions = defaultdict(int)
n_countenance = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if token.lemma_.lower() == 'countenance':
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if token.lemma_.lower() == 'countenance':
n_countenance += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation and child.text.lower() not in ['--',]:
#print child.text.lower(),
#countenance_modifiers[(child.text.lower(), child.tag_)] += 1
countenance_modifiers[child.text.lower()] += 1
#print
if token.head.lemma_.lower() not in string.punctuation:
countenance_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_countenance', n_countenance
print
print '\t', 'countenance_modifiers'.upper()
print
output = []
for w in Counter(countenance_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#print
#print '\t', 'countenance "heads"'.upper()
#print
#output = []
#for w in Counter(countenance_actions).most_common():
# output.append(w[0] + ' ' + str(w[1]))
#print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#break
import string
from collections import defaultdict, Counter
for a, t in enumerate(texts):
print
print t['file_name']
visage_modifiers = defaultdict(int)
visage_actions = defaultdict(int)
n_visage = 0
for s in t['spacy_doc'].sents:
do_this_sentence = False
for token in s:
if token.lemma_.lower() == 'visage':
do_this_sentence = True
break
if do_this_sentence == True:
#print
#print s
#print
for token in s:
if token.lemma_.lower() == 'visage':
n_visage += 1
#print [child for child in token.children], '>>', \
# token.i, token.text, token.pos_, token.dep_, '>>', \
# token.head.i, token.head.text, token.head.pos_
for child in token.children:
if child.text.lower() not in string.punctuation and child.text.lower() not in ['--',]:
#print child.text.lower(),
#countenance_modifiers[(child.text.lower(), child.tag_)] += 1
visage_modifiers[child.text.lower()] += 1
#print
if token.head.lemma_.lower() not in string.punctuation:
visage_actions[token.head.lemma_.lower()] += 1
#print token.head.text.lower()
print
print '\t', 'n_visage', n_visage
print
print '\t', 'visage_modifiers'.upper()
print
output = []
for w in Counter(visage_modifiers).most_common():
output.append(w[0] + ' ' + str(w[1]))
print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#print
#print '\t', 'visage "heads"'.upper()
#print
#output = []
#for w in Counter(visage_actions).most_common():
# output.append(w[0] + ' ' + str(w[1]))
#print '\t' + '\n\t'.join(textwrap.wrap(', '.join(output), width=80))
#break