This notebook looks for concentrations of works associated with vision in Jane Eyre. I originally though to look for just the lemma for "eye", "look" and "see"; however, although Bronte leans especially hard on those words, her sight-related vocabulary is much larger. This notebook uses Wordnet to identify words beyond "eye", "look" and "see" which should be included in a vision-related word count.
In Jane Eyre, there does seem to be a "rhythm" of seeing, not seeing so much (or at all), seeing, etc. I don't think the findings here are definitive; I might, for example, run the main visualization (available here) with the text sliced into different sized bins. And I might also examine other novels to see of they too follow a similar pattern (I'll start this second set directly, since it's largely a pattern of copying code from here and letting it run).
Note: this notebook contains a fair-sized section of passages from Jane Eyre; please scroll past them, since a little more follows.
. . . load the text, and pass to spacy for part-of-speech tagging and lemmatization.
import spacy
nlp = spacy.load('en')
import codecs, re
CORPUS_FOLDER = '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/'
text = codecs.open(CORPUS_FOLDER + 'Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt',
'r', encoding='utf-8').read()
text = re.sub('\s+', ' ', text).strip()
doc = nlp(text)
This cell lists the noun and verb synsets and their associated lemma. The format is like
idea.n.01 3178 64 aspect 17, attention 28, beauty 34, character 54, charm 26, claim 16
where the first column ("idea.n.01 3178 64") contains the synset name ("idea.n.01"), the number of times a lemma in the text points to that synset ("3178", and the number of unique lemma that point to the synset ("64"). The second column ("aspect 17, attention 28, beauty 34 . . . ") contains a list of the unique words that point to the synset with the number of times the lemma occurs in the text.
A fair definition of synset is available at https://en.wikipedia.org/wiki/Synonym_ring; a synset can be thought of as a container which groups (roughly) semantically-equivalent words. Note that in Wordnet, synsets are organized in heirarchies; as one moves up a heirarchy, the semantic equivalence gets shakier and shakier.
In this cell, I print a limited number of examples; otherwise, the output would be quite lengthy. The examples printed below are from fairly far up in the heirarchy, which explains their seemingly faulty semantic equivalence. When I used this cell to identify "interesting" or "vision-related" synsets, I printed many examples by commenting out the two lines following the "LIMITING OUTPUT HERE" comment.
import textwrap
import tabletext
from IPython.display import HTML, display
import tabulate
from collections import defaultdict, Counter
from textblob import Word
hyper = lambda s: s.hypernyms()
pos_lemma_counts = defaultdict(lambda : defaultdict(int))
for t in doc:
if t.lemma_ not in ['-PRON-', 'which', 'what', 'who']:
pos_lemma_counts[t.pos_][t.lemma_] += 1
for pos in sorted(pos_lemma_counts.keys()):
#if pos not in ['NOUN', 'VERB', 'ADJ', 'ADV']:
if pos not in ['NOUN', 'VERB',]:
continue
synset_words = defaultdict(list)
synset_counts = defaultdict(int)
for pos_lemma_counter in Counter(pos_lemma_counts[pos]).most_common():
if pos_lemma_counter[1] < 10:
break
word_synsets = Word(pos_lemma_counter[0]).get_synsets(pos=pos[0].lower())
for w in word_synsets:
synset_words[w.name()].append(pos_lemma_counter[0] + ' ' + str(pos_lemma_counter[1]))
synset_counts[w.name()] += pos_lemma_counter[1]
h = list(w.closure(hyper, depth=10))
for s in h:
synset_words[s.name()].append(pos_lemma_counter[0] + ' ' + str(pos_lemma_counter[1]))
synset_counts[s.name()] += pos_lemma_counter[1]
output_table = []
for synset, synset_count in Counter(synset_counts).most_common():
word_count_string = ', '.join(sorted(list(set(synset_words[synset]))))
n_words_in_synset = len(set(synset_words[synset]))
okay_to_print = False
if pos == 'NOUN' and n_words_in_synset < 100:
for important_word in ['eye', 'attention', 'observation', 'regard', 'vision']:
if important_word + ' ' in word_count_string:
okay_to_print = True
if pos == 'VERB' and n_words_in_synset < 100:
okay_to_print = True
if okay_to_print == True:
output_table.append([synset + ' ' + str(synset_count) + ' ' + str(n_words_in_synset),
'\n'.join(textwrap.wrap(word_count_string, 50))])
# LIMITING OUTPUT HERE
if len(output_table) > 5:
break
print
print pos
print
display(HTML(tabulate.tabulate(output_table, tablefmt='html')))
Here, I count the number of sight-related noun and verb lemma which occur in Jane Eyre. A lemma is sight-related if its synset, or a synset up in the lemma's heirarchy, matches one of the synsets in the "interesting_synsets" list below.
"interesting_synsets" is a list I built using the preceeding Wordnet-related cell. It's a fairly good list; I'm counting "seem", which is not correct, but I'm also counting words like "witness", "observation, "glare", "peep" and "scrutiny", which are words I wouldn't have looked for. Note also that I could have cast I much wider net: there's much in Jane Eyre
I count lemma occurences per "bin"; I've divided Jane Eyre into 1,000 bins, each of 231 tokens. These bins do not overlap.
This cell also prints out the lemma which matched the "interesting_synsets", along with the number of times the lemma occurred.
Lastly, please note that I'm also counting the number of quotation marks per bin . . . I'm going to use that later.
from collections import defaultdict, Counter
interesting_synsets = [
'eye.n.',
'look.n.',
'sight.n.',
'stare.n.',
'gaze.n.',
'vision.n.',
'see.v.01',
'detect.v.01',
'spy.v.03',
'appear.v.04',
'look.v.01',
'visualize.v.01',
'see.v.23',
'look.v.03',
'detect.v.01',
'watch.v.01',
]
hyper = lambda s: s.hypernyms()
words_counted = defaultdict(int)
n_tokens = doc.__len__()
N_BINS = 1000
bin_size = n_tokens / N_BINS
print 'n_tokens', n_tokens, 'N_BINS', N_BINS, 'bin_size', bin_size
quotation_mark_counts = []
bin_counts = []
for a in range(0, N_BINS + 1):
bin_counts.append(0)
quotation_mark_counts.append(0)
for t in doc:
if t.text == '"':
bin_number = t.i / bin_size
quotation_mark_counts[bin_number] += 1
if t.lemma_ not in ['-PRON-', 'which', 'what', 'who'] and t.pos_ in ['NOUN', 'VERB',]:
#if t.i > 1000:
# break
is_seeing_lemma = False
word_synsets = Word(t.lemma_).get_synsets(pos=t.pos_[0].lower())
for w in word_synsets:
for interesting_synset in interesting_synsets:
if w.name().startswith(interesting_synset) == True or \
w.name() == interesting_synset:
is_seeing_lemma = True
else:
hypernyms = list(w.closure(hyper, depth=10))
for h in hypernyms:
if h.name().startswith(interesting_synset) == True or \
w.name() == interesting_synset:
is_seeing_lemma = True
if is_seeing_lemma == True:
bin_number = t.i / bin_size
bin_counts[bin_number] += 1
words_counted[(t.lemma_, t.pos_)] += 1
print
print 'WORDS COUNTED'
print
for w in Counter(words_counted).most_common():
print '\t', w[0], w[1]
high_count = -1
for b in bin_counts:
if b > high_count:
high_count = b
print
print 'high_count', high_count
#print bin_counts
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import unicodecsv as csv
from pylab import rcParams
rcParams['figure.figsize'] = 100, 3
import seaborn as sns
sns.set(style="whitegrid")
plt.bar(range(len(bin_counts)), bin_counts, align='center', color='#98AFC7', alpha=1.0)
plt.title('"seeing" words')
plt.xlabel('bin')
plt.ylabel('n words')
plt.ylim(0, 12)
plt.show()
In the previous cell, the graph seemed to suggest a rhythm of bins like "seeing, not so much (or no) seeing, seeing, not so much, etc." But is it the rhythm of randomness?
In the cell below, I first graph the last 100 bins from JE. Then, I produce 5 graphs; for each, I randomly shuffled the bins from JE, then I plot the last 100 bins. Lastly, I randomly generate 100 bin values, then plot the results.
When I randomly shuffle the bins from JE (the first set of 5 graphs), I get results which look more or less like the last 100 (unshuffled) bins from JE (the first graph). However, when I randomly generate 100 bin values (the second set of 5 graphs), the results do not look like JE (the first graph).
This suggests, although doesn't quite prove, that there's more to the rhythm of "seeing, not seeing" that just the random sprinkling of sight-related words throughout the text; instead, the rhythm seems to be the result of the relative proportion of "seeing" and "not seeing" bins.
%matplotlib inline
import copy, random
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import unicodecsv as csv
from pylab import rcParams
rcParams['figure.figsize'] = 15, 3
import seaborn as sns
sns.set(style="whitegrid")
# -----------------------------------------------------------------------
def plot_subgraph(sub_bins, title):
plt.bar(range(len(sub_bins)), sub_bins, align='center', color='#98AFC7', alpha=1.0)
plt.title(title)
plt.xlabel('bin')
plt.ylabel('n words')
plt.ylim(0, 12)
plt.show()
# -----------------------------------------------------------------------
print
print '************************************'
print 'LAST 100 BINS FROM JANE EYRE'
print '************************************'
print
sub_bins = bin_counts[901:]
plot_subgraph(sub_bins, '"seeing" words -- last 100 from JE')
print
print '************************************'
print 'LAST 100, RANDOMLY SHUFFLED BINS FROM JE'
print '************************************'
print
work_bin_counts = copy.deepcopy(bin_counts)
for a in range(0, 5):
random.shuffle(work_bin_counts)
sub_bins = work_bin_counts[901:]
plot_subgraph(sub_bins, '"seeing" words -- shuffled JE last 100 -- ' + str(a + 1))
print
print '************************************'
print '100, RANDOMLY GENERATED'
print '************************************'
print
for a in range(0, 5):
work_bin_counts = []
for b in range(0, 100):
work_bin_counts.append(random.randint(0, 11))
plot_subgraph(work_bin_counts, '"seeing" words -- 100 random bins -- ' + str(a + 1))
Pick through the bin data, and find sequences of two or more bins where there is either no sight-related lemma, or else there's a lot (i.e., >= 6 lemma in a bin).
sequences_of_zero_bins = []
temp_bins = []
for a in range(0, len(bin_counts)):
if bin_counts[a] == 0:
temp_bins.append(a)
else:
if len(temp_bins) > 0:
if len(temp_bins) > 1:
sequences_of_zero_bins.append(temp_bins)
temp_bins = []
if len(temp_bins) > 0:
if len(temp_bins) > 1:
sequences_of_zero_bins.append(temp_bins)
print
for a in sequences_of_zero_bins:
print 'zero sequence', a
sequences_of_high_bins = []
temp_bins = []
for a in range(0, len(bin_counts)):
if bin_counts[a] >= 6:
temp_bins.append(a)
else:
if len(temp_bins) > 0:
if len(temp_bins) > 1:
sequences_of_high_bins.append(temp_bins)
temp_bins = []
if len(temp_bins) > 0:
if len(temp_bins) > 1:
sequences_of_high_bins.append(temp_bins)
print
for a in sequences_of_high_bins:
print 'high sequence', a
Print the passages identified in the previous steps, so they can be available for close reading.
original_tokens = []
for t in doc:
original_tokens.append(t.text)
print
print '************************************'
print 'ZERO SIGHT SEQUENCES'
print '************************************'
for s in sequences_of_zero_bins:
from_a = s[0] * bin_size
to_a = (s[-1] * bin_size) + bin_size
print
print '\t' + '\n\t'.join(textwrap.wrap(' '.join(original_tokens[from_a: to_a]), 100))
print
print '************************************'
print 'HIGH SIGHT SEQUENCES'
print '************************************'
for s in sequences_of_high_bins:
from_a = s[0] * bin_size
to_a = (s[-1] * bin_size) + bin_size
print
print '\t' + '\n\t'.join(textwrap.wrap(' '.join(original_tokens[from_a: to_a]), 100))
Here, I plot the number of sight-related lemma (y axis) against the number of quotation marks (x axis) per bin. I wondered whether the two counts had any relation to each other; whether, for example, passages with lots of sight-related lemma had fewer quotation marks, and vice-versa. In other words, I wondered whether there might be a "see, speak, see, speak . . . " rhythm to the novel.
There seems to be a slight tendency in that direction, if I squint at the plot, although I think I'd have hard time convincing anyone else. Part of the problem may be that I'm not using the right input data; instead of using the plain text, I should be using the XML version, which has dialog and narration separated . . .
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import unicodecsv as csv
from pylab import rcParams
rcParams['figure.figsize'] = 20,4
import seaborn as sns
sns.set(style="whitegrid")
plt.title('quotation_mark_counts vs sight_lemma_counts')
plt.xlabel('quotation_mark_counts')
plt.ylabel('sight_lemma_counts')
plt.scatter(quotation_mark_counts, bin_counts)
plt.show()
In the previous scatter plot, bins drawn from the plain text were the basis for analysis. The bins, however, arbitrarily mix dialog and narration. In the XML version of the text, narration and dialog are clearly separated, which should enable me to test for what I suspect, that instead of a "see, don't see" rhythm, there's a "see, speak" rhythm.
The results are a couple of cells down in the notebook, after the cell titled "then, do some basic counting".
. . . paragraph-by-paragraph, keeping the counts for dialog and narration separate.
from lxml import etree
# ----------------------------------------------------------
def count_text_content(text):
n_seeing_lemma = 0
n_tokens = 0
seeing_lemma = []
p_doc = nlp(unicode(text))
for t in p_doc:
n_tokens += 1
if t.lemma_ not in ['-PRON-', 'which', 'what', 'who'] and t.pos_ in ['NOUN', 'VERB',]:
is_seeing_lemma = False
word_synsets = Word(t.lemma_).get_synsets(pos=t.pos_[0].lower())
for w in word_synsets:
for interesting_synset in interesting_synsets:
if w.name().startswith(interesting_synset) == True or \
w.name() == interesting_synset:
is_seeing_lemma = True
else:
hypernyms = list(w.closure(hyper, depth=10))
for h in hypernyms:
if h.name().startswith(interesting_synset) == True or \
w.name() == interesting_synset:
is_seeing_lemma = True
if is_seeing_lemma == True:
n_seeing_lemma += 1
seeing_lemma.append(t.lemma_)
return n_seeing_lemma, n_tokens, seeing_lemma
# ----------------------------------------------------------
XML_CORPUS_FOLDER = '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction_XML/'
tree = etree.parse(XML_CORPUS_FOLDER + 'Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.xml')
root = tree.getroot()
p_counts_seeing_lemma = []
dialog_seeing_lemma = []
narration_seeing_lemma = []
for p in root.xpath('//p'):
p_counts_seeing_lemma.append({'type': p.get('type'),
'n_seeing_lemma_dialog': 0, 'n_tokens_dialog': 0,
'n_seeing_lemma_narration': 0, 'n_tokens_narration': 0})
for a in p.xpath('descendant::dialog'):
n_seeing_lemma, n_tokens, seeing_lemma = count_text_content(a.text)
p_counts_seeing_lemma[-1]['n_seeing_lemma_dialog'] += n_seeing_lemma
p_counts_seeing_lemma[-1]['n_tokens_dialog'] += n_tokens
dialog_seeing_lemma += seeing_lemma
for a in p.xpath('descendant::narration'):
n_seeing_lemma, n_tokens, seeing_lemma = count_text_content(a.text)
p_counts_seeing_lemma[-1]['n_seeing_lemma_narration'] += n_seeing_lemma
p_counts_seeing_lemma[-1]['n_tokens_narration'] += n_tokens
narration_seeing_lemma += seeing_lemma
print 'Done!'
At a basic level, seeing lemma are about 2/3 more common in narration (relative frequency 0.0126) than in dialog (0.0075). About twice as many narrative passages (38.1%) contain seeing lemma than dialog passages (18.0%); I'm a little suspicious of this last measurement, since Jane Eyre contains many short "paragraphs" of dialog.
import tabletext
n_dialog_paragraphs = 0
n_narration_paragraphs = 0
n_mixed_paragraphs = 0
n_dialog_passages = 0
n_dialog_passages_with_seeing = 0
n_narration_passages = 0
n_narration_passages_with_seeing = 0
total_seeing_lemma_dialog = 0
total_tokens_dialog = 0
total_seeing_lemma_narration = 0
total_tokens_narration = 0
for p in p_counts_seeing_lemma:
if p['type'] == 'dialog':
n_dialog_paragraphs += 1
if p['type'] == 'narration':
n_narration_paragraphs += 1
if p['type'] == 'mixed':
n_mixed_paragraphs += 1
if p['n_tokens_dialog'] > 0:
n_dialog_passages += 1
if p['n_seeing_lemma_dialog'] > 0:
n_dialog_passages_with_seeing += 1
if p['n_tokens_narration'] > 0:
n_narration_passages += 1
if p['n_seeing_lemma_narration'] > 0:
n_narration_passages_with_seeing += 1
total_seeing_lemma_dialog += p['n_seeing_lemma_dialog']
total_tokens_dialog += p['n_tokens_dialog']
total_seeing_lemma_narration += p['n_seeing_lemma_narration']
total_tokens_narration += p['n_tokens_narration']
results = [
['n_dialog_paragraphs', n_dialog_paragraphs],
['n_narration_paragraphs', n_narration_paragraphs],
['n_mixed_paragraphs', n_mixed_paragraphs],
['', ''],
['n_dialog_passages', n_dialog_passages],
['n_dialog_passages_with_seeing', n_dialog_passages_with_seeing],
['% passages w/seeing -- dialog',
('%.1f' % (float(n_dialog_passages_with_seeing) / float(n_dialog_passages) * 100.0)) + '%'],
['', ''],
['n_narration_passages', n_narration_passages],
['n_narration_passages_with_seeing', n_narration_passages_with_seeing],
['% passages w/seeing -- narration',
('%.1f' % (float(n_narration_passages_with_seeing) / float(n_narration_passages) * 100.0)) + '%'],
['', ''],
['total_seeing_lemma_dialog', total_seeing_lemma_dialog],
['total_tokens_dialog', total_tokens_dialog],
['rel freq seeing -- dialog', '%.4f' % (float(total_seeing_lemma_dialog) / float(total_tokens_dialog))],
['', ''],
['total_seeing_lemma_narration', total_seeing_lemma_narration],
['total_tokens_narration', total_tokens_narration],
['rel freq seeing -- dialog', '%.4f' % (float(total_seeing_lemma_narration) / float(total_tokens_narration))],
]
print tabletext.to_text(results)
print
print 'len(dialog_seeing_lemma)', len(dialog_seeing_lemma)
print 'len(narration_seeing_lemma)', len(narration_seeing_lemma)
original_tokens = []
for t in doc:
original_tokens.append(t.text)
passage_n = 0
for a in range(0, len(bin_counts)):
if bin_counts[a] >= 8:
from_a = a * bin_size
to_a = (a * bin_size) + bin_size
passage_n += 1
print
print str(passage_n), '(more than 8 sight-related words)'
print
print '\t' + '\n\t'.join(textwrap.wrap(' '.join(original_tokens[from_a: to_a]), 100))
original_tokens = []
for t in doc:
original_tokens.append(t.text)
passage_n = 0
for a in range(0, len(bin_counts)):
if bin_counts[a] >= 7 and bin_counts[a] < 8:
from_a = a * bin_size
to_a = (a * bin_size) + bin_size
passage_n += 1
print
print str(passage_n), '(7 sight-related words)'
print
print '\t' + '\n\t'.join(textwrap.wrap(' '.join(original_tokens[from_a: to_a]), 100))