In what follows, I'm working from two critical/theoretical sources: 1) in her reseach proposal titled "The Purchase of Romans in a Time of Inequality, 1847-1920" (circulated during the summer, 2016), Professor Tatlock wrote that the wanted to "trace the presence of what I am provisionally labeling the 'Jane-Eyre-Effekt'", and 2) Andrew Piper's paper, "The Werther Effect I: Goethe, Objecthood, and the Handling of Knowledge," to which Professor Tatlock pointed in her research proposal.
Andrew Piper's essay describes a straight-forward method. He defines a texts' "Wetherness" as a function of "the relative presence or absence of a set of words within them drawn from Werther." He begins by identifying 91 "most frequent significant words" ("significant" meaning, not in a list of stopwords). He does not lemmatize these words. Using his list of 91 words, he contructs a document-term matrix, then used the matrix to compute the Euclidian distance between the documents. Texts which are close share a similar distribution of the 91 words.
Piper goes on to visualize his results, although the form of his visualization isn't necessary: the distances alone are sufficient communicate the relative "Wertherness" of texts.
This notebook follows that method, more or less, although there are some important differences:
Please note that, despite the title of this notebook, I am not actually testing for "effect". Instead, I'm testing for textual similarities.
Note that I'm not using the current version of spacy, which seems to be buggy on my platform.
import glob, re, codecs, json, glob
from collections import defaultdict, Counter
INPUT_FOLDER = '/home/spenteco/0/muncie_public_library_corpus/PG_no_backmatter_fiction/'
import spacy
nlp = spacy.load('en')
print spacy.__version__
Note that I can pass in several lists to select specific parts of speech, to drop some lemma, and to include only certain words in the results; if this last list is empty, then all words for selected part(s) of speech are included.
def get_document(path_to_file, selected_part_of_speech, lemma_to_drop, words_to_select):
text = codecs.open(path_to_file, 'r', encoding='utf-8').read()
doc = nlp(unicode(text))
document = []
for t in doc:
if t.pos_ in selected_part_of_speech and t.lemma_.lower() not in lemma_to_drop:
if len(words_to_select) == 0 or t.lemma_.lower() in words_to_select:
document.append(t.lemma_.lower())
return document
This code, in addition to producing a nice report, also pulls lists for use in later steps, should we want to restrict our analysis to a set of most frequent words, much as Piper does.
Many of the words which have interested us in the past (eye, face, hand, voice) are among the most common words in JE. I was more than a little surprised to see that "eye" is the most common noun in the novel!
However, I should have been less surprised: in another notebook, I learned that the top 25 words in JE each appear in almost every novel in the corpus. I.e., these are common words not just in JE, but in the corpus.
Note that I do use je_common_nouns in the last cell of the notebook, although I use it only to draw attention to words, and not for any calculation.
It's rather astounding how often words we've been interested in occur in JE. I would not have guessed, for example, that "eye" is the most common (non-proper) noun in the novel, nor that words relating to time appear so high on the list.
def pad_width(s, width):
new_s = s
while len(new_s) < width:
new_s += ' '
return new_s
def count_most_frequent_words_in_je(part_of_speech, n_to_list, words_to_select):
je_doc = get_document(INPUT_FOLDER + 'Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt',
[part_of_speech], ['-pron-', 'what', 'who'], words_to_select)
je_counts = defaultdict(int)
for a in je_doc:
je_counts[a] += 1
print
print part_of_speech
print
most_frequent_words = []
n_printed = 0
for w in Counter(je_counts).most_common(n_to_list):
most_frequent_words.append(w[0])
print pad_width(w[0] + ' ' + str(w[1]), 20),
n_printed += 1
if n_printed > 0 and n_printed % 5 == 0:
print
print
return most_frequent_words
# -------------------------------------------------------------
je_common_nouns = count_most_frequent_words_in_je('NOUN', 250, [])
je_common_verbs = count_most_frequent_words_in_je('VERB', 50, [])
je_common_adj = count_most_frequent_words_in_je('ADJ', 25, [])
je_common_adv = count_most_frequent_words_in_je('ADV', 25, [])
A "document" is a list of lemma selected from a text.
This version selects nouns whether they appear in JE or not (WORDS_TO_SELECT is empty), drops pronouns.
Why nouns? I believe that nouns carry most of the semantic content of text; they seem to have more lexical variety than other parts of speech.
Why all nouns? If, as I suspect, the nouns carry most of a text's semantic content, then including all of the nouns of a text is a more accurate reflection of its content than limiting the word list to just those words appearing in JE.
labels = []
documents = []
WORDS_TO_SELECT = []
#WORDS_TO_SELECT = je_common_nouns
for a, path_to_file in enumerate(glob.glob(INPUT_FOLDER + '*.txt')):
if a % 100 == 0:
print 'processing', a
labels.append(path_to_file.split('/')[-1])
documents.append(get_document(path_to_file,
['NOUN'],
['-pron-', 'what', 'who'],
WORDS_TO_SELECT))
f = codecs.open('labels.js', 'w', encoding='utf-8')
f.write(json.dumps(labels))
f.close()
f = codecs.open('documents.js', 'w', encoding='utf-8')
f.write(json.dumps(documents))
f.close()
I'm doing this just so I can restart the notebook, since the previous step takes quite a bit of time to complete.
import codecs, json
f = codecs.open('labels.js', 'r', encoding='utf-8')
labels = json.loads(f.read())
f.close()
f = codecs.open('documents.js', 'r', encoding='utf-8')
documents = json.loads(f.read())
f.close()
I use as much of the gensim machinery as I can; the creation of a dictionary (a mapping to and from words and word ids), corpus (a set of word counts for each text), and MatrixSimilarity index are all boilerplate right from the gensim tutorials.
The wrinkle is corpus_tf. The gensim tutorials describe creating several kinds of corpora, of which the tf-idf corpus is closest to what we might use here. However, the gensim tf-idf process computes zero tf-idf scores for somes words (eye, day, time, hand, etc) which are both very common in JE and in the corpus as a whole (eye, for example, appears at least once in every text in the corpus). Because the gensim tf-idf process computes zero tf-idf scores for these words, they do not figure in the following distance calculations. Therefore, I create a simple term-frequency corpus, so that every word included in documents gets some sort of score.
from gensim import corpora, models, similarities
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(text) for text in documents]
corpus_tf = []
for a in range(0, len(corpus)):
new_row = []
for b in corpus[a]:
new_row.append([b[0], float(b[1]) / float(len(documents[a]))])
corpus_tf.append(new_row)
index_tf = similarities.MatrixSimilarity(corpus_tf)
The process here is to 1) select a novel, then 2) lookup the similarity between it and every other novel in the corpus. In some cases, I list the 10 novels most similar to the selected novel, and the 10 novels most distant from the selected novel. And I always graph the distribution of similarity scores for the selected novel.
In this scheme, the higher the score, the more similar the novels. Novels with a similarity score of 1.0 are identical; a score of 0.0 means the novels have nothing in common.
Note that this information hearkens back to the "Manhattan" visualizations we constructed four or five years ago.
Also, please note that this sort of similarity information is usually presented in one of two ways: 1) as a heat map; I chose not to do that because the resulting visualization would be enormous; 2) as a network of one kind or another (e.g., as a hierarchical clustering diagram, or as a Voronoi diagram, as Piper does in the "The Werther Effect"); I did not do that because such visualizations flatten information about the similarites between novels.
I did experiment with hierarchical clustering (see below), where I discuss my dissatisfactions with it.
The method seems reasonable. I would, for example, expect that any novel by Charlotte Bronte would be similar to other novels by her, and the results bear that out. Similarly, I would have expected that JE would be very different from the Frank in/on a Thing novels, and that's what I see.
The from one novel to every other novel are skewed (see the graphs); it feels like there's a lot of JE like material in the corpus.
The JE and Gold Elsie relationship is interesting: there are 68 or so novels more similar to JE than Gold Elsie; however, there are only 10 novels which are more similar to Gold Elsie than JE. This suggests it's possible for Gold Elsie and Hypothetical Novel A to be equally similar to JE, but for Gold Elsie and Hypothetical Novel A to be not particularly close.
%matplotlib inline
import numpy as np
import pandas as pd
from scipy import stats, integrate
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
plt.rcParams['figure.figsize']=(15,5)
def distances_from_one_novel_to_all_others(novel_label, print_details):
for a in range(0, len(corpus_tf)):
if labels[a].find(novel_label) == -1:
continue
sims = index_tf[corpus_tf[a]]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print
print labels[a].upper()
print
all_sims_scores = []
for b in range(1, len(sims)):
all_sims_scores.append(sims[b][1])
if b < 11 and print_details == True:
print '\t', sims[b][1], labels[sims[b][0]]
if print_details == True:
print
for b in range(1, len(sims)):
if labels[sims[b][0]].find('Marlitt') > -1 or labels[sims[b][0]].find('Jane_Eyre') > -1:
print '\t', sims[b][1], ('(' + str(b) + ')'), labels[sims[b][0]]
reversed_sims = []
for s in sims:
reversed_sims.append((s[1], s[0]))
reversed_sims.sort()
print
for b in range(0, 10):
print '\t', reversed_sims[b][0], labels[reversed_sims[b][1]]
print
print '\tmean', np.mean(all_sims_scores), \
'median', np.median(all_sims_scores), \
'std', np.std(all_sims_scores), \
'plus 1 std', (np.mean(all_sims_scores) + (1 * np.std(all_sims_scores)))
print
ax = sns.distplot(all_sims_scores, bins=100)
ax.set(xlabel='sim score', ylabel='n sims')
plt.show()
# ------------------------------------------------
distances_from_one_novel_to_all_others('Bront_Charlotte_Jane_Eyre', True)
distances_from_one_novel_to_all_others('Marlitt_E_Eugenie_Gold_Elsie', True)
distances_from_one_novel_to_all_others('Marlitt_E_Eugenie_At_the_Councillor', True)
distances_from_one_novel_to_all_others('Marlitt_OMS_Wister', True)
distances_from_one_novel_to_all_others('Malory_Thomas_Sir_King_Arthur', False)
distances_from_one_novel_to_all_others('Castlemon_Harry_Frank_in_the_Woods', False)
I need a document-term matrix for what follows . . .
matrix = []
for a in range(0, len(corpus_tf)):
row = []
for b in range(0, len(dictionary)):
row.append(0.0)
for b in corpus_tf[a]:
row[b[0]] = b[1]
matrix.append(row)
f = codecs.open('matrix.js', 'w', encoding='utf-8')
f.write(json.dumps(matrix))
f.close()
print 'len(matrix)', len(matrix)
print 'len(matrix[0])', len(matrix[0])
. . . so the notebook is restartable at this point.
import codecs, json
f = codecs.open('matrix.js', 'r', encoding='utf-8')
matrix = json.loads(f.read())
f.close()
Much as I did with similarity scores (see above), I compute the distances between a selected novel and the rest of the novels. If the distance between two novels is 0.0, they are identical; the greater the distance, the more unlike they are.
The point here is be sure that the results are not an artifact of any particular distance- or similarity-calculating scheme.
Not much, really. We're getting roughly the same results we saw earlier; same skewed graphs except flipped; same close-and-distant novels, more or less; same Marlitt-JE relationships.
The cityblock distance metric does seem to draw JE and Marlitt closer than the other methods. I use cityblock in what follows, although not because it gives the results we like, but because I think it will be easier to explain word-by-word contributions to the results.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial.distance import *
import numpy as np
def examine_scipy_distances(novel_label, print_details):
sns.set(color_codes=True)
plt.rcParams['figure.figsize']=(15,5)
je_a = -1
for a in range(0, len(labels)):
if novel_label in labels[a]:
je_a = a
break
print
print labels[je_a].upper()
print
print 'Cosine'
distances = []
graph_distances = []
for a in range(0, len(matrix)):
distances.append([cosine(matrix[a], matrix[je_a]), labels[a]])
graph_distances.append(cosine(matrix[a], matrix[je_a]))
distances.sort()
if print_details == True:
print
for a in distances[1:11]:
print '\t', a[0], a[1]
print
for position, a in enumerate(distances[1:]):
if a[1].find('Marlitt') > -1 or a[1].find('Jane_Eyre') > -1:
print '\t', a[0], ('(' + str(position) + ')'), a[1]
print
for a in distances[len(distances) - 10:]:
print '\t', a[0], a[1]
print
print '\tmean', np.mean(graph_distances), \
'median', np.median(graph_distances), \
'std', np.std(graph_distances), \
'less 1 std', (np.mean(graph_distances) - (1 * np.std(graph_distances)))
ax = sns.distplot(graph_distances, bins=100)
ax.set(xlabel='COSINE', ylabel='n texts')
plt.show()
print
print 'Euclidean'
distances = []
graph_distances = []
for a in range(0, len(matrix)):
distances.append([euclidean(matrix[a], matrix[je_a]), labels[a]])
graph_distances.append(euclidean(matrix[a], matrix[je_a]))
distances.sort()
if print_details == True:
print
for a in distances[1:11]:
print '\t', a[0], a[1]
print
for position, a in enumerate(distances[1:]):
if a[1].find('Marlitt') > -1 or a[1].find('Jane_Eyre') > -1:
print '\t', a[0], ('(' + str(position) + ')'), a[1]
print
for a in distances[len(distances) - 10:]:
print '\t', a[0], a[1]
print
print '\tmean', np.mean(graph_distances), \
'median', np.median(graph_distances), \
'std', np.std(graph_distances), \
'less 1 std', (np.mean(graph_distances) - (1 * np.std(graph_distances)))
ax = sns.distplot(graph_distances, bins=100)
ax.set(xlabel='EUCLIDEAN', ylabel='n texts')
plt.show()
print
print 'Cityblock'
distances = []
graph_distances = []
for a in range(0, len(matrix)):
distances.append([cityblock(matrix[a], matrix[je_a]), labels[a]])
graph_distances.append(cityblock(matrix[a], matrix[je_a]))
distances.sort()
if print_details == True:
print
for a in distances[1:11]:
print '\t', a[0], a[1]
print
for position, a in enumerate(distances[1:]):
if a[1].find('Marlitt') > -1 or a[1].find('Jane_Eyre') > -1:
print '\t', a[0], ('(' + str(position) + ')'), a[1]
print
for a in distances[len(distances) - 10:]:
print '\t', a[0], a[1]
print
print '\tmean', np.mean(graph_distances), \
'median', np.median(graph_distances), \
'std', np.std(graph_distances), \
'less 1 std', (np.mean(graph_distances) - (1 * np.std(graph_distances)))
ax = sns.distplot(graph_distances, bins=100)
ax.set(xlabel='CITYBLOCK', ylabel='n texts')
plt.show()
# --------------------------------------------------------
# ------------------------------------------------
examine_scipy_distances('Bront_Charlotte_Jane_Eyre', True)
examine_scipy_distances('Marlitt_E_Eugenie_Gold_Elsie', True)
#examine_scipy_distances('Marlitt_E_Eugenie_At_the_Councillor', True)
#examine_scipy_distances('Marlitt_OMS_Wister', True)
#examine_scipy_distances('Malory_Thomas_Sir_King_Arthur', False)
#examine_scipy_distances('Castlemon_Harry_Frank_in_the_Woods', False)
The cophenetic correlation (see https://en.wikipedia.org/wiki/Cophenetic_correlation) "is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points."
I.e., how much does the following dendrogram flatten the information? I want to pick the best (i.e., highest scored) linkage type when making the dendrogram.
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import dendrogram, linkage
def find_best_linkage_type(matrix, pdist_matrix, linkage_type):
Z = linkage(matrix, linkage_type)
c, coph_dists = cophenet(Z, pdist_matrix)
print linkage_type, 'cophenet c', c
# ----------------------------------------------------
pdist_matrix = pdist(matrix)
#find_best_linkage_type(matrix, pdist_matrix, 'ward')
#find_best_linkage_type(matrix, pdist_matrix, 'single')
#find_best_linkage_type(matrix, pdist_matrix, 'complete')
#find_best_linkage_type(matrix, pdist_matrix, 'average')
#find_best_linkage_type(matrix, pdist_matrix, 'weighted')
#find_best_linkage_type(matrix, pdist_matrix, 'centroid')
#find_best_linkage_type(matrix, pdist_matrix, 'median')
I include the code for this only because it is generally expected; however, I don't much care for the results, nevermind that I tried it with the linkage types with higher cophenetic correlations (average, centroid, single, media). Therefore, I don't include the visualizations. They are also quite large, which makes the notebook cumbersome.
Why don't I like the results?
%matplotlib inline
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist
import numpy as np
def draw_cluster_diagram(matrix, row_labels, linkage_type):
print
print '****', linkage_type
print
Z = linkage(matrix, linkage_type)
plt.figure(figsize=(10, 500))
plt.title('ALL WORDS')
plt.xlabel('')
plt.ylabel('distance')
dendrogram(
Z,
orientation='left',
labels=row_labels,
leaf_font_size=20,
color_threshold=0.02
)
plt.show()
# ----------------------------------------------------
new_labels = []
for a in labels:
if 'Jane_Eyre' in a or 'Marlitt' in a:
new_labels.append('****** ' + a)
else:
new_labels.append(a)
#draw_cluster_diagram(matrix, new_labels, 'average')
#draw_cluster_diagram(matrix, new_labels, 'centroid')
#draw_cluster_diagram(matrix, new_labels, 'single')
#draw_cluster_diagram(matrix, new_labels, 'median')
I compute everything-to-everything distances. I use the city block method because I think that it will make it easier for me to later determine which words are contributing to distance.
from scipy.spatial.distance import *
distances = []
for a in range(0, len(matrix) - 1):
if a % 100 == 0:
print 'processing', a
for b in range(a + 1, len(matrix)):
distances.append([cityblock(matrix[a], matrix[b]), a, labels[a], b, labels[b]])
f = codecs.open('distances.js', 'w', encoding='utf-8')
f.write(json.dumps(distances))
f.close()
. . . so I can restart the notebook here, if necessary.
import codecs, json
f = codecs.open('distances.js', 'r', encoding='utf-8')
distances = json.loads(f.read())
f.close()
Useful things accomplished here:
Notice that nothing is very close to Marlitt's Owl's Nest, Rubies, or Schillingscount. I do not believe that, in fact, this is so; instead, I think we're observing the result of problems parsing these texts.
At this point, I might question the composition of the corpus. To review, the corpus consists of fiction which circulated in the Muncie Public Library in the 1890's, and for which we were able to find good (i.e., from PG) electronic copies.
We could, of course, compose a corpus which draws JE-Marlitt comparisions more sharply. A ridiculous example: a corpus of JE, Marlitt, and Frank novels, and nothing else. But to do so, even non-ridiculously, would, I think misrepresent this kind of similarity between JE and the the fiction it was read with.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, Counter
def print_very_close(novel_label, lower_distance):
print
print 'VERY CLOSE TO', novel_label.upper()
print
for d in distances:
if d[0] <= lower_distance:
if novel_label in d[2]:
if 'Jane_Eyre' in d[4] or 'Marlitt' in d[4]:
print '\t', '****', d[4]
else:
print '\t', d[4]
if novel_label in d[4]:
if 'Jane_Eyre' in d[2] or 'Marlitt' in d[2]:
print '\t', '****', d[2]
else:
print '\t', d[2]
graph_distances = []
for d in distances:
graph_distances.append(d[0])
sns.set(color_codes=True)
plt.rcParams['figure.figsize']=(15,5)
ax = sns.distplot(graph_distances, bins=100)
ax.set(xlabel='CITYBLOCK', ylabel='n texts')
plt.show()
print
print 'mean', np.mean(graph_distances), \
'median', np.median(graph_distances), \
'std', np.std(graph_distances), \
'less 1 std', (np.mean(graph_distances) - (1 * np.std(graph_distances))), \
'less 2 std', (np.mean(graph_distances) - (2 * np.std(graph_distances)))
n_very_close = 0
very_close_counts = defaultdict(int)
lower_distance = (np.mean(graph_distances) - (2 * np.std(graph_distances)))
very_close_distances = []
for d in distances:
if d[0] <= lower_distance:
n_very_close += 1
very_close_counts[d[2]] += 1
very_close_counts[d[4]] += 1
very_close_distances.append(d)
print
print 'n_very_close', n_very_close
print
for w in Counter(very_close_counts).most_common(25):
if 'Jane_Eyre' in w[0] or 'Marlitt' in w[0]:
print '****', w[0], w[1]
else:
print w[0], w[1]
print_very_close('Bront_Charlotte_Jane_Eyre', lower_distance)
print_very_close('Marlitt_E_Eugenie_At_the_Councillor_s_or_A_Nameless_History_PG_43393_0.txt', lower_distance)
print_very_close('Marlitt_E_Eugenie_Gold_Elsie_PG_42426.txt', lower_distance)
print_very_close('Marlitt_OMS_Wister translation_cleaned_110617.txt', lower_distance)
print_very_close('Marlitt_Wister_Baliff.txt', lower_distance)
print_very_close('Marlitt_Wister_Gisela.txt', lower_distance)
print_very_close('Marlitt_Wister_Little_Moorland_Princess_cleaned_121817.txt', lower_distance)
print_very_close('Marlitt_Wister_Owls.txt', lower_distance)
print_very_close('Marlitt_Wister_Rubies.txt', lower_distance)
print_very_close('Marlitt_Wister_Schillingscourt.txt', lower_distance)
Can I determine which words contribute--or do not contribute--to the distance between two novels?
Here, I simply scatter plot the relative frequencies of words in two novels; the X-axis is the frequency on one novel, and the Y-axis is the frequency in the other. I draw plot for 8 novels, each compared to JE (JE is always the X-axis). Four of those novels are very close to JE, two are Marlitt novels, and four are far from JE.
The scatter plots for JE and novels close to JE look much like we would expect them to: most of the points are along a 45 degree slope upward and to the right from the origin; i.e., the plots suggest that a large number of words occur at about the same frequency in the two novels.
The Marlitt novels look like other novels which are very close to JE, although, since Marlitt novels are a litte more distant from JE, their plots show a "fuzzier" 45 degree slope upward and to the right.
It's harder to see what's going in the plots for novels which are distance from JE. Root cause seems to be that these novels have very high-frequency words which do not appear in JE, but which alter the scale of the graph.
Bottom line? These plots suggest that it would be possible to identify word-by-word the causes of the (dis)similarity between two novels.
%matplotlib inline
from scipy.spatial.distance import *
import math
import matplotlib.pyplot as plt
import seaborn as sns
def what_makes_two_novels_close(novel_a, novel_b):
n_a = -1
n_b = -1
for n in range(0, len(labels)):
if novel_a == labels[n]:
n_a = n
if novel_b == labels[n]:
n_b = n
distance = cityblock(matrix[n_a], matrix[n_b])
print
print
print novel_a
print novel_b
print 'distance', distance
sns.set(color_codes=True)
plt.rcParams['figure.figsize']=(15,15)
x = []
y = []
close_frequent_words = []
for w in range(0, len(matrix[n_a])):
if matrix[n_a][w] == 0 and matrix[n_b][w] == 0:
pass
else:
x.append(matrix[n_a][w])
y.append(matrix[n_b][w])
high_dimension = -1
for a in x:
if a > high_dimension:
high_dimension = a
for a in y:
if a > high_dimension:
high_dimension = a
plt.scatter(x, y)
plt.xlim(0, high_dimension + 0.001)
plt.ylim(0, high_dimension + 0.001)
plt.title(novel_a + ' -- ' + novel_b + '\nWord Frequencies (distance ' + str(distance) + ')')
plt.xlabel(novel_a)
plt.ylabel(novel_b)
plt.show()
# ------------------------------------------------------------------------------------
what_makes_two_novels_close('Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt', \
'Bront_Charlotte_Shirley_PG_30486.txt')
what_makes_two_novels_close('Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt', \
'Dickens_Charles_Dombey_and_Son_PG_821.txt')
what_makes_two_novels_close('Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt', \
'Harland_Marion_Alone_PG_46505.txt')
what_makes_two_novels_close('Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt', \
'Gaskell_Elizabeth_Cleghorn_Ruth_PG_4275.txt')
what_makes_two_novels_close('Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt', \
'Marlitt_E_Eugenie_Gold_Elsie_PG_42426.txt')
what_makes_two_novels_close('Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt', \
'Marlitt_E_Eugenie_At_the_Councillor_s_or_A_Nameless_History_PG_43393_0.txt')
what_makes_two_novels_close('Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt', \
'Alcott_Louisa_May_Pratt_Anna_Bronson_Alcott_Comic_Tragedies_Written_by_PG_33986.txt')
what_makes_two_novels_close('Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt', \
'Castlemon_Harry_Frank_in_the_Woods_PG_42307_8.txt')
what_makes_two_novels_close('Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt', \
'Castlemon_Harry_Frank_on_the_Prairie_PG_42101_0.txt')
what_makes_two_novels_close('Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt', \
'Billings_Josh_Josh_Billings_on_Ice_and_Other_Things_PG_41025.txt')
I create a network from the very close distances; if two novels are very close, then I connect them with an edge. I color Marlitt's nodes in blue, Charlotte Bronte's in red, and Dickens' in green.
The network graphs is available at http://talus.artsci.wustl.edu/je_effect_web/distance_network.html.
Use your mousewheel to zoom and unzoom the graph. Hover over a node to see what novel it represents. Click and drag over white space to move the graph in the window.
I also list the 25 novels with the highest network centrality, as well as the 25 novels with the most triangles in the network.
I'm speculating to myself a "Jane Eyre Effect" like: 1) Some sort of a general cultural effect (periodicals?) in England, which causes Bronte and Dickens to produce fiction which, at the level of the noun, is similar; 2) JE (and Dombey?) goes to Germany, where Marlitt, etc write fiction influenced by 1; 3) those German works come back to Anglophone countries in translation where, in combination with "native" English writers influenced by 1, they spark off another round of the "Jane Eyre Effect".
Note to self: There's a novel labeled The_Second_Wife_Wister_corrected.txt, which should be labeled with its original author (Marlitt?).
import json, codecs
import networkx as nx
from networkx.readwrite import json_graph
from networkx.algorithms import *
from collections import defaultdict, Counter
sns.set(color_codes=True)
plt.rcParams['figure.figsize']=(15,15)
nodes_added = []
G=nx.Graph()
for d in very_close_distances:
if d[2] not in nodes_added:
nodes_added.append(d[2])
G.add_node(d[2])
if d[4] not in nodes_added:
nodes_added.append(d[4])
G.add_node(d[4])
G.add_edge(d[2], d[4])
f = codecs.open('je_effect_web/node_link_data.js', 'w', encoding='utf-8')
f.write(json.dumps(json_graph.node_link_data(G)))
f.close()
dc = degree_centrality(G)
print
for w in Counter(dc).most_common(25):
if 'Jane_Eyre' in w[0]:
print '****', w[0], w[1]
else:
print w[0], w[1]
tris = triangles(G)
print
for w in Counter(tris).most_common(25):
if 'Jane_Eyre' in w[0]:
print '****', w[0], w[1]
else:
print w[0], w[1]
Here, I dig into pairs of novels, looking for words which occur at a relatively high frequency but don't especially contribute to the distance between the novels. I.e., I'm looking for words which occur at roughly the same frequency in both novels, and which occur relatively frequently in both.
The output is:
Bront_Charlotte_Jane_Eyre_An_Autobiography_PG_1260.txt
Marlitt_E_Eugenie_Gold_Elsie_PG_42426.txt
matrix distance 0.786523244186
DAY DOOR EYE FACE ROOM TIME WORD
absence action addition address advice affection afresh afternoon age agony
arrangement article back basin beauty being bell black blood blow board body
bond bone boot boudoir brain bride bridle brow . . .
important_common_words_distance 0.0218369461188 (2%)
important_common_words_frequency 0.191273434414 (19%)
I list the two novels I examine, then the distance between the two novels. Next, I list, in upper case, the top 25 words from JE which occurs in both novels at more or less the same frequency. Then, I detail the 250 top words which occur at more or less the same frequency in both novels. Lastly, I list the distance contributed by those words to the total distance, and the frequency of those words in the first of the two novels.
At the bottom of the listing, I list the top 100 nouns listed in the preceeding details.
Note that I'm only printing off the first 25 JE-other novel comparisions, although I'm counting all the nouns which otherwise would have been details.
from scipy.spatial.distance import *
import math, textwrap
from collections import defaultdict, Counter
def what_makes_two_novels_close_V2(novel_a, novel_b, common_word_count, distance_number):
n_a = -1
n_b = -1
for n in range(0, len(labels)):
if novel_a == labels[n]:
n_a = n
if novel_b == labels[n]:
n_b = n
print_details = True
if distance_number > 20:
if 'Marlitt' in labels[n_a] or 'Marlitt' in labels[n_b]:
pass
else:
print_details = False
if print_details == True:
print
print
print labels[n_a]
print labels[n_b]
print
print '\t', 'matrix distance', cityblock(matrix[n_a], matrix[n_b])
individual_distances = []
for w in range(0, len(matrix[n_a])):
if matrix[n_a][w] == 0 and matrix[n_b][w] == 0:
pass
else:
word_distance = math.fabs(matrix[n_a][w] - matrix[n_b][w])
word_frequency_a = matrix[n_a][w]
word_frequency_b = matrix[n_b][w]
word = dictionary[w]
if word_frequency_b > 0 and word_frequency_a > 0:
word_ratio = (word_frequency_a / word_frequency_b)
if word_ratio >= 0.8 and word_ratio <= 1.2:
individual_distances.append([word_frequency_a, word_frequency_b, word_distance, word])
individual_distances.sort(reverse=True)
important_common_words = []
important_common_je_words = []
important_common_words_distance = 0.0
important_common_words_frequency = 0.0
for i in individual_distances[:250]:
if i[3] in je_common_nouns[:25]:
important_common_je_words.append(i[3])
#common_word_count[i[3].upper()] += 1
important_common_words.append(i[3])
important_common_words_distance += i[2]
important_common_words_frequency += i[0]
common_word_count[i[3]] += 1
important_common_je_words.sort()
important_common_words.sort()
if print_details == True:
print
print '\t', '\n\t'.join(textwrap.wrap(' '.join(important_common_je_words).upper(), 80))
print
print '\t', '\n\t'.join(textwrap.wrap(' '.join(important_common_words), 80))
print
print '\t', 'important_common_words_distance', important_common_words_distance, \
('(' + str(int(important_common_words_distance / cityblock(matrix[n_a], matrix[n_b]) * 100)) + '%)')
print '\t', 'important_common_words_frequency', important_common_words_frequency, \
('(' + str(int(important_common_words_frequency * 100)) + '%)')
word_by_word_distance = 0
for a in individual_distances:
word_by_word_distance += a[0]
#print
#print '\t', 'word_by_word_distance', word_by_word_distance
# --------------------------------------------------------------------------
print 'je_common_nouns[:25]', je_common_nouns[:25]
print
print '**********************************************************************************'
print 'SELECTED NOVEL PAIRS'
print '**********************************************************************************'
common_word_count = defaultdict(int)
for d in very_close_distances:
if 'Jane_Eyre' in d[2] and 'Gold_Elsie' in d[4]:
what_makes_two_novels_close_V2(d[2], d[4], common_word_count, -1)
if 'Gold_Elsie' in d[2] and 'Jane_Eyre' in d[4]:
what_makes_two_novels_close_V2(d[2], d[4], common_word_count, -1)
for d in very_close_distances:
if 'Dombey' in d[2] and 'Gold_Elsie' in d[4]:
what_makes_two_novels_close_V2(d[2], d[4], common_word_count, -1)
if 'Gold_Elsie' in d[2] and 'Dombey' in d[4]:
what_makes_two_novels_close_V2(d[2], d[4], common_word_count, -1)
for d in very_close_distances:
if 'Jane_Eyre' in d[2] and 'Dombey' in d[4]:
what_makes_two_novels_close_V2(d[2], d[4], common_word_count, -1)
if 'Dombey' in d[2] and 'Jane_Eyre' in d[4]:
what_makes_two_novels_close_V2(d[2], d[4], common_word_count, -1)
print
print '**********************************************************************************'
print 'JANE EYRE AND EVERYTHING CLOSE TO IT'
print '**********************************************************************************'
n_je_distances = 0
common_word_count = defaultdict(int)
for d in very_close_distances:
if 'Jane_Eyre' in d[2] or 'Jane_Eyre' in d[4]:
n_je_distances += 1
what_makes_two_novels_close_V2(d[2], d[4], common_word_count, n_je_distances)
print
print
print 'n_je_distances', n_je_distances
print
for w in Counter(common_word_count).most_common(100):
print w[0], w[1]