Try to find words which are "clumped" in the Greiffenberg poem, "clumps" being a proxy for progression or word frequency changes. Do some words occur more often in specific parts of the poem?
This process looks only at non-stopwords which occur 25 times or more in the poem (159 words; called words_for_analysis in the code). It processes the poem as a set of shingles instead as a set of chunks (shingles are like chunks except that, unlike chunks as we've used them in the past, shingles overlap each other. Shingles produce a more smoothed set of bar plots than chunks. Any given word can appear in more than one shingle, depending on where the word falls in the poem; however, shingles mitigate some of the arbitrary chopping apart of the poem that results from chunks.
The process is set to run with a shingle size of 2100 and a shingle overlap of 400. This produces 10 shingles of roughly 18,000 words each. Word counts--not word frequencies--are used in determining variance, finding "clumps", etc; since the shingles are about the same size, word counts are functionally the same as word frequencies.
The process tries to find words_for_analysis which have a high variance in their distribution across shingles. Words which a high variance are words which a "clumped" in the text. For example, they may tend to occur mostly in the middle of the poem, or at the beginning and the end of the poem.
Outputs are described in detail in the section "Main process loop" below.
I'm using the standard (i.e., modern) set of nltk german stopwords. It's not the best set, perhaps, although I don't think it makes a lot of difference . . . I also added "u" to the list.
import codecs, re, textwrap
from collections import defaultdict, Counter
from nltk.corpus import stopwords
sw = set(stopwords.words('german') + ['u'])
wrapper = textwrap.TextWrapper(width=60)
print 'stopwords:',
print
print '\n' + '\n'.join(textwrap.wrap(' '.join(sorted(list(sw))), 80))
Strip out blank lines, normalize spaces.
poem_lines = []
for l in codecs.open('Siegessäule_corrected.txt', 'r', encoding='utf-8').read().split('\n'):
if l.strip() > '':
poem_lines.append(re.sub('\s+', ' ', l.strip()))
print 'len(poem_lines)', len(poem_lines)
How many times do words occur? On wnat lines?
Of particular importance is word_lines, which for every word contains a list of the lines on which the word occurs, like:
u'herr': [602, 654, 698, 768, 784, 890, 915, 926, 929, 1194, 1594, 2929, 2993, 4120, 5057, 5063, 5373, 5379, 5931,
6066],
u'dorn': [4801],
u'truge': [1864, 3208, 3772, 4158]
"herr", for example, occurs on lines 602, 654, 698, etc; "dorn" only on line 4801; and "truge" on lines 1864, 3208, etc.
This cell also creates words_for_analysis, a list of non-stopwords which occur 25 times or more in the poem (controlled, and easily changed, by variable LOWER_WORD_LIMIT).
word_counts = defaultdict(int)
n_words = 0
for t in re.split(u'\s+|\.|!|/|’|;|:|\'|-|’|\)|\(|\?|\,', ' '.join(poem_lines).lower()):
if t > '' and t not in sw:
n_words += 1
word_counts[t] += 1
print
print 'n_words', n_words
print 'len(word_counts)', len(word_counts)
word_lines = defaultdict(list)
lines_words = {}
for line_n, line in enumerate(poem_lines):
lines_words[line_n] = []
for t in re.split(u'\s+|\.|!|/|’|;|:|\'|-|’|\)|\(|\?|\,', line.lower()):
if t > '' and t not in sw:
word_lines[t].append(line_n)
lines_words[line_n].append(t)
LOWER_WORD_LIMIT = 25
words_for_analysis = []
for word, lines in word_lines.iteritems():
if len(lines) >= LOWER_WORD_LIMIT and word not in sw:
words_for_analysis.append(word)
print
print 'len(word_lines)', len(word_lines)
print 'len(lines_words)', len(lines_words)
print 'len(words_for_analysis)', len(words_for_analysis)
print
print 'words_for_analysis'
print
print '\n' + '\n'.join(textwrap.wrap(' '.join(sorted(words_for_analysis)), 80))
The next three cells contain functions called from the "main process loop" (see below).
get_shingles breaks the poem into overlapping "shingles" (shingles are like chunks, except that they overlap). Note that shingle_size and shingle_overlap as passed into this routine as parameters, so it's very easy to change them, and to run this notebook with different settings. Interestingly enough, if shingle_size = shingle_overlap, then this routine will produce non-overlapping shingles (i.e., "chunks" as we usually have understood them).
graph_word produces the bar plots that appear below.
find_local_maximums locates the "peak" or "peaks" in the bar plots. It works with the shingle_size and shingle_overlap settings which produced the bar plots below; however, this line of code:
window_size = int(len(shingle_scores) * 0.25)
may cause the function to not work correctly with other shingle_size and shingle_overlap settings, the problem being the fixed 0.25 factor used to set window_size.
import itertools
def get_shingles(lines_words, shingle_size, shingle_overlap):
shingles = []
n_shingles = (len(lines_words) / shingle_overlap) + 1
for a in range(0, n_shingles):
shingle_start = (a * shingle_overlap)
shingle_stop = ((a * shingle_overlap) + shingle_size)
shingles.append(list(itertools.chain.from_iterable(lines_words[shingle_start: shingle_stop])))
if shingle_stop >= len(lines_words):
break
return shingles
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import unicodecsv as csv
from pylab import rcParams
rcParams['figure.figsize'] = 15, 3
import seaborn as sns
def graph_word(variance, word, n_occurences, shingle_scores, high_score, local_maximums):
print
print word, 'n_occurences', n_occurences, \
'variance', variance, \
'local_maximums', local_maximums
plt.bar(range(len(shingle_scores)), shingle_scores.values(), align='center', color='#98AFC7', alpha=1.0)
plt.title(word)
plt.xlabel('shingle')
plt.ylabel('n words')
plt.ylim(0, high_score)
plt.show()
def find_local_maximums(shingle_scores):
window_size = int(len(shingle_scores) * 0.25)
local_maximums = []
for a in range(0, len(shingle_scores)):
slice_start = a - window_size
slice_end = a + window_size
a_is_local_max = True
for b in range(slice_start, slice_end):
if b != a and b >= 0 and b < len(shingle_scores.values()):
if shingle_scores.values()[b] > shingle_scores.values()[a]:
a_is_local_max = False
if a_is_local_max == True and shingle_scores.values()[a] != 0:
local_maximums.append((a, shingle_scores.values()[a]))
local_maximums = sorted(list(set(local_maximums)))
return local_maximums
This cell, which calls the functions listed in the previous three cells, produces two outputs:
For every word in words_for_analysis (159 words, in this run), output a line of text listing its number of occurrencs, its variance (i.e., its amount of "clumpiness"), and its local maximums. For example:
gott n_occurences 228 variance 15.6544929059 local_maximums [(0, 69), (10, 130)]
Local maximums are expressed as pairs (shingle_number, number of occurrences). "gott", for example, has two local maximums, one in shingle 0 (69 occurences; shingle counting starts with zero, not one), and one in shingle 10 (130 occurrences). So "gott" is clumped at the beginning and the end of the poem.
Words are listed in variance ("clumpiness") order, high to low.
This cell contains a lot of commented-out code (the lines prefixed with "#"), where I experiment with different shingle sizes, check the number of words in the resulting shingles, etc.
There's a lot of clumping, and a lot of similar words clumping, in shingles 0 and 10 (i.e., at the beginning and end of the poem). Does the poem begin and end with similar concerns?
There's significant clumping in shingles 4, 5 and 6, although not as much as in 0, and 10. One of these shingles (5, the middle of poem) shows that "turk", etc appears there, much as we expected. Interesting, and unlike, 0 and 10, the clumpy words in 4, 5 and 6 are different.
from gensim import corpora, models
import numpy as np
highest_score = -1.0
#for SHINGLE_SIZE, SHINGLE_OVERLAP, HIGH_SCORE in [[1000, 200, 76], [2000, 400, 125]]:
for SHINGLE_SIZE, SHINGLE_OVERLAP, HIGH_SCORE in [[2100, 400, 135]]:
shingles = get_shingles(lines_words.values(), SHINGLE_SIZE, SHINGLE_OVERLAP)
#for sn, s in enumerate(shingles):
# print 'shingle number', sn, 'number of words', len(s)
#print
print
print '************************************************************'
print 'SHINGLE_SIZE', SHINGLE_SIZE, 'SHINGLE_OVERLAP', SHINGLE_OVERLAP, 'len(shingles)', len(shingles)
print '************************************************************'
dictionary = corpora.Dictionary(shingles)
corpus = [dictionary.doc2bow(doc) for doc in shingles]
#tfidf = models.TfidfModel(corpus)
#corpus_tfidf = tfidf[corpus]
#corpus_tf = []
#for a in range(0, len(corpus)):
# new_row = []
# for b in corpus[a]:
# new_row.append([b[0], float(b[1]) / float(len(shingles[a]))])
# corpus_tf.append(new_row)
doc_word_scores = []
#for doc in corpus_tfidf:
#for doc in corpus_tf:
for doc in corpus:
word_scores = {}
for id, value in doc:
word = dictionary.get(id)
if word in words_for_analysis:
word_scores[word] = value
if value > highest_score:
highest_score = value
doc_word_scores.append(word_scores)
scores_by_variance = []
for word in words_for_analysis:
plot_results = {}
for dn, d in enumerate(doc_word_scores):
plot_results[dn] = 0.0
try:
plot_results[dn] = d[word]
except KeyError:
pass
plot_results_total = 0.0
for v in plot_results.values():
plot_results_total += v
plot_results_scaled = []
for v in plot_results.values():
plot_results_scaled.append(v / plot_results_total)
# COMPUTE VARIANCE USING THE RAW DF SCORES, OR SCALED SCORES?
#scores_by_variance.append([np.var(plot_results_scaled), word, len(word_lines[word]), plot_results])
#scores_by_variance.append([np.var(plot_results.values()), word, len(word_lines[word]), plot_results])
scores_by_variance.append([(np.var(plot_results.values()) / np.mean(plot_results.values())),
word, len(word_lines[word]), plot_results])
all_local_maximums = {}
scores_by_variance.sort(reverse=True)
print
print 'ALL ********************************************************'
#print 'HIGH *******************************************************'
#print 'LOW ********************************************************'
for s in scores_by_variance:
#for s in scores_by_variance[:10]:
#for s in scores_by_variance[-10:]:
local_maximums = find_local_maximums(s[3])
if s[0] > 1.0 and len(local_maximums) <= 3:
for l in local_maximums:
try:
all_local_maximums[l[0]].append([s[1], l[1]])
except KeyError:
all_local_maximums[l[0]] = [[s[1], l[1]]]
graph_word(s[0], s[1], s[2], s[3], HIGH_SCORE, local_maximums)
print
print 'LOCAL MAXIMUMS **********************************************'
print
for shingle_n in sorted(all_local_maximums.keys()):
print 'shingle', shingle_n, 'words:',
for wn, w in enumerate(all_local_maximums[shingle_n]):
if wn == len(all_local_maximums[shingle_n]) - 1:
print w[0] + ' ' + str(w[1])
else:
print w[0] + ' ' + str(w[1]) + ',',
print
print
print 'highest_score', highest_score