Try to find words which are "clumped" in the Greiffenberg poem, "clumps" being a proxy for progression or word frequency changes. Do some words occur more often in specific parts of the poem?
import codecs, re
from collections import defaultdict, Counter
from nltk.corpus import stopwords
sw = set(stopwords.words('german'))
Strip out blank lines, normalize spaces.
poem_lines = []
for l in codecs.open('Greiffenberg_line endings.txt', 'r', encoding='utf-8').read().split('\n'):
if l.strip() > '':
poem_lines.append(re.sub('\s+', ' ', l.strip()))
print 'len(poem_lines)', len(poem_lines)
How many times do words occur? On wnat lines?
Of particular importance is word_lines, which contains for every word which occurs 10 or more times, a list of the lines on which the word occurs.
word_counts = defaultdict(int)
n_words = 0
for t in re.split(u'(\s+|\.|!|/|’|;|:|\'|-|’|\)|\(|\?)', ' '.join(poem_lines).lower()):
if t > '':
if t not in [' ', '.', '!', '/', u'’', ';', ':', '-', '(', ')', '?', '\'']:
n_words += 1
word_counts[t] += 1
print
print 'n_words', n_words
print 'len(word_counts)', len(word_counts)
word_lines = defaultdict(list)
lines_words = defaultdict(list)
for line_n, line in enumerate(poem_lines):
for t in re.split(u'(\s+|\.|!|/|’|;|:|\'|-|’|\)|\(|\?)', line.lower()):
if t > '':
if t not in [' ', '.', '!', '/', u'’', ';', ':', '-', '(', ')', '?', '\'']:
if t not in sw and word_counts[t] >= 10:
word_lines[t].append(line_n)
lines_words[line_n].append(t)
print
print 'len(word_lines)', len(word_lines)
print 'len(lines_words)', len(lines_words)
From the previous step, we have word_lines, a dictionary which points words to a list of the lines in whitch those words occur. For example:
{u'all': [68, 408, 512, 570, 571, 572, 584, 599, 713, 737, 1060, 1074, 1120, 1162, 1322, 1326, 1401, 1560, 1660, 1676, 2294, 2547, 2560, 2599, 3084, 3228, 3587, 4350, 4357, 4478, 4523, 4626, 4732, 4746, 4911, 5013, 5046, 5084, 5091, 5108, 5199, 5268, 5285, 5475, 5674, 5726, 5997],
u'ganz': [9, 348, 377, 549, 610, 645, 746, 778, 779, 790, 855, 981, 1165, 1174, 1213, 1227, 1238, 1395, 1586, 1708, 1712, 1760, 1859, 1862, 1864, 1888, 1891, 1896, 1901, 1943, 1951, 2014, 2048, 2051, 2054, 2096, 2115, 2117, 2128, 2287, 2416, 2433, 2462, 2613, 2643, 2739, 2814, 2829, 2863, 3062, 3139, 3230, 3282, 3521, 3793, 3887, 3959, 4059, 4090, 4190, 4280, 4359, 4471, 4535, 4650, 4665, 4819, 4977, 5105, 5269, 5659, 5725, 5742, 5813] . . .
where "all" occurs in lines 68, 408, 512, 570, etc, "gans" occurs in lines 9, 348, 377, 549, etc.
word_lines contains only those words which occur 10 or more times in the poem.
This process starts immediately after the "THE ACTUAL BUSINESS OF LOOPING OVER THE WORDS" comment. First, we compute the bins for plotting graphs (graph_bins); graph bins are 50 lines wide. Then, for each word, we count the numbers of words per a second set of bins (variance_bin_counts) used for calculating a variance in distribution of occurences of the word across this second set of bins, which we then scale (scaled_variance_bin_counts), and from which we compute a variance.
A couple of things to note here. First, there are far more bins (120) used in graphing than in computing the variance (6). Why? I wanted to show more detail in the graphs while having a simple to understand (and debug) method for computing variance.
Second, variance is a number which indicates how (un)evenly a word is distributed across the 6 variance bins. If a word has a relatively high variance, then it occurs unevenly in the poem; e.g., for example, it might tend to occur mostly in the middle of the poem, or at the beginning and the end of the poem. On the other hand, if a word has a relatively low variance, then it is more or less evenly sprinkled across the poem.
Once I've computed variances for all the words in word_lines, I sort them from high varience to low varience (i.e., from high "clumpiness" to low "clumpiness", then graph them and write them to a csv.
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import unicodecsv as csv
from pylab import rcParams
rcParams['figure.figsize'] = 15, 3
import seaborn as sns
# --------------------------------------------------------------------------
# A FUNCTION TO OUTPUT ONE WORD'S GRAPH AND CSV LINE.
# --------------------------------------------------------------------------
def report_word(v, graph_bins, csv_writer):
variance = v[0]
word = v[1]
bin_counts = v[2]
scaled_bin_counts = v[3]
n_occurences = v[4]
print
print word, 'n_occurences', n_occurences, 'bin_counts', bin_counts, 'variance', variance
if csv_writer != None:
csv_writer.writerow((word, n_occurences, variance,
bin_counts[0], bin_counts[1], bin_counts[2],
bin_counts[3], bin_counts[4], bin_counts[5]))
plt.hist(word_lines[word], bins=graph_bins)
plt.title(word + ' (variance ' + str(variance) + ')')
plt.xlabel('line number')
plt.ylabel('n occurrences')
plt.ylim(0, 10)
plt.xlim(0, 6100)
plt.show()
# --------------------------------------------------------------------------
# THE ACTUAL BUSINESS OF LOOPING OVER THE WORDS.
# --------------------------------------------------------------------------
CHUNK_SIZE = 50
N_CHUNKS = (len(poem_lines) / CHUNK_SIZE) + 1
graph_bins = []
for a in range(0, N_CHUNKS):
graph_bins.append(a * CHUNK_SIZE)
variances_words = []
for word in sorted(word_lines.keys()):
variance_bin_counts = {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0}
for n in word_lines[word]:
variange_bin_n = n / 1012
if variange_bin_n > 5:
variange_bin_n = 5
variance_bin_counts[variange_bin_n] += 1
scaled_variance_bin_counts = {}
for k in bin_counts.keys():
scaled_variance_bin_counts[k] = float(variance_bin_counts[k]) / float(len(word_lines[word]))
variances_words.append([np.var(scaled_variance_bin_counts.values()), word,
variance_bin_counts, scaled_variance_bin_counts, len(word_lines[word])])
variances_words.sort(reverse=True)
f = open('word_bin_variances.csv', 'w')
w = csv.writer(f, encoding='utf-8')
w.writerow(('word', 'n_occurences', 'variance',
'bin_counts[0]', 'bin_counts[1]', 'bin_counts[2]',
'bin_counts[3]', 'bin_counts[4]', 'bin_counts[5]'))
for v in variances_words:
report_word(v, graph_bins, w)
f.close()