examine Greiffenberg¶

Try to find words which are "clumped" in the Greiffenberg poem, "clumps" being a proxy for progression or word frequency changes. Do some words occur more often in specific parts of the poem?

import codecs, re
from collections import defaultdict, Counter
from nltk.corpus import stopwords

sw = set(stopwords.words('german'))

Load the poem¶

Strip out blank lines, normalize spaces.

poem_lines = []
for l in codecs.open('Greiffenberg_line endings.txt', 'r', encoding='utf-8').read().split('\n'):
    if l.strip() > '':
        poem_lines.append(re.sub('\s+', ' ', l.strip()))

print 'len(poem_lines)', len(poem_lines)

len(poem_lines) 6077

tokenization, word counting, etc¶

How many times do words occur? On wnat lines?

Of particular importance is word_lines, which contains for every word which occurs 10 or more times, a list of the lines on which the word occurs.

word_counts = defaultdict(int)
n_words = 0

for t in re.split(u'(\s+|\.|!|/|’|;|:|\'|-|’|\)|\(|\?)', ' '.join(poem_lines).lower()):
    if t > '':
        if t not in [' ', '.', '!', '/', u'’', ';', ':', '-', '(', ')', '?', '\'']:
            n_words += 1
            word_counts[t] += 1
            
print
print 'n_words', n_words
print 'len(word_counts)', len(word_counts)
    
word_lines = defaultdict(list)
lines_words = defaultdict(list)

for line_n, line in enumerate(poem_lines):

    for t in re.split(u'(\s+|\.|!|/|’|;|:|\'|-|’|\)|\(|\?)', line.lower()):
        if t > '':
            if t not in [' ', '.', '!', '/', u'’', ';', ':', '-', '(', ')', '?', '\'']:
                
                if t not in sw and word_counts[t] >= 10:
                    word_lines[t].append(line_n)
                    lines_words[line_n].append(t)
                    
print
print 'len(word_lines)', len(word_lines)
print 'len(lines_words)', len(lines_words)

n_words 52205
len(word_counts) 8657

len(word_lines) 522
len(lines_words) 5536

Find "clumpy" words; list them in order of "clumpiness"¶

From the previous step, we have word_lines, a dictionary which points words to a list of the lines in whitch those words occur. For example:

{u'all': [68, 408, 512, 570, 571, 572, 584, 599, 713, 737, 1060, 1074, 1120, 1162, 1322, 1326, 1401, 1560, 1660, 1676, 2294, 2547, 2560, 2599, 3084, 3228, 3587, 4350, 4357, 4478, 4523, 4626, 4732, 4746, 4911, 5013, 5046, 5084, 5091, 5108, 5199, 5268, 5285, 5475, 5674, 5726, 5997], 
 u'ganz': [9, 348, 377, 549, 610, 645, 746, 778, 779, 790, 855, 981, 1165, 1174, 1213, 1227, 1238, 1395, 1586, 1708, 1712, 1760, 1859, 1862, 1864, 1888, 1891, 1896, 1901, 1943, 1951, 2014, 2048, 2051, 2054, 2096, 2115, 2117, 2128, 2287, 2416, 2433, 2462, 2613, 2643, 2739, 2814, 2829, 2863, 3062, 3139, 3230, 3282, 3521, 3793, 3887, 3959, 4059, 4090, 4190, 4280, 4359, 4471, 4535, 4650, 4665, 4819, 4977, 5105, 5269, 5659, 5725, 5742, 5813] . . .

where "all" occurs in lines 68, 408, 512, 570, etc, "gans" occurs in lines 9, 348, 377, 549, etc.

word_lines contains only those words which occur 10 or more times in the poem.

This process starts immediately after the "THE ACTUAL BUSINESS OF LOOPING OVER THE WORDS" comment. First, we compute the bins for plotting graphs (graph_bins); graph bins are 50 lines wide. Then, for each word, we count the numbers of words per a second set of bins (variance_bin_counts) used for calculating a variance in distribution of occurences of the word across this second set of bins, which we then scale (scaled_variance_bin_counts), and from which we compute a variance.

A couple of things to note here. First, there are far more bins (120) used in graphing than in computing the variance (6). Why? I wanted to show more detail in the graphs while having a simple to understand (and debug) method for computing variance.

Second, variance is a number which indicates how (un)evenly a word is distributed across the 6 variance bins. If a word has a relatively high variance, then it occurs unevenly in the poem; e.g., for example, it might tend to occur mostly in the middle of the poem, or at the beginning and the end of the poem. On the other hand, if a word has a relatively low variance, then it is more or less evenly sprinkled across the poem.

Once I've computed variances for all the words in word_lines, I sort them from high varience to low varience (i.e., from high "clumpiness" to low "clumpiness", then graph them and write them to a csv.

%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import unicodecsv as csv
from pylab import rcParams
rcParams['figure.figsize'] = 15, 3

import seaborn as sns

# --------------------------------------------------------------------------
# A FUNCTION TO OUTPUT ONE WORD'S GRAPH AND CSV LINE.
# --------------------------------------------------------------------------

def report_word(v, graph_bins, csv_writer):
    
    variance = v[0]
    word = v[1]
    bin_counts = v[2]
    scaled_bin_counts = v[3]
    n_occurences = v[4]
    
    print
    print word, 'n_occurences', n_occurences, 'bin_counts', bin_counts, 'variance', variance
    
    if csv_writer != None:
        csv_writer.writerow((word, n_occurences, variance, 
                             bin_counts[0], bin_counts[1], bin_counts[2], 
                             bin_counts[3], bin_counts[4], bin_counts[5]))
    
    plt.hist(word_lines[word], bins=graph_bins)
    plt.title(word + ' (variance ' + str(variance) + ')')
    plt.xlabel('line number')
    plt.ylabel('n occurrences')
    plt.ylim(0, 10)
    plt.xlim(0, 6100)
    
    plt.show()
    
# --------------------------------------------------------------------------
# THE ACTUAL BUSINESS OF LOOPING OVER THE WORDS.
# --------------------------------------------------------------------------

CHUNK_SIZE = 50

N_CHUNKS = (len(poem_lines) / CHUNK_SIZE) + 1

graph_bins = []

for a in range(0, N_CHUNKS):
    graph_bins.append(a * CHUNK_SIZE)

variances_words = []
    
for word in sorted(word_lines.keys()):
    
    variance_bin_counts = {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0}
    for n in word_lines[word]:
    
        variange_bin_n = n / 1012
        if variange_bin_n > 5:
            variange_bin_n = 5
            
        variance_bin_counts[variange_bin_n] += 1
        
    scaled_variance_bin_counts = {}
    for k in bin_counts.keys():
        scaled_variance_bin_counts[k] = float(variance_bin_counts[k]) / float(len(word_lines[word]))
    
    variances_words.append([np.var(scaled_variance_bin_counts.values()), word, 
                            variance_bin_counts, scaled_variance_bin_counts, len(word_lines[word])])
    
variances_words.sort(reverse=True)

f = open('word_bin_variances.csv', 'w')
w = csv.writer(f, encoding='utf-8')
w.writerow(('word', 'n_occurences', 'variance', 
             'bin_counts[0]', 'bin_counts[1]', 'bin_counts[2]', 
             'bin_counts[3]', 'bin_counts[4]', 'bin_counts[5]'))

for v in variances_words:
    report_word(v, graph_bins, w)
    
f.close()

bathor n_occurences 14 bin_counts {0: 0, 1: 0, 2: 1, 3: 13, 4: 0, 5: 0} variance 0.116780045351

hülffe n_occurences 10 bin_counts {0: 0, 1: 9, 2: 0, 3: 0, 4: 1, 5: 0} variance 0.108888888889

bassa n_occurences 13 bin_counts {0: 0, 1: 0, 2: 2, 3: 11, 4: 0, 5: 0} variance 0.0954963839579

städt n_occurences 12 bin_counts {0: 0, 1: 0, 2: 10, 3: 2, 4: 0, 5: 0} variance 0.0925925925926

tartarn n_occurences 11 bin_counts {0: 0, 1: 0, 2: 2, 3: 9, 4: 0, 5: 0} variance 0.0893021120294

wiewol n_occurences 11 bin_counts {0: 1, 1: 0, 2: 0, 3: 9, 4: 0, 5: 1} variance 0.0865472910927

gran n_occurences 10 bin_counts {0: 0, 1: 0, 2: 2, 3: 8, 4: 0, 5: 0} variance 0.0855555555556

ferdinand n_occurences 10 bin_counts {0: 0, 1: 0, 2: 8, 3: 2, 4: 0, 5: 0} variance 0.0855555555556

kreutz n_occurences 14 bin_counts {0: 0, 1: 0, 2: 0, 3: 0, 4: 11, 5: 3} variance 0.0827664399093

brach n_occurences 10 bin_counts {0: 0, 1: 0, 2: 1, 3: 8, 4: 1, 5: 0} variance 0.0822222222222