Greiffenberg -- sliding windows¶

Try to find words which are "clumped" in the Greiffenberg poem, "clumps" being a proxy for progression or word frequency changes. Do some words occur more often in specific parts of the poem?

This process looks only at non-stopwords which occur 25 times or more in the poem (159 words; called words_for_analysis in the code). It processes the poem as a set of shingles instead as a set of chunks (shingles are like chunks except that, unlike chunks as we've used them in the past, shingles overlap each other. Shingles produce a more smoothed set of bar plots than chunks. Any given word can appear in more than one shingle, depending on where the word falls in the poem; however, shingles mitigate some of the arbitrary chopping apart of the poem that results from chunks.

The process is set to run with a shingle size of 2100 and a shingle overlap of 400. This produces 10 shingles of roughly 18,000 words each. Word counts--not word frequencies--are used in determining variance, finding "clumps", etc; since the shingles are about the same size, word counts are functionally the same as word frequencies.

The process tries to find words_for_analysis which have a high variance in their distribution across shingles. Words which a high variance are words which a "clumped" in the text. For example, they may tend to occur mostly in the middle of the poem, or at the beginning and the end of the poem.

Outputs are described in detail in the section "Main process loop" below.

stopwords¶

I'm using the standard (i.e., modern) set of nltk german stopwords. It's not the best set, perhaps, although I don't think it makes a lot of difference . . . I also added "u" to the list.

import codecs, re, textwrap
from collections import defaultdict, Counter
from nltk.corpus import stopwords

sw = set(stopwords.words('german') + ['u'])
wrapper = textwrap.TextWrapper(width=60)

print 'stopwords:', 
print
print '\n' + '\n'.join(textwrap.wrap(' '.join(sorted(list(sw))), 80))

stopwords:

aber alle allem allen aller alles als also am an ander andere anderem anderen
anderer anderes anderm andern anderr anders auch auf aus bei bin bis bist da
damit dann das dasselbe dazu daß dein deine deinem deinen deiner deines dem
demselben den denn denselben der derer derselbe derselben des desselben dessen
dich die dies diese dieselbe dieselben diesem diesen dieser dieses dir doch dort
du durch ein eine einem einen einer eines einig einige einigem einigen einiger
einiges einmal er es etwas euch euer eure eurem euren eurer eures für gegen
gewesen hab habe haben hat hatte hatten hier hin hinter ich ihm ihn ihnen ihr
ihre ihrem ihren ihrer ihres im in indem ins ist jede jedem jeden jeder jedes
jene jenem jenen jener jenes jetzt kann kein keine keinem keinen keiner keines
können könnte machen man manche manchem manchen mancher manches mein meine
meinem meinen meiner meines mich mir mit muss musste nach nicht nichts noch nun
nur ob oder ohne sehr sein seine seinem seinen seiner seines selbst sich sie
sind so solche solchem solchen solcher solches soll sollte sondern sonst u um
und uns unse unsem unsen unser unses unter viel vom von vor war waren warst was
weg weil weiter welche welchem welchen welcher welches wenn werde werden wie
wieder will wir wird wirst wo wollen wollte während würde würden zu zum zur zwar
zwischen über

Load the poem¶

Strip out blank lines, normalize spaces.

poem_lines = []
for l in codecs.open('Siegessäule_corrected.txt', 'r', encoding='utf-8').read().split('\n'):
    if l.strip() > '':
        poem_lines.append(re.sub('\s+', ' ', l.strip()))

print 'len(poem_lines)', len(poem_lines)

len(poem_lines) 6078

tokenization, word counting, etc¶

How many times do words occur? On wnat lines?

Of particular importance is word_lines, which for every word contains a list of the lines on which the word occurs, like:

u'herr': [602, 654, 698, 768, 784, 890, 915, 926, 929, 1194, 1594, 2929, 2993, 4120, 5057, 5063, 5373, 5379, 5931,
    6066], 
u'dorn': [4801],
u'truge': [1864, 3208, 3772, 4158]

"herr", for example, occurs on lines 602, 654, 698, etc; "dorn" only on line 4801; and "truge" on lines 1864, 3208, etc.

This cell also creates words_for_analysis, a list of non-stopwords which occur 25 times or more in the poem (controlled, and easily changed, by variable LOWER_WORD_LIMIT).

word_counts = defaultdict(int)
n_words = 0

for t in re.split(u'\s+|\.|!|/|’|;|:|\'|-|’|\)|\(|\?|\,', ' '.join(poem_lines).lower()):
    if t > '' and t not in sw:
        n_words += 1
        word_counts[t] += 1
            
print
print 'n_words', n_words
print 'len(word_counts)', len(word_counts)
    
word_lines = defaultdict(list)
lines_words = {}

for line_n, line in enumerate(poem_lines):
    
    lines_words[line_n] = []

    for t in re.split(u'\s+|\.|!|/|’|;|:|\'|-|’|\)|\(|\?|\,', line.lower()):
        if t > '' and t not in sw:
            word_lines[t].append(line_n)
            lines_words[line_n].append(t)

LOWER_WORD_LIMIT = 25

words_for_analysis = []
for word, lines in word_lines.iteritems():
    if len(lines) >= LOWER_WORD_LIMIT and word not in sw:
        words_for_analysis.append(word)
                    
print
print 'len(word_lines)', len(word_lines)
print 'len(lines_words)', len(lines_words)
print 'len(words_for_analysis)', len(words_for_analysis)
print
print 'words_for_analysis'
print
print '\n' + '\n'.join(textwrap.wrap(' '.join(sorted(words_for_analysis)), 80))

n_words 28464
len(word_counts) 8459

len(word_lines) 8459
len(lines_words) 6078
len(words_for_analysis) 158

words_for_analysis


ab ach all allein allmacht augen bald beut bey biß blut buß christ christen
christenheit dadurch diß drum eh ehr ehren end engel erd erden erst erz ewig
feind feld feur flammen fort freuden fried frucht furcht gab ganz gar geben
gefahr gehn geht geist geistes gibt glauben gleich glück gnad gnaden gott gottes
grund gut hand haubt heer heil held helden her herz herzen himmel himmels hätt
höchsten hülf hülff ja je jesu jezt kan kommen krafft krieg kron könig land
lassen lauf laß leben lieb liebe ließ lob lust macht mann meer mehr muht mund
must muß nie nit noht o ohn pflegt raht recht reich ruh schaar schlacht schon
schutz seelen selber sey seyn sieg sieges sinn sohn solt sonn stadt stark
sternen streit stäts tausend theil thun tod treu trieb trost tugend türk türken
unsre volk voll waffen wann ward welt wer werd werk wider wol wolt wort wunder
wunsch wurd wär zeit ziel

Helper functions¶

The next three cells contain functions called from the "main process loop" (see below).

get_shingles breaks the poem into overlapping "shingles" (shingles are like chunks, except that they overlap). Note that shingle_size and shingle_overlap as passed into this routine as parameters, so it's very easy to change them, and to run this notebook with different settings. Interestingly enough, if shingle_size = shingle_overlap, then this routine will produce non-overlapping shingles (i.e., "chunks" as we usually have understood them).

graph_word produces the bar plots that appear below.

find_local_maximums locates the "peak" or "peaks" in the bar plots. It works with the shingle_size and shingle_overlap settings which produced the bar plots below; however, this line of code:

window_size = int(len(shingle_scores) * 0.25)

may cause the function to not work correctly with other shingle_size and shingle_overlap settings, the problem being the fixed 0.25 factor used to set window_size.

import itertools

def get_shingles(lines_words, shingle_size, shingle_overlap):
    
    shingles = []
    
    n_shingles = (len(lines_words) / shingle_overlap) + 1

    for a in range(0, n_shingles):

        shingle_start = (a * shingle_overlap)
        shingle_stop = ((a * shingle_overlap) + shingle_size)
    
        shingles.append(list(itertools.chain.from_iterable(lines_words[shingle_start: shingle_stop])))
        
        if shingle_stop >= len(lines_words):
            break
    
    return shingles

%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import unicodecsv as csv

from pylab import rcParams
rcParams['figure.figsize'] = 15, 3

import seaborn as sns

def graph_word(variance, word, n_occurences, shingle_scores, high_score, local_maximums):
    
    print
    print word, 'n_occurences', n_occurences, \
            'variance', variance, \
            'local_maximums', local_maximums
    
    plt.bar(range(len(shingle_scores)), shingle_scores.values(), align='center', color='#98AFC7', alpha=1.0)

    plt.title(word)
    plt.xlabel('shingle')
    plt.ylabel('n words')
    plt.ylim(0, high_score)
    
    plt.show()

def find_local_maximums(shingle_scores):
        
    window_size = int(len(shingle_scores) * 0.25)
    
    local_maximums = []
    
    for a in range(0, len(shingle_scores)):
        
        slice_start = a - window_size
        slice_end = a + window_size
        
        a_is_local_max = True
        for b in range(slice_start, slice_end):
            if b != a and b >= 0 and b < len(shingle_scores.values()):
                if shingle_scores.values()[b] > shingle_scores.values()[a]:
                    a_is_local_max = False
        
        if a_is_local_max == True and shingle_scores.values()[a] != 0:
            local_maximums.append((a, shingle_scores.values()[a]))
        
    local_maximums = sorted(list(set(local_maximums)))
    
    return local_maximums

Main process loop¶

This cell, which calls the functions listed in the previous three cells, produces two outputs:

For every word in words_for_analysis (159 words, in this run), output a line of text listing its number of occurrencs, its variance (i.e., its amount of "clumpiness"), and its local maximums. For example:

gott n_occurences 228 variance 15.6544929059 local_maximums [(0, 69), (10, 130)]

Local maximums are expressed as pairs (shingle_number, number of occurrences). "gott", for example, has two local maximums, one in shingle 0 (69 occurences; shingle counting starts with zero, not one), and one in shingle 10 (130 occurrences). So "gott" is clumped at the beginning and the end of the poem.

Words are listed in variance ("clumpiness") order, high to low.

After the graphs, the process lists for each shingle the words which have a local maximum in that shingle. In other words, the list shows which words are clumped where. I include only words which a variance > 1.0, and with three or fewer local maximum; i.e., the list includes only clearly "clumpy" words. This section (scroll to the bottom) would seem to be the most critically interesting. The graphs serve two purposes: one, to demonstrate the method; and two, to provide background for the list of "clumpy" words. For example, "türken" appears in the list at shingle 5; however, if you look at its graph, you'll see that shingle 5 is its peak, and that it also occurs frequently in shingles before and after 5.

This cell contains a lot of commented-out code (the lines prefixed with "#"), where I experiment with different shingle sizes, check the number of words in the resulting shingles, etc.

Bottom line?¶

There's a lot of clumping, and a lot of similar words clumping, in shingles 0 and 10 (i.e., at the beginning and end of the poem). Does the poem begin and end with similar concerns?
There's significant clumping in shingles 4, 5 and 6, although not as much as in 0, and 10. One of these shingles (5, the middle of poem) shows that "turk", etc appears there, much as we expected. Interesting, and unlike, 0 and 10, the clumpy words in 4, 5 and 6 are different.

from gensim import corpora, models
import numpy as np

highest_score = -1.0

#for SHINGLE_SIZE, SHINGLE_OVERLAP, HIGH_SCORE in [[1000, 200, 76], [2000, 400, 125]]:
for SHINGLE_SIZE, SHINGLE_OVERLAP, HIGH_SCORE in [[2100, 400, 135]]:
    
    shingles = get_shingles(lines_words.values(), SHINGLE_SIZE, SHINGLE_OVERLAP)
    
    #for sn, s in enumerate(shingles):
    #    print 'shingle number', sn, 'number of words', len(s)
    #print
    
    print
    print '************************************************************'
    print 'SHINGLE_SIZE', SHINGLE_SIZE, 'SHINGLE_OVERLAP', SHINGLE_OVERLAP, 'len(shingles)', len(shingles)
    print '************************************************************'

    dictionary = corpora.Dictionary(shingles)
    corpus = [dictionary.doc2bow(doc) for doc in shingles]
    
    #tfidf = models.TfidfModel(corpus)
    #corpus_tfidf = tfidf[corpus]

    #corpus_tf = []
    #for a in range(0, len(corpus)):
    #    new_row = []
    #    for b in corpus[a]:
    #        new_row.append([b[0], float(b[1]) / float(len(shingles[a]))])
    #    corpus_tf.append(new_row)
    
    doc_word_scores = []
    #for doc in corpus_tfidf:
    #for doc in corpus_tf:
    for doc in corpus:
        word_scores = {}
        for id, value in doc:
            
            word = dictionary.get(id)
            
            if word in words_for_analysis:
                word_scores[word] = value
            
                if value > highest_score:
                    highest_score = value
                
        doc_word_scores.append(word_scores)
    
    scores_by_variance = []
    
    for word in words_for_analysis:
        
        plot_results = {}
        
        for dn, d in enumerate(doc_word_scores):
            plot_results[dn] = 0.0
            try:
                plot_results[dn] = d[word]
            except KeyError:
                pass
            
        plot_results_total = 0.0
        for v in plot_results.values():
            plot_results_total += v
        
        plot_results_scaled = []
        for v in plot_results.values():
            plot_results_scaled.append(v / plot_results_total)
        
        #  COMPUTE VARIANCE USING THE RAW DF SCORES, OR SCALED SCORES?
        #scores_by_variance.append([np.var(plot_results_scaled), word, len(word_lines[word]), plot_results])
        #scores_by_variance.append([np.var(plot_results.values()), word, len(word_lines[word]), plot_results])
        
        scores_by_variance.append([(np.var(plot_results.values()) / np.mean(plot_results.values())), 
                                       word, len(word_lines[word]), plot_results])
        
    all_local_maximums = {}
        
    scores_by_variance.sort(reverse=True)
    
    print
    print 'ALL ********************************************************'
    #print 'HIGH *******************************************************'
    #print 'LOW ********************************************************'
    
    for s in scores_by_variance:
    #for s in scores_by_variance[:10]:
    #for s in scores_by_variance[-10:]:
        
        local_maximums = find_local_maximums(s[3])
    
        if s[0] > 1.0 and len(local_maximums) <= 3:
        
            for l in local_maximums:
                try:
                    all_local_maximums[l[0]].append([s[1], l[1]])
                except KeyError:
                    all_local_maximums[l[0]] = [[s[1], l[1]]]
    
        graph_word(s[0], s[1], s[2], s[3], HIGH_SCORE, local_maximums)
    
    print
    print 'LOCAL MAXIMUMS **********************************************'
    print
    for shingle_n in sorted(all_local_maximums.keys()):
        print 'shingle', shingle_n, 'words:',
        for wn, w in enumerate(all_local_maximums[shingle_n]):
            if wn == len(all_local_maximums[shingle_n]) - 1:
                print w[0] + ' ' + str(w[1])
            else:
                print w[0] + ' ' + str(w[1]) + ',',
        print
        
print
print 'highest_score', highest_score

************************************************************
SHINGLE_SIZE 2100 SHINGLE_OVERLAP 400 len(shingles) 11
************************************************************

ALL ********************************************************

ward n_occurences 106 variance 18.2782117334 local_maximums [(4, 98)]

gott n_occurences 228 variance 15.6544929059 local_maximums [(0, 69), (10, 130)]

ach n_occurences 104 variance 12.0020352782 local_maximums [(0, 61), (10, 38)]

wann n_occurences 171 variance 11.6178615905 local_maximums [(0, 73), (10, 85)]

bald n_occurences 108 variance 10.932096475 local_maximums [(4, 75)]

türken n_occurences 91 variance 10.2868593124 local_maximums [(5, 70)]

kan n_occurences 139 variance 10.0681818182 local_maximums [(0, 48), (9, 73)]

herz n_occurences 109 variance 10.055078472 local_maximums [(0, 41), (10, 61)]

stadt n_occurences 49 variance 8.69672727273 local_maximums [(4, 43)]

gnaden n_occurences 58 variance 8.19865319865 local_maximums [(0, 33), (10, 23)]