Average sentence lengths

Note that I consider any sentence length of 1 to be bogus, and do not consider in when computing a text's average sentence length. Such one-token "sentences" are a consequence of how the spacy sentence splitting worked when I part-of-speech tagged the chicago corpus.

The chicago corpus contains 14,097 such "sentences", or 0.3% of 4,618,493 total sentences. Net result is that, even after dropping the one-token "sentences", I'm still overcounting sentences in the chigaco corpus by ~ 0.3%, and the sentence-length averages are almost certainly somewhat shorter than they are in reality. I.e., these one-token sentences mark places where we, if we were doing this by hand, would have kept combined together the sentences on either side.

Note that the sentence lengths for Kafka do not depend on spacy's sentence splitting; for Kafka, we determine sentence boundaries, and thus sentence length, using our hand-currated, one-sentence-per-line data.

Also, please note that toward the end of this notebook, I run some of the texts through TextBlob and Stanford CoreNLP and derive average sentence lengths for comparision.

In [32]:
import glob, codecs, re, numpy

BASELINE_CORPUS_FOLDER = 'chicago_pos/'
KAFKA_CORPUS_FOLDER = 'kafka_pos/'

paths_to_files = [p for p in glob.glob(BASELINE_CORPUS_FOLDER + '*.txt') + \
                    glob.glob(KAFKA_CORPUS_FOLDER + '*.txt')]

mean_sentence_lengths = []
median_sentence_lengths = []

for p in paths_to_files:
    
    sentence_lengths = []
    
    text = codecs.open(p, 'r', encoding='utf-8').read()
    sentences = text.split('\n')
    for s in sentences:
        if s.strip() > '':
            if len(re.split('\s+', s.strip())) > 1:
                sentence_lengths.append(len(re.split('\s+', s.strip())))
    
    mean_sentence_lengths.append(numpy.mean(sentence_lengths))
    median_sentence_lengths.append(numpy.median(sentence_lengths))
    
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from pylab import rcParams
rcParams['figure.figsize'] = 10, 6

sns.set_style("whitegrid")

kafka_values = []
non_kafka_values = []
for pn, p in enumerate(paths_to_files):
    if p.find('kafka') > -1:
        kafka_values.append(mean_sentence_lengths[pn])
    else:
        non_kafka_values.append(mean_sentence_lengths[pn])
                
n, bins, patches = plt.hist(non_kafka_values, bins=40, facecolor='#809DBA', alpha=0.5)

for v in kafka_values:
    plt.axvline(v, color='#DFA11C', linestyle='solid', linewidth=1)

plt.title('MEAN SENTENCE LENGTH')
plt.xlabel('mean sentence length')
plt.ylabel('n texts')
plt.xlim(xmax=40)

plt.show()

kafka_values = []
non_kafka_values = []
for pn, p in enumerate(paths_to_files):
    if p.find('kafka') > -1:
        kafka_values.append(median_sentence_lengths[pn])
    else:
        non_kafka_values.append(median_sentence_lengths[pn])
                
n, bins, patches = plt.hist(non_kafka_values, bins=40, facecolor='#809DBA', alpha=0.5)

for v in kafka_values:
    plt.axvline(v, color='#DFA11C', linestyle='solid', linewidth=1)

plt.title('MEDIAN SENTENCE LENGTH')
plt.xlabel('median sentence length')
plt.ylabel('n texts')
plt.xlim(xmax=40)

plt.show()

Can that be right? Are Kakfa translations' sentences really so long?

This is the code that pointed out the problem with 1-token "sentences".

In [48]:
paths_to_files = [p for p in glob.glob(BASELINE_CORPUS_FOLDER + '*.txt')[:25] + \
                    glob.glob(KAFKA_CORPUS_FOLDER + '*.txt') if p.find('/deu_') == -1]

print
print 'FILE NAME, N SENTENCES, AVG LEN, SHORTEST SENTENCE, LONGEST SENTENCE'
print 

for p in paths_to_files:
    
    file_name = p.split('/')[-1]
    
    sentence_lengths = []
    
    text = codecs.open(p, 'r', encoding='utf-8').read()
    sentences = text.split('\n')
    for s in sentences:
        if s.strip() > '':
            sentence_lengths.append(len(re.split('\s+', s.strip())))
            
    print file_name, len(sentences), numpy.mean(sentence_lengths), numpy.amin(sentence_lengths), numpy.amax(sentence_lengths)
FILE NAME, N SENTENCES, AVG LEN, SHORTEST SENTENCE, LONGEST SENTENCE

00010587.txt 6636 13.331424265259985 1 116
00010836.txt 1497 19.924465240641712 1 108
00010336.txt 2707 22.511456023651146 1 310
00010686.txt 7640 13.060479120303704 1 74
00010418.txt 8365 16.947154471544714 1 130
00010362.txt 8033 14.50996015936255 1 75
00010416.txt 7733 13.85657009829281 1 165
00010574.txt 10643 15.718286036459313 1 82
00010927.txt 6391 12.23849765258216 1 61
00010716.txt 3805 14.40720294426919 1 74
00010658.txt 5927 13.164360445494431 1 88
00010583.txt 7705 19.76414849428868 1 301
00010604.txt 6855 12.963524948934928 1 101
00010883.txt 6861 12.870408163265306 1 88
00010957.txt 19246 11.820784619381657 1 100
00010725.txt 2411 16.06058091286307 1 111
00011011.txt 5592 12.03416204614559 1 106
00010747.txt 5856 17.407173356105893 1 332
00010346.txt 5198 23.767558206657686 1 630
00010442.txt 6341 17.060094637223976 1 88
00010808.txt 5574 10.607213350080746 1 108
00010767.txt 5704 14.408907592495177 1 92
00010472.txt 12911 13.680480247869868 1 113
00010673.txt 3022 12.203905991393578 1 75
00010527.txt 7492 12.42410893071686 1 94
eng_Johnston_1999_Metamorphosis_text_lined_corrected.txt 917 27.292576419213972 2 155
eng_Bernofsky_2014_Metamorphosis_text_lined_corrected.txt 707 35.34985835694051 2 183
eng_Underwood_1981_Metamorphosis_text_lined_corrected.txt 706 34.81418439716312 3 176
eng_Pasley_1992_Transformation_text_lined_corrected.txt 701 35.854285714285716 2 181
eng_Corngold_1972_Metamorphosis_text_lined_corrected.txt 714 34.42917251051893 2 183
eng_Freed_1996_Metamorphosis_text_lined_corrected.txt 709 31.39265536723164 2 151
eng_Applebaum_1993_Metamorphosis_text_lined_corrected.txt 714 34.78260869565217 2 162
eng_Lloyd_1937_Metamorphosis_text_lined_corrected.txt 748 31.7429718875502 2 144
eng_Crick_2009_Metamorphosis_lined_corrected.txt 749 33.16577540106952 2 189
eng_Neugroschel_1993_Metamorphisis_text_lined_corrected.txt 818 28.626682986536107 2 153
eng_Muir_1948_Metamorphosis_text_lined_corrected.txt 696 34.50935251798561 2 193
eng_Hofmann_2006_Metamorphosis_text_lined_corrected.txt 718 34.3207810320781 2 180

Double checking the numbers . . .

. . . by going back to the full text versions, and running the text through another NLP package (TextBlob). Note that the number's don't match the spacy numbers; in some cases, they're off by quite a bit. I wouldn't spent a lot of time reconciling these numbers to the numbers above, since I consider TextBlob to be a quick-and-dirty tool, now largely replaced by spacy. I include it here mostly to get another perspective.

Also, note that here, for Kafka, the TextBlob sentence splitting doesn't result in the same number of sentences as are in our hand-currated versions. Which isn't a problem for our analysis per se; instead, it points out how much automatic sentence splitting can vary from hand-split sentences.

In [49]:
from textblob import TextBlob

paths_to_files = [p for p in glob.glob('../from_box/Master_Files_Fall_2018/chicago_corpus/*.txt')[:25] + \
                    glob.glob('../from_box/Master_Files_Fall_2018/English_Translation_Files/*.txt') \
                              if p.find('/deu_') == -1]

print
print 'FILE NAME, N SENTENCES, AVG LEN, SHORTEST SENTENCE, LONGEST SENTENCE'
print 

for p in paths_to_files:
    
    file_name = p.split('/')[-1]
    
    sentence_lengths = []
    
    text = re.sub('\s+', ' ', codecs.open(p, 'r', encoding='utf-8').read())
    blob = TextBlob(text)
    for s in blob.sentences:
        sentence_lengths.append(len(s.tokens))
            
    print file_name, len(blob.sentences), numpy.mean(sentence_lengths), numpy.amin(sentence_lengths), numpy.amax(sentence_lengths)
FILE NAME, N SENTENCES, AVG LEN, SHORTEST SENTENCE, LONGEST SENTENCE

00010587.txt 4337 20.463454000461148 2 162
00010836.txt 1383 21.496746203904554 1 104
00010336.txt 2228 27.001795332136446 1 219
00010686.txt 5920 16.919425675675676 2 124
00010418.txt 6624 21.49381038647343 2 128
00010362.txt 6078 19.1967752550181 1 118
00010416.txt 5537 19.12696405996027 2 157
00010574.txt 10224 16.079029733959313 2 95
00010927.txt 4774 16.528906577293675 2 97
00010716.txt 2419 22.72715998346424 1 172
00010658.txt 4352 17.805836397058822 2 103
00010583.txt 7397 20.602676760848993 1 302
00010604.txt 5063 17.45546118901837 2 103
00010883.txt 4980 17.259638554216867 1 95
00010957.txt 15289 14.997514552946562 2 100
00010725.txt 1915 20.255352480417756 2 112
00011011.txt 4197 15.800095306171075 2 103
00010747.txt 4542 22.116248348745046 2 332
00010346.txt 3867 31.52521334367727 2 807
00010442.txt 5284 20.730696442089325 2 133
00010808.txt 5284 10.834595003785012 2 107
00010767.txt 4435 18.6304396843292 1 123
00010472.txt 9810 17.896228338430173 2 214
00010673.txt 2436 15.246305418719212 1 92
00010527.txt 5229 17.873589596481164 1 160
eng_Johnston_1999_Metamorphosis_text_lined_corrected.txt 950 26.286315789473683 2 155
eng_Bernofsky_2014_Metamorphosis_text_lined_corrected.txt 739 33.70230040595399 1 183
eng_Underwood_1981_Metamorphosis_text_lined_corrected.txt 738 33.00135501355014 3 174
eng_Pasley_1992_Transformation_text_lined_corrected.txt 732 34.060109289617486 2 179
eng_Corngold_1972_Metamorphosis_text_lined_corrected.txt 747 32.74966532797858 2 183
eng_Freed_1996_Metamorphosis_text_lined_corrected.txt 739 30.027063599458728 2 151
eng_Applebaum_1993_Metamorphosis_text_lined_corrected.txt 747 33.10977242302543 2 160
eng_Lloyd_1937_Metamorphosis_text_lined_corrected.txt 790 29.931645569620255 1 144
eng_Crick_2009_Metamorphosis_lined_corrected.txt 781 31.51472471190781 2 187
eng_Neugroschel_1993_Metamorphisis_text_lined_corrected.txt 852 27.377934272300468 2 153
eng_Muir_1948_Metamorphosis_text_lined_corrected.txt 744 32.12231182795699 2 181
eng_Hofmann_2006_Metamorphosis_text_lined_corrected.txt 753 32.53253652058433 2 180

Trying the Stanford CoreNLP package . . .

I ran the raw text through the Stanford CoreNLP package, which I consider state-of-the-art, but tend to use less than spacy because spacy, once loaded, is much faster. These numbers still don't match our spacy-derived and (for Kafka) hand-currated data, but there closer . . .

In [39]:
!pwd
!echo "annotators = tokenize, ssplit" > stanford.properties
!cat stanford.properties
/data/1/kafka/my_notebooks
annotators = tokenize, ssplit
In [50]:
import commands
from lxml import etree

paths_to_files = [p for p in glob.glob('/data/1/kafka/from_box/Master_Files_Fall_2018/chicago_corpus/*.txt')[:25] + \
                    glob.glob('/data/1/kafka/from_box/Master_Files_Fall_2018/English_Translation_Files/*.txt') \
                              if p.find('/deu_') == -1]

print
print 'FILE NAME, N SENTENCES, AVG LEN, SHORTEST SENTENCE, LONGEST SENTENCE'
print 

for p in paths_to_files:
    
    xml_file_name = p.split('/')[-1] + '.xml'
    
    cmd = 'java -Xmx4g -cp "/home/spenteco/Downloads/stanford-corenlp-full-2018-02-27/*" ' + \
            'edu.stanford.nlp.pipeline.StanfordCoreNLP -file ' + p + \
            ' -props /data/1/kafka/my_notebooks/stanford.properties'
    
    noop = commands.getoutput(cmd)
    
    tree = etree.parse(xml_file_name)
    
    n_sentences = len(tree.xpath('//sentence'))
    
    sentence_lengths = []
    for s in tree.xpath('//sentence'):
        sentence_lengths.append(len(s.xpath('descendant::token')))
        
    noop = commands.getoutput('rm ' + xml_file_name)
    
    print xml_file_name, n_sentences, numpy.mean(sentence_lengths), numpy.amin(sentence_lengths), numpy.amax(sentence_lengths)
FILE NAME, N SENTENCES, AVG LEN, SHORTEST SENTENCE, LONGEST SENTENCE

00010587.txt.xml 6199 14.334892724633006 1 121
00010836.txt.xml 1420 20.78098591549296 3 108
00010336.txt.xml 2671 22.605016847622615 2 341
00010686.txt.xml 7347 13.483598747788212 1 127
00010418.txt.xml 8126 17.385060300270737 2 128
00010362.txt.xml 8076 14.320331847449232 2 75
00010416.txt.xml 7187 14.750104355085572 1 158
00010574.txt.xml 10219 16.20540170271064 2 96
00010927.txt.xml 6309 12.335869392930734 1 59
00010716.txt.xml 3569 15.232838330064444 2 74
00010658.txt.xml 5929 13.022432113341205 2 80
00010583.txt.xml 6872 22.005966239813738 2 323
00010604.txt.xml 6266 14.19885094158953 2 101
00010883.txt.xml 6436 13.66174642635177 1 88
00010957.txt.xml 18898 11.987035665149751 2 100
00010725.txt.xml 2401 16.020408163265305 2 111
00011011.txt.xml 5276 12.635898407884762 2 104
00010747.txt.xml 5441 18.522881823194265 1 298
00010346.txt.xml 4809 25.353295903514244 2 807
00010442.txt.xml 6348 16.99448645242596 2 88
00010808.txt.xml 5276 11.236732373009856 2 109
00010767.txt.xml 5606 14.530146271851587 2 91
00010472.txt.xml 12116 14.432403433476395 2 209
00010673.txt.xml 3052 11.988204456094364 1 75
00010527.txt.xml 7153 12.860198518104292 2 90
eng_Johnston_1999_Metamorphosis_text_lined_corrected.txt.xml 937 26.637139807897544 2 155
eng_Bernofsky_2014_Metamorphosis_text_lined_corrected.txt.xml 734 33.927792915531334 2 183
eng_Underwood_1981_Metamorphosis_text_lined_corrected.txt.xml 739 32.95805142083897 3 174
eng_Pasley_1992_Transformation_text_lined_corrected.txt.xml 735 33.926530612244896 2 179
eng_Corngold_1972_Metamorphosis_text_lined_corrected.txt.xml 746 32.792225201072384 2 183
eng_Freed_1996_Metamorphosis_text_lined_corrected.txt.xml 739 30.027063599458728 2 151
eng_Applebaum_1993_Metamorphosis_text_lined_corrected.txt.xml 747 33.10575635876841 2 160
eng_Lloyd_1937_Metamorphosis_text_lined_corrected.txt.xml 786 30.081424936386767 2 144
eng_Crick_2009_Metamorphosis_lined_corrected.txt.xml 782 31.475703324808183 2 187
eng_Neugroschel_1993_Metamorphisis_text_lined_corrected.txt.xml 850 27.436470588235295 2 153
eng_Muir_1948_Metamorphosis_text_lined_corrected.txt.xml 744 32.12231182795699 2 181
eng_Hofmann_2006_Metamorphosis_text_lined_corrected.txt.xml 753 32.53253652058433 2 180