Note that I consider any sentence length of 1 to be bogus, and do not consider in when computing a text's average sentence length. Such one-token "sentences" are a consequence of how the spacy sentence splitting worked when I part-of-speech tagged the chicago corpus.
The chicago corpus contains 14,097 such "sentences", or 0.3% of 4,618,493 total sentences. Net result is that, even after dropping the one-token "sentences", I'm still overcounting sentences in the chigaco corpus by ~ 0.3%, and the sentence-length averages are almost certainly somewhat shorter than they are in reality. I.e., these one-token sentences mark places where we, if we were doing this by hand, would have kept combined together the sentences on either side.
Note that the sentence lengths for Kafka do not depend on spacy's sentence splitting; for Kafka, we determine sentence boundaries, and thus sentence length, using our hand-currated, one-sentence-per-line data.
Also, please note that toward the end of this notebook, I run some of the texts through TextBlob and Stanford CoreNLP and derive average sentence lengths for comparision.
import glob, codecs, re, numpy
BASELINE_CORPUS_FOLDER = 'chicago_pos/'
KAFKA_CORPUS_FOLDER = 'kafka_pos/'
paths_to_files = [p for p in glob.glob(BASELINE_CORPUS_FOLDER + '*.txt') + \
glob.glob(KAFKA_CORPUS_FOLDER + '*.txt')]
mean_sentence_lengths = []
median_sentence_lengths = []
for p in paths_to_files:
sentence_lengths = []
text = codecs.open(p, 'r', encoding='utf-8').read()
sentences = text.split('\n')
for s in sentences:
if s.strip() > '':
if len(re.split('\s+', s.strip())) > 1:
sentence_lengths.append(len(re.split('\s+', s.strip())))
mean_sentence_lengths.append(numpy.mean(sentence_lengths))
median_sentence_lengths.append(numpy.median(sentence_lengths))
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import rcParams
rcParams['figure.figsize'] = 10, 6
sns.set_style("whitegrid")
kafka_values = []
non_kafka_values = []
for pn, p in enumerate(paths_to_files):
if p.find('kafka') > -1:
kafka_values.append(mean_sentence_lengths[pn])
else:
non_kafka_values.append(mean_sentence_lengths[pn])
n, bins, patches = plt.hist(non_kafka_values, bins=40, facecolor='#809DBA', alpha=0.5)
for v in kafka_values:
plt.axvline(v, color='#DFA11C', linestyle='solid', linewidth=1)
plt.title('MEAN SENTENCE LENGTH')
plt.xlabel('mean sentence length')
plt.ylabel('n texts')
plt.xlim(xmax=40)
plt.show()
kafka_values = []
non_kafka_values = []
for pn, p in enumerate(paths_to_files):
if p.find('kafka') > -1:
kafka_values.append(median_sentence_lengths[pn])
else:
non_kafka_values.append(median_sentence_lengths[pn])
n, bins, patches = plt.hist(non_kafka_values, bins=40, facecolor='#809DBA', alpha=0.5)
for v in kafka_values:
plt.axvline(v, color='#DFA11C', linestyle='solid', linewidth=1)
plt.title('MEDIAN SENTENCE LENGTH')
plt.xlabel('median sentence length')
plt.ylabel('n texts')
plt.xlim(xmax=40)
plt.show()
This is the code that pointed out the problem with 1-token "sentences".
paths_to_files = [p for p in glob.glob(BASELINE_CORPUS_FOLDER + '*.txt')[:25] + \
glob.glob(KAFKA_CORPUS_FOLDER + '*.txt') if p.find('/deu_') == -1]
print
print 'FILE NAME, N SENTENCES, AVG LEN, SHORTEST SENTENCE, LONGEST SENTENCE'
print
for p in paths_to_files:
file_name = p.split('/')[-1]
sentence_lengths = []
text = codecs.open(p, 'r', encoding='utf-8').read()
sentences = text.split('\n')
for s in sentences:
if s.strip() > '':
sentence_lengths.append(len(re.split('\s+', s.strip())))
print file_name, len(sentences), numpy.mean(sentence_lengths), numpy.amin(sentence_lengths), numpy.amax(sentence_lengths)
. . . by going back to the full text versions, and running the text through another NLP package (TextBlob). Note that the number's don't match the spacy numbers; in some cases, they're off by quite a bit. I wouldn't spent a lot of time reconciling these numbers to the numbers above, since I consider TextBlob to be a quick-and-dirty tool, now largely replaced by spacy. I include it here mostly to get another perspective.
Also, note that here, for Kafka, the TextBlob sentence splitting doesn't result in the same number of sentences as are in our hand-currated versions. Which isn't a problem for our analysis per se; instead, it points out how much automatic sentence splitting can vary from hand-split sentences.
from textblob import TextBlob
paths_to_files = [p for p in glob.glob('../from_box/Master_Files_Fall_2018/chicago_corpus/*.txt')[:25] + \
glob.glob('../from_box/Master_Files_Fall_2018/English_Translation_Files/*.txt') \
if p.find('/deu_') == -1]
print
print 'FILE NAME, N SENTENCES, AVG LEN, SHORTEST SENTENCE, LONGEST SENTENCE'
print
for p in paths_to_files:
file_name = p.split('/')[-1]
sentence_lengths = []
text = re.sub('\s+', ' ', codecs.open(p, 'r', encoding='utf-8').read())
blob = TextBlob(text)
for s in blob.sentences:
sentence_lengths.append(len(s.tokens))
print file_name, len(blob.sentences), numpy.mean(sentence_lengths), numpy.amin(sentence_lengths), numpy.amax(sentence_lengths)
I ran the raw text through the Stanford CoreNLP package, which I consider state-of-the-art, but tend to use less than spacy because spacy, once loaded, is much faster. These numbers still don't match our spacy-derived and (for Kafka) hand-currated data, but there closer . . .
!pwd
!echo "annotators = tokenize, ssplit" > stanford.properties
!cat stanford.properties
import commands
from lxml import etree
paths_to_files = [p for p in glob.glob('/data/1/kafka/from_box/Master_Files_Fall_2018/chicago_corpus/*.txt')[:25] + \
glob.glob('/data/1/kafka/from_box/Master_Files_Fall_2018/English_Translation_Files/*.txt') \
if p.find('/deu_') == -1]
print
print 'FILE NAME, N SENTENCES, AVG LEN, SHORTEST SENTENCE, LONGEST SENTENCE'
print
for p in paths_to_files:
xml_file_name = p.split('/')[-1] + '.xml'
cmd = 'java -Xmx4g -cp "/home/spenteco/Downloads/stanford-corenlp-full-2018-02-27/*" ' + \
'edu.stanford.nlp.pipeline.StanfordCoreNLP -file ' + p + \
' -props /data/1/kafka/my_notebooks/stanford.properties'
noop = commands.getoutput(cmd)
tree = etree.parse(xml_file_name)
n_sentences = len(tree.xpath('//sentence'))
sentence_lengths = []
for s in tree.xpath('//sentence'):
sentence_lengths.append(len(s.xpath('descendant::token')))
noop = commands.getoutput('rm ' + xml_file_name)
print xml_file_name, n_sentences, numpy.mean(sentence_lengths), numpy.amin(sentence_lengths), numpy.amax(sentence_lengths)