A quick notebook to run a sample of texts through spacy, grab a set of data relating to named entity recognition, and report the results.
Conclusions? Ugh. Something is wrong here. Perhaps I need to get a bigger model for spacy. Maybe I should look at another package (Stanford's?). There may be a bug somewhere in the code below. But these results are very discouraging . . .
!ls -1 /home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction | wc -l
import spacy
print spacy.__version__
nlp = spacy.load('en')
nlp.max_length = 2000000
PATH_TO_CORPUS = '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/'
. . . at "random" for testing.
import random, glob
random.seed()
paths_to_files = random.sample(glob.glob(PATH_TO_CORPUS + '*.txt'), 10)
print paths_to_files
I'm grabbing three kinds of data here:
import codecs, re
from collections import defaultdict, Counter
entity_types = defaultdict(int)
capitalized_not_named_entity = defaultdict(int)
named_entities = defaultdict(lambda: defaultdict(int))
for pn, p in enumerate(paths_to_files):
novel_label = p.split('/')[-1]
print pn, novel_label
text = re.sub('\s+', ' ', codecs.open(p, 'r', encoding='utf-8').read())
doc = nlp(unicode(text))
named_entities_in_text = []
for t in doc:
entity_types[t.ent_type_] += 1
is_capitalized = False
if t.text[0] == t.text[0].upper() and t.pos_ not in ['PUNCT', 'SPACE'] and t.tag_ not in ['POS']:
is_capitalized = True
if t.ent_type_ > '' and t.tag_ not in ['POS']:
named_entities_in_text.append([t.ent_type_, t.ent_iob_, t.text])
elif is_capitalized == True:
capitalized_not_named_entity[t.text] += 1
grouped_named_entities = []
for a in named_entities_in_text:
if a[1] == 'B':
grouped_named_entities.append([a[0], [a[2]]])
else:
grouped_named_entities[-1][1].append(a[2])
for a in grouped_named_entities:
named_entities[a[0]][' '.join(a[1])] += 1
print 'Done!'
Note that the first line (" 1281250") effectively means "none" or "not a named entity."
print
for c in Counter(entity_types).most_common():
print c[0], c[1]
Here, I list out the top 100 capitalized words which spacy did not identify as a named entity. I'm looking for instances in which spacy should have identified something, but didn't.
It does seem to miss some things: common name prefixes ("Mr.", "Mrs", "Miss"); titles ("Captain", "General", "Major"), and a few names ("Becky", "Triplet", "Maurice").
print
for c in Counter(capitalized_not_named_entity).most_common(100):
print c[0], c[1]
Here, I list the top 25 named entities for each type of named entity. Such lists should give us a broad overview of how well or poorly spacy does.
Bottom line? It's a mess.
for k in sorted(named_entities.keys()):
print
for c in Counter(named_entities[k]).most_common(25):
print k, c[0], c[1]
I list every place name (which seem to fall into two types, "GPE" and "LOC") which occurs 2 or more times the in sampled data.
I'm not happy with the results.
for k in ['GPE', 'LOC']:
print
for c in Counter(named_entities[k]).most_common():
if c[1] == 1:
break
print k, c[0], c[1]