A quick notebook to run a sample of texts through the Stanford Named Entity Recognizer, grab a set of data relating to named entity recognition, and report the results.
Conclusions? It's much better than spacy. It's not perfect, but it's good enough that I feel comfortable using it to mine all the place names in the corpus.
!ls -1 /home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction | wc -l
PATH_TO_CORPUS = '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/'
PATH_TO_STANFORD_NER = '/home/spenteco/Downloads/stanford-ner-2018-02-27/'
. . . at "random" for testing.
import random, glob
random.seed()
paths_to_files = random.sample(glob.glob(PATH_TO_CORPUS + '*.txt'), 10)
print paths_to_files
I'm grabbing three kinds of data here:
import commands, re, string
from collections import defaultdict, Counter
entity_types = defaultdict(int)
capitalized_not_named_entity = defaultdict(int)
named_entities = defaultdict(lambda: defaultdict(int))
for pn, p in enumerate(paths_to_files):
novel_label = p.split('/')[-1]
cmd = 'cd ' + PATH_TO_STANFORD_NER + '; ./ner.sh ' + p
result_lines = commands.getoutput(cmd).split('\n')
end_of_header_line_number = -1
start_of_footer_line_number = -1
for ln, line in enumerate(result_lines):
if line.startswith('Loading classifier from'):
end_of_header_line_number = ln
if line.startswith('CRFClassifier tagged'):
start_of_footer_line_number = ln
print novel_label, end_of_header_line_number, start_of_footer_line_number
one_text_named_entities = []
one_text_capitalized_not_named_entity = []
for tn, t in enumerate(re.split('\s+', ' '.join(result_lines[end_of_header_line_number + 1:
start_of_footer_line_number]))):
if t.strip() == '':
continue
try:
word = t.split('/')[0]
entity_type = t.split('/')[1]
if word[0] == word[0].upper() and entity_type == 'O':
if word[0] not in string.punctuation:
one_text_capitalized_not_named_entity.append(word)
elif entity_type != 'O':
if len(named_entities) > 0 and tn > 0 and named_entities[-1][0] == tn - 1:
one_text_named_entities[-1][2] = one_text_named_entities[-1][2] + ' ' + word
one_text_named_entities[-1][0] = tn
else:
one_text_named_entities.append([tn, entity_type, word])
except IndexError:
print 'IndexError', t
for w in one_text_capitalized_not_named_entity:
capitalized_not_named_entity[w] += 1
for w in one_text_named_entities:
entity_types[w[1]] += 1
named_entities[w[1]][w[2]] += 1
print 'Done!'
print
for c in Counter(entity_types).most_common():
print c[0], c[1]
Here, I list out the top 100 capitalized words which spacy did not identify as a named entity. I'm looking for instances in which spacy should have identified something, but didn't.
Common name prefixes ("Mr.", "Mrs", "Miss") aren't included in named interesting. Spacy does the same thing: is that perhaps a standard convention of this kind of software? And the Stanford NER does miss some names ("Fanny", "Teddy"). But this seems much better than spacy.
print
for c in Counter(capitalized_not_named_entity).most_common(100):
print c[0], c[1]
Here, I list the top 25 named entities for each type of named entity. Such lists should give us a broad overview of how well or poorly the Stanford NER does. Bottom line? "ORGANIZATION" feels useless Perhaps I'm doing something wrong in parsing the results. Perhaps it's just a mess. "LOCATION" and "PERSON" are much better than spacy. There are some things I don't like here: "New" and "York" are separate names (almost certainly a problem in my code, and not in the Stanford code). For some reason, it thinks "D'Artagnan" is a location, and not a person.
for k in sorted(named_entities.keys()):
print
for c in Counter(named_entities[k]).most_common(25):
print k, c[0], c[1]
. . . to see if the "New" "York" and "D'Artagnan" problems are me, or the Stanford NER.
Results? I'm screwing up "New" "York" (easy to fix and test). Stanford is responsible for thinking "D'Artagnan" is a LOCATION, and not a PERSON.
!pwd
!echo "\"I want to go to New York,\" D'Artagnan said." > test.txt
!cat /home/spenteco/1/muncie_2019/tatlock/notebooks/test.txt
!cd /home/spenteco/Downloads/stanford-ner-2018-02-27/; ./ner.sh /home/spenteco/1/muncie_2019/tatlock/notebooks/test.txt