01_evaluate_ner_stanford¶

A quick notebook to run a sample of texts through the Stanford Named Entity Recognizer, grab a set of data relating to named entity recognition, and report the results.

Conclusions? It's much better than spacy. It's not perfect, but it's good enough that I feel comfortable using it to mine all the place names in the corpus.

Where are my files?¶

!ls -1 /home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction | wc -l

883

Load spacy, etc¶

PATH_TO_CORPUS = '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/'
PATH_TO_STANFORD_NER = '/home/spenteco/Downloads/stanford-ner-2018-02-27/'

Grab 10 files . . .¶

. . . at "random" for testing.

import random, glob

random.seed()

paths_to_files = random.sample(glob.glob(PATH_TO_CORPUS + '*.txt'), 10)

print paths_to_files

['/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/Connor_Ralph_Black_Rock_A_Tale_of_the_Selkirks_PG_3245.txt', '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/Yonge_Charlotte_M_Charlotte_Mary_Chantry_House_PG_7378_0.txt', '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/Holmes_Mary_Jane_Tempest_and_Sunshine_PG_17260_0.txt', '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/Austin_Jane_G_Jane_Goodwin_Outpost_PG_4676.txt', '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/Hough_Emerson_The_Girl_at_the_Halfway_House_A_Story_of_the_Plains_PG_14948.txt', '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/Wharton_Edith_The_Greater_Inclination_PG_9190.txt', '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/Twain_Mark_A_Tramp_Abroad_PG_119.txt', '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/Dumas_Alexandre_Twenty_Years_After_PG_1259_8.txt', '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/A_L_O_E_Hebrew_Heroes_A_Tale_Founded_on_Jewish_History_PG_26094.txt', '/home/spenteco/0/corpora/muncie_public_library_corpus/PG_no_backmatter_fiction/Reid_Mayne_The_Tiger_Hunter_PG_25127.txt']

Run the texts through the Stanford NER¶

I'm grabbing three kinds of data here:

entity_types, which are counts of all of the kinds of named entities reported by the Stanford NER.
capitalized_not_named_entity, which includes any capitalized word which the Stanford NER did not recognize as a named entity.
named_entities, which are counts of all of the individual named entities reported by spthe Stanford NERacy, grouped by named entity type.

import commands, re, string
from collections import defaultdict, Counter

entity_types = defaultdict(int)
capitalized_not_named_entity = defaultdict(int)

named_entities = defaultdict(lambda: defaultdict(int))

for pn, p in enumerate(paths_to_files):
        
    novel_label = p.split('/')[-1]
    
    cmd = 'cd ' + PATH_TO_STANFORD_NER + '; ./ner.sh ' + p
    
    result_lines = commands.getoutput(cmd).split('\n')
    
    end_of_header_line_number = -1
    start_of_footer_line_number = -1
    for ln, line in enumerate(result_lines):
        if line.startswith('Loading classifier from'):
            end_of_header_line_number = ln
        if line.startswith('CRFClassifier tagged'):
            start_of_footer_line_number = ln
            
    print novel_label, end_of_header_line_number, start_of_footer_line_number
    
    one_text_named_entities = []
    one_text_capitalized_not_named_entity = []
    
    for tn, t in enumerate(re.split('\s+', ' '.join(result_lines[end_of_header_line_number + 1: 
                                                    start_of_footer_line_number]))):
        
        if t.strip() == '': 
            continue
        
        try:
            word = t.split('/')[0]
            entity_type = t.split('/')[1]

            if word[0] == word[0].upper() and entity_type == 'O':
                if word[0] not in string.punctuation:
                    one_text_capitalized_not_named_entity.append(word)
            elif entity_type != 'O':

                if len(named_entities) > 0 and tn > 0 and named_entities[-1][0] == tn - 1:
                    one_text_named_entities[-1][2] = one_text_named_entities[-1][2] + ' ' + word
                    one_text_named_entities[-1][0] = tn
                else:
                    one_text_named_entities.append([tn, entity_type, word])
        except IndexError:
            print 'IndexError', t
    
    for w in one_text_capitalized_not_named_entity:
        capitalized_not_named_entity[w] += 1
        
        
    for w in one_text_named_entities:
        entity_types[w[1]] += 1
        named_entities[w[1]][w[2]] += 1
        
print 'Done!'

Connor_Ralph_Black_Rock_A_Tale_of_the_Selkirks_PG_3245.txt 3 3391
Yonge_Charlotte_M_Charlotte_Mary_Chantry_House_PG_7378_0.txt 3 4060
Holmes_Mary_Jane_Tempest_and_Sunshine_PG_17260_0.txt 3 5209
Austin_Jane_G_Jane_Goodwin_Outpost_PG_4676.txt 3 4130
IndexError Untokenizable:
IndexError 
IndexError (U+8A,
IndexError decimal:
IndexError 138)
Hough_Emerson_The_Girl_at_the_Halfway_House_A_Story_of_the_Plains_PG_14948.txt 3 5162
Wharton_Edith_The_Greater_Inclination_PG_9190.txt 3 3499
Twain_Mark_A_Tramp_Abroad_PG_119.txt 3 7370
Dumas_Alexandre_Twenty_Years_After_PG_1259_8.txt 3 17640
A_L_O_E_Hebrew_Heroes_A_Tale_Founded_on_Jewish_History_PG_26094.txt 3 3327
Reid_Mayne_The_Tiger_Hunter_PG_25127.txt 3 6819
Done!

What kinds of named entities does the Stanford NER recognize?¶

print
for c in Counter(entity_types).most_common():
    print c[0], c[1]

PERSON 28493
LOCATION 7568
ORGANIZATION 1798

False negatives?¶

Here, I list out the top 100 capitalized words which spacy did not identify as a named entity. I'm looking for instances in which spacy should have identified something, but didn't.

Common name prefixes ("Mr.", "Mrs", "Miss") aren't included in named interesting. Spacy does the same thing: is that perhaps a standard convention of this kind of software? And the Stanford NER does miss some names ("Fanny", "Teddy"). But this seems much better than spacy.

print
for c in Counter(capitalized_not_named_entity).most_common(100):
    print c[0], c[1]

I 16073
The 4891
He 2541
It 2267
Mr. 1560
But 1538
And 1528
You 1302
She 1099
A 1074
We 1013
Mrs. 959
What 922
There 899
Yes 870
In 861
They 745
Oh 736
At 729
This 700
Then 686
Well 680
No 677
As 602
That 596
If 508
My 504
His 501
When 491
CHAPTER 490
God 472
Dr. 470
Ah 456
Do 434
Now 429
So 428
THE 389
Why 384
Athos 374
For 352
On 348
One 348
To 341
How 328
Costal 321
Miss 280
Fanny 277
All 271
Let 249
Lord 242
After 237
Here 235
Come 224
Indian 219
Is 218
Her 216
Porthos 213
German 205
An 202
Not 200
Teddy 199
These 190
With 181
Captain 177
By 173
Where 170
Colonel 167
From 163
English 163
Sunshine 162
Of 160
Who 148
Perhaps 146
Have 146
Sir 140
Good 134
Curly 129
OF 126
Did 123
Very 123
Hebrew 119
O 118
Go 117
Isabel 115
Your 115
Are 114
French 114
Some 110
Greek 109
Just 106
Winter 105
However 103
Italian 98
Indeed 97
Heaven 97
While 96
Our 95
Two 94
Hadassah 91
Poor 90

What did it find?¶

Here, I list the top 25 named entities for each type of named entity. Such lists should give us a broad overview of how well or poorly the Stanford NER does. Bottom line? "ORGANIZATION" feels useless Perhaps I'm doing something wrong in parsing the results. Perhaps it's just a mess. "LOCATION" and "PERSON" are much better than spacy. There are some things I don't like here: "New" and "York" are separate names (almost certainly a problem in my code, and not in the Stanford code). For some reason, it thinks "D'Artagnan" is a location, and not a person.

for k in sorted(named_entities.keys()):
    print
    for c in Counter(named_entities[k]).most_common(25):
        print k, c[0], c[1]

-1 0 0

LOCATION D'Artagnan 1523
LOCATION Paris 221
LOCATION New 192
LOCATION Porthos 140
LOCATION London 133
LOCATION France 131
LOCATION England 113
LOCATION Morelos 104
LOCATION Florence 100
LOCATION York 93
LOCATION Ellisville 89
LOCATION Frankfort 85
LOCATION Orleans 82
LOCATION Jerusalem 79
LOCATION San 73
LOCATION Europe 69
LOCATION Woburn 69
LOCATION Kentucky 67
LOCATION Palmas 62
LOCATION Austria 61
LOCATION Las 61
LOCATION Heidelberg 58
LOCATION Palais 58
LOCATION Royal 58
LOCATION Oajaca 53

ORGANIZATION Athos 168
ORGANIZATION House 75
ORGANIZATION de 74
ORGANIZATION Fanny 59
ORGANIZATION Vard 49
ORGANIZATION Gannett 46
ORGANIZATION Chantry 42
ORGANIZATION la 27
ORGANIZATION of 26
ORGANIZATION Parson 25
ORGANIZATION Frank 23
ORGANIZATION Hotel 23
ORGANIZATION Lens 22
ORGANIZATION Valle 20
ORGANIZATION Comte 19
ORGANIZATION Madame 19
ORGANIZATION Fere 18
ORGANIZATION Del 18
ORGANIZATION Comminges 17
ORGANIZATION Corps 17
ORGANIZATION Black 13
ORGANIZATION Rue 12
ORGANIZATION Halfway 11
ORGANIZATION Protestant 10
ORGANIZATION Argus 10

PERSON Don 834
PERSON Aramis 689
PERSON Julia 630
PERSON de 579
PERSON Mazarin 554
PERSON Monsieur 549
PERSON Fanny 548
PERSON Clarence 543
PERSON Athos 494
PERSON Porthos 484
PERSON Middleton 442
PERSON Lacey 438
PERSON Zarah 427
PERSON Rafael 404
PERSON Franklin 383
PERSON Dora 366
PERSON Raoul 334
PERSON Grimaud 314
PERSON Griff 282
PERSON Ellen 277
PERSON Craig 262
PERSON Emily 254
PERSON Wilmot 245
PERSON Mordaunt 234
PERSON Cornelio 224

Test a one-line text . . .¶

. . . to see if the "New" "York" and "D'Artagnan" problems are me, or the Stanford NER.

Results? I'm screwing up "New" "York" (easy to fix and test). Stanford is responsible for thinking "D'Artagnan" is a LOCATION, and not a PERSON.

!pwd
!echo "\"I want to go to New York,\" D'Artagnan said." > test.txt
!cat /home/spenteco/1/muncie_2019/tatlock/notebooks/test.txt
!cd /home/spenteco/Downloads/stanford-ner-2018-02-27/; ./ner.sh /home/spenteco/1/muncie_2019/tatlock/notebooks/test.txt

/home/spenteco/1/muncie_2019/tatlock/notebooks
"I want to go to New York," D'Artagnan said.
Invoked on Thu Sep 06 10:49:03 CDT 2018 with arguments: -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz -textFile /home/spenteco/1/muncie_2019/tatlock/notebooks/test.txt
loadClassifier=./classifiers/english.all.3class.distsim.crf.ser.gz
textFile=/home/spenteco/1/muncie_2019/tatlock/notebooks/test.txt
Loading classifier from ./classifiers/english.all.3class.distsim.crf.ser.gz ... done [1.3 sec].
``/O I/O want/O to/O go/O to/O New/LOCATION York/LOCATION ,/O ''/O D'Artagnan/LOCATION said/O ./O 
CRFClassifier tagged 13 words in 1 documents at 209.68 words per second.