Self Named entity chunker can be trained using the ieer corpus, which stands for Information Extraction: Entity Recognition. The ieer corpus has chunk trees but no part-of-speech tags for the words, so it is a bit tedious job to perform.
Named entity chunk trees can be created from ieer corpus using the ieertree2conlltags()
and ieer_chunked_sents()
functions. This can be used to train the ClassifierChunker class
created in the Classification-based chunking.
Code #1 : ieertree2conlltags()
import nltk.tag from nltk.chunk.util import conlltags2tree from nltk.corpus import ieer def ieertree2conlltags(tree, tag = nltk.tag.pos_tag): words, ents = zip ( * tree.pos()) iobs = [] prev = None for ent in ents: if ent = = tree.label(): iobs.append( 'O' ) prev = None elif prev = = ent: iobs.append( 'I-% s' % ent) else : iobs.append( 'B-% s' % ent) prev = ent words, tags = zip ( * tag(words)) return zip (words, tags, iobs) |
Code #2 : ieer_chunked_sents()
import nltk.tag from nltk.chunk.util import conlltags2tree from nltk.corpus import ieer def ieer_chunked_sents(tag = nltk.tag.pos_tag): for doc in ieer.parsed_docs(): tagged = ieertree2conlltags(doc.text, tag) yield conlltags2tree(tagged) |
Using 80 out of 94 sentences for training and the remaining ones for testing.
Code #3 : How the classifier works on the first sentence of the treebank_chunk corpus.
from nltk.corpus import ieer from chunkers import ieer_chunked_sents, ClassifierChunker from nltk.corpus import treebank_chunk ieer_chunks = list (ieer_chunked_sents()) print ( "Length of ieer_chunks : " , len (ieer_chunks)) # initializing chunker chunker = ClassifierChunker(ieer_chunks[: 80 ]) print ( "\nparsing : \n" , chunker.parse( treebank_chunk.tagged_sents()[ 0 ])) # evaluating score = chunker.evaluate(ieer_chunks[ 80 :]) a = score.accuracy() p = score.precision() r = score.recall() print ( "\nAccuracy : " , a) print ( "\nPrecision : " , p) print ( "\nRecall : " , r) |
Output :
Length of ieer_chunks : 94 parsing : Tree('S', [Tree('LOCATION', [('Pierre', 'NNP'), ('Vinken', 'NNP')]), (', ', ', '), Tree('DURATION', [('61', 'CD'), ('years', 'NNS')]), Tree('MEASURE', [('old', 'JJ')]), (', ', ', '), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), Tree('DATE', [('Nov.', 'NNP'), ('29', 'CD')]), ('.', '.')]) Accuracy : 0.8829018388070625 Precision : 0.4088717454194793 Recall : 0.5053635280095352
How it works ?
The ieer trees generated by ieer_chunked_sents() are not entirely accurate. There are no explicit sentence breaks, so each document is a single tree. Also, the words are not explicitly tagged, it’s guess work using nltk.tag.pos_tag().