Different kind of ChunkParserI subclass can be used to identify the LOCATION chunks. As it uses the gazetteers corpus to identify location words. The gazetteers corpus is a WordListCorpusReader class
that contains the following location words:
- Country names
- U.S. states and abbreviations
- Mexican states
- Major U.S. cities
- Canadian provinces
LocationChunker class
looking for words that are found in the gazetteers corpus by iterating over a tagged sentence. It creates a LOCATION chunk using IOB tags when it finds one or more location words. The IOB LOCATION tags are produced in the iob_locations()
and the parse()
method converts the IOB tags to Tree.
Code #1 : LocationChunker class
from nltk.chunk import ChunkParserI from nltk.chunk.util import conlltags2tree from nltk.corpus import gazetteers class LocationChunker(ChunkParserI): def __init__( self ): self .locations = set (gazetteers.words()) self .lookahead = 0 for loc in self .locations: nwords = loc.count( ' ' ) if nwords > self .lookahead: self .lookahead = nwords |
Code #2 : iob_locations() method
def iob_locations( self , tagged_sent): i = 0 l = len (tagged_sent) inside = False while i < l: word, tag = tagged_sent[i] j = i + 1 k = j + self .lookahead nextwords, nexttags = [], [] loc = False while j < k: if ' ' .join([word] + nextwords) in self .locations: if inside: yield word, tag, 'I-LOCATION' else : yield word, tag, 'B-LOCATION' for nword, ntag in zip (nextwords, nexttags): yield nword, ntag, 'I-LOCATION' loc, inside = True , True i = j break if j < l: nextword, nexttag = tagged_sent[j] nextwords.append(nextword) nexttags.append(nexttag) j + = 1 else : break if not loc: inside = False i + = 1 yield word, tag, 'O' def parse( self , tagged_sent): iobs = self .iob_locations(tagged_sent) return conlltags2tree(iobs) |
Code #3 : use the LocationChunker class to parse the sentence
from nltk.chunk import ChunkParserI from chunkers import sub_leaves from chunkers import LocationChunker t = loc.parse([( 'San' , 'NNP' ), ( 'Francisco' , 'NNP' ), ( 'CA' , 'NNP' ), ( 'is' , 'BE' ), ( 'cold' , 'JJ' ), ( 'compared' , 'VBD' ), ( 'to' , 'TO' ), ( 'San' , 'NNP' ), ( 'Jose' , 'NNP' ), ( 'CA' , 'NNP' )]) print ( "Location : \n" , sub_leaves(t, 'LOCATION' )) |
Output :
Location : [[('San', 'NNP'), ('Francisco', 'NNP'), ('CA', 'NNP')], [('San', 'NNP'), ('Jose', 'NNP'), ('CA', 'NNP')]]