NLP | Location Tags Extraction

26 July 2024

1

Different kind of ChunkParserI subclass can be used to identify the LOCATION chunks. As it uses the gazetteers corpus to identify location words. The gazetteers corpus is a WordListCorpusReader class that contains the following location words:

Country names
U.S. states and abbreviations
Mexican states
Major U.S. cities
Canadian provinces

LocationChunker class looking for words that are found in the gazetteers corpus by iterating over a tagged sentence. It creates a LOCATION chunk using IOB tags when it finds one or more location words. The IOB LOCATION tags are produced in the iob_locations() and the parse() method converts the IOB tags to Tree.

Code #1 : LocationChunker class

from nltk.chunk import ChunkParserI 
from nltk.chunk.util import conlltags2tree 
from nltk.corpus import gazetteers 
  
class LocationChunker(ChunkParserI): 
    def __init__(self): 
        self.locations = set(gazetteers.words()) 
        self.lookahead = 0
        for loc in self.locations: 
            nwords = loc.count(' ') 
        if nwords > self.lookahead: 
            self.lookahead = nwords 

Code #2 : iob_locations() method

def iob_locations(self, tagged_sent): 
      
    i = 0
    l = len(tagged_sent) 
    inside = False
      
    while i < l: 
        word, tag = tagged_sent[i] 
        j = i + 1
        k = j + self.lookahead 
        nextwords, nexttags = [], [] 
        loc = False
          
    while j < k: 
        if ' '.join([word] + nextwords) in self.locations: 
            if inside: 
                yield word, tag, 'I-LOCATION'
            else: 
                yield word, tag, 'B-LOCATION'
            for nword, ntag in zip(nextwords, nexttags): 
                yield nword, ntag, 'I-LOCATION'
                loc, inside = True, True
                i = j 
                break
              
        if j < l: 
            nextword, nexttag = tagged_sent[j] 
            nextwords.append(nextword) 
            nexttags.append(nexttag) 
            j += 1
        else: 
            break
        if not loc: 
            inside = False
            i += 1
            yield word, tag, 'O'
              
    def parse(self, tagged_sent): 
        iobs = self.iob_locations(tagged_sent) 
        return conlltags2tree(iobs) 

Code #3 : use the LocationChunker class to parse the sentence

from nltk.chunk import ChunkParserI 
from chunkers import sub_leaves 
from chunkers import LocationChunker 
  
t = loc.parse([('San', 'NNP'), ('Francisco', 'NNP'), 
               ('CA', 'NNP'), ('is', 'BE'), ('cold', 'JJ'),  
               ('compared', 'VBD'), ('to', 'TO'), ('San', 'NNP'), 
               ('Jose', 'NNP'), ('CA', 'NNP')]) 
  
print ("Location : \n", sub_leaves(t, 'LOCATION')) 

Output :

Location : 
[[('San', 'NNP'), ('Francisco', 'NNP'), ('CA', 'NNP')], 
[('San', 'NNP'), ('Jose', 'NNP'), ('CA', 'NNP')]]

NLP | Location Tags Extraction

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

Google Messages can now show your profile exactly how it’s supposed to be

Recent Comments

EDITOR PICKS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR POSTS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR CATEGORY

ABOUT US

FOLLOW US