NLP | Proper Noun Extraction

27 July 2024

3

Chunking all proper nouns (tagged with NNP) is a very simple way to perform named entity extraction. A simple grammar that combines all proper nouns into a NAME chunk can be created using the RegexpParser class.

Then, we can test this on the first tagged sentence of treebank_chunk to compare the results with the previous recipe:

Code #1 : Testing it on the first tagged sentence of treebank_chunk

from nltk.corpus import treebank_chunk 
from nltk.chunk import RegexpParser 
from chunkers import sub_leaves 
  
chunker = RegexpParser(r'''   
                       NAME: 
                       {<NNP>+} 
                       ''') 
      
print ("Named Entities : \n",  
       sub_leaves(chunker.parse( 
               treebank_chunk.tagged_sents()[0]), 'NAME')) 

Output :

Named Entities : 
[[('Pierre', 'NNP'), ('Vinken', 'NNP')], [('Nov.', 'NNP')]]

Note : The code above returns all the proper nouns – ‘Pierre’, ‘Vinken’, ‘Nov.’
NAME chunker is a simple usage of the RegexpParser class. All sequences of NNP tagged words are combined into NAME chunks.
PersonChunker class can be used if one only want to chunk the names of people.

Code #2 : PersonChunker class

from nltk.chunk import ChunkParserI 
from nltk.chunk.util import conlltags2tree 
from nltk.corpus import names 
  
class PersonChunker(ChunkParserI): 
    def __init__(self): 
        self.name_set = set(names.words()) 
          
    def parse(self, tagged_sent): 
          
        iobs = [] 
        in_person = False
        for word, tag in tagged_sent: 
            if word in self.name_set and in_person: 
                iobs.append((word, tag, 'I-PERSON')) 
            elif word in self.name_set: 
                iobs.append((word, tag, 'B-PERSON')) 
                in_person = True
            else: 
                iobs.append((word, tag, 'O')) 
                in_person = False
                  
        return conlltags2tree(iobs) 

PersonChunker class checks whether each word is in its names_set (constructed from the names corpus) by iterating over the tagged sentence. It either uses B-PERSON or I-PERSON IOB tags if the current word is in the names_set, depending on whether the previous word was also in the names_set. O IOB tag is assigned to the word that’s not in the names_set argument. IOB tags list is converted to a Tree using conlltags2tree() after completion.

Code #3 : Using PersonChunker class on the same tagged sentence

from nltk.corpus import treebank_chunk 
from nltk.chunk import RegexpParser 
from chunkers import sub_leaves 
  
from chunkers import PersonChunker 
chunker = PersonChunker() 
print ("Person name  : ",  
       sub_leaves(chunker.parse( 
               treebank_chunk.tagged_sents()[0]), 'PERSON')) 

Output :

Person name  : [[('Pierre', 'NNP')]]

NLP | Proper Noun Extraction

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

PureVPN vs. Private Internet Access 2025: Which Is Better? by Gjurgjica Panova

Recent Comments

EDITOR PICKS

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

POPULAR POSTS

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

POPULAR CATEGORY

ABOUT US

FOLLOW US