Thursday, December 26, 2024
Google search engine
HomeLanguagesNamed Entity Recognition

Named Entity Recognition

The named entity recognition (NER) is one of the most popular data preprocessing task. It involves the identification of key information in the text and classification into a set of predefined categories. An entity is basically the thing that is consistently talked about or refer to in the text.

NER is the form of NLP.

At its core, NLP is just a two-step process, below are the two steps that are involved:

  • Detecting the entities from the text
  • Classifying them into different categories

Some of the categories that are the most important architecture in NER such that:

  • Person
  • Organization
  • Place/ location

Other common tasks include classifying of the following:

  • date/time.
  • expression
  • Numeral measurement (money, percent, weight, etc)
  • E-mail address

Ambiguity in NE

  • For a person, the category definition is intuitively quite clear, but for computers, there is some ambiguity in classification. Let’s look at some ambiguous example:
    • England (Organisation) won the 2019 world cup vs The 2019 world cup happened in England(Location).
    • Washington(Location) is the capital of the US vs The first president of the US was Washington(Person).

Methods of NER

  • One way is to train the model for multi-class classification using different machine learning algorithms, but it requires a lot of labelling. In addition to labelling the model also requires a deep understanding of context to deal with the ambiguity of the sentences. This makes it a challenging task for a simple machine learning algorithm.
  • Another way is that Conditional random field that is implemented by both NLP Speech Tagger and NLTK.  It is a probabilistic model that can be used to model sequential data such as words. The CRF can capture a deep understanding of the context of the sentence. In this model, the input {\text{X}} =  \left \{ \vec{x}_{1} ,\vec{x}_{2} ,\vec{x}_{3}, \ldots,\vec{x}_{T}   \right \}

{\text{p}}\left( {{\text{y|}}{\mathbf{x}}} \right) = \frac{1}{{z\left( \vec{x} \right)}}\mathop \prod \limits_{t = 1}^{T} { \exp }\left\{ {\mathop \sum \limits_{k = 1}^{k} \omega_{k} f_{k} \left( {y_{t} ,y_{t - 1} ,\vec{x}_{t} } \right)} \right\}

  • Deep Learning Based NER: deep learning NER is much more accurate than previous method, as it is capable to assemble words. This is due to the fact that it used a method called word embedding, that is capable of understanding the semantic and syntactic relationship between various words. It is also able to learn analyzes topic-specific as well as high level words automatically. This makes deep learning NER applicable for performing multiple tasks. Deep learning can do most of the repetitive work itself, hence researchers for example can use their time more efficiently.

Implementation

  • In this implementation, we will perform Named Entity Recognition using two different frameworks: Spacy and NLTK. This code can be run on colab, however for visualization purpose. I recommend the local environment. We can install the following frameworks using pip install
  • First, we performed Named Entity recognition using Spacy.

Python3




# command to run before code
! pip install spacy
! pip install nltk
! python -m spacy download en_core_web_sm
 
# imports and load spacy english language package
import spacy
from spacy import displacy
from spacy import tokenizer
nlp = spacy.load('en_core_web_sm')
 
#Load the text and process it
# I copied the text from python wiki
text =("Python is an interpreted, high-level and general-purpose programming language
       "Pythons design philosophy emphasizes code readability with"
       "its notable use of significant indentation."
       "Its language constructs and object-oriented approach aim to"
       "help programmers write clear and"
       "logical code for small and large-scale projects")
# text2 = # copy the paragraphs from  https://www.python.org/doc/essays/
doc = nlp(text)
#doc2 = nlp(text2)
sentences = list(doc.sents)
print(sentences)
# tokenization
for token in doc:
    print(token.text)
# print entities
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)
# now we use displaycy function on doc2
displacy.render(doc, style='ent', jupyter=True)


[Python is an interpreted, high-level and general-purpose programming language.,
 Pythons design philosophy emphasizes code readability with its notable use of significant indentation.,
 Its language constructs and object-oriented approachaim to help programmers write clear, logical code for small and large-scale projects]
 # tokens
 Python
is
an
interpreted
,
high
-
level
and
general
-
purpose
programming
language
.
Pythons
design
philosophy
emphasizes
code
readability
with
its
notable
use
of
significant
indentation
.
Its
language
constructs
and
object
-
oriented
approachaim
to
help
programmers
write
clear
,
logical
code
for
small
and
large
-
scale
projects
# named entity
[('Python', 0, 6, 'ORG')]

#here ORG stands for Organization

Spacy entity tags on doc2

  • Below is a list and their meaning of spacy entity tags:

Spacy Named Entity recognition Tags

  • Now we performed the named entity recognition task on NLTK.

Python3




# import modules and download packages
import nltk
nltk.download('words')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_tagger')
nltk.download('state_union')
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
 
# process the text and print Named entities
# tokenization
train_text = state_union.raw()
 
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
# function
def get_named _entity():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=False)
            namedEnt.draw()
    except:
        pass
get_named_entity()


A sentence Example of NER

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments