Named Entity Recognition

28 July 2024

1

The named entity recognition (NER) is one of the most popular data preprocessing task. It involves the identification of key information in the text and classification into a set of predefined categories. An entity is basically the thing that is consistently talked about or refer to in the text.

NER is the form of NLP.

At its core, NLP is just a two-step process, below are the two steps that are involved:

Detecting the entities from the text
Classifying them into different categories

Some of the categories that are the most important architecture in NER such that:

Person
Organization
Place/ location

Other common tasks include classifying of the following:

date/time.
expression
Numeral measurement (money, percent, weight, etc)
E-mail address

Ambiguity in NE

For a person, the category definition is intuitively quite clear, but for computers, there is some ambiguity in classification. Let’s look at some ambiguous example:
- England (Organisation) won the 2019 world cup vs The 2019 world cup happened in England(Location).
- Washington(Location) is the capital of the US vs The first president of the US was Washington(Person).

Methods of NER

One way is to train the model for multi-class classification using different machine learning algorithms, but it requires a lot of labelling. In addition to labelling the model also requires a deep understanding of context to deal with the ambiguity of the sentences. This makes it a challenging task for a simple machine learning algorithm.
Another way is that Conditional random field that is implemented by both NLP Speech Tagger and NLTK. It is a probabilistic model that can be used to model sequential data such as words. The CRF can capture a deep understanding of the context of the sentence. In this model, the input ${\text{X}} = \left \{ \vec{x}_{1} ,\vec{x}_{2} ,\vec{x}_{3}, \ldots,\vec{x}_{T} \right \}$

${\text{p}}\left( {{\text{y|}}{\mathbf{x}}} \right) = \frac{1}{{z\left( \vec{x} \right)}}\mathop \prod \limits_{t = 1}^{T} { \exp }\left\{ {\mathop \sum \limits_{k = 1}^{k} \omega_{k} f_{k} \left( {y_{t} ,y_{t - 1} ,\vec{x}_{t} } \right)} \right\}$

Deep Learning Based NER: deep learning NER is much more accurate than previous method, as it is capable to assemble words. This is due to the fact that it used a method called word embedding, that is capable of understanding the semantic and syntactic relationship between various words. It is also able to learn analyzes topic-specific as well as high level words automatically. This makes deep learning NER applicable for performing multiple tasks. Deep learning can do most of the repetitive work itself, hence researchers for example can use their time more efficiently.

Implementation

In this implementation, we will perform Named Entity Recognition using two different frameworks: Spacy and NLTK. This code can be run on colab, however for visualization purpose. I recommend the local environment. We can install the following frameworks using pip install
First, we performed Named Entity recognition using Spacy.

Python3

# command to run before code
! pip install spacy
! pip install nltk
! python -m spacy download en_core_web_sm
 
# imports and load spacy english language package
import spacy
from spacy import displacy
from spacy import tokenizer
nlp = spacy.load('en_core_web_sm')
 
#Load the text and process it
# I copied the text from python wiki
text =("Python is an interpreted, high-level and general-purpose programming language
       "Pythons design philosophy emphasizes code readability with"
       "its notable use of significant indentation."
       "Its language constructs and object-oriented approach aim to"
       "help programmers write clear and"
       "logical code for small and large-scale projects")
# text2 = # copy the paragraphs from  https://www.python.org/doc/essays/
doc = nlp(text)
#doc2 = nlp(text2)
sentences = list(doc.sents)
print(sentences)
# tokenization
for token in doc:
    print(token.text)
# print entities
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)
# now we use displaycy function on doc2
displacy.render(doc, style='ent', jupyter=True)

[Python is an interpreted, high-level and general-purpose programming language.,
 Pythons design philosophy emphasizes code readability with its notable use of significant indentation.,
 Its language constructs and object-oriented approachaim to help programmers write clear, logical code for small and large-scale projects]
 # tokens
 Python
is
an
interpreted
,
high
-
level
and
general
-
purpose
programming
language
.
Pythons
design
philosophy
emphasizes
code
readability
with
its
notable
use
of
significant
indentation
.
Its
language
constructs
and
object
-
oriented
approachaim
to
help
programmers
write
clear
,
logical
code
for
small
and
large
-
scale
projects
# named entity
[('Python', 0, 6, 'ORG')]

#here ORG stands for Organization

Spacy entity tags on doc2

Below is a list and their meaning of spacy entity tags:

Spacy Named Entity recognition Tags

Now we performed the named entity recognition task on NLTK.

Python3

# import modules and download packages
import nltk
nltk.download('words')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_tagger')
nltk.download('state_union')
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
 
# process the text and print Named entities
# tokenization
train_text = state_union.raw()
 
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
# function
def get_named _entity():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=False)
            namedEnt.draw()
    except:
        pass
get_named_entity()

A sentence Example of NER

Named Entity Recognition

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Interview With Bill Reed – CEO at RemotelyMe by Shauli Zacks

Samsung’s Galaxy S24 FE plummets to the price it should have been at launch

Samsung’s new periscope camera fits telephoto lenses into an even slimmer design

OnePlus’ decision to ditch Samsung’s OLED screens could backfire in the US

Recent Comments

EDITOR PICKS

Interview With Bill Reed – CEO at RemotelyMe by Shauli Zacks

Samsung’s Galaxy S24 FE plummets to the price it should have been at launch

Samsung’s new periscope camera fits telephoto lenses into an even slimmer design

POPULAR POSTS

Interview With Bill Reed – CEO at RemotelyMe by Shauli Zacks

Samsung’s Galaxy S24 FE plummets to the price it should have been at launch

Samsung’s new periscope camera fits telephoto lenses into an even slimmer design

POPULAR CATEGORY

ABOUT US

FOLLOW US