The named entity recognition (NER) is one of the most popular data preprocessing task. It involves the identification of key information in the text and classification into a set of predefined categories. An entity is basically the thing that is consistently talked about or refer to in the text.
NER is the form of NLP.
At its core, NLP is just a two-step process, below are the two steps that are involved:
- Detecting the entities from the text
- Classifying them into different categories
Some of the categories that are the most important architecture in NER such that:
- Person
- Organization
- Place/ location
Other common tasks include classifying of the following:
- date/time.
- expression
- Numeral measurement (money, percent, weight, etc)
- E-mail address
Ambiguity in NE
- For a person, the category definition is intuitively quite clear, but for computers, there is some ambiguity in classification. Let’s look at some ambiguous example:
- England (Organisation) won the 2019 world cup vs The 2019 world cup happened in England(Location).
- Washington(Location) is the capital of the US vs The first president of the US was Washington(Person).
Methods of NER
- One way is to train the model for multi-class classification using different machine learning algorithms, but it requires a lot of labelling. In addition to labelling the model also requires a deep understanding of context to deal with the ambiguity of the sentences. This makes it a challenging task for a simple machine learning algorithm.
- Another way is that Conditional random field that is implemented by both NLP Speech Tagger and NLTK. It is a probabilistic model that can be used to model sequential data such as words. The CRF can capture a deep understanding of the context of the sentence. In this model, the input
- Deep Learning Based NER: deep learning NER is much more accurate than previous method, as it is capable to assemble words. This is due to the fact that it used a method called word embedding, that is capable of understanding the semantic and syntactic relationship between various words. It is also able to learn analyzes topic-specific as well as high level words automatically. This makes deep learning NER applicable for performing multiple tasks. Deep learning can do most of the repetitive work itself, hence researchers for example can use their time more efficiently.
Implementation
- In this implementation, we will perform Named Entity Recognition using two different frameworks: Spacy and NLTK. This code can be run on colab, however for visualization purpose. I recommend the local environment. We can install the following frameworks using pip install
- First, we performed Named Entity recognition using Spacy.
Python3
# command to run before code ! pip install spacy ! pip install nltk ! python - m spacy download en_core_web_sm # imports and load spacy english language package import spacy from spacy import displacy from spacy import tokenizer nlp = spacy.load( 'en_core_web_sm' ) #Load the text and process it # I copied the text from python wiki text = ("Python is an interpreted, high - level and general - purpose programming language "Pythons design philosophy emphasizes code readability with" "its notable use of significant indentation." "Its language constructs and object-oriented approach aim to" "help programmers write clear and" "logical code for small and large-scale projects" ) # text2 = # copy the paragraphs from https://www.python.org/doc/essays/ doc = nlp(text) #doc2 = nlp(text2) sentences = list (doc.sents) print (sentences) # tokenization for token in doc: print (token.text) # print entities ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] print (ents) # now we use displaycy function on doc2 displacy.render(doc, style = 'ent' , jupyter = True ) |
[Python is an interpreted, high-level and general-purpose programming language., Pythons design philosophy emphasizes code readability with its notable use of significant indentation., Its language constructs and object-oriented approachaim to help programmers write clear, logical code for small and large-scale projects] # tokens Python is an interpreted , high - level and general - purpose programming language . Pythons design philosophy emphasizes code readability with its notable use of significant indentation . Its language constructs and object - oriented approachaim to help programmers write clear , logical code for small and large - scale projects # named entity [('Python', 0, 6, 'ORG')] #here ORG stands for Organization
- Below is a list and their meaning of spacy entity tags:
- Now we performed the named entity recognition task on NLTK.
Python3
# import modules and download packages import nltk nltk.download( 'words' ) nltk.download( 'punkt' ) nltk.download( 'maxent_ne_chunker' ) nltk.download( 'averaged_perceptron_tagger' ) nltk.download( 'state_union' ) from nltk.corpus import state_union from nltk.tokenize import PunktSentenceTokenizer # process the text and print Named entities # tokenization train_text = state_union.raw() sample_text = state_union.raw( "2006-GWBush.txt" ) custom_sent_tokenizer = PunktSentenceTokenizer(train_text) tokenized = custom_sent_tokenizer.tokenize(sample_text) # function def get_named _entity(): try : for i in tokenized: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) namedEnt = nltk.ne_chunk(tagged, binary = False ) namedEnt.draw() except : pass get_named_entity() |