This article was published as a part of the Data Science Blogathon
INTRODUCTION
Artificial Intelligence (AI) has been a trendy term among individuals for many years. Earlier, when we used to hear the term “AI”, we could only think about Robots. However AI is not limited to robots, and nowadays, every electronic device we use has AI associated with it, be it smartphones, smart TVs, refrigerators, or Air conditioners. AI basically means a machine can take its decision without human intervention. For example, a Refrigerator can control its cooling based on the number of items present in it. The air conditioner can adjust the settings on its own based on the outside temperature. In smartphones, almost every application is using AI in some way.
Now, we have a basic understanding of AI, let’s talk about NLP. NLP refers to natural language processing. It is the branch of artificial intelligence through which computers can understand and generate human language. Through NLP, a computer can understand text, speech and preprocess it, apply algorithms and different ML or DL models on it and generate output as per its needs.
NLP in data science
NLP plays a very important role in data science. According to IDC, 80 per cent of world data is in unstructured form i.e. in form of text, audio, videos, and images. Therefore, NLP becomes crucial to analyze the texts and audio.
NLP is used in autocomplete search, language translator to convert the text in one language to another, in building virtual chat boxes in which we chat with an AI regarding our queries and AI responds based on an understanding of our text, voice assistants such as Alexa, Siri are built on top of NLP and even online grammar checking tools use NLP. NLP is very successful in sentimental analysis where based on comments on users on several social media platforms we can understand the mood of users.
In this blog, I will show you the basics of NLP, and later, I will be implementing NLP on PM Modi’s independence speech of 2020 and extract important topics from his speech using NLP in Python.
Important terms and topics
I will explain some of the terms and topics that will help in the understanding of the implementation of NLP.
Tokenization
Before understanding what tokenization is, we need to understand the layout of texts. Texts stored in multiple files are called corpus. The text stored in a single file such as a .txt file is called a document. Then a document is divided into several paragraphs and a paragraph is divided into several sentences. A sentence is made up of a group of words and phrases, these are called tokens.
Corpus> Documents> Paragraphs> Sentences> Tokens
Tokenization is a process in which we split the text into smaller chunks. It basically means we divide paragraphs into sentences, a sentence is divided into phrases or words.
Ex-A sentence is – Ram is a boy
After tokenization, we get [‘Ram’, ‘is’, ‘a’, ‘boy’].
Tokenization is an important process in data cleaning.
Normalization
Normalization means the base or root form of a word. A word can be described in 3 parts:
prefix + root form + suffix
In normalization, we remove the prefix and suffix from the word to convert it to into its root form.
Normalization is basically done in 2 ways :
Stemming
It is a process of reducing inflected words to their base form. After stemming, the base word formed may or may not have meaning.
Example- Stemming will convert 2 words – “history” and “historical” to the base word “histori“.
Lemmatization
Lemmatization is an advanced form of stemming. Lemmatization makes sure that the base word definitely has some meaning.
Ex – Lemmatization will convert 2 words – “history” and “historical” to the base word “history“.
Stop Words
Stop words are the words that do not have value to the texts on which we are implementing NLP. These are the common words that are used everywhere to maintain semantic relations between words.
Ex- I, me, the, it, etc.
Removing stop words from the text is also a process of data cleaning.
Bag of Words
The tokenized and lemmatized text can be converted to a bag of words. It is a dictionary where the key-value pair is the word and its frequency in the corpus.
Word2vec
Word2vec is a method in which we convert the words into vectors such that the semantic relationship between the words is preserved. Each word can be represented as a vector of 100 dimensions. The words with less distance between their vectors share similarities.
LDA
LDA stands for Latent Dirichlet Allocation. This is an unsupervised algorithm that is used to classify topics in form of texts in a document. LDA assumes that the text we give has words that are somehow interrelated. The second assumption is that there is a mixture of topics in the text.
I will be implementing the same algorithm for topics detection from PM Modi independence speech.
Procedure
First of all, the text is stored in variables either by importing a file or through web scrapping. Then we need to clean the data with the help of tokenization, stemming, lemmatization, and removing stop words from the text. The text is converted to vectorized form through various techniques bag of words and Word2vec are two of them. Lastly, we will implement LDA on the data.
Implementation in Python
Now, I will explain to you the code.
Libraries Used
re library is used to filter the text based on regular expressions.
nltk library is used which provided most of the functions for NLP.
genism library is used to implement word2vec and LDA.
pip install nltk pip install genism nltk.download()
In the above first two lines, we install nltk and genism libraries. The third line downloads all the functionalities of nltk.
from collections import Counter import re from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer from nltk.stem import WordNetLemmatizer import gensim from gensim.models import Word2Vec ps = PorterStemmer() wordnet=WordNetLemmatizer()
In the above code, we import all the libraries and modules which will be used in our code and also initialized Stemmer and Lemmatizer class.
file=open('15Aug.txt','r',encoding="utf8") paragraph=file.read() file.close() print(paragraph)
Then, we stored the text speech of PM Modi in the paragraph variable.
sentences = nltk.sent_tokenize(paragraph) corpus = [] for i in range(len(sentences)): review = re.sub('[^a-zA-Z]', ' ', sentences[i]) review = review.lower() review = review.split() review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))] review = ' '.join(review) corpus.append(review) print(sentences[1],'n',corpus[1])
#OUTPUT
Today, we are able to live in an independent India because of the sacrifices of millions of sons and daughters of Mother India. today able live independent india sacrifice million son daughter mother india
Now, we divided the paragraph into sentences. Then looping through each sentence, we removed all the characters other than alphabets. We divided each sentence into words and then lemmatized each word and removed the stopwords. Finally, each sentence is stored in a corpus variable.
st="" for sentence in corpus: st=st+sentence tokens=st.split() a=Counter(tokens) d=a.most_common(20) print(d)
#OUTPUT
[('india', 70), ('country', 53), ('new', 38), ('self', 37), ('one', 34), ('countryman', 31), ('world', 29), ('also', 28), ('get', 25), ('year', 23), ('crore', 22), ('reliant', 21), ('sector', 20), ('people', 19), ('development', 19), ('work', 19), ('corona', 18), ('would', 18), ('make', 17), ('every', 17)]
Here, we have generated the top 20 words from PM Modi’s speech along with their frequencies. Since, it was an independence day speech, the most spoken word was India followed by words like country, new, self, one.
import matplotlib.pyplot as plt from wordcloud import WordCloud dic={} for row in d: dic[row[0]]=row[1] word_freq=dic #Generate word cloud wc = WordCloud(width=400, height=200, max_words=100, background_color='white').generate_from_frequencies(word_freq) plt.figure(figsize=(12, 8)) plt.imshow(wc, interpolation='bilinear') plt.axis('off') plt.show()
#OUTPUT
The following code has implemented a beautiful word cloud of the 20 most frequent words in the speech.
dataset = [d.split() for d in corpus] model = Word2Vec(dataset, min_count=1) words =list(model.wv.index_to_key) vector = model.wv['india'] similar = model.wv.most_similar('india') print(similar)
#OUTPUT
[('rupee', 0.37193065881729126), ('asia', 0.3691994547843933), ('corona', 0.32324033975601196), ('similarly', 0.30869385600090027), ('going', 0.29958200454711914), ('year', 0.2857002019882202), ('creativity', 0.28366604447364807), ('vinoba', 0.277676522731781), ('pride', 0.2725961208343506), ('towards', 0.27066749334335327)]
In the above code, we have implemented Word2Vec. vector variable stores the coordinates of the word “India”. most_similar function finds the similar words which are related to “India”. As we can see the words “rupee”, “Asia”, “corona”, “creativity”, “pride” are related with the word “India”.
dictionary = gensim.corpora.Dictionary(dataset) dictionary.filter_extremes(no_below=10, no_above=0.1, keep_n= 100000) bow_corpus = [dictionary.doc2bow(doc) for doc in dataset]
With the help of the genism dictionary, we create a dictionary of words along with their frequencies, then we filter the extreme words i.e. words that occur very frequently and words that occur very less. “doc2bow” function converts the document into a bag of words format, i.e list of (token_id, token_count) tuples.
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics = 5, id2word = dictionary, passes = 50, workers = 2, alpha=0.5, eta=0.5) print(lda_model)
#OUTPUT
LdaModel(num_terms=64, num_topics=5, decay=0.5, chunksize=2000)
Finally, now we make a lda_model with “genism.models” library. Parameter optimization is one of the most important steps to get important topics. “num_topics” defines the number of topics we want. Here, I have set the value to 5. “id2word” is a mapping from word ids (integers) to words (strings). “alpha” and “beta” defines whether the document has few topics or a mixture of many topics and whether each topic has a mixture of most words or few words respectively.
for idx, topic in lda_model.print_topics(-1): print("Topic: {} nWords: {}".format(idx, topic )) print("n")
#OUTPUT
Lastly, we extract the topics from the LDA model which we have defined. Here we can see that for the first topic words such as self, reliant, year, health, Indian are present which basically means PM was talking about self-reliant India and particularly in the health sector. Similarly, the second topic is talking about development and infrastructure. With parameter optimization and analysis, we can extract several other topics.
There are several other algorithms for topics detection. I found LDA quite easy and effective. Therefore, I implemented it in this project.
Summary
We started with an intuition of AI and NLP. Then we talked about how NLP is very helpful in data science along with its applications. Then we discussed important terms and topics which are necessary to be known to implement NLP. Finally, I have shown a step-by-step guide to implement the NLP on PM Modi Independence Day speech.
References
Images – https://www.visor.ai/nlp-chatbots/
https://monkeylearn.com/blog/nlp-ai/
https://monkeylearn.com/blog/natural-language-processing-applications/
https://www.kdnuggets.com/2019/09/overview-topics-extraction-python-latent-dirichlet-allocation.html
About Me
Hi! I am Ashish Choudhary. I am pursuing B.Tech from the JC Bose University of Science & Technology. Data Science is my passion and feels proud to write interesting blogs related to it. Feel free to contact me on LinkedIn.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.