Today various organizations, be it online shopping, government and private sector organizations, catering and tourism industry or other institutions that offer customer services are concerned about their customers and ask for feedback every single time we use their services. Consider the fact, that these companies may be receiving enormous amounts of user feedback every single day. And it would become quite tedious for the management to sit and analyze each of those.
But, the technologies today have reached to an extent where they can do all the tasks of human beings. And the field which makes these things happen is Machine Learning. The machines have become capable of understanding human languages using Natural Language Processing. Today researches are being done in the field of text analytics.
And one such application of text analytics and NLP is a Feedback Summarizer which helps in summarizing and shortening the text in the user feedback. This can be done an algorithm to reduce bodies of text but keeping its original meaning, or giving a great insight into the original text.
If you’re interested in Data Analytics, you will find learning about Natural Language Processing very useful. Python provides immense library support for NLP. We will be using NLTK – the Natural Language Toolkit. which will serve our purpose right.
Install NLTK module on your system using :sudo pip install nltk
Let’s understand the steps –
Step 1: Importing required libraries
There are two NLTK libraries that will be necessary for building an efficient feedback summarizer.
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize, sent_tokenize |
Terms Used :
- Corpus
Corpus means a collection of text. It could be data sets of anything containing texts be it poems by a certain poet, bodies of work by a certain author, etc. In this case, we are going to use a data set of pre-determined stop words. - Tokenizers
it divides a text into a series of tokens. There are three main tokenizers – word, sentence, and regex tokenizer. We will only use the word and sentence tokenizer
Step 2: Removing Stop Words and storing them in a separate array of words.
Stop Word
Any word like (is, a, an, the, for) that does not add value to the meaning of a sentence. For example, let’s say we have the sentence
GeeksForGeeks is one of the most useful websites for competitive programming.
After removing stop words, we can narrow the number of words and preserve the meaning as follows:
['GeeksForGeeks', 'one', 'useful', 'website', 'competitive', 'programming', '.']
Step 3: Create a frequency table of words
A python dictionary that’ll keep a record of how many times each word appears in the feedback after removing the stop words.we can use the dictionary over every sentence to know which sentences have the most relevant content in the overall text.
stopWords = set (stopwords.words( "english" )) words = word_tokenize(text) freqTable = dict () |
Step 4: Assign score to each sentence depending on the words it contains and the frequency table
We can use the sent_tokenize() method to create the array of sentences. Secondly, we will need a dictionary to keep the score of each sentence, we will later go through the dictionary to generate the summary.
sentences = sent_tokenize(text) sentenceValue = dict () |
Step 5: Assign a certain score to compare the sentences within the feedback.
A simple approach to compare our scores would be to find the average score of a sentence. The average itself can be a good threshold.
sumValues = 0 for sentence in sentenceValue: sumValues + = sentenceValue[sentence] average = int (sumValues / len (sentenceValue)) |
Apply the threshold value and store sentences in order into the summary.
Code : Complete implementation of Text Summarizer using Python
# importing libraries import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize, sent_tokenize # Input text - to summarize text = """ """ # Tokenizing the text stopWords = set (stopwords.words( "english" )) words = word_tokenize(text) # Creating a frequency table to keep the # score of each word freqTable = dict () for word in words: word = word.lower() if word in stopWords: continue if word in freqTable: freqTable[word] + = 1 else : freqTable[word] = 1 # Creating a dictionary to keep the score # of each sentence sentences = sent_tokenize(text) sentenceValue = dict () for sentence in sentences: for word, freq in freqTable.items(): if word in sentence.lower(): if sentence in sentenceValue: sentenceValue[sentence] + = freq else : sentenceValue[sentence] = freq sumValues = 0 for sentence in sentenceValue: sumValues + = sentenceValue[sentence] # Average value of a sentence from the original text average = int (sumValues / len (sentenceValue)) # Storing sentences into our summary. summary = '' for sentence in sentences: if (sentence in sentenceValue) and (sentenceValue[sentence] > ( 1.2 * average)): summary + = " " + sentence print (summary) |
Input:
There are many techniques available to generate extractive summarization to keep it simple, I will be using an unsupervised learning approach to find the sentences similarity and rank them. Summarization can be defined as a task of producing a concise and fluent summary while preserving key information and overall meaning. One benefit of this will be, you don’t need to train and build a model prior start using it for your project. It’s good to understand Cosine similarity to make the best use of the code you are going to see. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Its measures cosine of the angle between vectors. The angle will be 0 if sentences are similar.
Output
There are many techniques available to generate extractive summarization. Summarization can be defined as a task of producing a concise and fluent summary while preserving key information and overall meaning. One benefit of this will be, you don’t need to train and build a model prior start using it for your project. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.