Saturday, November 16, 2024
Google search engine
HomeLanguagesPython | Text Summarizer

Python | Text Summarizer

Today various organizations, be it online shopping, government and private sector organizations, catering and tourism industry or other institutions that offer customer services are concerned about their customers and ask for feedback every single time we use their services. Consider the fact, that these companies may be receiving enormous amounts of user feedback every single day. And it would become quite tedious for the management to sit and analyze each of those.
But, the technologies today have reached to an extent where they can do all the tasks of human beings. And the field which makes these things happen is Machine Learning. The machines have become capable of understanding human languages using Natural Language Processing. Today researches are being done in the field of text analytics.
And one such application of text analytics and NLP is a Feedback Summarizer which helps in summarizing and shortening the text in the user feedback. This can be done an algorithm to reduce bodies of text but keeping its original meaning, or giving a great insight into the original text.

If you’re interested in Data Analytics, you will find learning about Natural Language Processing very useful. Python provides immense library support for NLP. We will be using NLTK – the Natural Language Toolkit. which will serve our purpose right.

Install NLTK module on your system using :
sudo pip install nltk

Let’s understand the steps –
Step 1: Importing required libraries

There are two NLTK libraries that will be necessary for building an efficient feedback summarizer.




from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize


Terms Used :

  • Corpus
    Corpus means a collection of text. It could be data sets of anything containing texts be it poems by a certain poet, bodies of work by a certain author, etc. In this case, we are going to use a data set of pre-determined stop words.
  • Tokenizers
    it divides a text into a series of tokens. There are three main tokenizers – word, sentence, and regex tokenizer. We will only use the word and sentence tokenizer

Step 2: Removing Stop Words and storing them in a separate array of words.

Stop Word

Any word like (is, a, an, the, for) that does not add value to the meaning of a sentence. For example, let’s say we have the sentence

 GeeksForGeeks is one of the most useful websites for competitive programming.

After removing stop words, we can narrow the number of words and preserve the meaning as follows:

['GeeksForGeeks', 'one', 'useful', 'website', 'competitive', 'programming', '.']

Step 3: Create a frequency table of words
A python dictionary that’ll keep a record of how many times each word appears in the feedback after removing the stop words.we can use the dictionary over every sentence to know which sentences have the most relevant content in the overall text.




stopWords = set(stopwords.words("english"))
words = word_tokenize(text)
freqTable = dict()


Step 4: Assign score to each sentence depending on the words it contains and the frequency table

We can use the sent_tokenize() method to create the array of sentences. Secondly, we will need a dictionary to keep the score of each sentence, we will later go through the dictionary to generate the summary.




sentences = sent_tokenize(text)
sentenceValue = dict()


Step 5: Assign a certain score to compare the sentences within the feedback.
A simple approach to compare our scores would be to find the average score of a sentence. The average itself can be a good threshold.




sumValues = 0
for sentence in sentenceValue:
    sumValues += sentenceValue[sentence]
average = int(sumValues / len(sentenceValue))


Apply the threshold value and store sentences in order into the summary.

Code : Complete implementation of Text Summarizer using Python




# importing libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
   
# Input text - to summarize 
text = """ """
   
# Tokenizing the text
stopWords = set(stopwords.words("english"))
words = word_tokenize(text)
   
# Creating a frequency table to keep the 
# score of each word
   
freqTable = dict()
for word in words:
    word = word.lower()
    if word in stopWords:
        continue
    if word in freqTable:
        freqTable[word] += 1
    else:
        freqTable[word] = 1
   
# Creating a dictionary to keep the score
# of each sentence
sentences = sent_tokenize(text)
sentenceValue = dict()
   
for sentence in sentences:
    for word, freq in freqTable.items():
        if word in sentence.lower():
            if sentence in sentenceValue:
                sentenceValue[sentence] += freq
            else:
                sentenceValue[sentence] = freq
   
   
   
sumValues = 0
for sentence in sentenceValue:
    sumValues += sentenceValue[sentence]
   
# Average value of a sentence from the original text
   
average = int(sumValues / len(sentenceValue))
   
# Storing sentences into our summary.
summary = ''
for sentence in sentences:
    if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
        summary += " " + sentence
print(summary)


Input:

There are many techniques available to generate extractive summarization to keep it simple, I will be using an unsupervised learning approach to find the sentences similarity and rank them. Summarization can be defined as a task of producing a concise and fluent summary while preserving key information and overall meaning. One benefit of this will be, you don’t need to train and build a model prior start using it for your project. It’s good to understand Cosine similarity to make the best use of the code you are going to see. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Its measures cosine of the angle between vectors. The angle will be 0 if sentences are similar.

Output

There are many techniques available to generate extractive summarization. Summarization can be defined as a task of producing a concise and fluent summary while preserving key information and overall meaning. One benefit of this will be, you don’t need to train and build a model prior start using it for your project. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.

RELATED ARTICLES

Most Popular

Recent Comments