In this article we are going to tokenize sentence, paragraph, and webpage contents using the NLTK toolkit in the python environment then we will remove stop words and apply stemming on the contents of sentences, paragraphs, and webpage. Finally, we will Compute the frequency of words after removing stop words and stemming.
Modules Needed
bs4: Beautiful Soup (bs4) is a Python library for extracting data from HTML and XML files. To install this library, type the following command in IDE/terminal.
pip install bs4
urllib: Urllib package is the Uniform Resource Locators handling library for python. It is used to fetch URLs.To install this library, type the following command in IDE/terminal.
pip install urllib
nltk: The NLTK library is a massive tool kit for Natural Language Processing in Python, this module helps us by providing the entire NLP methodology. To install this library, type the following command in IDE/terminal.
pip install nltk
Stepwise Implementation:
Step1:
- Save the files sentence.txt, paragraph.txt in the current directory.
- Open the files using the open method and store them in file operators named file1, and file2.
- Read the file contents using read() method and store the entire file contents into a single string.
- Display the file contents.
- Close the file operators.
Python
import nltk s = input ( 'Enter the file name which contains a sentence: ' ) file1 = open (s) sentence = file1.read() file1.close() p = input ( 'Enter the file name which contains a paragraph: ' ) file2 = open (p) paragraph = file2.read() file2.close() |
Step2:
- Import urllib.request for opening and reading the webpage contents.
- From bs4 import BeautifulSoup which allows us to pull data out of HTML documents.
- Using urllib.request make a request to that particular url server.
- The server will responds and returns the Html document.
- Read the contents of webpage using read() method.
- Pass the webpage data into BeautifulSoup which helps us to organize and format the messy web data by fixing bad HTML and present to us in an easily-traversable structures.
Python
import urllib.request from bs4 import BeautifulSoup url = input ( 'Enter URL of Webpage: ' ) print ( '\n' ) url_request = urllib.request.Request(url) url_response = urllib.request.urlopen(url) webpage_data = url_response.read() soup = BeautifulSoup(webpage_data, 'html.parser' ) |
Step3:
- To simplify the task of tokenizing we are going to extract an only a portion of HTML page.
- Using BeautifulSoup operator extract all the paragraph tags present in HTML document.
- Soup(‘p’) returns a list of items that contain all the paragraph tags present on the webpage.
- Create an empty string named web_page_data.
- For each tag present in the list concatenate the text enclosed between the tags to the empty string.
Python
web_page_paragraph_contents = soup( 'p' ) web_page_data = '' for para in web_page_paragraph_contents: web_page_data = web_page_data + str (para.text) |
Step4:
- Using re.sub() replace the non-alphabetical characters with an empty string.
- re.sub() takes a regular expression, new string and the input string as arguments and returns the modified string (Replaces the specified characters in the input string with the new string).
- ^ – means it will match the pattern written on right of it.
- \w – #Return a match at every non-alphabetical character(characters NOT between a and Z. Like “!”, “?” white-space, numbers including underscore etc.) and \s – matches a blank space.
Python
from nltk.tokenize import word_tokenize import re sentence_without_punctuations = re.sub(r '[^\w\s]' , '', sentence) paragraph_without_punctuations = re.sub(r '[^\w\s]' , '', paragraph) web_page_paragraphs_without_punctuations = re.sub( r '[^\w\s]' , '', web_page_data) |
Step5:
- Pass sentence, paragraph, webpage contents after removing punctuations, unnecessary characters into word_tokenize() which returns tokenized text, paragraph, web string.
- Display the contents of the tokenized sentence, tokenized paragraph, tokenized web string.
Python
sentence_after_tokenizing = word_tokenize(sentence_without_punctuations) paragraph_after_tokenizing = word_tokenize(paragraph_without_punctuations) webpage_after_tokenizing = word_tokenize( web_page_paragraphs_without_punctuations) |
Step6:
- from nltk.corpus import stopwords.
- Download stopwords using nltk.download(‘stopwords’).
- Store the English stop words in nltk_stop_words.
- Compare each word in tokenized sentence, tokenized paragraph tokenized web string with words present in nltk_stop_words if any of the words in our data occurs in nltk stop words we are going to ignore those words.
Python
from nltk.corpus import stopwords nltk.download( 'stopwords' ) nltk_stop_words = stopwords.words( 'english' ) sentence_without_stopwords = [ i for i in sentence_after_tokenizing if not i.lower() in nltk_stop_words] paragraph_without_stopwords = [ j for j in paragraph_after_tokenizing if not j.lower() in nltk_stop_words] webpage_without_stopwords = [ k for k in webpage_after_tokenizing if not k.lower() in nltk_stop_words] |
Step7:
- from nltk.stem.porter import PorterStemmer.
- Do Stemming using nltk : removing the suffix and considering the root word.
- Create three empty lists for storing stemmed words of sentence, paragraph, webpage.
- Using stemmer.stem() stem each word present in the previous list and store it in newly created lists.
Python
from nltk.stem.porter import PorterStemmer stemmer = PorterStemmer() sentence_after_stemming = [] paragraph_after_stemming = [] webpage_after_stemming = [] # creating empty lists for storing stemmed words for word in sentence_without_stopwords: sentence_after_stemming.append(stemmer.stem(word)) for word in paragraph_without_stopwords: paragraph_after_stemming.append(stemmer.stem(word)) for word in webpage_without_stopwords: webpage_after_stemming.append(stemmer.stem(word)) |
Step8:
- Sometimes after doing stemming it may result in misspelled words because it is an implementation issue.
- Using TextBlob module we can find the relevant correct words for a particular misspelled word.
- For each word in sentence_after_stemming, paragraph_after_stemming, webpage_after_stemming find the actual correct for that word using correct() method.
- Check whether the correct word present in stop words. If it is not present in stop words replace the correct word with the misspelled word.
Python
from textblob import TextBlob final_words_sentence = [] final_words_paragraph = [] final_words_webpage = [] for i in range ( len (sentence_after_stemming)): final_words_sentence.append( 0 ) present_word = sentence_after_stemming[i] b = TextBlob(sentence_after_stemming[i]) if str (b.correct()).lower() in nltk_stop_words: final_words_sentence[i] = present_word else : final_words_sentence[i] = str (b.correct()) print (final_words_sentence) print ( '\n' ) for i in range ( len (paragraph_after_stemming)): final_words_paragraph.append( 0 ) present_word = paragraph_after_stemming[i] b = TextBlob(paragraph_after_stemming[i]) if str (b.correct()).lower() in nltk_stop_words: final_words_paragraph[i] = present_word else : final_words_paragraph[i] = str (b.correct()) print (final_words_paragraph) print ( '\n' ) for i in range ( len (webpage_after_stemming)): final_words_webpage.append( 0 ) present_word = webpage_after_stemming[i] b = TextBlob(webpage_after_stemming[i]) if str (b.correct()).lower() in nltk_stop_words: final_words_webpage[i] = present_word else : final_words_webpage[i] = str (b.correct()) print (final_words_webpage) print ( '\n' ) |
Step9:
- Using Counter method in the Collections module find the frequency of words in sentences, paragraphs, webpage. Python Counter is a container that will hold the count of each of the elements present in the container.
- Counter method returns a dictionary with key-value pair as {‘word’,word_count}.
Python
from collections import Counter sentence_count = Counter(final_words_sentence) paragraph_count = Counter(final_words_paragraph) webpage_count = Counter(final_words_webpage) |
Below is the full implementation:
Python
from collections import Counter from textblob import TextBlob from nltk.stem.porter import PorterStemmer from nltk.corpus import stopwords import re from nltk.tokenize import word_tokenize from bs4 import BeautifulSoup import urllib.request import nltk s = input ( 'Enter the file name which contains a sentence: ' ) file1 = open (s) sentence = file1.read() file1.close() p = input ( 'Enter the file name which contains a paragraph: ' ) file2 = open (p) paragraph = file2.read() file2.close() url = input ( 'Enter URL of Webpage: ' ) print ( '\n' ) url_request = urllib.request.Request(url) url_response = urllib.request.urlopen(url) webpage_data = url_response.read() soup = BeautifulSoup(webpage_data, 'html.parser' ) print ( '<------------------------------------------Initial Contents of Sentence are-------------------------------------------> \n' ) print (sentence) print ( '\n' ) print ( '<------------------------------------------Initial Contents of Paragraph are-------------------------------------------> \n' ) print (paragraph) print ( '\n' ) print ( '<------------------------------------------Initial Contents of Webpage are---------------------------------------------> \n' ) print (soup) print ( '\n' ) web_page_paragraph_contents = soup( 'p' ) web_page_data = '' for para in web_page_paragraph_contents: web_page_data = web_page_data + str (para.text) print ( '<------------------------------------------Contents enclosed between the paragraph tags in the web page are---------------------------------------------> \n' ) print (web_page_data) print ( '\n' ) sentence_without_punctuations = re.sub(r '[^\w\s]' , '', sentence) paragraph_without_punctuations = re.sub(r '[^\w\s]' , '', paragraph) web_page_paragraphs_without_punctuations = re.sub( r '[^\w\s]' , '', web_page_data) print ( '<------------------------------------------Contents of sentence after removing punctuations---------------------------------------------> \n' ) print (sentence_without_punctuations) print ( '\n' ) print ( '<------------------------------------------Contents of paragraph after removing punctuations---------------------------------------------> \n' ) print (paragraph_without_punctuations) print ( '\n' ) print ( '<------------------------------------------Contents of webpage after removing punctuations-----------------------------------------------> \n' ) print (web_page_paragraphs_without_punctuations) print ( '\n' ) sentence_after_tokenizing = word_tokenize(sentence_without_punctuations) paragraph_after_tokenizing = word_tokenize(paragraph_without_punctuations) webpage_after_tokenizing = word_tokenize( web_page_paragraphs_without_punctuations) print ( '<------------------------------------------Contents of sentence after tokenizing----------------------------------------------> \n' ) print (sentence_after_tokenizing) print ( '\n' ) print ( '<------------------ ------------------------Contents of paragraph after tokenizing---------------------------------------------> \n' ) print (paragraph_after_tokenizing) print ( '\n' ) print ( '<------------------------------------------Contents of webpage after tokenizing-----------------------------------------------> \n' ) print (webpage_after_tokenizing) print ( '\n' ) nltk.download( 'stopwords' ) nltk_stop_words = stopwords.words( 'english' ) sentence_without_stopwords = [ i for i in sentence_after_tokenizing if not i.lower() in nltk_stop_words] paragraph_without_stopwords = [ j for j in paragraph_after_tokenizing if not j.lower() in nltk_stop_words] webpage_without_stopwords = [ k for k in webpage_after_tokenizing if not k.lower() in nltk_stop_words] print ( '<------------------------------------------Contents of sentence after removing stopwords---------------------------------------------> \n' ) print (sentence_without_stopwords) print ( '\n' ) print ( '<------------------------------------------Contents of paragraph after removing stopwords---------------------------------------------> \n' ) print (paragraph_without_stopwords) print ( '\n' ) print ( '<------------------------------------------Contents of webpage after removing stopwords-----------------------------------------------> \n' ) print (webpage_without_stopwords) print ( '\n' ) stemmer = PorterStemmer() sentence_after_stemming = [] paragraph_after_stemming = [] webpage_after_stemming = [] # creating empty lists for storing stemmed words for word in sentence_without_stopwords: sentence_after_stemming.append(stemmer.stem(word)) for word in paragraph_without_stopwords: paragraph_after_stemming.append(stemmer.stem(word)) for word in webpage_without_stopwords: webpage_after_stemming.append(stemmer.stem(word)) print ( '<------------------------------------------Contents of sentence after doing stemming---------------------------------------------> \n' ) print (sentence_after_stemming) print ( '\n' ) print ( '<------------------------------------------Contents of paragraph after doing stemming---------------------------------------------> \n' ) print (paragraph_after_stemming) print ( '\n' ) print ( '<------------------------------------------Contents of webpage after doing stemming-----------------------------------------------> \n' ) print (webpage_after_stemming) print ( '\n' ) final_words_sentence = [] final_words_paragraph = [] final_words_webpage = [] for i in range ( len (sentence_after_stemming)): final_words_sentence.append( 0 ) present_word = sentence_after_stemming[i] b = TextBlob(sentence_after_stemming[i]) if str (b.correct()).lower() in nltk_stop_words: final_words_sentence[i] = present_word else : final_words_sentence[i] = str (b.correct()) print ( '<------------------------------------------Contents of sentence after correcting misspelled words-----------------------------------------------> \n' ) print (final_words_sentence) print ( '\n' ) for i in range ( len (paragraph_after_stemming)): final_words_paragraph.append( 0 ) present_word = paragraph_after_stemming[i] b = TextBlob(paragraph_after_stemming[i]) if str (b.correct()).lower() in nltk_stop_words: final_words_paragraph[i] = present_word else : final_words_paragraph[i] = str (b.correct()) print ( '<------------------------------------------Contents of paragraph after correcting misspelled words-----------------------------------------------> \n' ) print (final_words_paragraph) print ( '\n' ) for i in range ( len (webpage_after_stemming)): final_words_webpage.append( 0 ) present_word = webpage_after_stemming[i] b = TextBlob(webpage_after_stemming[i]) if str (b.correct()).lower() in nltk_stop_words: final_words_webpage[i] = present_word else : final_words_webpage[i] = str (b.correct()) print ( '<------------------------------------------Contents of webpage after correcting misspelled words-----------------------------------------------> \n' ) print (final_words_webpage) print ( '\n' ) sentence_count = Counter(final_words_sentence) paragraph_count = Counter(final_words_paragraph) webpage_count = Counter(final_words_webpage) print ( '<------------------------------------------Frequency of words in sentence ---------------------------------------------> \n' ) print (sentence_count) print ( '\n' ) print ( '<------------------------------------------Frequency of words in paragraph ---------------------------------------------> \n' ) print (paragraph_count) print ( '\n' ) print ( '<------------------------------------------Frequency of words in webpage -----------------------------------------------> \n' ) print (webpage_count) |