Introduction
N-grams are one of the fundamental concepts every data scientist and computer science professional must know while working with text data. In this beginner-level tutorial, we will learn what n-grams are and explore them on text data in Python. The objective of the blog is to analyze different types of n-grams on the given text data and hence decide which n-gram works the best for our data.
Learning Objectives
- Implement n-gram in Python from scratch and using nltk
- Understand n-grams and their importance
- Know the applications of n-grams in NLP
This article was published as a part of the Data Science Blogathon.
Table of contents
Quiz Time
Step into the realm of N-Grams and their implementation in Python using NLTK library. Good luck!
Question
Your answer:
Correct answer:
Your Answers
What Are N-Grams(ngrams)?
N-grams are continuous sequences of words or symbols, or tokens in a document. In technical terms, they can be defined as the neighboring sequences of items in a document. They come into play when we deal with text data in NLP (Natural Language Processing) tasks. They have a wide range of applications, like language models, semantic features, spelling correction, machine translation, text mining, etc.
How Are N-Grams Classified?
Did you notice the ‘n’ in the term “n-grams”? Can you guess what this ‘n’ possibly is?
Remember when we learned how to input an array by first inputting its size(n) or even a number from the user? Generally, we used to store such values in a variable declared as ‘n’! Apart from programming, you must have extensively encountered ‘n’ in the formulae of the sum of series and so on. What do you think ‘n’ was over there?
Summing up, ‘n’ is just a variable that can have positive integer values, including 1,2,3, and so on.’n’ basically refers to multiple.
Thinking along the same lines, n-grams are classified into the following types, depending on the value that ‘n’ takes.
n | Term |
1 | Unigram |
2 | Bigram |
3 | Trigram |
n | n-gram |
As clearly depicted in the table above, when n=1, it is said to be a unigram. When n=2, it is said to be a bigram, and so on.
Now, you must be wondering why we need many different types of n-grams?! This is because different types of n-grams are suitable for different types of applications. You should try different n-grams on your data in order to confidently conclude which one works the best among all for your text analysis. For instance, research has substantiated that trigrams and 4 grams work the best in the case of spam filtering.
Example of N-Grams
Let’s understand n-grams practically with the help of the following sample sentence:
“I reside in Bengaluru”.
SL.No. | Type of n-gram | Generated n-grams |
1 | Unigram | [“I”,”reside”,”in”,”Bengaluru”] |
2 | Bigram | [“I reside”,”reside in”,”in Bengaluru”] |
3 | Trigram | [“I reside in”, “reside in Bengaluru”] |
from nltk import ngrams
sentence = 'I reside in Bengaluru.'
n = 1
unigrams = ngrams(sentence.split(), n)
for grams in unigrams:
print grams
For the time being, let’s not consider the removal of stop-words :
From the table above, it’s clear that unigram means taking only one word at a time, bigram means taking two words at a time, and trigram means taking three words at a time. We will be implementing only till trigrams here in this blog. Feel free to proceed ahead and explore 4 grams, 5 grams, and so on from your takeaways from the blog!
Step-By-Step Implementation of N-Grams in Python
And here comes the most interesting section of the blog! Unless we practically implement what we learn, there is absolutely no fun in learning it! So, let’s proceed to code and generate n-grams on Google Colab in Python. You can also build a simple n-gram language model on top of this code.
Step 1: Explore the Dataset
I will be using sentiment analysis for the financial news dataset. The sentiments are from the perspective of retail investors. It is an open-source Kaggle dataset. Download it from here before moving ahead.
Let’s begin, as usual, by importing the required libraries and reading and understanding the data:
Python Code:
df.info()
You can see that the dataset has 4846 rows and two columns, namely,’ Sentiment’ and ‘News Headline.’
NOTE: When you download the dataset from Kaggle directly, you will notice that the columns are nameless! So, I named them later and updated them in the all-data.csv file before reading it using pandas. Ensure that you do not miss this step.
df.isna().sum()
The data is just perfect, with absolutely no missing values at all! That’s our luck, indeed!
df['Sentiment'].value_counts()
We can undoubtedly infer that the dataset includes three categories of sentiments:
- Neutral
- Positive
- Negative
Out of 4846 sentiments, 2879 have been found to be neutral, 1363 positive, and the rest negative.
Step 2: Feature Extraction
Our objective is to predict the sentiment of a given news headline. Obviously, the ‘News Headline’ column is our only feature, and the ‘Sentiment’ column is our target variable.
y=df['Sentiment'].values
y.shape
x=df['News Headline'].values
x.shape
Both the outputs return a shape of (4846,) which means 4846 rows and 1 column as we have 4846 rows of data and just 1 feature and a target for x and y, respectively.
Step 3: Train-Test Split
In any machine learning, deep learning, or NLP(Natural Language Processing) task, splitting the data into train and test is indeed a highly crucial step. The train_test_split() method provided by sklearn is widely used for the same. So, let’s begin by importing it:
from sklearn.model_selection import train_test_split
Here’s how I’ve split the data: 60% for the train and the rest 40% for the test. I had started with 20% for the test. I kept on playing with the test_size parameter only to realize that the 60-40 ratio of split provides more useful and meaningful insights from the trigrams generated. Don’t worry; we will be looking at trigrams in just a while.
(x_train,x_test,y_train,y_test)=train_test_split(x,y,test_size=0.4)
x_train.shape
y_train.shape
x_test.shape
y_test.shape
On executing the codes above, you will observe that 2907 rows have been considered as train data, and the rest of the 1939 rows have been considered as test data.
Our next step is to convert these NumPy arrays to Pandas data frames and thus create two data frames, namely,df_train and df_test. The former is created by concatenating x_train and y_train arrays. The latter data frame is created by concatenating x_test and y_test arrays. This is necessary to count the number of positive, negative, and neutral sentiments in both train and test datasets which we will be doing in a while.
df1=pd.DataFrame(x_train)
df1=df1.rename(columns={0:'news'})
df2=pd.DataFrame(y_train)
df2=df2.rename(columns={0:'sentiment'})
df_train=pd.concat([df1,df2],axis=1)
df_train.head()
df3=pd.DataFrame(x_test)
df3=df3.rename(columns={0:'news'})
df4=pd.DataFrame(y_test)
df4=df2.rename(columns={0:'sentiment'})
df_test=pd.concat([df3,df4],axis=1)
df_test.head()
Step 4: Basic Pre-Processing of Train and Test Data
Here, in order to pre-process our text data, we will remove punctuations in train and test data for the ‘news’ column using punctuation provided by the string library.
#removing punctuations
#library that contains punctuation
import string
string.punctuation
#defining the function to remove punctuation
def remove_punctuation(text):
if(type(text)==float):
return text
ans=""
for i in text:
if i not in string.punctuation:
ans+=i
return ans
#storing the puntuation free text in a new column called clean_msg
df_train['news']= df_train['news'].apply(lambda x:remove_punctuation(x))
df_test['news']= df_test['news'].apply(lambda x:remove_punctuation(x))
df_train.head()
#punctuations are removed from news column in train dataset
Compare the above output with the previous output of df_train. You can observe that punctuations have been successfully removed from the text present in the feature column(news column) of the training dataset. Similarly, from the above codes, punctuations will be removed successfully from the news column of the test data frame as well. You can optionally view df_test.head() as well to note it.
As a next step, we have to remove stopwords from the news column. For this, let’s use the stopwords provided by nltk as follows:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
We will be using this to generate n-grams in the very next step.
Step 5: Code to Generate N-grams
Let’s code a custom function to generate n-grams for a given text as follows:
#method to generate n-grams:
#params:
#text-the text for which we have to generate n-grams
#ngram-number of grams to be generated from the text(1,2,3,4 etc., default value=1)
def generate_N_grams(text,ngram=1):
words=[word for word in text.split(" ") if word not in set(stopwords.words('english'))]
print("Sentence after removing stopwords:",words)
temp=zip(*[words[i:] for i in range(0,ngram)])
ans=[' '.join(ngram) for ngram in temp]
return ans
The above function inputs two parameters, namely, text and ngram, which refer to the text data for which we want to generate a given number of n-grams and the number of grams to be generated, respectively. Firstly, word tokenization is done where the stop words are ignored, and the remaining words are retained. From the example section, you must have been clear on how to generate n-grams manually for a given text. We have coded the very same logic in the function generate_N_grams() above. It will thus consider n words at a time from the text where n is given by the value of the ngram parameter of the function.
Let’s check the working of the function with the help of a simple example to create bigrams as follows:
#sample!
generate_N_grams("The sun rises in the east",2)
Great! We are now set to proceed.
Step 6: Creating Unigrams
Let’s follow the steps below to create unigrams for the news column of the df_train data frame:
- Create unigrams for each of the news records belonging to each of the three categories of sentiments.
- Store the word and its count in the corresponding dictionaries.
- Convert these dictionaries to corresponding data frames.
- Fetch the top 10 most frequently used words.
- Visualize the most frequently used words for all the 3 categories-positive, negative and neutral.
Have a look at the codes below to understand the steps better.
from collections import defaultdict
positiveValues=defaultdict(int)
negativeValues=defaultdict(int)
neutralValues=defaultdict(int)
#get the count of every word in both the columns of df_train and df_test dataframes
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="positive"
for text in df_train[df_train.sentiment=="positive"].news:
for word in generate_N_grams(text):
positiveValues[word]+=1
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="negative"
for text in df_train[df_train.sentiment=="negative"].news:
for word in generate_N_grams(text):
negativeValues[word]+=1
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="neutral"
for text in df_train[df_train.sentiment=="neutral"].news:
for word in generate_N_grams(text):
neutralValues[word]+=1
#focus on more frequently occuring words for every sentiment=>
#sort in DO wrt 2nd column in each of positiveValues,negativeValues and neutralValues
df_positive=pd.DataFrame(sorted(positiveValues.items(),key=lambda x:x[1],reverse=True))
df_negative=pd.DataFrame(sorted(negativeValues.items(),key=lambda x:x[1],reverse=True))
df_neutral=pd.DataFrame(sorted(neutralValues.items(),key=lambda x:x[1],reverse=True))
pd1=df_positive[0][:10]
pd2=df_positive[1][:10]
ned1=df_negative[0][:10]
ned2=df_negative[1][:10]
nud1=df_neutral[0][:10]
nud2=df_neutral[1][:10]
plt.figure(1,figsize=(16,4))
plt.bar(pd1,pd2, color ='green',
width = 0.4)
plt.xlabel("Words in positive dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in positive dataframe-UNIGRAM ANALYSIS")
plt.savefig("positive-unigram.png")
plt.show()
plt.figure(1,figsize=(16,4))
plt.bar(ned1,ned2, color ='red',
width = 0.4)
plt.xlabel("Words in negative dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in negative dataframe-UNIGRAM ANALYSIS")
plt.savefig("negative-unigram.png")
plt.show()
plt.figure(1,figsize=(16,4))
plt.bar(nud1,nud2, color ='yellow',
width = 0.4)
plt.xlabel("Words in neutral dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in neutral dataframe-UNIGRAM ANALYSIS")
plt.savefig("neutral-unigram.png")
plt.show()
Step 7: Creating Bigrams
Repeat the same steps which we followed to analyze our data using unigrams, except that you have to pass parameter 2 while invoking the generate_N_grams() function. You can optionally consider changing the names of the data frames, which I have done.
positiveValues2=defaultdict(int)
negativeValues2=defaultdict(int)
neutralValues2=defaultdict(int)
#get the count of every word in both the columns of df_train and df_test dataframes
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="positive"
for text in df_train[df_train.sentiment=="positive"].news:
for word in generate_N_grams(text,2):
positiveValues2[word]+=1
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="negative"
for text in df_train[df_train.sentiment=="negative"].news:
for word in generate_N_grams(text,2):
negativeValues2[word]+=1
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="neutral"
for text in df_train[df_train.sentiment=="neutral"].news:
for word in generate_N_grams(text,2):
neutralValues2[word]+=1
#focus on more frequently occuring words for every sentiment=>
#sort in DO wrt 2nd column in each of positiveValues,negativeValues and neutralValues
df_positive2=pd.DataFrame(sorted(positiveValues2.items(),key=lambda x:x[1],reverse=True))
df_negative2=pd.DataFrame(sorted(negativeValues2.items(),key=lambda x:x[1],reverse=True))
df_neutral2=pd.DataFrame(sorted(neutralValues2.items(),key=lambda x:x[1],reverse=True))
pd1bi=df_positive2[0][:10]
pd2bi=df_positive2[1][:10]
ned1bi=df_negative2[0][:10]
ned2bi=df_negative2[1][:10]
nud1bi=df_neutral2[0][:10]
nud2bi=df_neutral2[1][:10]
plt.figure(1,figsize=(16,4))
plt.bar(pd1bi,pd2bi, color ='green',width = 0.4)
plt.xlabel("Words in positive dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in positive dataframe-BIGRAM ANALYSIS")
plt.savefig("positive-bigram.png")
plt.show()
plt.figure(1,figsize=(16,4))
plt.bar(ned1bi,ned2bi, color ='red',
width = 0.4)
plt.xlabel("Words in negative dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in negative dataframe-BIGRAM ANALYSIS")
plt.savefig("negative-bigram.png")
plt.show()
plt.figure(1,figsize=(16,4))
plt.bar(nud1bi,nud2bi, color ='yellow',
width = 0.4)
plt.xlabel("Words in neutral dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in neutral dataframe-BIGRAM ANALYSIS")
plt.savefig("neutral-bigram.png")
plt.show()
Step 8: Creating Trigrams
Repeat the same steps which we followed to analyze our data using unigrams, except that you have to pass parameter 3 while invoking the generate_N_grams() function. You can optionally consider changing the names of the data frames, which I have done.
positiveValues3=defaultdict(int)
negativeValues3=defaultdict(int)
neutralValues3=defaultdict(int)
#get the count of every word in both the columns of df_train and df_test dataframes
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="positive"
for text in df_train[df_train.sentiment=="positive"].news:
for word in generate_N_grams(text,3):
positiveValues3[word]+=1
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="negative"
for text in df_train[df_train.sentiment=="negative"].news:
for word in generate_N_grams(text,3):
negativeValues3[word]+=1
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="neutral"
for text in df_train[df_train.sentiment=="neutral"].news:
for word in generate_N_grams(text,3):
neutralValues3[word]+=1#focus on more frequently occuring words for every sentiment=>
#sort in DO wrt 2nd column in each of positiveValues,negativeValues and neutralValues
df_positive3=pd.DataFrame(sorted(positiveValues3.items(),key=lambda x:x[1],reverse=True))
df_negative3=pd.DataFrame(sorted(negativeValues3.items(),key=lambda x:x[1],reverse=True))
df_neutral3=pd.DataFrame(sorted(neutralValues3.items(),key=lambda x:x[1],reverse=True))
pd1tri=df_positive3[0][:10]
pd2tri=df_positive3[1][:10]
ned1tri=df_negative3[0][:10]
ned2tri=df_negative3[1][:10]
nud1tri=df_neutral3[0][:10]
nud2tri=df_neutral3[1][:10]
plt.figure(1,figsize=(16,4))
plt.bar(pd1tri,pd2tri, color ='green',
width = 0.4)
plt.xlabel("Words in positive dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in positive dataframe-TRIGRAM ANALYSIS")
plt.savefig("positive-trigram.png")
plt.show()
plt.figure(1,figsize=(16,4))
plt.bar(ned1tri,ned2tri, color ='red',
width = 0.4)
plt.xlabel("Words in negative dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in negative dataframe-TRIGRAM ANALYSIS")
plt.savefig("negative-trigram.png")
plt.show()
plt.figure(1,figsize=(16,4))
plt.bar(nud1tri,nud2tri, color ='yellow',
width = 0.4)
plt.xlabel("Words in neutral dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in neutral dataframe-TRIGRAM ANALYSIS")
plt.savefig("neutral-trigram.png")
plt.show()
Results of the Model
From the above graphs, we can conclude that trigrams perform the best on our train data. This is because it provides more useful words frequently, such as profit rose EUR, a year earlier for the positive data frame, corresponding period, period 2007, names of companies such as HEL for the negative data frame and Finland, the company said and again names of companies such as HEL, OMX Helsinki and so on for the neutral data frame.
Conclusion
Therefore, n-grams are one of the most powerful techniques for extracting features from the text while working on a text problem. You can find the entire code here. In this blog, we have successfully learned what n-grams are and how we can generate n-grams for a given text dataset easily in Python. We also understood the applications of n-grams in NLP and generated n-grams in the case study of sentiment analysis.
Key Takeaways
- N-grams are the most powerful technique to extract the features from the text.
- N-grams have a wide range of applications in language models, spelling correctors, text classification problems, and more.
Frequently Asked Questions
A. Below is the n-gram implementation code for Python.from nltk import ngrams
sentence = 'Hi! How are you doing today?'
n = 2
bigrams = ngrams(sentence.split(), 2)
for grams in bigrams:
print grams
A. N-grams split the sentence into multiple sequences of tokens depending upon the value of n. For example, given n=3, n-grams for the following sentence “I am doing well today” looks like [“I am doing”, “am doing good”, “doing good today”]
A. N-grams are used in the various use cases of NLP, such as spelling correction, machine translation, language models, semantic feature extraction, etc.
A. The ‘n’ in n-grams refers to the no. of sequences of tokens. Hence, when the value of n=2, it’s known as bigrams.
A. Here are the advantages and disadvantages of n-grams in NLP.
Pros
The concept of n-grams is simple and easy to use yet powerful. Hence, it can be used to build a variety of applications in NLP, like language models, spelling correctors, etc.
Cons
N-grams cannot deal Out Of Vocabulary (OOV) words. It works well with the words present in the training set. In the case of an Out Of Vocabulary (OOV) word, n-grams fail to tackle it.
Another serious concern about n-grams is that it deals with large sparsity.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.