Amazon gives a platform to small businesses and companies with modest resources to grow larger. And Because of its popularity, people actually spend time and write detailed reviews, about the brand and the product. So, by analyzing that data we can tell companies a lot about their products and also the ways to enhance the quality of the product. But that large amount of data can not be analyzed by a person.
Amazon Product Reviews Sentiment Analysis in Python
So here comes the Machine learning part, i.e. Natural Language Processing (NLP) to overcome the problem of large datasets and analyze it. Our task is to predict whether the review given is positive or negative. The real dataset after scraping the website might include millions of reviews. So we preprocessed the data for you,
Before starting the code, download the dataset by clicking the link.
Steps to be followed
- Importing Libraries and Datasets
- Preprocessing and cleaning the reviews
- Analysis of the Dataset
- Converting text into Vectors
- Model training, Evaluation, and Prediction
Let’s start with the code now.
Importing Libraries and Datasets
The libraries used are :
- Pandas : For importing the dataset.
- Scikit-learn : For importing the model, accuracy module, and TfidfVectorizer.
- Warning : To ignore all the warnings
- Matplotlib : To plot the visualization. Also used Wordcloud for that.
Python3
import warnings warnings.filterwarnings( 'ignore' ) import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer import matplotlib.pyplot as plt from wordcloud import WordCloud |
For NLP part, we will be using NLTK Library. From that we will be requiring stopword and punkt. so let’s download and import them using the below command.
Python3
import nltk nltk.download( 'punkt' ) nltk.download( 'stopwords' ) from nltk.corpus import stopwords |
After that import the downloaded dataset using the below code.
Python3
data = pd.read_csv( 'AmazonReview.csv' ) data.head() |
Output :
Preprocessing and cleaning the reviews
Python3
data.info() |
Output:
Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Review 24999 non-null object 1 Sentiment 25000 non-null int64
Now, To drop the null values (if any), run the below command.
Python3
data.dropna(inplace = True ) |
To predict the Sentiment as positive(numerical value = 1) or negative(numerical value = 0), we need to change them the values to those categories. For that the condition will be like if the sentiment value is less than or equal to 3, then it is negative(0) else positive(1). For better understanding, refer the code below.
Python3
#1,2,3->negative(i.e 0) data.loc[data[ 'Sentiment' ]< = 3 , 'Sentiment' ] = 0 #4,5->positive(i.e 1) data.loc[data[ 'Sentiment' ]> 3 , 'Sentiment' ] = 1 |
Now, once the dataset is ready, we will clean the review column by removing the stopwords. The code for that is given below.
Python3
stp_words = stopwords.words( 'english' ) def clean_review(review): cleanreview = " " .join(word for word in review. split() if word not in stp_words) return cleanreview data[ 'Review' ] = data[ 'Review' ]. apply (clean_review) |
Once we have done with the preprocess. Let’s see the top 5 rows to see the improved dataset.
Python3
data.head() |
Output :
Analysis of the Dataset
Let’s check out that how many counts are there for positive and negative sentiments.
Python3
data[ 'Sentiment' ].value_counts() |
Output :
0 15000 1 9999
To have the better picture of the importance of the words let’s create the Wordcloud of all the words with sentiment = 0 i.e. negative
Python3
consolidated = ' ' .join(word for word in data[ 'Review' ][data[ 'Sentiment' ] = = 0 ].astype( str )) wordCloud = WordCloud(width = 1600 ,height = 800 ,random_state = 21 ,max_font_size = 110 ) plt.figure(figsize = ( 15 , 10 )) plt.imshow(wordCloud.generate(consolidated),interpolation = 'bilinear' ) plt.axis( 'off' ) plt.show() |
Output :
Let’s do the same for all the words with sentiment = 1 i.e. positive
Python3
consolidated = ' ' .join(word for word in data[ 'Review' ][data[ 'Sentiment' ] = = 1 ].astype( str )) wordCloud = WordCloud(width = 1600 ,height = 800 ,random_state = 21 ,max_font_size = 110 ) plt.figure(figsize = ( 15 , 10 )) plt.imshow(wordCloud.generate(consolidated),interpolation = 'bilinear' ) plt.axis( 'off' ) plt.show() |
Output :
Now we have a clear picture of the words we have in both the categories.
Let’s create the vectors.
Converting text into Vectors
TF-IDF calculates that how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set). We will be implementing this with the code below.
Python3
cv = TfidfVectorizer(max_features = 2500 ) X = cv.fit_transform(data[ 'Review' ] ).toarray() |
Model training, Evaluation, and Prediction
Once analysis and vectorization is done. We can now explore any machine learning model to train the data. But before that perform the train-test split.
Python3
from sklearn.model_selection import train_test_split x_train ,x_test,y_train,y_test = train_test_split(X,data[ 'Sentiment' ], test_size = 0.25 , random_state = 42 ) |
Now we can train any model, Let’s explore the Logistic Regression.
Python3
from sklearn.linear_model import LogisticRegression model = LogisticRegression() #Model fitting model.fit(x_train,y_train) #testing the model pred = model.predict(x_test) #model accuracy print (accuracy_score(y_test,pred)) |
Output :
0.81632
Let’s see the confusion matrix for the results.
Python3
from sklearn import metrics cm = confusion_matrix(y_test,pred) cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = [ False , True ]) cm_display.plot() plt.show() |
Output :