Sentiment Classification Using BERT

27 July 2024

2

BERT stands for Bidirectional Representation for Transformers and was proposed by researchers at Google AI language in 2018. Although the main aim of that was to improve the understanding of the meaning of queries related to Google Search, BERT becomes one of the most important and complete architectures for various natural language tasks having generated state-of-the-art results on Sentence pair classification tasks, question-answer tasks, etc.

Bidirectional Representation for Transformers (BERT)

BERT is a powerful technique for natural language processing that can improve how well computers comprehend human language. The foundation of BERT is the idea of exploiting bidirectional context to acquire complex and insightful word and phrase representations. By simultaneously examining both sides of a word’s context, BERT can capture a word’s whole meaning in its context, in contrast to earlier models that only considered the left or right context of a word. This enables BERT to deal with ambiguous and complex linguistic phenomena including polysemy, co-reference, and long-distance relationships.

For that, the paper also proposed the architecture of different tasks. In this post, we will be using BERT architecture for Sentiment classification tasks specifically the architecture used for the CoLA (Corpus of Linguistic Acceptability) binary classification task.

Single Sentence Classification Task

BERT has proposed two versions:

BERT (BASE): 12 layers of encoder stack with 12 bidirectional self-attention heads and 768 hidden units.
BERT (LARGE): 24 layers of encoder stack with 24 bidirectional self-attention heads and 1024 hidden units.

For TensorFlow implementation, Google has provided two versions of both the BERT BASE and BERT LARGE: Uncased and Cased. In an uncased version, letters are lowercase before WordPiece tokenization.

Sentiment Classification Using BERT:

Step 1: Import the necessary libraries

Python3

import os
import shutil
import tarfile
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
import pandas as pd
from bs4 import BeautifulSoup
import re
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.offline as pyo
import plotly.graph_objects as go
from wordcloud import WordCloud, STOPWORDS
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

Step 2: Load the dataset

Python3

# Get the current working directory
current_folder = os.getcwd()
 
dataset = tf.keras.utils.get_file(
    fname ="aclImdb.tar.gz",
    origin ="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
    cache_dir=  current_folder,
    extract = True)

check the dataset folder

Python3

dataset_path = os.path.dirname(dataset)
# Check the dataset
os.listdir(dataset_path)

Output:

['aclImdb.tar.gz', 'aclImdb']

Check the ‘aclImdb’ directory

Python3

# Dataset directory
dataset_dir = os.path.join(dataset_path, 'aclImdb')
 
# Check the Dataset directory
os.listdir(dataset_dir)

Output:

['imdb.vocab', 'train', 'imdbEr.txt', 'README', 'test']

Check the ‘Train’ dataset folder

Python3

train_dir = os.path.join(dataset_dir,'train')
os.listdir(train_dir)

Output:

['labeledBow.feat',
 'urls_neg.txt',
 'unsupBow.feat',
 'urls_unsup.txt',
 'urls_pos.txt',
 'pos',
 'neg',
 'unsup']

Read the files of the ‘Train’ directory files

Python3

for file in os.listdir(train_dir):
    file_path = os.path.join(train_dir, file)
    # Check if it's a file (not a directory)
    if os.path.isfile(file_path):
        with open(file_path, 'r', encoding='utf-8') as f:
            first_value = f.readline().strip()
            print(f"{file}: {first_value}")
    else:
        print(f"{file}: {file_path}")

Output:

labeledBow.feat: 9 0:9 1:1 2:4 3:4 4:6 5:4 6:2 7:2 8:4 10:4 12:2 26:1 27:1 28:1 29:2 32:1 41:1 45:1 47:1 50:1 54:2 57:1 59:1 63:2 64:1 66:1 68:2 70:1 72:1 78:1 100:1 106:1 116:1 122:1 125:1 136:1 140:1 142:1 150:1 167:1 183:1 201:1 207:1 208:1 213:1 217:1 230:1 255:1 321:5 343:1 357:1 370:1 390:2 468:1 514:1 571:1 619:1 671:1 766:1 877:1 1057:1 1179:1 1192:1 1402:2 1416:1 1477:2 1940:1 1941:1 2096:1 2243:1 2285:1 2379:1 2934:1 2938:1 3520:1 3647:1 4938:1 5138:4 5715:1 5726:1 5731:1 5812:1 8319:1 8567:1 10480:1 14239:1 20604:1 22409:4 24551:1 47304:1
urls_neg.txt: http://www.imdb.com/title/tt0064354/usercomments
unsupBow.feat: 0 0:8 1:6 3:5 4:2 5:1 7:1 8:5 9:2 10:1 11:2 13:3 16:1 17:1 18:1 19:1 22:3 24:1 26:3 28:1 30:1 31:1 35:2 36:1 39:2 40:1 41:2 46:2 47:1 48:1 52:1 63:1 67:1 68:1 74:1 81:1 83:1 87:1 104:1 105:1 112:1 117:1 131:1 151:1 155:1 170:1 198:1 225:1 226:1 288:2 291:1 320:1 331:1 342:1 364:1 374:1 384:2 385:1 407:1 437:1 441:1 465:1 468:1 470:1 519:1 595:1 615:1 650:1 692:1 851:1 937:1 940:1 1100:1 1264:1 1297:1 1317:1 1514:1 1728:1 1793:1 1948:1 2088:1 2257:1 2358:1 2584:2 2645:1 2735:1 3050:1 4297:1 5385:1 5858:1 7382:1 7767:1 7773:1 9306:1 10413:1 11881:1 15907:1 18613:1 18877:1 25479:1
urls_unsup.txt: http://www.imdb.com/title/tt0018515/usercomments
urls_pos.txt: http://www.imdb.com/title/tt0453418/usercomments
pos: datasets/aclImdb/train/pos
neg: datasets/aclImdb/train/neg
unsup: datasets/aclImdb/train/unsup

Load the Movies reviews and convert them into the pandas’ data frame with their respective sentiment

Here 0 means Negative and 1 means Positive

Python3

def load_dataset(directory):
    data = {"sentence": [], "sentiment": []}
    for file_name in os.listdir(directory):
        print(file_name)
        if file_name == 'pos':
            positive_dir = os.path.join(directory, file_name)
            for text_file in os.listdir(positive_dir):
                text = os.path.join(positive_dir, text_file)
                with open(text, "r", encoding="utf-8") as f:
                    data["sentence"].append(f.read())
                    data["sentiment"].append(1)
        elif file_name == 'neg':
            negative_dir = os.path.join(directory, file_name)
            for text_file in os.listdir(negative_dir):
                text = os.path.join(negative_dir, text_file)
                with open(text, "r", encoding="utf-8") as f:
                    data["sentence"].append(f.read())
                    data["sentiment"].append(0)
             
    return pd.DataFrame.from_dict(data)

Load the training datasets

Python3

# Load the dataset from the train_dir
train_df = load_dataset(train_dir)
print(train_df.head())

Output:

labeledBow.feat
urls_neg.txt
unsupBow.feat
urls_unsup.txt
urls_pos.txt
pos
neg
unsup
                                            sentence  sentiment
0  A surprisingly complex and well crafted study ...          1
1  I thought this movie was fantastic. It was hil...          1
2  This is an extremely long movie, which means y...          1
3  A Pentagon science team seem to have perfected...          1
4  I was amazed at the improvements made in an an...          1

Load the test dataset respectively

Python3

test_dir = os.path.join(dataset_dir,'test')
 
# Load the dataset from the train_dir
test_df = load_dataset(test_dir)
print(test_df.head())

Output:

labeledBow.feat
urls_neg.txt
urls_pos.txt
pos
neg
                                            sentence  sentiment
0  Watched this on KQED, with Frank Baxter commen...          1
1  Frank Sinatra took this role, chewed it up wit...          1
2  I don't pretend to be a huge Asterix fan, havi...          1
3  A very interesting documentary - certainly a l...          1
4  Good drama/comedy, with two good performances ...          1

Step 3: Preprocessing

Python3

sentiment_counts = train_df['sentiment'].value_counts()
 
fig =px.bar(x= {0:'Negative',1:'Positive'},
            y= sentiment_counts.values,
            color=sentiment_counts.index,
            color_discrete_sequence =  px.colors.qualitative.Dark24,
            title='<b>Sentiments Counts')
 
fig.update_layout(title='Sentiments Counts',
                  xaxis_title='Sentiment',
                  yaxis_title='Counts',
                  template='plotly_dark')
 
# Show the bar chart
fig.show()
pyo.plot(fig, filename = 'Sentiments Counts.html', auto_open = True)

Output:

Sentiment Counts

Text Cleaning

Python3

def text_cleaning(text):
    soup = BeautifulSoup(text, "html.parser")
    text = re.sub(r'\[[^]]*\]', '', soup.get_text())
    pattern = r"[^a-zA-Z0-9\s,']"
    text = re.sub(pattern, '', text)
    return text

Apply text_cleaning

Python3

# Train dataset
train_df['Cleaned_sentence'] = train_df['sentence'].apply(text_cleaning).tolist()
# Test dataset
test_df['Cleaned_sentence'] = test_df['sentence'].apply(text_cleaning)

Plot reviews on WordCLouds

Python3

# Function to generate word cloud
def generate_wordcloud(text,Title):
    all_text = " ".join(text)
    wordcloud = WordCloud(width=800,
                          height=400,
                          stopwords=set(STOPWORDS),
                          background_color='black').generate(all_text)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.title(Title)
    plt.show()

Positive Reviews

Python3

positive = train_df[train_df['sentiment']==1]['Cleaned_sentence'].tolist()
generate_wordcloud(positive,'Positive Review')

Output:

Positive Reviews WordClound

Negative Reviews

Python3

negative = train_df[train_df['sentiment']==0]['Cleaned_sentence'].tolist()
generate_wordcloud(negative,'Negative Review')

Output:

Negative Reviews WordCloud

Separate input text and target sentiment of both train and test

Python3

# Training data
#Reviews = "[CLS] " +train_df['Cleaned_sentence'] + "[SEP]"
Reviews = train_df['Cleaned_sentence']
Target = train_df['sentiment']
 
# Test data
#test_reviews =  "[CLS] " +test_df['Cleaned_sentence'] + "[SEP]"
test_reviews = test_df['Cleaned_sentence']
test_targets = test_df['sentiment']

Split TEST data into test and validation

Python3

x_val, x_test, y_val, y_test = train_test_split(test_reviews,
                                                    test_targets,
                                                    test_size=0.5,
                                                    stratify = test_targets)

Step 4: Tokenization & Encoding

BERT tokenization is used to convert the raw text into numerical inputs that can be fed into the BERT model. It tokenized the text and performs some preprocessing to prepare the text for the model’s input format. Let’s understand some of the key features of the BERT tokenization model.

BERT tokenizer splits the words into subwords or workpieces. For example, the word “neveropen” can be split into “Lazyroar” “##for”, and”##Lazyroar”. The “##” prefix indicates that the subword is a continuation of the previous one. It reduces the vocabulary size and helps the model to deal with rare or unknown words.
BERT tokenizer adds special tokens like [CLS], [SEP], and [MASK] to the sequence. These tokens have special meanings like :
- [CLS] is used for classifications and to represent the entire input in the case of sentiment analysis,
- [SEP] is used as a separator i.e. to mark the boundaries between different sentences or segments,
- [MASK] is used for masking i.e. to hide some tokens from the model during pre-training.
BERT tokenizer gives their components as outputs:
- input_ids: The numerical identifiers of the vocabulary tokens
- token_type_ids: It identifies which segment or sentence each token belongs to.
- attention_mask: It flags that inform the model which tokens to pay attention to and which to disregard.

Load the pre-trained BERT tokenizer

Python3

#Tokenize and encode the data using the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Apply the BERT tokenization in training, testing and validation dataset

Python3

max_len= 128
# Tokenize and encode the sentences
X_train_encoded = tokenizer.batch_encode_plus(Reviews.tolist(),
                                              padding=True,
                                              truncation=True,
                                              max_length = max_len,
                                              return_tensors='tf')
 
X_val_encoded = tokenizer.batch_encode_plus(x_val.tolist(),
                                              padding=True,
                                              truncation=True,
                                              max_length = max_len,
                                              return_tensors='tf')
 
X_test_encoded = tokenizer.batch_encode_plus(x_test.tolist(),
                                              padding=True,
                                              truncation=True,
                                              max_length = max_len,
                                              return_tensors='tf')

Check the encoded dataset

Python3

k = 0
print('Training Comments -->>',Reviews[k])
print('\nInput Ids -->>\n',X_train_encoded['input_ids'][k])
print('\nDecoded Ids -->>\n',tokenizer.decode(X_train_encoded['input_ids'][k]))
print('\nAttention Mask -->>\n',X_train_encoded['attention_mask'][k])
print('\nLabels -->>',Target[k])

Output:

Training Comments -->> Valentine is a horrible movie This is what I thought of itActing Very bad Katherine Heigl can not act 
The other's weren't much betterStory The story was okay, but it could have been more developed This movie had the potential to be a great movie, 
but it failedMusic Yes, some of the music was pretty coolOriginality Not very original The name Paige Prescott' Recognize PrescottBottom Line 
Don't see Valentine It's a really stupid movie110
Input Ids -->>
 tf.Tensor(
[  101 10113  2003  1037  9202  3185  2023  2003  2054  1045  2245  1997
  2009 18908  2075  2200  2919  9477  2002  8004  2140  2064  2025  2552
  1996  2060  1005  1055  4694  1005  1056  2172  2488 23809  2100  1996
  2466  2001  3100  1010  2021  2009  2071  2031  2042  2062  2764  2023
  3185  2018  1996  4022  2000  2022  1037  2307  3185  1010  2021  2009
  3478 27275  2748  1010  2070  1997  1996  2189  2001  3492  4658 10050
 24965  3012  2025  2200  2434  1996  2171 17031 20719  1005  6807 20719
 18384 20389  2240  2123  1005  1056  2156 10113  2009  1005  1055  1037
  2428  5236  3185 14526  2692   102     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0], shape=(128,), dtype=int32)
Decoded Ids -->>
 [CLS] valentine is a horrible movie this is what i thought of itacting very bad katherine heigl can not act 
the other's weren't much betterstory the story was okay, but it could have been more developed this movie had the potential to be a great movie,
 but it failedmusic yes, some of the music was pretty cooloriginality not very original the name paige prescott'recognize prescottbottom line 
don't see valentine it's a really stupid movie110 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
Attention Mask -->>
 tf.Tensor(
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], shape=(128,), dtype=int32)
Labels -->> 0

Step 5: Build the classification model

Lad the model

Python3

# Intialize the model
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Output:

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
=================================================================
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________

If the task at hand is similar to the one on which the checkpoint model was trained, we can use TFBertForSequenceClassification to provide predictions without further training.

Compile the model

Python3

# Compile the model with an appropriate optimizer, loss function, and metrics
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

Train the model

Python3

# Step 5: Train the model
history = model.fit(
    [X_train_encoded['input_ids'], X_train_encoded['token_type_ids'], X_train_encoded['attention_mask']],
    Target,
    validation_data=(
      [X_val_encoded['input_ids'], X_val_encoded['token_type_ids'], X_val_encoded['attention_mask']],y_val),
    batch_size=32,
    epochs=3
)

Output:

Epoch 1/3
782/782 [==============================] - 513s 587ms/step - loss: 0.3445 - accuracy: 0.8446 - val_loss: 0.2710 - val_accuracy: 0.8880
Epoch 2/3
782/782 [==============================] - 432s 552ms/step - loss: 0.2062 - accuracy: 0.9186 - val_loss: 0.2686 - val_accuracy: 0.8886
Epoch 3/3
782/782 [==============================] - 431s 551ms/step - loss: 0.1105 - accuracy: 0.9615 - val_loss: 0.3235 - val_accuracy: 0.8908

Step 6:Evaluate the model

Python3

#Evaluate the model on the test data
test_loss, test_accuracy = model.evaluate(
    [X_test_encoded['input_ids'], X_test_encoded['token_type_ids'], X_test_encoded['attention_mask']],
    y_test
)
print(f'Test loss: {test_loss}, Test accuracy: {test_accuracy}')

Output:

391/391 [==============================] - 67s 171ms/step - loss: 0.3417 - accuracy: 0.8873
Test loss: 0.3417432904243469, Test accuracy: 0.8872799873352051

Save the model and tokenizer to the local folder

Python3

path = 'path-to-save'
# Save tokenizer
tokenizer.save_pretrained(path +'/Tokenizer')
 
# Save model
model.save_pretrained(path +'/Model')

Load the model and tokenizer from the local folder

Python3

# Load tokenizer
bert_tokenizer = BertTokenizer.from_pretrained(path +'/Tokenizer')
 
# Load model
bert_model = TFBertForSequenceClassification.from_pretrained(path +'/Model')

Predict the sentiment of the test dataset

Python3

pred = bert_model.predict(
    [X_test_encoded['input_ids'], X_test_encoded['token_type_ids'], X_test_encoded['attention_mask']])
 
# pred is of type TFSequenceClassifierOutput
logits = pred.logits
 
# Use argmax along the appropriate axis to get the predicted labels
pred_labels = tf.argmax(logits, axis=1)
 
# Convert the predicted labels to a NumPy array
pred_labels = pred_labels.numpy()
 
label = {
    1: 'positive',
    0: 'Negative'
}
 
# Map the predicted labels to their corresponding strings using the label dictionary
pred_labels = [label[i] for i in pred_labels]
Actual = [label[i] for i in y_test]
 
print('Predicted Label :', pred_labels[:10])
print('Actual Label    :', Actual[:10])

Output:

391/391 [==============================] - 68s 167ms/step
Predicted Label : ['positive', 'positive', 'positive', 'positive', 'positive', 'Negative', 'Negative', 'Negative', 'positive', 'positive']
Actual Label    : ['positive', 'positive', 'positive', 'Negative', 'positive', 'Negative', 'Negative', 'Negative', 'positive', 'positive']

Classification Report

Python3

print("Classification Report: \n", classification_report(Actual, pred_labels))

Output:

Classification Report: 
               precision    recall  f1-score   support
    Negative       0.91      0.86      0.88      6250
    positive       0.87      0.91      0.89      6250
    accuracy                           0.89     12500
   macro avg       0.89      0.89      0.89     12500
weighted avg       0.89      0.89      0.89     12500

Step 7: Prediction with user inputs

Python3

def Get_sentiment(Review, Tokenizer=bert_tokenizer, Model=bert_model):
    # Convert Review to a list if it's not already a list
    if not isinstance(Review, list):
        Review = [Review]
 
    Input_ids, Token_type_ids, Attention_mask = Tokenizer.batch_encode_plus(Review,
                                                                             padding=True,
                                                                             truncation=True,
                                                                             max_length=128,
                                                                             return_tensors='tf').values()
    prediction = Model.predict([Input_ids, Token_type_ids, Attention_mask])
 
    # Use argmax along the appropriate axis to get the predicted labels
    pred_labels = tf.argmax(prediction.logits, axis=1)
 
    # Convert the TensorFlow tensor to a NumPy array and then to a list to get the predicted sentiment labels
    pred_labels = [label[i] for i in pred_labels.numpy().tolist()]
    return pred_labels

Let’s predict with our own review

Python

Review ='''Bahubali is a blockbuster Indian movie that was released in 2015.
It is the first part of a two-part epic saga that tells the story of a legendary hero who fights for his kingdom and his love.
The movie has received rave reviews from critics and audiences alike for its stunning visuals,
spectacular action scenes, and captivating storyline.'''
Get_sentiment(Review)

Output:

1/1 [==============================] - 3s 3s/step
['positive']

Conclusions

In this post, we show how to use BERT to make sentiment classification on the IMDB movie reviews dataset. We have discussed the architecture and features of BERT, such as bidirectional context, WordPiece tokenization, and fine-tuning. We have also included the code and methods for loading the BERT model, building a custom classifier on top of it, training and evaluating the model, and making predictions on fresh inputs. BERT has achieved great accuracy and performance on the sentiment classification test, as well as the ability to handle complex and varied linguistic expressions. This article is a helpful and practical guide for anyone interested in learning how to utilize BERT for text categorization challenges.

Sentiment Classification Using BERT

Bidirectional Representation for Transformers (BERT)

Sentiment Classification Using BERT:

Step 1: Import the necessary libraries

Python3

Step 2: Load the dataset

Python3

check the dataset folder

Python3

Check the ‘aclImdb’ directory

Python3

Check the ‘Train’ dataset folder

Python3

Read the files of the ‘Train’ directory files

Python3

Load the Movies reviews and convert them into the pandas’ data frame with their respective sentiment

Python3

Python3

Load the test dataset respectively

Python3

Step 3: Preprocessing

Python3

Text Cleaning

Python3

Python3

Plot reviews on WordCLouds

Python3

Python3

Python3

Python3

Split TEST data into test and validation

Python3

Step 4: Tokenization & Encoding

Load the pre-trained BERT tokenizer

Python3

Apply the BERT tokenization in training, testing and validation dataset

Python3

Check the encoded dataset

Python3

Step 5: Build the classification model

Lad the model

Python3

Compile the model

Python3

Train the model

Python3

Step 6:Evaluate the model

Python3

Save the model and tokenizer to the local folder

Python3

Load the model and tokenizer from the local folder

Python3

Predict the sentiment of the test dataset

Python3

Classification Report

Python3

Step 7: Prediction with user inputs

Python3

Let’s predict with our own review

Python

Conclusions

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US