Saturday, November 16, 2024
Google search engine
HomeLanguagesSentiment Classification Using BERT

Sentiment Classification Using BERT

BERT stands for Bidirectional Representation for Transformers and was proposed by researchers at Google AI language in 2018. Although the main aim of that was to improve the understanding of the meaning of queries related to Google Search, BERT becomes one of the most important and complete architectures for various natural language tasks having generated state-of-the-art results on Sentence pair classification tasks, question-answer tasks, etc.

Bidirectional Representation for Transformers (BERT)

BERT is a powerful technique for natural language processing that can improve how well computers comprehend human language. The foundation of BERT is the idea of exploiting bidirectional context to acquire complex and insightful word and phrase representations. By simultaneously examining both sides of a word’s context, BERT can capture a word’s whole meaning in its context, in contrast to earlier models that only considered the left or right context of a word. This enables BERT to deal with ambiguous and complex linguistic phenomena including polysemy, co-reference, and long-distance relationships.

For that, the paper also proposed the architecture of different tasks. In this post, we will be using BERT architecture for Sentiment classification tasks specifically the architecture used for the CoLA (Corpus of Linguistic Acceptability) binary classification task.

Single Sentence Classification Task-GeeksforLazyroar

Single Sentence Classification Task

BERT has proposed two versions:

  • BERT (BASE): 12 layers of encoder stack with 12 bidirectional self-attention heads and 768 hidden units.
  • BERT (LARGE): 24 layers of encoder stack with 24 bidirectional self-attention heads and 1024 hidden units.

For TensorFlow implementation, Google has provided two versions of both the BERT BASE and BERT LARGE: Uncased and Cased. In an uncased version, letters are lowercase before WordPiece tokenization.

Sentiment Classification Using BERT:

Step 1: Import the necessary libraries

Python3




import os
import shutil
import tarfile
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
import pandas as pd
from bs4 import BeautifulSoup
import re
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.offline as pyo
import plotly.graph_objects as go
from wordcloud import WordCloud, STOPWORDS
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


Step 2: Load the dataset

Python3




# Get the current working directory
current_folder = os.getcwd()
 
dataset = tf.keras.utils.get_file(
    fname ="aclImdb.tar.gz",
    origin ="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
    cache_dir=  current_folder,
    extract = True)


check the dataset folder

Python3




dataset_path = os.path.dirname(dataset)
# Check the dataset
os.listdir(dataset_path)


Output:

['aclImdb.tar.gz', 'aclImdb']

Check the ‘aclImdb’ directory

Python3




# Dataset directory
dataset_dir = os.path.join(dataset_path, 'aclImdb')
 
# Check the Dataset directory
os.listdir(dataset_dir)


Output:

['imdb.vocab', 'train', 'imdbEr.txt', 'README', 'test']

Check the ‘Train’ dataset folder

Python3




train_dir = os.path.join(dataset_dir,'train')
os.listdir(train_dir)


Output:

['labeledBow.feat',
'urls_neg.txt',
'unsupBow.feat',
'urls_unsup.txt',
'urls_pos.txt',
'pos',
'neg',
'unsup']

Read the files of the ‘Train’ directory files

Python3




for file in os.listdir(train_dir):
    file_path = os.path.join(train_dir, file)
    # Check if it's a file (not a directory)
    if os.path.isfile(file_path):
        with open(file_path, 'r', encoding='utf-8') as f:
            first_value = f.readline().strip()
            print(f"{file}: {first_value}")
    else:
        print(f"{file}: {file_path}")


Output:

labeledBow.feat: 9 0:9 1:1 2:4 3:4 4:6 5:4 6:2 7:2 8:4 10:4 12:2 26:1 27:1 28:1 29:2 32:1 41:1 45:1 47:1 50:1 54:2 57:1 59:1 63:2 64:1 66:1 68:2 70:1 72:1 78:1 100:1 106:1 116:1 122:1 125:1 136:1 140:1 142:1 150:1 167:1 183:1 201:1 207:1 208:1 213:1 217:1 230:1 255:1 321:5 343:1 357:1 370:1 390:2 468:1 514:1 571:1 619:1 671:1 766:1 877:1 1057:1 1179:1 1192:1 1402:2 1416:1 1477:2 1940:1 1941:1 2096:1 2243:1 2285:1 2379:1 2934:1 2938:1 3520:1 3647:1 4938:1 5138:4 5715:1 5726:1 5731:1 5812:1 8319:1 8567:1 10480:1 14239:1 20604:1 22409:4 24551:1 47304:1
urls_neg.txt: http://www.imdb.com/title/tt0064354/usercomments
unsupBow.feat: 0 0:8 1:6 3:5 4:2 5:1 7:1 8:5 9:2 10:1 11:2 13:3 16:1 17:1 18:1 19:1 22:3 24:1 26:3 28:1 30:1 31:1 35:2 36:1 39:2 40:1 41:2 46:2 47:1 48:1 52:1 63:1 67:1 68:1 74:1 81:1 83:1 87:1 104:1 105:1 112:1 117:1 131:1 151:1 155:1 170:1 198:1 225:1 226:1 288:2 291:1 320:1 331:1 342:1 364:1 374:1 384:2 385:1 407:1 437:1 441:1 465:1 468:1 470:1 519:1 595:1 615:1 650:1 692:1 851:1 937:1 940:1 1100:1 1264:1 1297:1 1317:1 1514:1 1728:1 1793:1 1948:1 2088:1 2257:1 2358:1 2584:2 2645:1 2735:1 3050:1 4297:1 5385:1 5858:1 7382:1 7767:1 7773:1 9306:1 10413:1 11881:1 15907:1 18613:1 18877:1 25479:1
urls_unsup.txt: http://www.imdb.com/title/tt0018515/usercomments
urls_pos.txt: http://www.imdb.com/title/tt0453418/usercomments
pos: datasets/aclImdb/train/pos
neg: datasets/aclImdb/train/neg
unsup: datasets/aclImdb/train/unsup

Load the Movies reviews and convert them into the pandas’ data frame with their respective sentiment

Here 0 means Negative and 1 means Positive

Python3




def load_dataset(directory):
    data = {"sentence": [], "sentiment": []}
    for file_name in os.listdir(directory):
        print(file_name)
        if file_name == 'pos':
            positive_dir = os.path.join(directory, file_name)
            for text_file in os.listdir(positive_dir):
                text = os.path.join(positive_dir, text_file)
                with open(text, "r", encoding="utf-8") as f:
                    data["sentence"].append(f.read())
                    data["sentiment"].append(1)
        elif file_name == 'neg':
            negative_dir = os.path.join(directory, file_name)
            for text_file in os.listdir(negative_dir):
                text = os.path.join(negative_dir, text_file)
                with open(text, "r", encoding="utf-8") as f:
                    data["sentence"].append(f.read())
                    data["sentiment"].append(0)
             
    return pd.DataFrame.from_dict(data)


Load the training datasets

Python3




# Load the dataset from the train_dir
train_df = load_dataset(train_dir)
print(train_df.head())


Output:

labeledBow.feat
urls_neg.txt
unsupBow.feat
urls_unsup.txt
urls_pos.txt
pos
neg
unsup
sentence sentiment
0 A surprisingly complex and well crafted study ... 1
1 I thought this movie was fantastic. It was hil... 1
2 This is an extremely long movie, which means y... 1
3 A Pentagon science team seem to have perfected... 1
4 I was amazed at the improvements made in an an... 1

Load the test dataset respectively

Python3




test_dir = os.path.join(dataset_dir,'test')
 
# Load the dataset from the train_dir
test_df = load_dataset(test_dir)
print(test_df.head())


Output:

labeledBow.feat
urls_neg.txt
urls_pos.txt
pos
neg
sentence sentiment
0 Watched this on KQED, with Frank Baxter commen... 1
1 Frank Sinatra took this role, chewed it up wit... 1
2 I don't pretend to be a huge Asterix fan, havi... 1
3 A very interesting documentary - certainly a l... 1
4 Good drama/comedy, with two good performances ... 1

Step 3: Preprocessing

Python3




sentiment_counts = train_df['sentiment'].value_counts()
 
fig =px.bar(x= {0:'Negative',1:'Positive'},
            y= sentiment_counts.values,
            color=sentiment_counts.index,
            color_discrete_sequence =  px.colors.qualitative.Dark24,
            title='<b>Sentiments Counts')
 
fig.update_layout(title='Sentiments Counts',
                  xaxis_title='Sentiment',
                  yaxis_title='Counts',
                  template='plotly_dark')
 
# Show the bar chart
fig.show()
pyo.plot(fig, filename = 'Sentiments Counts.html', auto_open = True)


Output:

Sentiment Counts-GeeksforLazyroar

Sentiment Counts

Text Cleaning

Python3




def text_cleaning(text):
    soup = BeautifulSoup(text, "html.parser")
    text = re.sub(r'\[[^]]*\]', '', soup.get_text())
    pattern = r"[^a-zA-Z0-9\s,']"
    text = re.sub(pattern, '', text)
    return text


Apply text_cleaning

Python3




# Train dataset
train_df['Cleaned_sentence'] = train_df['sentence'].apply(text_cleaning).tolist()
# Test dataset
test_df['Cleaned_sentence'] = test_df['sentence'].apply(text_cleaning)


Plot reviews on WordCLouds

Python3




# Function to generate word cloud
def generate_wordcloud(text,Title):
    all_text = " ".join(text)
    wordcloud = WordCloud(width=800,
                          height=400,
                          stopwords=set(STOPWORDS),
                          background_color='black').generate(all_text)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.title(Title)
    plt.show()


Positive Reviews

Python3




positive = train_df[train_df['sentiment']==1]['Cleaned_sentence'].tolist()
generate_wordcloud(positive,'Positive Review')


Output:

Positive Reviews WordClound-GeeksforLazyroar

Positive Reviews WordClound

Negative Reviews

Python3




negative = train_df[train_df['sentiment']==0]['Cleaned_sentence'].tolist()
generate_wordcloud(negative,'Negative Review')


Output:

Negative Reviews WordCloud-GeeksforLazyroar

Negative Reviews WordCloud

Separate input text and target sentiment of both train and test

Python3




# Training data
#Reviews = "[CLS] " +train_df['Cleaned_sentence'] + "[SEP]"
Reviews = train_df['Cleaned_sentence']
Target = train_df['sentiment']
 
# Test data
#test_reviews =  "[CLS] " +test_df['Cleaned_sentence'] + "[SEP]"
test_reviews = test_df['Cleaned_sentence']
test_targets = test_df['sentiment']


Split TEST data into test and validation

Python3




x_val, x_test, y_val, y_test = train_test_split(test_reviews,
                                                    test_targets,
                                                    test_size=0.5,
                                                    stratify = test_targets)


Step 4: Tokenization & Encoding

BERT tokenization is used to convert the raw text into numerical inputs that can be fed into the BERT model. It tokenized the text and performs some preprocessing to prepare the text for the model’s input format. Let’s understand some of the key features of the BERT tokenization model.

  • BERT tokenizer splits the words into subwords or workpieces. For example, the word “neveropen” can be split into “Lazyroar” “##for”, and”##Lazyroar”. The “##” prefix indicates that the subword is a continuation of the previous one. It reduces the vocabulary size and helps the model to deal with rare or unknown words.
  • BERT tokenizer adds special tokens like [CLS], [SEP], and [MASK] to the sequence. These tokens have special meanings like :
    • [CLS] is used for classifications and to represent the entire input in the case of sentiment analysis,
    • [SEP] is used as a separator i.e. to mark the boundaries between different sentences or segments,
    • [MASK] is used for masking i.e. to hide some tokens from the model during pre-training.
  • BERT tokenizer gives their components as outputs:
    • input_ids: The numerical identifiers of the vocabulary tokens
    • token_type_ids: It identifies which segment or sentence each token belongs to.
    • attention_mask: It flags that inform the model which tokens to pay attention to and which to disregard.

Load the pre-trained BERT tokenizer

Python3




#Tokenize and encode the data using the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)


Apply the BERT tokenization in training, testing and validation dataset

Python3




max_len= 128
# Tokenize and encode the sentences
X_train_encoded = tokenizer.batch_encode_plus(Reviews.tolist(),
                                              padding=True,
                                              truncation=True,
                                              max_length = max_len,
                                              return_tensors='tf')
 
X_val_encoded = tokenizer.batch_encode_plus(x_val.tolist(),
                                              padding=True,
                                              truncation=True,
                                              max_length = max_len,
                                              return_tensors='tf')
 
X_test_encoded = tokenizer.batch_encode_plus(x_test.tolist(),
                                              padding=True,
                                              truncation=True,
                                              max_length = max_len,
                                              return_tensors='tf')


Check the encoded dataset

Python3




k = 0
print('Training Comments -->>',Reviews[k])
print('\nInput Ids -->>\n',X_train_encoded['input_ids'][k])
print('\nDecoded Ids -->>\n',tokenizer.decode(X_train_encoded['input_ids'][k]))
print('\nAttention Mask -->>\n',X_train_encoded['attention_mask'][k])
print('\nLabels -->>',Target[k])


Output:

Training Comments -->> Valentine is a horrible movie This is what I thought of itActing Very bad Katherine Heigl can not act 
The other's weren't much betterStory The story was okay, but it could have been more developed This movie had the potential to be a great movie,
but it failedMusic Yes, some of the music was pretty coolOriginality Not very original The name Paige Prescott' Recognize PrescottBottom Line
Don't see Valentine It's a really stupid movie110
Input Ids -->>
tf.Tensor(
[ 101 10113 2003 1037 9202 3185 2023 2003 2054 1045 2245 1997
2009 18908 2075 2200 2919 9477 2002 8004 2140 2064 2025 2552
1996 2060 1005 1055 4694 1005 1056 2172 2488 23809 2100 1996
2466 2001 3100 1010 2021 2009 2071 2031 2042 2062 2764 2023
3185 2018 1996 4022 2000 2022 1037 2307 3185 1010 2021 2009
3478 27275 2748 1010 2070 1997 1996 2189 2001 3492 4658 10050
24965 3012 2025 2200 2434 1996 2171 17031 20719 1005 6807 20719
18384 20389 2240 2123 1005 1056 2156 10113 2009 1005 1055 1037
2428 5236 3185 14526 2692 102 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0], shape=(128,), dtype=int32)
Decoded Ids -->>
[CLS] valentine is a horrible movie this is what i thought of itacting very bad katherine heigl can not act
the other's weren't much betterstory the story was okay, but it could have been more developed this movie had the potential to be a great movie,
but it failedmusic yes, some of the music was pretty cooloriginality not very original the name paige prescott'recognize prescottbottom line
don't see valentine it's a really stupid movie110 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
Attention Mask -->>
tf.Tensor(
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], shape=(128,), dtype=int32)
Labels -->> 0

Step 5: Build the classification model

Lad the model

Python3




# Intialize the model
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)


Output:

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bert (TFBertMainLayer) multiple 109482240

dropout_37 (Dropout) multiple 0

classifier (Dense) multiple 1538

=================================================================
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________

If the task at hand is similar to the one on which the checkpoint model was trained, we can use TFBertForSequenceClassification to provide predictions without further training.

Compile the model

Python3




# Compile the model with an appropriate optimizer, loss function, and metrics
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])


Train the model

Python3




# Step 5: Train the model
history = model.fit(
    [X_train_encoded['input_ids'], X_train_encoded['token_type_ids'], X_train_encoded['attention_mask']],
    Target,
    validation_data=(
      [X_val_encoded['input_ids'], X_val_encoded['token_type_ids'], X_val_encoded['attention_mask']],y_val),
    batch_size=32,
    epochs=3
)


Output:

Epoch 1/3
782/782 [==============================] - 513s 587ms/step - loss: 0.3445 - accuracy: 0.8446 - val_loss: 0.2710 - val_accuracy: 0.8880
Epoch 2/3
782/782 [==============================] - 432s 552ms/step - loss: 0.2062 - accuracy: 0.9186 - val_loss: 0.2686 - val_accuracy: 0.8886
Epoch 3/3
782/782 [==============================] - 431s 551ms/step - loss: 0.1105 - accuracy: 0.9615 - val_loss: 0.3235 - val_accuracy: 0.8908

Step 6:Evaluate the model

Python3




#Evaluate the model on the test data
test_loss, test_accuracy = model.evaluate(
    [X_test_encoded['input_ids'], X_test_encoded['token_type_ids'], X_test_encoded['attention_mask']],
    y_test
)
print(f'Test loss: {test_loss}, Test accuracy: {test_accuracy}')


Output:

391/391 [==============================] - 67s 171ms/step - loss: 0.3417 - accuracy: 0.8873
Test loss: 0.3417432904243469, Test accuracy: 0.8872799873352051

Save the model and tokenizer to the local folder

Python3




path = 'path-to-save'
# Save tokenizer
tokenizer.save_pretrained(path +'/Tokenizer')
 
# Save model
model.save_pretrained(path +'/Model')


Load the model and tokenizer from the local folder

Python3




# Load tokenizer
bert_tokenizer = BertTokenizer.from_pretrained(path +'/Tokenizer')
 
# Load model
bert_model = TFBertForSequenceClassification.from_pretrained(path +'/Model')


Predict the sentiment of the test dataset

Python3




pred = bert_model.predict(
    [X_test_encoded['input_ids'], X_test_encoded['token_type_ids'], X_test_encoded['attention_mask']])
 
# pred is of type TFSequenceClassifierOutput
logits = pred.logits
 
# Use argmax along the appropriate axis to get the predicted labels
pred_labels = tf.argmax(logits, axis=1)
 
# Convert the predicted labels to a NumPy array
pred_labels = pred_labels.numpy()
 
label = {
    1: 'positive',
    0: 'Negative'
}
 
# Map the predicted labels to their corresponding strings using the label dictionary
pred_labels = [label[i] for i in pred_labels]
Actual = [label[i] for i in y_test]
 
print('Predicted Label :', pred_labels[:10])
print('Actual Label    :', Actual[:10])


Output:

391/391 [==============================] - 68s 167ms/step
Predicted Label : ['positive', 'positive', 'positive', 'positive', 'positive', 'Negative', 'Negative', 'Negative', 'positive', 'positive']
Actual Label : ['positive', 'positive', 'positive', 'Negative', 'positive', 'Negative', 'Negative', 'Negative', 'positive', 'positive']

Classification Report

Python3




print("Classification Report: \n", classification_report(Actual, pred_labels))


Output:

Classification Report: 
precision recall f1-score support
Negative 0.91 0.86 0.88 6250
positive 0.87 0.91 0.89 6250
accuracy 0.89 12500
macro avg 0.89 0.89 0.89 12500
weighted avg 0.89 0.89 0.89 12500

Step 7: Prediction with user inputs

Python3




def Get_sentiment(Review, Tokenizer=bert_tokenizer, Model=bert_model):
    # Convert Review to a list if it's not already a list
    if not isinstance(Review, list):
        Review = [Review]
 
    Input_ids, Token_type_ids, Attention_mask = Tokenizer.batch_encode_plus(Review,
                                                                             padding=True,
                                                                             truncation=True,
                                                                             max_length=128,
                                                                             return_tensors='tf').values()
    prediction = Model.predict([Input_ids, Token_type_ids, Attention_mask])
 
    # Use argmax along the appropriate axis to get the predicted labels
    pred_labels = tf.argmax(prediction.logits, axis=1)
 
    # Convert the TensorFlow tensor to a NumPy array and then to a list to get the predicted sentiment labels
    pred_labels = [label[i] for i in pred_labels.numpy().tolist()]
    return pred_labels


Let’s predict with our own review

Python




Review ='''Bahubali is a blockbuster Indian movie that was released in 2015.
It is the first part of a two-part epic saga that tells the story of a legendary hero who fights for his kingdom and his love.
The movie has received rave reviews from critics and audiences alike for its stunning visuals,
spectacular action scenes, and captivating storyline.'''
Get_sentiment(Review)


Output:

1/1 [==============================] - 3s 3s/step
['positive']

Conclusions

In this post, we show how to use BERT to make sentiment classification on the IMDB movie reviews dataset. We have discussed the architecture and features of BERT, such as bidirectional context, WordPiece tokenization, and fine-tuning. We have also included the code and methods for loading the BERT model, building a custom classifier on top of it, training and evaluating the model, and making predictions on fresh inputs. BERT has achieved great accuracy and performance on the sentiment classification test, as well as the ability to handle complex and varied linguistic expressions. This article is a helpful and practical guide for anyone interested in learning how to utilize BERT for text categorization challenges.

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments