BERT stands for Bidirectional Representation for Transformers and was proposed by researchers at Google AI language in 2018. Although the main aim of that was to improve the understanding of the meaning of queries related to Google Search, BERT becomes one of the most important and complete architectures for various natural language tasks having generated state-of-the-art results on Sentence pair classification tasks, question-answer tasks, etc.
Bidirectional Representation for Transformers (BERT)
BERT is a powerful technique for natural language processing that can improve how well computers comprehend human language. The foundation of BERT is the idea of exploiting bidirectional context to acquire complex and insightful word and phrase representations. By simultaneously examining both sides of a word’s context, BERT can capture a word’s whole meaning in its context, in contrast to earlier models that only considered the left or right context of a word. This enables BERT to deal with ambiguous and complex linguistic phenomena including polysemy, co-reference, and long-distance relationships.
For that, the paper also proposed the architecture of different tasks. In this post, we will be using BERT architecture for Sentiment classification tasks specifically the architecture used for the CoLA (Corpus of Linguistic Acceptability) binary classification task.
BERT has proposed two versions:
- BERT (BASE): 12 layers of encoder stack with 12 bidirectional self-attention heads and 768 hidden units.
- BERT (LARGE): 24 layers of encoder stack with 24 bidirectional self-attention heads and 1024 hidden units.
For TensorFlow implementation, Google has provided two versions of both the BERT BASE and BERT LARGE: Uncased and Cased. In an uncased version, letters are lowercase before WordPiece tokenization.
Sentiment Classification Using BERT:
Step 1: Import the necessary libraries
Python3
import os import shutil import tarfile import tensorflow as tf from transformers import BertTokenizer, TFBertForSequenceClassification import pandas as pd from bs4 import BeautifulSoup import re import matplotlib.pyplot as plt import plotly.express as px import plotly.offline as pyo import plotly.graph_objects as go from wordcloud import WordCloud, STOPWORDS from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report |
Step 2: Load the dataset
Python3
# Get the current working directory current_folder = os.getcwd() dataset = tf.keras.utils.get_file( fname = "aclImdb.tar.gz" , cache_dir = current_folder, extract = True ) |
check the dataset folder
Python3
dataset_path = os.path.dirname(dataset) # Check the dataset os.listdir(dataset_path) |
Output:
['aclImdb.tar.gz', 'aclImdb']
Check the ‘aclImdb’ directory
Python3
# Dataset directory dataset_dir = os.path.join(dataset_path, 'aclImdb' ) # Check the Dataset directory os.listdir(dataset_dir) |
Output:
['imdb.vocab', 'train', 'imdbEr.txt', 'README', 'test']
Check the ‘Train’ dataset folder
Python3
train_dir = os.path.join(dataset_dir, 'train' ) os.listdir(train_dir) |
Output:
['labeledBow.feat',
'urls_neg.txt',
'unsupBow.feat',
'urls_unsup.txt',
'urls_pos.txt',
'pos',
'neg',
'unsup']
Read the files of the ‘Train’ directory files
Python3
for file in os.listdir(train_dir): file_path = os.path.join(train_dir, file ) # Check if it's a file (not a directory) if os.path.isfile(file_path): with open (file_path, 'r' , encoding = 'utf-8' ) as f: first_value = f.readline().strip() print (f "{file}: {first_value}" ) else : print (f "{file}: {file_path}" ) |
Output:
labeledBow.feat: 9 0:9 1:1 2:4 3:4 4:6 5:4 6:2 7:2 8:4 10:4 12:2 26:1 27:1 28:1 29:2 32:1 41:1 45:1 47:1 50:1 54:2 57:1 59:1 63:2 64:1 66:1 68:2 70:1 72:1 78:1 100:1 106:1 116:1 122:1 125:1 136:1 140:1 142:1 150:1 167:1 183:1 201:1 207:1 208:1 213:1 217:1 230:1 255:1 321:5 343:1 357:1 370:1 390:2 468:1 514:1 571:1 619:1 671:1 766:1 877:1 1057:1 1179:1 1192:1 1402:2 1416:1 1477:2 1940:1 1941:1 2096:1 2243:1 2285:1 2379:1 2934:1 2938:1 3520:1 3647:1 4938:1 5138:4 5715:1 5726:1 5731:1 5812:1 8319:1 8567:1 10480:1 14239:1 20604:1 22409:4 24551:1 47304:1
urls_neg.txt: http://www.imdb.com/title/tt0064354/usercomments
unsupBow.feat: 0 0:8 1:6 3:5 4:2 5:1 7:1 8:5 9:2 10:1 11:2 13:3 16:1 17:1 18:1 19:1 22:3 24:1 26:3 28:1 30:1 31:1 35:2 36:1 39:2 40:1 41:2 46:2 47:1 48:1 52:1 63:1 67:1 68:1 74:1 81:1 83:1 87:1 104:1 105:1 112:1 117:1 131:1 151:1 155:1 170:1 198:1 225:1 226:1 288:2 291:1 320:1 331:1 342:1 364:1 374:1 384:2 385:1 407:1 437:1 441:1 465:1 468:1 470:1 519:1 595:1 615:1 650:1 692:1 851:1 937:1 940:1 1100:1 1264:1 1297:1 1317:1 1514:1 1728:1 1793:1 1948:1 2088:1 2257:1 2358:1 2584:2 2645:1 2735:1 3050:1 4297:1 5385:1 5858:1 7382:1 7767:1 7773:1 9306:1 10413:1 11881:1 15907:1 18613:1 18877:1 25479:1
urls_unsup.txt: http://www.imdb.com/title/tt0018515/usercomments
urls_pos.txt: http://www.imdb.com/title/tt0453418/usercomments
pos: datasets/aclImdb/train/pos
neg: datasets/aclImdb/train/neg
unsup: datasets/aclImdb/train/unsup
Load the Movies reviews and convert them into the pandas’ data frame with their respective sentiment
Here 0 means Negative and 1 means Positive
Python3
def load_dataset(directory): data = { "sentence" : [], "sentiment" : []} for file_name in os.listdir(directory): print (file_name) if file_name = = 'pos' : positive_dir = os.path.join(directory, file_name) for text_file in os.listdir(positive_dir): text = os.path.join(positive_dir, text_file) with open (text, "r" , encoding = "utf-8" ) as f: data[ "sentence" ].append(f.read()) data[ "sentiment" ].append( 1 ) elif file_name = = 'neg' : negative_dir = os.path.join(directory, file_name) for text_file in os.listdir(negative_dir): text = os.path.join(negative_dir, text_file) with open (text, "r" , encoding = "utf-8" ) as f: data[ "sentence" ].append(f.read()) data[ "sentiment" ].append( 0 ) return pd.DataFrame.from_dict(data) |
Load the training datasets
Python3
# Load the dataset from the train_dir train_df = load_dataset(train_dir) print (train_df.head()) |
Output:
labeledBow.feat
urls_neg.txt
unsupBow.feat
urls_unsup.txt
urls_pos.txt
pos
neg
unsup
sentence sentiment
0 A surprisingly complex and well crafted study ... 1
1 I thought this movie was fantastic. It was hil... 1
2 This is an extremely long movie, which means y... 1
3 A Pentagon science team seem to have perfected... 1
4 I was amazed at the improvements made in an an... 1
Load the test dataset respectively
Python3
test_dir = os.path.join(dataset_dir, 'test' ) # Load the dataset from the train_dir test_df = load_dataset(test_dir) print (test_df.head()) |
Output:
labeledBow.feat
urls_neg.txt
urls_pos.txt
pos
neg
sentence sentiment
0 Watched this on KQED, with Frank Baxter commen... 1
1 Frank Sinatra took this role, chewed it up wit... 1
2 I don't pretend to be a huge Asterix fan, havi... 1
3 A very interesting documentary - certainly a l... 1
4 Good drama/comedy, with two good performances ... 1
Step 3: Preprocessing
Python3
sentiment_counts = train_df[ 'sentiment' ].value_counts() fig = px.bar(x = { 0 : 'Negative' , 1 : 'Positive' }, y = sentiment_counts.values, color = sentiment_counts.index, color_discrete_sequence = px.colors.qualitative.Dark24, title = '<b>Sentiments Counts' ) fig.update_layout(title = 'Sentiments Counts' , xaxis_title = 'Sentiment' , yaxis_title = 'Counts' , template = 'plotly_dark' ) # Show the bar chart fig.show() pyo.plot(fig, filename = 'Sentiments Counts.html' , auto_open = True ) |
Output:
Text Cleaning
Python3
def text_cleaning(text): soup = BeautifulSoup(text, "html.parser" ) text = re.sub(r '\[[^]]*\]' , '', soup.get_text()) pattern = r "[^a-zA-Z0-9\s,']" text = re.sub(pattern, '', text) return text |
Apply text_cleaning
Python3
# Train dataset train_df[ 'Cleaned_sentence' ] = train_df[ 'sentence' ]. apply (text_cleaning).tolist() # Test dataset test_df[ 'Cleaned_sentence' ] = test_df[ 'sentence' ]. apply (text_cleaning) |
Plot reviews on WordCLouds
Python3
# Function to generate word cloud def generate_wordcloud(text,Title): all_text = " " .join(text) wordcloud = WordCloud(width = 800 , height = 400 , stopwords = set (STOPWORDS), background_color = 'black' ).generate(all_text) plt.figure(figsize = ( 10 , 5 )) plt.imshow(wordcloud, interpolation = 'bilinear' ) plt.axis( "off" ) plt.title(Title) plt.show() |
Positive Reviews
Python3
positive = train_df[train_df[ 'sentiment' ] = = 1 ][ 'Cleaned_sentence' ].tolist() generate_wordcloud(positive, 'Positive Review' ) |
Output:
Negative Reviews
Python3
negative = train_df[train_df[ 'sentiment' ] = = 0 ][ 'Cleaned_sentence' ].tolist() generate_wordcloud(negative, 'Negative Review' ) |
Output:
Separate input text and target sentiment of both train and test
Python3
# Training data #Reviews = "[CLS] " +train_df['Cleaned_sentence'] + "[SEP]" Reviews = train_df[ 'Cleaned_sentence' ] Target = train_df[ 'sentiment' ] # Test data #test_reviews = "[CLS] " +test_df['Cleaned_sentence'] + "[SEP]" test_reviews = test_df[ 'Cleaned_sentence' ] test_targets = test_df[ 'sentiment' ] |
Split TEST data into test and validation
Python3
x_val, x_test, y_val, y_test = train_test_split(test_reviews, test_targets, test_size = 0.5 , stratify = test_targets) |
Step 4: Tokenization & Encoding
BERT tokenization is used to convert the raw text into numerical inputs that can be fed into the BERT model. It tokenized the text and performs some preprocessing to prepare the text for the model’s input format. Let’s understand some of the key features of the BERT tokenization model.
- BERT tokenizer splits the words into subwords or workpieces. For example, the word “neveropen” can be split into “Lazyroar” “##for”, and”##Lazyroar”. The “##” prefix indicates that the subword is a continuation of the previous one. It reduces the vocabulary size and helps the model to deal with rare or unknown words.
- BERT tokenizer adds special tokens like [CLS], [SEP], and [MASK] to the sequence. These tokens have special meanings like :
- [CLS] is used for classifications and to represent the entire input in the case of sentiment analysis,
- [SEP] is used as a separator i.e. to mark the boundaries between different sentences or segments,
- [MASK] is used for masking i.e. to hide some tokens from the model during pre-training.
- BERT tokenizer gives their components as outputs:
- input_ids: The numerical identifiers of the vocabulary tokens
- token_type_ids: It identifies which segment or sentence each token belongs to.
- attention_mask: It flags that inform the model which tokens to pay attention to and which to disregard.
Load the pre-trained BERT tokenizer
Python3
#Tokenize and encode the data using the BERT tokenizer tokenizer = BertTokenizer.from_pretrained( 'bert-base-uncased' , do_lower_case = True ) |
Apply the BERT tokenization in training, testing and validation dataset
Python3
max_len = 128 # Tokenize and encode the sentences X_train_encoded = tokenizer.batch_encode_plus(Reviews.tolist(), padding = True , truncation = True , max_length = max_len, return_tensors = 'tf' ) X_val_encoded = tokenizer.batch_encode_plus(x_val.tolist(), padding = True , truncation = True , max_length = max_len, return_tensors = 'tf' ) X_test_encoded = tokenizer.batch_encode_plus(x_test.tolist(), padding = True , truncation = True , max_length = max_len, return_tensors = 'tf' ) |
Check the encoded dataset
Python3
k = 0 print ( 'Training Comments -->>' ,Reviews[k]) print ( '\nInput Ids -->>\n' ,X_train_encoded[ 'input_ids' ][k]) print ( '\nDecoded Ids -->>\n' ,tokenizer.decode(X_train_encoded[ 'input_ids' ][k])) print ( '\nAttention Mask -->>\n' ,X_train_encoded[ 'attention_mask' ][k]) print ( '\nLabels -->>' ,Target[k]) |
Output:
Training Comments -->> Valentine is a horrible movie This is what I thought of itActing Very bad Katherine Heigl can not act
The other's weren't much betterStory The story was okay, but it could have been more developed This movie had the potential to be a great movie,
but it failedMusic Yes, some of the music was pretty coolOriginality Not very original The name Paige Prescott' Recognize PrescottBottom Line
Don't see Valentine It's a really stupid movie110
Input Ids -->>
tf.Tensor(
[ 101 10113 2003 1037 9202 3185 2023 2003 2054 1045 2245 1997
2009 18908 2075 2200 2919 9477 2002 8004 2140 2064 2025 2552
1996 2060 1005 1055 4694 1005 1056 2172 2488 23809 2100 1996
2466 2001 3100 1010 2021 2009 2071 2031 2042 2062 2764 2023
3185 2018 1996 4022 2000 2022 1037 2307 3185 1010 2021 2009
3478 27275 2748 1010 2070 1997 1996 2189 2001 3492 4658 10050
24965 3012 2025 2200 2434 1996 2171 17031 20719 1005 6807 20719
18384 20389 2240 2123 1005 1056 2156 10113 2009 1005 1055 1037
2428 5236 3185 14526 2692 102 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0], shape=(128,), dtype=int32)
Decoded Ids -->>
[CLS] valentine is a horrible movie this is what i thought of itacting very bad katherine heigl can not act
the other's weren't much betterstory the story was okay, but it could have been more developed this movie had the potential to be a great movie,
but it failedmusic yes, some of the music was pretty cooloriginality not very original the name paige prescott'recognize prescottbottom line
don't see valentine it's a really stupid movie110 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
Attention Mask -->>
tf.Tensor(
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], shape=(128,), dtype=int32)
Labels -->> 0
Step 5: Build the classification model
Lad the model
Python3
# Intialize the model model = TFBertForSequenceClassification.from_pretrained( 'bert-base-uncased' , num_labels = 2 ) |
Output:
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bert (TFBertMainLayer) multiple 109482240
dropout_37 (Dropout) multiple 0
classifier (Dense) multiple 1538
=================================================================
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________
If the task at hand is similar to the one on which the checkpoint model was trained, we can use TFBertForSequenceClassification to provide predictions without further training.
Compile the model
Python3
# Compile the model with an appropriate optimizer, loss function, and metrics optimizer = tf.keras.optimizers.Adam(learning_rate = 2e - 5 ) loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True ) metric = tf.keras.metrics.SparseCategoricalAccuracy( 'accuracy' ) model. compile (optimizer = optimizer, loss = loss, metrics = [metric]) |
Train the model
Python3
# Step 5: Train the model history = model.fit( [X_train_encoded[ 'input_ids' ], X_train_encoded[ 'token_type_ids' ], X_train_encoded[ 'attention_mask' ]], Target, validation_data = ( [X_val_encoded[ 'input_ids' ], X_val_encoded[ 'token_type_ids' ], X_val_encoded[ 'attention_mask' ]],y_val), batch_size = 32 , epochs = 3 ) |
Output:
Epoch 1/3
782/782 [==============================] - 513s 587ms/step - loss: 0.3445 - accuracy: 0.8446 - val_loss: 0.2710 - val_accuracy: 0.8880
Epoch 2/3
782/782 [==============================] - 432s 552ms/step - loss: 0.2062 - accuracy: 0.9186 - val_loss: 0.2686 - val_accuracy: 0.8886
Epoch 3/3
782/782 [==============================] - 431s 551ms/step - loss: 0.1105 - accuracy: 0.9615 - val_loss: 0.3235 - val_accuracy: 0.8908
Step 6:Evaluate the model
Python3
#Evaluate the model on the test data test_loss, test_accuracy = model.evaluate( [X_test_encoded[ 'input_ids' ], X_test_encoded[ 'token_type_ids' ], X_test_encoded[ 'attention_mask' ]], y_test ) print (f 'Test loss: {test_loss}, Test accuracy: {test_accuracy}' ) |
Output:
391/391 [==============================] - 67s 171ms/step - loss: 0.3417 - accuracy: 0.8873
Test loss: 0.3417432904243469, Test accuracy: 0.8872799873352051
Save the model and tokenizer to the local folder
Python3
path = 'path-to-save' # Save tokenizer tokenizer.save_pretrained(path + '/Tokenizer' ) # Save model model.save_pretrained(path + '/Model' ) |
Load the model and tokenizer from the local folder
Python3
# Load tokenizer bert_tokenizer = BertTokenizer.from_pretrained(path + '/Tokenizer' ) # Load model bert_model = TFBertForSequenceClassification.from_pretrained(path + '/Model' ) |
Predict the sentiment of the test dataset
Python3
pred = bert_model.predict( [X_test_encoded[ 'input_ids' ], X_test_encoded[ 'token_type_ids' ], X_test_encoded[ 'attention_mask' ]]) # pred is of type TFSequenceClassifierOutput logits = pred.logits # Use argmax along the appropriate axis to get the predicted labels pred_labels = tf.argmax(logits, axis = 1 ) # Convert the predicted labels to a NumPy array pred_labels = pred_labels.numpy() label = { 1 : 'positive' , 0 : 'Negative' } # Map the predicted labels to their corresponding strings using the label dictionary pred_labels = [label[i] for i in pred_labels] Actual = [label[i] for i in y_test] print ( 'Predicted Label :' , pred_labels[: 10 ]) print ( 'Actual Label :' , Actual[: 10 ]) |
Output:
391/391 [==============================] - 68s 167ms/step
Predicted Label : ['positive', 'positive', 'positive', 'positive', 'positive', 'Negative', 'Negative', 'Negative', 'positive', 'positive']
Actual Label : ['positive', 'positive', 'positive', 'Negative', 'positive', 'Negative', 'Negative', 'Negative', 'positive', 'positive']
Classification Report
Python3
print ( "Classification Report: \n" , classification_report(Actual, pred_labels)) |
Output:
Classification Report:
precision recall f1-score support
Negative 0.91 0.86 0.88 6250
positive 0.87 0.91 0.89 6250
accuracy 0.89 12500
macro avg 0.89 0.89 0.89 12500
weighted avg 0.89 0.89 0.89 12500
Step 7: Prediction with user inputs
Python3
def Get_sentiment(Review, Tokenizer = bert_tokenizer, Model = bert_model): # Convert Review to a list if it's not already a list if not isinstance (Review, list ): Review = [Review] Input_ids, Token_type_ids, Attention_mask = Tokenizer.batch_encode_plus(Review, padding = True , truncation = True , max_length = 128 , return_tensors = 'tf' ).values() prediction = Model.predict([Input_ids, Token_type_ids, Attention_mask]) # Use argmax along the appropriate axis to get the predicted labels pred_labels = tf.argmax(prediction.logits, axis = 1 ) # Convert the TensorFlow tensor to a NumPy array and then to a list to get the predicted sentiment labels pred_labels = [label[i] for i in pred_labels.numpy().tolist()] return pred_labels |
Let’s predict with our own review
Python
Review = '''Bahubali is a blockbuster Indian movie that was released in 2015. It is the first part of a two-part epic saga that tells the story of a legendary hero who fights for his kingdom and his love. The movie has received rave reviews from critics and audiences alike for its stunning visuals, spectacular action scenes, and captivating storyline.''' Get_sentiment(Review) |
Output:
1/1 [==============================] - 3s 3s/step
['positive']
Conclusions
In this post, we show how to use BERT to make sentiment classification on the IMDB movie reviews dataset. We have discussed the architecture and features of BERT, such as bidirectional context, WordPiece tokenization, and fine-tuning. We have also included the code and methods for loading the BERT model, building a custom classifier on top of it, training and evaluating the model, and making predictions on fresh inputs. BERT has achieved great accuracy and performance on the sentiment classification test, as well as the ability to handle complex and varied linguistic expressions. This article is a helpful and practical guide for anyone interested in learning how to utilize BERT for text categorization challenges.