Saturday, November 16, 2024
Google search engine
HomeLanguagesHate Speech Detection using Deep Learning

Hate Speech Detection using Deep Learning

There must be times when you have come across some social media post whose main aim is to spread hate and controversies or use abusive language on social media platforms. As the post consists of textual information to filter out such Hate Speeches NLP comes in handy. This is one of the main applications of NLP which is known as Sentence Classification tasks.

In this article, we will learn how to build an NLP-based Sequence Classification model which can predict Tweets as Hate Speech, Offensive Language, and Normal.

Importing Libraries and Dataset

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.

  • Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
  • Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
  • Matplotlib/Seaborn/Wordcloud This library is used to draw visualizations.
  • NLTK – Natural Language Tool Kit provides various functions to process the raw textual data.

Python3




%%capture
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.model_selection import train_test_split
 
# Text Pre-processing libraries
import nltk
import string
import warnings
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
 
# Tensorflow imports to build the model.
import tensorflow as tf
from tensorflow import keras
from keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
 
nltk.download('stopwords')
nltk.download('omw-1.4')
nltk.download('wordnet')
warnings.filterwarnings('ignore')


Now let’s load the dataset into a pandas data frame and look at the first five rows of the dataset.

Python3




df = pd.read_csv('hate_speech.csv')
df.head()


Output:

First Five rows of the dataset

First Five rows of the dataset

To check how many such tweets data we have let’s print the shape of the data frame.

Python3




df.shape


Output:

(19826, 2)

Although there are only two columns in this dataset let’s check the info about their columns.

Python3




df.info()


Output:

Info about the dataset

Info about the dataset

The shape of the data frame and the number of non-null values are the same hence we can say that there are no null values in the dataset.

Python3




plt.pie(df['class'].value_counts().values,
        labels = df['class'].value_counts().index,
        autopct='%1.1f%%')
plt.show()


Output:

Pie Chart of the distribution of classes

Pie Chart of the distribution of classes

Here the three labels are as follows:

0 - Hate Speech
1 - Offensive Language
2 - Neither

We need to handle the data imbalance problem before we train a model on this dataset.

Text Preprocessing

Textual data is highly unstructured and need attention on many aspects like:

Although removing data means loss of information but we need to do this to make the data perfect to feed into a machine learning model.

Python3




# Lower case all the words of the tweet before any preprocessing
df['tweet'] = df['tweet'].str.lower()
 
# Removing punctuations present in the text
punctuations_list = string.punctuation
def remove_punctuations(text):
    temp = str.maketrans('', '', punctuations_list)
    return text.translate(temp)
 
df['tweet']= df['tweet'].apply(lambda x: remove_punctuations(x))
df.head()


Output:

Dataset after removal of punctuation's

Dataset after removal of punctuation’s

The below function is a helper function that will help us to remove the stop words and Lemmatize the important words.

Python3




def remove_stopwords(text):
    stop_words = stopwords.words('english')
 
    imp_words = []
 
    # Storing the important words
    for word in str(text).split():
 
        if word not in stop_words:
 
            # Let's Lemmatize the word as well
            # before appending to the imp_words list.
 
            lemmatizer = WordNetLemmatizer()
            lemmatizer.lemmatize(word)
 
            imp_words.append(word)
 
    output = " ".join(imp_words)
 
    return output
 
 
df['tweet'] = df['tweet'].apply(lambda text: remove_stopwords(text))
df.head()


Output:

Dataset after removal of stop words and lemmatization

Dataset after removal of stop words and lemmatization

Word cloud is a text visualization tool that help’s us to get insights into the most frequent words present in the corpus of the data.

Python3




def plot_word_cloud(data, typ):
  # Joining all the tweets to get the corpus
  email_corpus = " ".join(data['tweet'])
 
  plt.figure(figsize = (10,10))
   
  # Forming the word cloud
  wc = WordCloud(max_words = 100,
                width = 200,
                height = 100,
                collocations = False).generate(email_corpus)
   
  # Plotting the wordcloud obtained above
  plt.title(f'WordCloud for {typ} emails.', fontsize = 15)
  plt.axis('off')
  plt.imshow(wc)
  plt.show()
  print()
 
plot_word_cloud(df[df['class']==2], typ='Neither')


Output:

Word cloud for the neither class of data

Word cloud for the neither class of data

As we know from above that the data we had was highly imbalanced now we will solve this problem by using a mixture of down sampling and up sampling.

Python3




class_2 = df[df['class'] == 2]
class_1 = df[df['class'] == 1].sample(n=3500)
class_0 = df[df['class'] == 0]
 
balanced_df = pd.concat([class_0, class_0, class_0, class_1, class_2], axis=0)


Now let’s check what is the data distribution in the three classes.

Python3




plt.pie(balanced_df['class'].value_counts().values,
        labels=balanced_df['class'].value_counts().index,
        autopct='%1.1f%%')
plt.show()


Output:

Pie chart for the distribution of the data in three classes

Pie chart for the distribution of the data in three classes

After this step we can be sure of the fact that the data is perfectly balanced for the three classes.

Word2Vec Conversion

We cannot feed words to a machine learning model because they work on numbers only. So, first, we will convert the our words to vectors with the token id’s to the corresponding words and after padding them our textual data will arrive to a stage where we can feed it to a model.

Python3




features = balanced_df['tweet']
target = balanced_df['class']
 
X_train, X_val, Y_train, Y_val = train_test_split(features,
                                                  target,
                                                  test_size=0.2,
                                                  random_state=22)
X_train.shape, X_val.shape


Output:

((8201,), (2051,))

We have successfully divided our data into training and validation data.

Python3




Y_train = pd.get_dummies(Y_train)
Y_val = pd.get_dummies(Y_val)
Y_train.shape, Y_val.shape


Output:

((8201, 3), (2051, 3))

The labels of the classes have been converted into one-hot-encoded vectors. For this, we will use a vocabulary size of 5000 with each tweet, not more than 100 in length.

Python3




max_words = 5000
max_len = 100
 
token = Tokenizer(num_words=max_words,
                  lower=True,
                  split=' ')
 
token.fit_on_texts(X_train)


We have fitted the tokenizer on our training data we will use it to convert the training and validation data both to vectors.

Python3




# training the tokenizer
max_words = 5000
token = Tokenizer(num_words=max_words,
                  lower=True,
                  split=' ')
token.fit_on_texts(train_X)
 
#Generating token embeddings
Training_seq = token.texts_to_sequences(train_X)
Training_pad = pad_sequences(Training_seq,
                             maxlen=50,
                             padding='post',
                             truncating='post')
 
Testing_seq = token.texts_to_sequences(test_X)
Testing_pad = pad_sequences(Testing_seq,
                            maxlen=50,
                            padding='post',
                            truncating='post')


Model Development and Evaluation

We will implement a Sequential model which will contain the following parts:

  • Three Embedding Layers to learn a featured vector representations of the input vectors.
  • A Bidirectional LSTM layer to identify useful patterns in the sequence.
  • Then we will have one fully connected layer.
  • We have included some BatchNormalization layers to enable stable and fast training and a Dropout layer before the final layer to avoid any possibility of overfitting.
  • The final layer is the output layer which outputs soft probabilities for the three classes. 

Python3




model = keras.models.Sequential([
    layers.Embedding(max_words, 32, input_length=max_len),
    layers.Bidirectional(layers.LSTM(16)),
    layers.Dense(512, activation='relu', kernel_regularizer='l1'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(3, activation='softmax')
])
 
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
 
model.summary()


Output:

Summary of the model architecture

Summary of the model architecture

While compiling a model we provide these three essential parameters:

  • optimizer – This is the method that helps to optimize the cost function by using gradient descent.
  • loss – The loss function by which we monitor whether the model is improving with training or not.
  • metrics – This helps to evaluate the model by predicting the training and the validation data.

Python3




keras.utils.plot_model(
    model,
    show_shapes=True,
    show_dtype=True,
    show_layer_activations=True
)


Output:

Change in input as it passes in the model

Change in input as it passes in the model

Callback

Callbacks are used to check whether the model is improving with each epoch or not. If not then what are the necessary steps to be taken like ReduceLROnPlateau decreases learning rate further. Even then if model performance is not improving then training will be stopped by EarlyStopping. We can also define some custom callbacks to stop training in between if the desired results have been obtained early.

Python3




from keras.callbacks import EarlyStopping, ReduceLROnPlateau
 
es = EarlyStopping(patience=3,
                   monitor = 'val_accuracy',
                   restore_best_weights = True)
 
lr = ReduceLROnPlateau(patience = 2,
                       monitor = 'val_loss',
                       factor = 0.5,
                       verbose = 0)


So, finally, we have reached the step when we will train our model.

Python3




history = model.fit(X_train, Y_train,
                    validation_data=(X_val, Y_val),
                    epochs=50,
                    verbose=1,
                    batch_size=32,
                    callbacks=[lr, es])


Output:

Training progress

 

To get a better picture of the training progress we should plot the graph of loss and accuracy epoch-by-epoch.

Python3




history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot()
history_df.loc[:, ['accuracy', 'val_accuracy']].plot()
plt.show()


Output:

Graph of loss and accuracy  epoch by epoch

Graph of loss and accuracy  epoch by epoch

Conclusion

The model we have trained is a little over fitting the training data but we can handle this by using different regularization techniques. But still, we had achieved 90% accuracy on the validation data which is quite sufficient to prove the power of LSTM models in NLP-related tasks. 

RELATED ARTICLES

Most Popular

Recent Comments