Recommender systems are important in many sectors because they assist consumers discover appropriate products or materials based on their preferences and behaviors. These systems analyze user-item interactions and create personalized suggestions using algorithms and approaches. In this post, we will look specifically at tweet ranking and how recommender systems may be used to improve the Twitter user experience.
The purpose is to create a Twitter ranking recommender system using TensorFlow, a well-known deep learning framework. We can develop an effective and efficient tweet grading model using TensorFlow’s rich collection of tools and APIs.
Concepts related to the topic:
- Collaborative filtering is a widely used approach in recommender systems. It leverages the collective behavior of users to make predictions or recommendations. By analyzing the historical interactions between users and items, collaborative filtering can identify patterns and similarities, enabling the system to make relevant recommendations.
- User-item interactions form the foundation of collaborative filtering. These interactions include various types of user actions, such as likes, retweets, and comments on tweets. By capturing and analyzing these interactions, we can gain insights into user preferences and use them to make informed recommendations.
- Matrix factorization is a technique commonly employed in collaborative filtering. It aims to extract latent factors from user-item interaction data. By decomposing the interaction matrix into lower-dimensional matrices, matrix factorization can represent users and items in a latent space, capturing their underlying characteristics.
- Embeddings are essential in capturing the characteristics of users and items. They are low-dimensional vector representations that capture the inherent features or attributes of users and items. Embeddings help to establish relationships between users and items, enabling the system to recommend items based on similar users or items with shared characteristics.
Steps needed:
Step 1: Import the required libraries and modules
Python3
import numpy as np import pandas as pd import tensorflow as tf from tensorflow.keras.models import Model from tensorflow.keras.layers import Input , Embedding, LSTM, Dense, Concatenate from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder, MinMaxScaler |
Step 2: Load the dataset from a CSV file
Dataset link:https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
Python3
dataset_path = "twcs.csv" df = pd.read_csv(dataset_path) df.info() |
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2811774 entries, 0 to 2811773
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 tweet_id int64
1 author_id object
2 inbound bool
3 created_at object
4 text object
5 response_tweet_id object
6 in_response_to_tweet_id float64
dtypes: bool(1), float64(1), int64(1), object(4)
memory usage: 131.4+ MB
Print the first 5 rows
Python3
# Print first 5 rows print (df.head()) |
Output:
tweet_id author_id inbound created_at \
0 1 sprintcare False Tue Oct 31 22:10:47 +0000 2017
1 2 115712 True Tue Oct 31 22:11:45 +0000 2017
2 3 115712 True Tue Oct 31 22:08:27 +0000 2017
3 4 sprintcare False Tue Oct 31 21:54:49 +0000 2017
4 5 115712 True Tue Oct 31 21:49:35 +0000 2017
text response_tweet_id \
0 @115712 I understand. I would like to assist y... 2
1 @sprintcare and how do you propose we do that NaN
2 @sprintcare I have sent several private messag... 1
3 @115712 Please send us a Private Message so th... 3
4 @sprintcare I did. 4
in_response_to_tweet_id
0 3.0
1 1.0
2 4.0
3 5.0
4 6.0
Step 3: Extract tweet texts, user IDs, and labels from the DataFrame
Python3
tweets = df[ "text" ].tolist() user_ids = df[ "author_id" ].tolist() labels = df[ "inbound" ].tolist() # Check the data print ( 'Tweets :' ,tweets[ 0 ]) print ( 'User Id:' , user_ids[ 0 ]) print ( 'Labels :' ,labels[ 0 ]) |
Output:
Tweets : @115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.
User Id: sprintcare
Labels : False
Step 4: Perform label encoding on user IDs and labels
Python3
user_encoder = LabelEncoder() label_encoder = LabelEncoder() user_ids_encoded = user_encoder.fit_transform(user_ids) labels_encoded = label_encoder.fit_transform(labels) print ( user_ids[ 0 ], ':' ,user_ids_encoded[ 0 ]) print (labels[ 0 ], ':' , labels_encoded[ 0 ]) |
Output:
sprintcare : 702776
False : 0
Label encoding is used to convert categorical data (user IDs and labels) into numerical representations.
Step 5: Perform min-max scaling on user IDs
Python3
scaler = MinMaxScaler() user_ids_scaled = scaler.fit_transform(user_ids_encoded.reshape( - 1 , 1 )) |
Min-max scaling is used to normalize the user IDs to a specific range, typically between 0 and 1.
Step 6: Reduce the dataset size to 500000 samples
Python3
tweets = tweets[: 500000 ] user_ids_scaled = user_ids_scaled[: 500000 ] labels_encoded = labels_encoded[: 500000 ] |
This step limits the dataset to the first 500000 samples for faster processing and testing.
Step 7: Split the data into training and testing sets
Python3
train_tweets, test_tweets, train_user_ids, test_user_ids, train_labels, test_labels = train_test_split( tweets, user_ids_scaled, labels_encoded, test_size = 0.2 , random_state = 42 ) |
The dataset is split into training and testing sets using the train_test_split
function. The test size is set to 20% of the data, and a random state is specified for reproducibility.
Step 8: Tokenize tweet texts
Python3
tokenizer = tf.keras.preprocessing.text.Tokenizer() tokenizer.fit_on_texts(train_tweets) train_sequences = tokenizer.texts_to_sequences(train_tweets) test_sequences = tokenizer.texts_to_sequences(test_tweets) print ( 'Tweets:' ,train_tweets[ 0 ]) print ( 'Encoded Tweeet :' ,train_sequences[ 0 ]) |
Output:
Tweets: @189743 Would you mind providing more details about the issue? We'd certainly follow up. ^R https://t.co/iuwZCjz4Or
Encoded Tweeet : [98281, 89, 3, 567, 1119, 74, 91, 65, 2, 86, 136, 715, 161, 58, 882, 11, 9, 10, 2451]
The Tokenizer
class from Keras is used to convert the tweet texts into sequences of integers. The tokenizer is fitted on the training tweets and then used to tokenize both the training and testing tweets.
Step 9: Pad sequences to ensure they have the same length
Python3
max_sequence_length = max ( len (seq) for seq in train_sequences + test_sequences) train_sequences = tf.keras.preprocessing.sequence.pad_sequences(train_sequences, maxlen = max_sequence_length) test_sequences = tf.keras.preprocessing.sequence.pad_sequences(test_sequences, maxlen = max_sequence_length) |
To ensure that all sequences have the same length, padding is applied to the sequences. The maximum sequence length is determined, and both the training and testing sequences are padded to this length.
Step 10: Create the user embedding model
Python3
user_embedding_dim = 10 user_embedding_input = Input (shape = ( 1 ,)) user_embedding_output = Embedding( len (user_encoder.classes_), user_embedding_dim)(user_embedding_input) user_embedding_output = Flatten()(user_embedding_output) user_embedding_model = Model(user_embedding_input, user_embedding_output) |
In this step, we create the user embedding model. The user IDs are passed as input to the model using an Input
layer. The Embedding
layer is then applied to map the user IDs to dense vectors of user_embedding_dim
dimensions. The number of unique user IDs is determined by len(user_encoder.classes_)
. The output of the Embedding
layer is flattened using the Flatten
layer to obtain a 1D vector. Finally, we define the user embedding model by specifying the input and output layers using the Model
class.
Step 11: Create the tweet embedding model using LSTM
Python3
tweet_embedding_dim = 50 tweet_embedding_input = Input (shape = (max_sequence_length,)) tweet_embedding_output = Embedding( len (tokenizer.word_index) + 1 , tweet_embedding_dim)(tweet_embedding_input) tweet_lstm_output = LSTM( 64 )(tweet_embedding_output) |
In this step, we create the tweet embedding model using the LSTM (Long Short-Term Memory) layer. The input to the model is the sequences of tokenized tweets, represented as integers. An Input
layer is defined by a shape of (max_sequence_length,)
, where max_sequence_length
is the maximum length of the padded sequences. The Embedding
layer is applied to map the tokenized sequences to dense vectors of tweet_embedding_dim
dimensions. The vocabulary size is determined by len(tokenizer.word_index) + 1
. The LSTM
the layer is then used to process the tweet embeddings and generate a fixed-dimensional output of size 64.
Step 12: Concatenate user and tweet embeddings
Python3
concatenated = Concatenate()([user_embedding_output, tweet_lstm_output]) |
In this step, the user embeddings from the user embedding model and the tweet LSTM output are concatenated using the Concatenate
layer. This combines the information from both the user and the tweet to capture the relationship between them.
Step 13: Add dense layers for ranking
Python3
dense_layer_units = [ 64 , 32 ] for units in dense_layer_units: concatenated = Dense(units, activation = 'relu' )(concatenated) output_layer = Dense( 1 , activation = 'sigmoid' )(concatenated) |
We add dense layers to the concatenated embeddings for ranking. In this example, two dense layers with 64 and 32 units, respectively, are added. Each dense layer uses the ReLU activation function. The final dense layer has a single unit and uses the sigmoid activation function to output a probability between 0 and 1, indicating the ranking score.
Step 14: Create the model
Python3
model = Model(inputs = [user_embedding_input, tweet_embedding_input], outputs = output_layer) model.summary() |
Output:
Model: "model_5"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_5 (InputLayer) [(None, 1)] 0 []
input_6 (InputLayer) [(None, 93)] 0 []
embedding_4 (Embedding) (None, 1, 10) 7027770 ['input_5[0][0]']
embedding_5 (Embedding) (None, 93, 50) 14688800 ['input_6[0][0]']
flatten_2 (Flatten) (None, 10) 0 ['embedding_4[0][0]']
lstm_2 (LSTM) (None, 64) 29440 ['embedding_5[0][0]']
concatenate_1 (Concatenate) (None, 74) 0 ['flatten_2[0][0]',
'lstm_2[0][0]']
dense_6 (Dense) (None, 64) 4800 ['concatenate_1[0][0]']
dense_7 (Dense) (None, 32) 2080 ['dense_6[0][0]']
dense_8 (Dense) (None, 1) 33 ['dense_7[0][0]']
...
Total params: 21,752,923
Trainable params: 21,752,923
Non-trainable params: 0
The model is created by specifying the input and output layers. The inputs are the user embeddings and tweet embeddings, and the output is the ranking score.
Step 15: Compile the model
Python3
model. compile (loss = 'binary_crossentropy' , optimizer = 'adam' , metrics = [ 'accuracy' ]) |
The model is compiled with the binary cross-entropy loss function, which is commonly used for binary classification tasks. The Adam optimizer is used for optimization, and accuracy is chosen as the evaluation metric.
Step 16: Train the model
Python3
num_epochs = 10 model.fit([train_user_ids, train_sequences], train_labels, epochs = num_epochs, batch_size = 32 ) |
In this step, the model is trained using the fit
method. The training data consists of the user IDs (train_user_ids
) and the padded tweet sequences (train_sequences
), with the corresponding labels (train_labels
). The number of epochs is set to 10, indicating the number of times the entire training dataset will be passed through the model. The batch size is set to 32, which determines the number of samples processed before the model’s weights are updated.
Step 17: Evaluate the model
Python3
loss, accuracy = model.evaluate([test_user_ids, test_sequences], test_labels) print ( "Test accuracy:" , accuracy) |
Output:
1/63 [..............................] - ETA: 24s - loss: 1.1737 - accuracy: 0.8750
63/63 [==============================] - 1s 6ms/step - loss: 0.1655 - accuracy: 0.9750
Test accuracy: 0.9750000238418579
The trained model is evaluated using the testing data. The testing data consists of the user IDs (test_user_ids
) and the padded tweet sequences (test_sequences
), along with the corresponding labels (test_labels
). The evaluate
method returns the loss value and accuracy of the model on the testing data. Finally, the test accuracy is printed to the console.
step 18: Save the trained model
Python3
model.save( "recommender_model.h5" ) |
This line saves the trained model to the specified file path (“/content/drive/MyDrive/recommender_model.h5”).
Step 19: Function to recommend tweets based on user preference
Python3
# Function to recommend tweets based on user preference def recommend_tweets(user_id, tweets): # Encode the user ID user_id_encoded = user_encoder.transform([user_id])[ 0 ] # Tokenize and pad the tweet sequences sequences = tokenizer.texts_to_sequences(tweets) sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen = max_sequence_length) # Repeat the user ID for each tweet user_ids = np.full(( len (tweets),), user_id_encoded) # Predict tweet relevance using the trained model predictions = model.predict([user_ids, sequences]) # Sort the tweets based on predicted relevance sorted_indices = np.argsort(predictions, axis = 0 )[:: - 1 ].flatten() sorted_tweets = [tweets[i] for i in sorted_indices] return sorted_tweets |
This function recommend_tweets
takes a user ID and a list of tweets as input and returns a list of recommended tweets based on the user’s preference. It performs the following steps:
- Encodes the user ID using
user_encoder.transform
. - Tokenizes and pads the tweet sequences using
tokenizer.texts_to_sequences
andtf.keras.preprocessing.sequence.pad_sequences
. - Predicts tweet relevance using the trained model by calling
model.predict
. - Sorts the tweets based on predicted relevance using
np.argsort
. - Returns the sorted tweets.
Step 20: Usage example
Python3
user_id = "115801" tweets = [ "Great article! I learned a lot." , "Not sure about this one..." , "Interesting perspective." , "I agree with the points made." , "Could be better explained." ] recommended_tweets = recommend_tweets(user_id, tweets) recommended_tweets |
Output:
1/1 [==============================] - 0s 16ms/step
['Could be better explained.',
'Interesting perspective.',
'Great article! I learned a lot.',
'Not sure about this one...',
'I agree with the points made.']
This code sets a user ID and a list of tweets and calls the recommend_tweets
function to obtain a list of recommended tweets. Finally, it prints the recommended tweets.
Please make sure you have defined and trained the model before attempting to save it and use it for a recommendation.