Recommender Systems (Tweet Ranking) using TensorFlow

20 July 2024

1

Recommender systems are important in many sectors because they assist consumers discover appropriate products or materials based on their preferences and behaviors. These systems analyze user-item interactions and create personalized suggestions using algorithms and approaches. In this post, we will look specifically at tweet ranking and how recommender systems may be used to improve the Twitter user experience.

The purpose is to create a Twitter ranking recommender system using TensorFlow, a well-known deep learning framework. We can develop an effective and efficient tweet grading model using TensorFlow’s rich collection of tools and APIs.

Concepts related to the topic:

Collaborative filtering is a widely used approach in recommender systems. It leverages the collective behavior of users to make predictions or recommendations. By analyzing the historical interactions between users and items, collaborative filtering can identify patterns and similarities, enabling the system to make relevant recommendations.
User-item interactions form the foundation of collaborative filtering. These interactions include various types of user actions, such as likes, retweets, and comments on tweets. By capturing and analyzing these interactions, we can gain insights into user preferences and use them to make informed recommendations.
Matrix factorization is a technique commonly employed in collaborative filtering. It aims to extract latent factors from user-item interaction data. By decomposing the interaction matrix into lower-dimensional matrices, matrix factorization can represent users and items in a latent space, capturing their underlying characteristics.
Embeddings are essential in capturing the characteristics of users and items. They are low-dimensional vector representations that capture the inherent features or attributes of users and items. Embeddings help to establish relationships between users and items, enabling the system to recommend items based on similar users or items with shared characteristics.

Steps needed:

Step 1: Import the required libraries and modules

Python3

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Concatenate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

Step 2: Load the dataset from a CSV file

Dataset link:https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter

Python3

dataset_path = "twcs.csv"
df = pd.read_csv(dataset_path)
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2811774 entries, 0 to 2811773
Data columns (total 7 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   tweet_id                 int64  
 1   author_id                object 
 2   inbound                  bool   
 3   created_at               object 
 4   text                     object 
 5   response_tweet_id        object 
 6   in_response_to_tweet_id  float64
dtypes: bool(1), float64(1), int64(1), object(4)
memory usage: 131.4+ MB

Print the first 5 rows

Python3

# Print first 5 rows
print(df.head())

Output:

   tweet_id   author_id  inbound                      created_at  \
0         1  sprintcare    False  Tue Oct 31 22:10:47 +0000 2017   
1         2      115712     True  Tue Oct 31 22:11:45 +0000 2017   
2         3      115712     True  Tue Oct 31 22:08:27 +0000 2017   
3         4  sprintcare    False  Tue Oct 31 21:54:49 +0000 2017   
4         5      115712     True  Tue Oct 31 21:49:35 +0000 2017   

                                                text response_tweet_id  \
0  @115712 I understand. I would like to assist y...                 2   
1      @sprintcare and how do you propose we do that               NaN   
2  @sprintcare I have sent several private messag...                 1   
3  @115712 Please send us a Private Message so th...                 3   
4                                 @sprintcare I did.                 4   

   in_response_to_tweet_id  
0                      3.0  
1                      1.0  
2                      4.0  
3                      5.0  
4                      6.0

Step 3: Extract tweet texts, user IDs, and labels from the DataFrame

Python3

tweets = df["text"].tolist()
user_ids = df["author_id"].tolist()
labels = df["inbound"].tolist()
 
# Check the data
print('Tweets :',tweets[0])
print('User Id:', user_ids[0])
print('Labels :',labels[0])

Output:

Tweets : @115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.
User Id: sprintcare
Labels : False

Step 4: Perform label encoding on user IDs and labels

Python3

user_encoder = LabelEncoder()
label_encoder = LabelEncoder()
 
user_ids_encoded = user_encoder.fit_transform(user_ids)
labels_encoded = label_encoder.fit_transform(labels)
 
print( user_ids[0],':',user_ids_encoded[0])
print(labels[0],':', labels_encoded[0])

Output:

sprintcare : 702776
False : 0

Label encoding is used to convert categorical data (user IDs and labels) into numerical representations.

Step 5: Perform min-max scaling on user IDs

Python3

scaler = MinMaxScaler()
user_ids_scaled = scaler.fit_transform(user_ids_encoded.reshape(-1, 1))

Min-max scaling is used to normalize the user IDs to a specific range, typically between 0 and 1.

Step 6: Reduce the dataset size to 500000 samples

Python3

tweets = tweets[:500000]
user_ids_scaled = user_ids_scaled[:500000]
labels_encoded = labels_encoded[:500000]

This step limits the dataset to the first 500000 samples for faster processing and testing.

Step 7: Split the data into training and testing sets

Python3

train_tweets, test_tweets, train_user_ids, test_user_ids, train_labels, test_labels = train_test_split(
    tweets, user_ids_scaled, labels_encoded, test_size=0.2, random_state=42
)

The dataset is split into training and testing sets using the train_test_split function. The test size is set to 20% of the data, and a random state is specified for reproducibility.

Step 8: Tokenize tweet texts

Python3

tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(train_tweets)
 
train_sequences = tokenizer.texts_to_sequences(train_tweets)
test_sequences = tokenizer.texts_to_sequences(test_tweets)
 
print('Tweets:',train_tweets[0])
print('Encoded Tweeet :',train_sequences[0])

Output:

Tweets: @189743 Would you mind providing more details about the issue? We'd certainly follow up. ^R https://t.co/iuwZCjz4Or
Encoded Tweeet : [98281, 89, 3, 567, 1119, 74, 91, 65, 2, 86, 136, 715, 161, 58, 882, 11, 9, 10, 2451]

The Tokenizer class from Keras is used to convert the tweet texts into sequences of integers. The tokenizer is fitted on the training tweets and then used to tokenize both the training and testing tweets.

Step 9: Pad sequences to ensure they have the same length

Python3

max_sequence_length = max(len(seq) for seq in train_sequences + test_sequences)
train_sequences = tf.keras.preprocessing.sequence.pad_sequences(train_sequences, 
                                                                maxlen=max_sequence_length)
test_sequences = tf.keras.preprocessing.sequence.pad_sequences(test_sequences, 
                                                               maxlen=max_sequence_length)

To ensure that all sequences have the same length, padding is applied to the sequences. The maximum sequence length is determined, and both the training and testing sequences are padded to this length.

Step 10: Create the user embedding model

Python3

user_embedding_dim = 10
user_embedding_input = Input(shape=(1,))
user_embedding_output = Embedding(len(user_encoder.classes_), user_embedding_dim)(user_embedding_input)
user_embedding_output = Flatten()(user_embedding_output)
user_embedding_model = Model(user_embedding_input, user_embedding_output)

In this step, we create the user embedding model. The user IDs are passed as input to the model using an Input layer. The Embedding layer is then applied to map the user IDs to dense vectors of user_embedding_dim dimensions. The number of unique user IDs is determined by len(user_encoder.classes_). The output of the Embedding layer is flattened using the Flatten layer to obtain a 1D vector. Finally, we define the user embedding model by specifying the input and output layers using the Model class.

Step 11: Create the tweet embedding model using LSTM

Python3

tweet_embedding_dim = 50
tweet_embedding_input = Input(shape=(max_sequence_length,))
tweet_embedding_output = Embedding(len(tokenizer.word_index) + 1, 
                                   tweet_embedding_dim)(tweet_embedding_input)
tweet_lstm_output = LSTM(64)(tweet_embedding_output)

In this step, we create the tweet embedding model using the LSTM (Long Short-Term Memory) layer. The input to the model is the sequences of tokenized tweets, represented as integers. An Input layer is defined by a shape of (max_sequence_length,), where max_sequence_length is the maximum length of the padded sequences. The Embedding layer is applied to map the tokenized sequences to dense vectors of tweet_embedding_dim dimensions. The vocabulary size is determined by len(tokenizer.word_index) + 1. The LSTM the layer is then used to process the tweet embeddings and generate a fixed-dimensional output of size 64.

Step 12: Concatenate user and tweet embeddings

Python3

concatenated = Concatenate()([user_embedding_output, tweet_lstm_output])

In this step, the user embeddings from the user embedding model and the tweet LSTM output are concatenated using the Concatenate layer. This combines the information from both the user and the tweet to capture the relationship between them.

Step 13: Add dense layers for ranking

Python3

dense_layer_units = [64, 32]
for units in dense_layer_units:
    concatenated = Dense(units, activation='relu')(concatenated)
output_layer = Dense(1, activation='sigmoid')(concatenated)

We add dense layers to the concatenated embeddings for ranking. In this example, two dense layers with 64 and 32 units, respectively, are added. Each dense layer uses the ReLU activation function. The final dense layer has a single unit and uses the sigmoid activation function to output a probability between 0 and 1, indicating the ranking score.

Step 14: Create the model

Python3

model = Model(inputs=[user_embedding_input, tweet_embedding_input], outputs=output_layer)
model.summary()

Output:

Model: "model_5"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_5 (InputLayer)           [(None, 1)]          0           []                               
                                                                                                  
 input_6 (InputLayer)           [(None, 93)]         0           []                               
                                                                                                  
 embedding_4 (Embedding)        (None, 1, 10)        7027770     ['input_5[0][0]']                
                                                                                                  
 embedding_5 (Embedding)        (None, 93, 50)       14688800    ['input_6[0][0]']                
                                                                                                  
 flatten_2 (Flatten)            (None, 10)           0           ['embedding_4[0][0]']            
                                                                                                  
 lstm_2 (LSTM)                  (None, 64)           29440       ['embedding_5[0][0]']            
                                                                                                  
 concatenate_1 (Concatenate)    (None, 74)           0           ['flatten_2[0][0]',              
                                                                  'lstm_2[0][0]']                 
                                                                                                  
 dense_6 (Dense)                (None, 64)           4800        ['concatenate_1[0][0]']          
                                                                                                  
 dense_7 (Dense)                (None, 32)           2080        ['dense_6[0][0]']                
                                                                                                  
 dense_8 (Dense)                (None, 1)            33          ['dense_7[0][0]']                
                                                                                                  
...
Total params: 21,752,923
Trainable params: 21,752,923
Non-trainable params: 0

The model is created by specifying the input and output layers. The inputs are the user embeddings and tweet embeddings, and the output is the ranking score.

Step 15: Compile the model

Python3

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The model is compiled with the binary cross-entropy loss function, which is commonly used for binary classification tasks. The Adam optimizer is used for optimization, and accuracy is chosen as the evaluation metric.

Step 16: Train the model

Python3

num_epochs = 10
model.fit([train_user_ids, train_sequences], train_labels, epochs=num_epochs, batch_size=32)

In this step, the model is trained using the fit method. The training data consists of the user IDs (train_user_ids) and the padded tweet sequences (train_sequences), with the corresponding labels (train_labels). The number of epochs is set to 10, indicating the number of times the entire training dataset will be passed through the model. The batch size is set to 32, which determines the number of samples processed before the model’s weights are updated.

Step 17: Evaluate the model

Python3

loss, accuracy = model.evaluate([test_user_ids, test_sequences], test_labels)
print("Test accuracy:", accuracy)

Output:

1/63 [..............................] - ETA: 24s - loss: 1.1737 - accuracy: 0.8750
63/63 [==============================] - 1s 6ms/step - loss: 0.1655 - accuracy: 0.9750
Test accuracy: 0.9750000238418579

The trained model is evaluated using the testing data. The testing data consists of the user IDs (test_user_ids) and the padded tweet sequences (test_sequences), along with the corresponding labels (test_labels). The evaluate method returns the loss value and accuracy of the model on the testing data. Finally, the test accuracy is printed to the console.

step 18: Save the trained model

Python3

model.save("recommender_model.h5")

This line saves the trained model to the specified file path (“/content/drive/MyDrive/recommender_model.h5”).

Step 19: Function to recommend tweets based on user preference

Python3

# Function to recommend tweets based on user preference
def recommend_tweets(user_id, tweets):
    # Encode the user ID
    user_id_encoded = user_encoder.transform([user_id])[0]
 
    # Tokenize and pad the tweet sequences
    sequences = tokenizer.texts_to_sequences(tweets)
    sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_sequence_length)
    # Repeat the user ID for each tweet
    user_ids = np.full((len(tweets),), user_id_encoded)
    # Predict tweet relevance using the trained model
    predictions = model.predict([user_ids, sequences])
 
    # Sort the tweets based on predicted relevance
    sorted_indices = np.argsort(predictions, axis=0)[::-1].flatten()
    sorted_tweets = [tweets[i] for i in sorted_indices]
 
    return sorted_tweets

This function recommend_tweets takes a user ID and a list of tweets as input and returns a list of recommended tweets based on the user’s preference. It performs the following steps:

Encodes the user ID using user_encoder.transform.
Tokenizes and pads the tweet sequences using tokenizer.texts_to_sequences and tf.keras.preprocessing.sequence.pad_sequences.
Predicts tweet relevance using the trained model by calling model.predict.
Sorts the tweets based on predicted relevance using np.argsort.
Returns the sorted tweets.

Step 20: Usage example

Python3

user_id = "115801"
tweets = [
    "Great article! I learned a lot.",
    "Not sure about this one...",
    "Interesting perspective.",
    "I agree with the points made.",
    "Could be better explained."
]
 
recommended_tweets = recommend_tweets(user_id, tweets)
recommended_tweets

Output:

1/1 [==============================] - 0s 16ms/step
['Could be better explained.',
 'Interesting perspective.',
 'Great article! I learned a lot.',
 'Not sure about this one...',
 'I agree with the points made.']

This code sets a user ID and a list of tweets and calls the recommend_tweets function to obtain a list of recommended tweets. Finally, it prints the recommended tweets.

Please make sure you have defined and trained the model before attempting to save it and use it for a recommendation.

Recommender Systems (Tweet Ranking) using TensorFlow

Concepts related to the topic:

Steps needed:

Step 1: Import the required libraries and modules

Python3

Step 2: Load the dataset from a CSV file

Python3

Print the first 5 rows

Python3

Step 3: Extract tweet texts, user IDs, and labels from the DataFrame

Python3

Step 4: Perform label encoding on user IDs and labels

Python3

Step 5: Perform min-max scaling on user IDs

Python3

Step 6: Reduce the dataset size to 500000 samples

Python3

Step 7: Split the data into training and testing sets

Python3

Step 8: Tokenize tweet texts

Python3

Step 9: Pad sequences to ensure they have the same length

Python3

Step 10: Create the user embedding model

Python3

Step 11: Create the tweet embedding model using LSTM

Python3

Step 12: Concatenate user and tweet embeddings

Python3

Step 13: Add dense layers for ranking

Python3

Step 14: Create the model

Python3

Step 15: Compile the model

Python3

Step 16: Train the model

Python3

Step 17: Evaluate the model

Python3

step 18: Save the trained model

Python3

Step 19: Function to recommend tweets based on user preference

Python3

Step 20: Usage example

Python3

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US