In this article, we learn how to build a basic recommendation engine from scratch using Pandas.
Building Movie Recommendation Engines using Pandas
A Recommendation Engine or Recommender Systems or Recommender Systems is a system that predicts or filters preferences according to each user’s likings. Recommender systems supervise delivering an index of suggestions via collaborative filtering or content-based filtering.
A Recommendation Engine is one of the most popular and widely used applications of machine learning. Almost all the big tech companies such as E-Commerce websites, Netflix, Amazon Prime and more uses Recommendation Engines to recommend suitable items or movies to the users. It is based on the instinct that similar types of users are more likely to have similar ratings on similar search items or entities.
Now let’s start creating our very basic and simple Recommender Engine using pandas. Let’s concentrate on delivering a simple recommendation engine by presenting things that are most comparable to a certain object based on correlation and number of ratings, in this case, movies. It just tells what movies are considered equivalent to the user’s film choice.
To download the files: .tsv file, Movie_Id_Titles.csv.
Popularity Based Filtering
Popularity-based filtering is one of the most basic and not so useful filtering techniques to build a recommender system. It basically filters out the item which is mostly in trend and hides the rest. For example, in our movies dataset if a movie is rated by most of the users that mean it is watched by so many users and is in trend now. So only those movies with a maximum number of ratings will be suggested to the users by the recommender system. There is a lack of personalization as it is not sensitive to some particular taste of a user.
Example:
At first, we will import the pandas library of python with the help of which we will create the Recommendation Engine. Then we loaded the datasets from the given path in the code below and added the column names to it.
Python3
# import pandas library import pandas as pd # Get the column names col_names = [ 'user_id' , 'item_id' , 'rating' , 'timestamp' ] # Load the dataset ratings = pd.read_csv(path, sep = '\t' , names = col_names) # Check the head of the data print (ratings.head()) # Check out all the movies and their respective IDs movies = pd.read_csv( print (movies.head()) # We merge the data movies_merge = pd.merge(ratings, movies, on = 'item_id' ) movies_merge.head() |
Output
Now we will rank the movies based on the numbers of ratings on the movies. As we are doing popularity-based filtering, the movies that are watched by more users will have more ratings.
Python3
pop_movies = movies_merge.groupby( "title" ) pop_movies[ "user_id" ].count().sort_values( ascending = False ).reset_index().rename( columns = { "user_id" : "score" }) pop_movies[ 'Rank' ] = pop_movies[ 'score' ].rank( ascending = 0 , method = 'first' ) pop_movies |
Output
Then we visualize the top 10 movies with the most rating count:
Python3
import matplotlib.pyplot as plt plt.figure(figsize = ( 12 , 4 )) plt.barh(pop_movies[ 'title' ].head( 6 ), pop_movies[ 'score' ].head( 6 ), align = 'center' , color = 'RED' ) plt.xlabel( "Popularity" ) plt.title( "Popular Movies" ) plt.gca().invert_yaxis() |
Output
Collaborative Filtering
User-based filtering:
These techniques suggest outcomes to a user that matching users have picked. We can either apply Pearson correlation or cosine similarity for estimating the resemblance between two users. In user-based collaborative filtering, we locate the likeness or similarity score among users. Collaborative filtering takes into count the strength of the mass. For example, if many people watch e-book A and B both and a new user reads only book B, then the recommendation engine will also suggest the user read book A.
Item-Based Collaborative Filtering:
Instead of calculating the resemblance among various users, item-based collaborative filtering suggests items based on their likeness with the items that the target user ranked. Likewise, the resemblance can be calculated with Pearson Correlation or Cosine Similarity. For example, if user A likes movie P and a new user B is similar to A then the recommender will suggest movie P to user B.
The below code demonstrates the user-item-based collaborative filtering.
Example:
At first, we will import the pandas library of python with the help of which we will create the Recommendation Engine. Then we loaded the datasets from the given path in the code below and added the column names to it.
Python3
# import pandas library import pandas as pd # Get the column names col_names = [ 'user_id' , 'item_id' , 'rating' , 'timestamp' ] # Load the dataset path = 'https: / / media.geeksforgeeks.org / \ wp - content / uploads / file .tsv' ratings = pd.read_csv(path, sep = '\t' , names = col_names) # Check the head of the data print (ratings.head()) # Check out all the movies and their respective IDs movies = pd.read_csv( 'https: / / media.geeksforgeeks.org / \ wp - content / uploads / Movie_Id_Titles.csv') print (movies.head()) |
Output
Now we merge the two datasets on the basis of the item_id which is the common primary key for both.
Python3
movies_merge = pd.merge(ratings, movies, on = 'item_id' ) movies_merge.head() |
Output
Here we calculate the mean of the number of ratings given to each of the movies. Then we calculate the count of the number of ratings given to each of the movies. We sort them in ascending order as we can see in the output.
Python3
print (movies_merge.groupby( 'title' )[ 'rating' ].mean().sort_values( ascending = False ).head()) print (movies_merge.groupby( 'title' )[ 'rating' ].count().sort_values( ascending = False ).head()) |
Output
Now we create a new dataframe named ratings_mean_count_data and added the new columns of rating mean and rating count beside each movie title since these two parameters are required for filtering out the best suggestions to the user.
Python3
ratings_mean_count_data = pd.DataFrame( movies_merge.groupby( 'title' )[ 'rating' ].mean()) ratings_mean_count_data[ 'rating_counts' ] = pd.DataFrame( movies_merge.groupby( 'title' )[ 'rating' ].count()) ratings_mean_count_data |
Output
In the newly created dataframe, we can see the movies along with the mean value of ratings and the number of ratings. Now we want to create a matrix to see each user’s rating on each movie. To do so we will do the following code.
Python3
user_rating = movies_merge.pivot_table( index = 'user_id' , columns = 'title' , values = 'rating' ) user_rating.head() |
Output
Here each column contains all the ratings of all users of a particular movie making it easy for us to find ratings of our movie of choice.
Here each column contains all the ratings of all users of a particular movie making it easy for us to find ratings of our movie of choice.
So we will see the ratings of Star Wars(1977) as it has got the highest count of ratings. Since we want to find the correlation between movies with the most ratings this will be a good approach. We will see the first 25 ratings.
Python3
Star_Wars_ratings = user_rating[ 'Star Wars (1977)' ] Star_Wars_ratings.head( 15 ) |
Output
Now we will find the movies which correlate with Star Wars(1977) using the corrwith() function. Next, we store the correlation values under column Correlation in a new dataframe called corr_Star_Wars. We removed the NaN values from the new dataset.
We displayed the first 10 movies which are highly correlated with Star Wars(1977) in ascending order using the parameter ‘ascending=False’.
Python3
movies_like_Star_Wars = user_rating.corrwith(Star_Wars_ratings) corr_Star_Wars = pd.DataFrame(movies_like_Star_Wars, columns = [ 'Correlation' ]) corr_Star_Wars.dropna(inplace = True ) corr_Star_Wars.head( 10 ) corr_Star_Wars.sort_values( 'Correlation' , ascending = False ).head( 25 ) |
Output
From the above output, we can see that the movies which are highly correlated with Star Wars(1977) are not all famous and well known.
There can be cases where only one user watches a particular movie and give it a 5-star rating. In that case, it will not be a valid rating as no other user has watched it.
So correlation only might not be a good metric for filtering out the best suggestion. So we added the column of rating_counts to the data frame to account for the number of ratings along with correlation.
Python3
corr_Star_Wars_count = corr_Star_Wars.join( ratings_mean_count_data[ 'rating_counts' ]) |
We assumed that the movies which are worth watching will at least have some ratings greater than 100. So the below code filters out the most correlated movies with ratings from more than 100 users.
Python3
corr_Star_Wars_count[corr_Star_Wars_count[ 'rating_counts' ] > 100 ].sort_values( 'Correlation' , ascending = False ).head() corr_Star_Wars_count = corr_Star_Wars_count.reset_index() corr_Star_Wars_count |
Output
We can better visualize see the final set of recommended movies
Python3
import matplotlib.pyplot as plt plt.figure(figsize = ( 12 , 4 )) plt.barh(corr_Star_Wars_count[ 'title' ].head( 10 ), abs (corr_Star_Wars_count[ 'Correlation' ].head( 10 )), align = 'center' , color = 'red' ) plt.xlabel( "Popularity" ) plt.title( "Top 10 Popular Movies" ) plt.gca().invert_yaxis() |
Output
Therefore the above movies will be recommended to users who have just finished watching or had watched Star Wars(1977). In this way, we can build a very basic recommender system with pandas. For real-time recommender engines definitely, pandas will not fulfill the needs. for that, we will have to implement complex machine learning algorithms and frameworks.