There are a lot of applications where websites collect data from their users and use that data to predict the likes and dislikes of their users. This allows them to recommend the content that they like. Recommender systems are a way of suggesting or similar items and ideas to a user’s specific way of thinking.
Recommender System is different types:
- Collaborative Filtering: Collaborative Filtering recommends items based on similarity measures between users and/or items. The basic assumption behind the algorithm is that users with similar interests have common preferences.
- Content-Based Recommendation: It is supervised machine learning used to induce a classifier to discriminate between interesting and uninteresting items for the user.
Content-Based Recommendation System: Content-Based systems recommends items to the customer similar to previously high-rated items by the customer. It uses the features and properties of the item. From these properties, it can calculate the similarity between the items.
In a content-based recommendation system, first, we need to create a profile for each item, which represents the properties of those items. From the user profiles are inferred for a particular user. We use these user profiles to recommend the items to the users from the catalog.
Item profile:
In a content-based recommendation system, we need to build a profile for each item, which contains the important properties of each item. For Example, If the movie is an item, then its actors, director, release year, and genre are its important properties, and for the document, the important property is the type of content and set of important words in it.
Let’s have a look at how to create an item profile. First, we need to perform the TF-IDF vectorizer, here TF (term frequency) of a word is the number of times it appears in a document and The IDF (inverse document frequency) of a word is the measure of how significant that term is in the whole corpus. These can be calculated by the following formula:
- The term-frequency can be calculated by:
where fij is the frequency of term(feature) i in document(item) j.
- The inverse-document frequency can be calculated with:
where, ni number of documents that mention term i. N is the total number of docs.
- Therefore, the total formula is:
Here, doc profile is the set of words with
User profile:
The user profile is a vector that describes the user preference. During the creation of the user’s profile, we use a utility matrix that describes the relationship between user and item. From this information, the best estimate we can decide which item the user likes, is some aggregation of the profiles of those items.
Advantages and Disadvantages:
- Advantages:
- No need for data on other users when applying to similar users.
- Able to recommend to users with unique tastes.
- Able to recommend new & popular items
- Explanations for recommended items.
- Disadvantages:
- Finding the appropriate feature is hard.
- Doesn’t recommend items outside the user profile.
Collaborative Filtering: Collaborative filtering is based on the idea that similar people (based on the data) generally tend to like similar things. It predicts which item a user will like based on the item preferences of other similar users.
Collaborative filtering uses a user-item matrix to generate recommendations. This matrix contains the values that indicate a user’s preference towards a given item. These values can represent either explicit feedback (direct user ratings) or implicit feedback (indirect user behavior such as listening, purchasing, watching).
- Explicit Feedback: The amount of data that is collected from the users when they choose to do so. Many of the times, users choose not to provide data for the user. So, this data is scarce and sometimes costs money. For example, ratings from the user.
- Implicit Feedback: In implicit feedback, we track user behavior to predict their preference.
Example:
- Consider a user x, we need to find another user whose rating are similar to x’s rating, and then we estimate x’s rating based on another user.
M_1 | M_2 | M_3 | M_4 | M_5 | M_6 | M_7 | |
---|---|---|---|---|---|---|---|
A | 4 | 5 | 1 | ||||
B | 5 | 5 | 4 | 5 | |||
C | 2 | 4 | |||||
D | 3 | 3 |
- Let’s create a matrix representing different user and movies:
- Consider two users x, y with rating vectors rx and ry. We need to decide a similarity matrix to calculate similarity b/w sim(x,y). THere are many methods to calculate similarity such as: Jaccard similarity, cosine similarity and pearson similarity. Here, we use centered cosine similarity/ pearson similarity, where we normalize the rating by subtracting the mean:
M_1 | M_2 | M_3 | M_4 | M_5 | M_6 | M_7 | |
---|---|---|---|---|---|---|---|
A | 2/3 | 5/3 | -7/3 | ||||
B | 1/3 | 1/3 | -2/3 | ||||
C | -5/3 | 1/3 | 4/3 | ||||
D | 0 | 0 |
- Here, we can calculate similarity: For ex: sim(A,B) = cos(rA, rB) = 0.09 ; sim(A,C) = -0.56. sim(A,B) > sim(A,C).
Rating Predictions
- Let rx be the vector of user x’s rating. Let N be the set of k similar users who also rated item i. Then we can calculate the prediction of user x and item i by using following formula:
Advantages and Disadvantages:
- Advantages:
- No need for the domain knowledge because embedding are learned automatically.
- Capture inherent subtle characteristics.
- Disadvantages:
- Cannot handle fresh items due to cold start problem.
- Hard to add any new features that may improve quality of model
Implementation:
Python3
# code import numpy as np import pandas as pd import sklearn import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.simplefilter(action = 'ignore' , category = FutureWarning) ratings.head() movies.head() n_ratings = len (ratings) n_movies = len (ratings[ 'movieId' ].unique()) n_users = len (ratings[ 'userId' ].unique()) print (f "Number of ratings: {n_ratings}" ) print (f "Number of unique movieId's: {n_movies}" ) print (f "Number of unique users: {n_users}" ) print (f "Average ratings per user: {round(n_ratings/n_users, 2)}" ) print (f "Average ratings per movie: {round(n_ratings/n_movies, 2)}" ) user_freq = ratings[[ 'userId' , 'movieId' ]].groupby( 'userId' ).count().reset_index() user_freq.columns = [ 'userId' , 'n_ratings' ] user_freq.head() # Find Lowest and Highest rated movies: mean_rating = ratings.groupby( 'movieId' )[[ 'rating' ]].mean() # Lowest rated movies lowest_rated = mean_rating[ 'rating' ].idxmin() movies.loc[movies[ 'movieId' ] = = lowest_rated] # Highest rated movies highest_rated = mean_rating[ 'rating' ].idxmax() movies.loc[movies[ 'movieId' ] = = highest_rated] # show number of people who rated movies rated movie highest ratings[ratings[ 'movieId' ] = = highest_rated] # show number of people who rated movies rated movie lowest ratings[ratings[ 'movieId' ] = = lowest_rated] ## the above movies has very low dataset. We will use bayesian average movie_stats = ratings.groupby( 'movieId' )[[ 'rating' ]].agg([ 'count' , 'mean' ]) movie_stats.columns = movie_stats.columns.droplevel() # Now, we create user-item matrix using scipy csr matrix from scipy.sparse import csr_matrix def create_matrix(df): N = len (df[ 'userId' ].unique()) M = len (df[ 'movieId' ].unique()) # Map Ids to indices user_mapper = dict ( zip (np.unique(df[ "userId" ]), list ( range (N)))) movie_mapper = dict ( zip (np.unique(df[ "movieId" ]), list ( range (M)))) # Map indices to IDs user_inv_mapper = dict ( zip ( list ( range (N)), np.unique(df[ "userId" ]))) movie_inv_mapper = dict ( zip ( list ( range (M)), np.unique(df[ "movieId" ]))) user_index = [user_mapper[i] for i in df[ 'userId' ]] movie_index = [movie_mapper[i] for i in df[ 'movieId' ]] X = csr_matrix((df[ "rating" ], (movie_index, user_index)), shape = (M, N)) return X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper = create_matrix(ratings) from sklearn.neighbors import NearestNeighbors """ Find similar movies using KNN """ def find_similar_movies(movie_id, X, k, metric = 'cosine' , show_distance = False ): neighbour_ids = [] movie_ind = movie_mapper[movie_id] movie_vec = X[movie_ind] k + = 1 kNN = NearestNeighbors(n_neighbors = k, algorithm = "brute" , metric = metric) kNN.fit(X) movie_vec = movie_vec.reshape( 1 , - 1 ) neighbour = kNN.kneighbors(movie_vec, return_distance = show_distance) for i in range ( 0 ,k): n = neighbour.item(i) neighbour_ids.append(movie_inv_mapper[n]) neighbour_ids.pop( 0 ) return neighbour_ids movie_titles = dict ( zip (movies[ 'movieId' ], movies[ 'title' ])) movie_id = 3 similar_ids = find_similar_movies(movie_id, X, k = 10 ) movie_title = movie_titles[movie_id] print (f "Since you watched {movie_title}" ) for i in similar_ids: print (movie_titles[i]) |
Output:
Number of ratings: 100836 Number of unique movieId's: 9724 Number of unique users: 610 Average number of ratings per user: 165.3 Average number of ratings per movie: 10.37 ========================================== # lowest rated movieId title genres 2689 3604 Gypsy (1962) Musical # highest rated movieId title genres 48 53 Lamerica (1994) Adventure|Drama # who rate highest rated movie userId movieId rating timestamp 13368 85 53 5.0 889468268 96115 603 53 5.0 963180003 # who rate lowest rated movie userId movieId rating timestamp 13633 89 3604 0.5 1520408880 Since you watched Grumpier Old Men (1995) Grumpy Old Men (1993) Striptease (1996) Nutty Professor, The (1996) Twister (1996) Father of the Bride Part II (1995) Broken Arrow (1996) Bio-Dome (1996) Truth About Cats & Dogs, The (1996) Sabrina (1995) Birdcage, The (1996