Movie Recommendation System using Machine Learning

Movie Recommendation System using Machine Learning

7 mins read22.5K Views Comment
Updated on Mar 20, 2023 16:09 IST

Every time you open up YouTube just to figure out the solution to your problem or just get the latest news, you end up spending more time. A similar thing happens when you decided on binging through a single movie/series from an OTT you end up watching more than what you had in your mind. Ever wondered how they were able to do such a thing? Most of the OTT platforms depend on their movie recommendation system.

2022_01_Untitled-design-16.jpg

 

Contents

But, what is a Recommendation System exactly?

A movie recommendation system is a fancy way to describe a process that tries to predict your preferred items based on your or people similar to you.

In layman’s terms, we can say that a Recommendation System is a tool designed to predict/filter the items as per the user’s behavior.

Recommended online courses

Best-suited Machine Learning courses for you

Learn Machine Learning with these high-rated online courses

2.5 L
2 years
2.5 L
2 years
1.53 L
11 months
34.65 K
11 months
5.6 L
18 months
– / –
6 months
– / –
8 hours

Why exactly do we need Recommendation Systems?

From a user’s perspective, they are catered to fulfil the user’s needs in the shortest time possible. For example, the type of content you watch on Netflix or Hulu. A person who likes to watch only Korean drama will see titles related to that only but a person who likes to watch Action-based titles will see that on their home screen.

From an organization’s perspective, they want to keep the user as long as possible on the platform so that it will generate the most possible profit for them. With better recommendations, it creates positive feedback from the user as well. What good it will be to the organization to have a library of 500K+ titles when they cannot provide proper recommendations?

Recommendations are a great way to keep you watching but for Raghu the recommendations he gets wrong. But how? Well, as you know that recommendation systems are catered for a user but not for multiple users. Raghu lives in a joint family and everyone uses a single system to watch what they want. While OTT platforms give you a choice of adding multiple profiles but everyone else has already taken those and he is left with a single profile to share with his grandparents. So, Raghu decides to create his movie recommendation system. Before getting started he should understand the different types of recommendation systems.

Types of Recommendation Systems

The following figure shows different kinds of recommender systems:

2022_01_Types-of-Recommendation-Systems.jpg

Collaborative Filtering

There are two types of collaborative filtering:

User-Based: Where we try to find similar users based on their item choices and recommend the items. A user-item rating matrix is created at first. Then, we find the correlations between the users and recommend items based on correlation.

2022_01_Collaborative-Filtering.jpg

Consider the above figure, we can see that:

  • Jet likes Drama, Adventure, and Fantasy-based movies.
  • Rosie likes Action and Fantasy-based movies.
  • Eva likes Drama and Adventure-based movies.

From the above data, we can say that Eva is highly correlated to Jet. Thus, we can recommend her Fantasy movies as well.

Item Based

Where we try to find a similar item based on their user’s choices and recommend the items. A user-user item rating matrix is created at first. Then, we find the correlations between the items and recommend items based on correlation.

Using collaborative filtering becomes stale when either item or user choices differ.

Content-Based Filtering

In this type, we will try to find similar items to the user’s selected item. Consider the below figure:

2022_01_Content-Based-Filtering.jpg

Let’s say Raghu watches a movie X, then in this case the model/method will try to find a similar movie based on its features like genres, actors and directors, etc. For example, if a user likes to watch movies like say Central Intelligence where Dwayne Johnson is the protagonist, the model recommends the movies where Dwayne Johnson is either protagonist or has done some other part in it.

Raghu wants the exact similar type of recommender system where he can input some movie names and related movies are given as recommendations. Let’s see how he will apply machine learning to create a recommendation system.

To create the movie recommendation system Raghu has taken data from TMDB API. You can also request an API:

2022_01_create-the-recommendation-system.jpg

Movie Dataset

The data gathered by Raghu has the following details:

  • Title: Movie Title.
  • Overview: Abstract of the Movie.
  • Popularity: Movie popularity rating as per TMDB.
  • Vote_average: Votes average out of 10.
  • Vote_count: Number of votes from the users.
  • Release_date: Date of release of the movie.
  • Keywords: Keywords for the movie by TMDB in the list.
  • Genres: Movie Genres in the list.
  • Cast: Cast of the movie on the list.
  • Crew: Crew of the movie in the list.

Reading Movies Data:

As Raghu loads the data, let’s see how it looks:


 
data=pd.read_csv('tmdb.csv.zip',compression='zip',index_col='id') data.head()
Copy code
2022_01_Reading-Movies-Data.jpg

Run this demo in Colab – Try it Yourself!

google-collab

Cleaning Data

As you can see that before applying any machine learning models or even exploring the data we need to clean the data:

Removing Unnamed Column:

The Unnamed Columns are irritating as we cannot delete is normally. To remove this, Raghu gets the list of columns and renames the “Unnamed: 0” column and later removes it:


 
data.columns=['temp', 'title', 'overview', 'popularity', 'vote_average', 'vote_count', 'release_date', 'keywords', 'genres', 'cast', 'crew'] data.drop('temp',axis=1,inplace=True)
Copy code
2022_01_output-1.jpg

The output after dropping the column:

Changing Data Type

After filling the null values for empty columns, Raghu realizes that he will have to change the data type for most of them:

2022_01_Changing-Data-Type.jpg

He creates a dictionary with columns as keys and their new type as values. Then, changes the datatype:


 
new_types={'title': str, 'overview': str, 'release_date': 'datetime64',} for col in new_types.keys(): data[col]=data[col].astype(new_types[col])
Copy code

It seems that he has not treated the list columns. The list columns still have some empty values if he changes the type as a list directly he will get the following error:

2022_01_list-column-error.jpg

The error means that it does not support the list datatype as of now. Instead, he creates that column as string type and keeps the values as comma separated:


 
for col in ['keywords', 'genres', 'cast', 'crew']: for val in ['[',']','\'']: data[col]=data[col].str.replace(val,'') data[col]=data[col].astype(str)
Copy code
2022_01_above-Data-Exploration.jpg

Data Exploration

After cleaning the data, Raghu wants to do some analysis of the data. He creates two functions for list columns:

  • get_unique(data,col): Returns a list of unique items.

 
def get_uniques(data,col):
'''
data: Dataframe object
col: column name with comma separated values
---
returns: a list of unique category values in that column
'''
out=set([val.strip().lower() for val in ','.join(data[col].unique()).split(',')])
try:
out.remove('')
except:
return list(out)
return list(out)
Copy code
  • get_counts(data,col,categories): Returns the counts for the unique items

 
def get_counts(data, col, categories):
'''
data: dataframe object
col: name of the column
categories: categories present
----
return a dictionary with counts of each category
'''
categ = {category: None for category in categories}
for category in tqdm(categories):
val=0
for index in data.index:
if category in data.at[index,col].lower():
val+=1
categ[category]=val
return categ
Copy code

Using the two functions he creates a plotly chart to see most popular genres:


 
# Get the base counts of for each category and sort them by counts
base_counts = get_counts(data, 'genres', genres)
base_counts = pd.DataFrame(index=base_counts.keys(),
data=base_counts.values(),
columns=['Counts'])
base_counts.sort_values(by='Counts', inplace=True)
# Plot the chart which shows top genres and separate by color where genre<1000
colors=['#abaeab' if i<1000 else '#A0E045' for i in base_counts.Counts]
fig = px.bar(x=base_counts.index,
y=base_counts['Counts'],
title='Most Popular Genre',color_discrete_sequence=colors,color=base_counts.index)
fig.show()
Copy code
2022_01_Data-Exploration.jpg

Later, he finds how plots movie release per year:


 
# Function to plot value counts plots
def plot_value_counts_bar(data, col):
'''
data: Dataframe
col: Name of the column to be plotted
----
returns a plotly figure
'''
vc = pd.DataFrame(data[col].value_counts())
vc['cat'] = vc.index
fig = px.bar(vc, x='cat', y=col, color='cat', title=col)
fig.update_layout()
return fig
data['year']=data.release_date.dt.year
plot_value_counts_bar(data,'year')
Copy code
2022_01_plot_value_counts_bar.jpg

Then, he creates another function to find the ratings by popularity, vote_count, vote_average:


 
def get_ratings(data, col,ratings_col, categories):
'''
data: dataframe object
col: name of the column
categories: categories present
----
return a dictionary with average ratings of each category
'''
categ = {category: None for category in categories}
for category in tqdm(categories):
val=0
ratings=0
for index in data.index:
if category in data.at[index,col].lower():
val+=1
ratings+=data.at[index,ratings_col]
categ[category]=round(ratings/val,2)
return categ
base_counts = get_ratings(data, 'genres','vote_count', genres)
base_counts = pd.DataFrame(index=base_counts.keys(),
data=base_counts.values(),
columns=['Counts'])
base_counts.sort_values(by='Counts', inplace=True)
fig = px.pie(names=base_counts.index,
values=base_counts['Counts'],
title='Most Popular Genre by Votes',color=base_counts.index)
fig.show()
Copy code
2022_01_popular-genre.jpg

You can explore more using the above functions like most popular crew, most voted crew.

What is Polynomial Regression in Machine Learning?
Machine Learning for Fraud Detection
Transfer Learning in Machine Learning: Techniques for Reusing Pre-Trained model

Building Model

Raghu will be building the model in two ways:

Using CountVectorizer

It converts a collection of text into a matrix of counts with each hit.

Take an example with 3 sentences:

I enjoy Marvel movies.

I like Dwayne.

I like Iron Man.

The count vectorizer will create a matrix where it determines the frequency of each word.

2022_01_Using-CountVectorizer.jpg

Focusing on the first row, “like” and “enjoy” are besides “I” for 2 and 1 times respectively. Similarly, other rows are calculated.

Raghu, creates the sentences for the CountVectorizer:


 
def create_soup(data):
# Creating a simple text for countvectorizer to work with
att = data['title'].lower()
for i in data[1:]:
att = att + ' ' + str(i)
return att
model_data=data.copy()
model_data=model_data[['title','keywords','genres','cast','crew']]
model_data['soup']=model_data.apply(create_soup,axis=1)
Copy code

He gets the data in the following way:

2022_01_CountVectorizer.jpg

Now, he gets the cosine similarity scores:


 
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(model_data['soup'])
cosine_sim2 = cosine_similarity(count_matrix)
Copy code

Since we have the cosine similarity scores we can now get the recommendations. The below functions get the top 10 movies sorted by popularity:


 
def get_recommendations_new(title, data, orig_data, cosine_sim=cosine_sim2):
'''
title: movie title
data: model_data
orig_data: original dataframe
cosine_sim: cosine similarity matrix to use.
---
returns: Table plot of plotly where top 10 movies by popularity are sorted.
'''
indices = pd.Series(data.index, index=data['title'])
idx = indices[title]
# Get the pairwsie similarity scores of all movies with that movie
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores of the 10 most similar movies
sim_scores = sim_scores[1:11]
# Get the movie indices
movie_indices = [i[0] for i in sim_scores]
# Return the top 10 most similar movies
out=orig_data[[
'title', 'vote_average', 'genres', 'crew', 'popularity'
]].iloc[movie_indices]
out.genres = out.genres.str.replace(',', '<br>')
out.crew = out.crew.str.replace(',', '<br>')
final=out.sort_values(by='popularity',ascending=False)
colorscale = [[0, '#477BA8'], [.5, '#ece4db'], [1, '#d8e2dc']]
fig = ff.create_table(final, colorscale=colorscale, height_constant=70)
return fig
Copy code

Let’s try for “The Shawshank Redemption”:

2022_01_Shawshank-Redemption.jpg

Let’s see for another title “Spirited Away”:

2022_01_Spirited-Away.jpg

Using NearestNeighbors

We can use NearestNeighbors as well to create our recommendation system. Before training the model, we need to process the data for optimal performance:


 
nn_data=data.copy()
def fill_genre(value,col,categories=genres):
if col in value.lower() :
return 1
else:
return 0
# Create genre columns
for col in genres:
nn_data[col]=None
for index in tqdm(nn_data.index):
for col in genres:
nn_data.at[index,col]=fill_genre(nn_data.at[index,'genres'],col)
for col in genres:
nn_data[col]=nn_data.genres.apply(fill_genre,args=(col,))
nn_data.drop(['overview','release_date','genres','title'],axis=1,inplace=True)
for col in ['keywords','cast','crew']:
nn_data[col]=LabelEncoder().fit_transform(nn_data[col])
Copy code

Traning the model:


 
model_knn = NearestNeighbors(metric='cosine',
algorithm='auto',
n_neighbors=20,
n_jobs=-1)
model_knn.fit(nn_data)
Copy code

Now, Let’s test our model:


 
# Create a function to recommend top 10 movies
def recommend_movies(movie,nn_data,orig_data):
orig_data.reset_index(inplace=True)
nn_data.reset_index(inplace=True,drop=True)
movie_index=nn_data[orig_data.title==movie].index
distances, indices = model_knn.kneighbors(np.array(nn_data.iloc[movie_index]).reshape(
1, -1),n_neighbors=10)
out=orig_data[[
'title', 'vote_average', 'genres', 'crew', 'popularity'
]].iloc[indices[0]]
out.genres = out.genres.str.replace(',', '<br>')
out.crew = out.crew.str.replace(',', '<br>')
final=out.sort_values(by='popularity',ascending=False)
colorscale = [[0, '#fad2e1'], [.5, '#fde2e4'], [1, '#fff1e6']]
fig = ff.create_table(final, colorscale=colorscale, height_constant=70)
return fig
Copy code

Let’s check for the movie “Thor”:

2022_01_Thor.jpg

Let’s try for “Eternals”:

2022_01_Eternals.jpg

Run this demo in Colab – Try it Yourself!

google-collab

Conclusion

In this article, we have learned how to create a recommendation system using machine learning. Apart from movie recommendations, you can try making recommender systems from shopping products, news, typing assistance, and so on.

By Sameer Jain

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio