Explore, Expand, Discover!!!
Introduction
This post demonstrates building a #recommendation system for #Indianmovies using #collaborativefiltering. The code snippet includes #datapreparation, #cleaning, user-item matrix creation, cosine similarity calculation, and #visualization using #seaborn and #matplotlib, all aimed at improving movie recommendations and enhancing user experience.
Building a recommendation system for movies or books using collaborative filtering is a common machine learning project that involves using similarities between users and items to provide personalized suggestions. Collaborative filtering can be done using different techniques, such as nearest neighbors, matrix factorization, or deep learning. One example of a collaborative filtering system is a movie recommender that uses a feedback matrix of user ratings for different movies to predict what movies a user might like based on the preferences of similar users. To build such a system, one needs to preprocess and clean the data, choose an appropriate algorithm and evaluation metric, train and test the model, and generate recommendations.
Nearest Neighborhood Technique
Nearest neighbors technique for collaborative filtering is a machine learning algorithm that finds clusters of similar users or items based on common ratings or preferences, and makes predictions using the average rating of top-k nearest neighbors. For example, if a user wants to watch a movie that they have not rated before, the algorithm can find other users who have rated that movie and have similar tastes to the user, and use their ratings to estimate how much the user would like that movie. There are two types of nearest neighbors techniques: user-based and item-based. User-based CF uses similarities between users to generate recommendations, while item-based CF uses similarities between items. Nearest neighbors technique is a standard method of collaborative filtering that is simple and effective, but it also has some drawbacks, such as scalability issues, sparsity problems, and cold start challenges.
Example:
Here’s a complete code snippet to download Indian movie data from Kaggle, prepare and clean the data, check if the data is clean, suggest movie based on collaborative filtering, and plot for a nice visualization:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import seaborn as sns
import matplotlib.pyplot as plt
# Download Indian movie data from Kaggle
!pip install kaggle
!mkdir ~/.kaggle
!echo '{"username":"<username>","key":"<API key>"}' > ~/.kaggle/kaggle.json
!kaggle datasets download -d rounakbanik/the-movies-dataset
# Load movie data into a pandas dataframe
movie_data = pd.read_csv('movies_metadata.csv')
# Clean the data by dropping irrelevant columns and rows with missing data
movie_data = movie_data[['id', 'title', 'genres', 'release_date']]
movie_data = movie_data[movie_data['release_date'].str.len() == 10]
movie_data['id'] = pd.to_numeric(movie_data['id'], errors='coerce')
movie_data = movie_data.dropna(subset=['id'])
movie_data['id'] = movie_data['id'].astype(int)
# Load movie ratings data into a pandas dataframe
movie_ratings = pd.read_csv('ratings.csv')
# Clean the data by dropping irrelevant columns and merging with movie data
movie_ratings = movie_ratings[['userId', 'movieId', 'rating']]
movie_ratings = movie_ratings[movie_ratings['movieId'].isin(movie_data['id'])]
movie_ratings = pd.merge(movie_ratings, movie_data, left_on='movieId', right_on='id')
# Create a user-item matrix
user_item_matrix = movie_ratings.pivot(index='userId', columns='title', values='rating')
# Calculate pairwise cosine similarity between users
user_similarity = cosine_similarity(user_item_matrix)
# Define a function to find the k most similar users to a target user
def find_similar_users(target_user_id, k):
user_similarities = list(enumerate(user_similarity[target_user_id]))
sorted_similarities = sorted(user_similarities, key=lambda x: x[1], reverse=True)
return [i[0] for i in sorted_similarities][1:k+1]
# Define a function to recommend movies to a target user
def recommend_movies(target_user_id, k):
similar_users = find_similar_users(target_user_id, k)
target_user_ratings = user_item_matrix.loc[target_user_id].dropna()
similar_user_ratings = user_item_matrix.loc[similar_users].mean()
similar_user_ratings = similar_user_ratings[similar_user_ratings > similar_user_ratings.quantile(0.7)]
recommended_movies = pd.concat([similar_user_ratings, target_user_ratings]).groupby(level=0).mean().sort_values(ascending=False).head(10)
return recommended_movies
# Test the model on a sample user
target_user_id = 100
k = 5
recommended_movies = recommend_movies(target_user_id, k)
print('Recommended movies for user {}:'.format(target_user_id))
print(recommended_movies)
# Visualize the results with a heatmap
sns.set(style='white')
plt.figure(figsize=(10, 10))
sns.heatmap(user_item_matrix.corr(), cmap='coolwarm', annot=False)
plt.title('Movie Rating Correlations')
plt.show()
In this example, we first download Indian movie data from Kaggle using the Kaggle API, then load the movie data into a pandas data frame and clean it by dropping irrelevant columns and rows with missing data. We also load movie ratings data into a pandas data frame and clean it by dropping irrelevant columns and merging it with the movie data. We then create a user-item matrix and calculate pairwise cosine similarity between users.
Next, we define two functions: find_similar_users
and recommend_movies
. The find_similar_users
function takes a target user ID and a number k as inputs, and returns the k most similar users to the target user based on their rating history. The recommend_movies
function takes a target user ID and a number k as inputs, and uses the find_similar_users
function to find the k most similar users to the target user, and then recommends movies based on their rating history.
We test the model on a sample user with user ID 100 and k=5, and display the recommended movies.
Finally, we visualize the results with a heatmap using seaborn and matplotlib, showing the correlation between movie ratings across all users. The visualization can help us identify patterns and relationships between movies that may be useful for future recommendation systems.
Conclusion
Building a recommendation system for Indian movies using collaborative filtering can greatly enhance the user experience and increase engagement with the platform. By utilizing user ratings and preferences, the system can recommend movies that are likely to be of interest to the user. In this project, we demonstrated the process of data preparation, cleaning, and matrix creation, as well as how to calculate cosine similarity and visualize the results. While there are other methods for recommendation systems, collaborative filtering is a powerful and widely used approach that can yield accurate and relevant recommendations. Overall, this project showcases the potential of machine learning algorithms in improving user experience and satisfaction.