This repository contains code that runs collaborative filtering on data from the MovieLens-100k dataset to generate movie recommendations for users. Also runs feature analysis to determine whether or not the learned user/movie matrices from the SVD decomposition contain information about user gender and movie release year.
Will have to install numpy, scikit-learn, pandas and the surprise package.
The folder recommendation_system contains files:
modelselectionsvd.py : runs GridSearchCV to determine the best regularization parameter for the SVD algorithm
evaluationbyMAE : takes all the user movie ratings generated by the model and compares them against their actual counterparts in the test set to get the Mean Absolute Error.
evalutationbytop5 : generates the top 5 movie recommendations for each user and averages all ratings for such recommendations found in the test set.
The folder feature_analysis contains files:
userfeatures.py : Takes the user matrix learned from the SVD decomposition and uses the features learned there as well as the actual genders of each user to train a Logistic Regression classifier that predicts user gender solely from ratings.
moviefeatures.py : Takes the movie matrix learned from the SVD decomposition and uses the features learned there as well as the actual release years of each movie to train a Kernel Ridge Regression classifier that predicts movie release year solely from ratings. The model is then compared with a naive model that simply predicts movie release year with the mean movie release year.
trainset.csv: the training set of user ratings. There are three columns, (user-id, item-id, rating), as the headers indicate. There are 943 users, with their ids ranging from from 0 to 942. There are 1681 items, with their ids ranging from 0 to 1680.
testset.csv: the test set of user ratings. It has the same structure as the training set.
gender.csv: genders of 943 users. Female is 0, and male is 1.
release_year.csv: release years of 1681 movies