Distributed Recommender System for GitHub repositories based on the implicit feedback captured when a user stars a repository.
The system relies on a dataset containing information about user interactions on the GitHub website. In order to collect it, we started from a well-known repository like React and we stored all the users that starred that repo. Then, for each user we stored what other repositories he has starred.
In particular, for each repository the following data is avalable:
- Repository name
- Repository owner
- Number of stars
- Number of forks
- Main language
- About (small description of the repo)
- Sponsor (true/false - some repo have the possibility to open donations)
- Last update time
Regarding users, instead, we collected data about which user starred which repository.
For the recommender system, we experimented with the Content-based filtering approach and we tried the Collaborative Filtering approaches, implementing both Neighbourhood CF and the Matrix Factorization approach.
Finally, in order to evaluate our work, we used the following metrics: MAP@K and Personalization.
In this repo is stored the notebook used for the creation and validation of the recommender system, named GitHub_Recommender_System.ipynb
.
In order to run the notebook, Colab is suggested since the project is made using the pyspark
library.
In the following table the results obtained by the models are reported:
Model | MAP@1 | MAP@2 | MAP@3 | MAP@4 | MAP@5 | Personalization |
---|---|---|---|---|---|---|
Content-based Filtering | 0.067 | 0.051 | 0.043 | 0.037 | 0.033 | 0.676 |
User-based Collaborative Filtering | 0.357 | 0.261 | 0.213 | 0.183 | 0.164 | 0.965 |
Item-Based Collaborative Filtering | 0.355 | 0.300 | 0.264 | 0.239 | 0.216 | 0.874 |
Matrix Factorization | 0.506 | 0.393 | 0.332 | 0.298 | 0.268 | 0.864 |
We can see that as expected the Matrix Factorization model is the best model, having a higher MAP value for all the ranks.
Using the notebook as backend we realized a small demo which gives the possibility to inspect the recommendations for all the users in the dataset. Moreover, it allows to define a fake user, selecting the repositories that he starred in order to test the system.
The frontend is available at the following link https://recommend-hub.netlify.app/, but unfortunately the backend is not hosted anymore. However, it is possible to host it yourself directly in Colab, following the instructions provided at the end of the notebook.
The system is implemented using the Python language and the following libraries:
pyspark
as the distributed frameworkpandas
to work with datamatplotlib
andseaborn
to perform the EDA
Moreover, for the Web server the following technologies have been used:
flask
andngrok
Python libraries for the back-endReact.js
for the front-end