Movielens analysis pyspark scripts project This scripts can be run in a pyspark env (preferabbly on a docker, I recommend the big-data-europe updated with pyspark and python 3.9)
This repository contains scripts for analyzing the MovieLens dataset using PySpark and Hadoop inside a Docker environment. The analysis focuses on various insights including top-rated movies, genre trends, user behavior, and correlations.
Before running the scripts, ensure you have the following installed:
- Docker (for running Hadoop & PySpark)
- Git (for cloning the repository)
- Jupyter Notebook (for visualization)
All scripts are located in the scripts/
folder.
spark-submit --master local[*] scripts/load_data.py --path hdfs:///movielens/
This script loads the dataset into.
spark-submit --master local[*] scripts/best_movies_by_genre.py --path hdfs:///movielens/
Finds the best-rated movie for each genre and saves results in HDFS.
spark-submit --master local[*] scripts/generate_chart_data.py --path hdfs:///movielens/
Generates various statistical analyses (ratings, trends, correlations) and stores them in HDFS.
docker cp <namenode-container-id>:/tmp/charts C:\Programming\Movie_Lens\
This moves generated CSVs to your local machine.
mv C:\Programming\Movie_Lens\charts C:\Programming\Movie_Lens\analysis\charts
jupyter notebook
http://localhost:8888/notebooks/Movie_Lens/analysis/analysis.ipynb
- Ratings Distribution - Histogram of all ratings.
- Top 10 Movies by Rating - Bar chart of highest-rated movies.
- Ratings Over Time - How movie ratings evolved yearly.
- Most Popular Genres - Which genres received the most reviews over time.
- User Rating Behavior - Scatter plot showing rating activity.
- Movie Rating Variance - Box plot of rating consistency.
- Correlation: Reviews & Ratings - Scatter plot showing relationship.
- Most Reviewed Genres Per Year - Stacked bar chart.
- Most Popular Tags Per Year - Bar chart of frequent tags.
- Yearly Trends: Ratings & Reviews - Dual line chart.
hdfs dfs -ls /movielens/
hdfs dfs -cat /movielens/output_best_movies_by_genre.txt/part-00000
pyspark
docker restart namenode datanode resourcemanager nodemanager historyserver
- The script paths use
hdfs:///movielens/
since the dataset is stored in HDFS. - If a script fails, check if HDFS is running and data exists in HDFS (
hdfs dfs -ls /movielens/
). - To add more nodes for parallel processing, update the Spark submit command to use a cluster instead of
local[*]
.