I will be using Kaggle data available here to analyze and predict the ratings of video games based on 16 features. The dataset has 16,719 records, hence I will be using pandas and other visualization libraries to analyze the dataset and predict using Machine Learning models available through scientific libraries like scikit-learn, etc.
Tasks in the project:
- Data pre-processing
- Visualization to understand the data patterns
- Learn about different Machine Learning models used for the data pattern observed in Step 2
- Model building
- Evaluation of the performance of the built Model
- Results report
What can I Learn?
- pandas, NumPy, etc.
- Data processing technologies
- Visualization through libraries like matplotlib etc.
- About prediction techniques
Target: By end of Week 4, I will have completed data gathering and collecting information about important steps in data analysis, tools used, packages used, etc. from different blogs and articles on the net
Target: By the end of Week 5, I would be completing the data preprocessing and data inspection portion of my Data Analysis project.
“GARBAGE IN, GARBAGE OUT”
Planned steps:
After reading different articles and blogs about pre-processing in data analysis projects, I have found the following steps crucial. It is been shown that “45% of a data scientist’s time is spent on data preparation tasks” - Datanami
- Data cleaning
- Checking for missing values
- Categorical v/s Numerical Data
- Splitting Data into Training and Test datasets
- Feature Scaling
- Data integration
- In my project, this step is ignored, as my dataset is aggregated already by Kaggle
- Data reduction
- Attribute selection
- Numerosity reduction
- Dimensionality reduction
- Data transformation
- Aggregation
- Normalization
- Feature selection
- Discreditization
Target: By the end of Week 6, I will have done data visualization on my dataset and tried to understand the correlation between different features at my disposal. This would enable feature selection
“Data visualization is the language of decision making”
Planned steps:
- To analyze data in a chronological sense (by year)
- Questions to answer:
- How is the trend of video games over a period of time
- Is the trend common across different geographical locations
- Questions to answer:
- To visualize the prominence of Branding in video games (by publisher)
- Questions to answer:
- Who are the prominent players
- How does Brand affect the ratings of video-games
- Questions to answer:
- To visualize based on Genre and platform
- Questions to answer:
- Which genre dominates the industry
- How many games are released in those genres
- Does the platform play a role? PS2 vs PS4
- Questions to answer:
Target: To study different models used for the data patterns observed and experiment with those models on my dataset. Also, look for implementation tools, Google Colab?
Planned tasks:
- Implement different models
- Compare results
- Regression v/s Classification (Need to read more about it)
Target: Every model can be tuned to use certain parameters which affect model performance
“THE BRAIN of THE MODEL”
Planned tasks:
- Optimize performance through parameter tuning
- Interpret results
- What does the pattern convey? Can a prediction be made?
- How accurate is the model?
Planned tasks:
- Make a video for submission
- Prepare a report of the results
Dataset: Kaggle
Video Link: Panopto
How to run project? Install requirements specified in requirements.txt and run using jupyter
In case of numeric_only=True error, it is due to different version of pandas. Please use requirements.txt to install requirements. Alternatively, please remove numeric_only=True parameter.