- Linkdin Profile
- Project Overview
- How will this project help?
- Resources Used
- Exploratory Data Analysis (EDA) and Data Cleaning
- Feature Engineering
- Model Building and Evaluation
- Model Prediction
For any queries regarding about this project contact me
Link : https://www.linkedin.com/in/anil-l-b023631b6/
• Created a machine learning model that estimates salary of data scientist based on the features like rating, company_founded, etc.
• Engineered features from the text of each job description to quantify the value companies put on python, excel, tableau and sql
• This project helps data scientist/analyst to negotiate their income for an existing or a new job
• Packages: pandas, numpy, sklearn, matplotlib, seaborn.
• Dataset by Ken Jee: https://github.com/PlayingNumbers/ds_salary_proj
• Removed unwanted columns: 'Unnamed: 0'
• Plotted bargraphs and countplots for numerical and categorical features respectively for EDA
• Numerical Features (Rating, Founded): Replaced NaN or -1 values with mean or meadian based on their distribution
• Categorical Features: Replaced NaN or -1 values with 'Other'/'Unknown' category
• Removed unwanted alphabet/special characters from Salary feature
• Converted the Salary column into one scale i.e from (per hour, per annum, employer provided salary) to (per annum)
• Creating new features from existing features e.g. job_in_headquaters from (job_location, headquarters), etc.
• Trimming columns i.e. Trimming features having more than 10 categories to reduce the dimensionality
• Handling ordinal and nominal categorical features
• Feature Selection using information gain (mutual_info_regression) and correlation matrix
• Feature Scaling using StandardScalar
Metric: Negative Root Mean Squared Error (NRMSE)
• Multiple Linear Regression: -27.523
• Lasso Regression: -27.993
• Random Forest: -17.637
• Gradient Boosting: -24.429
• Voting (Random Forest + Gradient Boosting): -19.136
Note: Evaluation scores are obtained using cross validation.