Skip to content

Latest commit

 

History

History
75 lines (50 loc) · 3.33 KB

File metadata and controls

75 lines (50 loc) · 3.33 KB

Data-Scientist-Salary-Prediction

Table of Content

  • Linkdin Profile
  • Project Overview
  • How will this project help?
  • Resources Used
  • Exploratory Data Analysis (EDA) and Data Cleaning
  • Feature Engineering
  • Model Building and Evaluation
  • Model Prediction

Linkdin Profile

For any queries regarding about this project contact me

Link : https://www.linkedin.com/in/anil-l-b023631b6/

Dataset Python 3.6 library

Project Overview

• Created a machine learning model that estimates salary of data scientist based on the features like rating, company_founded, etc.
• Engineered features from the text of each job description to quantify the value companies put on python, excel, tableau and sql

How will this project help?

• This project helps data scientist/analyst to negotiate their income for an existing or a new job

Resources Used

• Packages: pandas, numpy, sklearn, matplotlib, seaborn.
• Dataset by Ken Jee: https://github.com/PlayingNumbers/ds_salary_proj

Exploratory Data Analysis (EDA) and Data Cleaning

Removed unwanted columns: 'Unnamed: 0'
Plotted bargraphs and countplots for numerical and categorical features respectively for EDA
Numerical Features (Rating, Founded): Replaced NaN or -1 values with mean or meadian based on their distribution
rating rating1

Categorical Features: Replaced NaN or -1 values with 'Other'/'Unknown' category
Removed unwanted alphabet/special characters from Salary feature
Converted the Salary column into one scale i.e from (per hour, per annum, employer provided salary) to (per annum)

Feature Engineering

Creating new features from existing features e.g. job_in_headquaters from (job_location, headquarters), etc.
jih

• Trimming columns i.e. Trimming features having more than 10 categories to reduce the dimensionality
Handling ordinal and nominal categorical features
• Feature Selection using information gain (mutual_info_regression) and correlation matrix
• Feature Scaling using StandardScalar

infogain

corr1

Model Building and Evaluation

Metric: Negative Root Mean Squared Error (NRMSE)
• Multiple Linear Regression: -27.523
• Lasso Regression: -27.993
Random Forest: -17.637
• Gradient Boosting: -24.429
• Voting (Random Forest + Gradient Boosting): -19.136
Note: Evaluation scores are obtained using cross validation.

Model Prediction

prediction