Employee Churn Analysis, Feature Importance and Prediction Using Ensembling Model
Employee churn is the overall turnover in an organization's staff as existing employees leave and new ones are hired. Alternatively, in simple words, you can say, when employees leave the organization is known as churn. Another definition can be when a member of a population leaves a population, is known as churn.
This project focuses on analysis and prediction of employee churn data through exploratory visualization based on key indicators essential for employees. The project will cover the following topics:
- Data Loading and pre-processing
- Exploratory Visualization
- Understanding Features and their importance
- Building Ensembling Model
- Evaluating Model Performance
Ensemble Method:
Ensemble is a machine learning technique that combines the decisions from multiple models to improve the overall performance. There are various types of ensembling techniques, in our case we have used Bagging(Bootstrap Aggregating) ensemble method. In this method, the result from different models exposed to the datasets are aggregated to yield the final prediction.
Label Encoding:
Label encoding is the technique to deal with categorical values or multiple labels. It refers to converting the labels into numeric form so as to convert it into the machine-readable form.
Grid Search:
Grid search is used to find the optimal hyper-parameters of the model which results in the most accurate predictions.
F-Score:
The F score, also called the F1 score or F measure, is a measure of a test’s accuracy. The F score is defined as the weighted harmonic mean of the test’s precision and recall.
Below are the key indicators we have considered to analyze and predict employee churn:
- Age
- Daily Rate
- Office Distance from Home
- Hourly Rate
- Monthly Income
- Monthly Rate
- Number of Companies Worked With
- Percent Salary Hike
- Total Working Years
- Training Times Last Year
- Experience Year at Company
- Years In Current Role
- Years Since Last Promotion
- Years with Current Manager
- Business Travel
- Education
- Education Level
- Environment Satisfaction
- Gender
- Job Involvement
- Job Level
- Job Role
- Job Satisfaction
- Marital Status
- Overtime
- Performance Rating
- Relationship Satisfaction
- Stock Option Level
- Work Life Balance
The feature/variable importance have been calculated using Random forest and gradient boosting machine learning models.
Here is an intersting bar graph created using RF model for feature importance:
Below are the ML models used for the predictions:
- Logistic Regression Classifier
- Random Forest Classifier
- Gradient Boosting
- Neural Net
Later, Bootstrap Aggregating ensemble method was used to get the final prediction.