https://github.com/ShehanaIqbal/pumpItUpChallenge-ML
Data were visualized and analysed with follwing techniques.
- Listing data types of all columns
- Printing the head of dataframes
- Listing the value counts of columns
- Listing unique values of columns
- Plotting bar plots with each label in unique colour
- Printing crosstabs against each other
- Plotting correlation maps including new features
- Ploting confution matrix
- NaN values of columns with data type 'float' were filled with the mean of respective columns.
- NaN values of other columns of except the column 'scheme_name' were filled with the most occuring values of the respective columns.
- The column 'scheme_name' was removed since nearly 3/4th (35258) of the data rows did not contain a value for 'scheme_name'
- 'id' column was removed since it does no contribution towards classification
The following columns were found to be carrying redundant information when compared using crosstab. Hence they were removed from the dataframe.
- 'waterpoint_type',
- 'recorded_by',
- 'wpt_name',
- 'subvillage',
- 'date_recorded',
- 'scheme_name',
- 'installer',
- 'quantity_group',
- 'source_type',
- 'payment_type',
- 'region',
- 'extraction_type_group',
- 'waterpoint_type_group',
- 'quality_group',
- 'management_group'
- 'funder'
Following new features were created.
- 'is_population_high' - if population is greater than the population mean value
- 'is_source_spring' - if the source is Spring
- 'is_source_shallow_well' - if the source is shallow well
- 'is_insufficient_and_soft' - if quantity is insufficient and water_quality is soft
- 'is_enough_and_soft' - if the quantity is enough and water_quality is soft
- 'water_quality_is_salty' - if water_quality is salty
- 'water_quality_is_unknown' - if water_quality is unknown
- 'water_quality_is_soft' - if water_quality is soft
- 'amount_tsh_range' - grouping rows based on the value of amount_tsh
- 'gps_height_is_zero' - if gps_height is zero
- 'gps_height_less' - if gps_height less than zero
- 'gps_height_more' - if gps_height more than zero
- 'is_num_private_zero' - if num_private is zero
- 'decade' - replacing construction year by the decade that it belongs to
- 'is_old_construction'- if construction is old (60s, 70s, 90s)
- 'is_new_construction'- if construction belongs to 2000-2009
- 'is_quantity_dry' - if quantity is dry
- 'is_quantity_enough' - if quantity is enough
- 'is_quantity_insuficient' - if quantity is insufficient
- 'is_funder_Government_Of_Tanzania' - if funder is Government_Of_Tanzania
- 'is_funder_Danida' - if funder is Danida
- 'is_funder_rwwsp' - if funder is Rwwsp
- 'is_extraction_type_class_gravity' - if extraction type class is gravity
- 'is_extraction_type_class_other' - if extraction type class is other
- 'is_never_pay' - if payment is 'never pay'
- During new feature creation, some important values from some columns were one hot encoded seperately(without encoding the whole column) since there are many other less contributing values in those columns which does not need to be one hot encoded.
- After feature ceration all the columns except the columns of float datatypes were categorically encoded.
- Labels were also categorically encoded.
- In order to handle the imbalanceness of test data, training data were oversampled with SMOTE.
- Data were split in the ration of 2:1 for training & testing purposes.
- Data were normalized using RobustScaler
- Feature selection was tried with RandomForestClassifier and trained the model with only selected features. Since the model performed poorly compared to training with all features, feature selection was not used to achieve the best accuracy. (But the commented code is available in the model-PumpItUp.ipynb)
Following algorithms were tried when choosing the best model.
- Random forest
- XGBoost
- KNN
Since XGBoost performed the best among all the tried out approaches, XGBoost was selected as the best model for training purposes.
- accuracy_score - Provides the accuracy of the model
- balanced_accuracy_score - Provides the classwise balanced accuracy
Given below is the scores achieved by the submitted (last version) of the model.
Accuracy:
=========
TRAIN: 0.9754318322023442
TEST: 0.8614146601120957
Balanced Accuracy:
==================
TRAIN: 0.9754343630754786
TEST: 0.8613801396006276