Badge Source
- Business Problem
- Data Source
- Methods
- Tech Stack
- Quick glance at the Results
- Lessons learned and Recommendation
- Limitation and what can be Improved
- Run Locally
- Explore the notebook
- Report and Presentation
- Deployment on streamlit
- App deployed on streamlit
- Contribution
- License
An airline brand has been receiving a fair amount of unsatisfactory sentiment towards our flight services. We want to identify the root causes that our passengers are having for these sentiments and overall increase airline satisfaction for our particular brand. Over a certain period we recorded surveys on our passengers to provide more details about their experience by asking specific questions that may hint us in what we can improve on. In order to achieve this we must first create a machine learning model that accurately predicts a passenger satisfactory level using their response for certain customer service categories. We want to increase customer retention and believe if we make the customer happy they are more likely to use our services again.
- Exploratory Data Analysis
- Multivariate Analysis
- Visualizations
- Modeling
- Reporting
- App Deployment
- R (Data Cleansing and Exploratory Analysis)
- Python (Machine Learning Modeling and App preparation)
- GitHub Pages (R Markdown Deployment onto Web)
- Microsoft Office (Reporting & Presentation)
- Streamlit (Interface for model)
Correlation Matrix between numeric features.
Confusion Matrix of Random Forest Classifier.
Random Forest Feature Importance Plot.
Top 3 models on the testing set (with default parameters)
Model | Accuracy | Sensitivity (Recall) | Specificity |
---|---|---|---|
Logistic Regression | 87.5% | 90.4% | 83.7% |
Random Forest | 96.5% | 98.2% | 94.2% |
Gradient Boosting | 95.4% | 97.1% | 93% |
- Final Model used: Random Forest Classifier
- Why choose Random Forest Classifier compared to the other models: The reason why Random Forest Classifier was the chosen model was that it provided better metrics not only in terms of accuracy but also in other metrics such as sensitivity, specificity, and precision. The precision score gave about 97.47% and overall is a better metric when it comes down to classifying a target with imbalance classes. OUr target variable, satisfaction level, had more unsatisfactory/neutral compared to satisfactory passengers based on our surveys. Also Random Forest is able to provide feature importance based on the splitting of various trees by determining which split/node provides the overall greatest decrease in gini index. This what provides further insights on our passengers view and what impacts satisfaction level the most. However, using Logistic Regression or Boosting would have been sufficient for analysis since there was not a huge difference in our metric scores.
- Metric used: Specificity
- Why choose Specificity as a metric: Our response variable in what we are trying to predict satisfaction level had imbalance classes. This creates a problem for our machine learning algorithm since they cannot learn each class at the same level. Therefore, our machine learning model might learn unsatisfactory/neutral passengers better since we were given more observations on them. Since we want to determine passenger with satisfactory level accurately this is only given when our specificity score is the greatest. If you look at the confusion matrix above we can see 1: represent satisfactory and 0: represents unsatisfactory/neutral passengers. Therefore, we want to increase our true positive, the lower right corner of the confusion matrix, which is our specificity score referred to as our recall score.
- In this project I learned how to leverage feature importance using our Random Forest and Gradient Boosting models to determine what influences our response the most. Its important to note that some features might provide a negative influence to our response variable or a positive one. Since the goal for this project is to increase satisfactory level, we want to identify not only the top important features but also the ones that provide positive influence. For example, Cleanliness is a feature given and some logic would say as we decrease cleanliness so would satisfaction levels. This can also be said in reverse if we increase cleanliness then you would expect customers to be more satisfied with their experience, thus this feature provides a positive influence. A negative influence would say if we increase a feature then satisfaction level would decrease or vice-versa.
- Some limitations that were provided or not not considered in this project are other external factors. For example our observations in the data set did not provide a ticket fare amount for each passenger. In some sense a person who paid a higher fare for their ticket will be given a higher level of service through out their travel which overall increase their satisfaction level for the particular airline.
- Another piece of information lacking in our data was given a passenger companionship. For example, did the passenger travel alone or did they travel with friends/family. Understanding this would provide further insights on the reason why the passenger was traveling, for vacation, business, or emergency. These scenarios might have an overall effect on their airline experience since external factors might have already contributed to their mood.
- A final question we can ask is whether a person income level has an influence on their satisfaction level of their airline. If we were given a passengers income, we can further classify the passenger into several brackets such as lower, middle, and high class groups. Since money over all gives you access to better features and services their satisfaction level might be easier met.
First, Open your Command line or Terminal and head to a directory where you want to save the project.
git init
git clone https://github.com/luisosorio3214/Airline-Satisfaction-Prediction-App.git
cd Airline-Satisfaction-Prediction-App
python -m venv "env_name"
For Window Usersenv_name\Scripts\activate
For Mac Userssource env_name/bin/activate
pip install -r requirements.txt
streamlit run app.py
If you are having issues with streamlit, please follow this tutorial on how to set up streamlit.
To explore the R notebook file click here.
The Report and Presentation was done collaboratively with other students at Long Beach State University. I express my gratitude and say thank you for the work they provided.
To read the Full Report of the Analysis click here.
To see the Full Presentation given click here.
To deploy this project on streamlit share, follow these steps:
- Make sure you have a github repository with full project files including the requirements.txt file
- Go to streamlit share
- Login with Github, Google, etc.
- click on new button
- Select the GitHub repo, branch, python file with the streamlit codes
- Click Save and Deploy
Video to gif tool
Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change or contribute.
MIT License
Copyright (c) 2022 Stern Semasuka
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Learn more about MIT license