Skip to content

Latest commit

 

History

History
689 lines (320 loc) · 35.2 KB

README.md

File metadata and controls

689 lines (320 loc) · 35.2 KB

eCommerce Business Trends

Group Members

Elaine Ng - Peer reviews on code (Square role), Branches: Elaine, technology.md, Machine_Learning, readme_wk2, readme_wk3, .gitignore, requirements.txt

Luz Maria Cabral - Dashboard (Circle role), Branches: Luz, readme_wk2, readme_wk3

Manupriya Sharma - Quality Assurance and test the code (X role), Branches: Manupriya, Database, readme_wk2, readme_wk3

Corinne Hume - Draft presentation (Triangle role), Branches: Corinne, visuals_wk2, readme_wk2, readme_wk3, presentation_wk3

Presentation

Link to Google Slides

Project

Selected Topic: Brazilian eCommerce Business Trends

In this project, we will analyze customer orders placed between 2016-2018 from sellers on the Brazilian eCommerce platform Olist Store. From this data, we will first determine the number of customers by geographic region (looking at zip code prefix, city, and state), most popular products, number of repeat customers and purchases, and total purchases by date. After data cleaning and initial analysis, we will use machine learning to make predictions on customer behavior, providing sellers on Olist the opportunity to increase sales and their customer base.

Why we selected our topic:

We selected this topic because of the analysis options available from eCommerce data, with information available on customers, sellers, products and geographic regions. We would like to know if consumer behavior differs between customer's geographic location and if future purchases could be predicted using geolocation data. We would also like to know what products are likely to be purchased depending on the time of year. Our ultimate goal is to predict consumer behavior from this dataset using machine learning.

Description of our source of data:

We chose the Brazilian-eCommerce dataset from Kaggle for our analysis. This dataset contains approximately 100,000 customer orders, along with corresponding files on product information and English translations of product categories originally in Portuguese. Seller names in this dataset were anonymized and replaced with Game of Thrones House names. Six files from the original Kaggle dataset were chosen for further analysis: geolocation dataset, olist_customers_dataset, olist_product_dataset, order_item_dataset, olist_orders_dataset, and product_category_name_translation.

Data Source: https://www.kaggle.com/olistbr/brazilian-ecommerce

Questions we hope to answer with our data:

With this data, we hope to answer...

  • What is the ordering behavior for consumers in each geographical region?
  • What type of products would a company expect to buy more of depending on the season?
  • Analyse reviews and develop predictive model based on the given parameters.

Database

We used Postgres to create tables in SQL. The ERD below shows connectivity among 9 tables.

QuickDBD-export (1)

The description of these tables is as follows:

  1. olist_orders_dataset: This table is connected to 4 other tables. It is used to connect all the details related to an order.
  2. olist_order_items_dataset: It contains the details of an item that had been purchased such as shipping date, price and so on.
  3. olist_order_reviews_dataset: It contains details related to any reviews posted by the customer on a particular product that he had purchased.
  4. olist_products_dataset: It contains related to a product such as the ID, category name and measurements.
  5. olist_order_payments_dataset: The information contained in this table is related to the payment details associated with a particular order.
  6. olist_customers_dataset: Details the customer base information of this firm.
  7. olist_geolocation_dataset: It contains geographical information related to both the sellers and customers.
  8. olist_sellers_dataset: This table contains the information related to all the sellers who have registered with this firm.
  9. olist_product_category_name_translation: This table is connected to products database.

We cleaned all these tables using Jupyter Notebook and imported them to a postgresSQL 11 server for joining the tables to create one big database to further analyze. SQL inner joins were used to connect all the tables.

SQL

Using RDS on AWS with Jupyter notebooks

Step 1: Setting up the Database

We used an AWS free tier template to create our RDS database on AWS. We learned that free templates are only available for Postgres 12 and higher versions.

AWS

We made our database accessible from anywhere so it’s accessible outside of the default VPC.

Step 2: Create AWS server on postgres

Once the database was created on AWS, we created an AWS server with the name final_project on postgres.

pic

We used the schema we created before to create all the tables again and then join them using inner joins.

query

The final joined database was imported to jupyter notebook and it has a total of 34 columns and 91,596 rows.

info

Step 3: Simplifying our data (Machine Learning)

We created a profile of our data to understand the relation between various features. We realized many features in our data did not affect the review score so we dropped those columns. For example, features like "customer_city", "customer_state", "geolocation_lat","geolocation_lng", "customer_id", "order_purchase_timestamp", "order_approved_at", "order_delivered_carrier_date", "payment_sequential" were dropped.

profiling

We confirmed our results by creating a correlation matrix:

corelation_matrix

Finally, we decided to keeping the following columns in our data:

  • zipcode order_status
  • price freight_value review_score
  • payment_type
  • payment_value product_id
  • product_photos_qty
  • product_category
  • seller_zip
  • seller_state
  • time_order_to_delivery
  • time_estimate_to_delivery

Step 4: Feature Engineering (Machine Learning)

We used the following methods to encode the catagorical variables:

  • lambda function
  • label encoder

feature engineering1

Feature engineering 2

Our final data for machine learning has 13 columns:

final

Step 5: Importing CSV

As a first step to connect with AWS database we imported our dataframe to csv file.

CSV

Step 6: Setting up a config file

Once psycopg2 was imported, we created a config.py file to store the details for accessing our database. To connect, we needed the following details:

  • Endpoint
  • Port
  • Name
  • User’s Name
  • User’s Password

Step 7: Connecting & Creating a table

After importing psycopg2 and our config file, we created a function that connects to the database and sets up a cursor. This function uses our credentials from our config.py file to create the conn_string, and uses the conn_string to create the connection to our database hosted by AWS.

sql connection1

Step 8: Create Table

Once we had a connection and a cursor, we wrote SQL queries in Python. PostgreSQL queries from Python using the psycopg2 library need four elements:

  • Establish Connection
  • Establish Cursor
  • Execute Cursor
  • Commit Connection

Step 9: Loading data into a Postgres table from CSV

Once we had our tables, we copied data from CSV files with psycopg2 using the copy_from() method.

SQL connection 2

Step 10: Importing data from AWS database to jupyter notebook for Machine Learning (Code included in the picture above)

We used pandas to read in the SQL database to jupyter notebook. We used the final_customers_sql file to create our machine learning models

SQL_todb

Machine Learning:

Goal: To create a Machine Learning model to predict review score. We converted review score to binary variable by using the following code:

Picture1

(See Feature Engineering and Data Preprocessing for Machine Learning in Database section above)

Description of how data was split into training and testing sets:

The goal of the training and testing sets is to create the machine learning model to predict review score. We took the review score and used it as a target column, then made it into y data. After that, we put x and y into training and testing to make X_train, X_test and y_train, and y_test. We split the data by making a 75/25 split where 25% of the data was used for testing. X is everything except review score and y is review score.

training_testing_split

Resampling

Our data has more positive reviews than negative reviews, as the pie chart below shows.

pie chart

We used Random undersampling to resample our data using the follwoing code:

resample

Model Choice:

As our target variable is binary, we choose a classification model. We created 5 different models and compared their accuracies. We got accuracy >70% for all the models except for Artificial Neural Network that gave us only 50% of accuracy. Random forest performed the best for our data:

  1. Linear Logistic Regression: We got accuracy of 74%. Logistic regression is easier to implement and interpret, and very efficient to train. But it is tough to obtain complex relationships using logistic regression and it over fits the model.

  2. K-NN model: We were hoping to receive higher accuracy with K-NN but we got lower accuracy of 67% than logistic regression. One benefit of KNN algorithm is that it doesn’t require training before making predictions, new data can be added seamlessly which will not impact the accuracy of the algorithm.

  3. Decision Tree: We got accuracy of ~92%. Decision Tree algorithm is very intuitive and easy to understand but a small change in the data can change the prediction big time.

  4. Ensemble-Random Forest: As this combines various decision tree models, it gave us the best value of accuracy of 94% . The Random Forest doesn’t over fit the model but it makes algorithms to run slow.

  5. Artificial Neural Network: We were hoping to get the highest accuracy with this one but it gave us accuracy of 50%. ANN can overfit the data and takes a lot of time to run.
    Here is a graph that shows the comparison of all the models we tested:

model comparison

Feature Importance

We used Random Forest feature importance technique (as Random Forest has the maximum accuracy ~94% among the all models we tried to find importance score to input features based on how useful they are at predicting a target variable. As the picture below shows that the most important feature is the "time estimate to delivery" which is the differnce between estimate delivery date and actual delivery date and the least important feature is the "order status".

feature importance

Predictive Web App

We used Streamlit to create an interactive web app to predict the review score for various combinations of the features. We used Random Forest classifier model to create the prediction.

Predictive App

Dashboard

Link to Story Board on Google Slides

Link to Draft Story on Tableau

For the Dashboard, we began working on Google slides and figuring out how our presentation would flow. We decided to tell our story with Google slides and show our final dashboard with Tableau. Since a couple of us will be working on Tableau, we started a free trial using Tableau Online so we can all download and edit the same files.

This report starts by telling the story of Olist, how they began, where there customers are at, what products sell the most and ways to improve their services. Olist is a company that begins its online presence in the Fall of 2016 and builds a robust presence in 2017 and 2018. We review the products that they help distribute and how they're performing around the country. As part of their quality control review they send out surveys for feedback on their services, we were able to use their star rating to see how customers felt about the services provided. Finally part of the service Olist provides is logistics and delivery is largely discussed in these reviews, we attempt to draw this picture in our visualizations.

Visualization Tools

The team decided to move forward with Tableau for our dashboard needs. The data, as previously described, was cleaned, merged and further processed via Python, postgresSQL, and AWS. The dataset that we've been working with include some of the changes made for the machine learning models, changing strings into integers. Also, in assistance to our Tableau learning journey, aside from the module, were youtube videos for guidance on connecting worksheets for interactive connections and overlapping maps to demonstrate the data.

Sales Dashboard

Olist Sales Review

In this board we see the popularity of items being sold on Olist, most popular item are home goods and least popular are security services. We used cluster or bubble chart to review this and a map to locate places where these items were sold most frequently.

Review Dashboard

Review_Dashboard The next dashboard illustrates the reviews and how most categories were rated. Overall Olist provides a good service with most items receiving reviews of 3.5 and greater. There are negative reviews and there will be seen in areas where delivering large ticket items may not be as easy as delivering smaller items. Most of our machine learning visualizations and results are located in this frame as well. The logistics might be different. For this reason we also reviewed freight costs and services.

Outline

Our preliminary slides will show how we chose the topic of eCommerce Business Trends, where we gathered our data from, and topics of exploration. pres_slide1

For the Database portion, we will show how our database connects with the machine learning model in jupyter notebook and how we used SQL joins to create our final database. pres_slide2 (1)

For the machine learning section we will show the results of our machine learning models, predictions, and which algorithm performed best.

Storyboarding our dashboard:

Questions we will address with our Dashboard (using Olist Sales Data from 2016-2018)

  • Did the number of sellers increase over time?
  • How did the number of sales by product category change over time?
  • How did customer review scores change over time?

We will show this information with:

Interactive elements:

  • line graphs that can be filtered by year
  • Maps that can be filtered by customer review scores, product category, and city

Additional Ideas:

  • Review if the number of seller accounts have increased over time
  • And another dashboard for logistics
    • how do logistics impact the buying and selling experience? =======

eCommerce Business Trends

Group Members

Elaine Ng - Machine Learning Model (Square role), Branches: Elaine, technology.md, Machine_Learning, readme_wk2

Luz Maria Cabral - Dashboard (X role), Branches: Luz, readme_wk2

Manupriya Sharma - Database (Triangle role), Branches: Manupriya, Database, readme_wk2

Corinne Hume - Visual (Circle role), Branches: Corinne, visuals_wk2, readme_wk2

Presentation

Link to Google Slides

Project

Selected Topic: Brazilian eCommerce Business Trends

In this project, we will analyze customer orders placed between 2016-2018 from sellers on the Brazilian eCommerce platform Olist Store. From this data, we will first determine the number of customers by geographic region (looking at zip code prefix, city, and state), most popular products, number of repeat customers and purchases, and total purchases by date. After data cleaning and initial analysis, we will use machine learning to make predictions on customer behavior, providing sellers on Olist the opportunity to increase sales and their customer base.

Why we selected our topic:

We selected this topic because of the analysis options available from eCommerce data, with information available on customers, sellers, products and geographic regions. We would like to know if consumer behavior differs between customer's geographic location and if future purchases could be predicted using geolocation data. We would also like to know what products are likely to be purchased depending on the time of year. Our ultimate goal is to predict consumer behavior from this dataset using machine learning.

Description of our source of data:

We chose the Brazilian-eCommerce dataset from Kaggle for our analysis. This dataset contains approximately 100,000 customer orders, along with corresponding files on product information and English translations of product categories originally in Portuguese. Seller names in this dataset were anonymized and replaced with Game of Thrones House names. Six files from the original Kaggle dataset were chosen for further analysis: geolocation dataset, olist_customers_dataset, olist_product_dataset, order_item_dataset, olist_orders_dataset, and product_category_name_translation.

Data Source: https://www.kaggle.com/olistbr/brazilian-ecommerce

Questions we hope to answer with our data:

With this data, we hope to answer...

  • What is the ordering behavior for consumers in each geographical region?
  • What type of products would a company expect to buy more of depending on the season?
  • Analyse reviews and develop predictive model based on the given parameters.

Communication Protocol

  • We will coordinate and communicate through Slack and Zoom. Our team will meet on Zoom every Monday, Wednesday and Thursday.

Planning Meeting:

  • Every Monday we will meet to plan the deliverables and responsibilities for the entire week. Everyone will work on their branch or feature branch on Github.
  • Google Drive: To add websites with data, brainstorm topics to choose from, and share our files for the project

Debugging Session:

  • Every Wednesday we will discuss challenges and try to debug issues together.

Merging work to main branch

  • We will post on slack before merging to the main branch and after creating new branches.

  • On Sundays at 7pm, we will check in on Slack to review our GitHub page before turning in the assignment on Canvas.

Database

We used Postgres to create tables in SQL. The ERD below shows connectivity among 9 tables.

QuickDBD-export

The description of these tables is as follows:

  1. olist_orders_dataset: This table is connected to 4 other tables. It is used to connect all the details related to an order.
  2. olist_order_items_dataset: It contains the details of an item that had been purchased such as shipping date, price and so on.
  3. olist_order_reviews_dataset: It contains details related to any reviews posted by the customer on a particular product that he had purchased.
  4. olist_products_dataset: It contains related to a product such as the ID, category name and measurements.
  5. olist_order_payments_dataset: The information contained in this table is related to the payment details associated with a particular order.
  6. olist_customers_dataset: Details the customer base information of this firm.
  7. olist_geolocation_dataset: It contains geographical information related to both the sellers and customers.
  8. olist_sellers_dataset: This table contains the information related to all the sellers who have registered with this firm.
  9. olist_product_category_name_translation: This table is connected to products database.

We cleaned all these tables using Jupyter Notebook and imported them to a postgresSQL 11 server for joining the tables to create one big database to further analyze. SQL inner joins were used to connect all the tables.

SQL

Using RDS on AWS with Jupyter notebooks

Step 1: Setting up the Database

We used an AWS free tier template to create our RDS database on AWS. We learned that free templates are only available for Postgres 12 and higher versions.

AWS

We made our database accessible from anywhere so it’s accessible outside of the default VPC.

Step 2: Create AWS server on postgres

Once the database was created on AWS, we created an AWS server with the name final_project on postgres.

pic

We used the schema we created before to create all the tables again and then join them using inner joins.

query

The final joined database was imported to jupyter notebook and it has a total of 34 columns and 91,596 rows.

info

Step 3: Simplifying our data (Machine Learning)

We created a profile of our data to understand the relation between various features. We realized many features in our data did not affect the review score so we dropped those columns. For example, features like "customer_city", "customer_state", "geolocation_lat","geolocation_lng", "customer_id", "order_purchase_timestamp", "order_approved_at", "order_delivered_carrier_date", "payment_sequential" were dropped.

profiling

We confirmed our results by creating a correlation matrix:

corelation_matrix

Finally, we were left with the following columns in our data:

  • zipcode order_status
  • price freight_value review_score
  • payment_type
  • payment_value product_id
  • product_photos_qty
  • product_category
  • seller_zip
  • seller_state
  • time_order_to_delivery
  • time_estimate_to_delivery

Step 4: Feature Engineering (Machine Learning)

We used the following methods to encode the catagorical variables:

  • lambda function
  • label encoder

feature engineering1

Feature engineering 2

Our final data has 13 columns:

final

Step 5: Importing CSV

As a first step to connect with AWS database we imported our dataframe to csv file.

CSV

Step 6: Setting up a config file

Once psycopg2 was imported, we created a config.py file to store the details for accessing our database. To connect, we needed the following details:

  • Endpoint
  • Port
  • Name
  • User’s Name
  • User’s Password

Step 7: Connecting & Creating a table

After importing psycopg2 and our config file, we created a function that connects to the database and sets up a cursor. This function uses our credentials from our config.py file to create the conn_string, and uses the conn_string to create the connection to our database hosted by AWS.

sql connection1

Step 8: Create Table

Once we had a connection and a cursor, we wrote SQL queries in Python. PostgreSQL queries from Python using the psycopg2 library need four elements:

  • Establish Connection
  • Establish Cursor
  • Execute Cursor
  • Commit Connection

Step 9: Loading data into a Postgres table from CSV

Once we had our tables, we copied data from CSV files with psycopg2 using the copy_from() method.

SQL connection 2

Step 10: Importing data from AWS database to jupyter notebook for Machine Learning (Code included in the picture above)

We used pandas to read in the SQL database to jupyter notebook. We used the final_customers_sql file to create our machine learning models

SQL_todb

Machine Learning:

Goal: To create a Machine Learning model to predict review score. We converted review score to binary variable by using the following code:

Picture1

(See Feature Engineering and Data Preprocessing for Machine Learning in Database section above)

Description of how data was split into training and testing sets:

The goal of the training and testing sets is to create the machine learning model to predict review score. We took the review score and used it as a target column, then made it into y data. After that, we put x and y into training and testing to make X_train, X_test and y_train, and y_test. We split the data by making a 75/25 split where 25% of the data was used for testing. X is everything except review score and y is review score.

training_testing_split

Explanation of Model Choice:

As our target variable is binary, we choose a classification model. We created 6 different models and compared their accuracy. Over all we got pretty good accuracy >85% for all the models we tested but Random forest performed the best for our data:

  1. Linear Logistic Regression: We got accuracy of 88%. Logistic regression is easier to implement and interpret, and very efficient to train. But it is tough to obtain complex relationships using logistic regression and it over fits the model.

  2. K-NN model: We were hoping to receive higher accuracy with K-NN but we got lower accuracy of 86% than logistic regression. One benefit of KNN algorithm is that it doesn’t require training before making predictions, new data can be added seamlessly which will not impact the accuracy of the algorithm.

  3. Kernel SVM:Non Linear: We got accuracy of 87%

  4. Decision Tree: We got accuracy of 95%. 7. Decision Tree algorithm is very intuitive and easy to understand but a small change in the data can change the prediction big time.

  5. Ensemble: Random Forest: As this combines various decision tree models, it gave us the best value of accuracy of 96% . The Random Forest doesn’t over fit the model but it makes algorithms to run slow.

  6. Artificial Neural Network: We were hoping to get the highest accuracy with this one but it gave us accuracy of 87%. ANN can overfit the data and takes a lot of time to run.

Here is a graph that shows the comparison of all the models we tested: comparison2

Dashboard

Link to Story Board on Google Slides (begins on slide number 21)

Link to Draft Story on Tableau

For the Dashboard, we began working on Google slides and figuring out how our presentation would flow. We decided to tell our story with Google slides and show our final dashboard with Tableau. Since a couple of us will be working on Tableau, we started a free trial using Tableau Online so we can all download and edit the same files.

Outline

Our preliminary slides will show how we chose the topic of eCommerce Business Trends, where we gathered our data from, and topics of exploration. pres_slide1

For the Database portion, we will show how our database connects with the machine learning model in jupyter notebook and how we used SQL joins to create our final database. pres_slide2 (1)

For the machine learning section we will show the results of our machine learning models, predictions, and which algorithm performed best.

Storyboarding our dashboard:

Questions we will address with our Dashboard (using Olist Sales Data from 2016-2018)

  • Did the number of sellers increase over time?
  • How did the number of sales by product category change over time?
  • How did customer review scores change over time?

We will show this information with:

Interactive elements:

  • line graphs that can be filtered by year
  • Maps that can be filtered by customer review scores, product category, and city

Additional Ideas:

  • Review if the number of seller accounts have increased over time
  • And another dashboard for logistics
    • how do logistics impact the buying and selling experience?