Skip to content

ETL and Sentiment Analysis of Amazon reviews with AWS, PySpark, PostgreSQL, NLP

Notifications You must be signed in to change notification settings

dshetty100/Amazon_Vine_Analysis

Repository files navigation

Amazon Vine Analysis

Overview of the Analysis

This project is about analyzing reviews written by the paid members of the Amazon Vine program (an Amazon service that allows manufacturers and publishers to receive reviews for their products) using Amazon Web Services (AWS) Relational Data Service (RDS), PostgreSQL, Google Colab, and PySpark. The purpose of this project was to analyze review data and determine if there is any bias towards favorable reviews from the paid members in the available data.

There were 50 datasets of product categories available to choose from Amazon Review Datasets. Each one contains reviews of a specific product, from clothing apparel to wireless products. I chose to analyze reviews in the Camera category.

For the analysis, PySpark was used to extract the dataset, transform the data, connect it to an AWS RDS instance, and load the transformed data into PostgreSQL using the schema challenge_schema with pgAdmin. The complete program for the ETL (Extract-Transform-Load) can be found in the file, Amazon_Reviews_ETL.ipynb, written using Google Colab. PySpark was also used to determine if there is any bias toward favorable reviews from Vine members in the chosen dataset. The complete program for this can be found in the file, Vine_Reviews_Analysis.ipynb, written using Google Colab.

Results

The available reviews from the chosen dataset were filtered to just those with more than 20 votes and those which were more than 50% "helpful."

Results of the calculation from the filtered dataset (also see, Vine_Reviews_Analysis.ipynb ) are shown below:

  • Paid and Unpaid Reviews Paid and Unpaid Reviews
    There were 50,516 non-vine (unpaid) reviewers and 607 vine (paid) reviews.

  • Five-star Paid and Unpaid Reviews Five-star Paid and Unpaid Reviews
    There were 25,300 non-vine (unpaid) 5-star reviewers and 257 vine (paid) 5-star reviewers.

  • Five-star reviews as a percent of paid reviews Five-star reviews as a percent of paid reviews
    Out of 607 total Vine (paid) reviews, 257 (42.3%) were 5-star reviews.

  • Five-star reviews as percent of non-paid reviews Five-star reviews as percent of non-paid reviews
    Out of 50,516 nob-vine (unpaid) reviews, 25,300 (50.08%) were 5-star reviews.

Summary

The above results show that there is not enough evidence to suggest that there is a bias toward five-star reviews from paid Amazon Vine reviewers. To arrive at a definitive conclusion it would be useful to carry out a similar analysis for the paid and unpaid reviews across a few different product categories.

About

ETL and Sentiment Analysis of Amazon reviews with AWS, PySpark, PostgreSQL, NLP

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published