GitHub - nileshsingal/CDAC-DBDA-FINAL_PROJECT--NYC-taxi-trip-visualization: Data analysis and visualization of New York Yellow Taxi Trip data, The core objective of this is to find the most pickups, drop-offs of public based on their location, time of most traffic and how to overcome the needs of the public, by using BigData Technologies and Tableau.

Analysis of New York Yellow Taxi Trip Data

Introduction of NYC Yellow Taxi:

Yellow Taxis are the only vehicles licensed to pick up street-hailing passengers anywhere in NYC.
Yellow Taxis charge standard metered fares for all street hail trips.
Yellow Taxi smartphone apps can offer set, upfront pricing for trips booked through an app.
Yellow Taxis are easily identified by their yellow color, taxi “T” markings, and license numbers on the roof and sides of the vehicle.

Introduction of NYC Yellow Taxi Trip Data:

Variable Name	Description
vendorid :	A code indicating the TPEP provider that provided the record.
pickup_datetime:	The date and time when the meter was engaged.
dropoff_datetime:	The date and time when the meter was disengaged.
passanger_count:	The number of passengers in the vehicle. This is a driver-entered value.
trip_distance :	The elapsed trip distance in miles reported by the taximeter.
PULocationID :	Taxi Zone in which the taximeter was engaged.
DOLocationID :	Taxi Zone in which the taximeter was disengaged.
ratecodeid :	The final rate code in effect at the end of the trip.
store_fwd_flg :	This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward”, because the vehicle did not have a connection to the server.
pay_type :	Signifying how the passenger paid for the trip.
fare_amount :	The time-and-distance fare calculated by the meter.
surcharge :	$0.30 improvement surcharge, $0.50 overnight surcharge 8pm to 6am, New York State Congestion Surcharge of $2.50.
mta_tax :	$0.50 MTA tax that is automatically triggered based on the metered rate in use.
tip_amount :	This field is automatically populated for credit card tips. Cash tips are not included.
toll_amount :	Total amount of all tolls paid in trip.
total_amount :	The total amount charged to passengers. Does not include cash tips.

Objective:

The core objective of this project is to analyse the factors for demand for taxis, to find the most pickups, drop-offs of public based on their location, time of most traffic and how to overcome the needs of the public.

Architecture Diagram:

Code to Get Raw Data from NYC website and store data in S3 bucket:

Dataingestion

Code to Read Raw Data from S3 bucket and create Dataframe in PySpark, Perform Cleaning and Transformations, and Create Table in Hive:

PyScript

We have used AWS CloudFormation to automate above process

CloudFormation templet

- Then we connected Hive table with Tableau Public using Amazon EMR Hadoop Hive Connector

Chalanges Faced:

Slow query performance

Because of big volume(30GB) of our data,the performance of our query become very poor, we tackled this problem with the help of diffrent bigdata file formats(orc,parquet,etc.)
After converting to parquet format our data volume reduced to 8GB, and performance improved.

Emr steps

Initially we were strugling alot with EMR steps, but after reading AWS documentation and doing some trial and erros this problem got solved.

CloudFormation

Our EMR cluster use to through some errors while we were doing cloudeFormation, but after backtracking the error message, we were able to launch EMR cluster successfully using CFT.

Loading data in Tableau

This was biggest chalange in front of us to connect and load the data with tableau,even after converting our data to bigdata file formats(orc,parquet), volume of our data was still very large for the tableau to execute the query and do visualizations, we solved this problem by saving extract of our data in local machine, and then everything went very smooth.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
NYC yellow taxi visualization		NYC yellow taxi visualization
CFTTRIPDATA.json		CFTTRIPDATA.json
README.md		README.md
dataingestion.sh		dataingestion.sh
finalpyscript.py		finalpyscript.py
tableau_NYC_extract (2).twb		tableau_NYC_extract (2).twb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis of New York Yellow Taxi Trip Data

Introduction of NYC Yellow Taxi:

Yellow Taxis are the only vehicles licensed to pick up street-hailing passengers anywhere in NYC.

Yellow Taxis charge standard metered fares for all street hail trips.

Yellow Taxi smartphone apps can offer set, upfront pricing for trips booked through an app.

Yellow Taxis are easily identified by their yellow color, taxi “T” markings, and license numbers on the roof and sides of the vehicle.

Introduction of NYC Yellow Taxi Trip Data:

Objective:

Architecture Diagram:

Code to Get Raw Data from NYC website and store data in S3 bucket:

Code to Read Raw Data from S3 bucket and create Dataframe in PySpark, Perform Cleaning and Transformations, and Create Table in Hive:

We have used AWS CloudFormation to automate above process

- Then we connected Hive table with Tableau Public using Amazon EMR Hadoop Hive Connector

Chalanges Faced:

About

Releases

Packages

Languages

nileshsingal/CDAC-DBDA-FINAL_PROJECT--NYC-taxi-trip-visualization

Folders and files

Latest commit

History

Repository files navigation

Analysis of New York Yellow Taxi Trip Data

Introduction of NYC Yellow Taxi:

Yellow Taxis are the only vehicles licensed to pick up street-hailing passengers anywhere in NYC.

Yellow Taxis charge standard metered fares for all street hail trips.

Yellow Taxi smartphone apps can offer set, upfront pricing for trips booked through an app.

Yellow Taxis are easily identified by their yellow color, taxi “T” markings, and license numbers on the roof and sides of the vehicle.

Introduction of NYC Yellow Taxi Trip Data:

Objective:

Architecture Diagram:

Code to Get Raw Data from NYC website and store data in S3 bucket:

Code to Read Raw Data from S3 bucket and create Dataframe in PySpark, Perform Cleaning and Transformations, and Create Table in Hive:

We have used AWS CloudFormation to automate above process

- Then we connected Hive table with Tableau Public using Amazon EMR Hadoop Hive Connector

Chalanges Faced:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages