-
Yellow Taxis charge standard metered fares for all street hail trips.
Variable Name | Description |
---|---|
vendorid : | A code indicating the TPEP provider that provided the record. |
pickup_datetime: | The date and time when the meter was engaged. |
dropoff_datetime: | The date and time when the meter was disengaged. |
passanger_count: | The number of passengers in the vehicle. This is a driver-entered value. |
trip_distance : | The elapsed trip distance in miles reported by the taximeter. |
PULocationID : | Taxi Zone in which the taximeter was engaged. |
DOLocationID : | Taxi Zone in which the taximeter was disengaged. |
ratecodeid : | The final rate code in effect at the end of the trip. |
store_fwd_flg : | This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward”, because the vehicle did not have a connection to the server. |
pay_type : | Signifying how the passenger paid for the trip. |
fare_amount : | The time-and-distance fare calculated by the meter. |
surcharge : | $0.30 improvement surcharge, $0.50 overnight surcharge 8pm to 6am, New York State Congestion Surcharge of $2.50. |
mta_tax : | $0.50 MTA tax that is automatically triggered based on the metered rate in use. |
tip_amount : | This field is automatically populated for credit card tips. Cash tips are not included. |
toll_amount : | Total amount of all tolls paid in trip. |
total_amount : | The total amount charged to passengers. Does not include cash tips. |
- The core objective of this project is to analyse the factors for demand for taxis, to find the most pickups, drop-offs of public based on their location, time of most traffic and how to overcome the needs of the public.
Code to Read Raw Data from S3 bucket and create Dataframe in PySpark, Perform Cleaning and Transformations, and Create Table in Hive:
Slow query performance
- Because of big volume(30GB) of our data,the performance of our query become very poor, we tackled this problem with the help of diffrent bigdata file formats(orc,parquet,etc.)
- After converting to parquet format our data volume reduced to 8GB, and performance improved.
Emr steps
- Initially we were strugling alot with EMR steps, but after reading AWS documentation and doing some trial and erros this problem got solved.
CloudFormation
- Our EMR cluster use to through some errors while we were doing cloudeFormation, but after backtracking the error message, we were able to launch EMR cluster successfully using CFT.
Loading data in Tableau
- This was biggest chalange in front of us to connect and load the data with tableau,even after converting our data to bigdata file formats(orc,parquet), volume of our data was still very large for the tableau to execute the query and do visualizations, we solved this problem by saving extract of our data in local machine, and then everything went very smooth.