Skip to content

Latest commit

 

History

History
59 lines (47 loc) · 4.72 KB

README.md

File metadata and controls

59 lines (47 loc) · 4.72 KB

Elastic Stack demo for airline data

US Domestic Flights ETL flow with weather, geo, delays, airlines for the country's top 5 airports. Uses Logstash, Elasticsearch & Kibana (with optionally the Kibana plugin Timelion).

As added bonus, there is a separate data set with 2014 TSA claims data.

Running the demo

Complete these steps:

  1. Download data:

    • sh wget.sh
    • Data size is about 2.5 GB. Because of filtering, size in Elasticsearch will be much lower, below 100 MB.
  2. Create Elasticsearch indices and templates:

    • sh create_flight_template.sh URI [USERNAME:PASSWORD]
    • sh create_tsaclaims_index.sh URI [USERNAME:PASSWORD]
  3. Ingest flight data into Elasticsearch with Logstash:

    • Optionally put a username/password/host in import_*.conf
    • sh load_tsaclaims.sh && sh load_flights.sh
  4. Create an alias called flights, composed of all flights-* indices:

    • sh create_flight_alias.sh
  5. Create the index patterns in Kibana:

    • tsaclaims with Date Received as time field
    • flights with FlightDateTime as time field
  6. Import Kibana visuals and dashboards:

    • In Kibana, go to Settings, then Objects, then Import kibana_import.json
    • Optional: Timelion is a time series graphing plugin for Kibana, developed by the people of Elastic. Read more about Timelion and how to get it here. Currently it is not possible to export or import Timelion sheets. To create some charts about this data, open Timelion and add the following. For every line, add a Chart on the Timelion sheet and paste in the code for six different charts. Don't forget to save the sheet.

    .es(index=flights).label("All Flights"), .es(index=flights, q=ArrDelayMinutes:>0).label("Delayed Flights")

    .static(55).color(red).label("Red Line"), .static(50).color(orange).label("Orange Line"), .es(index=flights, q=ArrDelayMinutes:>0).label("Delayed Flights Percentage").divide(.es(index=flights)).multiply(100).color(navy).movingaverage(5)

    .es(index=flights, metric=avg:tmax).color(orange).lines(width=2).movingaverage(5).label("Minimum Temperature (celsius) mavg=5"), .es(index=flights, metric=avg:tmin).color(lightblue).lines(width=2).movingaverage(5).label("Maximum Temperature (celsius) mavg=5"), .es(index=flights, metric=avg:WeatherDelay).color(Red).movingaverage(5).label("Weather Delay (in minutes) mavg=5")

    .es(index=flights, q=ArrDelayMinutes:>0).label("Delayed Flights Percentage").color(navy).movingaverage(10), .es(index=flights, metric=sum:terribility).label("Terribility Index").movingaverage(10)

    .es(index=tsaclaims, timefield="Date Received").movingaverage(7).label("TSA Claims mavg(7)"), .es(index=flights).movingaverage(7).divide(10).label("Flights mavg(7) /10")

    .es(index=flights, metric=avg:snowfall).divide(10).add(.es(index=flights, metric=avg:thunder)).sum(.es(index=flights, metric=avg:hail).multiply(3)).sum(.es(index=flights, metric=avg:glaze).multiply(2)).sum(.es(index=flights, metric=avg:fog).multiply(1)).sum(.es(index=flights, metric=avg:heavy_fog).multiply(5)).sum(.es(index=flights, metric=avg:dust_ash).multiply(10)).label("Average Terribility(R)").points(4).color(Navy), .es(index=flights, metric=avg:terribility).label("Ingested Terribility(R)")

Prerequisites

  1. Elasticsearch 2.3
  2. Kibana 4.4
  3. Logstash 2.3
  4. Timelion 4.4 (optional)

Other versions may work but are untested. If it turns out it works, please consider letting us know by making a pull request on this README.

What's included

  1. create_*.sh: sets up Elasticsearch templates, mappings (actual mappings in mapping*.json) and aliases
  2. lookup_data/*: airport timezone and weather data for enriching the flight data
  3. logstash/filters/*.rb: four simple Logstash filters to join the lookup data
  4. load_*.sh: invoke Logstash to import the flat data files
  5. remove_indices.sh: remove all indices, mappings, templates and
  6. wget.sh: downloads the flight data files
  7. import_*.conf: configuration files for Logstash. Here, the host is hardcoded so change it to your needs
  8. kibana_import.json: Two Dashboards and 43 Visualizations for Kibana

Data sources

  1. The airline data is taken from US BTS and is limited to 2014 and the 5 busiest airports: ATL, ORD, JFK, LAX and DFW. Flights need one of these airports as both source as well as destination to qualify.
  2. The weather data is taken from NCEI. For all 5 airports I used the closest weather station (in all cases, that means readings that are taken on the actual airport)
  3. The timezone data was provided by jpatokal