MUSA 508 FINAL PROJECT NOTES

Project option 4 - Forecast Metro train delays in and around NYC: An amazing new dataset has popped up on Kaggle recently that list origin/destinations delays for Amtrak and NJ Transit trains. Can you predict train delays? Consider the time frame that it would be useful to have such predictions. Predicting 5 minutes out is not going to be as useful as 2-3 hours out. Consider training on a month and predicting for the next week or two. Consider time/space (train line, county etc.) cross validation. Many app use cases here.

tentative time period: september 2019 as training data, first two weeks of oct 2019 as test data? (alt. use three sets - test, training, and validation?)

data issues

amtrak trains have no scheduled_time data, and therefore also no delay_minutes. (they also don't have stop_sequence data, but that probably doesn't matter if we don't have delays.) this may mean we have to limit our predictions to nj transit trains, but maybe we could incorporate amtrak traffic as a feature somehow? not sure how to do this technically, though, or what exactly the amtrak data will allow.
some nj transit trains may also have missing values

time

use lubridate as in ch. 8 to standardize time format and create features for e.g. day of week
how granular should we be? chapter used 15- and 60-minute time increments; is that meaningful here? (probably for scheduled_time? could experiment with different intervals?)
holidays: labor day in september; anything else?

space

ch. 8 example used census tracts as units and calculated spatial lag that way
train data arguably has two spatial aspects: the current station (and all trains passing through it), and the previous station on the same line. do we incorporate lag features for both? maybe test both alone and in combination?

model

ch. 8 used linear regression; should we try something like random forest, or is that overkill? (probably start with linear regardless to have more to show at presentation)

predictions

presumably, prediction field is delay_minutes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notes.md

notes.md

Files

notes.md

Latest commit

History

notes.md

File metadata and controls