Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

binary model evaluation #20

Open
mattmills49 opened this issue Feb 10, 2017 · 3 comments
Open

binary model evaluation #20

mattmills49 opened this issue Feb 10, 2017 · 3 comments

Comments

@mattmills49
Copy link
Owner

not sure what the output should be

@mattmills49
Copy link
Owner Author

This is a function I wrote for some work analysis:

model_acc <- function(.data, .model){
  preds <- predict(.model, newdata = .data, type = "response")
  
  pred_data <- data_frame(actual = .data$Attn, preds = preds, type = .data$train) %>%
    filter(!is.na(preds)) %>%
    group_by(type) %>%
    arrange(desc(preds)) %>%
    mutate(TPR = cumsum(actual) / sum(actual),
           FPR = cumsum(1 - actual) / sum(1 - actual)) %>%
    summarize(MSE = mean((preds - actual)^2),
              AUC = sum(diff(FPR) * na.omit(lead(TPR) + TPR)) / 2,
              TPR = mean(preds > .5 & actual == 1),
              TNR = mean(preds <= .5 & actual == 0),
              LSR = mean(actual * log(preds) + (1 - actual) * log(1 - preds)))
  
  cal <- data_frame(actual = .data$Attn, preds = preds, type = .data$train) %>%
    filter(!is.na(preds)) %>%
    group_by(type) %>%
    mutate(pred_group = cut(preds, breaks = floor(n() / 1000), include.lowest = T)) %>%
    group_by(type, pred_group) %>%
    summarize(mean_pred = mean(preds), mean_actual = mean(actual)) %>%
    summarize(Bias = mean(mean_pred - mean_actual))
  
  return(left_join(pred_data, cal, by = "type"))

Positives:

  • easily calculates model scores including MSE, AUC, True Positive Rate, True Negative Rate, and the Logistic Scoring Rule
  • Returns results in a data frame

Negatives:

  • Doesn't generalize to new data for prediction column, dependent column, and any groupings
  • Takes in both the model and data frame, which I'm not sure is necessary.
  • Calculates all the metrics. What if you only want one? What if you want to add one?

@mattmills49
Copy link
Owner Author

mattmills49 commented Mar 14, 2017

To generalize we could probably use formulas, for example the call

binary_model_evaluation <- function(.data = model_data, prediction_formula = dependent ~ prediction, group_var = "split")

Would tell us that in the .data dataframe the dependent column is our "Y" variable and the predictions from the model we are validating are in the prediction column. We could also include a grouping variable (group_var) that would allow us to calculate the accuracy measurements on each split of the data (think a train/test split or time-based variable like season).

@mattmills49
Copy link
Owner Author

If we wanted to make the metrics portable we would have to write each as a separate function and then pass the different measures as list or named vector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant