best boosting AUC? #15

szilard · 2015-05-31T04:33:02Z

@tqchen @hetong007 I'm trying to get a good AUC with boosting for the largest dataset (n = 10M). Would be nice to beat random forests :)

So far I did some basic grid search https://github.com/szilard/benchm-ml/blob/master/3-boosting/0-xgboost-init-grid.R for n = 1M (not the largest dataset) and seems like deeper trees, min_child_weight = 1 subsample = 0.5 work well.

I'm running now https://github.com/szilard/benchm-ml/blob/master/3-boosting/6a-xgboost-grid.R with n = 10M by just looping over max_depth = c(2,5,10,20,50) but it's been running for a while.

Any suggestions?

Smallest learning rate I'm using is eta = 0.01, any experience with smaller values?

PS: See results so far here: https://github.com/szilard/benchm-ml#boosting-gradient-boosted-treesgradient-boosting-machines

The text was updated successfully, but these errors were encountered:

tqchen · 2015-05-31T05:01:43Z

One thing that might be interesting to try is use integer encoding for the dates(seemed i got far better result with simply depth=6)

tqchen · 2015-05-31T05:11:09Z

If no overfitting was happening, eta=0.01 was good enough. Another interesting thing to try, which I always do to optimize AUC, is to re-balance the weight. In particular, set

scale_pos_weight= num_neg_example/num_pos_example

Will re-balance the positive and negative weight, usually work better for AUC. Though the effect was more significant on unbalanced dataset, I am not sure what will happen here

szilard · 2015-05-31T05:14:16Z

Well, I deliberately considered that "forbidden" :) See some discussion here: #1

The reason is that I want the benchmark on a dataset with a mix of categoricals and numerics (with more categoricals) - similar to a industry/business dataset. So I'm making the day of the week, month etc. somewhat artificially categoricals:

for (k in c("Month","DayofMonth","DayOfWeek")) {
  d_train[[k]] <- as.factor(d_train[[k]])
}

If I made them ordinal/numeric those variables would dominate importance-wise the prediction and the dataset would have more too many "numeric" features.

So, the game between RF and boosting is on in the sense that those vars need to be categoricals and no other feature engineering ;)

szilard · 2015-05-31T05:18:31Z

Re @tqchen 2nd comment:

This dataset is pretty well balanced. What's very handy in xgboost (and missing from the other tools) is the early stopping :) And I use it with eval_metric = "auc" :)

I'll let you know when this run finishes: https://github.com/szilard/benchm-ml/blob/master/3-boosting/6a-xgboost-grid.R

szilard · 2015-05-31T05:25:55Z

I might also try eta=0.001 though already eta=0.01 is painfully slow. Btw is there any paper/result about decreasing learning rates during training (for boosting)? Or even some "adaptive" learning rate, see e.g. Vowpal Wabbit for the linear case.

tqchen · 2015-05-31T05:39:59Z

I do not know if there is any theory on decreasing learning rate. I see you set subsample parameter, there is another colsample_bytree which sub-samples columns, usually makes result less easier to overfit and running time faster(set it to 0.5 or 0.3)

tqchen · 2015-05-31T05:46:31Z

although gbm usually wins on complicated cases with more features and many feature played an important role, maybe this dataset was not in that case. Since the things are made explicitly categorical, currently there was too few integer features

szilard · 2015-05-31T05:46:31Z

Hm... Quick google does not bring up anything. Besides VW, there is simulated annealing in various contexts (e.g. neural nets) etc. This might be useful for boosting...

Yeah, colsample_bytree is needed for RF. I'll try out for boosting, thanks :)

szilard · 2015-05-31T05:50:05Z

Yeah, I wish I chose a dataset with more columns... Anyway, my main focus here is see what tools can run on 10M rows in decent time and with decent AUC, and AUC for boosting is pretty close to RF.

hetong007 · 2015-05-31T05:51:14Z

Just for reference, I sometimes try this adaptive learning rate method: http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf . It is not implemented in the xgboost.

szilard · 2015-05-31T05:59:54Z

Oh, now I remember reading this http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf which is implemented in H2O deep learning.

Btw do you think this has been adapted for GBMs?@hetong007: when you say "sometimes I try this..." what tool are you using? (is it boosting/GBM or something else like linear etc)

tqchen · 2015-05-31T06:05:35Z

The nature of boosting was very different from linear solver, which makes adagrad may not be a directly applied here. Actually, even for deeplearning, adagrad was not the best choice for common convnet(maybe a safe choice).

However, it was actually straight forward to tweak the R/python code of xgboost to implementing decay learning rate without touching the cpp part, so maybe it was interesting to try.

hetong007 · 2015-05-31T06:09:32Z

Sometimes in my research I will write some small prototypes to compare, but basically matrix factorization.

@tqchen I think currently we cannot get the gradient of each update from R/python, so at least the one I posted is not applicable.

szilard · 2015-05-31T06:13:20Z

Here is some options for decay (after quick web search): http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40808.pdf I was doing some simulated annealing related stuff in physics 20 yrs ago ;)

Anyway, maybe it's easy to change it in the C++ level code in xgboost?

tqchen · 2015-05-31T06:16:34Z

there is a thing called customized loss function in xgboost, which should do the job in R. Adding learning rate was equivalent to scale the h statistics:)

tqchen · 2015-05-31T06:18:29Z

BTW, @szilard maybe it is interesting to do some feature importance analysis on the trees learnt, see this example

I guess the result will have a few very important features

szilard · 2015-05-31T06:21:30Z

Yeah, I think I took a quick look at the variable importance in RF in H2O. There are only 8 variables though...

szilard · 2015-05-31T06:22:03Z

I'll have to look at custom loss functions, probably tomorrow...

szilard · 2015-06-02T01:59:16Z

@tqchen @hetong007 See some boosting results here (GBM beats RF now) https://github.com/szilard/benchm-ml#boosting-gradient-boosted-treesgradient-boosting-machines

tqchen · 2015-06-02T02:03:38Z

great, thanks @szilard

szilard · 2015-06-02T17:05:32Z

Here is Time/AUC for a few settings:

n = 10M (dataset size)
nround = 5000 max_depth = 20 eta = 0.01 min_child_weight = 1

subsample = 0.5 Time: 50000s AUC: 0.811
subsample = 1 Time: 49000s AUC: 0.805
subsample = 0.5 colsample_bytree = 0.5 Time: 35000s AUC: 0.810

szilard closed this as completed Jun 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

best boosting AUC? #15

best boosting AUC? #15

szilard commented May 31, 2015

tqchen commented May 31, 2015

tqchen commented May 31, 2015

szilard commented May 31, 2015

szilard commented May 31, 2015

szilard commented May 31, 2015

tqchen commented May 31, 2015

tqchen commented May 31, 2015

szilard commented May 31, 2015

szilard commented May 31, 2015

hetong007 commented May 31, 2015

szilard commented May 31, 2015

tqchen commented May 31, 2015

hetong007 commented May 31, 2015

szilard commented May 31, 2015

tqchen commented May 31, 2015

tqchen commented May 31, 2015

szilard commented May 31, 2015

szilard commented May 31, 2015

szilard commented Jun 2, 2015

tqchen commented Jun 2, 2015

szilard commented Jun 2, 2015

best boosting AUC? #15

best boosting AUC? #15

Comments

szilard commented May 31, 2015

tqchen commented May 31, 2015

tqchen commented May 31, 2015

szilard commented May 31, 2015

szilard commented May 31, 2015

szilard commented May 31, 2015

tqchen commented May 31, 2015

tqchen commented May 31, 2015

szilard commented May 31, 2015

szilard commented May 31, 2015

hetong007 commented May 31, 2015

szilard commented May 31, 2015

tqchen commented May 31, 2015

hetong007 commented May 31, 2015

szilard commented May 31, 2015

tqchen commented May 31, 2015

tqchen commented May 31, 2015

szilard commented May 31, 2015

szilard commented May 31, 2015

szilard commented Jun 2, 2015

tqchen commented Jun 2, 2015

szilard commented Jun 2, 2015