-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LightGBM results #46
Comments
wow! microsoft wins again! |
@earino 😱 |
@szilard The timing result of LightGBM is much slower than I expected.. I will try to figure out why. Update: I think i figure out why... Thanks |
@guolinke thank you for looking into this First, LightGBM is right there among the fastest tools already. There are many tools I've tried that are 10x slower than the state-of-the-art. This kind of data is very common in business (fraud detection, churn prediction, credit scoring etc). It has categorical variables that effectively are translated to sparse features. I started the benchmark with this focus in mind (the airline dataset is similar in structure/size), but of course I recognize that it would be desirable to benchmark the algorithms on a wider variety of data (it's just that my time is pretty limited). |
Try ordinal encoding all the categorical features into integers |
@zachmayer you already asked about "integer encoding" 1 year ago #22 Again, the point of this benchmark is not to obtain the highest accuracy possible with feature engineering, but to see how the algos run (time) and perform (accuracy) on (the same) data that resembles fraud/churn/credit scoring i.e. a mix of numeric and categoricals with more categorical variables. So I'm making the Month, DoW etc. categoricals deliberately to mimic that with the airline dataset. |
@szilard Hah, thanks for the reminder. I came here though a different path this time, and didn't realize it was the same site. Anyways, thanks for the benchmarks. They're very useful. |
@zachmayer Sure, no worries :) Thanks for feedback :) |
@szilard I am trying to let LightGBM to support categorical feature directly (without one-hot coding). I think it can further speed-up in such tasks. |
Yeah, I've seen 10x speedup (and less RAM usage) from using sparse vs dense representation e.g. in glmnet or xgboost. |
does xgboost support categorical feature without one-hot coding? |
Well, yes and no :) h2o.ai supports cat. data directly (without doing 1-hot enc). for xgboost you need to do 1-hot encoding to get numerical values, but what I was referring above is you can have either dense or sparse representation (e.g. in R using model.matrix or sparse.model.matrix) and the latter is a huge speedup. |
@szilard I finish this part, it speed up 10x on my desktop(i7, 4 cores) with almost same accuracy. scripts:
BTW, you should clone the
|
Fantastic @guolinke It's 8x faster than before :)
Just to make sure, you are not using ordinal integer encoding, right? I mean this: #1 |
@szilard Thanks, It is same speed-up with my benchmark result on 16 cores machines. For numerical features, we find a threshold, and left child is ( For categorical features, we find a category |
Since I'm a fanatic R user, I hope now developing the R package will get traction ;) |
@szilard Sure, R package is an urgent item. |
As you've seen, I started a new github repo for quickly benchmarking the new best GBMs https://github.com/szilard/GBM-perf Therefore closing this old issue here. |
New GBM implementation released by Microsoft: https://github.com/Microsoft/LightGBM
on 10M dataset, r3.8xlarge
trying to match xgboost & LightGBM params
Code to get the results here
The text was updated successfully, but these errors were encountered: