LightGBM results #46

szilard · 2016-11-10T01:28:25Z

New GBM implementation released by Microsoft: https://github.com/Microsoft/LightGBM

on 10M dataset, r3.8xlarge

trying to match xgboost & LightGBM params

nround = 100, max_depth = 10, eta = 0.1

num_iterations=100  learning_rate=0.1  num_leaves=1024  min_data_in_leaf=100
num_iterations=100  learning_rate=0.1  num_leaves=512   min_data_in_leaf=100
num_iterations=100  learning_rate=0.1  num_leaves=1024  min_data_in_leaf=0

Tool	time (s)	AUC
xgboost	350	0.7511
LightGBM 1	500	0.7848
LightGBM 2	350	0.7729
LightGBM 3	450	0.7897

Code to get the results here

The text was updated successfully, but these errors were encountered:

earino · 2016-11-10T01:48:20Z

wow! microsoft wins again!

szilard · 2016-11-10T01:54:18Z

@earino 😱

guolinke · 2016-11-11T07:41:21Z

@szilard The timing result of LightGBM is much slower than I expected.. I will try to figure out why.
Thanks.

Update:

I think i figure out why...
Almost all features in your data are sparse features. Though we have optimized for sparse features, but it still have some additional cost.
I think we can continue to optimize for the sparse features.

Thanks

szilard · 2016-11-11T17:24:00Z

@guolinke thank you for looking into this

First, LightGBM is right there among the fastest tools already. There are many tools I've tried that are 10x slower than the state-of-the-art.

This kind of data is very common in business (fraud detection, churn prediction, credit scoring etc). It has categorical variables that effectively are translated to sparse features. I started the benchmark with this focus in mind (the airline dataset is similar in structure/size), but of course I recognize that it would be desirable to benchmark the algorithms on a wider variety of data (it's just that my time is pretty limited).

zachmayer · 2016-11-14T14:07:04Z

Try ordinal encoding all the categorical features into integers

szilard · 2016-11-14T17:53:36Z

@zachmayer you already asked about "integer encoding" 1 year ago #22

Again, the point of this benchmark is not to obtain the highest accuracy possible with feature engineering, but to see how the algos run (time) and perform (accuracy) on (the same) data that resembles fraud/churn/credit scoring i.e. a mix of numeric and categoricals with more categorical variables. So I'm making the Month, DoW etc. categoricals deliberately to mimic that with the airline dataset.

zachmayer · 2016-11-14T18:14:53Z

@szilard Hah, thanks for the reminder. I came here though a different path this time, and didn't realize it was the same site.

Anyways, thanks for the benchmarks. They're very useful.

szilard · 2016-11-14T18:27:48Z

@zachmayer Sure, no worries :) Thanks for feedback :)

guolinke · 2016-12-01T08:54:22Z

@szilard I am trying to let LightGBM to support categorical feature directly (without one-hot coding). I think it can further speed-up in such tasks.

szilard · 2016-12-01T16:40:28Z

Yeah, I've seen 10x speedup (and less RAM usage) from using sparse vs dense representation e.g. in glmnet or xgboost.

guolinke · 2016-12-01T16:57:01Z

does xgboost support categorical feature without one-hot coding?
I am not just use categorical feature as numerical feature.

szilard · 2016-12-01T17:02:38Z

Well, yes and no :)

h2o.ai supports cat. data directly (without doing 1-hot enc).

for xgboost you need to do 1-hot encoding to get numerical values, but what I was referring above is you can have either dense or sparse representation (e.g. in R using model.matrix or sparse.model.matrix) and the latter is a huge speedup.

guolinke · 2016-12-03T07:56:39Z

@szilard I finish this part, it speed up 10x on my desktop(i7, 4 cores) with almost same accuracy.
I am not sure the performance in 32 cores servers (since there are only 8 features, and lightgbm is multi-threading by #features, so it will only use 8 cores).
Can you help to benchmark this?

scripts:

import pandas as pd
import numpy as np
from sklearn.datasets import dump_svmlight_file
from sklearn.preprocessing import LabelEncoder

d = pd.read_csv("all.csv")
X = d.drop('dep_delayed_15min', 1)
y = d[["dep_delayed_15min"]]

categorical_names = ["Month" , "DayofMonth" , "DayOfWeek" , "UniqueCarrier" , "Origin" , "Dest" ]

for name in categorical_names:
    le = LabelEncoder().fit(X[name])
    X[name]  = le.transform(X[name])
y_num = np.where(y=="Y",1,0)[:,0]
dump_svmlight_file(X, y_num, 'all.libsvm') 


head -10000000 all.libsvm > train.libsvm
tail -100000 all.libsvm > test.libsvm

time lightgbm data=train.libsvm  task=train  objective=binary \
     num_iterations=100  learning_rate=0.1  num_leaves=1024  min_data_in_leaf=100 categorical_column=0,1,2,4,5,6

time  lightgbm  data=train.libsvm  task=train  objective=binary \
     num_iterations=100  learning_rate=0.1  num_leaves=512  min_data_in_leaf=100 categorical_column=0,1,2,4,5,6

time lightgbm data=train.libsvm  task=train  objective=binary \
     num_iterations=100  learning_rate=0.1  num_leaves=1024  min_data_in_leaf=0 categorical_column=0,1,2,4,5,6

BTW, you should clone the categorical-feature-support branch (will merge to master soon):

git clone --recursive https://github.com/Microsoft/LightGBM -b categorical-feature-support

szilard · 2016-12-06T01:13:19Z

Fantastic @guolinke It's 8x faster than before :)

Tool	time (s)	AUC
xgboost	350	0.7511
LightGBM 1	60	0.7857
LightGBM 2	50	0.7747
LightGBM 3	55	0.7889

Just to make sure, you are not using ordinal integer encoding, right? I mean this: #1
(that would inflate the AUC for this dataset, but would probably not work with other datasets that have "real" categorical data)

guolinke · 2016-12-06T02:27:35Z

@szilard Thanks, It is same speed-up with my benchmark result on 16 cores machines.
I use the categorical directly. Actually, tree can handle categorical features as well( I am not sure why xgboost didn't enable this), Its split rule is different with numerical features.

For numerical features, we find a threshold, and left child is (<= threshold), right child is (>threshold).

For categorical features, we find a category A, and left child is ( is A), right child is ( not A). I think this is equal with one-hot coding (let one category = 1, and other = 0, so the split rule is like (is A) and (not A).

szilard · 2016-12-06T03:19:50Z

is A, not A is great. Congrats for huge speedup!

Since I'm a fanatic R user, I hope now developing the R package will get traction ;)

guolinke · 2016-12-06T05:38:41Z

@szilard Sure, R package is an urgent item.

szilard · 2017-05-30T00:41:24Z

As you've seen, I started a new github repo for quickly benchmarking the new best GBMs https://github.com/szilard/GBM-perf

Therefore closing this old issue here.

szilard mentioned this issue May 27, 2017

lightgbm categorical values from R szilard/GBM-perf#2

Closed

szilard closed this as completed May 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LightGBM results #46

LightGBM results #46

szilard commented Nov 10, 2016 •

edited

Loading

earino commented Nov 10, 2016

szilard commented Nov 10, 2016 •

edited

Loading

guolinke commented Nov 11, 2016 •

edited

Loading

szilard commented Nov 11, 2016

zachmayer commented Nov 14, 2016

szilard commented Nov 14, 2016

zachmayer commented Nov 14, 2016

szilard commented Nov 14, 2016

guolinke commented Dec 1, 2016 •

edited

Loading

szilard commented Dec 1, 2016

guolinke commented Dec 1, 2016

szilard commented Dec 1, 2016 •

edited

Loading

guolinke commented Dec 3, 2016 •

edited

Loading

szilard commented Dec 6, 2016 •

edited

Loading

guolinke commented Dec 6, 2016 •

edited

Loading

szilard commented Dec 6, 2016 •

edited

Loading

guolinke commented Dec 6, 2016

szilard commented May 30, 2017

LightGBM results #46

LightGBM results #46

Comments

szilard commented Nov 10, 2016 • edited Loading

earino commented Nov 10, 2016

szilard commented Nov 10, 2016 • edited Loading

guolinke commented Nov 11, 2016 • edited Loading

szilard commented Nov 11, 2016

zachmayer commented Nov 14, 2016

szilard commented Nov 14, 2016

zachmayer commented Nov 14, 2016

szilard commented Nov 14, 2016

guolinke commented Dec 1, 2016 • edited Loading

szilard commented Dec 1, 2016

guolinke commented Dec 1, 2016

szilard commented Dec 1, 2016 • edited Loading

guolinke commented Dec 3, 2016 • edited Loading

szilard commented Dec 6, 2016 • edited Loading

guolinke commented Dec 6, 2016 • edited Loading

szilard commented Dec 6, 2016 • edited Loading

guolinke commented Dec 6, 2016

szilard commented May 30, 2017

szilard commented Nov 10, 2016 •

edited

Loading

szilard commented Nov 10, 2016 •

edited

Loading

guolinke commented Nov 11, 2016 •

edited

Loading

guolinke commented Dec 1, 2016 •

edited

Loading

szilard commented Dec 1, 2016 •

edited

Loading

guolinke commented Dec 3, 2016 •

edited

Loading

szilard commented Dec 6, 2016 •

edited

Loading

guolinke commented Dec 6, 2016 •

edited

Loading

szilard commented Dec 6, 2016 •

edited

Loading