Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sklearn HistGBT #36

Open
szilard opened this issue Sep 10, 2020 · 23 comments
Open

sklearn HistGBT #36

szilard opened this issue Sep 10, 2020 · 23 comments

Comments

@szilard
Copy link
Owner

szilard commented Sep 10, 2020

New tool https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html

based on POC https://github.com/ogrisel/pygbm mentioned earlier here #15

@szilard szilard mentioned this issue Sep 10, 2020
@szilard
Copy link
Owner Author

szilard commented Sep 10, 2020

Code to compare HistGBT to lightgbm:

import pandas as pd
import numpy as np
from sklearn import preprocessing
from scipy import sparse
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")

d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all_cat = preprocessing.OneHotEncoder(categories="auto").fit_transform(d_all[vars_cat])
X_all = sparse.hstack((X_all_cat, d_all[vars_num])).tocsr()
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

X_train = X_all[0:d_train.shape[0],]
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0]),]
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]


md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))


md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

Looks like HistGBT does not support sparse matrices:

Traceback (most recent call last):
  File "run.py", line 43, in <module>
    md.fit(X_train, y_train)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 121, in fit
    X, y = self._validate_data(X, y, dtype=[X_DTYPE],
  File "/usr/local/lib/python3.8/dist-packages/sklearn/base.py", line 432, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 795, in check_X_y
    X = check_array(X, accept_sparse=accept_sparse,
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 575, in check_array
    array = _ensure_sparse_format(array, accept_sparse=accept_sparse,
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 353, in _ensure_sparse_format
    raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

@szilard
Copy link
Owner Author

szilard commented Sep 10, 2020

Experiments on m5.2xlarge (8 cores, 32GB RAM):

lightgbm:

md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
1.8572165966033936
0.7300012836184978

Will need dense matrices for lightgbm as well for fair comparison to HistGBT:

X_train_DENSE = X_train.toarray()
X_test_DENSE = X_test.toarray()

md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
2.1161305904388428
0.7300012836184978

@szilard
Copy link
Owner Author

szilard commented Sep 10, 2020

HistGBT with dense matrices:

md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

the process starts filling up the RAM (slowly) and it's running out of memory (and the OS kills the process):

python3 run.py
Killed

@szilard
Copy link
Owner Author

szilard commented Sep 10, 2020

HistGBT will not crash (OOM) with smaller max_leaves:

md = HistGradientBoostingClassifier(max_leaf_nodes=128, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
34.628803730010986
0.731705293267934

which is considerable slower than lightgbm:

md = lgb.LGBMClassifier(num_leaves=128, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
0.948988676071167
0.7334361781282212

@szilard
Copy link
Owner Author

szilard commented Sep 11, 2020

Memory usage:

md = lgb.LGBMClassifier(num_leaves=128, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

RAM usage on server:

  • after loading data, sparse matrix: 0.7 GB
  • after transforming to dense matrix 1.79 GB
  • while training lightgbm: 1.80 GB
md = HistGradientBoostingClassifier(max_leaf_nodes=128, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
  • while training HistGBT: 14.5 GB

For less trees HistGBT:

md = HistGradientBoostingClassifier(max_leaf_nodes=128, learning_rate=0.1, max_iter=10)

6.3 GB

For shallower trees HistGBT:

md = HistGradientBoostingClassifier(max_leaf_nodes=16, learning_rate=0.1, max_iter=100)

3.3 GB

@szilard
Copy link
Owner Author

szilard commented Sep 11, 2020

Summary so far:

HistGBT (sklearn's HistGradientBoostingClassifier) currently does not support sparse matrices for encoding categorical variables.

On the airline dataset (with dense matrices) it uses a lot of memory (lightgbm does not use considerable memory other than the data - even with dense matrices). Is this a memory leak? The amount of memory increases with number of trees and the depth of trees. It runs out of memory even for small data/not too deep trees.

It also runs slow compared to lightgbm (both on dense matrices).

Am I doing something wrong? @amueller @ogrisel @Laurae2 Is this because of categorical data (vs previous benchmarks were on numeric data)? Can I change something in the code to make it better?

@ogrisel
Copy link

ogrisel commented Sep 11, 2020

HistGBT (sklearn's HistGradientBoostingClassifier) currently does not support sparse matrices for encoding categorical variables.

This is a known limitation. However for categorical variables there are three solutions that do not involve sparse preprocessing of the training data:

On the airline dataset (with dense matrices) it uses a lot of memory (lightgbm does not use considerable memory other than the data - even with dense matrices). Is this a memory leak? The amount of memory increases with number of trees and the depth of trees. It runs out of memory even for small data/not too deep trees.

Do you use scikit-learn master? we recently fixed some cyclic references that prevent the GC to properly release memory in a timely fashion: scikit-learn/scikit-learn#18334

If you want to try the master branch without building from source, feel free to use the nightly builds:

https://scikit-learn.org/0.21/developers/advanced_installation.html#installing-nightly-builds

It also runs slow compared to lightgbm (both on dense matrices).

lightgbm 3.0 is known to be ~2x faster than scikit-learn, probably because of the new row-wise parallelism in histograms:

microsoft/LightGBM#2791 (comment)

Also, on hyper-threaded machines, you want to limit the number of threads explicitly with OMP_NUM_THREADS=number_of_physical_cores python benchmark.py to avoid over-subscription issues. We want to do that automatically but this needs a bit of work.

@szilard
Copy link
Owner Author

szilard commented Sep 11, 2020

Thanks @ogrisel for very quick answer.

Encoding: yeah I know I could use ordinal or target encoding. In fact there was some discussion on this in 2015 on exactly this dataset, I seemed that ordinal encoding could actually get better AUC (in random forest) than 1-hot . The reason I did 1-hot in the benchmark I guess because it has been (used to be?) the preferred method for practitioners and it was also the common denominator for all packages (e.g. I kept using 1-hot even with lightgbm in the benchmark, even after lightgbm started using "direct" encoding).

So maybe I should try ordinal encoding (and others), but then I should do the same with all the tools. Or at least try it out. But then of course there are so many other things I should also do (e.g. using more datasets of different structure, sparsity etc) to make the benchmark more meaningful. All I managed to do is create a list a while ago.

@szilard
Copy link
Owner Author

szilard commented Sep 11, 2020

Thanks @ogrisel for suggestions, I just used the latest release, I'll try out the nightly build and also OMP_NUM_THREADS. Will add results here.

@szilard
Copy link
Owner Author

szilard commented Sep 11, 2020

Keeping track of things:

sudo apt install python3-pip
sudo pip3 install -U pandas lightgbm sklearn

Data size in RAM:

import pandas as pd
import numpy as np
from sklearn import preprocessing
from scipy import sparse
import sys

d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")

d_all = pd.concat([d_train,d_test])
d_all = pd.concat([d_train,d_test])
sys.getsizeof(d_all)/1e6

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

sys.getsizeof(d_all)/1e6
  
X_all_cat = preprocessing.OneHotEncoder(categories="auto").fit_transform(d_all[vars_cat])
X_all = sparse.hstack((X_all_cat, d_all[vars_num])).tocsr()

X_all_cat.data.nbytes/1e6
X_all.data.nbytes/1e6

X_all_DENSE = X_all.toarray()

X_all_DENSE.nbytes/1e6
>>> sys.getsizeof(d_all)/1e6
88.390248
>>> sys.getsizeof(d_all)/1e6
26.000016
>>> X_all_cat.data.nbytes/1e6
9.6
>>> X_all.data.nbytes/1e6
12.8
>>> X_all_DENSE.nbytes/1e6
1102.4

@ogrisel
Copy link

ogrisel commented Sep 11, 2020

I was too curious, here are the results with ordinal encoding:

import pandas as pd
import sklearn
from sklearn import preprocessing
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.compose import ColumnTransformer

from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingClassifier


d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")
vars_cat = ["Month", "DayofMonth", "DayOfWeek", "UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime", "Distance"]
input_all = vars_cat + vars_num

ordinal_encoder = preprocessing.OrdinalEncoder(
    handle_unknown="use_encoded_value",
    unknown_value=-1,
)

preprocessor = ColumnTransformer([
        ("cat", ordinal_encoder, vars_cat),
    ],
    remainder="passthrough",
)
X_train = preprocessor.fit_transform(d_train[vars_cat + vars_num])
y_train = (d_train["dep_delayed_15min"] == "Y").values

X_test = preprocessor.transform(d_test[vars_cat + vars_num])
y_test = (d_test["dep_delayed_15min"] == "Y").values

print(f"n_samples={X_train.shape[0]}")
print(f"n_features={X_train.shape[1]}")

print(f"LightGBM {lgb.__version__}:")
md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(f"  - training time: {time.time() - start_time:.3f}s")

y_pred = md.predict_proba(X_test)[:, 1]
print(f"  - ROC AUC: {metrics.roc_auc_score(y_test, y_pred):.3f}")

print(f"scikit-learn {sklearn.__version__}:")
md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1,
                                    max_iter=100)
start_time = time.time()
md.fit(X_train, y_train)
print(f"  - training time: {time.time() - start_time:.3f}s")

y_pred = md.predict_proba(X_test)[:, 1]
print(f"  - ROC AUC: {metrics.roc_auc_score(y_test, y_pred):.3f}")

I set: OMP_NUM_THREADS=4 (this dataset has too few features to really benefit from many threads and both lightgbm and scikit-learn suffer from over-subscription):

n_samples=100000
n_features=8
LightGBM 3.0.0:
  - training time: 1.508s
  - ROC AUC: 0.718
scikit-learn 0.24.dev0:
  - training time: 3.487s
  - ROC AUC: 0.718

So indeed LighGBM 3.0 is a bit more than 2x faster than scikit-learn master but there is some variability on such short runs.

@szilard
Copy link
Owner Author

szilard commented Sep 11, 2020

Nice @ogrisel, I was working on that as well, you got there first 👍

It's then weird why on 1-hot encoding HistGBT is 30x slower (see above). It's just another data matrix (though wider) and it has mostly 0s and 1s (though I use dense matrices, so lightgbm does not have a knowledge of that either).

@ogrisel
Copy link

ogrisel commented Sep 11, 2020

It's expected for high cardinality features: it has a lot more work to do to build the histograms for the expanded features and for each such feature we treat the majority of 0s as any other value. LightGBM is very smart at handling sparse features but this is not (yet) implemented in scikit-learn.

@szilard
Copy link
Owner Author

szilard commented Sep 11, 2020

Here's my version (with LabelEncoder, pardon me, I'm mainly an R guy LOL):

import pandas as pd
import numpy as np
from sklearn import preprocessing
from scipy import sparse
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")

d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all = d_all[vars_num+vars_cat].values
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

X_train = X_all[0:d_train.shape[0],]
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0]),]
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]


md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))


md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

Timings:

LGBMClassifier(num_leaves=512)
>>> print(time.time() - start_time)
1.4476277828216553
>>> print(metrics.roc_auc_score(y_test, y_pred))
0.7177775781298882

HistGradientBoostingClassifier(max_leaf_nodes=512)
>>> print(time.time() - start_time)
3.791149616241455
>>> print(metrics.roc_auc_score(y_test, y_pred))
0.7164761138428702

@szilard
Copy link
Owner Author

szilard commented Sep 11, 2020

It's expected for high cardinality features: it has a lot more work to do to build the histograms for the expanded features and for each such feature we treat the majority of 0s as any other value. LightGBM is very smart at handling sparse features but this is not (yet) implemented in scikit-learn.

I used dense matrices with lightgbm as well, are you suggesting lightgbm might do a "preprocessing" step to lump the 0s into a bin and that does not need to be repeated over and over the iterations or something like that?

@ogrisel
Copy link

ogrisel commented Sep 11, 2020

I used dense matrices with lightgbm as well, are you suggesting lightgbm might do a "preprocessing" step to lump the 0s into a bin and that does not need to be repeated over and over the iterations or something like that?

Indeed I did not get that. LightGBM might be clever enough to automatically detect sparse patterns even in dense arrays. Futhermore, if it detects sparsity patterns it might also benefit from Exclusive Feature Bundling https://lightgbm.readthedocs.io/en/latest/Parameters.html#enable_bundle which we do not implement in scikit-learn at the moment.

@NicolasHug
Copy link

NicolasHug commented Sep 11, 2020

Lightgbm does feature bundling for features that are mutually exclusive as is the case for OHEd features.

Lol @ogrisel is a few seconds faster as usual

@NicolasHug
Copy link

@szilard we have an internal sklearn.ensemble._hist_gradient_boosting.utils.get_equivalent_model in sklearn to have fairer comparisons between models (we deactivate feature bundling here), in case you're curious about various potential discrepancies.
This is mostly to make sure we get equivalent predictions, obviously using it for benchmarks purposes would not be fair to LightGBM and others because we deactivate some advanced fancy stuff that isn't yet implemented in sklearn

@ogrisel
Copy link

ogrisel commented Sep 11, 2020

I am not sure that deactivating feature bundling can be considered "fair" for end-user facing benchmarks. But it's useful for us to debug the scikit-learn performance and treat one problem at a time though.

@szilard
Copy link
Owner Author

szilard commented Sep 11, 2020

Thanks @ogrisel and @NicolasHug for insights. I will look at a few things, will post all findings here.

@szilard
Copy link
Owner Author

szilard commented Sep 11, 2020

Using nightly builds:

sudo apt install python3-pip
sudo pip3 install -U pandas lightgbm sklearn
sudo pip3 install -U --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple scikit-learn
    Found existing installation: scikit-learn 0.23.2
    Uninstalling scikit-learn-0.23.2:
      Successfully uninstalled scikit-learn-0.23.2
Successfully installed scikit-learn-0.24.dev0

Running original 1-hot encoding, dense matrices (both lighgbm and HistGBT):

import pandas as pd
import numpy as np
from sklearn import preprocessing
from scipy import sparse
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")

d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all_cat = preprocessing.OneHotEncoder(categories="auto").fit_transform(d_all[vars_cat])
X_all = sparse.hstack((X_all_cat, d_all[vars_num])).tocsr()
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

X_train = X_all[0:d_train.shape[0],]
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0]),]
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]

X_train_DENSE = X_train.toarray()
X_test_DENSE = X_test.toarray()


md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))


md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

Great thing @ogrisel, the memory issue is fixed (looks like it was that memory leakage you mentioned). Now code above does not crash on 32 GB, memory usage is:

  • after loading data, create (dense) matrices: 1.71 GB
  • lightgbm training: 1.79 GB
  • HistGBT: 3.1 GB

So HistGBT is still using some memory (vs lightgbm using very little), but now it's much better than before. Thanks @ogrisel and @amueller for suggesting to use the dev version (master/nightly builds).

@szilard
Copy link
Owner Author

szilard commented Sep 11, 2020

Notwithstanding the other encoding options, if we look at 1-hot encoding:

Run time [sec] and AUC:

data size 100K:

lightgbm sparse
1.8356289863586426
0.7300012836184978
lightgbm dense
2.1029934883117676
0.7300012836184978
HistGBT dense
35.00905084609985
0.728212555455098

data size 1M:

lightgbm sparse
4.715407609939575
0.764772836574283
lightgbm dense
6.551331520080566
0.764772836574283
HistGBT dense
170.7247931957245
0.7655385730787526

data size 10M:

Cannot create the dense matrix with 32 GB RAM. Will need a bigger cloud instance. (TODO)

Code:

import pandas as pd
import numpy as np
from sklearn import preprocessing
from scipy import sparse
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-10m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")

d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all_cat = preprocessing.OneHotEncoder(categories="auto").fit_transform(d_all[vars_cat])
X_all = sparse.hstack((X_all_cat, d_all[vars_num])).tocsr()
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

X_train = X_all[0:d_train.shape[0],]
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0]),]
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]

X_train_DENSE = X_train.toarray()
X_test_DENSE = X_test.toarray()


print("lightgbm sparse")

md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))


print("lightgbm dense")

md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))


print("HistGBT dense")

md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

@ogrisel
Copy link

ogrisel commented Sep 14, 2020

The Exclusive Feature Bundling feature of LGBM is really efficient to deal with OHE categorical variables. However I still think that OHE is useless for decision trees based algorithms and ordinal encoding (or native categorical variable support) are better options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants