-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sklearn HistGBT #36
Comments
Code to compare HistGBT to lightgbm:
Looks like HistGBT does not support sparse matrices:
|
Experiments on m5.2xlarge (8 cores, 32GB RAM): lightgbm:
Will need dense matrices for lightgbm as well for fair comparison to HistGBT:
|
HistGBT with dense matrices:
the process starts filling up the RAM (slowly) and it's running out of memory (and the OS kills the process):
|
HistGBT will not crash (OOM) with smaller max_leaves:
which is considerable slower than lightgbm:
|
Memory usage:
RAM usage on server:
For less trees HistGBT:
6.3 GB For shallower trees HistGBT:
3.3 GB |
Summary so far: HistGBT (sklearn's HistGradientBoostingClassifier) currently does not support sparse matrices for encoding categorical variables. On the airline dataset (with dense matrices) it uses a lot of memory (lightgbm does not use considerable memory other than the data - even with dense matrices). Is this a memory leak? The amount of memory increases with number of trees and the depth of trees. It runs out of memory even for small data/not too deep trees. It also runs slow compared to lightgbm (both on dense matrices). Am I doing something wrong? @amueller @ogrisel @Laurae2 Is this because of categorical data (vs previous benchmarks were on numeric data)? Can I change something in the code to make it better? |
This is a known limitation. However for categorical variables there are three solutions that do not involve sparse preprocessing of the training data:
Do you use scikit-learn master? we recently fixed some cyclic references that prevent the GC to properly release memory in a timely fashion: scikit-learn/scikit-learn#18334 If you want to try the master branch without building from source, feel free to use the nightly builds: https://scikit-learn.org/0.21/developers/advanced_installation.html#installing-nightly-builds
lightgbm 3.0 is known to be ~2x faster than scikit-learn, probably because of the new row-wise parallelism in histograms: microsoft/LightGBM#2791 (comment) Also, on hyper-threaded machines, you want to limit the number of threads explicitly with |
Thanks @ogrisel for very quick answer. Encoding: yeah I know I could use ordinal or target encoding. In fact there was some discussion on this in 2015 on exactly this dataset, I seemed that ordinal encoding could actually get better AUC (in random forest) than 1-hot . The reason I did 1-hot in the benchmark I guess because it has been (used to be?) the preferred method for practitioners and it was also the common denominator for all packages (e.g. I kept using 1-hot even with lightgbm in the benchmark, even after lightgbm started using "direct" encoding). So maybe I should try ordinal encoding (and others), but then I should do the same with all the tools. Or at least try it out. But then of course there are so many other things I should also do (e.g. using more datasets of different structure, sparsity etc) to make the benchmark more meaningful. All I managed to do is create a list a while ago. |
Thanks @ogrisel for suggestions, I just used the latest release, I'll try out the nightly build and also OMP_NUM_THREADS. Will add results here. |
Keeping track of things:
Data size in RAM:
|
I was too curious, here are the results with ordinal encoding: import pandas as pd
import sklearn
from sklearn import preprocessing
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.compose import ColumnTransformer
from sklearn.experimental import enable_hist_gradient_boosting # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")
vars_cat = ["Month", "DayofMonth", "DayOfWeek", "UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime", "Distance"]
input_all = vars_cat + vars_num
ordinal_encoder = preprocessing.OrdinalEncoder(
handle_unknown="use_encoded_value",
unknown_value=-1,
)
preprocessor = ColumnTransformer([
("cat", ordinal_encoder, vars_cat),
],
remainder="passthrough",
)
X_train = preprocessor.fit_transform(d_train[vars_cat + vars_num])
y_train = (d_train["dep_delayed_15min"] == "Y").values
X_test = preprocessor.transform(d_test[vars_cat + vars_num])
y_test = (d_test["dep_delayed_15min"] == "Y").values
print(f"n_samples={X_train.shape[0]}")
print(f"n_features={X_train.shape[1]}")
print(f"LightGBM {lgb.__version__}:")
md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(f" - training time: {time.time() - start_time:.3f}s")
y_pred = md.predict_proba(X_test)[:, 1]
print(f" - ROC AUC: {metrics.roc_auc_score(y_test, y_pred):.3f}")
print(f"scikit-learn {sklearn.__version__}:")
md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1,
max_iter=100)
start_time = time.time()
md.fit(X_train, y_train)
print(f" - training time: {time.time() - start_time:.3f}s")
y_pred = md.predict_proba(X_test)[:, 1]
print(f" - ROC AUC: {metrics.roc_auc_score(y_test, y_pred):.3f}") I set:
So indeed LighGBM 3.0 is a bit more than 2x faster than scikit-learn master but there is some variability on such short runs. |
Nice @ogrisel, I was working on that as well, you got there first 👍 It's then weird why on 1-hot encoding HistGBT is 30x slower (see above). It's just another data matrix (though wider) and it has mostly 0s and 1s (though I use dense matrices, so lightgbm does not have a knowledge of that either). |
It's expected for high cardinality features: it has a lot more work to do to build the histograms for the expanded features and for each such feature we treat the majority of 0s as any other value. LightGBM is very smart at handling sparse features but this is not (yet) implemented in scikit-learn. |
Here's my version (with LabelEncoder, pardon me, I'm mainly an R guy LOL):
Timings:
|
I used dense matrices with lightgbm as well, are you suggesting lightgbm might do a "preprocessing" step to lump the 0s into a bin and that does not need to be repeated over and over the iterations or something like that? |
Indeed I did not get that. LightGBM might be clever enough to automatically detect sparse patterns even in dense arrays. Futhermore, if it detects sparsity patterns it might also benefit from Exclusive Feature Bundling https://lightgbm.readthedocs.io/en/latest/Parameters.html#enable_bundle which we do not implement in scikit-learn at the moment. |
Lightgbm does feature bundling for features that are mutually exclusive as is the case for OHEd features. Lol @ogrisel is a few seconds faster as usual |
@szilard we have an internal |
I am not sure that deactivating feature bundling can be considered "fair" for end-user facing benchmarks. But it's useful for us to debug the scikit-learn performance and treat one problem at a time though. |
Thanks @ogrisel and @NicolasHug for insights. I will look at a few things, will post all findings here. |
Using nightly builds:
Running original 1-hot encoding, dense matrices (both lighgbm and HistGBT):
Great thing @ogrisel, the memory issue is fixed (looks like it was that memory leakage you mentioned). Now code above does not crash on 32 GB, memory usage is:
So HistGBT is still using some memory (vs lightgbm using very little), but now it's much better than before. Thanks @ogrisel and @amueller for suggesting to use the dev version (master/nightly builds). |
Notwithstanding the other encoding options, if we look at 1-hot encoding: Run time [sec] and AUC: data size 100K:
data size 1M:
data size 10M: Cannot create the dense matrix with 32 GB RAM. Will need a bigger cloud instance. (TODO) Code:
|
The Exclusive Feature Bundling feature of LGBM is really efficient to deal with OHE categorical variables. However I still think that OHE is useless for decision trees based algorithms and ordinal encoding (or native categorical variable support) are better options. |
New tool https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html
based on POC https://github.com/ogrisel/pygbm mentioned earlier here #15
The text was updated successfully, but these errors were encountered: