You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm comparing the prediction using lightgbm dask vs lightgbm local, and noticed that with the same model setup, lightgbm dask gives a worse prediction. I think it's reasonable to expect similar pred using dask vs local training.
Reproducible example
Prepare train and test data
import pandas as pd
import lightgbm as lgb
import dask
import sklearn
sample_data = pd.read_csv("sample_data.csv")
train, test = sklearn.model_selection.train_test_split(sample_data, test_size=0.2, shuffle=False)
Local model training and prediction
model = lgb.LGBMRegressor(
max_depth=8,
learning_rate=0.01,
tree_learner="data",
n_estimators=100,
)
model.fit(train[['x0', 'x1']], train['y'], sample_weight=sample_data['weight'])
pred = model.predict(X=test[['x0', 'x1']])
r_squared = sklearn.metrics.r2_score(test['y'], pred, sample_weight=test['weight'])
print(f"r_squared: {r_squared}")
# the result is ~0.003
Distributed model training and prediction
train_data_dask = dask.dataframe.from_pandas(train, npartitions=4)
X_train = train_data_dask[["x0", "x1"]].to_dask_array(lengths=True)
y_train = train_data_dask["y"].to_dask_array(lengths=True)
w_train = train_data_dask["weight"].to_dask_array(lengths=True)
test_data_dask = dask.dataframe.from_pandas(test, npartitions=4)
X_test = test_data_dask[["x0", "x1"]].to_dask_array(lengths=True)
y_test = test_data_dask["y"].to_dask_array(lengths=True)
w_test = test_data_dask["weight"].to_dask_array(lengths=True)
model = lgb.DaskLGBMRegressor(
client=client,
max_depth=8,
learning_rate=0.01,
tree_learner="data",
n_estimators=100,
)
model.fit(X_train, y_train, sample_weight=w_train)
y_pred = model.predict(X=X_test)
# Measure the result using r_squared
y_local = y_test.compute()
w_local = w_test.compute()
y_pred_local = y_pred.compute()
r_squared = sklearn.metrics.r2_score(y_local, y_pred_local, sample_weight=w_local)
print(f"r_squared: {r_squared}")
# The result is ~0.001
Thank you for raising this @szhang5947. We've seen these kind of differences between local and distributed training and are investigating them in #3835. They seem to be related to distributed training in general (not just the dask interface). You can subscribe to that issue if you want to follow our progress.
jameslamb
changed the title
lightgbm + dask generates worse prediction compared to local training
[dask] lightgbm + dask generates worse prediction compared to local training
Nov 1, 2021
This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.
Description
I'm comparing the prediction using lightgbm dask vs lightgbm local, and noticed that with the same model setup, lightgbm dask gives a worse prediction. I think it's reasonable to expect similar pred using dask vs local training.
Reproducible example
Prepare train and test data
Local model training and prediction
Distributed model training and prediction
Environment info
lightgbm: 3.3.0
dask: 2021.05.1
sample_data.csv
The text was updated successfully, but these errors were encountered: