Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] lightgbm + dask generates worse prediction compared to local training #4744

Closed
szhang5947 opened this issue Oct 29, 2021 · 3 comments
Closed
Labels

Comments

@szhang5947
Copy link

szhang5947 commented Oct 29, 2021

Description

I'm comparing the prediction using lightgbm dask vs lightgbm local, and noticed that with the same model setup, lightgbm dask gives a worse prediction. I think it's reasonable to expect similar pred using dask vs local training.

Reproducible example

Prepare train and test data

import pandas as pd
import lightgbm as lgb
import dask
import sklearn

sample_data = pd.read_csv("sample_data.csv")
train, test = sklearn.model_selection.train_test_split(sample_data, test_size=0.2, shuffle=False)

Local model training and prediction

model = lgb.LGBMRegressor(
    max_depth=8,
    learning_rate=0.01,
    tree_learner="data",
    n_estimators=100,
)

model.fit(train[['x0', 'x1']], train['y'], sample_weight=sample_data['weight'])
pred = model.predict(X=test[['x0', 'x1']])

r_squared = sklearn.metrics.r2_score(test['y'], pred, sample_weight=test['weight'])
print(f"r_squared: {r_squared}")

# the result is ~0.003

Distributed model training and prediction

train_data_dask = dask.dataframe.from_pandas(train, npartitions=4)
X_train = train_data_dask[["x0", "x1"]].to_dask_array(lengths=True)
y_train = train_data_dask["y"].to_dask_array(lengths=True)
w_train = train_data_dask["weight"].to_dask_array(lengths=True)

test_data_dask = dask.dataframe.from_pandas(test, npartitions=4)
X_test = test_data_dask[["x0", "x1"]].to_dask_array(lengths=True)
y_test = test_data_dask["y"].to_dask_array(lengths=True)
w_test = test_data_dask["weight"].to_dask_array(lengths=True)

model = lgb.DaskLGBMRegressor(
    client=client,
    max_depth=8,
    learning_rate=0.01,
    tree_learner="data",
    n_estimators=100,
)

model.fit(X_train, y_train, sample_weight=w_train)
y_pred = model.predict(X=X_test)

# Measure the result using r_squared
y_local = y_test.compute()
w_local = w_test.compute()
y_pred_local = y_pred.compute()

r_squared = sklearn.metrics.r2_score(y_local, y_pred_local, sample_weight=w_local)
print(f"r_squared: {r_squared}")

# The result is ~0.001

Environment info

lightgbm: 3.3.0
dask: 2021.05.1
sample_data.csv

@jameslamb jameslamb added the dask label Oct 31, 2021
@jmoralez
Copy link
Collaborator

jmoralez commented Nov 1, 2021

Thank you for raising this @szhang5947. We've seen these kind of differences between local and distributed training and are investigating them in #3835. They seem to be related to distributed training in general (not just the dask interface). You can subscribe to that issue if you want to follow our progress.

@jameslamb jameslamb changed the title lightgbm + dask generates worse prediction compared to local training [dask] lightgbm + dask generates worse prediction compared to local training Nov 1, 2021
@no-response
Copy link

no-response bot commented Dec 1, 2021

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

@no-response no-response bot closed this as completed Dec 1, 2021
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants