Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] GPU build on Ubuntu 20.04 crashing on training #6705

Closed
empowerNate opened this issue Oct 29, 2024 · 8 comments
Closed

[CUDA] GPU build on Ubuntu 20.04 crashing on training #6705

empowerNate opened this issue Oct 29, 2024 · 8 comments
Labels

Comments

@empowerNate
Copy link

empowerNate commented Oct 29, 2024

Description

CUDA GPU version on Ubuntu 20.04 Linux crashes during training.

Reproducible example

import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate a dummy classification dataset
X, y = make_classification(n_samples=100000, n_features=20, n_classes=2, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the LightGBM classifier with GPU support
clf = lgb.LGBMClassifier(
    objective='binary',
    device='cuda',
    verbose=1,
)

clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Environment info

LightGBM version or commit hash: 4.5.0

Command(s) you used to install LightGBM

conda install -c conda-forge 'lightgbm>=4.4.0'

I've tried installing several different ways and in Python 3.11 and 3.10. It always seems to end up with this error.

Additional Comments

It works fine on CPU or with the other GPU build, just not with CUDA.

The GPU is NVIDIA Tesla M60

The NVIDIA driver version is 535 and CUDA 12.2 (I think, may have also been 12.7). Also tried with 565.57.01 and 12.7.

The traceback I'm getting is this:

{
	"name": "LightGBMError",
	"message": "Check failed: (split_indices_block_size_data_partition) > (0) at /home/conda/feedstock_root/build_artifacts/liblightgbm_1728547676427/work/src/treelearner/cuda/cuda_data_partition.cpp, line 280 .
",
	"stack": "---------------------------------------------------------------------------
LightGBMError                             Traceback (most recent call last)
Cell In[1], line 19
     12 # Create and train the LightGBM classifier with GPU support
     13 clf = lgb.LGBMClassifier(
     14     objective='binary',
     15     device='cuda',
     16     verbose=1,
     17 )
---> 19 clf.fit(X_train, y_train)
     21 # Predict and evaluate
     22 y_pred = clf.predict(X_test)

File /anaconda/envs/py311/lib/python3.11/site-packages/lightgbm/sklearn.py:1284, in LGBMClassifier.fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, feature_name, categorical_feature, callbacks, init_model)
   1281         else:
   1282             valid_sets.append((valid_x, self._le.transform(valid_y)))
-> 1284 super().fit(
   1285     X,
   1286     _y,
   1287     sample_weight=sample_weight,
   1288     init_score=init_score,
   1289     eval_set=valid_sets,
   1290     eval_names=eval_names,
   1291     eval_sample_weight=eval_sample_weight,
   1292     eval_class_weight=eval_class_weight,
   1293     eval_init_score=eval_init_score,
   1294     eval_metric=eval_metric,
   1295     feature_name=feature_name,
   1296     categorical_feature=categorical_feature,
   1297     callbacks=callbacks,
   1298     init_model=init_model,
   1299 )
   1300 return self

File /anaconda/envs/py311/lib/python3.11/site-packages/lightgbm/sklearn.py:955, in LGBMModel.fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, feature_name, categorical_feature, callbacks, init_model)
    952 evals_result: _EvalResultDict = {}
    953 callbacks.append(record_evaluation(evals_result))
--> 955 self._Booster = train(
    956     params=params,
    957     train_set=train_set,
    958     num_boost_round=self.n_estimators,
    959     valid_sets=valid_sets,
    960     valid_names=eval_names,
    961     feval=eval_metrics_callable,  # type: ignore[arg-type]
    962     init_model=init_model,
    963     callbacks=callbacks,
    964 )
    966 self._evals_result = evals_result
    967 self._best_iteration = self._Booster.best_iteration

File /anaconda/envs/py311/lib/python3.11/site-packages/lightgbm/engine.py:307, in train(params, train_set, num_boost_round, valid_sets, valid_names, feval, init_model, feature_name, categorical_feature, keep_training_booster, callbacks)
    295 for cb in callbacks_before_iter:
    296     cb(
    297         callback.CallbackEnv(
    298             model=booster,
   (...)
    304         )
    305     )
--> 307 booster.update(fobj=fobj)
    309 evaluation_result_list: List[_LGBM_BoosterEvalMethodResultType] = []
    310 # check evaluation result.

File /anaconda/envs/py311/lib/python3.11/site-packages/lightgbm/basic.py:4135, in Booster.update(self, train_set, fobj)
   4133 if self.__set_objective_to_none:
   4134     raise LightGBMError(\"Cannot update due to null objective function.\")
-> 4135 _safe_call(
   4136     _LIB.LGBM_BoosterUpdateOneIter(
   4137         self._handle,
   4138         ctypes.byref(is_finished),
   4139     )
   4140 )
   4141 self.__is_predicted_cur_iter = [False for _ in range(self.__num_dataset)]
   4142 return is_finished.value == 1

File /anaconda/envs/py311/lib/python3.11/site-packages/lightgbm/basic.py:296, in _safe_call(ret)
    288 \"\"\"Check the return value from C API call.
    289 
    290 Parameters
   (...)
    293     The return value from C API calls.
    294 \"\"\"
    295 if ret != 0:
--> 296     raise LightGBMError(_LIB.LGBM_GetLastError().decode(\"utf-8\"))

LightGBMError: Check failed: (split_indices_block_size_data_partition) > (0) at /home/conda/feedstock_root/build_artifacts/liblightgbm_1728547676427/work/src/treelearner/cuda/cuda_data_partition.cpp, line 280 .
"
}
@jameslamb jameslamb added the bug label Oct 30, 2024
@jameslamb
Copy link
Collaborator

Thanks very much for the excellent report.

I think that maybe LightGBM's CUDA build doesn't currently support Tesla M60 (CUDA compute capability 5.2). The oldest compute capability we support is 6.0 (Pascal).

LightGBM/CMakeLists.txt

Lines 226 to 239 in dc0ed53

set(CUDA_ARCHS "60" "61" "62" "70" "75")
if(CUDA_VERSION VERSION_GREATER_EQUAL "110")
list(APPEND CUDA_ARCHS "80")
endif()
if(CUDA_VERSION VERSION_GREATER_EQUAL "111")
list(APPEND CUDA_ARCHS "86")
endif()
if(CUDA_VERSION VERSION_GREATER_EQUAL "115")
list(APPEND CUDA_ARCHS "87")
endif()
if(CUDA_VERSION VERSION_GREATER_EQUAL "118")
list(APPEND CUDA_ARCHS "89")
list(APPEND CUDA_ARCHS "90")
endif()

But I'm not sure. @shiyu1994 can you help investigate this report?

@jameslamb jameslamb changed the title CUDA GPU build on Ubuntu 20.04 crashing on training [CUDA] GPU build on Ubuntu 20.04 crashing on training Oct 30, 2024
@empowerNate
Copy link
Author

empowerNate commented Oct 30, 2024

Oh, I see, thanks. And looks like up to 9.0 I guess. That seems it would include other options on azure like V100 and T4.

I assume find_package(CUDAToolkit 11.0 REQUIRED) means anything >= 11.0?

@empowerNate
Copy link
Author

BTW, is the CUDA build substantially better/faster than the OpenCL version? I was assuming so...

@empowerNate
Copy link
Author

Ah yeah, that was it - on M60 it doesn't work, on V100 it does. D'oh.

@jameslamb
Copy link
Collaborator

I assume find_package(CUDAToolkit 11.0 REQUIRED) means anything >= 11.0?

Yes. This line:

find_package(CUDAToolkit 11.0 REQUIRED)

There are some details on this here:

You can also see some hints about this in the logs of a recent CI job here using CUDA 12.6.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02             Driver Version: 535.146.02   CUDA Version: 12.6     |
...
-- The CUDA compiler identification is NVIDIA 12.6.68
...
-- Found CUDAToolkit: /usr/local/cuda/targets/x86_64-linux/include (found suitable version "12.6.68", minimum required is "11.0")

(build link)

If you want to see how the conda-forge packages are built, see https://github.com/conda-forge/lightgbm-feedstock.

BTW, is the CUDA build substantially better/faster than the OpenCL version?

Yes. The OpenCL version here is basically unmaintained at this point: #4946 (comment)

The CUDA version does more work on the GPU, with less copying between host and device. It's more actively developed and more thoroughly tested. If you have a compatible NVIDIA GPU, prefer "device": "cuda" to "device": "gpu" for LightGBM.

Ah yeah, that was it - on M60 it doesn't work, on V100 it does.

Ah great! Sorry there is not a more informative error there. I'm glad it's working well for you on using V100s.

I'm going to close this at it seems that that resolves the issue, but please post if you have additional questions. At this point, we won't add support for older GPUs. Even RAPIDS dropped support for Pascal earlier this year: https://docs.rapids.ai/notices/rsn0034/

@jameslamb
Copy link
Collaborator

I just noticed that you double-posted this here and on Stack Overflow (link). Please do not do that.

Maintainers here also monitor the [lightgbm] tag on Stack Overflow. I could have been spending time preparing an answer here while another maintainer was spending time answering your Stack Overflow post, which would have been a waste of maintainers' limited attention that could otherwise have been spent improving this project. Double-posting also makes it less likely that others with a similar question will find the relevant discussion and answer.

Since we've answered this here and your Stack Overflow post hasn't received any votes or comments, I think you should delete it.

@empowerNate
Copy link
Author

Got it, thanks for all the helpful info. Should we just close the SO question (needs one more vote)?

@jameslamb
Copy link
Collaborator

Thanks for that. I don't have sufficient reputation there to vote to close it, maybe someone else will come along. I appreciate that you linked back to this issue in your answer, that helps! We just want to be sure we're making good use of everyone's time.

Thanks again for the excellent report, the reproducible example and details you shared made it easy to get to a resolution quickly, and I know that takes some effort to put together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants