[CUDA] GPU build on Ubuntu 20.04 crashing on training #6705

empowerNate · 2024-10-29T23:45:47Z

Description

CUDA GPU version on Ubuntu 20.04 Linux crashes during training.

Reproducible example

import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate a dummy classification dataset
X, y = make_classification(n_samples=100000, n_features=20, n_classes=2, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the LightGBM classifier with GPU support
clf = lgb.LGBMClassifier(
    objective='binary',
    device='cuda',
    verbose=1,
)

clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Environment info

LightGBM version or commit hash: 4.5.0

Command(s) you used to install LightGBM

conda install -c conda-forge 'lightgbm>=4.4.0'

I've tried installing several different ways and in Python 3.11 and 3.10. It always seems to end up with this error.

Additional Comments

It works fine on CPU or with the other GPU build, just not with CUDA.

The GPU is NVIDIA Tesla M60

The NVIDIA driver version is 535 and CUDA 12.2 (I think, may have also been 12.7). Also tried with 565.57.01 and 12.7.

The traceback I'm getting is this:

{
	"name": "LightGBMError",
	"message": "Check failed: (split_indices_block_size_data_partition) > (0) at /home/conda/feedstock_root/build_artifacts/liblightgbm_1728547676427/work/src/treelearner/cuda/cuda_data_partition.cpp, line 280 .
",
	"stack": "---------------------------------------------------------------------------
LightGBMError                             Traceback (most recent call last)
Cell In[1], line 19
     12 # Create and train the LightGBM classifier with GPU support
     13 clf = lgb.LGBMClassifier(
     14     objective='binary',
     15     device='cuda',
     16     verbose=1,
     17 )
---> 19 clf.fit(X_train, y_train)
     21 # Predict and evaluate
     22 y_pred = clf.predict(X_test)

File /anaconda/envs/py311/lib/python3.11/site-packages/lightgbm/sklearn.py:1284, in LGBMClassifier.fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, feature_name, categorical_feature, callbacks, init_model)
   1281         else:
   1282             valid_sets.append((valid_x, self._le.transform(valid_y)))
-> 1284 super().fit(
   1285     X,
   1286     _y,
   1287     sample_weight=sample_weight,
   1288     init_score=init_score,
   1289     eval_set=valid_sets,
   1290     eval_names=eval_names,
   1291     eval_sample_weight=eval_sample_weight,
   1292     eval_class_weight=eval_class_weight,
   1293     eval_init_score=eval_init_score,
   1294     eval_metric=eval_metric,
   1295     feature_name=feature_name,
   1296     categorical_feature=categorical_feature,
   1297     callbacks=callbacks,
   1298     init_model=init_model,
   1299 )
   1300 return self

File /anaconda/envs/py311/lib/python3.11/site-packages/lightgbm/sklearn.py:955, in LGBMModel.fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, feature_name, categorical_feature, callbacks, init_model)
    952 evals_result: _EvalResultDict = {}
    953 callbacks.append(record_evaluation(evals_result))
--> 955 self._Booster = train(
    956     params=params,
    957     train_set=train_set,
    958     num_boost_round=self.n_estimators,
    959     valid_sets=valid_sets,
    960     valid_names=eval_names,
    961     feval=eval_metrics_callable,  # type: ignore[arg-type]
    962     init_model=init_model,
    963     callbacks=callbacks,
    964 )
    966 self._evals_result = evals_result
    967 self._best_iteration = self._Booster.best_iteration

File /anaconda/envs/py311/lib/python3.11/site-packages/lightgbm/engine.py:307, in train(params, train_set, num_boost_round, valid_sets, valid_names, feval, init_model, feature_name, categorical_feature, keep_training_booster, callbacks)
    295 for cb in callbacks_before_iter:
    296     cb(
    297         callback.CallbackEnv(
    298             model=booster,
   (...)
    304         )
    305     )
--> 307 booster.update(fobj=fobj)
    309 evaluation_result_list: List[_LGBM_BoosterEvalMethodResultType] = []
    310 # check evaluation result.

File /anaconda/envs/py311/lib/python3.11/site-packages/lightgbm/basic.py:4135, in Booster.update(self, train_set, fobj)
   4133 if self.__set_objective_to_none:
   4134     raise LightGBMError(\"Cannot update due to null objective function.\")
-> 4135 _safe_call(
   4136     _LIB.LGBM_BoosterUpdateOneIter(
   4137         self._handle,
   4138         ctypes.byref(is_finished),
   4139     )
   4140 )
   4141 self.__is_predicted_cur_iter = [False for _ in range(self.__num_dataset)]
   4142 return is_finished.value == 1

File /anaconda/envs/py311/lib/python3.11/site-packages/lightgbm/basic.py:296, in _safe_call(ret)
    288 \"\"\"Check the return value from C API call.
    289 
    290 Parameters
   (...)
    293     The return value from C API calls.
    294 \"\"\"
    295 if ret != 0:
--> 296     raise LightGBMError(_LIB.LGBM_GetLastError().decode(\"utf-8\"))

LightGBMError: Check failed: (split_indices_block_size_data_partition) > (0) at /home/conda/feedstock_root/build_artifacts/liblightgbm_1728547676427/work/src/treelearner/cuda/cuda_data_partition.cpp, line 280 .
"
}

The text was updated successfully, but these errors were encountered:

jameslamb · 2024-10-30T01:11:09Z

Thanks very much for the excellent report.

I think that maybe LightGBM's CUDA build doesn't currently support Tesla M60 (CUDA compute capability 5.2). The oldest compute capability we support is 6.0 (Pascal).

LightGBM/CMakeLists.txt

Lines 226 to 239 in dc0ed53

    
               set(CUDA_ARCHS "60" "61" "62" "70" "75") 
        
               if(CUDA_VERSION VERSION_GREATER_EQUAL "110") 
        
                   list(APPEND CUDA_ARCHS "80") 
        
               endif() 
        
               if(CUDA_VERSION VERSION_GREATER_EQUAL "111") 
        
                   list(APPEND CUDA_ARCHS "86") 
        
               endif() 
        
               if(CUDA_VERSION VERSION_GREATER_EQUAL "115") 
        
                   list(APPEND CUDA_ARCHS "87") 
        
               endif() 
        
               if(CUDA_VERSION VERSION_GREATER_EQUAL "118") 
        
                   list(APPEND CUDA_ARCHS "89") 
        
                   list(APPEND CUDA_ARCHS "90") 
        
               endif()

But I'm not sure. @shiyu1994 can you help investigate this report?

empowerNate · 2024-10-30T01:46:37Z

Oh, I see, thanks. And looks like up to 9.0 I guess. That seems it would include other options on azure like V100 and T4.

I assume find_package(CUDAToolkit 11.0 REQUIRED) means anything >= 11.0?

empowerNate · 2024-10-30T01:48:58Z

BTW, is the CUDA build substantially better/faster than the OpenCL version? I was assuming so...

empowerNate · 2024-10-30T02:21:59Z

Ah yeah, that was it - on M60 it doesn't work, on V100 it does. D'oh.

jameslamb · 2024-10-30T03:37:40Z

I assume find_package(CUDAToolkit 11.0 REQUIRED) means anything >= 11.0?

Yes. This line:

LightGBM/CMakeLists.txt

Line 220 in dc0ed53

find_package(CUDAToolkit 11.0 REQUIRED)

There are some details on this here:

You can also see some hints about this in the logs of a recent CI job here using CUDA 12.6.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02             Driver Version: 535.146.02   CUDA Version: 12.6     |
...
-- The CUDA compiler identification is NVIDIA 12.6.68
...
-- Found CUDAToolkit: /usr/local/cuda/targets/x86_64-linux/include (found suitable version "12.6.68", minimum required is "11.0")

(build link)

If you want to see how the conda-forge packages are built, see https://github.com/conda-forge/lightgbm-feedstock.

BTW, is the CUDA build substantially better/faster than the OpenCL version?

Yes. The OpenCL version here is basically unmaintained at this point: #4946 (comment)

The CUDA version does more work on the GPU, with less copying between host and device. It's more actively developed and more thoroughly tested. If you have a compatible NVIDIA GPU, prefer "device": "cuda" to "device": "gpu" for LightGBM.

Ah yeah, that was it - on M60 it doesn't work, on V100 it does.

Ah great! Sorry there is not a more informative error there. I'm glad it's working well for you on using V100s.

I'm going to close this at it seems that that resolves the issue, but please post if you have additional questions. At this point, we won't add support for older GPUs. Even RAPIDS dropped support for Pascal earlier this year: https://docs.rapids.ai/notices/rsn0034/

jameslamb · 2024-10-30T15:59:21Z

I just noticed that you double-posted this here and on Stack Overflow (link). Please do not do that.

Maintainers here also monitor the [lightgbm] tag on Stack Overflow. I could have been spending time preparing an answer here while another maintainer was spending time answering your Stack Overflow post, which would have been a waste of maintainers' limited attention that could otherwise have been spent improving this project. Double-posting also makes it less likely that others with a similar question will find the relevant discussion and answer.

Since we've answered this here and your Stack Overflow post hasn't received any votes or comments, I think you should delete it.

empowerNate · 2024-10-30T17:01:43Z

Got it, thanks for all the helpful info. Should we just close the SO question (needs one more vote)?

jameslamb · 2024-10-31T17:54:19Z

Thanks for that. I don't have sufficient reputation there to vote to close it, maybe someone else will come along. I appreciate that you linked back to this issue in your answer, that helps! We just want to be sure we're making good use of everyone's time.

Thanks again for the excellent report, the reproducible example and details you shared made it easy to get to a resolution quickly, and I know that takes some effort to put together.

jameslamb added the bug label Oct 30, 2024

jameslamb changed the title ~~CUDA GPU build on Ubuntu 20.04 crashing on training~~ [CUDA] GPU build on Ubuntu 20.04 crashing on training Oct 30, 2024

jameslamb closed this as completed Oct 30, 2024

dluks mentioned this issue Nov 20, 2024

[CUDA] GPU training fails with (split_indices_block_size_data_partition) > (0) on Ubuntu 22.04 #6727

Open

jameslamb mentioned this issue Feb 15, 2025

[RFC] [CUDA] distribute Python wheels with CUDA support #6828

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] GPU build on Ubuntu 20.04 crashing on training #6705

[CUDA] GPU build on Ubuntu 20.04 crashing on training #6705

empowerNate commented Oct 29, 2024 •

edited by jameslamb

Loading

jameslamb commented Oct 30, 2024

empowerNate commented Oct 30, 2024 •

edited

Loading

empowerNate commented Oct 30, 2024

empowerNate commented Oct 30, 2024

jameslamb commented Oct 30, 2024

jameslamb commented Oct 30, 2024

empowerNate commented Oct 30, 2024

jameslamb commented Oct 31, 2024

[CUDA] GPU build on Ubuntu 20.04 crashing on training #6705

[CUDA] GPU build on Ubuntu 20.04 crashing on training #6705

Comments

empowerNate commented Oct 29, 2024 • edited by jameslamb Loading

Description

Reproducible example

Environment info

Additional Comments

jameslamb commented Oct 30, 2024

empowerNate commented Oct 30, 2024 • edited Loading

empowerNate commented Oct 30, 2024

empowerNate commented Oct 30, 2024

jameslamb commented Oct 30, 2024

jameslamb commented Oct 30, 2024

empowerNate commented Oct 30, 2024

jameslamb commented Oct 31, 2024

empowerNate commented Oct 29, 2024 •

edited by jameslamb

Loading

empowerNate commented Oct 30, 2024 •

edited

Loading