-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA] GPU build on Ubuntu 20.04 crashing on training #6705
Comments
Thanks very much for the excellent report. I think that maybe LightGBM's CUDA build doesn't currently support Tesla M60 (CUDA compute capability 5.2). The oldest compute capability we support is 6.0 (Pascal). Lines 226 to 239 in dc0ed53
But I'm not sure. @shiyu1994 can you help investigate this report? |
Oh, I see, thanks. And looks like up to 9.0 I guess. That seems it would include other options on azure like V100 and T4. I assume |
BTW, is the CUDA build substantially better/faster than the OpenCL version? I was assuming so... |
Ah yeah, that was it - on M60 it doesn't work, on V100 it does. D'oh. |
Yes. This line: Line 220 in dc0ed53
There are some details on this here:
You can also see some hints about this in the logs of a recent CI job here using CUDA 12.6.
If you want to see how the
Yes. The OpenCL version here is basically unmaintained at this point: #4946 (comment) The CUDA version does more work on the GPU, with less copying between host and device. It's more actively developed and more thoroughly tested. If you have a compatible NVIDIA GPU, prefer
Ah great! Sorry there is not a more informative error there. I'm glad it's working well for you on using V100s. I'm going to close this at it seems that that resolves the issue, but please post if you have additional questions. At this point, we won't add support for older GPUs. Even RAPIDS dropped support for Pascal earlier this year: https://docs.rapids.ai/notices/rsn0034/ |
I just noticed that you double-posted this here and on Stack Overflow (link). Please do not do that. Maintainers here also monitor the Since we've answered this here and your Stack Overflow post hasn't received any votes or comments, I think you should delete it. |
Got it, thanks for all the helpful info. Should we just close the SO question (needs one more vote)? |
Thanks for that. I don't have sufficient reputation there to vote to close it, maybe someone else will come along. I appreciate that you linked back to this issue in your answer, that helps! We just want to be sure we're making good use of everyone's time. Thanks again for the excellent report, the reproducible example and details you shared made it easy to get to a resolution quickly, and I know that takes some effort to put together. |
Description
CUDA GPU version on Ubuntu 20.04 Linux crashes during training.
Reproducible example
Environment info
LightGBM version or commit hash: 4.5.0
Command(s) you used to install LightGBM
conda install -c conda-forge 'lightgbm>=4.4.0'
I've tried installing several different ways and in Python 3.11 and 3.10. It always seems to end up with this error.
Additional Comments
It works fine on CPU or with the other GPU build, just not with CUDA.
The GPU is NVIDIA Tesla M60
The NVIDIA driver version is 535 and CUDA 12.2 (I think, may have also been 12.7). Also tried with 565.57.01 and 12.7.
The traceback I'm getting is this:
The text was updated successfully, but these errors were encountered: