-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kaggle: cuML 24.10 import failure #487
Comments
Yes, it looks like the issue was mixing packages from the Upgrading to cuML 24.12 + forcing the use of # SHA from https://console.cloud.google.com/artifacts/docker/kaggle-gpu-images/us/gcr.io/python
IMAGE_SHA="f9647cc12ad6b5bff2567807c9baa993ae44f17e770a4e099dbde5f4e3a2f1ae"
docker run \
--rm \
--runtime nvidia \
--gpus "0,1" \
-it "gcr.io/kaggle-gpu-images/python@sha256:${IMAGE_SHA}" \
bash
# this fails
python -c "import cuml; print(cuml.__version__)"
# upgrade cuML + force the use of conda-forge libcusparse
conda install \
--override-channels \
-c nodefaults \
-c rapidsai \
-c conda-forge \
-c nvidia \
-c pytorch \
'cudf=24.12' \
'cuml=24.12' \
'conda-forge::libcusparse'
# this now succeeds
python -c "import cuml; print(cuml.__version__)" The relevant change in the conda environment is this:
full change summary from that 'conda install' (click me)
|
Between when this investigation started and today, Kaggle changed the base image for its main GPU + Python image:
It's no longer using Here's how to poke around and install it. IMAGE_SHA="1003a82bef5df3c098b2041d936cb5f1836e52d7b610e8f0f4dedc194fb3b773"
docker run \
--rm \
--runtime nvidia \
--gpus "0,1" \
-it "gcr.io/kaggle-gpu-images/python@sha256:${IMAGE_SHA}" \
bash
# see what's installed
pip freeze
# get cuML
python -m pip install \
--extra-index-url https://pypi.nvidia.com/ \
'cuml-cu12==24.12.*' |
I was about to comment "should we make a PR to add cuml back?" but then saw in https://github.com/Kaggle/docker-python/blob/083bc20f00eda74a422ab91b9c18de7a80806d07/kaggle_requirements.txt#L27C1-L27C10 that it mentions cuml. Kaggle/docker-python#1459 is the PR that added it. I think no new image has been published to
So maybe we need to wait a bit for a new image to be published and then see if it is fixed? |
I tried building the latest image from the repository myself but it exhausted the disk space available on the machine :-/ |
Looks like there is a new image (went up 2 weeks ago)... it has cuML, and loading it succeeds! # SHA from https://console.cloud.google.com/artifacts/docker/kaggle-gpu-images/us/gcr.io/python
IMAGE_SHA="57cb636a65386fd6c74fc9969211623034c487f7d483f9cd2c8456ebe2619345"
docker run \
--rm \
--runtime nvidia \
--gpus "0,1" \
-it "gcr.io/kaggle-gpu-images/python@sha256:${IMAGE_SHA}" \
bash
python -c "import cuml; print(cuml.__version__)"
# 24.12.00 Ran a small example there too, just for fun (from "Random Forest Classification and Accuracy Metrics" in the cuML docs) import cuml
from cupy import asnumpy
from joblib import dump, load
from cuml.datasets.classification import make_classification
from cuml.model_selection import train_test_split
from cuml.ensemble import RandomForestClassifier as cuRF
from sklearn.metrics import accuracy_score
# synthetic dataset dimensions
n_samples = 1000
n_features = 10
n_classes = 2
# random forest depth and size
n_estimators = 25
max_depth = 10
# generate synthetic data [ binary classification task ]
X, y = make_classification ( n_classes = n_classes,
n_features = n_features,
n_samples = n_samples,
random_state = 0 )
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state = 0 )
model = cuRF( max_depth = max_depth,
n_estimators = n_estimators,
random_state = 0 )
trained_RF = model.fit ( X_train, y_train )
predictions = model.predict ( X_test )
cu_score = cuml.metrics.accuracy_score( y_test, predictions )
sk_score = accuracy_score( asnumpy( y_test ), asnumpy( predictions ) )
print( " cuml accuracy: ", cu_score )
# cuml accuracy: 0.9959999918937683
print( " sklearn accuracy : ", sk_score )
# sklearn accuracy : 0.996 I also want to add... the root cause of the original problem was related to how Kaggle was creating this image with which conda
# (empty)
pip freeze | grep -E '\-cu12' 'pip freeze' output showing cuDF, cuML, cuVS, and more from wheels (click me)
So this is fixed in the new image, and in a way that wouldn't be affected again in the future by the root cause of the original problem 🎉 |
Description
In the Kaggle notebook environment, using the latest
kaggle-python
image (which contains RAPIDS 24.10 libraries), importingcuml
fails like this:full stacktrace (click me)
Reproducible Example
I was able to reproduce this outside of the Kaggle environment, using just the image built from https://github.com/Kaggle/docker-python.
NOTE: pinning to a specific SHA so this issue will be reproducible in the future if change are made to
kaggle-gpu-images/python
.output of 'nvidia-smi' (click me)
output of 'conda info' (click me)
output of 'conda list --explicit' (click me)
output of 'conda config --get' (click me)
Notes
At first glance, it looks like this might be a result of mixing packages from the
nvidia
andconda-forge
channels. Look at this channel priority fromconda config --get
:At https://docs.rapids.ai/install/#selector, we recommend a different order
And notice in the output of
conda list
above that libraries likelibcusparse
are coming from thenvidia
channel, and thatlibnvjitlink
is not installed at all.The text was updated successfully, but these errors were encountered: