-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems setting up nebari on existing k8s #2906
Comments
Hi @RaMaTHA thanks for opening an issue with the problem, and what you've tried so far. Are you running completely baremetal, or is this part of a smaller VM unit in a cloud setup, if so which cloud provider? GPU virtualization with complete bare-metal deployments hasn't been tested with Nebari yet, so I can't assure you it will work; if that's the case, I will need a bit more info about your cluster to replicate it myself. On the other hand, if this is part of cloud resources, feel free to attempt the notes below: Regarding your first issue, usually when
I see that you compared a previously "working" pod with yours to double-check the mounting, but could you annex a screenshot of the labels of your Jupyterlab pod when you launch it? Please don't forget to sanitize any project-specific labels if they do show up. Usually, in cloud deployments, based on the provider, it might be missing a |
I think you are running on complete baremetal, so I did a quick research on kubespray and GPUs, I am not entirely sure how dynamically it installs the drivers from the toolkit, but you could try manually installing the nvidia operator and check if that works:
If this works, we probably need to add this to Nebari as part of the drivers-setup for gpus here |
Thanks for the quick response. I'm running nebari on a complete, self-hosted, bare-metal setup based on kubespray. As far as I know, kubespray doesn't install nvidia-gpu by itself. So I installed the nvidia-gpu driver myself (more or less identical to your instructions above). These are the gpu pods that are currently running on my k8s: I have followed the guides from nvidia that can test the installed gpu-operator. This is how I tested it: apiVersion: batch/v1
kind: Job
metadata:
name: test-job-gpu
spec:
template:
spec:
runtimeClassName: nvidia
containers:
- name: nvidia-test
image: nvidia/cuda:12.0.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
kubernetes.io/hostname: node1
restartPolicy: Never Running the above job seems to detect the GPU: As you can see in the job example, I used the gpu mapped to the hostname: node1 (which is the default in kubespray). |
Thanks for following up. I see. To debug further, I will attempt to replicate the installation on my side. Right now, I can't see why your node would not correctly report the Nvidia command. Since you submitted the job properly, the K8s configuration is correct, and you also used the proper image. Regarding your other issue with conda-store, the authentication is based on your current config and is shared with the same authentication system as jupyterhub. I would double-check which group your user is currently assigned to through the keycloak admin console: |
Hi @RaMaTHA, I will look deeper into this by the end of the week. I haven't had the chance yet due to the upcoming 2025.1.1 release of Nebari, but I just wanted to let you know that this hasn't been forgotten :) |
Describe the bug
I have successfully deployed nebari on my existing k8s (kubespray) server.
To do this, I followed the instructions on: https://www.nebari.dev/docs/get-started/installing-nebari
After successful installation, I initialized and deployed nebari using the following commands:
So far, so good. Nebari has successfully processed its task.
I'm now also able to access the web interface and all pods are running as expected.
--
However, now there are two points that doesn't work as expected.
First issue:
When trying to access the gpu, I get the following error:
bash: nvidia-smi: command not found
I have checked the website, where I have found the following guide: https://www.nebari.dev/docs/how-tos/use-gpus
So I followed the instructions and modified the nebari-config.yaml as I would have expected it to work. But unfortunately it didn't help.
Now I started debugging for a while. But to make a long story short, I couldn't get the GPU to work... Not even the $ command that nvidia-smi returned anything.
However, when checking my k8s cluster (single node, name is node1), I can see, that the notebook is successfully trying to access the GPU. I used the following command to check this:
Here I can see that the GPU is (presumably) successfully mounted in the pod. I checked this with another pod (different application) which returns an error that the GPU is already in use. So the mounting should be correct.
Since this didn't solve my problem, I started trying things with the environment setup (conda_store), which led me to the second problem.
Second issue:
I can't log in to the conda_store interface. Of course, I can log in to keycloak and my notebook server, but not to the conda_store.
So, I followed the instructions as recommended at: https://www.nebari.dev/docs/how-tos/configuring-keycloak
But I still get the following error, when logging in:
{"status":"error","message":"Invalid authentication credentials"}
I tried debugging here for a while, but unfortunately that didn't help either. I'm not sure if the problems are related, because at the moment I can't set the variables in the conda_store (e.g. CONDA_OVERRIDE_CUDA: "12.1").
I would really appreciate any help. Thanks in advance
Expected behavior
OS and architecture in which you are running Nebari
Distributor ID: Debian Description: Debian GNU/Linux 12 (bookworm) Release: 12 Codename: bookworm
How to Reproduce the problem?
nebari-config.txt
Command output
No response
Versions and dependencies used.
conda 4.13.0
Client Version: v1.28.6
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.6
2024.11.1
Compute environment
None
Integrations
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: