Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems setting up nebari on existing k8s #2906

Open
RaMaTHA opened this issue Jan 10, 2025 · 5 comments
Open

Problems setting up nebari on existing k8s #2906

RaMaTHA opened this issue Jan 10, 2025 · 5 comments
Labels
area: k8s ⎈ area: user experience 👩🏻‍💻 needs: follow-up 📫 Someone needs to get back to this issue or PR needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug provider: Existing type: bug 🐛 Something isn't working

Comments

@RaMaTHA
Copy link

RaMaTHA commented Jan 10, 2025

Describe the bug

I have successfully deployed nebari on my existing k8s (kubespray) server.
To do this, I followed the instructions on: https://www.nebari.dev/docs/get-started/installing-nebari

After successful installation, I initialized and deployed nebari using the following commands:

nebari init --guided-init
nebari deploy -c nebari-config.yaml

So far, so good. Nebari has successfully processed its task.
I'm now also able to access the web interface and all pods are running as expected.

--

However, now there are two points that doesn't work as expected.

First issue:

When trying to access the gpu, I get the following error:
bash: nvidia-smi: command not found

I have checked the website, where I have found the following guide: https://www.nebari.dev/docs/how-tos/use-gpus

So I followed the instructions and modified the nebari-config.yaml as I would have expected it to work. But unfortunately it didn't help.

Now I started debugging for a while. But to make a long story short, I couldn't get the GPU to work... Not even the $ command that nvidia-smi returned anything.

However, when checking my k8s cluster (single node, name is node1), I can see, that the notebook is successfully trying to access the GPU. I used the following command to check this:

kubectl describe node node1 | grep -A5 "nvidia.com/gpu"

Here I can see that the GPU is (presumably) successfully mounted in the pod. I checked this with another pod (different application) which returns an error that the GPU is already in use. So the mounting should be correct.

Since this didn't solve my problem, I started trying things with the environment setup (conda_store), which led me to the second problem.

Second issue:

I can't log in to the conda_store interface. Of course, I can log in to keycloak and my notebook server, but not to the conda_store.

So, I followed the instructions as recommended at: https://www.nebari.dev/docs/how-tos/configuring-keycloak

But I still get the following error, when logging in:
{"status":"error","message":"Invalid authentication credentials"}

I tried debugging here for a while, but unfortunately that didn't help either. I'm not sure if the problems are related, because at the moment I can't set the variables in the conda_store (e.g. CONDA_OVERRIDE_CUDA: "12.1").

I would really appreciate any help. Thanks in advance

Expected behavior

  1. That the notebook is using my GPU
  2. That I can log in to the conda_store and adjust the environment there

OS and architecture in which you are running Nebari

Distributor ID: Debian Description: Debian GNU/Linux 12 (bookworm) Release: 12 Codename: bookworm

How to Reproduce the problem?

nebari-config.txt

Command output

No response

Versions and dependencies used.

conda --version

conda 4.13.0

kubectl version

Client Version: v1.28.6
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.6

nebari -V

2024.11.1

Compute environment

None

Integrations

No response

Anything else?

No response

@RaMaTHA RaMaTHA added needs: triage 🚦 Someone needs to have a look at this issue and triage type: bug 🐛 Something isn't working labels Jan 10, 2025
@viniciusdc
Copy link
Contributor

viniciusdc commented Jan 13, 2025

Hi @RaMaTHA thanks for opening an issue with the problem, and what you've tried so far.

Are you running completely baremetal, or is this part of a smaller VM unit in a cloud setup, if so which cloud provider?

GPU virtualization with complete bare-metal deployments hasn't been tested with Nebari yet, so I can't assure you it will work; if that's the case, I will need a bit more info about your cluster to replicate it myself. On the other hand, if this is part of cloud resources, feel free to attempt the notes below:

Regarding your first issue, usually when nvidia-smi doesn't show up, it is one of two things:

  • the expected GPU drivers, in this case, the nvdia daemon, was not installed;
  • the type of machine used for the node does not match the architecture (cloud only);

kubectl describe node node1 | grep -A5 "nvidia.com/gpu"
Here I can see that the GPU is (presumably) successfully mounted in the pod. I checked this with another pod (a different application), which returned an error that the GPU was already in use. So the mounting should be correct.

I see that you compared a previously "working" pod with yours to double-check the mounting, but could you annex a screenshot of the labels of your Jupyterlab pod when you launch it? Please don't forget to sanitize any project-specific labels if they do show up.

Usually, in cloud deployments, based on the provider, it might be missing a gpu: enabled or gpu_accelerators field in the node_groups section of the nebari-config.yaml.

@viniciusdc viniciusdc added the needs: follow-up 📫 Someone needs to get back to this issue or PR label Jan 13, 2025
@viniciusdc
Copy link
Contributor

viniciusdc commented Jan 13, 2025

I think you are running on complete baremetal, so I did a quick research on kubespray and GPUs, I am not entirely sure how dynamically it installs the drivers from the toolkit, but you could try manually installing the nvidia operator and check if that works:

  • kubectl create ns gpu-operator
  • kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
  • helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
  • helm install --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator

If this works, we probably need to add this to Nebari as part of the drivers-setup for gpus here

@RaMaTHA
Copy link
Author

RaMaTHA commented Jan 13, 2025

Thanks for the quick response.

I'm running nebari on a complete, self-hosted, bare-metal setup based on kubespray.

As far as I know, kubespray doesn't install nvidia-gpu by itself. So I installed the nvidia-gpu driver myself (more or less identical to your instructions above).

These are the gpu pods that are currently running on my k8s:

image

I have followed the guides from nvidia that can test the installed gpu-operator. This is how I tested it:

apiVersion: batch/v1
kind: Job
metadata:
  name: test-job-gpu
spec:
  template:
    spec:
      runtimeClassName: nvidia
      containers:
      - name: nvidia-test
        image: nvidia/cuda:12.0.0-base-ubuntu22.04
        command: ["nvidia-smi"]
        resources:
          limits:
            nvidia.com/gpu: 1
      nodeSelector:
        kubernetes.io/hostname: node1
      restartPolicy: Never

Running the above job seems to detect the GPU:

image

As you can see in the job example, I used the gpu mapped to the hostname: node1 (which is the default in kubespray).
So I decided to use the same mapping inside nebari (see my uploaded nebari-config.txt). But at this point I'm not quite sure if the mapping is correct or what the error might be in my setup.

@viniciusdc
Copy link
Contributor

Thanks for following up. I see. To debug further, I will attempt to replicate the installation on my side. Right now, I can't see why your node would not correctly report the Nvidia command. Since you submitted the job properly, the K8s configuration is correct, and you also used the proper image.

Regarding your other issue with conda-store, the authentication is based on your current config and is shared with the same authentication system as jupyterhub. I would double-check which group your user is currently assigned to through the keycloak admin console: <domain>/auth/admin -- add yourself to the admin group if not already, and test again if you can access it.

@viniciusdc viniciusdc added area: user experience 👩🏻‍💻 needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug area: k8s ⎈ provider: Existing and removed needs: triage 🚦 Someone needs to have a look at this issue and triage labels Jan 13, 2025
@viniciusdc
Copy link
Contributor

viniciusdc commented Jan 15, 2025

Hi @RaMaTHA, I will look deeper into this by the end of the week. I haven't had the chance yet due to the upcoming 2025.1.1 release of Nebari, but I just wanted to let you know that this hasn't been forgotten :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: k8s ⎈ area: user experience 👩🏻‍💻 needs: follow-up 📫 Someone needs to get back to this issue or PR needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug provider: Existing type: bug 🐛 Something isn't working
Projects
Status: New 🚦
Development

No branches or pull requests

2 participants