-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.21.0 hybrid #29
2.21.0 hybrid #29
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have more feedback when I test this
@@ -596,6 +596,7 @@ resource "openstack_compute_instance_v2" "k8s_nodes" { | |||
user_data = each.value.cloudinit != null ? templatefile("${path.module}/templates/cloudinit.yaml.tmpl", { | |||
extra_partitions = each.value.cloudinit.extra_partitions | |||
}) : data.cloudinit_config.cloudinit.rendered | |||
security_groups = var.port_security_enabled ? local.worker_sec_groups : null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this possibly due to a bug in kubespray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering the same thing, and after some reading I concluded that this likely is a problem that merits another PR or at least further discussion. In summary, I think it was an oversight when removing port definitions and then fixing the broken security groups in a future commit. Since we hadn't used the k8s_nodes
resource, it was never updated.
I think that series of commits was done to force Terraform to add instances to the auto_allocated_network. I would suggest we find a way to accomplish this without removing the ports resources so that we diverge as little as possible from "vanilla" Kubespray. This would make updating easier.
I can open a new issue that describes what I think the issue is more in depth and work on this when I have time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sure, thanks, keep it low priority, maybe we can reconsider this when we update kubespray the next time
first step I tested this branch creating a simple CPU only deployment and it worked fine, next I'll do a GPU only and then a hybrid. |
@ana-v-espinoza what do you get as container runtime for gpu nodes? I was expecting it to be different from the master nodes runtime, but I get:
so I am wondering if anything is wrong |
I get the same:
As far as I know, this is normal, as I believe this will only show which container runtime interface (CRI) K8s is using for this node. To see the actual container runtime, ssh into your GPU node and run a You should see a block of the config that looks like:
This is as expected by what is defined in the |
ok, that works fine, thanks! |
That's not something I considered. My first guess would be a mismatch between the What does Does a Or maybe the kubelet logs on the GPU node?: I may test this with Ubuntu 22 tomorrow myself if it seems like this will be a difficult problem to track down. |
|
there are some strange errors in the pods, for example failing to mount a configmap:
Maybe a networking issue? |
gpu-only deployment with Ubuntu 20 worked fine |
# "az" = "nova" | ||
# "flavor": "10" | ||
# "floating_ip": false | ||
# "extra_groups": "gpu-node" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ana-v-espinoza How do I create 2 profiles, 1 for CPU and 1 for GPU?
I see there is an extra group here, but it seems it is only in terraform and not in Kubernetes.
This works for GPU pods, but not for CPU:
https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/blob/master/gpu/jupyterhub_gpu.yaml#L2-L9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Andrea,
I'm using the same "profiles" config option. Here's my snippet. You'll notice that I don't override the image in the GPU profile, as I'm using an image similar to that discussed in zonca/jupyterhub-deploy-kubernetes-jetstream#72
singleuser:
image:
name: "unidata/hybrid-gpu"
tag: "minimal-tf"
profileList:
- display_name: "CPU Server"
default: true
- display_name: "GPU Server"
kubespawner_override:
extra_resource_limits:
nvidia.com/gpu: "1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that I request a CPU server, but I spawn on a GPU node, I am wondering if it is better to restrict CPU-only users to run on CPU nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay I see what you mean! Yeah, you can taint the GPU node(s), then add a toleration in kubespawner_override for the GPU profile.
Deployment of kubernetes on the hybrid cluster on Ubuntu 20 worked fine, now starting to test JupyterHub on the Hybrid cluster |
ok, all tests completed, it works great, |
@zonca Looks good to me, but I think my perception of the post might be skewed since you and I have been working on this in depth for some time now. Perhaps @julienchastang can give better feedback about anything that might need more detail or clarification. Julien could you please take a look at Andrea's new blog post about the work we've been doing here in this PR? (also linked above) |
Thanks. I read it once, but I would like to study it more carefully and actually launch a cluster according to what you have described. I will try to find time to do that in the near future. One thing I noticed is the image is still built upon |
CC: @julienchastang
Hey Andrea,
Apologies for the long PR description, but I feel like it contains some relevant information.
Here are the changes needed to deploy a hybrid cluster. In short, to deploy a hybrid cluster do everything as you normally would, with the exception of these 3 things:
number_of_k8s_nodes
andnumber_of_k8s_nodes_no_floating_ip
to0
. This is required.k8s_nodes
variable. The availability zone (av
),flavor
andfloating_ip
are the only required fields"extra_groups": "gpu-node"
value.A few notes:
The necessary changes to
containerd.yml
to enable the nvidia container runtime are contained in an ansible group_var file specific only to the "gpu-node" group. As such, there is no longer a need for a differentbranch_v<version>_gpu
branch for anything other than book-keeping and documenting the changes necessary to enable GPU capability. Simply specify which nodes should be GPU enabled by adding them to the group, or if deploying a fully GPU cluster, add them all to the group with thesupplementary_node_groups
var. See my note on this incluster.tfvars
.When running a JupyterHub on top of a CPU/GPU hybrid cluster, it may be necessary to do two things: 1) create two separate single user images, one for CPU usage and one for GPU usage, and set the appropriate
kubespawner_override
values; and 2) disable the hook image puller and the continuous image puller (see the JHub config snippet below). This is because some GPU enabled singleuser images are ultimately based off of a CUDA image and will expect GPUs to be available, as is the case with the one currently in your JupyterHub deployment repository. In other words, attempting to run this image in a hybrid cluster will result in some errors.I am currently working on creating some single user images that make this a non-problem by installing CUDA in a conda environment. You can see some preliminary work for this here. Expect a PR to address this problem in that repository soon.
In principle, this could be applied for things other than a CPU/GPU hybrid cluster. For example, we've ran across instances where multiple people are concurrently running computationally intensive tasks and crash the JupyterHub. The solution to this is to run the JHub "core" pods on a dedicated node, which probably can be ran on something smaller than the
m3.medium
that we typically use for our cluster. See more details about this here.Let me know if you have any questions,
Ana