feat: get information about global resource availability #222

lorenzo-cavazzi · 2019-11-22T10:42:29Z

Sometimes users end up with notebooks in a pending state without knowing that there aren't enough resources available in the cluster. This is especially true for GPUs that are very limited in some deployments.
We could get this information from kubernetes and use it for one (or both) of the following:

Provide a /stats API
Fail early in the environment launch phase when resources are unavailable

The text was updated successfully, but these errors were encountered:

rokroskar · 2021-02-05T10:48:20Z

@lorenzo-cavazzi can we clarify what the /stats api should do? cc @olevski

lorenzo-cavazzi · 2021-02-05T11:09:56Z

The initial idea was to get information about the availability of the resources to prevent stalling when starting a new environment requiring currently unavailable resources (e.g. when asking for 1 gpu and no gpus are available).

A few things changed in the meanwhile, and I guess it's not easy to have something like this in our environment -- with different nodes, it may not be obvious where the pod for a new interactive environment ends up.

A better solution could be to fail the lunch with a reasonable error message when no resources are available. That seems easier to implement and the UX won't be much worse than the original proposal.

rokroskar · 2021-02-05T13:17:44Z

I believe it should be possible to query the kubernetes API about resources and provide constraints on the types of nodes you want to consider. imho preventing a launch is better than recovering from a failed launch

lorenzo-cavazzi · 2021-02-05T13:51:10Z

Ok, then for the UI the idea would be to get extra information about the current availability of the resources.
The GET /options endpoint returns the available resources for the pods. Let's say we know an interactive environment can request from 1 to 4 cpu and from 0 to 2 gpu. The /stats endpoint should tell us how many cpu and gpu are currently available so that we can further limit the resources -- or even better notify the user that it's temporarily not possible to start a session with more than <currently_available> resources.

If that is feasible, I can quickly formalize the endpoint proposal by sketching it in SwaggerHub.

rokroskar · 2021-02-05T14:16:05Z

Apparently this is not easily solvable (see this discussion) but some tools exist: https://github.com/davidB/kubectl-view-allocations

olevski · 2021-02-12T09:05:31Z

I am sorry for being late to this. My github notifications are not setup right so I get way too many emails about random things and I miss things where I am needed.

@rokroskar that is a very good link you posted. In there I found a command like this, which will give you the cpu requested by every pod in a namespace (or all namespaces).

kubectl get po -o=jsonpath="{range .items[*]}{.metadata.namespace}:{.metadata.name}{'\n'}{range .spec.containers[*]}  {.name}:{.resources.requests.cpu}{'\n'}{end}{'\n'}{end}"

So if we use this command on the backed and combine it with something like top nodes or similar to see what is the total number and total capacity of the nodes then I believe we can get the data that @lorenzo-cavazzi is looking for.

Some other questions (that I think can be resolved) if we pursue this are:

If I am not mistaken we do not use taints, affinities, etc to limit that the user servers run on a limited set of pods? So then the question is how does one determine the total capacity that is available for user servers. To resolve this properly we may need to designate which nodes will run user servers and enforce this with affinities, taints, tolerations.
Not sure whether the credentials that the API uses to access the k8s cluster have enough permissions to allow it to run a command like kubectl get nodes or something similar to get the total capacity of the nodes.

This was referenced Nov 22, 2019

feat: provide appropriate feedback when environment resources are not available SwissDataScienceCenter/renku-ui#686

Closed

feat: add resource information to pod annotations #223

Closed

lorenzo-cavazzi mentioned this issue Feb 27, 2020

[Environments] Show available GPUs SwissDataScienceCenter/renku-ui#786

Open

rokroskar mentioned this issue Mar 3, 2020

limit the number of simultaneous servers #46

Open

Panaetius added this to renku-python May 18, 2022

Panaetius moved this to Backlog in renku-python May 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: get information about global resource availability #222

feat: get information about global resource availability #222

lorenzo-cavazzi commented Nov 22, 2019

rokroskar commented Feb 5, 2021

lorenzo-cavazzi commented Feb 5, 2021

rokroskar commented Feb 5, 2021

lorenzo-cavazzi commented Feb 5, 2021

rokroskar commented Feb 5, 2021

olevski commented Feb 12, 2021 •

edited

Loading

feat: get information about global resource availability #222

feat: get information about global resource availability #222

Comments

lorenzo-cavazzi commented Nov 22, 2019

rokroskar commented Feb 5, 2021

lorenzo-cavazzi commented Feb 5, 2021

rokroskar commented Feb 5, 2021

lorenzo-cavazzi commented Feb 5, 2021

rokroskar commented Feb 5, 2021

olevski commented Feb 12, 2021 • edited Loading

olevski commented Feb 12, 2021 •

edited

Loading