Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: get information about global resource availability #222

Open
lorenzo-cavazzi opened this issue Nov 22, 2019 · 6 comments
Open

feat: get information about global resource availability #222

lorenzo-cavazzi opened this issue Nov 22, 2019 · 6 comments

Comments

@lorenzo-cavazzi
Copy link
Member

Sometimes users end up with notebooks in a pending state without knowing that there aren't enough resources available in the cluster. This is especially true for GPUs that are very limited in some deployments.
We could get this information from kubernetes and use it for one (or both) of the following:

  • Provide a /stats API
  • Fail early in the environment launch phase when resources are unavailable
@rokroskar
Copy link
Member

@lorenzo-cavazzi can we clarify what the /stats api should do? cc @olevski

@lorenzo-cavazzi
Copy link
Member Author

The initial idea was to get information about the availability of the resources to prevent stalling when starting a new environment requiring currently unavailable resources (e.g. when asking for 1 gpu and no gpus are available).

A few things changed in the meanwhile, and I guess it's not easy to have something like this in our environment -- with different nodes, it may not be obvious where the pod for a new interactive environment ends up.

A better solution could be to fail the lunch with a reasonable error message when no resources are available. That seems easier to implement and the UX won't be much worse than the original proposal.

@rokroskar
Copy link
Member

I believe it should be possible to query the kubernetes API about resources and provide constraints on the types of nodes you want to consider. imho preventing a launch is better than recovering from a failed launch

@lorenzo-cavazzi
Copy link
Member Author

Ok, then for the UI the idea would be to get extra information about the current availability of the resources.
The GET /options endpoint returns the available resources for the pods. Let's say we know an interactive environment can request from 1 to 4 cpu and from 0 to 2 gpu. The /stats endpoint should tell us how many cpu and gpu are currently available so that we can further limit the resources -- or even better notify the user that it's temporarily not possible to start a session with more than <currently_available> resources.

If that is feasible, I can quickly formalize the endpoint proposal by sketching it in SwaggerHub.

@rokroskar
Copy link
Member

Apparently this is not easily solvable (see this discussion) but some tools exist: https://github.com/davidB/kubectl-view-allocations

@olevski
Copy link
Member

olevski commented Feb 12, 2021

I am sorry for being late to this. My github notifications are not setup right so I get way too many emails about random things and I miss things where I am needed.

@rokroskar that is a very good link you posted. In there I found a command like this, which will give you the cpu requested by every pod in a namespace (or all namespaces).

kubectl get po -o=jsonpath="{range .items[*]}{.metadata.namespace}:{.metadata.name}{'\n'}{range .spec.containers[*]}  {.name}:{.resources.requests.cpu}{'\n'}{end}{'\n'}{end}"

So if we use this command on the backed and combine it with something like top nodes or similar to see what is the total number and total capacity of the nodes then I believe we can get the data that @lorenzo-cavazzi is looking for.

Some other questions (that I think can be resolved) if we pursue this are:

  • If I am not mistaken we do not use taints, affinities, etc to limit that the user servers run on a limited set of pods? So then the question is how does one determine the total capacity that is available for user servers. To resolve this properly we may need to designate which nodes will run user servers and enforce this with affinities, taints, tolerations.
  • Not sure whether the credentials that the API uses to access the k8s cluster have enough permissions to allow it to run a command like kubectl get nodes or something similar to get the total capacity of the nodes.

@Panaetius Panaetius moved this to Backlog in renku-python May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

3 participants