-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for nvidia gpu limit #346
Conversation
sdk/python/kfp/compiler/compiler.py
Outdated
template['container']['resources'] = {} | ||
if op.memory_limit or op.cpu_limit: | ||
if op.memory_limit or op.cpu_limit or op.nvidia_gpu_limit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For each resource there is a "limit" and a "request". Do you also have the change for set_gpu_request()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried requests/gpu which doesn't work as expected. According to https://cloud.google.com/kubernetes-engine/docs/how-to/gpus, it only mentions the limit support. It looks like not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found this: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
"GPUs are only supposed to be specified in the limits section, which means:
You can specify GPU limits without specifying requests because Kubernetes will use the limit as the request value by default.
You can specify GPU in both limits and requests but these two values must be equal.
You cannot specify GPU requests without specifying limits."
The naming in the doc is misleading to me. limits sound like max, and requests min but actually it means both max and min here. This is different from cpu and memory.
Is it a good idea to stop the confusion at DSL side, say we just call it set_gpu()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it's better to just leave it open and let user to specify the values by add_resource_limit or add_resource_request. It will be easier for user to follow the instructions from gke document and less maintenance for us if things are changed in the future, e.g. gke start supports other brand gpus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG. Add the link to docstring then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. You removed the explicit gpu API. Then it is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... maybe it is also worthwhile to keep the set_gpu_limit()? a lot of python users (notebook users) they rely on intellisense. Having an explicit GPU setter helps them find it easily. Also I would prefer set_gpu* so auto complete can help here. What do you think of set_gpu_nvidia_limit()? Other suggestions? With only set_resource_limit and both cpu/memory setter, it is harder and confusing to users as to whether/how to set GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sdk/python/kfp/dsl/_container_op.py
Outdated
@@ -155,6 +164,16 @@ def set_cpu_limit(self, cpu): | |||
self._validate_cpu_string(cpu) | |||
self.cpu_limit = cpu | |||
|
|||
def set_nvidia_gpu_limit(self, gpu): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how many types of gpus we will support but I am under the impression that nvidia is the mainstream one. Would it be better to do:
set_gpu_limit(self, num_gpu, vendor='nvidia')?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope we don't make assumption here. e.g. user might want to use amd one. The API should be extensible.
I would also hope that we just expose k8s resource limit directly. It's hard predict what resource user want to use in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for exposing resource limits and requests directly as generic key value pairs. We should keep the existing cpu and memory ones which are used frequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: qimingj The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1 similar comment
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: qimingj The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Great work! Super happy to see this implemented so quickly. Thanks!! |
kubeflow#346) * adding inital requirements checklist for kfserving to be integrated in kubeflow 1.0 * formatting
This change is