-
Notifications
You must be signed in to change notification settings - Fork 71
Proposal: using gang scheduling API for generic distributed training support in Kubeflow #37
Comments
Issue-Label Bot is automatically applying the label Links: app homepage, dashboard and code for this bot. |
/cc @k82cn |
In this proposal, How is the distributed training environment setup for each framework? Eg: TF_CONFIG env in tensorflow(https://www.tensorflow.org/guide/distribute_strategy#setting_up_tf_config_environment_variable) and MASTER_ADDR, MASTER_PORT etc in pytorch (https://pytorch.org/tutorials/intermediate/dist_tuto.html#initialization-methods) It looks similar to the common operator discussion that would support all frameworks that are described in the proposal. |
Some input here: Gang-scheduling/coscheduling is the requirements to the scheduler, so common operator defined the |
The user's code will be responsible for setting the right environment variables for the framework they are using. The gang scheduler/controller cab set a cluster spec in an env variable that is similar to TF_CONFIG. Taking a cluster spec and convert into framework specific env should be trivial.
where can find "common operator discussion"? form the name it looks similar. |
@k82cn https://github.com/volcano-sh/volcano/blob/master/docs/design/job-api.md describes everything that I need and even goes beyond that! |
Very glad to hear that :)
Volcano is to enhance k8s's batch capability (based on kubernetes/kubernetes#68357 ); Kubeflow is easier for user to use ML frameworks :) And we're going to work together on batch scheduling part :) |
@karthikv2k This is the issue tracking it kubeflow/training-operator#960 |
Problem
Currently in Kubeflow, we have a controller per framework (e.g. TF-Job, and PyTorch-Operator) and to support a new framework, the message we are giving is that users have to write a new controller. This is a lot of friction for data scientists who most likely don’t know Go-Lang and K8s. Even if they do, getting a version of controller deployed in a corp cluster is not easy.
Proposed Solution
However, in reality users actually don’t have to write a new controller if they have a generic Gang scheduling API and in fact TF-Job controller exposes a restricted version of the API that works for almost all of the use cases. In fact, the Google AI Platform team implemented distributed PyTorch and XGBoost jobs using TF-Job API for the Google AI-Hub. So if we can create a controller for gang scheduling it will make it easy to add support for new frameworks.
Advantages
Less effort to support a new framework (users don’t need K8s or Go-Lang expertise)
A better story for portability between Kubeflow and other platforms like Mesos. The same container can be used in other platforms without any changes.
Other infras that support some version of gang scheduling API
Frameworks Support
From my understanding, distributed training for following frameworks can be implemented easily using just a generic gang scheduling.
Rough API Spec
Almost same as current tf-job spec but with more generic names and generalizing #worker groups.
The text was updated successfully, but these errors were encountered: