Proposal: using gang scheduling API for generic distributed training support in Kubeflow #37

karthikv2k · 2019-05-10T22:18:32Z

Problem

Currently in Kubeflow, we have a controller per framework (e.g. TF-Job, and PyTorch-Operator) and to support a new framework, the message we are giving is that users have to write a new controller. This is a lot of friction for data scientists who most likely don’t know Go-Lang and K8s. Even if they do, getting a version of controller deployed in a corp cluster is not easy.

Proposed Solution

However, in reality users actually don’t have to write a new controller if they have a generic Gang scheduling API and in fact TF-Job controller exposes a restricted version of the API that works for almost all of the use cases. In fact, the Google AI Platform team implemented distributed PyTorch and XGBoost jobs using TF-Job API for the Google AI-Hub. So if we can create a controller for gang scheduling it will make it easy to add support for new frameworks.

Advantages

Less effort to support a new framework (users don’t need K8s or Go-Lang expertise)
A better story for portability between Kubeflow and other platforms like Mesos. The same container can be used in other platforms without any changes.

Other infras that support some version of gang scheduling API

Google AI Platform training
Amazon Sagemaker
Mesos/Uber’s Michelangelo
YARN (not natively but with some application logic in Application Master, e.g. TonY)

Frameworks Support

From my understanding, distributed training for following frameworks can be implemented easily using just a generic gang scheduling.

TensorFlow
Horovod
PyTorch
XGBoost
Julia
LightGBM

Rough API Spec

Almost same as current tf-job spec but with more generic names and generalizing #worker groups.

apiVersion: kubeflow.org/v1beta1
kind: GangJob
metadata:
  generateName: gangjob
  namespace: kubeflow
spec:
  replicaSpecs:
    WorkerGroup1:
      replicas: 4
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: 
            image: 
            command:
    WorkerGroup2:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: 
            image: 
            Command:
    .
    .
    .
    WorkerGroupN:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: 
            image: 
            command:

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2019-05-10T22:18:35Z

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.86. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

karthikv2k · 2019-05-10T22:18:56Z

CC @richardsliu @abhi-g

richardsliu · 2019-05-10T22:28:18Z

/cc @k82cn
/cc @gaocegege
/cc @johnugeorge

johnugeorge · 2019-05-11T04:46:49Z

In this proposal, How is the distributed training environment setup for each framework? Eg: TF_CONFIG env in tensorflow(https://www.tensorflow.org/guide/distribute_strategy#setting_up_tf_config_environment_variable) and MASTER_ADDR, MASTER_PORT etc in pytorch (https://pytorch.org/tutorials/intermediate/dist_tuto.html#initialization-methods)

It looks similar to the common operator discussion that would support all frameworks that are described in the proposal.

k82cn · 2019-05-12T02:54:34Z

Some input here: Gang-scheduling/coscheduling is the requirements to the scheduler, so common operator defined the SchedulingSpec to communicate with kube-batch; for the other part, e.g. multiple pod template, it's more about controller instead of scheduling policy. Both of them are fundamental feature to k8s. please refer to kubernetes/kubernetes#68357 , http://github.com/volcano-sh/volcano on what we're doing there :)

karthikv2k · 2019-05-13T19:56:03Z

In this proposal, How is the distributed training environment setup for each framework?

The user's code will be responsible for setting the right environment variables for the framework they are using. The gang scheduler/controller cab set a cluster spec in an env variable that is similar to TF_CONFIG. Taking a cluster spec and convert into framework specific env should be trivial.

It looks similar to the common operator discussion that would support all frameworks that are described in the proposal.

where can find "common operator discussion"? form the name it looks similar.

karthikv2k · 2019-05-13T19:59:08Z

Some input here: Gang-scheduling/coscheduling is the requirements to the scheduler, so common operator defined the SchedulingSpec to communicate with kube-batch; for the other part, e.g. multiple pod template, it's more about controller instead of scheduling policy. Both of them are fundamental feature to k8s. please refer to kubernetes/kubernetes#68357 , http://github.com/volcano-sh/volcano on what we're doing there :)

@k82cn https://github.com/volcano-sh/volcano/blob/master/docs/design/job-api.md describes everything that I need and even goes beyond that!
However, I haven't got a clear idea on when to use volcano vs a Kubeflow job operator. Are these complimentary offerings?

k82cn · 2019-05-15T01:17:51Z

@k82cn https://github.com/volcano-sh/volcano/blob/master/docs/design/job-api.md describes everything that I need and even goes beyond that!

Very glad to hear that :)

However, I haven't got a clear idea on when to use volcano vs a Kubeflow job operator. Are these complimentary offerings?

Volcano is to enhance k8s's batch capability (based on kubernetes/kubernetes#68357 ); Kubeflow is easier for user to use ML frameworks :)

And we're going to work together on batch scheduling part :)

johnugeorge · 2019-05-15T01:51:44Z

@karthikv2k This is the issue tracking it kubeflow/training-operator#960

issue-label-bot bot added the feature_request label May 10, 2019

georgkaleido pushed a commit to georgkaleido/common that referenced this issue Jun 9, 2022

Update Node.js to version 14.18.0 (kubeflow#37)

6b97275

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: using gang scheduling API for generic distributed training support in Kubeflow #37

Proposal: using gang scheduling API for generic distributed training support in Kubeflow #37

karthikv2k commented May 10, 2019

issue-label-bot bot commented May 10, 2019

karthikv2k commented May 10, 2019

richardsliu commented May 10, 2019

johnugeorge commented May 11, 2019

k82cn commented May 12, 2019

karthikv2k commented May 13, 2019 •

edited

Loading

karthikv2k commented May 13, 2019

k82cn commented May 15, 2019 •

edited

Loading

johnugeorge commented May 15, 2019

Proposal: using gang scheduling API for generic distributed training support in Kubeflow #37

Proposal: using gang scheduling API for generic distributed training support in Kubeflow #37

Comments

karthikv2k commented May 10, 2019

Problem

Proposed Solution

Advantages

Other infras that support some version of gang scheduling API

Frameworks Support

Rough API Spec

issue-label-bot bot commented May 10, 2019

karthikv2k commented May 10, 2019

richardsliu commented May 10, 2019

johnugeorge commented May 11, 2019

k82cn commented May 12, 2019

karthikv2k commented May 13, 2019 • edited Loading

karthikv2k commented May 13, 2019

k82cn commented May 15, 2019 • edited Loading

johnugeorge commented May 15, 2019

karthikv2k commented May 13, 2019 •

edited

Loading

k82cn commented May 15, 2019 •

edited

Loading