Rename term gpu to leaf cell (#28)

* Rename gpuType/gpuNumber to skuType/skuNumber Rename gpuType -> skuType, gpuNumber -> skuNumber. * Rename gpu to device Rename gpu to device when referring affinity and index. * Add explanation for sku type and device Add explanation for sku type and device. * Revert term sku and device to leaf cell Revert term sku and device to leaf cell. * Fix Fix. * Convert old spec annotations for compatibility Convert old spec annotations for backward compatibility. * Update README Update README. * Resolve comments Resolve comments. * Update Update.
microsoft · Jul 27, 2020 · fbff5b0 · fbff5b0
1 parent 76ed604
commit fbff5b0
Show file tree

Hide file tree

Showing 55 changed files with 1,049 additions and 998 deletions.
diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@ The killer feature that distinguishes HiveD is that it provides resource guarant
 
 HiveD protects VCs' resources in terms of **cell**, a user-defined resource type that encodes both the quantity and other kinds of information, such as topology and hardware type. In the above example, a user can define a cell type of 8-GPU node, and the VC can be assigned one of such cell. Then, HiveD will ensure that *there is always one 8-GPU node available for the VC*, regardless of the other workloads in the cluster.
 
-HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different GPU models, or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them.
+HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different device models (e.g., NVIDIA V100 GPU, AMD Radeon MI100 GPU, Cloud TPU v3), or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them.
 
 ### [Gang Scheduling](example/feature/README.md#Gang-Scheduling)
 
@@ -34,8 +34,8 @@ HiveD supports multiple job **priorities**. Higher-priority jobs can **[preempt]
 
 ## Feature
 1. [Multi-Tenancy: Virtual Cluster (VC)](example/feature/README.md#VC-Safety)
-2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
-3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible GPU Types](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.
+2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
+3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible Hardware Types](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.
 4. Optimized Resource Fragmentation and Less Starvation
 5. [Priorities](example/feature/README.md#Guaranteed-Job), [Overuse with Low Priority](example/feature/README.md#Opportunistic-Job), and [Inter-](example/feature/README.md#Inter-VC-Preemption)/[Intra-VC Preemption](example/feature/README.md#Intra-VC-Preemption)
 6. [Job (Full/Partial) Gang Scheduling/Preemption](example/feature/README.md#Gang-Scheduling)

diff --git a/doc/design/state-machine.md b/doc/design/state-machine.md
@@ -68,7 +68,7 @@ For all cells currently associated with other AGs:
 
 `Used` (by other AGs) -> `Reserving` (by this AG) (e<sub>2</sub> in cell state machine);
 
-`Reserving`/`Reserved` (by other AGs) -> `Reserving`/`Reserved` (by this AG) (e<sub>3</sub>/e<sub>6</sub> in cell state machine); 
+`Reserving`/`Reserved` (by other AGs) -> `Reserving`/`Reserved` (by this AG) (e<sub>3</sub>/e<sub>6</sub> in cell state machine);
 
 For free cells:
 
@@ -94,7 +94,7 @@ __e<sub>4</sub>__:
 
 Condition: all pods of this AG are deleted.
 
-Operation: 
+Operation:
 all cells `Used` (by this AG) -> `Free` (e<sub>1</sub> in cell state machine).
 
 __e<sub>5</sub>__:
@@ -118,7 +118,7 @@ __e<sub>7</sub>__:
 
 Condition: all pods of this AG are deleted.
 
-Operation: 
+Operation:
 
 All the `Reserving` cells (by this AG) -> `Used` (by the `Being preempted` AG currently associated with the cell) (e<sub>4</sub> in cell state machine).
 
@@ -132,7 +132,7 @@ Operation: none.
 
 ## Cell State Machine
 
-Cell is the resource unit in HiveD. The figure below shows the state machine of cell. Note that here cells are _lowest-level physical cells_, e.g., single-GPU cells in typical configs (we record states only in these cells).
+Cell is the resource unit in HiveD. The figure below shows the state machine of cell. Note that here cells are _lowest-level physical cells_, e.g., leaf cells in typical configs (we record states only in these cells).
 
 <p style="text-align: center;">
   <img src="img/cell-state-machine.png" title="cell" alt="cell" width="70%"/>
@@ -188,7 +188,7 @@ __e<sub>2</sub>__:
 
 Condition: triggered by another AG from `Pending` to `Preempting` (i.e., that AG is preempting the `Allocated` AG currently associated with this cell) (e<sub>1</sub> in AG state machine).
 
-Operation: 
+Operation:
 
 The `Allocated` AG on this cell -> `Being preempted` (e<sub>6</sub> in AG state machine);
 
@@ -236,7 +236,7 @@ __e<sub>8</sub>__:
 
 Condition: triggered by (i) there is currently a `Preempting` AG on this cell but another `Allocated` AG is now associated with the cell (e<sub>0</sub> in AG state machine); OR (ii) the `Preempting` AG currently associated with this cell transitions to `Allocated` (e<sub>2</sub> in AG state machine).
 
-Operation: 
+Operation:
 
 For (i): the `Preempting` AG on this cell -> `Pending`  (e<sub>5</sub> in AG state machine); release the cell and then allocate it to the new `Allocated` AG.
 

diff --git a/doc/user-manual.md b/doc/user-manual.md
@@ -2,6 +2,7 @@
 
 ## <a name="Index">Index</a>
    - [Config](#Config)
+   - [Scheduling GPUs](#Scheduling-GPUs)
 
 ## <a name="Config">Config</a>
 ### <a name="ConfigQuickStart">Config QuickStart</a>
@@ -14,7 +15,6 @@
     Notes:
     1. It is like the [Azure VM Series](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu) or [GCP Machine Types](https://cloud.google.com/compute/docs/machine-types).
     2. Currently, the `skuTypes` is not directly used by HivedScheduler, but it is used by [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server) to setup proportional Pod resource requests and limits. So, if you are not using with [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server), you can skip to config it.
-    3. It is previously known as `gpuTypes`, and we are in the progress to rename it to `skuTypes`, as HiveD only awares the abstract `cell` concept instead of the concrete hardware that the `cell` represents.
 
     **Example:**
 
@@ -117,7 +117,7 @@
 5. Put it together
 
     **Example:**
-    
+
     Finally, after above steps, your config would be:
     ```yaml
     physicalCluster:
@@ -155,3 +155,37 @@
 
 ### <a name="ConfigDetail">Config Detail</a>
 [Detail Example](../example/config)
+
+## <a name="Scheduling-GPUs">Scheduling GPUs</a>
+
+To leverage this scheduler to schedule GPUs, if one container in the Pod want to use the allocated GPUs for the whole Pod,
+it could contain below environment variables:
+
+* NVIDIA GPUs
+
+  ```yaml
+  env:
+  - name: NVIDIA_VISIBLE_DEVICES
+    valueFrom:
+      fieldRef:
+        fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
+  ```
+  The scheduler directly delivers GPU isolation decision to [nvidia-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime)
+  through Pod Env `NVIDIA_VISIBLE_DEVICES`.
+
+* AMD GPUs
+
+  ```yaml
+  env:
+  - name: AMD_VISIBLE_DEVICES
+    valueFrom:
+      fieldRef:
+        fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
+  ```
+  The scheduler directly delivers GPU isolation decision to [rocm-container-runtime](https://github.com/abuccts/rocm-container-runtime)
+  through Pod Env `AMD_VISIBLE_DEVICES`.
+
+The annotation referred by the env will be populated by scheduler when bind the pod.
+
+If multiple containers in the Pod contain the env, the allocated GPUs are all visible to them,
+so it is these containers' freedom to control how to share these GPUs.
diff --git a/example/config/design/hivedscheduler.yaml b/example/config/design/hivedscheduler.yaml
@@ -11,7 +11,7 @@ kubeApiServerAddress: http://10.10.10.10:8080
 #
 # Constrains:
 # 1. All cellTypes should form a forest, i.e. a disjoint union of trees.
-# 2. All physicalCells should contain at most one physical specific GPU.
+# 2. All physicalCells should contain at most one physical specific device.
 # 3. Each physicalCell should contain exactly one node level cellType.
 # 4. Each physicalCell should specify full hierarchies defined by its cellType.
 # 5. A pinnedCellId should can be universally locate one physicalCell.
@@ -24,7 +24,7 @@ kubeApiServerAddress: http://10.10.10.10:8080
 ################################################################################
 physicalCluster:
   # Define the cell structures.
-  # Each leaf cellType contains a single GPU and also defines a gpuType of the
+  # Each leaf cellType contains a single device and also defines a leafCellType of the
   # same name.
   cellTypes:
     #######################################
@@ -35,8 +35,8 @@ physicalCluster:
       childCellType: CT1
       # Specify how many child cells it contains.
       childCellNumber: 2
-      # Specify whether it is a node level cellType, i.e. contains all GPUs of
-      # its corresponding gpuType within one node and only contains these GPUs.
+      # Specify whether it is a node level cellType, i.e. contains all leaf cells of
+      # its corresponding leafCellType within one node and only contains these leaf cells.
       # Defaults to false.
       isNodeLevel: true
 
@@ -149,15 +149,15 @@ physicalCluster:
     cellAddress: 0.0.0.0
   - cellType: CT1-NODE
     cellAddress: 0.0.0.1
-  # One node has multiple gpu types and
-  # non-standard gpu indices (by explicitly specifying cell addresses)
+  # One node has multiple leaf cell types and
+  # non-standard leaf cell indices (by explicitly specifying cell addresses)
   - cellType: CT1-NODE
     cellAddress: 1.0.0.2  # NODE Name
     cellChildren:
     - cellAddress: 8  # GPU Index
       pinnedCellId: VC1-YQW-CT1
     - cellAddress: 9  # GPU Index
-  # One cell has non-standard gpu indices
+  # One cell has non-standard leaf cell indices
   - cellType: 3-DGX1-P100-NODE
     cellChildren:
     # cellAddress can be omitted for non-node level cellType, which defaults to

diff --git a/example/feature/README.md b/example/feature/README.md
@@ -9,7 +9,7 @@
 
 HiveD guarantees **quota safety for all VCs**, in the sense that the requests to cells defined in each VC can always be satisfied.
 
-VC's cells can be described by Hardware Quantity, [Topology](#VC-Safety), [Type](#GPU-Type), [Pinned Cells](#Pinned-Cells), etc. To guarantee safety, HiveD never allows a VC to "invade" other VCs' cells. For example, to guarantee all VCs' topology, one VC's [guaranteed jobs](#Guaranteed-Job) should never make fragmentation inside other VCs:
+VC's cells can be described by Hardware Quantity, [Topology](#VC-Safety), [Type](#SKU-Type), [Pinned Cells](#Pinned-Cells), etc. To guarantee safety, HiveD never allows a VC to "invade" other VCs' cells. For example, to guarantee all VCs' topology, one VC's [guaranteed jobs](#Guaranteed-Job) should never make fragmentation inside other VCs:
 
 Two DGX-2s, two VCs each owns one DGX-2 node. For a traditional scheduler, this will translate into two VCs each owning 16 GPUs. When a user submits 16 1-GPU jobs to VC1, the user in VC2 might not be able to run a 16-GPU job, due to possible fragmentation issue caused by VC1. While HiveD can guarantee each VC always has one entire node available for its dedicated use.
 
@@ -30,19 +30,21 @@ This is similar to [K8S Taints and Tolerations](https://kubernetes.io/docs/conce
 2. Submit job [itc-pin](file/itc-pin.yaml) to VC1, all tasks in task role vc1pinned will be on node 10.151.41.25 (which is pinned), all tasks in task role vc1nopinned will NOT be on node 10.151.41.25.
    <img src="file/itc-pin.png" width="900"/>
 
-## GPU Type
+## SKU Type
 ### Description
-If `gpuType` is specified in the job, only that type of GPU will be allocated to the job, otherwise, any type of GPU can be allocated.
+`skuType` is the leaf `cellType` which does not have internal topology anymore.
+
+If `skuType` is specified in the job, only that type of leaf cell will be allocated to the job, otherwise, any type of leaf cell can be allocated.
 
 This is similar to [K8S Labels and Selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels), but with [VC Safety](#VC-Safety) guaranteed.
 
 ### Reproduce Steps
-#### `gpuType` specified
+#### `skuType` specified
 1. Use [hived-config-2](file/hived-config-2.yaml).
 2. Submit job [itc-k80-type](file/itc-k80-type.yaml), it will be partially running (some tasks waiting because all the specified K80 GPUs are used).
    <img src="file/itc-k80-type.png" width="900"/>
 
-#### `gpuType` not specified
+#### `skuType` not specified
 1. Use [hived-config-2](file/hived-config-2.yaml).
 2. Submit job [itc-no-type](file/itc-no-type.yaml), it will be fully running, and some tasks are using K80 (10.151.41.18) while others are using M60 (10.151.41.26).
    <img src="file/itc-no-type.png" width="900"/>
@@ -135,7 +137,7 @@ One VC's [Guaranteed Job](#Guaranteed-Job) can preempt other VCs' [Opportunistic
 
 ## Topology-Aware Intra-VC Scheduling
 ### Description
-Within one VC, HiveD chooses nearest GPUs for one `AffinityGroup` in best effort.
+Within one VC, HiveD chooses nearest leaf cells for one `AffinityGroup` in best effort.
 
 ### Reproduce Steps
 1. Use [hived-config-2](file/hived-config-2.yaml).
@@ -147,40 +149,40 @@ Within one VC, HiveD chooses nearest GPUs for one `AffinityGroup` in best effort
 
 ## Work-Preserving Reconfiguration
 ### Description
-HiveD can be reconfigured without unnecessary user impacts, such as add/update/delete physical/virtual clusters, GPU types/topologies, etc.
+HiveD can be reconfigured without unnecessary user impacts, such as add/update/delete physical/virtual clusters, different device types/topologies, etc.
 
 ### Reproduce Steps
 #### PhysicalCluster Reconfig - Delete PhysicalCell
 1. Use [hived-config-2](file/hived-config-2.yaml).
-2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `gpuType`. Wait until it is running.
-3. Delete all M60 `gpuType` related PhysicalCells and VirtualCells from [hived-config-2](file/hived-config-2.yaml), i.e. becomes [hived-config-33](file/hived-config-33.yaml).
-4. Use [hived-config-33](file/hived-config-33.yaml), and restart HiveD. 
+2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `skuType`. Wait until it is running.
+3. Delete all M60 `skuType` related PhysicalCells and VirtualCells from [hived-config-2](file/hived-config-2.yaml), i.e. becomes [hived-config-33](file/hived-config-33.yaml).
+4. Use [hived-config-33](file/hived-config-33.yaml), and restart HiveD.
 5. The job will still run without any impact, but its M60 usage is ignored by HiveD.
    *However, normally, the job will still fail if the corresponding physical node is later deleted from K8S or unhealthy.*
    <img src="file/itc-reconfig-1.png" width="900"/>
 
 #### PhysicalCluster Reconfig - Add PhysicalCell
 1. Use [hived-config-33](file/hived-config-33.yaml).
-2. Submit job [itc-k80-type](file/itc-k80-type.yaml) which requests K80 `gpuType`. Wait until it is running.
-3. Add all M60 `gpuType` related PhysicalCells and VirtualCells into [hived-config-33](file/hived-config-33.yaml), i.e. becomes [hived-config-2](file/hived-config-2.yaml).
-4. Use [hived-config-2](file/hived-config-2.yaml), and restart HiveD. 
+2. Submit job [itc-k80-type](file/itc-k80-type.yaml) which requests K80 `skuType`. Wait until it is running.
+3. Add all M60 `skuType` related PhysicalCells and VirtualCells into [hived-config-33](file/hived-config-33.yaml), i.e. becomes [hived-config-2](file/hived-config-2.yaml).
+4. Use [hived-config-2](file/hived-config-2.yaml), and restart HiveD.
 5. The job will still run without any impact, and its K80 usage is still accounted by HiveD.
    <img src="file/itc-k80-type.png" width="900"/>
 
 #### PhysicalCluster Reconfig - Update PhysicalCell - Add Node
 1. Use [hived-config-2](file/hived-config-2.yaml).
-2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `gpuType`. Wait until it is running.
+2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `skuType`. Wait until it is running.
 3. Add one M60 node into a PhysicalCell, then becomes [hived-config-4](file/hived-config-4.yaml).
-4. Use [hived-config-4](file/hived-config-4.yaml), and restart HiveD. 
+4. Use [hived-config-4](file/hived-config-4.yaml), and restart HiveD.
 5. The job will still run without any impact, and its M60 usage is still accounted by HiveD.
 6. To confirm the job is not impacted, such as [lazy preempted](#Lazy-Preemption). Submit job [itc-reconfig-2](file/itc-reconfig-2.yaml) which requests all M60 nodes and has the same priority as [itc-reconfig-1](file/itc-reconfig-1.yaml). The job will be waiting instead of preempting [itc-reconfig-1](file/itc-reconfig-1.yaml).
    <img src="file/itc-reconfig-2.png" width="900"/>
 
 #### PhysicalCluster Reconfig - Update PhysicalCell - Delete Node
 1. Use [hived-config-2](file/hived-config-2.yaml).
-2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) which requests K80 `gpuType`. Wait until it is running.
+2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) which requests K80 `skuType`. Wait until it is running.
 3. Delete one K80 node used by [itc-reconfig-3](file/itc-reconfig-3.yaml) from a PhysicalCell, then becomes [hived-config-7](file/hived-config-7.yaml).
-4. Use [hived-config-7](file/hived-config-7.yaml), and restart HiveD. 
+4. Use [hived-config-7](file/hived-config-7.yaml), and restart HiveD.
 5. The job will still run without any impact, but its deleted node usage is ignored by HiveD.
    *However, normally, the job will still fail if the corresponding physical node is later deleted from K8S or unhealthy.*
    <img src="file/itc-reconfig-3-1.png" width="900"/>
@@ -189,7 +191,7 @@ HiveD can be reconfigured without unnecessary user impacts, such as add/update/d
 1. Use [hived-config-2](file/hived-config-2.yaml).
 2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) to default VC. Wait until it is running.
 3. Delete the default VC and move its quota to VC1, then becomes [hived-config-5](file/hived-config-5.yaml).
-4. Use [hived-config-5](file/hived-config-5.yaml), and restart HiveD. 
+4. Use [hived-config-5](file/hived-config-5.yaml), and restart HiveD.
 5. The job will still run without any interruption but [lazy preempted](#Lazy-Preemption) by HiveD.
    <img src="file/itc-reconfig-3.png" width="900"/>
 6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-4](file/itc-reconfig-4.yaml) to VC1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
@@ -199,7 +201,7 @@ HiveD can be reconfigured without unnecessary user impacts, such as add/update/d
 1. Use [hived-config-2](file/hived-config-2.yaml).
 2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) to default VC. Wait until it is running.
 3. Move one K80-NODE cell from default VC to VC1, then becomes [hived-config-6](file/hived-config-6.yaml).
-4. Use [hived-config-6](file/hived-config-6.yaml), and restart HiveD. 
+4. Use [hived-config-6](file/hived-config-6.yaml), and restart HiveD.
 5. The job will still run without any interruption but [lazy preempted](#Lazy-Preemption) by HiveD.
 6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-5](file/itc-reconfig-5.yaml) to VC1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
    <img src="file/itc-reconfig-5.png" width="900"/>

diff --git a/example/feature/file/hived-config-1.yaml b/example/feature/file/hived-config-1.yaml
@@ -1,7 +1,7 @@
 kubeApiServerAddress: http://10.151.41.16:8080
 
 physicalCluster:
-  gpuTypes:
+  skuTypes:
     K80:
       gpu: 1
       cpu: 4

diff --git a/example/feature/file/hived-config-2.yaml b/example/feature/file/hived-config-2.yaml
@@ -1,7 +1,7 @@
 kubeApiServerAddress: http://10.151.41.16:8080
 
 physicalCluster:
-  gpuTypes:
+  skuTypes:
     K80:
       gpu: 1
       cpu: 4

diff --git a/example/feature/file/hived-config-3.yaml b/example/feature/file/hived-config-3.yaml
@@ -1,7 +1,7 @@
 kubeApiServerAddress: http://10.151.41.16:8080
 
 physicalCluster:
-  gpuTypes:
+  skuTypes:
     K80:
       gpu: 1
       cpu: 4

diff --git a/example/feature/file/hived-config-33.yaml b/example/feature/file/hived-config-33.yaml
@@ -1,7 +1,7 @@
 kubeApiServerAddress: http://10.151.41.16:8080
 
 physicalCluster:
-  gpuTypes:
+  skuTypes:
     K80:
       gpu: 1
       cpu: 4

diff --git a/example/feature/file/hived-config-4.yaml b/example/feature/file/hived-config-4.yaml
@@ -1,7 +1,7 @@
 kubeApiServerAddress: http://10.151.41.16:8080
 
 physicalCluster:
-  gpuTypes:
+  skuTypes:
     K80:
       gpu: 1
       cpu: 4