Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Commit

Permalink
Rename term gpu to leaf cell (#28)
Browse files Browse the repository at this point in the history
* Rename gpuType/gpuNumber to skuType/skuNumber

Rename gpuType -> skuType, gpuNumber -> skuNumber.

* Rename gpu to device

Rename gpu to device when referring affinity and index.

* Add explanation for sku type and device

Add explanation for sku type and device.

* Revert term sku and device to leaf cell

Revert term sku and device to leaf cell.

* Fix

Fix.

* Convert old spec annotations for compatibility

Convert old spec annotations for backward compatibility.

* Update README

Update README.

* Resolve comments

Resolve comments.

* Update

Update.
  • Loading branch information
abuccts authored Jul 27, 2020
1 parent 76ed604 commit fbff5b0
Show file tree
Hide file tree
Showing 55 changed files with 1,049 additions and 998 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ The killer feature that distinguishes HiveD is that it provides resource guarant

HiveD protects VCs' resources in terms of **cell**, a user-defined resource type that encodes both the quantity and other kinds of information, such as topology and hardware type. In the above example, a user can define a cell type of 8-GPU node, and the VC can be assigned one of such cell. Then, HiveD will ensure that *there is always one 8-GPU node available for the VC*, regardless of the other workloads in the cluster.

HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different GPU models, or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them.
HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different device models (e.g., NVIDIA V100 GPU, AMD Radeon MI100 GPU, Cloud TPU v3), or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them.

### [Gang Scheduling](example/feature/README.md#Gang-Scheduling)

Expand All @@ -34,8 +34,8 @@ HiveD supports multiple job **priorities**. Higher-priority jobs can **[preempt]

## Feature
1. [Multi-Tenancy: Virtual Cluster (VC)](example/feature/README.md#VC-Safety)
2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible GPU Types](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.
2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible Hardware Types](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.
4. Optimized Resource Fragmentation and Less Starvation
5. [Priorities](example/feature/README.md#Guaranteed-Job), [Overuse with Low Priority](example/feature/README.md#Opportunistic-Job), and [Inter-](example/feature/README.md#Inter-VC-Preemption)/[Intra-VC Preemption](example/feature/README.md#Intra-VC-Preemption)
6. [Job (Full/Partial) Gang Scheduling/Preemption](example/feature/README.md#Gang-Scheduling)
Expand Down
12 changes: 6 additions & 6 deletions doc/design/state-machine.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ For all cells currently associated with other AGs:

`Used` (by other AGs) -> `Reserving` (by this AG) (e<sub>2</sub> in cell state machine);

`Reserving`/`Reserved` (by other AGs) -> `Reserving`/`Reserved` (by this AG) (e<sub>3</sub>/e<sub>6</sub> in cell state machine);
`Reserving`/`Reserved` (by other AGs) -> `Reserving`/`Reserved` (by this AG) (e<sub>3</sub>/e<sub>6</sub> in cell state machine);

For free cells:

Expand All @@ -94,7 +94,7 @@ __e<sub>4</sub>__:

Condition: all pods of this AG are deleted.

Operation:
Operation:
all cells `Used` (by this AG) -> `Free` (e<sub>1</sub> in cell state machine).

__e<sub>5</sub>__:
Expand All @@ -118,7 +118,7 @@ __e<sub>7</sub>__:

Condition: all pods of this AG are deleted.

Operation:
Operation:

All the `Reserving` cells (by this AG) -> `Used` (by the `Being preempted` AG currently associated with the cell) (e<sub>4</sub> in cell state machine).

Expand All @@ -132,7 +132,7 @@ Operation: none.

## Cell State Machine

Cell is the resource unit in HiveD. The figure below shows the state machine of cell. Note that here cells are _lowest-level physical cells_, e.g., single-GPU cells in typical configs (we record states only in these cells).
Cell is the resource unit in HiveD. The figure below shows the state machine of cell. Note that here cells are _lowest-level physical cells_, e.g., leaf cells in typical configs (we record states only in these cells).

<p style="text-align: center;">
<img src="img/cell-state-machine.png" title="cell" alt="cell" width="70%"/>
Expand Down Expand Up @@ -188,7 +188,7 @@ __e<sub>2</sub>__:

Condition: triggered by another AG from `Pending` to `Preempting` (i.e., that AG is preempting the `Allocated` AG currently associated with this cell) (e<sub>1</sub> in AG state machine).

Operation:
Operation:

The `Allocated` AG on this cell -> `Being preempted` (e<sub>6</sub> in AG state machine);

Expand Down Expand Up @@ -236,7 +236,7 @@ __e<sub>8</sub>__:

Condition: triggered by (i) there is currently a `Preempting` AG on this cell but another `Allocated` AG is now associated with the cell (e<sub>0</sub> in AG state machine); OR (ii) the `Preempting` AG currently associated with this cell transitions to `Allocated` (e<sub>2</sub> in AG state machine).

Operation:
Operation:

For (i): the `Preempting` AG on this cell -> `Pending` (e<sub>5</sub> in AG state machine); release the cell and then allocate it to the new `Allocated` AG.

Expand Down
38 changes: 36 additions & 2 deletions doc/user-manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

## <a name="Index">Index</a>
- [Config](#Config)
- [Scheduling GPUs](#Scheduling-GPUs)

## <a name="Config">Config</a>
### <a name="ConfigQuickStart">Config QuickStart</a>
Expand All @@ -14,7 +15,6 @@
Notes:
1. It is like the [Azure VM Series](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu) or [GCP Machine Types](https://cloud.google.com/compute/docs/machine-types).
2. Currently, the `skuTypes` is not directly used by HivedScheduler, but it is used by [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server) to setup proportional Pod resource requests and limits. So, if you are not using with [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server), you can skip to config it.
3. It is previously known as `gpuTypes`, and we are in the progress to rename it to `skuTypes`, as HiveD only awares the abstract `cell` concept instead of the concrete hardware that the `cell` represents.

**Example:**

Expand Down Expand Up @@ -117,7 +117,7 @@
5. Put it together

**Example:**

Finally, after above steps, your config would be:
```yaml
physicalCluster:
Expand Down Expand Up @@ -155,3 +155,37 @@

### <a name="ConfigDetail">Config Detail</a>
[Detail Example](../example/config)

## <a name="Scheduling-GPUs">Scheduling GPUs</a>

To leverage this scheduler to schedule GPUs, if one container in the Pod want to use the allocated GPUs for the whole Pod,
it could contain below environment variables:

* NVIDIA GPUs

```yaml
env:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
```
The scheduler directly delivers GPU isolation decision to [nvidia-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime)
through Pod Env `NVIDIA_VISIBLE_DEVICES`.

* AMD GPUs

```yaml
env:
- name: AMD_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
```
The scheduler directly delivers GPU isolation decision to [rocm-container-runtime](https://github.com/abuccts/rocm-container-runtime)
through Pod Env `AMD_VISIBLE_DEVICES`.

The annotation referred by the env will be populated by scheduler when bind the pod.

If multiple containers in the Pod contain the env, the allocated GPUs are all visible to them,
so it is these containers' freedom to control how to share these GPUs.
14 changes: 7 additions & 7 deletions example/config/design/hivedscheduler.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ kubeApiServerAddress: http://10.10.10.10:8080
#
# Constrains:
# 1. All cellTypes should form a forest, i.e. a disjoint union of trees.
# 2. All physicalCells should contain at most one physical specific GPU.
# 2. All physicalCells should contain at most one physical specific device.
# 3. Each physicalCell should contain exactly one node level cellType.
# 4. Each physicalCell should specify full hierarchies defined by its cellType.
# 5. A pinnedCellId should can be universally locate one physicalCell.
Expand All @@ -24,7 +24,7 @@ kubeApiServerAddress: http://10.10.10.10:8080
################################################################################
physicalCluster:
# Define the cell structures.
# Each leaf cellType contains a single GPU and also defines a gpuType of the
# Each leaf cellType contains a single device and also defines a leafCellType of the
# same name.
cellTypes:
#######################################
Expand All @@ -35,8 +35,8 @@ physicalCluster:
childCellType: CT1
# Specify how many child cells it contains.
childCellNumber: 2
# Specify whether it is a node level cellType, i.e. contains all GPUs of
# its corresponding gpuType within one node and only contains these GPUs.
# Specify whether it is a node level cellType, i.e. contains all leaf cells of
# its corresponding leafCellType within one node and only contains these leaf cells.
# Defaults to false.
isNodeLevel: true

Expand Down Expand Up @@ -149,15 +149,15 @@ physicalCluster:
cellAddress: 0.0.0.0
- cellType: CT1-NODE
cellAddress: 0.0.0.1
# One node has multiple gpu types and
# non-standard gpu indices (by explicitly specifying cell addresses)
# One node has multiple leaf cell types and
# non-standard leaf cell indices (by explicitly specifying cell addresses)
- cellType: CT1-NODE
cellAddress: 1.0.0.2 # NODE Name
cellChildren:
- cellAddress: 8 # GPU Index
pinnedCellId: VC1-YQW-CT1
- cellAddress: 9 # GPU Index
# One cell has non-standard gpu indices
# One cell has non-standard leaf cell indices
- cellType: 3-DGX1-P100-NODE
cellChildren:
# cellAddress can be omitted for non-node level cellType, which defaults to
Expand Down
40 changes: 21 additions & 19 deletions example/feature/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

HiveD guarantees **quota safety for all VCs**, in the sense that the requests to cells defined in each VC can always be satisfied.

VC's cells can be described by Hardware Quantity, [Topology](#VC-Safety), [Type](#GPU-Type), [Pinned Cells](#Pinned-Cells), etc. To guarantee safety, HiveD never allows a VC to "invade" other VCs' cells. For example, to guarantee all VCs' topology, one VC's [guaranteed jobs](#Guaranteed-Job) should never make fragmentation inside other VCs:
VC's cells can be described by Hardware Quantity, [Topology](#VC-Safety), [Type](#SKU-Type), [Pinned Cells](#Pinned-Cells), etc. To guarantee safety, HiveD never allows a VC to "invade" other VCs' cells. For example, to guarantee all VCs' topology, one VC's [guaranteed jobs](#Guaranteed-Job) should never make fragmentation inside other VCs:

Two DGX-2s, two VCs each owns one DGX-2 node. For a traditional scheduler, this will translate into two VCs each owning 16 GPUs. When a user submits 16 1-GPU jobs to VC1, the user in VC2 might not be able to run a 16-GPU job, due to possible fragmentation issue caused by VC1. While HiveD can guarantee each VC always has one entire node available for its dedicated use.

Expand All @@ -30,19 +30,21 @@ This is similar to [K8S Taints and Tolerations](https://kubernetes.io/docs/conce
2. Submit job [itc-pin](file/itc-pin.yaml) to VC1, all tasks in task role vc1pinned will be on node 10.151.41.25 (which is pinned), all tasks in task role vc1nopinned will NOT be on node 10.151.41.25.
<img src="file/itc-pin.png" width="900"/>

## GPU Type
## SKU Type
### Description
If `gpuType` is specified in the job, only that type of GPU will be allocated to the job, otherwise, any type of GPU can be allocated.
`skuType` is the leaf `cellType` which does not have internal topology anymore.

If `skuType` is specified in the job, only that type of leaf cell will be allocated to the job, otherwise, any type of leaf cell can be allocated.

This is similar to [K8S Labels and Selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels), but with [VC Safety](#VC-Safety) guaranteed.

### Reproduce Steps
#### `gpuType` specified
#### `skuType` specified
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-k80-type](file/itc-k80-type.yaml), it will be partially running (some tasks waiting because all the specified K80 GPUs are used).
<img src="file/itc-k80-type.png" width="900"/>

#### `gpuType` not specified
#### `skuType` not specified
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-no-type](file/itc-no-type.yaml), it will be fully running, and some tasks are using K80 (10.151.41.18) while others are using M60 (10.151.41.26).
<img src="file/itc-no-type.png" width="900"/>
Expand Down Expand Up @@ -135,7 +137,7 @@ One VC's [Guaranteed Job](#Guaranteed-Job) can preempt other VCs' [Opportunistic

## Topology-Aware Intra-VC Scheduling
### Description
Within one VC, HiveD chooses nearest GPUs for one `AffinityGroup` in best effort.
Within one VC, HiveD chooses nearest leaf cells for one `AffinityGroup` in best effort.

### Reproduce Steps
1. Use [hived-config-2](file/hived-config-2.yaml).
Expand All @@ -147,40 +149,40 @@ Within one VC, HiveD chooses nearest GPUs for one `AffinityGroup` in best effort

## Work-Preserving Reconfiguration
### Description
HiveD can be reconfigured without unnecessary user impacts, such as add/update/delete physical/virtual clusters, GPU types/topologies, etc.
HiveD can be reconfigured without unnecessary user impacts, such as add/update/delete physical/virtual clusters, different device types/topologies, etc.

### Reproduce Steps
#### PhysicalCluster Reconfig - Delete PhysicalCell
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `gpuType`. Wait until it is running.
3. Delete all M60 `gpuType` related PhysicalCells and VirtualCells from [hived-config-2](file/hived-config-2.yaml), i.e. becomes [hived-config-33](file/hived-config-33.yaml).
4. Use [hived-config-33](file/hived-config-33.yaml), and restart HiveD.
2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `skuType`. Wait until it is running.
3. Delete all M60 `skuType` related PhysicalCells and VirtualCells from [hived-config-2](file/hived-config-2.yaml), i.e. becomes [hived-config-33](file/hived-config-33.yaml).
4. Use [hived-config-33](file/hived-config-33.yaml), and restart HiveD.
5. The job will still run without any impact, but its M60 usage is ignored by HiveD.
*However, normally, the job will still fail if the corresponding physical node is later deleted from K8S or unhealthy.*
<img src="file/itc-reconfig-1.png" width="900"/>

#### PhysicalCluster Reconfig - Add PhysicalCell
1. Use [hived-config-33](file/hived-config-33.yaml).
2. Submit job [itc-k80-type](file/itc-k80-type.yaml) which requests K80 `gpuType`. Wait until it is running.
3. Add all M60 `gpuType` related PhysicalCells and VirtualCells into [hived-config-33](file/hived-config-33.yaml), i.e. becomes [hived-config-2](file/hived-config-2.yaml).
4. Use [hived-config-2](file/hived-config-2.yaml), and restart HiveD.
2. Submit job [itc-k80-type](file/itc-k80-type.yaml) which requests K80 `skuType`. Wait until it is running.
3. Add all M60 `skuType` related PhysicalCells and VirtualCells into [hived-config-33](file/hived-config-33.yaml), i.e. becomes [hived-config-2](file/hived-config-2.yaml).
4. Use [hived-config-2](file/hived-config-2.yaml), and restart HiveD.
5. The job will still run without any impact, and its K80 usage is still accounted by HiveD.
<img src="file/itc-k80-type.png" width="900"/>

#### PhysicalCluster Reconfig - Update PhysicalCell - Add Node
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `gpuType`. Wait until it is running.
2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `skuType`. Wait until it is running.
3. Add one M60 node into a PhysicalCell, then becomes [hived-config-4](file/hived-config-4.yaml).
4. Use [hived-config-4](file/hived-config-4.yaml), and restart HiveD.
4. Use [hived-config-4](file/hived-config-4.yaml), and restart HiveD.
5. The job will still run without any impact, and its M60 usage is still accounted by HiveD.
6. To confirm the job is not impacted, such as [lazy preempted](#Lazy-Preemption). Submit job [itc-reconfig-2](file/itc-reconfig-2.yaml) which requests all M60 nodes and has the same priority as [itc-reconfig-1](file/itc-reconfig-1.yaml). The job will be waiting instead of preempting [itc-reconfig-1](file/itc-reconfig-1.yaml).
<img src="file/itc-reconfig-2.png" width="900"/>

#### PhysicalCluster Reconfig - Update PhysicalCell - Delete Node
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) which requests K80 `gpuType`. Wait until it is running.
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) which requests K80 `skuType`. Wait until it is running.
3. Delete one K80 node used by [itc-reconfig-3](file/itc-reconfig-3.yaml) from a PhysicalCell, then becomes [hived-config-7](file/hived-config-7.yaml).
4. Use [hived-config-7](file/hived-config-7.yaml), and restart HiveD.
4. Use [hived-config-7](file/hived-config-7.yaml), and restart HiveD.
5. The job will still run without any impact, but its deleted node usage is ignored by HiveD.
*However, normally, the job will still fail if the corresponding physical node is later deleted from K8S or unhealthy.*
<img src="file/itc-reconfig-3-1.png" width="900"/>
Expand All @@ -189,7 +191,7 @@ HiveD can be reconfigured without unnecessary user impacts, such as add/update/d
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) to default VC. Wait until it is running.
3. Delete the default VC and move its quota to VC1, then becomes [hived-config-5](file/hived-config-5.yaml).
4. Use [hived-config-5](file/hived-config-5.yaml), and restart HiveD.
4. Use [hived-config-5](file/hived-config-5.yaml), and restart HiveD.
5. The job will still run without any interruption but [lazy preempted](#Lazy-Preemption) by HiveD.
<img src="file/itc-reconfig-3.png" width="900"/>
6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-4](file/itc-reconfig-4.yaml) to VC1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
Expand All @@ -199,7 +201,7 @@ HiveD can be reconfigured without unnecessary user impacts, such as add/update/d
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) to default VC. Wait until it is running.
3. Move one K80-NODE cell from default VC to VC1, then becomes [hived-config-6](file/hived-config-6.yaml).
4. Use [hived-config-6](file/hived-config-6.yaml), and restart HiveD.
4. Use [hived-config-6](file/hived-config-6.yaml), and restart HiveD.
5. The job will still run without any interruption but [lazy preempted](#Lazy-Preemption) by HiveD.
6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-5](file/itc-reconfig-5.yaml) to VC1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
<img src="file/itc-reconfig-5.png" width="900"/>
Expand Down
2 changes: 1 addition & 1 deletion example/feature/file/hived-config-1.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
kubeApiServerAddress: http://10.151.41.16:8080

physicalCluster:
gpuTypes:
skuTypes:
K80:
gpu: 1
cpu: 4
Expand Down
2 changes: 1 addition & 1 deletion example/feature/file/hived-config-2.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
kubeApiServerAddress: http://10.151.41.16:8080

physicalCluster:
gpuTypes:
skuTypes:
K80:
gpu: 1
cpu: 4
Expand Down
2 changes: 1 addition & 1 deletion example/feature/file/hived-config-3.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
kubeApiServerAddress: http://10.151.41.16:8080

physicalCluster:
gpuTypes:
skuTypes:
K80:
gpu: 1
cpu: 4
Expand Down
2 changes: 1 addition & 1 deletion example/feature/file/hived-config-33.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
kubeApiServerAddress: http://10.151.41.16:8080

physicalCluster:
gpuTypes:
skuTypes:
K80:
gpu: 1
cpu: 4
Expand Down
2 changes: 1 addition & 1 deletion example/feature/file/hived-config-4.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
kubeApiServerAddress: http://10.151.41.16:8080

physicalCluster:
gpuTypes:
skuTypes:
K80:
gpu: 1
cpu: 4
Expand Down
Loading

0 comments on commit fbff5b0

Please sign in to comment.