Rename term gpu to leaf cell #28

abuccts · 2020-07-15T05:39:23Z

Rename gpuType -> leafCellType, gpuNumber -> leafCellNumber, etc.

Rename gpuType -> skuType, gpuNumber -> skuNumber.

codecov-commenter · 2020-07-15T05:41:03Z

Codecov Report

Merging #28 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master      #28   +/-   ##
=======================================
  Coverage   89.06%   89.06%           
=======================================
  Files           8        8           
  Lines        2231     2231           
=======================================
  Hits         1987     1987           
  Misses        192      192           
  Partials       52       52

Flag	Coverage Δ
#unittests	`89.06% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...com/microsoft/hivedscheduler/pkg/algorithm/cell.go	`92.15% <0.00%> (ø)`
...om/microsoft/hivedscheduler/pkg/algorithm/types.go	`86.30% <0.00%> (ø)`
...om/microsoft/hivedscheduler/pkg/algorithm/utils.go	`86.54% <0.00%> (ø)`
...m/microsoft/hivedscheduler/pkg/algorithm/config.go	`96.69% <0.00%> (ø)`
...hivedscheduler/pkg/algorithm/intra_vc_scheduler.go	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 76ed604...ff98dc7. Read the comment docs.

yqwang-ms · 2020-07-15T05:44:56Z

pkg/algorithm/topology_aware_scheduler.go

 	// affinity of the currently picked GPUs, defined as the lowest common ancestor
 	// of the GPUs in the cell hierarchy (lower level means better affinity)
-	currentAffinity := make(CellList, gpuNum)
+	currentAffinity := make(CellList, skuNum)
 	// GPUs with the best affinity ever seen


Still some leftovers:
// GPUs with the best affinity ever seen
bestAffinityGpus
bestAffinityGpuIndices #Closed

zhypku · 2020-07-15T06:06:48Z

doc/user-manual.md

@@ -14,7 +14,7 @@
    Notes:
    1. It is like the [Azure VM Series](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu) or [GCP Machine Types](https://cloud.google.com/compute/docs/machine-types).
    2. Currently, the `skuTypes` is not directly used by HivedScheduler, but it is used by [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server) to setup proportional Pod resource requests and limits. So, if you are not using with [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server), you can skip to config it.
-    3. It is previously known as `gpuTypes`, and we are in the progress to rename it to `skuTypes`, as HiveD only awares the abstract `cell` concept instead of the concrete hardware that the `cell` represents.
+    3. It is previously known as `gpuTypes`, as HiveD only awares the abstract `cell` concept instead of the concrete hardware that the `cell` represents.


HiveD is only aware of

zhypku · 2020-07-15T06:07:29Z

example/feature/README.md

 ### Description
-If `gpuType` is specified in the job, only that type of GPU will be allocated to the job, otherwise, any type of GPU can be allocated.
+If `skuType` is specified in the job, only that type of GPU will be allocated to the job, otherwise, any type of GPU can be allocated.


still "GPU"

zhypku · 2020-07-15T06:11:51Z

Maybe we need to explain the concept of SKU somewhere in the documentation, as this name is not that intuitive. For example we can say
A SKU type defines the concrete resource type, for example a certain GPU type (K80, V100, etc.) or CPU cores (for CPU-only jobs).

doc/user-manual.md

zhypku · 2020-07-15T11:55:40Z

example/config/design/hivedscheduler.yaml

@@ -24,7 +24,7 @@ kubeApiServerAddress: http://10.10.10.10:8080
 ################################################################################
 physicalCluster:
  # Define the cell structures.
-  # Each leaf cellType contains a single GPU and also defines a gpuType of the
+  # Each leaf cellType contains a single GPU and also defines a skuType of the


maybe you could search without case sensitivity :)

zhypku · 2020-07-15T12:04:39Z

Seems there is still a lot of "GPU" in the code. I suggest to search for "GPU" without case sensitive or match word, instead of for gpuNumber/gpuType explicitly

Rename gpu to device when referring affinity and index.

Add explanation for sku type and device.

doc/user-manual.md

example/feature/README.md

Revert term sku and device to leaf cell.

Fix.

zhypku

Maybe we should clarify somewhere that leaf cell index in the physical device index, e.g., CPU core index and GPU index. It is not a virtual, internal index.
Or you have mentioned it but I didn't see it :P

zhypku · 2020-07-24T01:16:59Z

README.md

@@ -20,7 +20,7 @@ The killer feature that distinguishes HiveD is that it provides resource guarant

 HiveD protects VCs' resources in terms of **cell**, a user-defined resource type that encodes both the quantity and other kinds of information, such as topology and hardware type. In the above example, a user can define a cell type of 8-GPU node, and the VC can be assigned one of such cell. Then, HiveD will ensure that *there is always one 8-GPU node available for the VC*, regardless of the other workloads in the cluster.

-HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different GPU models, or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them.
+HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different device (e.g. GPU, TPU) models, or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them.


(e.g. GPU, TPU) does not imply different GPU types. How about
different device models (e.g., V100 GPU, P100 GPU, TPU)

zhypku · 2020-07-24T01:23:29Z

pkg/algorithm/cell.go

 	"github.com/microsoft/hivedscheduler/pkg/api"
 	"k8s.io/klog"
 )

-// A Cell represents a set of GPUs affinitized by their interconnection topology.
+// A Cell represents a set of leaf cells affinitized by their interconnection topology.


Can we use cell to define cell?
Maybe device is more appropriate here

recursive definition? using leaf cell to define cell is like using sub tree to define tree

pkg/api/types.go

Convert old spec annotations for backward compatibility.

Update README.

yqwang-ms · 2020-07-27T04:04:16Z

pkg/scheduler/scheduler.go

-	// after the GPUs it runs on become unhealthy.
+	// bad leaf cells or newly deleted Nodes will lead Pod retried.
+	// For newly bad leaf cells, it is like the normal behaviour that a pod will fail
+	// after the leaf cells it runs on become unhealthy.


Here using device may be better? #Closed

yqwang-ms · 2020-07-27T04:07:33Z

pkg/api/config.go

@@ -72,7 +73,7 @@ type Config struct {
 	WaitingPodSchedulingBlockMilliSec *int64 `yaml:"waitingPodSchedulingBlockMilliSec"`

 	// Specify the whole physical cluster
-	// TODO: Automatically construct it based on node info from GPU and Network Device Plugins
+	// TODO: Automatically construct it based on node info from Device and Network Device Plugins


info from Device and Network Device Plugins [](start = 51, length = 43)

info from Device Plugins #Closed

yqwang-ms · 2020-07-27T04:11:53Z

README.md

-2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
-3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible GPU Types](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.
+2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
+3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible SKU Types](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.


We just use pai sku concept for demo, so in the homepage, we would better not depend on the pai concept. Using device type and Flexible Device Types is better ? #Closed

the sku type here is just links to feature demo readme section, there's no need to introduce a new "device type" here

yqwang-ms · 2020-07-27T04:14:10Z

doc/user-manual.md

@@ -14,7 +14,7 @@
    Notes:
    1. It is like the [Azure VM Series](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu) or [GCP Machine Types](https://cloud.google.com/compute/docs/machine-types).
    2. Currently, the `skuTypes` is not directly used by HivedScheduler, but it is used by [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server) to setup proportional Pod resource requests and limits. So, if you are not using with [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server), you can skip to config it.
-    3. It is previously known as `gpuTypes`, and we are in the progress to rename it to `skuTypes`, as HiveD only awares the abstract `cell` concept instead of the concrete hardware that the `cell` represents.
+    3. It is previously known as `gpuTypes`, as HiveD is only aware of the abstract `cell` concept instead of the concrete hardware that the `cell` represents.



Remove gpuTypes comment.
And add doc, to explain the relation between skuType (pai concept) and leafCellType (Hived concept) #Closed

ok, see the doc below

In reply to: 460637628 [](ancestors = 460637628)

yqwang-ms · 2020-07-27T04:27:17Z

pkg/api/constants.go

-	EnvNameNvidiaVisibleDevices  = "NVIDIA_VISIBLE_DEVICES"
-	AnnotationKeyPodGpuIsolation = GroupName + "/pod-gpu-isolation"
+	//    to share these leaf cells.
+	EnvNameNvidiaVisibleDevices       = "NVIDIA_VISIBLE_DEVICES"


As the NVIDIA_VISIBLE_DEVICES is not used in hived code, can you move it to user manual doc?
Such as section:
Scheduling GPUs

NVidia GPUs
NVIDIA_VISIBLE_DEVICES and its comment here.

AMDGPUs
XXXX_VISIBLE_DEVICES and its comment here similar? #Closed

As some places in the comment, still use gpu for NVIDIA_VISIBLE_DEVICES is better.

In reply to: 460640199 [](ancestors = 460640199)

yqwang-ms · 2020-07-27T04:32:07Z

pkg/internal/utils.go

 	bindingPod.Annotations[si.AnnotationKeyPodBindInfo] =
 		common.ToYaml(podBindInfo)

 	return bindingPod
 }

+// ConvertOldAnnotation converts old spec annotations for backward compatibility
+func ConvertOldAnnotation(annotation string) string {


ConvertOldAnnotation [](start = 5, length = 20)

just private func? #Closed

yqwang-ms · 2020-07-27T04:35:20Z

pkg/internal/utils.go

+		}
+	} else {
+		return map[string]string{
+			si.AnnotationKeyPodLeafCellIsolation: allocatedPod.Annotations[strings.Replace(si.AnnotationKeyPodLeafCellIsolation, "leaf-cell", "gpu", -1)],


strings.Replace(si.AnnotationKeyPodLeafCellIsolation, "leaf-cell", "gpu", -1) [](start = 66, length = 77)

Also define the deprecated Annotation in constants.go, instead of generating in runtime.
Such as DeprecatedAnnotationKeyXXX? #Closed

Resolve comments.

yqwang-ms · 2020-07-27T07:13:51Z

README.md

-2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
-3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible GPU Types](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.
+2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
+3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible SKU Types](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.


Flexible SKU Types [](start = 118, length = 18)

But below is Bad-Hardware-Awareness.
How about here rename to Flexible Hardware Types?

Or rename all "Hardware" word to device? #Closed

yqwang-ms · 2020-07-27T07:15:55Z

doc/user-manual.md

+
+## <a name="Scheduling-GPUs">Scheduling GPUs</a>
+
+To leverage this scheduler, if one container in the Pod want to use the allocated GPUs for the whole Pod,


To leverage this scheduler [](start = 0, length = 26)

To leverage this scheduler to schedule GPUs #Closed

Update.

yqwang-ms

Rename gpuType/gpuNumber to skuType/skuNumber

80614d6

Rename gpuType -> skuType, gpuNumber -> skuNumber.

abuccts requested review from zhypku and yqwang-ms July 15, 2020 05:39

yqwang-ms reviewed Jul 15, 2020

View reviewed changes

zhypku reviewed Jul 15, 2020

View reviewed changes

doc/user-manual.md Show resolved Hide resolved

zhypku reviewed Jul 15, 2020

View reviewed changes

abuccts added 2 commits July 20, 2020 14:31

Rename gpu to device

71c9110

Rename gpu to device when referring affinity and index.

Add explanation for sku type and device

18dde17

Add explanation for sku type and device.

zhypku reviewed Jul 21, 2020

View reviewed changes

doc/user-manual.md Outdated Show resolved Hide resolved

example/feature/README.md Outdated Show resolved Hide resolved

abuccts added 2 commits July 23, 2020 19:05

Revert term sku and device to leaf cell

9aa07f2

Revert term sku and device to leaf cell.

Fix

2b07ddd

Fix.

abuccts changed the title ~~Rename gpuType/gpuNumber to skuType/skuNumber~~ Rename term gpu to leaf cell Jul 23, 2020

zhypku reviewed Jul 24, 2020

View reviewed changes

abuccts added 2 commits July 24, 2020 12:42

Convert old spec annotations for compatibility

9cf2a59

Convert old spec annotations for backward compatibility.

Update README

6404e0e

Update README.

zhypku approved these changes Jul 27, 2020

View reviewed changes

yqwang-ms reviewed Jul 27, 2020

View reviewed changes

Resolve comments

cfeb8ad

Resolve comments.

yqwang-ms reviewed Jul 27, 2020

View reviewed changes

Update

ff98dc7

Update.

yqwang-ms approved these changes Jul 27, 2020

View reviewed changes

abuccts merged commit fbff5b0 into master Jul 27, 2020

abuccts deleted the xiongyf/rename-sku branch July 27, 2020 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename term gpu to leaf cell #28

Rename term gpu to leaf cell #28

abuccts commented Jul 15, 2020 •

edited

Loading

codecov-commenter commented Jul 15, 2020 •

edited

Loading

yqwang-ms Jul 15, 2020 •

edited

Loading

abuccts Jul 20, 2020

zhypku Jul 15, 2020

abuccts Jul 20, 2020

zhypku Jul 15, 2020

abuccts Jul 20, 2020

zhypku commented Jul 15, 2020

zhypku Jul 15, 2020

zhypku Jul 15, 2020

zhypku commented Jul 15, 2020

zhypku left a comment

zhypku Jul 24, 2020

zhypku Jul 24, 2020

abuccts Jul 24, 2020

yqwang-ms Jul 27, 2020 •

edited

Loading

abuccts Jul 27, 2020

yqwang-ms Jul 27, 2020 •

edited

Loading

abuccts Jul 27, 2020

yqwang-ms Jul 27, 2020 •

edited

Loading

abuccts Jul 27, 2020

yqwang-ms Jul 27, 2020 •

edited

Loading

yqwang-ms Jul 27, 2020

abuccts Jul 27, 2020

yqwang-ms Jul 27, 2020 •

edited

Loading

yqwang-ms Jul 27, 2020

abuccts Jul 27, 2020

yqwang-ms Jul 27, 2020 •

edited

Loading

abuccts Jul 27, 2020

yqwang-ms Jul 27, 2020 •

edited

Loading

abuccts Jul 27, 2020

yqwang-ms Jul 27, 2020 •

edited

Loading

yqwang-ms Jul 27, 2020 •

edited

Loading

yqwang-ms left a comment


		## <a name="Scheduling-GPUs">Scheduling GPUs</a>

		To leverage this scheduler, if one container in the Pod want to use the allocated GPUs for the whole Pod,

Rename term gpu to leaf cell #28

Rename term gpu to leaf cell #28

Conversation

abuccts commented Jul 15, 2020 • edited Loading

codecov-commenter commented Jul 15, 2020 • edited Loading

Codecov Report

yqwang-ms Jul 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhypku commented Jul 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhypku commented Jul 15, 2020

zhypku left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yqwang-ms Jul 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yqwang-ms Jul 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yqwang-ms Jul 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yqwang-ms Jul 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yqwang-ms Jul 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yqwang-ms Jul 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yqwang-ms Jul 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yqwang-ms Jul 27, 2020 • edited Loading

Choose a reason for hiding this comment

yqwang-ms Jul 27, 2020 • edited Loading

Choose a reason for hiding this comment

yqwang-ms left a comment

Choose a reason for hiding this comment

abuccts commented Jul 15, 2020 •

edited

Loading

codecov-commenter commented Jul 15, 2020 •

edited

Loading

yqwang-ms Jul 15, 2020 •

edited

Loading

yqwang-ms Jul 27, 2020 •

edited

Loading

yqwang-ms Jul 27, 2020 •

edited

Loading

yqwang-ms Jul 27, 2020 •

edited

Loading

yqwang-ms Jul 27, 2020 •

edited

Loading

yqwang-ms Jul 27, 2020 •

edited

Loading

yqwang-ms Jul 27, 2020 •

edited

Loading

yqwang-ms Jul 27, 2020 •

edited

Loading

yqwang-ms Jul 27, 2020 •

edited

Loading

yqwang-ms Jul 27, 2020 •

edited

Loading