Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Rename term gpu to leaf cell #28

Merged
merged 9 commits into from
Jul 27, 2020
Merged

Rename term gpu to leaf cell #28

merged 9 commits into from
Jul 27, 2020

Conversation

abuccts
Copy link
Member

@abuccts abuccts commented Jul 15, 2020

Rename gpuType -> leafCellType, gpuNumber -> leafCellNumber, etc.

Rename gpuType -> skuType, gpuNumber -> skuNumber.
@abuccts abuccts requested review from zhypku and yqwang-ms July 15, 2020 05:39
@codecov-commenter
Copy link

codecov-commenter commented Jul 15, 2020

Codecov Report

Merging #28 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #28   +/-   ##
=======================================
  Coverage   89.06%   89.06%           
=======================================
  Files           8        8           
  Lines        2231     2231           
=======================================
  Hits         1987     1987           
  Misses        192      192           
  Partials       52       52           
Flag Coverage Δ
#unittests 89.06% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...com/microsoft/hivedscheduler/pkg/algorithm/cell.go 92.15% <0.00%> (ø)
...om/microsoft/hivedscheduler/pkg/algorithm/types.go 86.30% <0.00%> (ø)
...om/microsoft/hivedscheduler/pkg/algorithm/utils.go 86.54% <0.00%> (ø)
...m/microsoft/hivedscheduler/pkg/algorithm/config.go 96.69% <0.00%> (ø)
...hivedscheduler/pkg/algorithm/intra_vc_scheduler.go 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 76ed604...ff98dc7. Read the comment docs.

// affinity of the currently picked GPUs, defined as the lowest common ancestor
// of the GPUs in the cell hierarchy (lower level means better affinity)
currentAffinity := make(CellList, gpuNum)
currentAffinity := make(CellList, skuNum)
// GPUs with the best affinity ever seen
Copy link
Member

@yqwang-ms yqwang-ms Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some leftovers:
// GPUs with the best affinity ever seen
bestAffinityGpus
bestAffinityGpuIndices #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -14,7 +14,7 @@
Notes:
1. It is like the [Azure VM Series](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu) or [GCP Machine Types](https://cloud.google.com/compute/docs/machine-types).
2. Currently, the `skuTypes` is not directly used by HivedScheduler, but it is used by [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server) to setup proportional Pod resource requests and limits. So, if you are not using with [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server), you can skip to config it.
3. It is previously known as `gpuTypes`, and we are in the progress to rename it to `skuTypes`, as HiveD only awares the abstract `cell` concept instead of the concrete hardware that the `cell` represents.
3. It is previously known as `gpuTypes`, as HiveD only awares the abstract `cell` concept instead of the concrete hardware that the `cell` represents.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveD is only aware of

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

### Description
If `gpuType` is specified in the job, only that type of GPU will be allocated to the job, otherwise, any type of GPU can be allocated.
If `skuType` is specified in the job, only that type of GPU will be allocated to the job, otherwise, any type of GPU can be allocated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still "GPU"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@zhypku
Copy link
Contributor

zhypku commented Jul 15, 2020

Maybe we need to explain the concept of SKU somewhere in the documentation, as this name is not that intuitive. For example we can say
A SKU type defines the concrete resource type, for example a certain GPU type (K80, V100, etc.) or CPU cores (for CPU-only jobs).

@@ -24,7 +24,7 @@ kubeApiServerAddress: http://10.10.10.10:8080
################################################################################
physicalCluster:
# Define the cell structures.
# Each leaf cellType contains a single GPU and also defines a gpuType of the
# Each leaf cellType contains a single GPU and also defines a skuType of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you could search without case sensitivity :)

@zhypku
Copy link
Contributor

zhypku commented Jul 15, 2020

Seems there is still a lot of "GPU" in the code. I suggest to search for "GPU" without case sensitive or match word, instead of for gpuNumber/gpuType explicitly

abuccts added 2 commits July 20, 2020 14:31
Rename gpu to device when referring affinity and index.
Add explanation for sku type and device.
abuccts added 2 commits July 23, 2020 19:05
Revert term sku and device to leaf cell.
Fix.
@abuccts abuccts changed the title Rename gpuType/gpuNumber to skuType/skuNumber Rename term gpu to leaf cell Jul 23, 2020
Copy link
Contributor

@zhypku zhypku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should clarify somewhere that leaf cell index in the physical device index, e.g., CPU core index and GPU index. It is not a virtual, internal index.
Or you have mentioned it but I didn't see it :P

README.md Outdated
@@ -20,7 +20,7 @@ The killer feature that distinguishes HiveD is that it provides resource guarant

HiveD protects VCs' resources in terms of **cell**, a user-defined resource type that encodes both the quantity and other kinds of information, such as topology and hardware type. In the above example, a user can define a cell type of 8-GPU node, and the VC can be assigned one of such cell. Then, HiveD will ensure that *there is always one 8-GPU node available for the VC*, regardless of the other workloads in the cluster.

HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different GPU models, or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them.
HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different device (e.g. GPU, TPU) models, or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(e.g. GPU, TPU) does not imply different GPU types. How about
different device models (e.g., V100 GPU, P100 GPU, TPU)

"github.com/microsoft/hivedscheduler/pkg/api"
"k8s.io/klog"
)

// A Cell represents a set of GPUs affinitized by their interconnection topology.
// A Cell represents a set of leaf cells affinitized by their interconnection topology.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use cell to define cell?
Maybe device is more appropriate here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recursive definition? using leaf cell to define cell is like using sub tree to define tree

abuccts added 2 commits July 24, 2020 12:42
Convert old spec annotations for backward compatibility.
Update README.
// after the GPUs it runs on become unhealthy.
// bad leaf cells or newly deleted Nodes will lead Pod retried.
// For newly bad leaf cells, it is like the normal behaviour that a pod will fail
// after the leaf cells it runs on become unhealthy.
Copy link
Member

@yqwang-ms yqwang-ms Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here using device may be better? #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -72,7 +73,7 @@ type Config struct {
WaitingPodSchedulingBlockMilliSec *int64 `yaml:"waitingPodSchedulingBlockMilliSec"`

// Specify the whole physical cluster
// TODO: Automatically construct it based on node info from GPU and Network Device Plugins
// TODO: Automatically construct it based on node info from Device and Network Device Plugins
Copy link
Member

@yqwang-ms yqwang-ms Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

info from Device and Network Device Plugins [](start = 51, length = 43)

info from Device Plugins #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

README.md Outdated
2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible GPU Types](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.
2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible SKU Types](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.
Copy link
Member

@yqwang-ms yqwang-ms Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just use pai sku concept for demo, so in the homepage, we would better not depend on the pai concept. Using device type and Flexible Device Types is better ? #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sku type here is just links to feature demo readme section, there's no need to introduce a new "device type" here

@@ -14,7 +14,7 @@
Notes:
1. It is like the [Azure VM Series](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu) or [GCP Machine Types](https://cloud.google.com/compute/docs/machine-types).
2. Currently, the `skuTypes` is not directly used by HivedScheduler, but it is used by [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server) to setup proportional Pod resource requests and limits. So, if you are not using with [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server), you can skip to config it.
3. It is previously known as `gpuTypes`, and we are in the progress to rename it to `skuTypes`, as HiveD only awares the abstract `cell` concept instead of the concrete hardware that the `cell` represents.
3. It is previously known as `gpuTypes`, as HiveD is only aware of the abstract `cell` concept instead of the concrete hardware that the `cell` represents.

Copy link
Member

@yqwang-ms yqwang-ms Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove gpuTypes comment.
And add doc, to explain the relation between skuType (pai concept) and leafCellType (Hived concept) #Closed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, see the doc below


In reply to: 460637628 [](ancestors = 460637628)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

EnvNameNvidiaVisibleDevices = "NVIDIA_VISIBLE_DEVICES"
AnnotationKeyPodGpuIsolation = GroupName + "/pod-gpu-isolation"
// to share these leaf cells.
EnvNameNvidiaVisibleDevices = "NVIDIA_VISIBLE_DEVICES"
Copy link
Member

@yqwang-ms yqwang-ms Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the NVIDIA_VISIBLE_DEVICES is not used in hived code, can you move it to user manual doc?
Such as section:
Scheduling GPUs

  1. NVidia GPUs
    NVIDIA_VISIBLE_DEVICES and its comment here.
  2. AMDGPUs
    XXXX_VISIBLE_DEVICES and its comment here similar? #Closed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As some places in the comment, still use gpu for NVIDIA_VISIBLE_DEVICES is better.


In reply to: 460640199 [](ancestors = 460640199)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

bindingPod.Annotations[si.AnnotationKeyPodBindInfo] =
common.ToYaml(podBindInfo)

return bindingPod
}

// ConvertOldAnnotation converts old spec annotations for backward compatibility
func ConvertOldAnnotation(annotation string) string {
Copy link
Member

@yqwang-ms yqwang-ms Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConvertOldAnnotation [](start = 5, length = 20)

just private func? #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

}
} else {
return map[string]string{
si.AnnotationKeyPodLeafCellIsolation: allocatedPod.Annotations[strings.Replace(si.AnnotationKeyPodLeafCellIsolation, "leaf-cell", "gpu", -1)],
Copy link
Member

@yqwang-ms yqwang-ms Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings.Replace(si.AnnotationKeyPodLeafCellIsolation, "leaf-cell", "gpu", -1) [](start = 66, length = 77)

Also define the deprecated Annotation in constants.go, instead of generating in runtime.
Such as DeprecatedAnnotationKeyXXX? #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Resolve comments.
README.md Outdated
2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible GPU Types](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.
2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible SKU Types](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.
Copy link
Member

@yqwang-ms yqwang-ms Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flexible SKU Types [](start = 118, length = 18)

But below is Bad-Hardware-Awareness.
How about here rename to Flexible Hardware Types?

Or rename all "Hardware" word to device? #Closed


## <a name="Scheduling-GPUs">Scheduling GPUs</a>

To leverage this scheduler, if one container in the Pod want to use the allocated GPUs for the whole Pod,
Copy link
Member

@yqwang-ms yqwang-ms Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To leverage this scheduler [](start = 0, length = 26)

To leverage this scheduler to schedule GPUs #Closed

Update.
Copy link
Member

@yqwang-ms yqwang-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@abuccts abuccts merged commit fbff5b0 into master Jul 27, 2020
@abuccts abuccts deleted the xiongyf/rename-sku branch July 27, 2020 07:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants