-
Notifications
You must be signed in to change notification settings - Fork 39
Conversation
Rename gpuType -> skuType, gpuNumber -> skuNumber.
Codecov Report
@@ Coverage Diff @@
## master #28 +/- ##
=======================================
Coverage 89.06% 89.06%
=======================================
Files 8 8
Lines 2231 2231
=======================================
Hits 1987 1987
Misses 192 192
Partials 52 52
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
// affinity of the currently picked GPUs, defined as the lowest common ancestor | ||
// of the GPUs in the cell hierarchy (lower level means better affinity) | ||
currentAffinity := make(CellList, gpuNum) | ||
currentAffinity := make(CellList, skuNum) | ||
// GPUs with the best affinity ever seen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still some leftovers:
// GPUs with the best affinity ever seen
bestAffinityGpus
bestAffinityGpuIndices #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
doc/user-manual.md
Outdated
@@ -14,7 +14,7 @@ | |||
Notes: | |||
1. It is like the [Azure VM Series](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu) or [GCP Machine Types](https://cloud.google.com/compute/docs/machine-types). | |||
2. Currently, the `skuTypes` is not directly used by HivedScheduler, but it is used by [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server) to setup proportional Pod resource requests and limits. So, if you are not using with [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server), you can skip to config it. | |||
3. It is previously known as `gpuTypes`, and we are in the progress to rename it to `skuTypes`, as HiveD only awares the abstract `cell` concept instead of the concrete hardware that the `cell` represents. | |||
3. It is previously known as `gpuTypes`, as HiveD only awares the abstract `cell` concept instead of the concrete hardware that the `cell` represents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HiveD is only aware of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
example/feature/README.md
Outdated
### Description | ||
If `gpuType` is specified in the job, only that type of GPU will be allocated to the job, otherwise, any type of GPU can be allocated. | ||
If `skuType` is specified in the job, only that type of GPU will be allocated to the job, otherwise, any type of GPU can be allocated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still "GPU"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Maybe we need to explain the concept of SKU somewhere in the documentation, as this name is not that intuitive. For example we can say |
@@ -24,7 +24,7 @@ kubeApiServerAddress: http://10.10.10.10:8080 | |||
################################################################################ | |||
physicalCluster: | |||
# Define the cell structures. | |||
# Each leaf cellType contains a single GPU and also defines a gpuType of the | |||
# Each leaf cellType contains a single GPU and also defines a skuType of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you could search without case sensitivity :)
Seems there is still a lot of "GPU" in the code. I suggest to search for "GPU" without case sensitive or match word, instead of for gpuNumber/gpuType explicitly |
Rename gpu to device when referring affinity and index.
Add explanation for sku type and device.
Revert term sku and device to leaf cell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should clarify somewhere that leaf cell index in the physical device index, e.g., CPU core index and GPU index. It is not a virtual, internal index.
Or you have mentioned it but I didn't see it :P
README.md
Outdated
@@ -20,7 +20,7 @@ The killer feature that distinguishes HiveD is that it provides resource guarant | |||
|
|||
HiveD protects VCs' resources in terms of **cell**, a user-defined resource type that encodes both the quantity and other kinds of information, such as topology and hardware type. In the above example, a user can define a cell type of 8-GPU node, and the VC can be assigned one of such cell. Then, HiveD will ensure that *there is always one 8-GPU node available for the VC*, regardless of the other workloads in the cluster. | |||
|
|||
HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different GPU models, or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them. | |||
HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different device (e.g. GPU, TPU) models, or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(e.g. GPU, TPU)
does not imply different GPU types. How about
different device models (e.g., V100 GPU, P100 GPU, TPU)
"github.com/microsoft/hivedscheduler/pkg/api" | ||
"k8s.io/klog" | ||
) | ||
|
||
// A Cell represents a set of GPUs affinitized by their interconnection topology. | ||
// A Cell represents a set of leaf cells affinitized by their interconnection topology. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use cell to define cell?
Maybe device is more appropriate here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recursive definition? using leaf cell to define cell is like using sub tree to define tree
Convert old spec annotations for backward compatibility.
Update README.
pkg/scheduler/scheduler.go
Outdated
// after the GPUs it runs on become unhealthy. | ||
// bad leaf cells or newly deleted Nodes will lead Pod retried. | ||
// For newly bad leaf cells, it is like the normal behaviour that a pod will fail | ||
// after the leaf cells it runs on become unhealthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here using device may be better? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
pkg/api/config.go
Outdated
@@ -72,7 +73,7 @@ type Config struct { | |||
WaitingPodSchedulingBlockMilliSec *int64 `yaml:"waitingPodSchedulingBlockMilliSec"` | |||
|
|||
// Specify the whole physical cluster | |||
// TODO: Automatically construct it based on node info from GPU and Network Device Plugins | |||
// TODO: Automatically construct it based on node info from Device and Network Device Plugins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
info from Device and Network Device Plugins [](start = 51, length = 43)
info from Device Plugins #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
README.md
Outdated
2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc. | ||
3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible GPU Types](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc. | ||
2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc. | ||
3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible SKU Types](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just use pai sku concept for demo, so in the homepage, we would better not depend on the pai concept. Using device type and Flexible Device Types is better ? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the sku type here is just links to feature demo readme section, there's no need to introduce a new "device type" here
@@ -14,7 +14,7 @@ | |||
Notes: | |||
1. It is like the [Azure VM Series](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu) or [GCP Machine Types](https://cloud.google.com/compute/docs/machine-types). | |||
2. Currently, the `skuTypes` is not directly used by HivedScheduler, but it is used by [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server) to setup proportional Pod resource requests and limits. So, if you are not using with [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server), you can skip to config it. | |||
3. It is previously known as `gpuTypes`, and we are in the progress to rename it to `skuTypes`, as HiveD only awares the abstract `cell` concept instead of the concrete hardware that the `cell` represents. | |||
3. It is previously known as `gpuTypes`, as HiveD is only aware of the abstract `cell` concept instead of the concrete hardware that the `cell` represents. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove gpuTypes comment.
And add doc, to explain the relation between skuType (pai concept) and leafCellType (Hived concept) #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
pkg/api/constants.go
Outdated
EnvNameNvidiaVisibleDevices = "NVIDIA_VISIBLE_DEVICES" | ||
AnnotationKeyPodGpuIsolation = GroupName + "/pod-gpu-isolation" | ||
// to share these leaf cells. | ||
EnvNameNvidiaVisibleDevices = "NVIDIA_VISIBLE_DEVICES" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the NVIDIA_VISIBLE_DEVICES is not used in hived code, can you move it to user manual doc?
Such as section:
Scheduling GPUs
- NVidia GPUs
NVIDIA_VISIBLE_DEVICES and its comment here. - AMDGPUs
XXXX_VISIBLE_DEVICES and its comment here similar? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As some places in the comment, still use gpu for NVIDIA_VISIBLE_DEVICES is better.
In reply to: 460640199 [](ancestors = 460640199)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
pkg/internal/utils.go
Outdated
bindingPod.Annotations[si.AnnotationKeyPodBindInfo] = | ||
common.ToYaml(podBindInfo) | ||
|
||
return bindingPod | ||
} | ||
|
||
// ConvertOldAnnotation converts old spec annotations for backward compatibility | ||
func ConvertOldAnnotation(annotation string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConvertOldAnnotation [](start = 5, length = 20)
just private func? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
pkg/internal/utils.go
Outdated
} | ||
} else { | ||
return map[string]string{ | ||
si.AnnotationKeyPodLeafCellIsolation: allocatedPod.Annotations[strings.Replace(si.AnnotationKeyPodLeafCellIsolation, "leaf-cell", "gpu", -1)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strings.Replace(si.AnnotationKeyPodLeafCellIsolation, "leaf-cell", "gpu", -1) [](start = 66, length = 77)
Also define the deprecated Annotation in constants.go, instead of generating in runtime.
Such as DeprecatedAnnotationKeyXXX? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Resolve comments.
README.md
Outdated
2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc. | ||
3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible GPU Types](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc. | ||
2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc. | ||
3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible SKU Types](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flexible SKU Types [](start = 118, length = 18)
But below is Bad-Hardware-Awareness.
How about here rename to Flexible Hardware Types?
Or rename all "Hardware" word to device? #Closed
doc/user-manual.md
Outdated
|
||
## <a name="Scheduling-GPUs">Scheduling GPUs</a> | ||
|
||
To leverage this scheduler, if one container in the Pod want to use the allocated GPUs for the whole Pod, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To leverage this scheduler [](start = 0, length = 26)
To leverage this scheduler to schedule GPUs #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename gpuType -> leafCellType, gpuNumber -> leafCellNumber, etc.