Skip to content

Commit

Permalink
Merge pull request #48 from Project-HAMi/add-unified-config
Browse files Browse the repository at this point in the history
Add unified configMap, fix license, update libvgpu.so
  • Loading branch information
archlitchi authored Feb 18, 2025
2 parents d3d3797 + b24d45a commit 44e38a7
Show file tree
Hide file tree
Showing 12 changed files with 40 additions and 60 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@ kubernetes.tar.gz
coverage.txt

updateso.sh
volcano-vgpu-device-plugin

lib/nvidia/libvgpu/build

3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Volcano vgpu device plugin for Kubernetes

[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2FProject-HAMi%2Fvolcano-vgpu-device-plugin.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2FProject-HAMi%2Fvolcano-vgpu-device-plugin?ref=badge_shield)

**Note**:

Volcano vgpu device-plugin can provide device-sharing mechanism for NVIDIA devices managed by volcano.
Expand Down Expand Up @@ -177,7 +179,6 @@ EOF
You can validate device memory using nvidia-smi inside container:

![img](./doc/hard_limit.jpg)
[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2FProject-HAMi%2Fvolcano-vgpu-device-plugin.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2FProject-HAMi%2Fvolcano-vgpu-device-plugin?ref=badge_shield)

> **WARNING:** *if you don't request GPUs when using the device plugin with NVIDIA images all
> the GPUs on the machine will be exposed inside your container.
Expand Down
1 change: 0 additions & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ require (
k8s.io/klog v1.0.0
k8s.io/klog/v2 v2.80.1
k8s.io/kubelet v0.0.0
k8s.io/kubernetes v1.18.2
sigs.k8s.io/yaml v1.2.0
)

Expand Down
7 changes: 0 additions & 7 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -294,7 +294,6 @@ github.com/hashicorp/hcl v1.0.0 h1:0Anlzjpi4vEasTeNFn2mLJgTSwt0+6sfsiTG8qcWGx4=
github.com/hashicorp/hcl v1.0.0/go.mod h1:E5yfLk+7swimpb2L/Alb/PJmXilQ/rhwaUYs4T20WEQ=
github.com/heketi/heketi v9.0.1-0.20190917153846-c2e2a4ab7ab9+incompatible/go.mod h1:bB9ly3RchcQqsQ9CpyaQwvva7RS5ytVoSoholZQON6o=
github.com/heketi/tests v0.0.0-20151005000721-f3775cbcefd6/go.mod h1:xGMAM8JLi7UkZt1i4FQeQy0R2T8GLUwQhOP5M1gBhy4=
github.com/hpcloud/tail v1.0.0 h1:nfCOvKYfkgYP8hkirhJocXT2+zOD8yUNjXaWfTlyFKI=
github.com/hpcloud/tail v1.0.0/go.mod h1:ab1qPbhIpdTxEkNHXyeSf5vhxWSCs/tWer42PpOxQnU=
github.com/imdario/mergo v0.3.5 h1:JboBksRwiiAJWvIYJVo46AfV+IAIKZpfrSzVKj42R4Q=
github.com/imdario/mergo v0.3.5/go.mod h1:2EnlNZ0deacrJVfApfmtdGgDfMuh/nq6Ok1EcJh5FfA=
Expand Down Expand Up @@ -397,13 +396,11 @@ github.com/olekukonko/tablewriter v0.0.0-20170122224234-a0225b3f23b5/go.mod h1:v
github.com/onsi/ginkgo v0.0.0-20170829012221-11459a886d9c/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE=
github.com/onsi/ginkgo v1.6.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE=
github.com/onsi/ginkgo v1.8.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE=
github.com/onsi/ginkgo v1.11.0 h1:JAKSXpt1YjtLA7YpPiqO9ss6sNXEsPfSGdwN0UHqzrw=
github.com/onsi/ginkgo v1.11.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE=
github.com/onsi/gomega v0.0.0-20170829124025-dcabb60a477c/go.mod h1:C1qb7wdrVGGVU+Z6iS04AVkA3Q65CEZX59MT0QO5uiA=
github.com/onsi/gomega v1.4.2/go.mod h1:ex+gbHU/CVuBBDIJjb2X0qEXbFg53c61hWP/1CpauHY=
github.com/onsi/gomega v1.4.3/go.mod h1:ex+gbHU/CVuBBDIJjb2X0qEXbFg53c61hWP/1CpauHY=
github.com/onsi/gomega v1.5.0/go.mod h1:ex+gbHU/CVuBBDIJjb2X0qEXbFg53c61hWP/1CpauHY=
github.com/onsi/gomega v1.7.0 h1:XPnZz8VVBHjVsy1vzJmRwIcSwiUO+JFfrv/xGiigmME=
github.com/onsi/gomega v1.7.0/go.mod h1:ex+gbHU/CVuBBDIJjb2X0qEXbFg53c61hWP/1CpauHY=
github.com/opencontainers/go-digest v1.0.0-rc1/go.mod h1:cMLVZDEM3+U2I4VmLI6N8jQYUd2OVphdqWwCJHrFt2s=
github.com/opencontainers/image-spec v1.0.1/go.mod h1:BtxoFyWECRxE4U/7sNtV5W15zMzWCbyJoFRP3s7yZA0=
Expand Down Expand Up @@ -717,7 +714,6 @@ gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127 h1:qIbj1fsPNlZgppZ+VLlY7N33
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/cheggaaa/pb.v1 v1.0.25/go.mod h1:V/YB90LKu/1FcN3WVnfiiE5oMCibMjukxqG/qStrOgw=
gopkg.in/errgo.v2 v2.1.0/go.mod h1:hNsd1EY+bozCKY1Ytp96fpM3vjJbqLJn88ws8XvfDNI=
gopkg.in/fsnotify.v1 v1.4.7 h1:xOHLXZwVvI9hhs+cLKq5+I5onOuwQLhQwiu63xxlHs4=
gopkg.in/fsnotify.v1 v1.4.7/go.mod h1:Tz8NjZHkW78fSQdbUxIjBTcgA1z1m8ZHf0WmKUhAMys=
gopkg.in/gcfg.v1 v1.2.0/go.mod h1:yesOnuUOFQAhST5vPY4nbZsb/huCgGGXlipJsBn0b3o=
gopkg.in/gemnasium/logrus-airbrake-hook.v2 v2.1.2/go.mod h1:Xk6kEKp8OKb+X14hQBKWaSkCsqBpgog8nAV2xsGOxlo=
Expand All @@ -727,7 +723,6 @@ gopkg.in/mcuadros/go-syslog.v2 v2.2.1/go.mod h1:l5LPIyOOyIdQquNg+oU6Z3524YwrcqEm
gopkg.in/natefinch/lumberjack.v2 v2.0.0/go.mod h1:l0ndWWf7gzL7RNwBG7wST/UCcT4T24xpD6X8LsfU/+k=
gopkg.in/resty.v1 v1.12.0/go.mod h1:mDo4pnntr5jdWRML875a/NmxYqAlA73dVijT2AXvQQo=
gopkg.in/square/go-jose.v2 v2.2.2/go.mod h1:M9dMgbHiYLoDGQrXy7OpJDJWiKiU//h+vD76mk0e1AI=
gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7 h1:uRGJdciOHaEIrze2W8Q3AKkepLTh2hOroT7a+7czfdQ=
gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7/go.mod h1:dt/ZhP58zS4L8KSrWDmTeBkI65Dw0HsyUHuEVlX15mw=
gopkg.in/warnings.v0 v0.1.1/go.mod h1:jksf8JmL6Qr/oQM2OXTHunEvvTAsrWBLb6OOjuVWRNI=
gopkg.in/yaml.v2 v2.0.0-20170812160011-eb3733d160e7/go.mod h1:JAlM8MvJe8wmxCU4Bli9HhUf9+ttbYbLASfIpnQbh74=
Expand Down Expand Up @@ -774,14 +769,12 @@ k8s.io/klog/v2 v2.80.1 h1:atnLQ121W371wYYFawwYx1aEY2eUfs4l3J72wtgAwV4=
k8s.io/klog/v2 v2.80.1/go.mod h1:y1WjHnz7Dj687irZUWR/WLkLc5N1YHtjLdmgWjndZn0=
k8s.io/kube-aggregator v0.18.2/go.mod h1:ijq6FnNUoKinA6kKbkN6svdTacSoQVNtKqmQ1+XJEYQ=
k8s.io/kube-controller-manager v0.18.2/go.mod h1:v45wCqexTrOltgwj92V4ve7hm5f70GQzi4a47/RQ0HQ=
k8s.io/kube-openapi v0.0.0-20200121204235-bf4fb3bd569c h1:/KUFqjjqAcY4Us6luF5RDNZ16KJtb49HfR3ZHB9qYXM=
k8s.io/kube-openapi v0.0.0-20200121204235-bf4fb3bd569c/go.mod h1:GRQhZsXIAJ1xR0C9bd8UpWHZ5plfAS9fzPjJuQ6JL3E=
k8s.io/kube-proxy v0.18.2/go.mod h1:VTgyDMdylYGgHVqLQo/Nt4yDWkh/LRsSnxRiG8GVgDo=
k8s.io/kube-scheduler v0.18.2/go.mod h1:dL+C0Hp/ahQOQK3BsgmV8btb3BtMZvz6ONUw/v1N8sk=
k8s.io/kubectl v0.18.2/go.mod h1:OdgFa3AlsPKRpFFYE7ICTwulXOcMGXHTc+UKhHKvrb4=
k8s.io/kubelet v0.18.2 h1:DXXwda6vfm2zKNiL/eCYr0N3ab6CU26UkYioBHySUMQ=
k8s.io/kubelet v0.18.2/go.mod h1:7x/nzlIWJLg7vOfmbQ4lgsYazEB0gOhjiYiHK1Gii4M=
k8s.io/kubernetes v1.18.2 h1:37sJPq6p+gx5hEHQSwCWXIiXDc9AajzV1A5UrswnDq0=
k8s.io/kubernetes v1.18.2/go.mod h1:z8xjOOO1Ljz+TaHpOxVGC7cxtF32TesIamoQ+BZrVS0=
k8s.io/legacy-cloud-providers v0.18.2/go.mod h1:zzFRqgDC6cP1SgPl7lMmo1fjILDZ+bsNtTjLnxAfgI0=
k8s.io/metrics v0.18.2/go.mod h1:qga8E7QfYNR9Q89cSCAjinC9pTZ7yv1XSVGUB0vJypg=
Expand Down
29 changes: 2 additions & 27 deletions pkg/plugin/nvidia/kube_interactor.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,16 +24,14 @@ import (
"time"

v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/resource"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/fields"
"k8s.io/apimachinery/pkg/types"
"k8s.io/apimachinery/pkg/util/wait"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/clientcmd"
"k8s.io/klog"
pluginapi "k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1"
nodeutil "k8s.io/kubernetes/pkg/util/node"
"volcano.sh/k8s-device-plugin/pkg/plugin/vgpu/util"
)

type KubeInteractor struct {
Expand Down Expand Up @@ -92,29 +90,6 @@ func (ki *KubeInteractor) GetPendingPodsOnNode() ([]v1.Pod, error) {
return res, nil
}

func (ki *KubeInteractor) PatchGPUResourceOnNode(gpuCount int) error {
var err error
err = wait.PollImmediate(1*time.Second, 10*time.Second, func() (bool, error) {
var node *v1.Node
node, err = ki.clientset.CoreV1().Nodes().Get(context.TODO(), ki.nodeName, metav1.GetOptions{})
if err != nil {
klog.V(4).Infof("failed to get node %s: %v", ki.nodeName, err)
return false, nil
}

newNode := node.DeepCopy()
newNode.Status.Capacity[VolcanoGPUNumber] = *resource.NewQuantity(int64(gpuCount), resource.DecimalSI)
newNode.Status.Allocatable[VolcanoGPUNumber] = *resource.NewQuantity(int64(gpuCount), resource.DecimalSI)
_, _, err = nodeutil.PatchNodeStatus(ki.clientset.CoreV1(), types.NodeName(ki.nodeName), node, newNode)
if err != nil {
klog.V(4).Infof("failed to patch volcano gpu resource: %v", err)
return false, nil
}
return true, nil
})
return err
}

func (ki *KubeInteractor) PatchUnhealthyGPUListOnNode(devices []*Device) error {
var err error
unhealthyGPUsStr := ""
Expand Down Expand Up @@ -144,7 +119,7 @@ func (ki *KubeInteractor) PatchUnhealthyGPUListOnNode(devices []*Device) error {
} else {
delete(newNode.Annotations, UnhealthyGPUIDs)
}
_, _, err = nodeutil.PatchNodeStatus(ki.clientset.CoreV1(), types.NodeName(ki.nodeName), node, newNode)
err = util.PatchNodeAnnotations(node, newNode.Annotations)
if err != nil {
klog.V(4).Infof("failed to patch volcano unhealthy gpu list %s: %v", unhealthyGPUsStr, err)
return false, nil
Expand Down
10 changes: 1 addition & 9 deletions pkg/plugin/nvidia/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -135,14 +135,6 @@ func (m *NvidiaDevicePlugin) Name() string {
// and starts the device healthchecks.
func (m *NvidiaDevicePlugin) Start() error {
m.initialize()
// must be called after initialize
if m.resourceName == VolcanoGPUMemory {
if err := m.kubeInteractor.PatchGPUResourceOnNode(len(m.physicalDevices)); err != nil {
log.Printf("failed to patch gpu resource: %v", err)
m.cleanup()
return fmt.Errorf("failed to patch gpu resource: %v", err)
}
}

err := m.Serve()
if err != nil {
Expand Down Expand Up @@ -261,7 +253,7 @@ func (m *NvidiaDevicePlugin) GetDevicePluginOptions(context.Context, *pluginapi.
// ListAndWatch lists devices and update that list according to the health status
func (m *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
if m.resourceName == VolcanoGPUMemory {
klog.Infoln("virtualDevices=-0=-=-=-=", len(m.virtualDevices))
klog.Infoln("virtualDevices=", len(m.virtualDevices))
err := s.Send(&pluginapi.ListAndWatchResponse{Devices: m.virtualDevices})
if err != nil {
log.Fatalf("failed sending devices %d: %v", len(m.virtualDevices), err)
Expand Down
6 changes: 5 additions & 1 deletion pkg/plugin/vgpu/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ type NvidiaConfig struct {
var (
DeviceSplitCount uint
GPUMemoryFactor uint
Mode string
DeviceCoresScaling float64
NodeName string
RuntimeSocketFlag string
Expand All @@ -55,7 +56,10 @@ type MigTemplateUsage struct {
InUse bool `json:"inuse,omitempty"`
}

type Geometry []MigTemplate
type Geometry struct {
Group string `yaml:"group"`
Instances []MigTemplate `yaml:"geometries"`
}

type MIGS []MigTemplateUsage

Expand Down
1 change: 1 addition & 0 deletions pkg/plugin/vgpu/plugin.go
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ func readFromConfigFile(sConfig *config.NvidiaConfig) (string, error) {
}
if len(val.OperatingMode) > 0 {
mode = val.OperatingMode
config.Mode = mode
}
klog.Infof("FilterDevice: %v", val.FilterDevice)
}
Expand Down
1 change: 1 addition & 0 deletions pkg/plugin/vgpu/register.go
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ func (r *DeviceRegister) apiDevices() *[]*util.DeviceInfo {
Id: dev.ID,
Count: int32(config.DeviceSplitCount),
Devmem: registeredmem,
Mode: config.Mode,
Type: fmt.Sprintf("%v-%v", "NVIDIA", *ndev.Model),
Health: strings.EqualFold(dev.Health, "healthy"),
})
Expand Down
2 changes: 1 addition & 1 deletion pkg/plugin/vgpu/util/util.go
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ func DecodeNodeDevices(str string) []*DeviceInfo {
func EncodeNodeDevices(dlist []*DeviceInfo) string {
tmp := ""
for _, val := range dlist {
tmp += val.Id + "," + strconv.FormatInt(int64(val.Count), 10) + "," + strconv.Itoa(int(val.Devmem)) + "," + val.Type + "," + strconv.FormatBool(val.Health) + ":"
tmp += val.Id + "," + strconv.FormatInt(int64(val.Count), 10) + "," + strconv.Itoa(int(val.Devmem)) + "," + val.Type + "," + strconv.FormatBool(val.Health) + "," + val.Mode + ":"
}
klog.V(3).Infoln("Encoded node Devices", tmp)
return tmp
Expand Down
37 changes: 25 additions & 12 deletions volcano-vgpu-device-plugin.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,60 +33,72 @@ data:
deviceSplitCount: 10
deviceMemoryScaling: 1
deviceCoreScaling: 1
gpuMemoryFactor: 1
knownMigGeometries:
- models: [ "A30" ]
allowedGeometries:
-
- group: group1
geometries:
- name: 1g.6gb
memory: 6144
count: 4
-
- group: group2
geometries:
- name: 2g.12gb
memory: 12288
count: 2
-
- group: group3
geometries:
- name: 4g.24gb
memory: 24576
count: 1
- models: [ "A100-SXM4-40GB", "A100-40GB-PCIe", "A100-PCIE-40GB", "A100-SXM4-40GB" ]
allowedGeometries:
-
- group: "group1"
geometries:
- name: 1g.5gb
memory: 5120
count: 7
-
- group: "group2"
geometries:
- name: 2g.10gb
memory: 10240
count: 3
- name: 1g.5gb
memory: 5120
count: 1
-
- group: "group3"
geometries:
- name: 3g.20gb
memory: 20480
count: 2
-
- group: "group4"
geometries:
- name: 7g.40gb
memory: 40960
count: 1
- models: [ "A100-SXM4-80GB", "A100-80GB-PCIe", "A100-PCIE-80GB"]
allowedGeometries:
-
- group: "group1"
geometries:
- name: 1g.10gb
memory: 10240
count: 7
-
- group: "group2"
geometries:
- name: 2g.20gb
memory: 20480
count: 3
- name: 1g.10gb
memory: 10240
count: 1
-
- group: "group3"
geometries:
- name: 3g.40gb
memory: 40960
count: 2
-
- group: "group4"
geometries:
- name: 7g.79gb
memory: 80896
count: 1
Expand Down Expand Up @@ -189,7 +201,8 @@ spec:
serviceAccount: volcano-device-plugin
containers:
- image: docker.io/projecthami/volcano-vgpu-device-plugin:v1.9.3
args: ["--device-split-count=10"]
command: ["sleep","infinity"]
#args: ["--device-split-count=10"]
lifecycle:
postStart:
exec:
Expand Down

0 comments on commit 44e38a7

Please sign in to comment.