Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump version to v0.15.0-rc.2 #596

Merged
merged 3 commits into from
Mar 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
## Changelog

### Version v0.15.0-rc.2
- Bump CUDA base image version to 12.3.2
- Add `cdi-cri` device list strategy. This uses the CDIDevices CRI field to request CDI devices instead of annotations.
- Set MPS memory limit by device index and not device UUID. This is a workaround for an issue where
these limits are not applied for devices if set by UUID.
- Update MPS sharing to disallow requests for multiple devices if MPS sharing is configured.
- Set mps device memory limit by index.
- Explicitly set sharing.mps.failRequestsGreaterThanOne = true.
- Run tail -f for each MPS daemon to output logs.
- Enforce replica limits for MPS sharing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really all that changed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❯  git log --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit -90 |grep -v Merge |grep -v Bump
11c7131e - Enforce maximum MPS replicas (25 hours ago) <Evan Lezar>
b3218699 - Run tail -f for each MPS daemon to output logs (2 days ago) <Evan Lezar>
93f161cc - Add mig-strategy flag to mps-control-daemon (2 days ago) <Evan Lezar>
4f3e4a58 - Cleanup log dir on stop (2 days ago) <Evan Lezar>
95be0832 - Explicitly set sharing.mps.failRequestsGreaterThanOne = true (2 days ago) <Evan Lezar>
13cf3b4c - Change validation logic for MPS sharing (2 days ago) <Evan Lezar>
d5f33b7d - Factor out allocate request validation (2 days ago) <Evan Lezar>
5edd66f6 - Set mps device memory limit by index (2 days ago) <Evan Lezar>
7302a18e - (origin/badge, badge) Add Status badges (8 days ago) <Carlos Eduardo Arango Gutierrez>
f3586af5 - (upstream/e2eactions, e2eactions) Add e2e github action (8 days ago) <Carlos Eduardo Arango Gutierrez>
00f34100 - Add dependabot config to update actions for gh-pages (9 days ago) <Evan Lezar>
6eb8d576 - Fix GitHub staging registry (9 days ago) <Evan Lezar>
dc38950a - Add cdi-cri device list strategy (11 days ago) <Evan Lezar>
cfcdcceb - Refactor label output (2 weeks ago) <Evan Lezar>
ca84c1b3 - Replace k8s-client.go with client sets (2 weeks ago) <Evan Lezar>
0727338f - Move use-node-feature-api to config structs (2 weeks ago) <Evan Lezar>
362d1d0f - Update nvidia-container-toolkit instructions (2 weeks ago) <Evan Lezar>
93b39b67 - (upstream/clean-go-mod) clean up replace directives in go.mod (2 weeks ago) <Tariq Ibrahim>
516945ce - TOFIX: Allow go mod tidy to set go version (2 weeks ago) <Evan Lezar>
2dbf357c - Add vendor check to actions (2 weeks ago) <Evan Lezar>
9e9fb58a - Extract GOLANG_VERSION from versions.mk (2 weeks ago) <Evan Lezar>
9cfdf86a - (origin/operations-per-run, operations-per-run) Increase operations-per-run on stale action (2 weeks ago) <Carlos Eduardo Arango Gutierrez>
ffb4b015 - Update github.com/mittwald/go-helm-client to v0.12.8 (2 weeks ago) <Evan Lezar>
86d94ffd - (origin/lifecyce, lifecyce) Edit stale message (2 weeks ago) <Carlos Eduardo Arango Gutierrez>
bc818aac - Use github image as staging image (3 weeks ago) <Evan Lezar>
4ac55b89 - Remove deprecated extensions/v1beta1 static deployment (3 weeks ago) <Evan Lezar>
6e209be5 - Update k8s.io/kubernetes to v1.28.7 (3 weeks ago) <Evan Lezar>
637326e1 - Fix typo on label name at stale action def (3 weeks ago) <Carlos Eduardo Arango Gutierrez>
cb9f53a2 - (origin/go-version-file, go-version-file) Update golang gh-Action to use go-version-file (3 weeks ago) <Carlos Eduardo Arango Gutierrez>
cb3949df - (origin/ghaction-stale, ghaction-stale) Add actions/stale gh-action (3 weeks ago) <Carlos Eduardo Arango Gutierrez>
778f0740 - Add changelog for v0.15.0-rc.1 (3 weeks ago) <Evan Lezar>
05897874 - Add MPS sharing section to README (3 weeks ago) <Evan Lezar>

non merge / non dependabot commits since rc1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say these are the relevant ones to include then:

11c7131e - Enforce maximum MPS replicas (25 hours ago) <Evan Lezar>
b3218699 - Run tail -f for each MPS daemon to output logs (2 days ago) <Evan Lezar>
95be0832 - Explicitly set sharing.mps.failRequestsGreaterThanOne = true (2 days ago) <Evan Lezar>
5edd66f6 - Set mps device memory limit by index (2 days ago) <Evan Lezar>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I have updated.

As a side note. It would be good to generate this automatically if possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a way on the GitHub UI for it. I'll show you next week

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know about the release notes for releases, I mean to update the changelog.

But maybe the issue is that I try to keep it up to date while making the changes.


### Version v0.15.0-rc.1
- Import GPU Feature Discovery into the GPU Device Plugin repo. This means that
Expand Down
4 changes: 2 additions & 2 deletions deployments/helm/nvidia-device-plugin/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ apiVersion: v2
name: nvidia-device-plugin
type: application
description: A Helm chart for the nvidia-device-plugin on Kubernetes
version: "0.15.0-rc.1"
appVersion: "0.15.0-rc.1"
version: "0.15.0-rc.2"
appVersion: "0.15.0-rc.2"
kubeVersion: ">= 1.10.0-0"
home: https://github.com/NVIDIA/k8s-device-plugin

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ metadata:
name: gpu-feature-discovery
labels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/version: 0.14.2
app.kubernetes.io/version: 0.15.0-rc.2
app.kubernetes.io/part-of: nvidia-gpu
spec:
selector:
Expand All @@ -15,11 +15,11 @@ spec:
metadata:
labels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/version: 0.14.2
app.kubernetes.io/version: 0.15.0-rc.2
app.kubernetes.io/part-of: nvidia-gpu
spec:
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.3
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2
name: gpu-feature-discovery
command: ["/usr/bin/gpu-feature-discovery"]
volumeMounts:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ metadata:
name: gpu-feature-discovery
labels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/version: 0.14.2
app.kubernetes.io/version: 0.15.0-rc.2
app.kubernetes.io/part-of: nvidia-gpu
spec:
selector:
Expand All @@ -15,11 +15,11 @@ spec:
metadata:
labels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/version: 0.14.2
app.kubernetes.io/version: 0.15.0-rc.2
app.kubernetes.io/part-of: nvidia-gpu
spec:
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.3
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2
name: gpu-feature-discovery
command: ["/usr/bin/gpu-feature-discovery"]
volumeMounts:
Expand Down
6 changes: 3 additions & 3 deletions deployments/static/gpu-feature-discovery-daemonset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ metadata:
name: gpu-feature-discovery
labels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/version: 0.14.2
app.kubernetes.io/version: 0.15.0-rc.2
app.kubernetes.io/part-of: nvidia-gpu
spec:
selector:
Expand All @@ -15,11 +15,11 @@ spec:
metadata:
labels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/version: 0.14.2
app.kubernetes.io/version: 0.15.0-rc.2
app.kubernetes.io/part-of: nvidia-gpu
spec:
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.3
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2
name: gpu-feature-discovery
command: ["/usr/bin/gpu-feature-discovery"]
volumeMounts:
Expand Down
6 changes: 3 additions & 3 deletions deployments/static/gpu-feature-discovery-job.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,19 @@ metadata:
name: gpu-feature-discovery
labels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/version: 0.14.2
app.kubernetes.io/version: 0.15.0-rc.2
app.kubernetes.io/part-of: nvidia-gpu
spec:
template:
metadata:
labels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/version: 0.14.2
app.kubernetes.io/version: 0.15.0-rc.2
app.kubernetes.io/part-of: nvidia-gpu
spec:
nodeName: NODE_NAME
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.3
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2
name: gpu-feature-discovery
command: ["/usr/bin/gpu-feature-discovery"]
args:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ spec:
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.4
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ spec:
- env:
- name: PASS_DEVICE_SPECS
value: "true"
image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.1
image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2
name: nvidia-device-plugin-ctr
securityContext:
privileged: true
Expand Down
2 changes: 1 addition & 1 deletion deployments/static/nvidia-device-plugin.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ spec:
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.1
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
Expand Down
2 changes: 1 addition & 1 deletion nvidia-device-plugin.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ spec:
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.1
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-rc.2
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
Expand Down
2 changes: 1 addition & 1 deletion versions.mk
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ MODULE := github.com/NVIDIA/$(DRIVER_NAME)

REGISTRY ?= nvcr.io/nvidia

VERSION ?= v0.15.0-rc.1
VERSION ?= v0.15.0-rc.2

# vVERSION represents the version with a guaranteed v-prefix
vVERSION := v$(VERSION:v%=%)
Expand Down