-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rbd: add volume healer #2108
rbd: add volume healer #2108
Conversation
35faadc
to
e00874b
Compare
@pkalever can try adding nolint in the following manner and see if that solves
|
@Rakshith-R this will imply the nolint on the whole function(block of code), but we just want to get nolint applied for one single line (func funcName(param1, param2 ...) {). Hope it makes sense now. |
|
@pkalever Thanks for this one. Let's get confirmation on the approach then we can do the PR review? |
[As discussed already] Many Thanks! |
How about using https://kubernetes.io/docs/concepts/workloads/pods/init-containers/ ? (Or at the least have a knob to activate / deactivate volume healer ) |
The ultimate goal is to bring the rbd-nbd processes back to life within the csi-rbdplugin pod. We thought about init containers initially, but the init containers should start and have to finish before the main container starts. Here our main container is csi-rbdplugin where the rbd-nbd processes should start and keep running, what are we supposed to do in the init containers and be done there before the main container starts? Given csi-rbdplugin container will not be started until the init containers finish, We want something to run in parallel with csi-rbdplugin, so that the main container will handle NodeStageVolume calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @nixpanic for the review. I will respin ASAP.
Thanks for the clarification !, but a command line argument to enable volume healer (default being false) will be great considering we dont want to run this operation for people not using rbd-nbd. |
We do not know in advance if users will configure a StorageClass with rbd-nbd or not. Having it as a global option/parameter would make this more error prone. It is more user-friendly to detect the need for it automatically. When a worker node does not have any VolumeAttachments with type |
Completely agree with Niels here. I don't think we should have an option to enable Volume Healer. |
1a259c8
to
e6823bb
Compare
@humblec Could you please remove the DNM flag. Given it's been close to 9-10 days we have the DNM blocking us on this route, it will be nice if we can lift this off and let this go. Given the 3.4.0 release is very close it is high time that we make this decision now to avoid any last-minute regressions. @Madhu-1 @nixpanic with the latest rebase the approvals are lost, please help replace them. Thanks! |
@pkalever please accept the fact that, in between this has happened:
As discussed in yesterday's and previous calls, eventhough it is alpha, if there are regressions its difficult to get cycled to users once rook have a release with the old images. so the extra care here. While |
@Mergifyio rebase |
Command
|
As mentioned earlier It was a quick move!
yes, the conflicting part is resolved.
We still have this on our plate, we will make sure to get them soon.
Yes, getting a second pair of eyes from others is always good, but please I request to get them at the earliest when the PR is posted or the design is proposed/documented. (again I'm not saying that it wasn't my mistake to document the design)
Yes, which is why I want this to land ASAP, so that there is some chance for CI to catch the regressions, if there are any.
Yes, and thanks for removing the Cheers to the userspace mounters!! |
k8s-e2e-external-storage-1.21 fails
|
mini-e2e-helm_k8s-1.20 & mini-e2e_k8s-1.19
|
|
/test ci/centos/mini-e2e-helm/k8s-1.21 |
path is used by standard package. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
As part of stage transaction if the mounter is of type nbd, then capture device path after a successful rbd-nbd map. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Nodeplugin needs below cluster roles: persistentvolumes: get volumeattachments: list, get These additional permissions are needed by the volume healer. Volume healer aims at fixing the volume health issues at the very startup time of the nodeplugin. As part of its operations, volume healer has to run through the list of volume attachments and understand details about each persistentvolume. The later commits will use these additional cluster roles. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Problem: ------- For rbd nbd userspace mounter backends, after a restart of the nodeplugin all the mounts will start seeing IO errors. This is because, for rbd-nbd backends there will be a userspace mount daemon running per volume, post restart of the nodeplugin pod, there is no way to restore the daemons back to life. Solution: -------- The volume healer is a one-time activity that is triggered at the startup time of the rbd nodeplugin. It navigates through the list of volume attachments on the node and acts accordingly. For now, it is limited to nbd type storage only, but it is flexible and can be extended in the future for other backend types as needed. From a few feets above: This solves a severe problem for nbd backed csi volumes. The healer while going through the list of volume attachments on the node, if finds the volume is in attached state and is of type nbd, then it will attempt to fix the rbd-nbd volumes by sending a NodeStageVolume request with the required volume attributes like secrets, device name, image attributes, and etc.. which will finally help start the required rbd-nbd daemons in the nodeplugin csi-rbdplugin container. This will allow reattaching the backend images with the right nbd device, thus allowing the applications to perform IO without any interruptions even after a nodeplugin restart. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
This will bring down the healer run time by a great factor. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Now that the healer functionaity for mounter processes is available, lets start, using it. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
warnings from golangci-lint: e2e/pod.go:207:122: directive `//nolint:unparam,lll // cn can be used with different inputs later` is unused for linter unparam (nolintlint) func execCommandInContainer(f *framework.Framework, c, ns, cn string, opt *metav1.ListOptions) (string, string, error) { //nolint:unparam,lll // cn can be used with different inputs later e2e/pod.go:307:70: directive `//nolint:unparam // skipNotFound can be used with different inputs later` is unused for linter unparam (nolintlint) func deletePodWithLabel(label, ns string, skipNotFound bool) error { //nolint:unparam // skipNotFound can be used with different inputs later Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Describe what this PR does
Problem:
For rbd nbd userspace mounter backends, after a restart of the nodeplugin all the mounts will start seeing IO errors. This is because, for rbd-nbd backends there will be a userspace mount daemon running per volume, post restart of the nodeplugin pod, there is no way to restore the daemons back to life.
Solution:
The volume healer is a one-time activity that is triggered at the startup time of the rbd nodeplugin. It navigates through the list of volume attachments on the node and acts accordingly.
For now, it is limited to nbd type storage only, but it is flexible and can be extended in the future for other backend types as needed.
From a few feets above:
This solves a severe problem for nbd backed csi volumes. The healer while going through the list of volume attachments on the node, if finds the volume is in attached state and is of type nbd, then it will attempt to fix the rbd-nbd volumes by sending a NodeStageVolume request with the required volume attributes like secrets, device name, image attributes, and etc.. which will finally help start the required rbd-nbd daemons in the nodeplugin csi-rbdplugin container. This will allow reattaching the backend images with the right nbd device, thus allowing the applications to perform IO without any interruptions even after a nodeplugin restart.
Some points to be noted
Q: What is the route for the upgrade path?
A: For upgrade:
Or
Q: Should this patch use rbd integrated CLI instead of rbd-nbd CLI?
A: Maybe once a release is available including ceph/ceph#41279, for now, attach/detach commands are not promoted at rbd CLI.
Q: What if the Lock is acquired by some other op for a given attachment?
A: Throw error for that attachment and continue for the next attachment on the node
Q: What is the rbd-nbd reattach timeout?
A: 5mins, this means the node plugin pod should come back and the healer should finish triggering NodeStageVolume requests for all attachments on that node in 300 seconds
Q: What version of rbd-nbd is needed on the node plugin pod?
A: rbd-nbd version should satisfy ceph pacific version (v16)
Q: What happens if attach command is absent?
A: You will see something like, 'NodeStageVolume failed, err: ... bla bla unknown option 'attach cephfs.a.data/csi-vol-6dd424b9-b971-11eb-af01-7a53e508ee3b' errors in the logs
Q: What happens when there is an rbd-nbd command failure?
A: Similar to above
Q: What kind of logs will we see at the nodeplugin on a successful healer run?
A: Here are some logs:
Initial discussion summary:
We have captured the initial discussion at, https://hackmd.io/mA8CtRUPS4SSV9oVcDFVeg. This PR has few changes from the design, mainly we don't start a new sidecar, rather trigger it inside csi-rbdplugin container itself.
Fixes: #667 #1929